1. 首页 > 软件安装教程 > 正文

Prometheus安装配置-Prometheus数据库安装配置_升级迁移详细过程

1. Prometheus概述与环境规划

Prometheus是一款开源的系统监控和告警工具包,最初由SoundCloud开发。Prometheus采用拉取式数据采集模型,支持多维数据模型和灵活的查询语言PromQL,广泛应用于云原生和微服务监控场景。更多学习教程www.fgedu.net.cn

1.1 Prometheus版本说明

Prometheus目前主要版本为2.45,本教程以Prometheus 2.45为例进行详细讲解。

# 查看Prometheus版本
$ prometheus –version
prometheus, version 2.45.0 (branch: HEAD, revision: abc123)
build user: root@buildhost
build date: 20240101-00:00:00
go version: go1.21.0
platform: linux/amd64

# 查看配置信息
$ prometheus –config.check –config.file=/etc/prometheus/prometheus.yml
prometheus configuration file /etc/prometheus/prometheus.yml is valid

1.2 环境规划

本次安装环境规划如下:

主机名:fgedudb01.fgedu.net.cn
IP地址:192.168.1.51
HTTP端口:9090
数据目录:/data/prometheus
配置目录:/etc/prometheus
日志目录:/var/log/prometheus

存储规划:
数据保留期:15天
采集间隔:15秒

1.3 Prometheus核心特性

主要特点:
1. 多维数据模型:指标名称和键值对标签
2. PromQL查询语言:强大的数据查询能力
3. 拉取式采集:主动从目标拉取指标
4. 服务发现:自动发现监控目标
5. 告警管理:支持Alertmanager告警
6. 可视化:支持Grafana集成
7. 时序存储:高效的本地存储
8. 联邦集群:支持多级联邦架构

2. 硬件环境要求与检查

在安装Prometheus之前,需要对服务器硬件环境进行全面检查。学习交流加群风哥微信: itpux-com

2.1 最低硬件要求

最低配置:
CPU:2核心
内存:4GB
磁盘:20GB

推荐配置(生产环境):
CPU:4核心以上
内存:16GB以上
磁盘:200GB以上SSD

大规模监控配置:
CPU:8核心以上
内存:32GB以上
磁盘:500GB以上SSD

2.2 系统环境检查

# 检查操作系统版本
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.8 (Ootpa)

# 检查内核版本
# uname -a
Linux fgedudb01 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Fri Apr 4 10:00:00 CST 2026 x86_64 x86_64 x86_64 GNU/Linux

# 检查内存信息
# free -h
total used free shared buff/cache available
Mem: 31Gi 1.0Gi 29Gi 256Mi 1.0Gi 30Gi
Swap: 7Gi 0B 7Gi

# 检查磁盘空间
# df -h
文件系统 容量 已用 可用 已用% 挂载点
/dev/mapper/vg_system-lv_root 50G 2.5G 48G 5% /
/dev/sda2 1014M 150M 865M 15% /boot
/dev/mapper/vg_data-lv_data 200G 20G 180G 10% /data

# 检查时间同步
# timedatectl status
Local time: 五 2026-04-04 10:00:00 CST
Universal time: 五 2026-04-04 02:00:00 UTC
RTC time: 五 2026-04-04 02:00:00
Time zone: Asia/Shanghai (CST, +0800)
NTP enabled: yes
NTP synchronized: yes

2.3 内核参数配置

# 配置内核参数
# vi /etc/sysctl.d/99-prometheus.conf

# 添加以下参数
# 网络参数
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 300

# 文件描述符限制
fs.file-max = 655360

# 使内核参数生效
# sysctl -p /etc/sysctl.d/99-prometheus.conf

# 输出示例:
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535

2.4 用户资源限制配置

# 配置用户限制
# vi /etc/security/limits.conf

# 添加以下配置
prometheus soft nofile 65535
prometheus hard nofile 65535
prometheus soft nproc 65535
prometheus hard nproc 65535

# 创建用户
# useradd -r -s /sbin/nologin prometheus

生产环境建议:Prometheus对磁盘I/O要求较高,建议使用SSD存储。时间同步对监控系统非常重要,务必配置NTP服务。

3. Prometheus安装步骤

本节详细介绍Prometheus 2.45的安装过程。学习交流加群风哥QQ113257174

3.1 创建目录结构

# 创建目录结构
# mkdir -p /etc/prometheus
# mkdir -p /data/prometheus
# mkdir -p /var/log/prometheus

# 设置目录权限
# chown -R prometheus:prometheus /etc/prometheus
# chown -R prometheus:prometheus /data/prometheus
# chown -R prometheus:prometheus /var/log/prometheus
# chmod -R 750 /data/prometheus
# chmod -R 750 /var/log/prometheus

# 验证目录权限
# ls -la /data/
总用量 0
drwxr-xr-x. 2 prometheus prometheus 6 4月 4 10:00 prometheus

3.2 下载并安装Prometheus

# 下载Prometheus
# cd /usr/local/src
# wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# 输出示例:
–2026-04-04 10:00:00– https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
正在解析主机 github.com… 140.82.121.4
正在连接 github.com|140.82.121.4|:443… 已连接。
已发出 HTTP 请求,正在等待回应… 200 OK
长度:100000000 (95M) [application/octet-stream]
正在保存至: “prometheus-2.45.0.linux-amd64.tar.gz”
100%[======================================>] 100,000,000 10.0MB/s 用时 9.5s
2026-04-04 10:00:10 (10.0 MB/s) – 已保存 “prometheus-2.45.0.linux-amd64.tar.gz”

# 解压安装包
# tar -xzf prometheus-2.45.0.linux-amd64.tar.gz

# 复制二进制文件
# cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
# cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/

# 复制默认配置文件
# cp prometheus-2.45.0.linux-amd64/prometheus.yml /etc/prometheus/

# 设置权限
# chown prometheus:prometheus /usr/local/bin/prometheus
# chown prometheus:prometheus /usr/local/bin/promtool
# chmod 755 /usr/local/bin/prometheus
# chmod 755 /usr/local/bin/promtool

# 验证安装
# prometheus –version
prometheus, version 2.45.0 (branch: HEAD, revision: abc123)
build user: root@buildhost
build date: 20240101-00:00:00
go version: go1.21.0
platform: linux/amd64

3.3 创建配置文件

# 编辑配置文件
# vi /etc/prometheus/prometheus.yml

# 添加以下配置
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: ‘fgedu-monitor’

alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093

rule_files:
– /etc/prometheus/rules/*.yml

scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’]
labels:
instance: ‘fgedudb01’

– job_name: ‘node_exporter’
static_configs:
– targets: [‘192.168.1.51:9100’, ‘192.168.1.52:9100’]
labels:
group: ‘production’

– job_name: ‘mysql_exporter’
static_configs:
– targets: [‘192.168.1.51:9104’]

– job_name: ‘redis_exporter’
static_configs:
– targets: [‘192.168.1.51:9121’]

# 创建规则目录
# mkdir -p /etc/prometheus/rules

# 创建告警规则文件
# vi /etc/prometheus/rules/alert_rules.yml

groups:
– name: node_alerts
rules:
– alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “Instance {{ $labels.instance }} down”
description: “{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.”

– alert: HighCPUUsage
expr: 100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.instance }}”
description: “CPU usage is above 80% (current value: {{ $value }}%)”

– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage on {{ $labels.instance }}”
description: “Memory usage is above 85% (current value: {{ $value }}%)”

– alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”}) * 100 < 15 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is above 85% (current value: {{ $value }}%)" # 设置权限 # chown -R prometheus:prometheus /etc/prometheus

3.4 创建Systemd服务

# 创建服务文件
# vi /etc/systemd/system/prometheus.service

[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
ExecStart=/usr/local/bin/prometheus \
–config.file=/etc/prometheus/prometheus.yml \
–storage.tsdb.path=/data/prometheus \
–storage.tsdb.retention.time=15d \
–storage.tsdb.retention.size=50GB \
–web.console.templates=/etc/prometheus/consoles \
–web.console.libraries=/etc/prometheus/console_libraries \
–web.listen-address=0.0.0.0:9090 \
–web.external-url=http://192.168.1.51:9090 \
–log.level=info \
–log.format=logfmt
RestartSec=5
LimitNOFILE=65535

[Install]
WantedBy=multi-user.target

# 重载systemd
# systemctl daemon-reload

# 启动Prometheus服务
# systemctl start prometheus

# 设置开机自启动
# systemctl enable prometheus

# 输出示例:
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.

# 检查服务状态
# systemctl status prometheus

● prometheus.service – Prometheus Monitoring System
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2026-04-04 10:00:00 CST; 5s ago
Docs: https://prometheus.io/docs/introduction/overview/
Main PID: 12345 (prometheus)
Tasks: 8 (limit: 4915)
Memory: 100.0M
CGroup: /system.slice/prometheus.service
└─12345 /usr/local/bin/prometheus –config.file=/etc/prometheus/prometheus.yml …

# 检查端口
# netstat -tlnp | grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 12345/prometheus

风哥提示:Prometheus配置文件使用YAML格式,注意缩进。storage.tsdb.retention.time参数控制数据保留时间,根据存储容量合理设置。

4. Prometheus参数配置

Prometheus参数配置是监控系统的关键步骤,直接影响监控效果和性能。更多学习教程公众号风哥教程itpux_com

4.1 配置服务发现

# 编辑配置文件
# vi /etc/prometheus/prometheus.yml

# 文件服务发现
scrape_configs:
– job_name: ‘file_sd’
file_sd_configs:
– files:
– /etc/prometheus/targets/*.json
refresh_interval: 5m

# 创建目标文件
# mkdir -p /etc/prometheus/targets
# vi /etc/prometheus/targets/nodes.json

[
{
“targets”: [“192.168.1.51:9100”, “192.168.1.52:9100”],
“labels”: {
“job”: “node_exporter”,
“env”: “production”
}
}
]

# Consul服务发现
scrape_configs:
– job_name: ‘consul_sd’
consul_sd_configs:
– server: ‘192.168.1.100:8500’
services: [‘node-exporter’, ‘mysql-exporter’]

# Kubernetes服务发现
scrape_configs:
– job_name: ‘kubernetes_sd’
kubernetes_sd_configs:
– role: pod
namespaces:
names:
– monitoring
– default

# 重载配置
# systemctl reload prometheus

4.2 配置远程存储

# 编辑配置文件
# vi /etc/prometheus/prometheus.yml

# 配置远程写入
remote_write:
– url: “http://192.168.1.51:8428/api/v1/write”
queue_config:
max_samples_per_send: 10000
max_shards: 200
capacity: 2500

# 配置远程读取
remote_read:
– url: “http://192.168.1.51:8428/api/v1/read”
read_recent: true

# 重启服务
# systemctl restart prometheus

4.3 配置告警规则

# 创建告警规则文件
# vi /etc/prometheus/rules/alert_rules.yml

groups:
– name: mysql_alerts
rules:
– alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “MySQL instance {{ $labels.instance }} is down”
description: “MySQL instance has been down for more than 1 minute.”

– alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: warning
annotations:
summary: “MySQL replication lag on {{ $labels.instance }}”
description: “Replication lag is {{ $value }} seconds.”

– alert: MySQLTooManyConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: “MySQL too many connections on {{ $labels.instance }}”
description: “Connection usage is {{ $value }}%.”

– name: redis_alerts
rules:
– alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “Redis instance {{ $labels.instance }} is down”
description: “Redis instance has been down for more than 1 minute.”

– alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “Redis memory usage high on {{ $labels.instance }}”
description: “Memory usage is {{ $value }}%.”

# 验证配置
# promtool check rules /etc/prometheus/rules/alert_rules.yml

Checking /etc/prometheus/rules/alert_rules.yml
SUCCESS: 6 rules found

# 重载配置
# systemctl reload prometheus

生产环境建议:建议使用文件服务发现或Consul服务发现实现动态目标管理。告警规则要根据实际业务需求定制,避免误报和漏报。

5. 数据采集与查询

Prometheus使用拉取模式采集数据,使用PromQL语言进行查询。from:www.itpux.com

5.1 安装Node Exporter

# 下载Node Exporter
# cd /usr/local/src
# wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz

# 解压并安装
# tar -xzf node_exporter-1.6.1.linux-amd64.tar.gz
# cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
# chmod 755 /usr/local/bin/node_exporter

# 创建服务文件
# vi /etc/systemd/system/node_exporter.service

[Unit]
Description=Node Exporter
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
–web.listen-address=:9100 \
–collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
Restart=on-failure

[Install]
WantedBy=multi-user.target

# 启动服务
# systemctl daemon-reload
# systemctl start node_exporter
# systemctl enable node_exporter

# 验证采集
# curl http://localhost:9100/metrics | head -20

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile=”0″} 0.000123
go_gc_duration_seconds{quantile=”0.25″} 0.000234
go_gc_duration_seconds{quantile=”0.5″} 0.000345
go_gc_duration_seconds{quantile=”0.75″} 0.000456
go_gc_duration_seconds{quantile=”1″} 0.001234
go_gc_duration_seconds_sum 0.012345
go_gc_duration_seconds_count 10
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 10

5.2 PromQL查询示例

# 查询CPU使用率
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)’

# 输出示例:
{
“status”: “success”,
“data”: {
“resultType”: “vector”,
“result”: [
{
“metric”: {
“instance”: “192.168.1.51:9100”
},
“value”: [1712205600, “25.5”]
}
]
}
}

# 查询内存使用率
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100’

# 查询磁盘使用率
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”})) * 100’

# 查询网络流量
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=rate(node_network_receive_bytes_total{device=”eth0″}[5m])’

# 查询范围数据
$ curl -G ‘http://192.168.1.51:9090/api/v1/query_range’ \
–data-urlencode ‘query=node_cpu_seconds_total{mode=”idle”}’ \
–data-urlencode ‘start=1712205000’ \
–data-urlencode ‘end=1712205600’ \
–data-urlencode ‘step=60s’

# 聚合查询
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=sum by(job) (rate(node_cpu_seconds_total[5m]))’

# 统计监控目标数量
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=count(up)’

5.3 使用Web UI查询

# 访问Web界面
# 浏览器打开 http://192.168.1.51:9090

# 常用查询示例:

# 1. 查看所有实例状态
up

# 2. CPU使用率(按实例)
100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)

# 3. 内存使用率
(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# 4. 磁盘IO
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# 5. 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# 6. 系统负载
node_load1
node_load5
node_load15

# 7. 进程数
node_procs_running
node_procs_blocked

# 8. 文件描述符
node_filefd_allocated
node_filefd_maximum

风哥提示:PromQL是Prometheus的核心查询语言,支持丰富的函数和操作符。建议熟悉常用查询模式,提高监控效率。

6. 网络连接配置

网络连接配置是客户端访问Prometheus的关键,需要正确配置监听端口和连接方式。更多学习教程www.fgedu.net.cn

6.1 配置网络监听

# 查看当前监听端口
# netstat -tlnp | grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 12345/prometheus

# 修改监听地址
# vi /etc/systemd/system/prometheus.service
ExecStart=/usr/local/bin/prometheus \
–web.listen-address=192.168.1.51:9090 \

# 重启服务
# systemctl daemon-reload
# systemctl restart prometheus

# 配置防火墙
# firewall-cmd –permanent –add-port=9090/tcp
success
# firewall-cmd –reload
success

6.2 配置认证

# Prometheus原生不支持认证
# 建议使用反向代理实现认证

# 安装nginx
# dnf install -y nginx

# 配置nginx反向代理
# vi /etc/nginx/conf.d/prometheus.conf

server {
listen 80;
server_name prometheus.fgedu.net.cn;

location / {
auth_basic “Prometheus”;
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

# 创建密码文件
# htpasswd -c /etc/nginx/.htpasswd admin
New password:
Re-type new password:
Adding password for user admin

# 启动nginx
# systemctl start nginx
# systemctl enable nginx

# 访问认证后的Prometheus
# curl -u admin:password http://prometheus.fgedu.net.cn/api/v1/query?query=up

6.3 配置HTTPS

# 生成SSL证书
# openssl req -x509 -nodes -newkey rsa:2048 \
-keyout /etc/nginx/ssl/prometheus.key \
-out /etc/nginx/ssl/prometheus.crt \
-days 365 \
-subj “/CN=prometheus.fgedu.net.cn”

# 配置HTTPS
# vi /etc/nginx/conf.d/prometheus.conf

server {
listen 443 ssl;
server_name prometheus.fgedu.net.cn;

ssl_certificate /etc/nginx/ssl/prometheus.crt;
ssl_certificate_key /etc/nginx/ssl/prometheus.key;

location / {
auth_basic “Prometheus”;
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

# 重启nginx
# systemctl restart nginx

生产环境建议:建议配置反向代理实现认证和HTTPS。对于敏感环境,可以配置IP白名单限制访问。

7. 备份恢复配置

备份恢复是监控系统管理的重要环节,Prometheus的数据存储在本地TSDB中。学习交流加群风哥微信: itpux-com

7.1 数据备份

# 创建备份目录
# mkdir -p /backup/prometheus

# 停止Prometheus服务
# systemctl stop prometheus

# 备份数据目录
# tar -czf /backup/prometheus/prometheus_data_$(date +%Y%m%d).tar.gz /data/prometheus

# 备份配置文件
# tar -czf /backup/prometheus/prometheus_config_$(date +%Y%m%d).tar.gz /etc/prometheus

# 启动服务
# systemctl start prometheus

# 验证备份文件
# ls -la /backup/prometheus/
总用量 2048
-rw-r–r–. 1 root root 1024000 4月 4 10:00 prometheus_config_20260404.tar.gz
-rw-r–r–. 1 root root 5120000 4月 4 10:00 prometheus_data_20260404.tar.gz

7.2 数据恢复

# 停止Prometheus服务
# systemctl stop prometheus

# 恢复数据目录
# rm -rf /data/prometheus/*
# tar -xzf /backup/prometheus/prometheus_data_20260404.tar.gz -C /
# chown -R prometheus:prometheus /data/prometheus

# 恢复配置文件
# rm -rf /etc/prometheus/*
# tar -xzf /backup/prometheus/prometheus_config_20260404.tar.gz -C /
# chown -R prometheus:prometheus /etc/prometheus

# 启动服务
# systemctl start prometheus

# 验证恢复
# curl -s ‘http://localhost:9090/api/v1/query?query=up’ | head -20

7.3 自动备份脚本

# 创建备份脚本
# vi /usr/local/bin/prometheus_backup.sh

#!/bin/bash
BACKUP_DIR=/backup/prometheus
DATE=$(date +%Y%m%d)
LOG_FILE=/var/log/prometheus/backup.log

echo “=== Backup started at $(date) ===” >> $LOG_FILE

# 备份配置文件
tar -czf ${BACKUP_DIR}/prometheus_config_${DATE}.tar.gz /etc/prometheus >> $LOG_FILE 2>&1

# 使用promtool快照
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot >> $LOG_FILE 2>&1

if [ $? -eq 0 ]; then
# 复制快照到备份目录
SNAPSHOT_DIR=$(ls -td /data/prometheus/snapshots/* | head -1)
tar -czf ${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz ${SNAPSHOT_DIR} >> $LOG_FILE 2>&1
echo “Backup completed successfully” >> $LOG_FILE
else
echo “Backup failed” >> $LOG_FILE
fi

# 清理30天前的备份
find ${BACKUP_DIR} -name “*.tar.gz” -mtime +30 -delete >> $LOG_FILE 2>&1

echo “=== Backup finished at $(date) ===” >> $LOG_FILE
echo “” >> $LOG_FILE

# 设置脚本权限
# chmod +x /usr/local/bin/prometheus_backup.sh

# 配置定时任务
# crontab -e

# 添加以下内容(每天凌晨2点执行备份)
0 2 * * * /usr/local/bin/prometheus_backup.sh

风哥提示:生产环境建议配置自动备份脚本,定期执行备份。对于大规模部署,建议使用远程存储如VictoriaMetrics或Thanos。

8. 升级与迁移

Prometheus升级和迁移是运维工作中的重要环节,需要仔细规划和执行。更多学习教程公众号风哥教程itpux_com

8.1 版本升级

# 检查当前版本
$ prometheus –version
prometheus, version 2.45.0

# 执行完整备份
# systemctl stop prometheus
# tar -czf /backup/prometheus/pre_upgrade.tar.gz /data/prometheus /etc/prometheus

# 下载新版本
# cd /usr/local/src
# wget https://github.com/prometheus/prometheus/releases/download/v2.46.0/prometheus-2.46.0.linux-amd64.tar.gz

# 解压并替换二进制文件
# tar -xzf prometheus-2.46.0.linux-amd64.tar.gz
# cp prometheus-2.46.0.linux-amd64/prometheus /usr/local/bin/
# cp prometheus-2.46.0.linux-amd64/promtool /usr/local/bin/
# chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

# 验证配置兼容性
# promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
SUCCESS: 0 potential problems found

# 启动服务
# systemctl start prometheus

# 验证版本
$ prometheus –version
prometheus, version 2.46.0

8.2 迁移到新服务器

# 在源服务器执行备份
# systemctl stop prometheus
# tar -czf prometheus_full_backup.tar.gz /data/prometheus /etc/prometheus

# 传输备份文件
# scp prometheus_full_backup.tar.gz new-server:/backup/

# 在新服务器安装Prometheus
# 参考3.2节安装步骤

# 恢复数据
# systemctl stop prometheus
# tar -xzf /backup/prometheus_full_backup.tar.gz -C /
# chown -R prometheus:prometheus /data/prometheus /etc/prometheus
# systemctl start prometheus

# 验证迁移
# curl -s ‘http://localhost:9090/api/v1/query?query=up’

生产环境建议:升级前必须执行完整备份,并在测试环境验证升级过程。跨大版本升级需要仔细阅读升级文档。

9. 生产环境实战案例

本节提供一个完整的生产环境配置案例,帮助读者更好地理解Prometheus的实际应用。from:www.itpux.com

9.1 安装Alertmanager

# 下载Alertmanager
# cd /usr/local/src
# wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

# 解压并安装
# tar -xzf alertmanager-0.26.0.linux-amd64.tar.gz
# cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
# cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/

# 创建配置文件
# vi /etc/prometheus/alertmanager.yml

global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.fgedu.net.cn:25’
smtp_from: ‘alertmanager@fgedu.net.cn’
smtp_auth_username: ‘alertmanager@fgedu.net.cn’
smtp_auth_password: ‘password’

route:
group_by: [‘alertname’, ‘severity’]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: ‘default-receiver’
routes:
– match:
severity: critical
receiver: ‘critical-receiver’
– match:
severity: warning
receiver: ‘warning-receiver’

receivers:
– name: ‘default-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’
– name: ‘critical-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’
webhook_configs:
– url: ‘http://192.168.1.100:5001/webhook’
– name: ‘warning-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’

# 创建服务文件
# vi /etc/systemd/system/alertmanager.service

[Unit]
Description=Alertmanager
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
–config.file=/etc/prometheus/alertmanager.yml \
–storage.path=/data/alertmanager \
–web.listen-address=:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target

# 启动服务
# mkdir -p /data/alertmanager
# chown prometheus:prometheus /data/alertmanager
# systemctl daemon-reload
# systemctl start alertmanager
# systemctl enable alertmanager

9.2 性能监控

# 查看Prometheus自身指标
$ curl -s http://localhost:9090/metrics | grep prometheus_

# HELP prometheus_build_info A metric with a constant ‘1’ value labeled by version, revision, branch, and goversion from which prometheus was built.
# TYPE prometheus_build_info gauge
prometheus_build_info{branch=”HEAD”,goversion=”go1.21.0″,revision=”abc123″,version=”2.45.0″} 1

# HELP prometheus_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload.
# TYPE prometheus_config_last_reload_success_timestamp_seconds gauge
prometheus_config_last_reload_success_timestamp_seconds 1.7122056e+09

# HELP prometheus_config_last_reload_successful Whether the last configuration reload attempt was successful.
# TYPE prometheus_config_last_reload_successful gauge
prometheus_config_last_reload_successful 1

# 查看采集目标状态
$ curl -s ‘http://localhost:9090/api/v1/targets’ | jq ‘.data.activeTargets[] | {job: .labels.job, health: .health}’

{
“job”: “prometheus”,
“health”: “up”
}
{
“job”: “node_exporter”,
“health”: “up”
}

# 查看TSDB状态
$ curl -s ‘http://localhost:9090/api/v1/status/tsdb’

{
“status”: “success”,
“data”: {
“headStats”: {
“numSeries”: 10000,
“numChunks”: 50000,
“chunkCount”: 50000,
“minTime”: 1712205000,
“maxTime”: 1712205600,
“minTimeMillis”: 1712205000000,
“maxTimeMillis”: 1712205600000
},
“seriesCountByMetricName”: […],
“labelCountByMetricName”: […],
“numLabelPairs”: 1000
}
}

9.3 高可用配置

# Prometheus高可用方案

# 方案1:多实例部署
# 部署多个Prometheus实例,各自独立采集数据
# 使用负载均衡器分发查询请求

# 方案2:联邦集群
# 主Prometheus配置
# vi /etc/prometheus/prometheus.yml

scrape_configs:
– job_name: ‘federate’
scrape_interval: 15s
honor_labels: true
metrics_path: ‘/federate’
params:
‘match[]’:
– ‘{job=”prometheus”}’
– ‘{job=”node_exporter”}’
static_configs:
– targets:
– ‘192.168.1.52:9090’
– ‘192.168.1.53:9090’
labels:
federation: ‘datacenter1’

# 方案3:使用Thanos或VictoriaMetrics
# Thanos提供长期存储和高可用查询
# VictoriaMetrics提供高性能存储和查询

# Thanos Sidecar配置示例
# vi /etc/systemd/system/thanos-sidecar.service

[Unit]
Description=Thanos Sidecar
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/thanos sidecar \
–prometheus.url=http://localhost:9090 \
–tsdb.path=/data/prometheus \
–objstore.config-file=/etc/thanos/objectstore.yml \
–http-address=0.0.0.0:19191 \
–grpc-address=0.0.0.0:10901
Restart=on-failure

[Install]
WantedBy=multi-user.target

风哥提示:Prometheus单实例存在单点故障风险,建议使用多实例部署或联邦架构实现高可用。对于大规模监控,建议使用Thanos或VictoriaMetrics。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息