1. Prometheus概述与环境规划
Prometheus是一个开源的监控系统和时间序列数据库,用于监控系统和应用程序的性能和健康状态。Prometheus基于拉取模式,通过HTTP协议从目标系统收集指标数据。更多学习教程www.fgedu.net.cn
1.1 Prometheus版本说明
Prometheus目前主要版本为2.x系列,本教程以Prometheus 2.44.0为例进行详细讲解。Prometheus 2.x版本相比之前版本在性能、稳定性和功能方面都有显著提升,支持更多的监控特性。
$ prometheus –version
prometheus, version 2.44.0 (branch: HEAD, revision: 734b0952f4c3a96b8586e05e661312425a1a05b0)
build user: root@1a2b3c4d5e6f
build date: 2023-04-05T15:33:19Z
go version: go1.19.6
platform: linux/amd64
# 查看系统版本
$ cat /etc/os-release
NAME=”Oracle Linux Server”
VERSION=”8.9″
ID=”ol”
PRETTY_NAME=”Oracle Linux Server 8.9″
# 查看内核版本
$ uname -r
5.4.17-2136.302.7.2.el8uek.x86_64
1.2 环境规划
本次安装环境规划如下:
monitor01.fgedu.net.cn (192.168.1.81) – Prometheus主节点
monitor02.fgedu.net.cn (192.168.1.82) – Prometheus备用节点
Prometheus版本:2.44.0
AlertManager版本:0.25.0
Grafana版本:9.5.2
安装方式:二进制安装
数据存储:本地文件系统 + NFS共享存储
2. 硬件环境要求
Prometheus作为监控系统,对硬件资源要求根据监控目标数量和数据保留时间而定。学习交流加群风哥微信: itpux-com
2.1 物理主机环境要求
– CPU:至少8核
– 内存:至少32GB
– 磁盘:系统盘120GB SSD + 数据盘1TB SSD
# 检查监控服务器资源
# free -h
total used free shared buff/cache available
Mem: 32G 8.4G 22G 512M 3.6G 23G
Swap: 8G 0B 8G
# 检查磁盘空间
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 120G 20G 100G 17% /
/dev/sdb1 1TB 50G 950G 5% /data
2.2 vSphere虚拟主机环境要求
– 监控服务器:
– vCPU:8核
– 内存:32GB
– 磁盘:系统盘120GB SSD + 数据盘1TB SSD
– 网络:VMXNET3网卡,10Gbps网络
资源池配置:
– CPU预留:4GHz
– 内存预留:16GB
– 内存限制:32GB
– CPU份额:正常
– 内存份额:正常
2.3 云平台主机环境要求
– 监控服务器:
– 实例规格:ecs.g6.4xlarge或同等规格
– vCPU:16核
– 内存:64GB
– 系统盘:SSD云盘 120GB
– 数据盘:SSD云盘 1TB
– 网络带宽:10Gbps以上
存储配置:
– OSS对象存储:用于存储监控数据备份
– NAS文件存储:用于共享监控数据
– 云盘快照:定期备份监控数据
3. 操作系统环境准备
在安装Prometheus之前,需要对操作系统进行必要的配置和优化。
3.1 操作系统版本检查
# cat /etc/os-release
NAME=”Oracle Linux Server”
VERSION=”8.9″
ID=”ol”
PRETTY_NAME=”Oracle Linux Server 8.9″
# 检查内核版本
# uname -r
5.4.17-2136.302.7.2.el8uek.x86_64
# 检查SELinux状态
# getenforce
Enforcing
# 检查防火墙状态
# systemctl status firewalld
● firewalld.service – firewalld – dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running)
3.2 依赖服务安装
# dnf install -y wget curl tar gzip
# 关闭防火墙
# systemctl stop firewalld
# systemctl disable firewalld
# 关闭SELinux
# setenforce 0
# sed -i ‘s/SELINUX=enforcing/SELINUX=disabled/’ /etc/selinux/config
# 创建Prometheus用户
# useradd -r -s /bin/false prometheus
# 创建目录结构
# mkdir -p /data/prometheus/{data,config,bin}
# chown -R prometheus:prometheus /data/prometheus
3.3 网络配置
# vi /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE=Ethernet
BOOTPROTO=static
NAME=ens33
DEVICE=ens33
ONBOOT=yes
IPADDR=192.168.1.81
NETMASK=255.255.255.0
GATEWAY=192.168.1.1
DNS1=114.114.114.114
# 重启网络
# systemctl restart NetworkManager
# 验证网络
# ping -c 4 google.com
4. Prometheus安装配置
完成环境准备后,开始安装Prometheus。
4.1 安装Prometheus
# wget https://github.com/prometheus/prometheus/releases/download/v2.44.0/prometheus-2.44.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf prometheus-2.44.0.linux-amd64.tar.gz
# mv prometheus-2.44.0.linux-amd64/{prometheus,promtool} /data/prometheus/bin/
# 创建配置文件
# vi /data/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
– “rules/*.yml”
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’]
– job_name: ‘node’
static_configs:
– targets: [‘localhost:9100’]
# 创建规则目录
# mkdir -p /data/prometheus/config/rules
# 创建systemd服务文件
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/prometheus \
–config.file=/data/prometheus/config/prometheus.yml \
–storage.tsdb.path=/data/prometheus/data \
–storage.tsdb.retention.time=15d \
–web.console.templates=/data/prometheus/bin/consoles \
–web.console.libraries=/data/prometheus/bin/console_libraries \
–web.listen-address=:9090
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启动Prometheus
# systemctl daemon-reload
# systemctl start prometheus
# systemctl enable prometheus
# 验证安装
# systemctl status prometheus
# curl http://localhost:9090/metrics
4.2 安装Node Exporter
# wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf node_exporter-1.5.0.linux-amd64.tar.gz
# mv node_exporter-1.5.0.linux-amd64/node_exporter /data/prometheus/bin/
# 创建systemd服务文件
# vi /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/node_exporter
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启动Node Exporter
# systemctl daemon-reload
# systemctl start node_exporter
# systemctl enable node_exporter
# 验证安装
# systemctl status node_exporter
# curl http://localhost:9100/metrics
4.3 安装AlertManager
# wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf alertmanager-0.25.0.linux-amd64.tar.gz
# mv alertmanager-0.25.0.linux-amd64/{alertmanager,amtool} /data/prometheus/bin/
# 创建配置文件
# vi /data/prometheus/config/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: [‘alertname’]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: ’email’
receivers:
– name: ’email’
email_configs:
– to: ‘admin@fgedu.net.cn’
from: ‘prometheus@fgedu.net.cn’
smarthost: ‘smtp.fgedu.net.cn:25’
auth_username: ‘prometheus’
auth_password: ‘password’
# 创建systemd服务文件
# vi /etc/systemd/system/alertmanager.service
[Unit]
Description=AlertManager
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/alertmanager \
–config.file=/data/prometheus/config/alertmanager.yml \
–storage.path=/data/prometheus/data/alertmanager
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启动AlertManager
# systemctl daemon-reload
# systemctl start alertmanager
# systemctl enable alertmanager
# 验证安装
# systemctl status alertmanager
# curl http://localhost:9093/metrics
5. Prometheus配置优化
为了提高Prometheus的性能和稳定性,需要进行一些配置优化。
5.1 存储配置优化
# vi /data/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
– “rules/*.yml”
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’]
– job_name: ‘node’
static_configs:
– targets: [‘localhost:9100’]
# 编辑systemd服务文件
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/prometheus \
–config.file=/data/prometheus/config/prometheus.yml \
–storage.tsdb.path=/data/prometheus/data \
–storage.tsdb.retention.time=15d \
–storage.tsdb.wal-compression \
–web.console.templates=/data/prometheus/bin/consoles \
–web.console.libraries=/data/prometheus/bin/console_libraries \
–web.listen-address=:9090
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 重启Prometheus
# systemctl daemon-reload
# systemctl restart prometheus
5.2 高可用配置
# 重复主节点的安装步骤
# 配置主节点Prometheus
# vi /data/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
– “rules/*.yml”
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093
– monitor02.fgedu.net.cn:9093
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’, ‘monitor02.fgedu.net.cn:9090’]
– job_name: ‘node’
static_configs:
– targets: [‘localhost:9100’, ‘monitor02.fgedu.net.cn:9100’]
# 配置备用节点Prometheus
# vi /data/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
– “rules/*.yml”
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093
– monitor01.fgedu.net.cn:9093
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’, ‘monitor01.fgedu.net.cn:9090’]
– job_name: ‘node’
static_configs:
– targets: [‘localhost:9100’, ‘monitor01.fgedu.net.cn:9100’]
# 重启Prometheus
# systemctl restart prometheus
5.3 内存配置
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
User=prometheus
Environment=”GODEBUG=madvdontneed=1″
ExecStart=/data/prometheus/bin/prometheus \
–config.file=/data/prometheus/config/prometheus.yml \
–storage.tsdb.path=/data/prometheus/data \
–storage.tsdb.retention.time=15d \
–storage.tsdb.wal-compression \
–web.console.templates=/data/prometheus/bin/consoles \
–web.console.libraries=/data/prometheus/bin/console_libraries \
–web.listen-address=:9090
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 重启Prometheus
# systemctl daemon-reload
# systemctl restart prometheus
6. Prometheus Exporter配置
Prometheus通过Exporter收集各种系统和应用的指标数据。
6.1 Node Exporter配置
# vi /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/node_exporter \
–collector.systemd \
–collector.processes \
–collector.diskstats \
–collector.filesystem \
–collector.netstat \
–collector.loadavg \
–collector.meminfo \
–collector.cpu
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 重启Node Exporter
# systemctl daemon-reload
# systemctl restart node_exporter
6.2 Blackbox Exporter配置
# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.23.0/blackbox_exporter-0.23.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf blackbox_exporter-0.23.0.linux-amd64.tar.gz
# mv blackbox_exporter-0.23.0.linux-amd64/blackbox_exporter /data/prometheus/bin/
# 创建配置文件
# vi /data/prometheus/config/blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: [“HTTP/1.1”, “HTTP/2”]
valid_status_codes: [200, 201, 202, 203, 204, 205, 206, 207, 208, 226]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
# 创建systemd服务文件
# vi /etc/systemd/system/blackbox_exporter.service
[Unit]
Description=Blackbox Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/blackbox_exporter \
–config.file=/data/prometheus/config/blackbox.yml \
–web.listen-address=:9115
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启动Blackbox Exporter
# systemctl daemon-reload
# systemctl start blackbox_exporter
# systemctl enable blackbox_exporter
# 配置Prometheus
# vi /data/prometheus/config/prometheus.yml
scrape_configs:
– job_name: ‘blackbox’
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
– targets:
– http://prometheus.io
– http://grafana.com
relabel_configs:
– source_labels: [__address__]
target_label: __param_target
– source_labels: [__param_target]
target_label: instance
– target_label: __address__
replacement: localhost:9115
6.3 MySQL Exporter配置
# wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf mysqld_exporter-0.14.0.linux-amd64.tar.gz
# mv mysqld_exporter-0.14.0.linux-amd64/mysqld_exporter /data/prometheus/bin/
# 创建MySQL用户
# mysql -u root -p
CREATE USER ‘exporter’@’localhost’ IDENTIFIED BY ‘password’ WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO ‘exporter’@’localhost’;
FLUSH PRIVILEGES;
EXIT;
# 创建配置文件
# vi /data/prometheus/config/.my.cnf
[client]
user=exporter
password=password
# 创建systemd服务文件
# vi /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=MySQL Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/data/prometheus/bin/mysqld_exporter \
–config.my-cnf=/data/prometheus/config/.my.cnf \
–web.listen-address=:9104
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 启动MySQL Exporter
# systemctl daemon-reload
# systemctl start mysqld_exporter
# systemctl enable mysqld_exporter
# 配置Prometheus
# vi /data/prometheus/config/prometheus.yml
scrape_configs:
– job_name: ‘mysql’
static_configs:
– targets: [‘localhost:9104’]
7. AlertManager配置
AlertManager用于处理Prometheus产生的告警,并将告警发送到指定的接收渠道。
7.1 AlertManager配置
# vi /data/prometheus/config/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.fgedu.net.cn:25’
smtp_from: ‘prometheus@fgedu.net.cn’
smtp_auth_username: ‘prometheus’
smtp_auth_password: ‘password’
route:
group_by: [‘alertname’, ‘cluster’, ‘service’]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: ’email’
routes:
– match:
severity: critical
receiver: ’email’
receivers:
– name: ’email’
email_configs:
– to: ‘admin@fgedu.net.cn’
send_resolved: true
– name: ‘slack’
slack_configs:
– api_url: ‘https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX’
channel: ‘#alerts’
send_resolved: true
inhibit_rules:
– source_match:
severity: ‘critical’
target_match:
severity: ‘warning’
equal: [‘alertname’, ‘cluster’, ‘service’]
# 重启AlertManager
# systemctl restart alertmanager
7.2 告警规则配置
# vi /data/prometheus/config/rules/node-alerts.yml
groups:
– name: node-alerts
rules:
– alert: NodeDown
expr: up{job=”node”} == 0
for: 5m
labels:
severity: critical
annotations:
summary: “Node {{ $labels.instance }} down”
description: “{{ $labels.instance }} has been down for more than 5 minutes”
– alert: HighCPUUsage
expr: (100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.instance }}”
description: “CPU usage is above 80% for more than 5 minutes”
– alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes – node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage on {{ $labels.instance }}”
description: “Memory usage is above 80% for more than 5 minutes”
– alert: HighDiskUsage
expr: (node_filesystem_size_bytes{mountpoint=”/”} – node_filesystem_free_bytes{mountpoint=”/”}) / node_filesystem_size_bytes{mountpoint=”/”} * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High disk usage on {{ $labels.instance }}”
description: “Disk usage is above 80% for more than 5 minutes”
# 重启Prometheus
# systemctl restart prometheus
8. Grafana集成
Grafana用于可视化Prometheus收集的指标数据。
8.1 安装Grafana
# dnf install -y https://dl.grafana.com/oss/release/grafana-9.5.2-1.x86_64.rpm
# 启动Grafana
# systemctl start grafana-server
# systemctl enable grafana-server
# 验证安装
# systemctl status grafana-server
# curl http://localhost:3000
8.2 配置Grafana
# 打开浏览器访问 http://localhost:3000
# 登录用户名:admin,密码:admin
# 添加Prometheus数据源
# 1. 点击左侧菜单的”Configuration” -> “Data sources”
# 2. 点击”Add data source”
# 3. 选择”Prometheus”
# 4. 配置URL为 http://localhost:9090
# 5. 点击”Save & Test”
# 导入Dashboard
# 1. 点击左侧菜单的”Dashboards” -> “Import”
# 2. 输入Dashboard ID:1860(Node Exporter Full)
# 3. 点击”Load”
# 4. 选择Prometheus数据源
# 5. 点击”Import”
9. Prometheus安全配置
Prometheus提供了多种安全功能,包括认证、授权、TLS加密等。
9.1 认证配置
# dnf install -y httpd-tools
# 创建密码文件
# htpasswd -c /data/prometheus/config/.htpasswd admin
New password:
Re-type new password:
Adding password for user admin
# 编辑Prometheus配置
# vi /data/prometheus/config/prometheus.yml
# 添加以下内容
web:
basic_auth_users:
admin: $2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# 重启Prometheus
# systemctl restart prometheus
9.2 TLS加密配置
# openssl req -newkey rsa:2048 -nodes -keyout /data/prometheus/config/prometheus.key -x509 -days 365 -out /data/prometheus/config/prometheus.crt
# 编辑Prometheus配置
# vi /data/prometheus/config/prometheus.yml
# 添加以下内容
web:
tls_cert_file: /data/prometheus/config/prometheus.crt
tls_key_file: /data/prometheus/config/prometheus.key
# 重启Prometheus
# systemctl restart prometheus
10. Prometheus性能优化
在生产环境中,需要对Prometheus进行性能优化以提高监控效率。from:www.itpux.com
10.1 存储优化
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
User=prometheus
Environment=”GODEBUG=madvdontneed=1″
ExecStart=/data/prometheus/bin/prometheus \
–config.file=/data/prometheus/config/prometheus.yml \
–storage.tsdb.path=/data/prometheus/data \
–storage.tsdb.retention.time=15d \
–storage.tsdb.wal-compression \
–storage.tsdb.max-block-duration=2h \
–storage.tsdb.min-block-duration=2h \
–web.console.templates=/data/prometheus/bin/consoles \
–web.console.libraries=/data/prometheus/bin/console_libraries \
–web.listen-address=:9090
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 重启Prometheus
# systemctl daemon-reload
# systemctl restart prometheus
10.2 内存优化
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
After=network.target
[Service]
User=prometheus
Environment=”GODEBUG=madvdontneed=1″
Environment=”GOMAXPROCS=8″
ExecStart=/data/prometheus/bin/prometheus \
–config.file=/data/prometheus/config/prometheus.yml \
–storage.tsdb.path=/data/prometheus/data \
–storage.tsdb.retention.time=15d \
–storage.tsdb.wal-compression \
–web.console.templates=/data/prometheus/bin/consoles \
–web.console.libraries=/data/prometheus/bin/console_libraries \
–web.listen-address=:9090
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# 重启Prometheus
# systemctl daemon-reload
# systemctl restart prometheus
10.3 抓取配置优化
# vi /data/prometheus/config/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
– job_name: ‘prometheus’
scrape_interval: 15s
static_configs:
– targets: [‘localhost:9090’]
– job_name: ‘node’
scrape_interval: 30s
static_configs:
– targets: [‘localhost:9100’]
– job_name: ‘mysql’
scrape_interval: 60s
static_configs:
– targets: [‘localhost:9104’]
# 重启Prometheus
# systemctl restart prometheus
11. Prometheus升级迁移
本节介绍Prometheus的版本升级和数据迁移方法。
11.1 Prometheus版本升级
# cp -r /data/prometheus/data /backup/prometheus-data-$(date +%Y%m%d)
# cp /data/prometheus/config/prometheus.yml /backup/prometheus-config-$(date +%Y%m%d).yml
# 停止Prometheus
# systemctl stop prometheus
# 下载新版本Prometheus
# wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
# 解压文件
# tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
# mv prometheus-2.45.0.linux-amd64/{prometheus,promtool} /data/prometheus/bin/
# 启动Prometheus
# systemctl start prometheus
# 验证升级
# prometheus –version
prometheus, version 2.45.0 (branch: HEAD, revision: abcdefg1234567890abcdefg1234567890abcdefg)
build user: root@1a2b3c4d5e6f
build date: 2023-05-01T15:33:19Z
go version: go1.19.6
platform: linux/amd64
# 访问Prometheus Web UI
# 打开浏览器访问 http://localhost:9090
11.2 Prometheus数据迁移
# systemctl stop prometheus
# 复制数据到新服务器
# scp -r /data/prometheus/data root@new-server:/data/prometheus/
# scp /data/prometheus/config/prometheus.yml root@new-server:/data/prometheus/config/
# 在新服务器上启动Prometheus
# systemctl start prometheus
# 验证迁移
# curl http://new-server:9090/metrics
12. Prometheus备份恢复
本节介绍Prometheus的备份和恢复方法。
12.1 Prometheus备份
# vi /data/prometheus/scripts/backup.sh
#!/bin/bash
BACKUP_DIR=”/backup/prometheus”
DATE=$(date +%Y%m%d)
# 创建备份目录
mkdir -p $BACKUP_DIR
# 停止Prometheus
systemctl stop prometheus
# 备份数据
cp -r /data/prometheus/data $BACKUP_DIR/data-$DATE
cp /data/prometheus/config/prometheus.yml $BACKUP_DIR/config-$DATE.yml
cp /data/prometheus/config/alertmanager.yml $BACKUP_DIR/alertmanager-$DATE.yml
cp -r /data/prometheus/config/rules $BACKUP_DIR/rules-$DATE
# 启动Prometheus
systemctl start prometheus
# 清理旧备份(保留7天)
find $BACKUP_DIR -type d -mtime +7 -exec rm -rf {} \;
# 添加执行权限
# chmod +x /data/prometheus/scripts/backup.sh
# 添加定时任务
# crontab -e
0 0 * * * /data/prometheus/scripts/backup.sh
12.2 Prometheus恢复
# systemctl stop prometheus
# 清理现有数据
# rm -rf /data/prometheus/data
# 恢复数据
# cp -r /backup/prometheus/data-20230405 /data/prometheus/data
# cp /backup/prometheus/config-20230405.yml /data/prometheus/config/prometheus.yml
# cp /backup/prometheus/alertmanager-20230405.yml /data/prometheus/config/alertmanager.yml
# cp -r /backup/prometheus/rules-20230405/* /data/prometheus/config/rules/
# 启动Prometheus
# systemctl start prometheus
# 验证恢复
# curl http://localhost:9090/metrics
# 打开浏览器访问 http://localhost:9090
12.3 Prometheus监控脚本
# vi /data/prometheus/scripts/monitor.sh
#!/bin/bash
LOG_FILE=”/var/log/prometheus_monitor.log”
ALERT_EMAIL=”admin@fgedu.net.cn”
check_prometheus_status() {
echo “$(date): Checking prometheus status…” >> $LOG_FILE
status=$(systemctl status prometheus | grep Active | awk ‘{print $2}’)
if [ “$status” != “active” ]; then
echo “$(date): Prometheus is not running” >> $LOG_FILE
echo “Prometheus is not running” | mail -s “Prometheus Alert” $ALERT_EMAIL
systemctl start prometheus
else
echo “$(date): Prometheus is running” >> $LOG_FILE
fi
}
check_prometheus_web() {
echo “$(date): Checking prometheus web…” >> $LOG_FILE
status=$(curl -s -o /dev/null -w “%{http_code}” http://localhost:9090)
if [ “$status” = “200” ]; then
echo “$(date): Prometheus web: OK” >> $LOG_FILE
else
echo “$(date): Prometheus web: FAILED” >> $LOG_FILE
echo “Prometheus web failed” | mail -s “Prometheus Alert” $ALERT_EMAIL
fi
}
check_alertmanager_status() {
echo “$(date): Checking alertmanager status…” >> $LOG_FILE
status=$(systemctl status alertmanager | grep Active | awk ‘{print $2}’)
if [ “$status” != “active” ]; then
echo “$(date): AlertManager is not running” >> $LOG_FILE
echo “AlertManager is not running” | mail -s “Prometheus Alert” $ALERT_EMAIL
systemctl start alertmanager
else
echo “$(date): AlertManager is running” >> $LOG_FILE
fi
}
check_node_exporter_status() {
echo “$(date): Checking node exporter status…” >> $LOG_FILE
status=$(systemctl status node_exporter | grep Active | awk ‘{print $2}’)
if [ “$status” != “active” ]; then
echo “$(date): Node Exporter is not running” >> $LOG_FILE
echo “Node Exporter is not running” | mail -s “Prometheus Alert” $ALERT_EMAIL
systemctl start node_exporter
else
echo “$(date): Node Exporter is running” >> $LOG_FILE
fi
}
main() {
check_prometheus_status
check_prometheus_web
check_alertmanager_status
check_node_exporter_status
}
main
# 添加执行权限
# chmod +x /data/prometheus/scripts/monitor.sh
# 添加定时任务
# crontab -e
*/15 * * * * /data/prometheus/scripts/monitor.sh
通过以上步骤,Prometheus安装配置、性能优化、升级迁移、备份恢复等内容已全部完成。Prometheus作为开源监控系统,能够高效地收集和分析监控数据,是企业级监控解决方案的重要组成部分。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
