1. Prometheus概述与环境规划
Prometheus是一款开源的系统监控和告警工具包,最初由SoundCloud开发。Prometheus采用拉取式数据采集模型,支持多维数据模型和灵活的查询语言PromQL,广泛应用于云原生和微服务监控场景。更多学习教程www.fgedu.net.cn
1.1 Prometheus版本说明
Prometheus目前主要版本为2.45,本教程以Prometheus 2.45为例进行详细讲解。
$ prometheus –version
prometheus, version 2.45.0 (branch: HEAD, revision: abc123)
build user: root@buildhost
build date: 20240101-00:00:00
go version: go1.21.0
platform: linux/amd64
# 查看配置信息
$ prometheus –config.check –config.file=/etc/prometheus/prometheus.yml
prometheus configuration file /etc/prometheus/prometheus.yml is valid
1.2 环境规划
本次安装环境规划如下:
IP地址:192.168.1.51
HTTP端口:9090
数据目录:/data/prometheus
配置目录:/etc/prometheus
日志目录:/var/log/prometheus
存储规划:
数据保留期:15天
采集间隔:15秒
1.3 Prometheus核心特性
1. 多维数据模型:指标名称和键值对标签
2. PromQL查询语言:强大的数据查询能力
3. 拉取式采集:主动从目标拉取指标
4. 服务发现:自动发现监控目标
5. 告警管理:支持Alertmanager告警
6. 可视化:支持Grafana集成
7. 时序存储:高效的本地存储
8. 联邦集群:支持多级联邦架构
2. 硬件环境要求与检查
在安装Prometheus之前,需要对服务器硬件环境进行全面检查。学习交流加群风哥微信: itpux-com
2.1 最低硬件要求
CPU:2核心
内存:4GB
磁盘:20GB
推荐配置(生产环境):
CPU:4核心以上
内存:16GB以上
磁盘:200GB以上SSD
大规模监控配置:
CPU:8核心以上
内存:32GB以上
磁盘:500GB以上SSD
2.2 系统环境检查
# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.8 (Ootpa)
# 检查内核版本
# uname -a
Linux fgedudb01 4.18.0-477.10.1.el8_8.x86_64 #1 SMP Fri Apr 4 10:00:00 CST 2026 x86_64 x86_64 x86_64 GNU/Linux
# 检查内存信息
# free -h
total used free shared buff/cache available
Mem: 31Gi 1.0Gi 29Gi 256Mi 1.0Gi 30Gi
Swap: 7Gi 0B 7Gi
# 检查磁盘空间
# df -h
文件系统 容量 已用 可用 已用% 挂载点
/dev/mapper/vg_system-lv_root 50G 2.5G 48G 5% /
/dev/sda2 1014M 150M 865M 15% /boot
/dev/mapper/vg_data-lv_data 200G 20G 180G 10% /data
# 检查时间同步
# timedatectl status
Local time: 五 2026-04-04 10:00:00 CST
Universal time: 五 2026-04-04 02:00:00 UTC
RTC time: 五 2026-04-04 02:00:00
Time zone: Asia/Shanghai (CST, +0800)
NTP enabled: yes
NTP synchronized: yes
2.3 内核参数配置
# vi /etc/sysctl.d/99-prometheus.conf
# 添加以下参数
# 网络参数
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 300
# 文件描述符限制
fs.file-max = 655360
# 使内核参数生效
# sysctl -p /etc/sysctl.d/99-prometheus.conf
# 输出示例:
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
…
2.4 用户资源限制配置
# vi /etc/security/limits.conf
# 添加以下配置
prometheus soft nofile 65535
prometheus hard nofile 65535
prometheus soft nproc 65535
prometheus hard nproc 65535
# 创建用户
# useradd -r -s /sbin/nologin prometheus
3. Prometheus安装步骤
本节详细介绍Prometheus 2.45的安装过程。学习交流加群风哥QQ113257174
3.1 创建目录结构
# mkdir -p /etc/prometheus
# mkdir -p /data/prometheus
# mkdir -p /var/log/prometheus
# 设置目录权限
# chown -R prometheus:prometheus /etc/prometheus
# chown -R prometheus:prometheus /data/prometheus
# chown -R prometheus:prometheus /var/log/prometheus
# chmod -R 750 /data/prometheus
# chmod -R 750 /var/log/prometheus
# 验证目录权限
# ls -la /data/
总用量 0
drwxr-xr-x. 2 prometheus prometheus 6 4月 4 10:00 prometheus
3.2 下载并安装Prometheus
# cd /usr/local/src
# wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
# 输出示例:
–2026-04-04 10:00:00– https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
正在解析主机 github.com… 140.82.121.4
正在连接 github.com|140.82.121.4|:443… 已连接。
已发出 HTTP 请求,正在等待回应… 200 OK
长度:100000000 (95M) [application/octet-stream]
正在保存至: “prometheus-2.45.0.linux-amd64.tar.gz”
100%[======================================>] 100,000,000 10.0MB/s 用时 9.5s
2026-04-04 10:00:10 (10.0 MB/s) – 已保存 “prometheus-2.45.0.linux-amd64.tar.gz”
# 解压安装包
# tar -xzf prometheus-2.45.0.linux-amd64.tar.gz
# 复制二进制文件
# cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
# cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
# 复制默认配置文件
# cp prometheus-2.45.0.linux-amd64/prometheus.yml /etc/prometheus/
# 设置权限
# chown prometheus:prometheus /usr/local/bin/prometheus
# chown prometheus:prometheus /usr/local/bin/promtool
# chmod 755 /usr/local/bin/prometheus
# chmod 755 /usr/local/bin/promtool
# 验证安装
# prometheus –version
prometheus, version 2.45.0 (branch: HEAD, revision: abc123)
build user: root@buildhost
build date: 20240101-00:00:00
go version: go1.21.0
platform: linux/amd64
3.3 创建配置文件
# vi /etc/prometheus/prometheus.yml
# 添加以下配置
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: ‘fgedu-monitor’
alerting:
alertmanagers:
– static_configs:
– targets:
– localhost:9093
rule_files:
– /etc/prometheus/rules/*.yml
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’]
labels:
instance: ‘fgedudb01’
– job_name: ‘node_exporter’
static_configs:
– targets: [‘192.168.1.51:9100’, ‘192.168.1.52:9100’]
labels:
group: ‘production’
– job_name: ‘mysql_exporter’
static_configs:
– targets: [‘192.168.1.51:9104’]
– job_name: ‘redis_exporter’
static_configs:
– targets: [‘192.168.1.51:9121’]
# 创建规则目录
# mkdir -p /etc/prometheus/rules
# 创建告警规则文件
# vi /etc/prometheus/rules/alert_rules.yml
groups:
– name: node_alerts
rules:
– alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “Instance {{ $labels.instance }} down”
description: “{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.”
– alert: HighCPUUsage
expr: 100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.instance }}”
description: “CPU usage is above 80% (current value: {{ $value }}%)”
– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage on {{ $labels.instance }}”
description: “Memory usage is above 85% (current value: {{ $value }}%)”
– alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”}) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 85% (current value: {{ $value }}%)"
# 设置权限
# chown -R prometheus:prometheus /etc/prometheus
3.4 创建Systemd服务
# vi /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
ExecStart=/usr/local/bin/prometheus \
–config.file=/etc/prometheus/prometheus.yml \
–storage.tsdb.path=/data/prometheus \
–storage.tsdb.retention.time=15d \
–storage.tsdb.retention.size=50GB \
–web.console.templates=/etc/prometheus/consoles \
–web.console.libraries=/etc/prometheus/console_libraries \
–web.listen-address=0.0.0.0:9090 \
–web.external-url=http://192.168.1.51:9090 \
–log.level=info \
–log.format=logfmt
RestartSec=5
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
# 重载systemd
# systemctl daemon-reload
# 启动Prometheus服务
# systemctl start prometheus
# 设置开机自启动
# systemctl enable prometheus
# 输出示例:
Created symlink /etc/systemd/system/multi-user.target.wants/prometheus.service → /etc/systemd/system/prometheus.service.
# 检查服务状态
# systemctl status prometheus
● prometheus.service – Prometheus Monitoring System
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2026-04-04 10:00:00 CST; 5s ago
Docs: https://prometheus.io/docs/introduction/overview/
Main PID: 12345 (prometheus)
Tasks: 8 (limit: 4915)
Memory: 100.0M
CGroup: /system.slice/prometheus.service
└─12345 /usr/local/bin/prometheus –config.file=/etc/prometheus/prometheus.yml …
# 检查端口
# netstat -tlnp | grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 12345/prometheus
4. Prometheus参数配置
Prometheus参数配置是监控系统的关键步骤,直接影响监控效果和性能。更多学习教程公众号风哥教程itpux_com
4.1 配置服务发现
# vi /etc/prometheus/prometheus.yml
# 文件服务发现
scrape_configs:
– job_name: ‘file_sd’
file_sd_configs:
– files:
– /etc/prometheus/targets/*.json
refresh_interval: 5m
# 创建目标文件
# mkdir -p /etc/prometheus/targets
# vi /etc/prometheus/targets/nodes.json
[
{
“targets”: [“192.168.1.51:9100”, “192.168.1.52:9100”],
“labels”: {
“job”: “node_exporter”,
“env”: “production”
}
}
]
# Consul服务发现
scrape_configs:
– job_name: ‘consul_sd’
consul_sd_configs:
– server: ‘192.168.1.100:8500’
services: [‘node-exporter’, ‘mysql-exporter’]
# Kubernetes服务发现
scrape_configs:
– job_name: ‘kubernetes_sd’
kubernetes_sd_configs:
– role: pod
namespaces:
names:
– monitoring
– default
# 重载配置
# systemctl reload prometheus
4.2 配置远程存储
# vi /etc/prometheus/prometheus.yml
# 配置远程写入
remote_write:
– url: “http://192.168.1.51:8428/api/v1/write”
queue_config:
max_samples_per_send: 10000
max_shards: 200
capacity: 2500
# 配置远程读取
remote_read:
– url: “http://192.168.1.51:8428/api/v1/read”
read_recent: true
# 重启服务
# systemctl restart prometheus
4.3 配置告警规则
# vi /etc/prometheus/rules/alert_rules.yml
groups:
– name: mysql_alerts
rules:
– alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “MySQL instance {{ $labels.instance }} is down”
description: “MySQL instance has been down for more than 1 minute.”
– alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: warning
annotations:
summary: “MySQL replication lag on {{ $labels.instance }}”
description: “Replication lag is {{ $value }} seconds.”
– alert: MySQLTooManyConnections
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: “MySQL too many connections on {{ $labels.instance }}”
description: “Connection usage is {{ $value }}%.”
– name: redis_alerts
rules:
– alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: “Redis instance {{ $labels.instance }} is down”
description: “Redis instance has been down for more than 1 minute.”
– alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “Redis memory usage high on {{ $labels.instance }}”
description: “Memory usage is {{ $value }}%.”
# 验证配置
# promtool check rules /etc/prometheus/rules/alert_rules.yml
Checking /etc/prometheus/rules/alert_rules.yml
SUCCESS: 6 rules found
# 重载配置
# systemctl reload prometheus
5. 数据采集与查询
Prometheus使用拉取模式采集数据,使用PromQL语言进行查询。from:www.itpux.com
5.1 安装Node Exporter
# cd /usr/local/src
# wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
# 解压并安装
# tar -xzf node_exporter-1.6.1.linux-amd64.tar.gz
# cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
# chmod 755 /usr/local/bin/node_exporter
# 创建服务文件
# vi /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
–web.listen-address=:9100 \
–collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
Restart=on-failure
[Install]
WantedBy=multi-user.target
# 启动服务
# systemctl daemon-reload
# systemctl start node_exporter
# systemctl enable node_exporter
# 验证采集
# curl http://localhost:9100/metrics | head -20
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile=”0″} 0.000123
go_gc_duration_seconds{quantile=”0.25″} 0.000234
go_gc_duration_seconds{quantile=”0.5″} 0.000345
go_gc_duration_seconds{quantile=”0.75″} 0.000456
go_gc_duration_seconds{quantile=”1″} 0.001234
go_gc_duration_seconds_sum 0.012345
go_gc_duration_seconds_count 10
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 10
5.2 PromQL查询示例
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)’
# 输出示例:
{
“status”: “success”,
“data”: {
“resultType”: “vector”,
“result”: [
{
“metric”: {
“instance”: “192.168.1.51:9100”
},
“value”: [1712205600, “25.5”]
}
]
}
}
# 查询内存使用率
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100’
# 查询磁盘使用率
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”})) * 100’
# 查询网络流量
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=rate(node_network_receive_bytes_total{device=”eth0″}[5m])’
# 查询范围数据
$ curl -G ‘http://192.168.1.51:9090/api/v1/query_range’ \
–data-urlencode ‘query=node_cpu_seconds_total{mode=”idle”}’ \
–data-urlencode ‘start=1712205000’ \
–data-urlencode ‘end=1712205600’ \
–data-urlencode ‘step=60s’
# 聚合查询
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=sum by(job) (rate(node_cpu_seconds_total[5m]))’
# 统计监控目标数量
$ curl -G ‘http://192.168.1.51:9090/api/v1/query’ \
–data-urlencode ‘query=count(up)’
5.3 使用Web UI查询
# 浏览器打开 http://192.168.1.51:9090
# 常用查询示例:
# 1. 查看所有实例状态
up
# 2. CPU使用率(按实例)
100 – (avg by(instance) (rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)
# 3. 内存使用率
(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 4. 磁盘IO
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
# 5. 网络流量
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# 6. 系统负载
node_load1
node_load5
node_load15
# 7. 进程数
node_procs_running
node_procs_blocked
# 8. 文件描述符
node_filefd_allocated
node_filefd_maximum
6. 网络连接配置
网络连接配置是客户端访问Prometheus的关键,需要正确配置监听端口和连接方式。更多学习教程www.fgedu.net.cn
6.1 配置网络监听
# netstat -tlnp | grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 12345/prometheus
# 修改监听地址
# vi /etc/systemd/system/prometheus.service
ExecStart=/usr/local/bin/prometheus \
–web.listen-address=192.168.1.51:9090 \
…
# 重启服务
# systemctl daemon-reload
# systemctl restart prometheus
# 配置防火墙
# firewall-cmd –permanent –add-port=9090/tcp
success
# firewall-cmd –reload
success
6.2 配置认证
# 建议使用反向代理实现认证
# 安装nginx
# dnf install -y nginx
# 配置nginx反向代理
# vi /etc/nginx/conf.d/prometheus.conf
server {
listen 80;
server_name prometheus.fgedu.net.cn;
location / {
auth_basic “Prometheus”;
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
# 创建密码文件
# htpasswd -c /etc/nginx/.htpasswd admin
New password:
Re-type new password:
Adding password for user admin
# 启动nginx
# systemctl start nginx
# systemctl enable nginx
# 访问认证后的Prometheus
# curl -u admin:password http://prometheus.fgedu.net.cn/api/v1/query?query=up
6.3 配置HTTPS
# openssl req -x509 -nodes -newkey rsa:2048 \
-keyout /etc/nginx/ssl/prometheus.key \
-out /etc/nginx/ssl/prometheus.crt \
-days 365 \
-subj “/CN=prometheus.fgedu.net.cn”
# 配置HTTPS
# vi /etc/nginx/conf.d/prometheus.conf
server {
listen 443 ssl;
server_name prometheus.fgedu.net.cn;
ssl_certificate /etc/nginx/ssl/prometheus.crt;
ssl_certificate_key /etc/nginx/ssl/prometheus.key;
location / {
auth_basic “Prometheus”;
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
# 重启nginx
# systemctl restart nginx
7. 备份恢复配置
备份恢复是监控系统管理的重要环节,Prometheus的数据存储在本地TSDB中。学习交流加群风哥微信: itpux-com
7.1 数据备份
# mkdir -p /backup/prometheus
# 停止Prometheus服务
# systemctl stop prometheus
# 备份数据目录
# tar -czf /backup/prometheus/prometheus_data_$(date +%Y%m%d).tar.gz /data/prometheus
# 备份配置文件
# tar -czf /backup/prometheus/prometheus_config_$(date +%Y%m%d).tar.gz /etc/prometheus
# 启动服务
# systemctl start prometheus
# 验证备份文件
# ls -la /backup/prometheus/
总用量 2048
-rw-r–r–. 1 root root 1024000 4月 4 10:00 prometheus_config_20260404.tar.gz
-rw-r–r–. 1 root root 5120000 4月 4 10:00 prometheus_data_20260404.tar.gz
7.2 数据恢复
# systemctl stop prometheus
# 恢复数据目录
# rm -rf /data/prometheus/*
# tar -xzf /backup/prometheus/prometheus_data_20260404.tar.gz -C /
# chown -R prometheus:prometheus /data/prometheus
# 恢复配置文件
# rm -rf /etc/prometheus/*
# tar -xzf /backup/prometheus/prometheus_config_20260404.tar.gz -C /
# chown -R prometheus:prometheus /etc/prometheus
# 启动服务
# systemctl start prometheus
# 验证恢复
# curl -s ‘http://localhost:9090/api/v1/query?query=up’ | head -20
7.3 自动备份脚本
# vi /usr/local/bin/prometheus_backup.sh
#!/bin/bash
BACKUP_DIR=/backup/prometheus
DATE=$(date +%Y%m%d)
LOG_FILE=/var/log/prometheus/backup.log
echo “=== Backup started at $(date) ===” >> $LOG_FILE
# 备份配置文件
tar -czf ${BACKUP_DIR}/prometheus_config_${DATE}.tar.gz /etc/prometheus >> $LOG_FILE 2>&1
# 使用promtool快照
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot >> $LOG_FILE 2>&1
if [ $? -eq 0 ]; then
# 复制快照到备份目录
SNAPSHOT_DIR=$(ls -td /data/prometheus/snapshots/* | head -1)
tar -czf ${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz ${SNAPSHOT_DIR} >> $LOG_FILE 2>&1
echo “Backup completed successfully” >> $LOG_FILE
else
echo “Backup failed” >> $LOG_FILE
fi
# 清理30天前的备份
find ${BACKUP_DIR} -name “*.tar.gz” -mtime +30 -delete >> $LOG_FILE 2>&1
echo “=== Backup finished at $(date) ===” >> $LOG_FILE
echo “” >> $LOG_FILE
# 设置脚本权限
# chmod +x /usr/local/bin/prometheus_backup.sh
# 配置定时任务
# crontab -e
# 添加以下内容(每天凌晨2点执行备份)
0 2 * * * /usr/local/bin/prometheus_backup.sh
8. 升级与迁移
Prometheus升级和迁移是运维工作中的重要环节,需要仔细规划和执行。更多学习教程公众号风哥教程itpux_com
8.1 版本升级
$ prometheus –version
prometheus, version 2.45.0
# 执行完整备份
# systemctl stop prometheus
# tar -czf /backup/prometheus/pre_upgrade.tar.gz /data/prometheus /etc/prometheus
# 下载新版本
# cd /usr/local/src
# wget https://github.com/prometheus/prometheus/releases/download/v2.46.0/prometheus-2.46.0.linux-amd64.tar.gz
# 解压并替换二进制文件
# tar -xzf prometheus-2.46.0.linux-amd64.tar.gz
# cp prometheus-2.46.0.linux-amd64/prometheus /usr/local/bin/
# cp prometheus-2.46.0.linux-amd64/promtool /usr/local/bin/
# chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# 验证配置兼容性
# promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
SUCCESS: 0 potential problems found
# 启动服务
# systemctl start prometheus
# 验证版本
$ prometheus –version
prometheus, version 2.46.0
8.2 迁移到新服务器
# systemctl stop prometheus
# tar -czf prometheus_full_backup.tar.gz /data/prometheus /etc/prometheus
# 传输备份文件
# scp prometheus_full_backup.tar.gz new-server:/backup/
# 在新服务器安装Prometheus
# 参考3.2节安装步骤
# 恢复数据
# systemctl stop prometheus
# tar -xzf /backup/prometheus_full_backup.tar.gz -C /
# chown -R prometheus:prometheus /data/prometheus /etc/prometheus
# systemctl start prometheus
# 验证迁移
# curl -s ‘http://localhost:9090/api/v1/query?query=up’
9. 生产环境实战案例
本节提供一个完整的生产环境配置案例,帮助读者更好地理解Prometheus的实际应用。from:www.itpux.com
9.1 安装Alertmanager
# cd /usr/local/src
# wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
# 解压并安装
# tar -xzf alertmanager-0.26.0.linux-amd64.tar.gz
# cp alertmanager-0.26.0.linux-amd64/alertmanager /usr/local/bin/
# cp alertmanager-0.26.0.linux-amd64/amtool /usr/local/bin/
# 创建配置文件
# vi /etc/prometheus/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.fgedu.net.cn:25’
smtp_from: ‘alertmanager@fgedu.net.cn’
smtp_auth_username: ‘alertmanager@fgedu.net.cn’
smtp_auth_password: ‘password’
route:
group_by: [‘alertname’, ‘severity’]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: ‘default-receiver’
routes:
– match:
severity: critical
receiver: ‘critical-receiver’
– match:
severity: warning
receiver: ‘warning-receiver’
receivers:
– name: ‘default-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’
– name: ‘critical-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’
webhook_configs:
– url: ‘http://192.168.1.100:5001/webhook’
– name: ‘warning-receiver’
email_configs:
– to: ‘admin@fgedu.net.cn’
# 创建服务文件
# vi /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/alertmanager \
–config.file=/etc/prometheus/alertmanager.yml \
–storage.path=/data/alertmanager \
–web.listen-address=:9093
Restart=on-failure
[Install]
WantedBy=multi-user.target
# 启动服务
# mkdir -p /data/alertmanager
# chown prometheus:prometheus /data/alertmanager
# systemctl daemon-reload
# systemctl start alertmanager
# systemctl enable alertmanager
9.2 性能监控
$ curl -s http://localhost:9090/metrics | grep prometheus_
# HELP prometheus_build_info A metric with a constant ‘1’ value labeled by version, revision, branch, and goversion from which prometheus was built.
# TYPE prometheus_build_info gauge
prometheus_build_info{branch=”HEAD”,goversion=”go1.21.0″,revision=”abc123″,version=”2.45.0″} 1
# HELP prometheus_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload.
# TYPE prometheus_config_last_reload_success_timestamp_seconds gauge
prometheus_config_last_reload_success_timestamp_seconds 1.7122056e+09
# HELP prometheus_config_last_reload_successful Whether the last configuration reload attempt was successful.
# TYPE prometheus_config_last_reload_successful gauge
prometheus_config_last_reload_successful 1
# 查看采集目标状态
$ curl -s ‘http://localhost:9090/api/v1/targets’ | jq ‘.data.activeTargets[] | {job: .labels.job, health: .health}’
{
“job”: “prometheus”,
“health”: “up”
}
{
“job”: “node_exporter”,
“health”: “up”
}
# 查看TSDB状态
$ curl -s ‘http://localhost:9090/api/v1/status/tsdb’
{
“status”: “success”,
“data”: {
“headStats”: {
“numSeries”: 10000,
“numChunks”: 50000,
“chunkCount”: 50000,
“minTime”: 1712205000,
“maxTime”: 1712205600,
“minTimeMillis”: 1712205000000,
“maxTimeMillis”: 1712205600000
},
“seriesCountByMetricName”: […],
“labelCountByMetricName”: […],
“numLabelPairs”: 1000
}
}
9.3 高可用配置
# 方案1:多实例部署
# 部署多个Prometheus实例,各自独立采集数据
# 使用负载均衡器分发查询请求
# 方案2:联邦集群
# 主Prometheus配置
# vi /etc/prometheus/prometheus.yml
scrape_configs:
– job_name: ‘federate’
scrape_interval: 15s
honor_labels: true
metrics_path: ‘/federate’
params:
‘match[]’:
– ‘{job=”prometheus”}’
– ‘{job=”node_exporter”}’
static_configs:
– targets:
– ‘192.168.1.52:9090’
– ‘192.168.1.53:9090’
labels:
federation: ‘datacenter1’
# 方案3:使用Thanos或VictoriaMetrics
# Thanos提供长期存储和高可用查询
# VictoriaMetrics提供高性能存储和查询
# Thanos Sidecar配置示例
# vi /etc/systemd/system/thanos-sidecar.service
[Unit]
Description=Thanos Sidecar
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/thanos sidecar \
–prometheus.url=http://localhost:9090 \
–tsdb.path=/data/prometheus \
–objstore.config-file=/etc/thanos/objectstore.yml \
–http-address=0.0.0.0:19191 \
–grpc-address=0.0.0.0:10901
Restart=on-failure
[Install]
WantedBy=multi-user.target
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
