Linux教程FG325-集群资源监控与告警

内容简介：本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容，详细介绍了相关技术的配置和使用方法。

风哥提示：

本文档详细介绍集群资源的监控和告警配置方法。

Part01-集群状态监控

1.1 实时监控集群状态

# 查看集群状态
[root@ha-node1 ~]# pcs status
Cluster name: mycluster
Cluster Summary:
* Stack: corosync
* Current DC: ha-node1 (version 2.1.6-1.el9)
* Last updated: Fri Apr 4 12:00:00 2026
* Last change: Fri Apr 4 11:55:00 2026
* 2 nodes configured
* 3 resource instances configured

Node List:
* Online: [ ha-node1 ha-node2 ]
* Standby:
* Maintenance:
* Offline:

Full List of Resources:
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Started ha-node1
* mysql (systemd:mariadb): Started ha-node2

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

# 持续监控集群状态
[root@ha-node1 ~]# watch -n 5 ‘pcs status’
Every 5.0s: pcs status

# 查看资源状态
[root@ha-node1 ~]# pcs status resources
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Started ha-node1
* mysql (systemd:mariadb): Started ha-node2

# 查看节点状态
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1 ha-node2
Standby:
Maintenance:
Offline:

1.2 查看资源操作历史

# 查看资源操作历史
[root@ha-node1 ~]# pcs status operations
Operations:
* vip: monitor interval=30s last-rc-change=Fri Apr 4 12:00:00 2026 exec-time=10ms
* nginx: monitor interval=20s last-rc-change=Fri Apr 4 12:00:00 2026 exec-time=5ms
* mysql: monitor interval=30s last-rc-change=Fri Apr 4 12:00:00 2026 exec-time=8ms

# 查看失败操作
[root@ha-node1 ~]# pcs status failcounts
Fail counts for nginx
ha-node1: 1 (last-failure: Fri Apr 4 11:50:00 2026)

# 查看集群事件
[root@ha-node1 ~]# pcs status events
Events:
* Fri Apr 4 11:50:00 2026: nginx failed on ha-node1
* Fri Apr 4 11:50:01 2026: nginx restarted on ha-node1
* Fri Apr 4 11:50:10 2026: nginx recovered on ha-node1

# 查看集群日志
[root@ha-node1 ~]# journalctl -u pacemaker -u corosync –since “1 hour ago”
— Logs begin at Fri 2026-04-04 10:00:00 CST. —
Apr 04 11:50:00 ha-node1 pacemaker[12345]: notice: nginx_monitor_20000[12346] exited with status 7
Apr 04 11:50:01 ha-node1 pacemaker[12345]: notice: Starting nginx on ha-node1
Apr 04 11:50:10 ha-node1 pacemaker[12345]: notice: nginx started on ha-node1

Part02-配置监控脚本

2.1 创建监控脚本

# 创建集群监控脚本
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_monitor.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/cluster_monitor.log" ALERT_EMAIL="admin@fgedu.net.cn" log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}

check_cluster_status() {
if ! pcs status > /dev/null 2>&1; then
log_message “ERROR: Cluster is not running”
echo “Cluster is not running” | mail -s “Cluster Alert” $ALERT_EMAIL
return 1
fi

OFFLINE_NODES=$(pcs status nodes | grep “Offline:” | awk ‘{print $2}’)
if [ -n “$OFFLINE_NODES” ]; then
log_message “WARNING: Offline nodes: $OFFLINE_NODES”
echo “Offline nodes: $OFFLINE_NODES” | mail -s “Cluster Alert – Offline Nodes” $ALERT_EMAIL
fi

FAILED_RESOURCES=$(pcs status resources | grep “Stopped\|Failed” | awk ‘{print $2}’)
if [ -n “$FAILED_RESOURCES” ]; then
log_message “WARNING: Failed resources: $FAILED_RESOURCES”
echo “Failed resources: $FAILED_RESOURCES” | mail -s “Cluster Alert – Failed Resources” $ALERT_EMAIL
fi

log_message “INFO: Cluster status check completed”
}

check_cluster_status
EOF

# 添加执行权限
[root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_monitor.sh

# 测试脚本
[root@ha-node1 ~]# /usr/local/bin/cluster_monitor.sh

# 查看日志
[root@ha-node1 ~]# cat /var/log/cluster_monitor.log
2026-04-04 12:05:00 – INFO: Cluster status check completed

2.2 配置定时监控

# 添加到cron定时任务
[root@ha-node1 ~]# crontab -e
*/5 * * * * /usr/local/bin/cluster_monitor.sh

# 验证cron任务
[root@ha-node1 ~]# crontab -l
*/5 * * * * /usr/local/bin/cluster_monitor.sh

# 创建详细监控脚本
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_health_check.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/cluster_health.log" check_node_status() { echo "=== Node Status ===" >> $LOG_FILE
pcs status nodes >> $LOG_FILE
echo “” >> $LOG_FILE
}

check_resource_status() {
echo “=== Resource Status ===” >> $LOG_FILE
pcs status resources >> $LOG_FILE
echo “” >> $LOG_FILE
}

check_quorum_status() {
echo “=== Quorum Status ===” >> $LOG_FILE
pcs quorum status >> $LOG_FILE
echo “” >> $LOG_FILE
}

check_stonith_status() {
echo “=== Stonith Status ===” >> $LOG_FILE
pcs stonith status >> $LOG_FILE
echo “” >> $LOG_FILE
}

echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Starting health check” >> $LOG_FILE
check_node_status
check_resource_status
check_quorum_status
check_stonith_status
echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Health check completed” >> $LOG_FILE
EOF

# 添加执行权限
[root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_health_check.sh

Part03-配置告警通知

3.1 配置邮件告警

# 安装邮件工具
[root@ha-node1 ~]# dnf install -y mailx postfix
Updating Subscription Management repositories.
Last metadata expiration check: 0:05:23 ago on Fri Apr 4 12:00:00 2026.
Dependencies resolved.
================================================================================
Package Architecture Version Repository Size
================================================================================
Installing:
mailx x86_64 12.5-38.el9 appstream 250 k
postfix x86_64 3.6.4-1.el9 appstream 1.5 M

Transaction Summary
================================================================================
Install 2 Packages

Total download size: 1.8 M
Installed size: 5.0 M
Downloading Packages:
(1/2): mailx-12.5-38.el9.x86_64.rpm 2.5 MB/s | 250 kB 00:00
(2/2): postfix-3.6.4-1.el9.x86_64.rpm 5.0 MB/s | 1.5 MB 00:00
——————————————————————————–
Total 4.5 MB/s | 1.8 MB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : mailx-12.5-38.el9.x86_64 1/2
Installing : postfix-3.6.4-1.el9.x86_64 2/2
Running scriptlet: postfix-3.6.4-1.el9.x86_64 2/2
Verifying : mailx-12.5-38.el9.x86_64 1/2
Verifying : postfix-3.6.4-1.el9.x86_64 2/2

Installed:
mailx-12.5-38.el9.x86_64 postfix-3.6.4-1.el9.x86_64

Complete!

# 配置Postfix
[root@ha-node1 ~]# systemctl enable –now postfix
Created symlink /etc/systemd/system/multi-user.target.wants/postfix.service → /usr/lib/systemd/system/postfix.service.

# 测试邮件发送
[root@ha-node1 ~]# echo “Test email from cluster” | mail -s “Cluster Test” admin@fgedu.net.cn

3.2 配置Pacemaker告警

# 创建告警脚本
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_alert.sh << 'EOF' #!/bin/bash ALERT_EMAIL="admin@fgedu.net.cn" EVENT="$1" RESOURCE="$2" NODE="$3" case "$EVENT" in "resource_fail") SUBJECT="[ALERT] Resource Failed: $RESOURCE on $NODE" BODY="Resource $RESOURCE has failed on node $NODE at $(date)" ;; "node_offline") SUBJECT="[ALERT] Node Offline: $NODE" BODY="Node $NODE went offline at $(date)" ;; "fence_event") SUBJECT="[ALERT] Fence Event: $NODE" BODY="Node $NODE was fenced at $(date)" ;; *) SUBJECT="[ALERT] Cluster Event" BODY="Unknown event: $EVENT on $RESOURCE at $NODE" ;; esac echo "$BODY" | mail -s "$SUBJECT" $ALERT_EMAIL echo "$(date '+%Y-%m-%d %H:%M:%S') - $SUBJECT" >> /var/log/cluster_alerts.log
EOF

# 添加执行权限
[root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_alert.sh

# 配置Pacemaker告警
[root@ha-node1 ~]# pcs alert create id=cluster_alert path=/usr/local/bin/cluster_alert.sh
Alert created: cluster_alert

# 查看告警配置
[root@ha-node1 ~]# pcs alert show
Alerts:
Alert: cluster_alert
Path: /usr/local/bin/cluster_alert.sh
Timeout: 30s

# 配置告警接收者
[root@ha-node1 ~]# pcs alert recipient add cluster_alert value=admin@fgedu.net.cn
Recipient added to alert ‘cluster_alert’

# 查看告警配置
[root@ha-node1 ~]# pcs alert show
Alerts:
Alert: cluster_alert
Path: /usr/local/bin/cluster_alert.sh
Timeout: 30s
Recipients:
admin@fgedu.net.cn

Part04-Prometheus监控集成

4.1 配置Prometheus导出器

# 安装pacemaker_exporter
[root@ha-node1 ~]# dnf install -y pacemaker_exporter
Updating Subscription Management repositories.
Last metadata expiration check: 0:05:23 ago on Fri学习交流加群风哥微信: itpux-com Apr 4 12:10:00 2026.
Dependencies resolved.
================================================================================
Package Architecture Version Repository Size
================================================================================
Installing:
pacemaker_exporter x86_64 0.1.0-1.el9 appstream 500 k

Transaction Summary
================================================================================
Install 1 Package

Total download size: 500 k
Installed size: 1.2 M
Downloading Packages:
pacemaker_exporter-0.1.0-1.el9.x86_64.rpm 2.0 MB/s | 500 kB 00:00
——————————————————————————–
Total 2.0 MB/s | 500 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
学习交流加群风哥QQ113257174 Installing : pacemaker_exporter-0.1.0-1.el9.x86_64 from PG视频:www.itpux.com 1/1
Running scriptlet: pacemaker_exporter-0.1.0-1.el9.x86_64 更多学习教程公众号风哥教程itpux_com 1/1
Verifying : pacemaker_exporter-0.1.0-1.el9.x86_64 1/1

Installed:
pacemaker_exporter-0.1.0-1.el9.x86_64

Complete!

# 启动导出器
[root@ha-node1 ~]# systemctl enable –now pacemaker_exporter
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker_exporter.service → /usr/lib/systemd/system/pacemaker_exporter.service.

# 验证导出器
[root@ha-node1 ~]# curl http://localhost:9256/metrics
# HELP pacemaker_cluster_nodes_total Total number of nodes in cluster
# TYPE pacemaker_cluster_nodes_total gauge
pacemaker_cluster_nodes_total 2
# HELP pacemaker_cluster_nodes_online Number of online nodes
# TYPE pacemaker_cluster_nodes_online gauge
pacemaker_cluster_nodes_online 2
# HELP pacemaker_cluster_resources_total Total number of resources
# TYPE pacemaker_cluster_resources_total gauge
pacemaker_cluster_resources_total 3
# HELP pacemaker_cluster_resources_running Number of running resources
# TYPE pacemaker_cluster_resources_running gauge
pacemaker_cluster_resources_running 3

风哥针对监控与告警建议：

配置实时监控脚本
设置邮件告警通知
集成Prometheus监控
定期检查集群状态
保存监控日志

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

Linux教程FG325-集群资源监控与告警

Part01-集群状态监控

1.1 实时监控集群状态

1.2 查看资源操作历史

Part02-配置监控脚本

2.1 创建监控脚本

2.2 配置定时监控

Part03-配置告警通知

3.1 配置邮件告警

3.2 配置Pacemaker告警

Part04-Prometheus监控集成

4.1 配置Prometheus导出器

相关推荐

联系我们