Linux教程FG326-集群日志分析与故障排查

内容简介：本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容，详细介绍了相关技术的配置和使用方法。

风哥提示：

本文档详细介绍集群日志分析和故障排查方法。

Part01-日志查看

1.1 查看集群日志

# 查看Pacemaker日志
[root@ha-node1 ~]# journalctl -u pacemaker –since “1 hour ago”
— Logs begin at Fri 2026-04-04 10:00:00 CST. —
Apr 04 12:00:00 ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0
Apr 04 12:00:00 ha-node1 pacemaker[12345]: notice: Result of monitor operation for nginx on ha-node1: 0
Apr 04 12:00:30 ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0
Apr 04 12:00:30 ha-node1 pacemaker[12345]: notice: Result of monitor operation for mysql on ha-node2: 0

# 查看Corosync日志
[root@ha-node1 ~]# journalctl -u corosync –since “1 hour ago”
— Logs begin at Fri 2026-04-04 10:00:00 CST. —
Apr 04 12:00:00 ha-node1 corosync[12340]: [TOTEM ] A new membership (192.168.1.10:5405) was formed. Members joined: 1 2
Apr 04 12:00:00 ha-node1 corosync[12340]: [QUORUM] Members[2]: 1 2
Apr 04 12:00:00 ha-node1 corosync[12340]: [MAIN ] Completed service synchronization, ready to provide service.

# 查看所有集群相关日志
[root@ha-node1 ~]# journalctl -u pacemaker -u corosync -u pcsd –since “today”
— Logs begin at Fri 2026-04-04 10:00:00 CST. —
Apr 04 10:00:00 ha-node1 systemd[1]: Starting Corosync Cluster Engine…
Apr 04 10:00:01 ha-node1 systemd[1]: Started Corosync Cluster Engine.
Apr 04 10:00:02 ha-node1 systemd[1]: Starting Pacemaker High Availability Cluster Manager…
Apr 04 10:00:03 ha-node1 systemd[1]: Started Pacemaker High Availability Cluster Manager.
Apr 04 10:00:04 ha-node1 systemd[1]: Starting PCS GUI and remote configuration interface.学习交流加群风哥微信: itpux-com..
Apr 04 10:00:05 ha-node1 systemd[1]: Started PCS GUI and remote con更多学习教程公众号风哥教程itpux_comfiguration interface.

# 实时查看日志
[root@ha-node1 ~]# journalctl -u pacemaker -u corosync -f
— Logs begin at Fri 202学习交流加群风哥QQ1132571746-04-04 10:00:00 CST. —
Apr 04 12:15:00 ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0

1.2 查看系统日志

# 查看系统消息日志
[root@ha-node1 ~]# tail -100 /var/log/messages
Apr 4 12:00:00 ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0
Apr 4 12:00:00 ha-node1 pacemaker[12345]: notice: Result of monitor operation for nginx on ha-node1: 0
Apr 4 12:00:30 ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0
Apr 4 12:00:30 ha-node1 pacemaker[12345]: notice: Result of monitor operation for mysql on ha-node2: 0

# 搜索特定关键词
[root@ha-node1 ~]# grep -i “error\|fail\|critical” /var/log/messages | tail -20
Apr 4 11:50:00 ha-node1 pacemaker[12345]: error: nginx_monitor_20000[12346] exited with status 7
Apr 4 11:50:00 ha-node1 pacemaker[12345]: notice: Starting nginx on ha-node1
Apr 4 11:50:10 ha-node1 pacemaker[12345]: notice: nginx started on ha-node1

# 查看CIB日志
[root@ha-node1 ~]# ls -la /var/log/pacemaker/
total 16
drwxr-xr-x. 2 root root 4096 Apr 4 10:00 .
drwxr-xr-x. 10 root root 4096 Apr 4 10:00 ..
-rw-r–r–. 1 root root 1234 Apr 4 12:00 cib.log
-rw-r–r–. 1 root root 5678 Apr 4 12:00 pacemaker.log

[root@ha-node1 ~]# tail -50 /var/log/pacemaker/pacemaker.log
2026-04-04T12:00:00.000Z ha-node1 pacemaker[12345]: notice: Result of monitor operation for vip on ha-node1: 0
2026-04-04T12:00:00.000Z ha-node1 pacemaker[12345]: notice: Result of monitor operation for nginx on ha-node1: 0

Part02-常见故障排查

2.1 节点故障排查

# 节点离线排查
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1
Standby:
Maintenance:
Offline: ha-node2

# 检查节点网络连接
[root@ha-node1 ~]# ping -c 3 ha-node2
PING ha-node2 (192.168.1.11) 56(84) bytes of data.
64 bytes from ha-node2 (192.168.1.11): icmp_seq=1 ttl=64 time=0.521 ms
64 bytes from ha-node2 (192.168.1.11): icmp_seq=2 ttl=64 time=0.489 ms
64 bytes from ha-node2 (192.168.1.11): icmp_seq=3 ttl=64 time=0.512 ms

— ha-node2 ping statistics —
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.489/0.507/0.521/0.013 ms

# 检查节点服务状态
[root@ha-node1 ~]# ssh ha-node2 ‘systemctl status corosync pacemaker’
● corosync.service – Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; preset: disabled)
Active: inactive (dead) since Fri 2026-04-04 12:10:00 CST; 5min ago

● pacemaker.service – Pacemaker High Availability Cluster Manager
from PG视频:www.itpux.com Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; preset: disabled)
Active: inactive (dead) since Fri 2026-04-04 12:10:00 CST; 5min ago

# 启动节点服务
[root@ha-node1 ~]# pcs cluster start ha-node2
ha-node2: Starting Cluster (corosync)…
ha-node2: Starting Cluster (pacemaker)…

# 验证节点上线
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1 ha-node2
Standby:
Maintenance:
Offline:

2.2 资源故障排查

# 查看资源状态
[root@ha-node1 ~]# pcs status resources
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Stopped (failed)
* mysql (systemd:mariadb): Started ha-node2

# 查看资源失败原因
[root@ha-node1 ~]# pcs resource failcount show nginx
Fail counts for nginx
ha-node1: 3 (last-failure: Fri Apr 4 12:15:00 2026)

# 查看资源失败详情
[root@ha-node1 ~]# journalctl -u pacemaker –since “30 minutes ago” | grep nginx
Apr 04 12:15:00 ha-node1 pacemaker[12345]: error: nginx_monitor_20000[12346] exited with status 7
Apr 04 12:15:00 ha-node1 pacemaker[12345]: notice: Starting nginx on ha-node1
Apr 04 12:15:10 ha-node1 pacemaker[12345]: error: nginx_start_0[12347] exited with status 1

# 检查服务状态
[root@ha-node1 ~]# systemctl status nginx
● nginx.service – The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Fri 2026-04-04 12:15:10 CST; 5min ago
Process: 12347 ExecStart=/usr/sbin/nginx (code=exited, status=1/FAILURE)
Main PID: 12345 (code=exited, status=0/SUCCESS)

Apr 04 12:15:10 ha-node1 nginx[12347]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)

# 解决端口占用问题
[root@ha-node1 ~]# netstat -tlnp | grep :80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 12348/httpd

[root@ha-node1 ~]# systemctl stop httpd

# 清理资源失败状态
[root@ha-node1 ~]# pcs resource cleanup nginx
Cleaned up nginx on ha-node1
Cleaned up nginx on ha-node2

# 重启资源
[root@ha-node1 ~]# pcs resource restart nginx
nginx: Restarting…
nginx: Successfully restarted

# 验证资源状态
[root@ha-node1 ~]# pcs status resources
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Started ha-node1
* mysql (systemd:mariadb): Started ha-node2

Part03-仲裁故障排查

3.1 仲裁丢失排查

# 查看仲裁状态
[root@ha-node1 ~]# pcs quorum status
Quorum information
——————
Date: Fri Apr 4 12:20:00 2026
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 1
Quorate: No

Votequorum information
———————-
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2
Flags: 2Node

Membership information
———————-
Nodeid Votes Name
1 1 ha-node1 (local)

# 检查节点状态
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1
Standby:
Maintenance:
Offline: ha-node2

# 检查网络连接
[root@ha-node1 ~]# ping -c 3 ha-node2
PING ha-node2 (192.168.1.11) 56(84) bytes of data.
From ha-node1 (192.168.1.10) icmp_seq=1 Destination Host Unreachable
From ha-node1 (192.168.1.10) icmp_seq=2 Destination Host Unreachable
From ha-node1 (192.更多视频教程www.fgedu.net.cn168.1.10) icmp_seq=3 Destination Host Unreachable

— ha-node2 ping statistics —
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2003ms

# 临时设置忽略仲裁
[root@ha-node1 ~]# pcs property set no-quorum-policy=ignore

# 验证设置
[root@ha-node1 ~]# pcs property list no-quorum-policy
no-quorum-policy: ignore

# 恢复节点后恢复仲裁策略
[root@ha-node1 ~]# pcs cluster start ha-node2
ha-node2: Starting Cluster (corosync)…
ha-node2: Starting Cluster (pacemaker)…

[root@ha-node1 ~]# pcs property set no-quorum-policy=stop

Part04-日志分析工具

4.1 使用pcs日志分析

# 查看集群事件
[root@ha-node1 ~]# pcs status events
Events:
* Fri Apr 4 12:15:00 2026: nginx failed on ha-node1
* Fri Apr 4 12:15:01 2026: nginx restarted on ha-node1
* Fri Apr 4 12:15:10 2026: nginx recovered on ha-node1
* Fri Apr 4 12:20:00 2026: ha-node2 went offline
* Fri Apr 4 12:25:00 2026: ha-node2 came online

# 创建日志分析脚本
[root@ha-node1 ~]# cat > /usr/local/bin/analyze_cluster_log.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/messages" OUTPUT="/var/log/cluster_analysis.log" echo "=== Cluster Log Analysis ===" > $OUTPUT
echo “Date: $(date)” >> $OUTPUT
echo “” >> $OUTPUT

echo “=== Errors ===” >> $OUTPUT
grep -i “error” $LOG_FILE | grep -E “pacemaker|corosync|pcs” | tail -20 >> $OUTPUT
echo “” >> $OUTPUT

echo “=== Failures ===” >> $OUTPUT
grep -i “fail” $LOG_FILE | grep -E “pacemaker|corosync|pcs” | tail -20 >> $OUTPUT
echo “” >> $OUTPUT

echo “=== Warnings ===” >> $OUTPUT
grep -i “warning” $LOG_FILE | grep -E “pacemaker|corosync|pcs” | tail -20 >> $OUTPUT
echo “” >> $OUTPUT

echo “=== Fence Events ===” >> $OUTPUT
grep -i “fence” $LOG_FILE | tail -10 >> $OUTPUT
echo “” >> $OUTPUT

echo “Analysis completed. See $OUTPUT”
EOF

# 添加执行权限
[root@ha-node1 ~]# chmod +x /usr/local/bin/analyze_cluster_log.sh

# 运行分析
[root@ha-node1 ~]# /usr/local/bin/analyze_cluster_log.sh
Analysis completed. See /var/log/cluster_analysis.log

# 查看分析结果
[root@ha-node1 ~]# cat /var/log/cluster_analysis.log
=== Cluster Log Analysis ===
Date: Fri Apr 4 12:30:00 CST 2026

=== Errors ===
Apr 4 12:15:00 ha-node1 pacemaker[12345]: error: nginx_monitor_20000[12346] exited with status 7

=== Failures ===
Apr 4 12:15:10 ha-node1 pacemaker[12345]: notice: nginx_start_0[12347] exited with status 1

=== Warnings ===
(no output)

=== Fence Events ===
(no output)

风哥针对日志分析与故障排查建议：

定期检查集群日志
关注错误和警告信息
建立日志分析脚本
保存历史日志用于分析
配置日志告警通知

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html