Linux教程FG330-高可用集群实战总结

教程整理：风哥教程 | 更新时间：2026-01-25 | 教程分类：Linux教程 | 文档学习：36

内容简介：本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容，详细介绍了相关技术的配置和使用方法。

风哥提示：

>本文档总结高可用集群的实战经验和最佳实践。

Part01-集群部署清单

1.1 部署前检查清单

# 检查系统版本
[root@ha-node1 ~]# cat /etc/os-release
NAME=”Rocky Linux”
VERSION=”9.2″
ID=”rocky”
ID_LIKE=”rhel centos fedora”

# 检查网络配置
[root@ha-node1 ~]# ip addr show
1: lo: mtu 65536
inet 127.0.0.1/8 scope host lo
2: ens33: mtu 1500
inet 192.168.1.10/24 brd 192.168.1.255 scope global ens33

# 检查主机名解析
[root@ha-node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain
192.168.1.更多视频教程www.fgedu.net.cn10 ha-node1.fgedu.net.cn ha-node1
192.168.1.学习交流加群风哥QQ11325717411 ha-node2.fgedu.net.cn ha-node2

# 检查时间同步
[root@ha-node1 ~]# chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* 192.168.1.1 2 6 17 10 +123us[ +123us] +/- 10ms

# 检查防火墙
[root@ha-node1 ~]# firewall-cmd –list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens33
sources:
services: cockpit dhcpv6-client ssh
ports:
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:

# 检查SELinux
[root@ha-node1 ~]# getenforce
Enforcing

1.2 集群状态检查

# 检查集群服务状态
[root@ha-node1 ~]# pcs status
Cluster name: mycluster
Cluster Summary:
* Stack: corosync
* Current DC: ha-node1 (version 2.1.7-1.el9)
* Last updated: Fri Apr 4 13:20:00 2026
* Last change: Fri Apr 4 13:15:00 2026
* 2 nodes configured
* 5 resource instances configured

Node List:
* Online: [ ha-node1 ha-node2 ]

Full List of Resources:
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Started ha-node1
* mysql (systemd:mariadb): Started ha-node1
* ipmi_fence (stonith:fence_ipmilan): Started ha-node1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

# 检查资源配置
[root@ha-node1 ~]# pcs resource show
vip (ocf::heartbeat:IPaddr2): Started
nginx (systemd:nginx): Started
mysql (systemd:mariadb): Started
ipmi_fence (stonith:fence_ipmilan): Started

# 检查约束配置
[root@ha-node1 ~]# pcs constraint
Location Constraints:
No location constraints found
Ordering Constraints:
start vip then start nginx (Mandatory)
start nginx then start mysql (Mandatory)
Colocation Constraints:
nginx with vip (score:INFINITY)
mysql with nginx (score:INFINITY)

# 检查仲裁状态
[root@ha-node1 ~]# pcs quorum status
Quorum information
——————
Date: Fri Apr 4 13:20:00 2026
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1
Quorate: Yes

Part02-最佳实践总结

2.1 集群配置最佳实践

# 1. 启用STONITH
[root@ha-node1 ~]# pcs property set stonith-enabled=true

# 2. 设置合理的仲裁策略
[root@ha-node1 ~]# pcs property set no-qu学习交流加群风哥微信: itpux-comorum-policy=stop

# 3. 设置资源粘性
[root@ha-node1 ~]# pcs resource defaults resource-stickiness=100

# 4. 设置迁移阈值
[root@ha-node1 ~]# pcs resource defaults migration-threshold=3

# 5. 配置监控间隔
[root@ha-node1 ~]# pcs resource update vip op monitor interval=15s timeout=10s

# 6. 配置启动超时
[root@ha-node1 ~]# pcs resource update nginx op start timeout=30s op stop timeout=30s

# 7. 配置资源约束
[root@ha-node1 ~]# pcs constraint order vip then nginx
[root@ha-node1 ~]# pcs constraint colocation add nginx with vip INFINITY

# 8. 配置Fence设备
[root@ha-node1 ~]# pcs stonith create ipmi_fence fence_ipmilan \
ipaddr=192.168.1.200 login=admin passwd=password

# 9. 配置仲裁设备（双节点）
[root@ha-nodfrom PG视频:www.itpux.come1 ~]# pcs quorum device add model net host=qdevice.fgedu.net.cn algorithm=lms

# 10. 启用ACL
[root@ha-node1 ~]# pcs property set enable-acl=true

更多学习教程公众号风哥教程itpux_com

2.2 运维最佳实践

# 1. 定期备份集群配置
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_backup.sh << 'EOF' #!/bin/bash BACKUP_DIR="/backup/cluster" DATE=$(date +%Y%m%d_%H%M%S) BACKUP_FILE="${BACKUP_DIR}/cluster_${DATE}.tar.gz" mkdir -p ${BACKUP_DIR} pcs config backup ${BACKUP_FILE} if [ $? -eq 0 ]; then echo "Backup successful: ${BACKUP_FILE}" find ${BACKUP_DIR} -name "cluster_*.tar.gz" -mtime +7 -delete else echo "Backup failed!" exit 1 fi EOF [root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_backup.sh [root@ha-node1 ~]# crontab -e 0 2 * * * /usr/local/bin/cluster_backup.sh >> /var/log/cluster_backup.log 2>&1

# 2. 定期检查集群状态
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_health_check.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/cluster_health.log" echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting health check" >> $LOG_FILE

# 检查节点状态
ONLINE_NODES=$(pcs status nodes | grep “Online:” | awk -F: ‘{print $2}’ | wc -w)
if [ $ONLINE_NODES -lt 2 ]; then
echo “WARNING: Only $ONLINE_NODES nodes online” >> $LOG_FILE
fi

# 检查资源状态
FAILED_RESOURCES=$(pcs status resources | grep -c “Stopped\|Failed”)
if [ $FAILED_RESOURCES -gt 0 ]; then
echo “WARNING: $FAILED_RESOURCES resources failed” >> $LOG_FILE
fi

# 检查仲裁状态
QUORUM=$(pcs quorum status | grep “Quorate:” | awk ‘{print $2}’)
if [ “$QUORUM” != “Yes” ]; then
echo “ERROR: Cluster lost quorum” >> $LOG_FILE
fi

echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Health check completed” >> $LOG_FILE
EOF

[root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_health_check.sh
[root@ha-node1 ~]# crontab -e
*/10 * * * * /usr/local/bin/cluster_health_check.sh

# 3. 定期测试故障转移
[root@ha-node1 ~]# cat > /usr/local/bin/test_failover.sh << 'EOF' #!/bin/bash LOG_FILE="/var/log/cluster_failover_test.log" echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting failover test" >> $LOG_FILE

# 记录当前资源位置
BEFORE=$(pcs status resources | grep “Started” | awk ‘{print $2}’)

# 移动资源
pcs resource move vip

# 等待资源转移
sleep 10

# 记录新资源位置
AFTER=$(pcs status resources | grep “Started” | awk ‘{print $2}’)

# 清除移动约束
pcs resource clear vip

echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Failover test completed: $BEFORE -> $AFTER” >> $LOG_FILE
EOF

[root@ha-node1 ~]# chmod +x /usr/local/bin/test_failover.sh

Part03-故障排查指南

3.1 常见故障排查

# 1. 节点离线排查
[root@ha-node1 ~]# pcs status nodes
[root@ha-node1 ~]# ping -c 3 ha-node2
[root@ha-node1 ~]# ssh ha-node2 ‘systemctl status corosync pacemaker’

# 2. 资源故障排查
[root@ha-node1 ~]# pcs status resources
[root@ha-node1 ~]# pcs resource failcount show
[root@ha-node1 ~]# journalctl -u pacemaker –since “30 minutes ago” | grep

# 3. 仲裁故障排查
[root@ha-node1 ~]# pcs quorum status
[root@ha-node1 ~]# pcs property set no-quorum-policy=ignore

# 4. Fence故障排查
[root@ha-node1 ~]# pcs stonith status
[root@ha-node1 ~]# pcs stonith fence –off

# 5. 网络故障排查
[root@ha-node1 ~]# ping -c 3
[root@ha-node1 ~]# netstat -tlnp | grep -E “2224|3121|5405”
[root@ha-node1 ~]# firewall-cmd –list-all

# 6. 日志分析
[root@ha-node1 ~]# journalctl -u pacemaker -u corosync –since “1 hour ago”
[root@ha-node1 ~]# tail -100 /var/log/messages | grep -E “pacemaker|corosync”

Part04-集群优化建议

4.1 性能优化建议

# 1. 优化集群超时
[root@ha-node1 ~]# pcs property set cluster-delay=30s
[root@ha-node1 ~]# pcs property set dc-deadtime=10s

# 2. 优化资源监控
[root@ha-node1 ~]# pcs resource update op monitor interval=15s timeout=10s

# 3. 优化Corosync配置
[root@ha-node1 ~]# pcs cluster corosync | grep token
token: 5000

# 4. 优化资源粘性
[root@ha-node1 ~]# pcs resource update meta resource-stickiness=200

# 5. 优化启动顺序
[root@ha-node1 ~]# pcs constraint order then

# 6. 优化批量限制
[root@ha-node1 ~]# pcs property set batch-limit=30
[root@ha-node1 ~]# pcs property set node-action-limit=20

高可用集群实战总结：

部署前做好充分规划
配置STONITH和仲裁设备
设置合理的资源约束
定期备份和测试
建立完善的监控告警
制定故障排查流程
持续优化集群性能

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html