内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。
风哥提示:
>本文档总结高可用集群的实战经验和最佳实践。
Part01-集群部署清单
1.1 部署前检查清单
[root@ha-node1 ~]# cat /etc/os-release
NAME=”Rocky Linux”
VERSION=”9.2″
ID=”rocky”
ID_LIKE=”rhel centos fedora”
# 检查网络配置
[root@ha-node1 ~]# ip addr show
1: lo:
inet 127.0.0.1/8 scope host lo
2: ens33:
inet 192.168.1.10/24 brd 192.168.1.255 scope global ens33
# 检查主机名解析
[root@ha-node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain
192.168.1.更多视频教程www.fgedu.net.cn10 ha-node1.fgedu.net.cn ha-node1
192.168.1.学习交流加群风哥QQ11325717411 ha-node2.fgedu.net.cn ha-node2
# 检查时间同步
[root@ha-node1 ~]# chronyc sources
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* 192.168.1.1 2 6 17 10 +123us[ +123us] +/- 10ms
# 检查防火墙
[root@ha-node1 ~]# firewall-cmd –list-all
public (active)
target: default
icmp-block-inversion: no
interfaces: ens33
sources:
services: cockpit dhcpv6-client ssh
ports:
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
# 检查SELinux
[root@ha-node1 ~]# getenforce
Enforcing
1.2 集群状态检查
[root@ha-node1 ~]# pcs status
Cluster name: mycluster
Cluster Summary:
* Stack: corosync
* Current DC: ha-node1 (version 2.1.7-1.el9)
* Last updated: Fri Apr 4 13:20:00 2026
* Last change: Fri Apr 4 13:15:00 2026
* 2 nodes configured
* 5 resource instances configured
Node List:
* Online: [ ha-node1 ha-node2 ]
Full List of Resources:
* vip (ocf::heartbeat:IPaddr2): Started ha-node1
* nginx (systemd:nginx): Started ha-node1
* mysql (systemd:mariadb): Started ha-node1
* ipmi_fence (stonith:fence_ipmilan): Started ha-node1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
# 检查资源配置
[root@ha-node1 ~]# pcs resource show
vip (ocf::heartbeat:IPaddr2): Started
nginx (systemd:nginx): Started
mysql (systemd:mariadb): Started
ipmi_fence (stonith:fence_ipmilan): Started
# 检查约束配置
[root@ha-node1 ~]# pcs constraint
Location Constraints:
No location constraints found
Ordering Constraints:
start vip then start nginx (Mandatory)
start nginx then start mysql (Mandatory)
Colocation Constraints:
nginx with vip (score:INFINITY)
mysql with nginx (score:INFINITY)
# 检查仲裁状态
[root@ha-node1 ~]# pcs quorum status
Quorum information
——————
Date: Fri Apr 4 13:20:00 2026
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 1
Quorate: Yes
Part02-最佳实践总结
2.1 集群配置最佳实践
[root@ha-node1 ~]# pcs property set stonith-enabled=true
# 2. 设置合理的仲裁策略
[root@ha-node1 ~]# pcs property set no-qu学习交流加群风哥微信: itpux-comorum-policy=stop
# 3. 设置资源粘性
[root@ha-node1 ~]# pcs resource defaults resource-stickiness=100
# 4. 设置迁移阈值
[root@ha-node1 ~]# pcs resource defaults migration-threshold=3
# 5. 配置监控间隔
[root@ha-node1 ~]# pcs resource update vip op monitor interval=15s timeout=10s
# 6. 配置启动超时
[root@ha-node1 ~]# pcs resource update nginx op start timeout=30s op stop timeout=30s
# 7. 配置资源约束
[root@ha-node1 ~]# pcs constraint order vip then nginx
[root@ha-node1 ~]# pcs constraint colocation add nginx with vip INFINITY
# 8. 配置Fence设备
[root@ha-node1 ~]# pcs stonith create ipmi_fence fence_ipmilan \
ipaddr=192.168.1.200 login=admin passwd=password
# 9. 配置仲裁设备(双节点)
[root@ha-nodfrom PG视频:www.itpux.come1 ~]# pcs quorum device add model net host=qdevice.fgedu.net.cn algorithm=lms
# 10. 启用ACL
[root@ha-node1 ~]# pcs property set enable-acl=true
更多学习教程公众号风哥教程itpux_com
2.2 运维最佳实践
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_backup.sh << 'EOF' #!/bin/bash BACKUP_DIR="/backup/cluster" DATE=$(date +%Y%m%d_%H%M%S) BACKUP_FILE="${BACKUP_DIR}/cluster_${DATE}.tar.gz" mkdir -p ${BACKUP_DIR} pcs config backup ${BACKUP_FILE} if [ $? -eq 0 ]; then echo "Backup successful: ${BACKUP_FILE}" find ${BACKUP_DIR} -name "cluster_*.tar.gz" -mtime +7 -delete else echo "Backup failed!" exit 1 fi EOF [root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_backup.sh [root@ha-node1 ~]# crontab -e 0 2 * * * /usr/local/bin/cluster_backup.sh >> /var/log/cluster_backup.log 2>&1
# 2. 定期检查集群状态
[root@ha-node1 ~]# cat > /usr/local/bin/cluster_health_check.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/cluster_health.log"
echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting health check" >> $LOG_FILE
# 检查节点状态
ONLINE_NODES=$(pcs status nodes | grep “Online:” | awk -F: ‘{print $2}’ | wc -w)
if [ $ONLINE_NODES -lt 2 ]; then
echo “WARNING: Only $ONLINE_NODES nodes online” >> $LOG_FILE
fi
# 检查资源状态
FAILED_RESOURCES=$(pcs status resources | grep -c “Stopped\|Failed”)
if [ $FAILED_RESOURCES -gt 0 ]; then
echo “WARNING: $FAILED_RESOURCES resources failed” >> $LOG_FILE
fi
# 检查仲裁状态
QUORUM=$(pcs quorum status | grep “Quorate:” | awk ‘{print $2}’)
if [ “$QUORUM” != “Yes” ]; then
echo “ERROR: Cluster lost quorum” >> $LOG_FILE
fi
echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Health check completed” >> $LOG_FILE
EOF
[root@ha-node1 ~]# chmod +x /usr/local/bin/cluster_health_check.sh
[root@ha-node1 ~]# crontab -e
*/10 * * * * /usr/local/bin/cluster_health_check.sh
# 3. 定期测试故障转移
[root@ha-node1 ~]# cat > /usr/local/bin/test_failover.sh << 'EOF'
#!/bin/bash
LOG_FILE="/var/log/cluster_failover_test.log"
echo "$(date '+%Y-%m-%d %H:%M:%S') - Starting failover test" >> $LOG_FILE
# 记录当前资源位置
BEFORE=$(pcs status resources | grep “Started” | awk ‘{print $2}’)
# 移动资源
pcs resource move vip
# 等待资源转移
sleep 10
# 记录新资源位置
AFTER=$(pcs status resources | grep “Started” | awk ‘{print $2}’)
# 清除移动约束
pcs resource clear vip
echo “$(date ‘+%Y-%m-%d %H:%M:%S’) – Failover test completed: $BEFORE -> $AFTER” >> $LOG_FILE
EOF
[root@ha-node1 ~]# chmod +x /usr/local/bin/test_failover.sh
Part03-故障排查指南
3.1 常见故障排查
[root@ha-node1 ~]# pcs status nodes
[root@ha-node1 ~]# ping -c 3 ha-node2
[root@ha-node1 ~]# ssh ha-node2 ‘systemctl status corosync pacemaker’
# 2. 资源故障排查
[root@ha-node1 ~]# pcs status resources
[root@ha-node1 ~]# pcs resource failcount show
[root@ha-node1 ~]# journalctl -u pacemaker –since “30 minutes ago” | grep
# 3. 仲裁故障排查
[root@ha-node1 ~]# pcs quorum status
[root@ha-node1 ~]# pcs property set no-quorum-policy=ignore
# 4. Fence故障排查
[root@ha-node1 ~]# pcs stonith status
[root@ha-node1 ~]# pcs stonith fence
# 5. 网络故障排查
[root@ha-node1 ~]# ping -c 3
[root@ha-node1 ~]# netstat -tlnp | grep -E “2224|3121|5405”
[root@ha-node1 ~]# firewall-cmd –list-all
# 6. 日志分析
[root@ha-node1 ~]# journalctl -u pacemaker -u corosync –since “1 hour ago”
[root@ha-node1 ~]# tail -100 /var/log/messages | grep -E “pacemaker|corosync”
Part04-集群优化建议
4.1 性能优化建议
[root@ha-node1 ~]# pcs property set cluster-delay=30s
[root@ha-node1 ~]# pcs property set dc-deadtime=10s
# 2. 优化资源监控
[root@ha-node1 ~]# pcs resource update
# 3. 优化Corosync配置
[root@ha-node1 ~]# pcs cluster corosync | grep token
token: 5000
# 4. 优化资源粘性
[root@ha-node1 ~]# pcs resource update
# 5. 优化启动顺序
[root@ha-node1 ~]# pcs constraint order
# 6. 优化批量限制
[root@ha-node1 ~]# pcs property set batch-limit=30
[root@ha-node1 ~]# pcs property set node-action-limit=20
- 部署前做好充分规划
- 配置STONITH和仲裁设备
- 设置合理的资源约束
- 定期备份和测试
- 建立完善的监控告警
- 制定故障排查流程
- 持续优化集群性能
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
