Linux教程FG319-集群服务故障转移测试

内容简介：本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容，详细介绍了相关技术的配置和使用方法。

本文档详细介绍集群服务资源配置和故障转移测试方法。

风哥提示：

Part01-服务资源配置

1.1 创建Web服务资源

# 创建Web服务资源组
[root@ha-node1 ~]# pcs resource create web_vip ocf:heartbeat:IPaddr2 \
ip=192.168.1.100 cidr_netmask=24 \
op monitor interval=30s

[root@ha-node1 ~]# pcs resource create web_service systemd:httpd \
op monitor interval=20s timeout=10s

# 创建资源组
[root@ha-node1 ~]# pcs resource group add webgroup web_vip web_service

# 查看资源组
[root@ha-node1 ~]# pcs resource show webgroup
Group: webgroup
Resource: web_vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=192.168.1.100
Operations: monitor interval=30s (web_vip-monitor-interval-30s)
Resource: web_service (class=systemd type=httpd)
Operations: monitor interval=20s timeout=10s (web_service-monitor-interval-20s)

# 查看资源状态
[root@ha-node1 ~]# pcs status resources
* webgroup (ocf::heartbeat:IPaddr2): Started ha-node1

1.2 创建数据库服务资源

# 创建数据库服务资源组
[root@ha-node1 ~]# pcs resource create db_vip ocf:heartbeat:IPaddr2 \
ip=192.168.1.101 cidr_netmask=24 \
op monitor interval=30s

[root@ha-node1 ~]# pcs resource create db_service systemd:mariadb \
op monitor interval=20s timeout=10s

# 创建资源组
[root@ha-node1 ~]# pcs resource group add dbgroup db_vip db_service

# 查看所有资源
[root@ha-node1 ~]# pcs status resources
* webgroup (ocf::heartbeat:IPaddr2): Started ha-node1
* dbgroup (ocf::heartbeat:IPaddr2): Started ha-node2

# 验证服务运行
[root@ha-node1 ~]# curl -I http://192.168.1.100
HTTP/1.1 200 OK
Date: Fri, 04 Apr 2026 11:10:00 GMT
Server: Apache/2.4.53 (Rocky Linux)
Last-Modified: Fri, 04 Apr 2026 10:00:00 GMT
ETag: “1234-5678”
Accept-Ranges: bytes
Content-Length: 1234
Content-Type: text/html; charset=UTF-8

[root@ha-node1 ~]# mysql -h 192.168.1.101 -u root -p -e “SELECT 1”
Enter password:
+—+
| 1 |
+—+
| 1 |
+—+

Part02-故障转移测试

2.1 节点故障测试

# 查看当前资源分布
[root@ha-node1 ~]# pcs status resources
* webgroup (ocf::heartbeat:IPaddr2): Started ha-node1
* dbgroup (ocf::heartbeat:IPaddr2): Started ha-node2

# 模拟node1故障
[root@ha-node1 ~]# pcs cluster standby ha-node1

# 查看资源状态（资源应转移到node2）
[root@ha-node1 ~]# pcs status resources
* webgroup (ocf::heartbeat:IPaddr2): Started ha-node2
* dbgroup (ocf::heartbeat:IPaddr2): Started ha-node2

# 验证服务可访问
[root@client ~]# curl -I http://192.168.1.100
HTTP/1.1 200 OK
Date: Fri, 04 Apr 2026 11:12:00 GMT
Server: Apache/2.4.53 (Rocky Linux)
Content-Type: text/html; charset=UTF-8

[root@client ~]# mysql -h 192.168.1.101 -u root -p -e “SELECT 1”
Enter password:
+—+
| 1 |
+—+
| 1 |
+—+

# 恢复node1
[root@ha-node1 ~]# pcs cluster unstandby ha-node1

# 查看节点状态
[root@ha-node1 ~]# pcs status nodes
Pacemaker Nodes:
Online: ha-node1 ha-node2
Standby:
Maintenance:
Offline:
from PG视频:www.itpux.com

2.2 服务故障测试

# 手动停止Web服务
[root@ha-node1 ~]# systemctl stop httpd

# 查看资源状态（集群应自动重启服务）
[root@ha-node1 ~]# pcs status resources
* webgroup (ocf::heartbeat:IPaddr2): Started ha-node1

# 查看服务状态
[root@ha-node1 ~]# systemctl status httpd
● httpd.service – The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd学习交流加群风哥微信: itpux-com/system/httpd.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 11:15:00 CST; 10s ago
Docs: man:httpd(8)
man:apachectl(8)
Main PID: 12345 (httpd)
Status: “Total requests: 0; Idle/Busy workers 100/0;Requests/sec: 0; Bytes served/sec: 0 B/sec”
Tasks: 213 (limit: 11232)
Memory: 25.6M
CGroup: /system.slice/httpd.service
├─12345 /usr/sbin/httpd -DFOREGROUND
├─12346 /usr/sbin/httpd -DFOREGROUND
└─12347 /usr/sbin/httpd -DFOREGROUND

Apr 04 11:15:00 ha-node1 systemd[1]: Started The Apache HTTP Server.

# 查看失败计数
[root@ha-node1 ~]# pcs status failcounts
Fail counts for web_service
ha-node1: 1 (last-failure: Fri Apr 4 11:14:50 2026)

# 清理失败计数
[root@ha-node1 ~]# pcs resource cleanup web_service
Cleaned up web_service on ha-node1
Cleaned up web_service on ha-node2

Part03-故障转移优化

3.1 设置迁移阈值

# 设置迁移阈值
[root@ha-node1 ~]# pcs resource update web_service meta migration-threshold=3

# 设置失败超时
[root@ha-node1 ~]# pcs resource update web_service meta failure-timeout=60s

# 查看资源配置
[root@ha-node1 ~]# pcs resource show web_service
Resource: web_service (class=systemd type=httpd)
Meta Attrs: failure-timeout=60s migration-threshold=3
Operations: monitor interval=20s timeout=10s (web_service-monitor-interval-20s)

# 设置资源粘性
[root@ha-node1 ~]# pcs resource update webgroup meta resource-stickiness=100

# 查看资源组配置
[root@ha-node1 ~]# pcs resource show webgroup
Group: webgroup
Meta Attrs: resource-stickiness=100
Resource: web_vip (class=ocf provider=heartb更多学习教程公众号风哥教程itpux_comeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=192.168.1.100
Operations: monitor interval=30s (web_vip-monitor-interval-30s)
Resource: web_service (class=systemd type=httpd)
Meta Attrs: failure-timeout=60s migration-threshold=3
Operations: monitor interval=20s timeout=10s (web_service-monitor-interval-20s)

3.2 设置故障转移策略

# 设置节点优先级
[root@ha-node1 ~]# pcs constraint location webgroup prefers ha-node1=100
[root@ha-node1 ~]# pcs constraint location webgroup prefers ha-node2=50

# 查看位置约束
[root@ha-node1 ~]# pcs constraint location
Location Constraints:
Resource: webgroup
Enabled on: ha-node1 (score:100)
Enabled on: ha-node2 (score:50)

# 设置故障回转策略
[root@ha-node1 ~]# pcs resource update webgroup meta resource-stickiness=200

# 验证配置
[root@ha-node1 ~]# pcs resource show webgroup
Group: webgroup
Meta Attrs: resource-stickiness=200
Resource: web_vip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=192.168.1.100
Operations: monitor interval=30s (web_vip-monitor-interval-30s)
Resource: web_service (class=systemd type=httpd)
Meta Attrs: failure-timeout=60s migration-threshold=3
Operations: monitor interval=20s timeout=10s (web_service-monitor-interval-20s)

Part04-故障转移监控

4.1 监控故障转移

# 查看集群事件日志
[root@ha-node1 ~]# pcs status events
Events:
* Fri Apr 4 11:14:50 2026: web_service failed on ha-node1
* Fri Apr 4 11:14:51 2026: web_service restarted on ha-node1
* Fri Apr 4 11:15:00 2026: web_service recovered on ha-node1

# 查看资源操作历史
[root@ha-node1 ~]# pcs status operations
Operations:
* web_vip: monitor interval=30s last-rc-change=Fri Apr 4 11:15:00 2026 exec-time=10ms
* web_service: monitor interval=20s last-rc-change=Fri Apr 4 11:15:00 2026 exec-time=5ms
* db_vip: monitor interval=30s last-rc-change=Fri Apr 4 11:10:00 2026 exec-time=10ms
* db_service: monitor interval=20s last-rc-change=Fri Apr 4 11:10:00 2026 exec-time=8ms

# 查看节点历史
[root@ha-node1 ~]# pcs status nodes –all
Pacemaker Nodes:
Online: ha-node1 ha-node2
Standby:
Maintenance:
Offline:

Corosync Nodes:
Online: ha-node1 ha-node2
Offline:

风哥针对故障转移测试建议：

定期进行故障转移测试
设置合理的迁移阈值
配置资源粘性避免频繁切换
监控故障转移时间
记录故障转移事件

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html