本篇文章详细介绍Rancher灾备方案设计与故障切换演练,包括备份策略、恢复流程、灾难演练、故障切换等实战内容。风哥教程参考Rancher官方文档备份恢复与高可用相关章节。
目录大纲
Part01-基础概念与理论知识
1.1 灾备架构设计
Rancher灾备架构包括主备架构、双活架构、多活架构。主备架构:主节点提供服务,备节点待命;双活架构:两个节点同时提供服务;多活架构:多个节点同时提供服务。灾备级别:RPO(恢复点目标)、RTO(恢复时间目标)。更多视频教程www.fgedu.net.cn
1.2 故障切换机制
故障切换包括自动切换和手动切换。自动切换:通过Keepalived、HAProxy等实现VIP漂移;手动切换:通过DNS切换、负载均衡配置实现。故障检测:心跳检测、健康检查、监控告警。切换流程:故障检测→决策→切换→验证。学习交流加群风哥微信: itpux-com
Part02-生产环境规划与建议
2.1 备份策略规划
备份策略包括:全量备份、增量备份、差异备份。备份频率:ETCD每小时一次,Rancher每天一次,应用数据根据业务需求。备份保留:保留7天、30天、90天备份。备份存储:本地存储、异地存储、云存储。学习交流加群风哥QQ113257174
2.2 灾备方案设计
灾备方案设计原则:高可用、可恢复、可验证。灾备级别:冷备(数据备份)、温备(系统备份)、热备(实时同步)。灾备演练:每季度进行一次灾备演练,验证备份恢复流程。灾备文档:建立详细的灾备文档和操作手册。更多学习教程公众号风哥教程itpux_com
Part03-生产环境项目实施方案
3.1 ETCD备份配置
配置ETCD自动备份。
> /Rancher/fgdata/etcd-backup/etcd_backup.log 2>&1Snapshot saved at /Rancher/fgdata/etcd-backup/etcd-backup-20260410-210000.db Hash: 1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef Revision: 123456 Total key: 1234 Total size: 123456789 ETCD备份完成: /Rancher/fgdata/etcd-backup/etcd-backup-20260410-210000.dbtotal 500M -rw-r--r-- 1 root root 120M Apr 10 21:00 etcd-backup-20260410-210000.db -rw-r--r-- 1 root root 120M Apr 10 20:00 etcd-backup-20260410-200000.db -rw-r--r-- 1 root root 120M Apr 10 19:00 etcd-backup-20260410-190000.db -rw-r--r-- 1 root root 120M Apr 10 18:00 etcd-backup-20260410-180000.dbfrom Rancher视频:www.itpux.com
3.2 Rancher备份配置
配置Rancher数据备份。
$BACKUP_FILE # 验证备份 if [ -f $BACKUP_FILE ]; then echo “Rancher备份成功: $BACKUP_FILE” ls -lh $BACKUP_FILE else echo “Rancher备份失败” exit 1 fi # 清理30天前的备份 find $BACKUP_DIR -name “rancher-backup-*.tar.gz” -mtime +30 -delete> /Rancher/fgdata/rancher-backup/rancher_backup.log 2>&1Rancher备份成功: /Rancher/fgdata/rancher-backup/rancher-backup-20260410-210000.tar.gz -rw-r--r-- 1 root root 500M Apr 10 21:00 /Rancher/fgdata/rancher-backup/rancher-backup-20260410-210000.tar.gztotal 2.0G -rw-r--r-- 1 root root 500M Apr 10 21:00 rancher-backup-20260410-210000.tar.gz -rw-r--r-- 1 root root 500M Apr 9 02:00 rancher-backup-20260409-020000.tar.gz -rw-r--r-- 1 root root 500M Apr 8 02:00 rancher-backup-20260408-020000.tar.gz -rw-r--r-- 1 root root 500M Apr 7 02:00 rancher-backup-20260407-020000.tar.gz3.3 故障切换配置
配置Keepalived实现VIP故障切换。
Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.aliyun.com * extras: mirrors.aliyun.com * updates: mirrors.aliyun.com Resolving Dependencies --> Running transaction check ---> Package keepalived.x86_64 0:2.1.5-8.el9 will be installed --> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================================================= Package Arch Version Repository Size ======================================================================================================================================================= Installing: keepalived x86_64 2.1.5-8.el9 base 432 k Transaction Summary ======================================================================================================================================================= Install 1 Package Total download size: 432 k Installed size: 1.1 M Downloading packages: keepalived-2.1.5-8.el9.x86_64.rpm | 432 kB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : keepalived-2.1.5-8.el9.x86_64 1/1 Verifying : keepalived-2.1.5-8.el9.x86_64 1/1 Installed: keepalived.x86_64 0:2.1.5-8.el9 Complete!/etc/keepalived/keepalived.conf < /etc/keepalived/keepalived.conf < Created symlink /etc/systemd/system/multi-user.target.wants/keepalived.service → /usr/lib/systemd/system/keepalived.service.● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2026-04-10 21:00:00 CST; 5s ago Process: 12345 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 12346 (keepalived) Tasks: 3 (limit: 4915) Memory: 2.1M CGroup: /system.slice/keepalived.service ├─12346 /usr/sbin/keepalived -D ├─12347 /usr/sbin/keepalived -D └─12348 /usr/sbin/keepalived -D Apr 10 21:00:00 fgedu.net.cn Keepalived[12346]: Starting VRRP child process, pid=12348 Apr 10 21:00:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Entering MASTER STATE Apr 10 21:00:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) setting VIPs. Apr 10 21:00:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Sending gratuitous ARPs on eth0 for 192.168.1.200 Apr 10 21:00:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Sending gratuitous ARPs on eth0 for 192.168.1.200 Apr 10 21:00:00 fgedu.net.cn systemd[1]: Started Keepalived service.2: eth0:mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff inet 192.168.1.100/24 brd 192.168.1.255 scope global eth0 valid_lft forever preferred_lft forever inet 192.168.1.200/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::211:22ff:fe33:4455/64 scope link valid_lft forever preferred_lft forever Part04-生产案例与实战讲解
4.1 灾备演练实战
执行灾备演练验证备份恢复流程。
Apr 10 21:30:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Master received advert with higher priority 100, ours 90 Apr 10 21:30:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Entering BACKUP STATE Apr 10 21:30:00 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) removing VIPs. Apr 10 21:30:01 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Master received advert with higher priority 100, ours 90 Apr 10 21:30:02 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) Entering BACKUP STATE Apr 10 21:30:03 fgedu.net.cn Keepalived_vrrp[12348]: (VI_1) removing VIPs.2: eth0:mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:11:22:33:44:66 brd ff:ff:ff:ff:ff:ff inet 192.168.1.101/24 brd 192.168.1.255 scope global eth0 valid_lft forever preferred_lft forever inet 192.168.1.200/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::211:22ff:fe33:4466/64 scope link valid_lft forever preferred_lft forever pong4.2 故障切换实战
执行故障切换演练。
● rancher-server.service - Rancher Server Loaded: loaded (/usr/lib/systemd/system/rancher-server.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2026-04-10 21:00:00 CST; 30min ago Process: 12345 ExecStart=/usr/local/bin/docker run --name rancher --restart=unless-stopped -d --privileged -v /var/lib/rancher:/var/lib/rancher -p 80:80 -p 443:443 rancher/rancher:latest (code=exited, status=0/SUCCESS) Main PID: 12346 (docker run) Tasks: 0 (limit: 4915) Memory: 500M CGroup: /system.slice/rancher-server.service └─12346 /usr/bin/docker run --name rancher --restart=unless-stopped -d --privileged -v /var/lib/rancher:/var/lib/rancher -p 80:80 -p 443:443 rancher/rancher:latest Apr 10 21:00:00 fgedu.net.cn systemd[1]: Started Rancher Server.NAME READY STATUS RESTARTS AGE cattle-cluster-agent-abc123def456-ghi78 1/1 Running 0 30m cattle-node-agent-jkl012mno345 1/1 Running 0 30m● rancher-server.service - Rancher Server Loaded: loaded (/usr/lib/systemd/system/rancher-server.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2026-04-10 21:35:00 CST; 5s ago Process: 12345 ExecStart=/usr/local/bin/docker run --name rancher --restart=unless-stopped -d --privileged -v /var/lib/rancher:/var/lib/rancher -p 80:80 -p 443:443 rancher/rancher:latest (code=exited, status=0/SUCCESS) Main PID: 12346 (docker run) Tasks: 0 (limit: 4915) Memory: 500M CGroup: /system.slice/rancher-server.service └─12346 /usr/bin/docker run --name rancher --restart=unless-stopped -d --privileged -v /var/lib/rancher:/var/lib/rancher -p 80:80 -p 443:443 rancher/rancher:latest Apr 10 21:35:00 fgedu.net.cn systemd[1]: Started Rancher Server.4.3 恢复验证实战
验证备份恢复流程。
2026-04-10 21:40:00.123456 I | snapshot restored successfullyhttps://192.168.1.100:2379 is healthy: successfully committed proposal: took = 12.345678msNAME STATUS ROLES AGE VERSION fgedu-node-1 Ready30d v1.28.5 fgedu-node-2 Ready 30d v1.28.5 fgedu-node-3 Ready 30d v1.28.5 Part05-风哥经验总结与分享
5.1 生产环境最佳实践
1. 定期执行备份和恢复演练
2. 使用多个备份存储位置
3. 配置监控告警及时发现问题
4. 建立详细的灾备文档和操作手册
5. 使用自动化工具简化灾备流程
6. 定期测试故障切换流程
7. 配置合理的RPO和RTO目标
8. 建立灾备团队和应急响应流程5.2 常见问题与解决方案
1. 备份失败:检查磁盘空间、验证备份脚本
2. 恢复失败:验证备份完整性、检查恢复环境
3. 故障切换失败:检查网络连通性、验证VIP配置
4. 数据不一致:使用增量备份、验证数据完整性
5. 演练影响生产:使用测试环境、选择低峰时段
6. 恢复时间过长:优化恢复流程、增加资源
7. 备份存储不足:清理旧备份、扩展存储容量
8. 文档不完整:定期更新文档、建立知识库本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
