内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。
风哥提示:
本文档介绍大规模环境的灾难恢复方案。
Part01-灾备架构设计
1.1 灾备中心规划
[root@dr-site ~]# cat > /root/dr-architecture.txt << 'EOF' 灾备架构设计 ============ 主数据中心: - 位置:北京 - IP段:192.168.1.0/24 - 服务器数量:100台 - 存储:SAN存储 100TB 灾备中心: - 位置:上海 - IP段:192.168.2.0/24 - 服务器数量:50台 - 存储:SAN存储 100TB 数据同步: - 同步方式:异步复制 - 同步间隔:15分钟 - 带宽:1Gbps专线 切换策略: - RPO(恢复点目标):15分钟 - RTO(恢复时间目标):4小时 EOF # 查看网络连通性 [root@dr-site ~]# ping -c 3 192.168.1.10 PING 192.168.1.10 (192.168.1.10) 56(84) bytes of data. 64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=10.5 ms 64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=10.3 ms 64 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=10.4 ms --- 192.168.1.10 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mde学习交流加群风哥QQ113257174v = 10.300/10.400/10.500/0.100 ms # 配置VPN隧道 [root@dr-site ~]# cat > /etc/ipsec.conf << 'EOF' config setup charondebug="ike 2, knl 2, cfg 2" uniqueids=no conn dr-tunnel type=transport left=192.168.2.10 leftid=@dr-site right=192.168.1.10 rightid=@primary-site ike=aes256-sha2_256-modp2048! esp=aes256-sha2_256! keyexchange=ikev2 authby=secret auto=start EOF # 配置预共享密钥 [root@dr-site ~]# cat > /etc/ipsec.secrets << 'EOF' @dr-site @primary-site : PSK "DRSecretKey123!" EOF # 启动IPsec [root@dr-site ~]# systemctl enable --now ipsec Created symlink /etc/systemd/system/multi-user.target.wants/ipsec.service → /usr/lib/systemd/system/ipsec.service. # 验证IPsec状态 [root@dr-site ~]# ipsec statusall Status of IKE charon daemon (strongSwan 5.9.10, Linux 5.14.0-284.11.1.el9_2.x86_64, x86_64): uptime: 10 seconds, since Apr 04 16:00:00 2026 worker threads: 11 of 16 idle, 5/0/0/0 working, job queue: 0/0/0/0, scheduled: 1 loaded plugins: charon aesni aes des rc2 sha2 sha3 md4 md5 random nonce x509 revocation constraints pubkey pkcs1 pkcs7 pkcs8 pkcs12 pgp dnskey sshkey pem openssl fips-prf gmp curve25519 chapoly xcbc cmac hmac ctr ccm gcm curl attr kernel-netlink resolve socket-default stroke vici updown xauth-generic Listening IP addresses: 192.168.2.10 Connections: dr-tunnel: 192.168.2.10...192.168.1.10 IKEv2 dr-tunnel: local: [dr-site] uses pre-shared key authentication dr-tunnel: remote: [primary-site] uses pre-shared key authentication Security Associations (1 up, 0 connecting): dr-tunfrom PG视频:www.itpux.comnel[1]: ESTABLISHED 10 seconds ago, 192.168.2.10[dr-site]...192.168.1.10[primary-site] dr-tunnel[1]: IKEv2 SPIs: 1234567890abcdef_i 1234567890abcdef_r*, pre-shared key reauthentication in 2 hours dr-tunnel[1]: IKE proposal: AES_CBC_256/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_2048
1.2 数据同步配置
[root@dr-site ~]# dnf install -y drbd-utils
Updating Subscription Management repositories.
Last metadata expiration check: 0:05:23 ago on Fri Apr 4 16:00:00 2026.
Dependencies resolved.
================================================================================
Package Architecture Version Repository Size
================================================================================
Installing:
drbd-utils x86_64 9.21.0-1.el9 elrepo 500 k
Transaction Summary
================================================================================
Install 1 Package
Total download size: 500 k
Installed size: 1.5 M
Downloading Packages:
drbd-utils-9.21.0-1.el9.x86_64.rpm 1.0 MB/s | 500 kB 00:00
——————————————————————————–
Total 1.0 MB/s | 500 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : drbd-utils-9.21.0-1.el9.x86_64 1/1
Running scriptlet: drbd-utils-9.21.0-1.el9.x86_64 1/1
Verifying : drbd-utils-9.21.0-1.el9.x86_64 1/1
Installed:
drbd-utils-9.21.0-1.el9.x86_64
Complete!
# 配置DRBD资源
[root@dr-site ~]# cat > /etc/drbd.d/dr0.res << 'EOF'
resource dr0 {
protocol C;
meta-disk internal;
device /dev/drbd0;
disk /dev/sdb1;
net {
cram-hmac-alg sha1;
shared-secret "DRBDSecret123!";
}
on primary-site {
address 192.168.1.10:7789;
}
on dr-site {
address 192.168.2.10:7789;
}
}
EOF
# 初始化DRBD
[root@dr-site ~]# drbdadm create-md dr0
initializing activity log
initializing bitmap (320 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
success
# 启动DRBD
[root@dr-site ~]# systemctl enable --now drbd
Created symlink /etc/systemd/system/multi-user.target.wants/drbd.service → /usr/lib/systemd/system/drbd.service.
# 查看DRBD状态
[root@dr-site ~]# drbdadm status dr0
dr0 role:Secondary
disk:Inconsistent
peer role:Primary
replication:SyncSource peer-disk:Inconsistent done:45.67
# 等待同步完成
[root@dr-site ~]# drbdadm status dr0
dr0 role:Secondary
disk:UpToDate
peer role:Primary
replication:Established peer-disk:UpToDate
# 配置数据库复制
[root@dr-site ~]# cat > /etc/my.cnf.d/replication.cnf << 'EOF'
[mysqld]
server-id = 2
log-bin = mysql-bin
relay-log = relay-bin
read-only = 1
binlog_format = ROW
gtid_mode = ON
enforce_gtid_consistency = ON
EOF
# 配置复制
[root@dr-site ~]# mysql -e "CHANGE MASTER TO
MASTER_HOST='192.168.1.10',
MASTER_USER='repl',
MASTER_PASSWORD='ReplPass123!',
MASTER_AUTO_POSITION=1;"
[root@dr-site ~]# mysql -e "START SLAVE;"
# 验证复制状态
[root@dr-site ~]# mysql -e "SHOW SLAVE STATUS\G"
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.1.10
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 12345
Relay_Log_File: relay-bin.000002
Relay_Log_Pos: 5678
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 12345
Relay_Log_Space: 6789
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_UUID: 12345678-90ab-cdef-1234-567890abcdef
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 12345678-90ab-cdef-1234-567890abcdef:1-100
Executed_Gtid_Set: 12345678-90ab-cdef-1234-567890abcdef:1-100
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
Part02-故障切换流程
2.1 自动故障检测
[root@dr-site ~]# cat > /usr/local/bin/dr-failover-check.sh << 'EOF' #!/bin/bash PRIMARY_SITE="192.168.1.10" DR_SITE="192.168.2.10" LOG_FILE="/var/log/dr-failover.log" log_message() { echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> $LOG_FILE
}
check_primary() {
ping -c 3 $PRIMARY_SITE > /dev/null 2>&1
return $?
}
check_services() {
ssh root@$PRIMARY_SITE “systemctl is-active nginx mariadb” > /dev/null 2>&1
return $?
}
activate_dr() {
log_message “Activating DR site…”
# 提升DRBD为主
drbdadm primary dr0
mount /dev/drbd0 /data
# 提升数据库为主
mysql -e “STOP SLAVE;”
mysql -e “SET GLOBAL read_only = OFF;”
# 启动服务
systemctl start nginx mariadb
# 更新DNS
nsupdate <<更多学习教程公众号风哥教程itpux_com DNS
server dns.fgedu.net.cn
zone fgedu.net.cn
update delete www.fgedu.net.cn A
update add www.fgedu.net.cn 300 A $DR_SITE
send
DNS
log_message "DR site activated successfully"
}
# 主检测循环
while true; do
if ! check_primary; then
log_message "Primary site unreachable"
sleep 60
if ! check_primary; then
log_message "Primary site still unreachable after 60s"
activate_dr
break
fi
fi
sleep 30
done
EOF
[root@dr-site ~]# chmod +x /usr/local/bin/dr-failover-check.sh
# 创建systemd服务
[root@dr-site ~]# cat > /etc/systemd/system/dr-failover.service << 'EOF'
[Unit]
Description=DR Failover Check Service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/dr-failover-check.sh
Restart=always
[Install]
WantedBy=multi-user.target
EOF
[root@dr-site ~]# systemctl daemon-reload
[root@dr-site ~]# systemctl enable --now dr-failover
Created symlink /etc/systemd/system/multi-user.target.wants/dr-failover.service → /usr/lib/systemd/system/dr-failover.service.
# 验证服务状态
[root@dr-site ~]# systemctl status dr-failover
● dr-failover.service - DR Failover Check Service
Loaded: loaded (/etc/systemd/system/dr-failover.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 16:10:00 CST; 10s ago
Main PID: 12345 (dr-failover-c)
Tasks: 1 (limit: 11232)
Memory: 5.0M
CGroup: /system.slice/dr-failover.service
└─12345 /bin/bash /usr/local/bin/dr-failover-check.sh
2.2 手动切换演练
[root@primary-site ~]# systemctl stop nginx mariadb
# 在灾备站点执行切换
[root@dr-site ~]# /usr/local/bin/dr-manual-failover.sh
Starting manual failover to DR site…
Step 1: Checking primary site status…
Primary site is down or unreachable.
Step 2: Promoting DRBD to primary…
drbdadm primary dr0
mount /dev/drbd0 /data
Step 3: Promoting database to primary…
mysql -e “STOP SLAVE;”
mysql -e “SET GLOBAL read_only = OFF;更多视频教程www.fgedu.net.cn”
Step 4: Starting services…
systemctl start nginx mariadb
Step 5: Updating DNS records…
Updating DNS record for www.fgedu.net.cn to 192.168.2.10
Step 6: Verifying services…
nginx service: active
mariadb service: active
Manual failover completed successfully!
# 验证服务状态
[root@dr-site ~]# systemctl status nginx mariadb
● nginx.service – The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 16:15:00 CST; 5min ago
● mariadb.service – MariaDB 10.5 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; preset: disabled)
Active: active (running) since Fri 2026-04-04 16:15:00 CST; 5min ago
# 验证数据完整性
[root@dr-site ~]# mysql -e “SELECT COUNT(*) FROM appdb.users;”
+———-+
| COUNT(*) |
+———-+
| 1000 |
+———-+
# 验证Web服务
[root@dr-site ~]# curl -I http://localhost
HTTP/1.1 200 OK
Server: nginx/1.20.1
Date: Fri, 04 Apr 2026 16:20:00 GMT
Content-Type: text/html
Connection: keep-alive
Part03-灾备演练计划
3.1 演练计划制定
[root@dr-site ~]# cat > /root/dr-drill-plan.txt << 'EOF' 灾备演练计划 ============ 演练目标: 1. 验证灾备切换流程 2. 测试数据完整性 3. 评估RTO和RPO 4. 发现并改进问题 演练类型: 1. 桌面演练(每季度) - 模拟故障场景 - 演练切换流程 - 不实际切换 2. 部分演练(每半年) - 选择非关键业务 - 实际切换测试 - 验证恢复流程 3. 全面演练(每年) - 全业务切换 - 验证完整流程 - 性能和容量测试 演练步骤: 1. 演练准备 - 通知相关人员 - 准备演练环境 - 备份当前配置 2. 演练执行 - 模拟故障场景 - 执行切换流程 - 验证服务状态 3. 演练总结 - 记录演练结果 - 分析问题和改进 - 更新文档 关键指标: - RTO目标:4小时 - RPO目标:15分钟 - 数据完整性:100% - 服务可用性:99.99% EOF # 创建演练检查清单 [root@dr-site ~]# cat > /root/dr-drill-checklist.txt << 'EOF' 灾备演练检查清单 ================ 演练前检查: [ ] 通知所有相关人员 [ ] 确认演练时间窗口 [ ] 备份当前配置 [ ] 确认数据同步状态 [ ] 准备回滚方案 切换过程检查: [ ] 主站点服务停止 [ ] 数据同步完成 [ ] DRBD提升为主 [ ] 数据库提升为主 [ ] 服务启动成功 [ ] DNS更新完成 服务验证检查: [ ] Web服务可访问 [ ] 数据库连接正常 [ ] 应用功能正常 [ ] 性能指标正常 [ ] 监控告警正常 回滚检查: [ ] 主站点恢复 [ ] 数据同步恢复 [ ] 服务切换回主站点 [ ] DNS记录恢复 [ ] 监控恢复 演练后检查: [ ] 记录演练结果 [ ] 分析问题和改进 [ ] 更新文档 [ ] 发送演练报告 EOF # 创建演练报告模板 [root@dr-site ~]# cat > /root/dr-drill-report-template.txt << 'EOF' 灾备演练报告 ============ 演练信息: - 演练日期:YYYY-MM-DD - 演练类型:桌面演练/部分演练/全面演练 - 演练时长:XX小时XX分钟 - 参与人员:XXX 演练结果: - RTO实际:XX小时XX分钟(目标:4小时) - RPO实际:XX分钟(目标:15分钟) - 数据完整性:XX% - 服务可用性:XX% 问题记录: 1. 问题描述 - 发现时间: - 影响范围: - 解决方案: 改进措施: 1. 改进项 - 责任人: - 完成时间: 结论: 演练成功/失败,需要改进的地方。 EOF
- 建立完善的灾备架构
- 配置数据实时同步
- 制定详细的切换流程
- 定期进行灾备演练
- 持续改进灾备方案
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
