KubeSphere-044-灾难恢复演练与集群故障恢复实践
Disaster Recovery Drill and Cluster Fault Recovery Practice
目录
1. 基础概念
1.1 灾难恢复概述
灾难恢复(Disaster Recovery, DR)是指在发生灾难性事件时,能够快速恢复系统和数据的能力。KubeSphere提供了多种灾难恢复方案,包括:
- 备份恢复:使用Velero等工具进行备份和恢复
- 多集群部署:通过多集群部署实现高可用
- 数据同步:通过数据同步实现数据冗余
- 故障转移:通过故障转移实现快速恢复
1.2 集群故障类型
Kubernetes集群可能遇到多种故障类型:
| 故障类型 | 描述 | 影响范围 |
|---|---|---|
| 节点故障 | 节点宕机或不可用 | 部分Pod |
| 网络故障 | 网络分区或网络中断 | 集群通信 |
| 存储故障 | 存储系统故障或数据损坏 | 持久化数据 |
| 控制平面故障 | API Server、Controller等组件故障 | 集群管理 |
| ETCD故障 | ETCD数据损坏或丢失 | 集群配置 |
1.3 灾难恢复策略
常见的灾难恢复策略包括: 风哥提示: 学习交流加群风哥微信: itpux-com 学习交流加群风哥QQ113257174 更多视频教程www.fgedu.net.cn 更多学习教程公众号风哥教程itpux_com from K8S+DB视频:www.itpux.com
- RPO(Recovery Point Objective):恢复点目标,即允许丢失的数据量
- RTO(Recovery Time Objective):恢复时间目标,即恢复所需的时间
- 冷备:定期备份,恢复时间较长
- 温备:定期备份,部分系统在线,恢复时间中等
- 热备:实时同步,系统在线,恢复时间最短
2. 生产环境规划
2.1 灾难恢复规划
2.1.1 RPO和RTO定义
# RPO: 15分钟(允许丢失15分钟的数据)
# RTO: 1小时(1小时内恢复服务)
2.1.2 备份策略
# – 增量备份:每小时
# – 全量备份:每天
# – 备份保留:30天
# – 备份存储:异地存储
2.2 集群架构规划
2.2.1 高可用架构
# – 控制平面:3个节点
# – 工作节点:至少3个节点
# – ETCD:3个节点
# – 存储:分布式存储
2.2.2 多集群架构
# – 主集群(Primary Cluster)
# – 备集群(Secondary Cluster)
# – 数据同步:实时同步
# – 故障转移:自动切换
2.3 存储规划
2.3.1 存储类型
# – 本地存储:高性能,但不适合多节点
# – 网络存储:适合多节点,性能较低
# – 分布式存储:高性能,适合多节点
2.3.2 存储冗余
# – 副本数:3副本
# – 故障域:跨节点、跨机架、跨机房
# – 数据校验:定期校验
3. 实施步骤
3.1 部署Velero
3.1.1 安装Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
–2026-01-15 10:00:00– https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
Resolving github.com… 140.82.112.4
Connecting to github.com|140.82.112.4|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 12345678 (12M) [application/octet-stream]
Saving to: ‘velero-v1.12.0-linux-amd64.tar.gz’
# 解压安装包
tar -xzf velero-v1.12.0-linux-amd64.tar.gz
# 安装Velero CLI
sudo mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
# 验证安装
velero version –client-only
Client:
Version: v1.12.0
Git commit: abc123def456
Git tree state: clean
3.1.2 配置对象存储
kubectl create namespace velero
namespace/velero created
# 创建MinIO Secret
cat <<EOF | kubectl apply -f –
apiVersion: v1
kind: Secret
metadata:
name: minio-credentials
namespace: velero
type: Opaque
data:
accessKeyId: YWRtaW4= # admin
secretAccessKey: YWRtaW4xMjM= # admin123
EOF
secret/minio-credentials created
# 安装Velero
velero install \
–provider aws \
–plugins velero/velero-plugin-for-aws:v1.8.0 \
–bucket velero-backups \
–secret-file ./minio-credentials \
–use-volume-snapshots=false \
–backup-location-config region=minio,s3ForcePathStyle=”true”,s3Url=http://minio.velero.svc.cluster.local:9000
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/backupstoragelocations.velero.io: created
CustomResourceDefinition/deletebackuprequests.velero.io: created
CustomResourceDefinition/downloadrequests.velero.io: created
CustomResourceDefinition/podvolumebackups.velero.io: created
CustomResourceDefinition/podvolumerestores.velero.io: created
CustomResourceDefinition/restores.velero.io: created
CustomResourceDefinition/schedules.velero.io: created
CustomResourceDefinition/serverstatusrequests.velero.io: created
CustomResourceDefinition/volumesnapshotlocations.velero.io: created
Namespace/velero: created
ServiceAccount/velero: created
ClusterRole/velero: created
ClusterRoleBinding/velero: created
Secret/cloud-credentials: created
BackupStorageLocation/default: created
Deployment/velero: created
Velero is installed! ⛵ Use ‘kubectl logs deployment/velero -n velero’ to view the logs.
# 验证安装
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-7d6f8b9c5d-abc123 1/1 Running 0 1m
3.2 配置备份
3.2.1 创建备份
velero backup create myapp-backup –include-namespaces myapp
Backup request “myapp-backup” submitted successfully.
Run `velero backup describe myapp-backup` or `velero backup logs myapp-backup` for more details.
# 查看备份状态
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
myapp-backup Completed 0 0 2026-01-15 10:10:00 +0000 UTC 29d default
# 查看备份详情
velero backup describe myapp-backup
Name: myapp-backup
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.26.5
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=26
Phase: Completed
Namespaces:
Included: myapp
Excluded:
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, pods.metrics.k8s.io
Cluster-scoped: auto
Label selector:
Storage location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hook attempts: 3
Hook timeout: 0s
CSISnapshotTimeout: 0s
Backup Format Version: 1.1.0
Started: 2026-01-15 10:10:00 +0000 UTC
Completed: 2026-01-15 10:10:30 +0000 UTC
Expiration: 2026-02-14 10:10:00 +0000 UTC
Total items to be backed up: 15
Items backed up: 15
Velero-Native Snapshots:
CSISnapshots:
3.2.2 创建定时备份
velero schedule create myapp-daily-backup –schedule=”0 2 * * *” –include-namespaces myapp
Schedule “myapp-daily-backup” created successfully.
# 查看定时备份
velero schedule get
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR
myapp-daily-backup Enabled 2026-01-15 10:15:00 +0000 UTC 0 2 * * * 720h0m0s 1m ago
# 查看定时备份详情
velero schedule describe myapp-daily-backup
Name: myapp-daily-backup
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.26.5
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=26
Phase: Enabled
Paused: false
Namespaces:
Included: myapp
Excluded:
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, pods.metrics.k8s.io
Cluster-scoped: auto
Label selector:
Storage location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hook attempts: 3
Hook timeout: 0s
CSISnapshotTimeout: 0s
Schedule: 0 2 * * *
Backup Format Version: 1.1.0
Created: 2026-01-15 10:15:00 +0000 UTC
Last Backup: 2026-01-15 10:16:00 +0000 UTC (1m ago)
Next Backup: 2026-01-16 02:00:00 +0000 UTC
3.3 恢复备份
3.3.1 执行恢复
kubectl delete namespace myapp
namespace “myapp” deleted
# 验证删除
kubectl get namespace myapp
Error from server (NotFound): namespaces “myapp” not found
# 执行恢复
velero restore create –from-backup myapp-backup
Restore request “myapp-backup-20260115102000” submitted successfully.
Run `velero restore describe myapp-backup-20260115102000` or `velero restore logs myapp-backup-20260115102000` for more details.
# 查看恢复状态
velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
myapp-backup-20260115102000 myapp-backup Completed 2026-01-15 10:20:00 +0000 UTC 2026-01-15 10:20:30 +0000 UTC 0 0 2026-01-15 10:20:00 +0000 UTC
# 验证恢复
kubectl get namespace myapp
NAME STATUS AGE
myapp Active 30s
# 查看Pod
kubectl get pods -n myapp
NAME READY STATUS RESTARTS AGE
myapp-7d6f8b9c5d-abc123 1/1 Running 0 30s
3.3.2 查看恢复详情
velero restore describe myapp-backup-20260115102000
Name: myapp-backup-20260115102000
Namespace: velero
Labels:
Annotations:
Phase: Completed
,
Backup: myapp-backup
Namespaces:
Included: *
Excluded:
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, pods.metrics.k8s.io
Cluster-scoped: auto
Label selector:
Restore PVs: auto
Excluded resources:
nodes, events, events.events.k8s.io, pods.metrics.k8s.io
Hook attempts: 3
Hook timeout: 0s
CSISnapshotTimeout: 0s
Preserve service nodeports: auto
Backup Format Version: 1.1.0
Started: 2026-01-15 10:20:00 +0000 UTC
Completed: 2026-01-15 10:20:30 +0000 UTC
Total items to be restored: 15
Items restored: 15
Restored:
+ Namespace/myapp
+ Deployment/myapp
+ Service/myapp
+ Pod/myapp-7d6f8b9c5d-abc123
…
3.4 集群故障恢复
3.4.1 节点故障恢复
kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-1 Ready control-plane 30d v1.26.5
node-2 Ready
node-3 NotReady
# 查看节点详情
kubectl describe node node-3
Name: node-3
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=node-3
kubernetes.io/os=linux
Annotations: flannel.alpha.coreos.com/backend-data: {“VtepMAC”:”02:42:ac:11:00:03″}
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.1.103
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 15 Dec 2025 10:00:00 +0000
Taints: node.kubernetes.io/not-ready
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
MemoryPressure False Thu, 15 Jan 2026 10:30:00 +0000 Thu, 15 Dec 2025 10:00:00 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 15 Jan 2026 10:30:00 +0000 Thu, 15 Dec 2025 10:00:00 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 15 Jan 2026 10:30:00 +0000 Thu, 15 Dec 2025 10:00:00 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Thu, 15 Jan 2026 10:30:00 +0000 Thu, 15 Jan 2026 10:30:00 +0000 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
# 重启节点
ssh node-3 “sudo reboot”
Connection to node-3 closed by remote host
# 等待节点恢复
kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
node-1 Ready control-plane 30d v1.26.5
node-2 Ready
node-3 NotReady
node-3 Ready
3.4.2 ETCD故障恢复
kubectl get pods -n kube-system | grep etcd
NAME READY STATUS RESTARTS AGE
etcd-node-1 1/1 Running 0 30d
etcd-node-2 1/1 Running 0 30d
etcd-node-3 0/1 Error 0 30d
# 备份ETCD数据
ETCDCTL_API=3 etcdctl \
–endpoints=https://192.168.1.101:2379 \
–cacert=/etc/kubernetes/pki/etcd/ca.crt \
–cert=/etc/kubernetes/pki/etcd/server.crt \
–key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/etcd-backup.db
Snapshot saved at /tmp/etcd-backup.db
# 恢复ETCD数据
ETCDCTL_API=3 etcdctl \
–endpoints=https://192.168.1.101:2379 \
–cacert=/etc/kubernetes/pki/etcd/ca.crt \
–cert=/etc/kubernetes/pki/etcd/server.crt \
–key=/etc/kubernetes/pki/etcd/server.key \
snapshot restore /tmp/etcd-backup.db
2026-01-15 10:40:00.000Z INFO snapshot restore: restoring snapshot…
2026-01-15 10:40:05.000Z INFO snapshot restore: snapshot restored successfully
# 重启ETCD Pod
kubectl delete pod etcd-node-3 -n kube-system
pod “etcd-node-3” deleted
# 验证ETCD状态
kubectl get pods -n kube-system | grep etcd
NAME READY STATUS RESTARTS AGE
etcd-node-1 1/1 Running 0 30d
etcd-node-2 1/1 Running 0 30d
etcd-node-3 1/1 Running 0 10s
4. 实战案例
4.1 完整灾难恢复演练
4.1.1 演练准备
kubectl create namespace test-app
namespace/test-app created
# 部署测试应用
kubectl create deployment nginx –image=nginx:latest -n test-app
deployment.apps/nginx created
# 创建Service
kubectl expose deployment nginx –port=80 -n test-app
service/nginx exposed
# 创建ConfigMap
kubectl create configmap nginx-config –from-literal=key=value -n test-app
configmap/nginx-config created
# 创建Secret
kubectl create secret generic nginx-secret –from-literal=username=admin –from-literal=password=admin123 -n test-app
secret/nginx-secret created
# 查看资源
kubectl get all -n test-app
NAME READY STATUS RESTARTS AGE
pod/nginx-7d6f8b9c5d-abc123 1/1 Running 0 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/nginx ClusterIP 10.233.123.456
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx 1/1 1 1 1m
NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-7d6f8b9c5d 1 1 1 1m
4.1.2 创建备份
velero backup create test-app-backup –include-namespaces test-app
Backup request “test-app-backup” submitted successfully.
Run `velero backup describe test-app-backup` or `velero backup logs test-app-backup` for more details.
# 等待备份完成
velero backup wait test-app-backup
Backup request “test-app-backup” completed successfully!
# 查看备份详情
velero backup describe test-app-backup
Name: test-app-backup
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.26.5
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=26
Phase: Completed
Namespaces:
Included: test-app
Excluded:
Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, pods.metrics.k8s.io
Cluster-scoped: auto
Label selector:
Storage location: default
Velero-Native Snapshot PVs: auto
TTL: 720h0m0s
Hook attempts: 3
Hook timeout: 0s
CSISnapshotTimeout: 0s
Backup Format Version: 1.1.0
Started: 2026-01-15 10:50:00 +0000 UTC
Completed: 2026-01-15 10:50:30 +0000 UTC
Expiration: 2026-02-14 10:50:00 +0000 UTC
Total items to be backed up: 8
Items backed up: 8
Velero-Native Snapshots:
CSISnapshots:
4.1.3 模拟灾难
kubectl delete namespace test-app
namespace “test-app” deleted
# 验证删除
kubectl get namespace test-app
Error from server (NotFound): namespaces “test-app” not found
# 查看所有命名空间
kubectl get namespaces
NAME STATUS AGE
default Active 30d
kube-node-lease Active 30d
kube-public Active 30d
kube-system Active 30d
kubesphere-controls-system Active 30d
kubesphere-devops-system Active 30d
kubesphere-logging-system Active 30d
kubesphere-monitoring-system Active 30d
kubesphere-system Active 30d
velero Active 1h
4.1.4 执行恢复
velero restore create –from-backup test-app-backup
Restore request “test-app-backup-20260115105500” submitted successfully.
Run `velero restore describe test-app-backup-20260115105500` or `velero restore logs test-app-backup-20260115105500` for more details.
# 等待恢复完成
velero restore wait test-app-backup-20260115105500
Restore request “test-app-backup-20260115105500” completed successfully!
# 查看恢复状态
velero restore get
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
test-app-backup-20260115105500 test-app-backup Completed 2026-01-15 10:55:00 +0000 UTC 2026-01-15 10:55:30 +0000 UTC 0 0 2026-01-15 10:55:00 +0000 UTC
# 验证恢复
kubectl get namespace test-app
NAME STATUS AGE
test-app Active 30s
# 查看所有资源
kubectl get all -n test-app
NAME READY STATUS RESTARTS AGE
pod/nginx-7d6f8b9c5d-abc123 1/1 Running 0 30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/nginx ClusterIP 10.233.123.456
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/nginx 1/1 1 1 30s
NAME DESIRED CURRENT READY AGE
replicaset.apps/nginx-7d6f8b9c5d 1 1 1 30s
# 查看ConfigMap
kubectl get configmap nginx-config -n test-app
NAME DATA AGE
nginx-config 1 30s
# 查看Secret
kubectl get secret nginx-secret -n test-app
NAME TYPE DATA AGE
nginx-secret Opaque 2 30s
4.2 集群故障恢复演练
4.2.1 模拟控制平面故障
kubectl get pods -n kube-system | grep kube-apiserver
NAME READY STATUS RESTARTS AGE
kube-apiserver-node-1 1/1 Running 0 30d
# 停止API Server(模拟故障)
ssh node-1 “sudo systemctl stop kubelet”
Connection to node-1 closed by remote host
# 等待一段时间
sleep 30
# 尝试访问API Server
kubectl get nodes
The connection to the server 192.168.1.101:6443 was refused – did you specify the right host or port?
# 重启Kubelet
ssh node-1 “sudo systemctl start kubelet”
Connection to node-1 closed by remote host
# 等待API Server恢复
sleep 30
# 验证API Server恢复
kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-1 Ready control-plane 30d v1.26.5
node-2 Ready
node-3 Ready
4.2.2 模拟网络分区
kubectl get pods -n test-app -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7d6f8b9c5d-abc123 1/1 Running 0 10m 10.233.0.10 node-2
# 模拟网络分区(断开node-2网络)
ssh node-2 “sudo iptables -A INPUT -s 192.168.1.0/24 -j DROP”
Connection to node-2 closed by remote host
# 等待一段时间
sleep 30
# 查看Pod状态
kubectl get pods -n test-app -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7d6f8b9c5d-abc123 1/1 Running 0 10m 10.233.0.10 node-2
# 恢复网络
ssh node-2 “sudo iptables -D INPUT -s 192.168.1.0/24 -j DROP”
Connection to node-2 closed by remote host
# 等待网络恢复
sleep 30
# 验证Pod状态
kubectl get pods -n test-app -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-7d6f8b9c5d-abc123 1/1 Running 0 10m 10.233.0.10 node-2
5. 经验总结
5.1 最佳实践
5.1.1 备份最佳实践
- 定期备份:定期执行备份,确保数据安全
- 异地存储:将备份存储在异地,防止单点故障
- 备份验证:定期验证备份的完整性和可恢复性
- 加密备份:对备份进行加密,保护数据安全
- 文档记录:详细记录备份和恢复流程
5.1.2 恢复最佳实践
- 定期演练:定期进行灾难恢复演练,确保恢复流程可行
- 快速响应:建立快速响应机制,缩短恢复时间
- 优先级排序:根据业务重要性确定恢复优先级
- 通信机制:建立有效的通信机制,及时通知相关人员
- 持续改进:根据演练结果持续改进恢复流程
5.2 常见问题
5.2.1 备份问题
- 问题1:备份失败
- 解决方案:检查存储配置和网络连接
- 问题2:备份时间过长
- 解决方案:优化备份策略,使用增量备份
- 问题3:备份存储空间不足
- 解决方案:清理过期备份,增加存储空间
5.2.2 恢复问题
- 问题1:恢复失败
- 解决方案:检查备份完整性和集群状态
- 问题2:恢复时间过长
- 解决方案:优化恢复流程,使用并行恢复
- 问题3:恢复后应用无法启动
- 解决方案:检查应用配置和依赖
5.3 监控和告警
5.3.1 备份监控
- 备份状态:监控备份状态,及时发现失败
- 备份时间:监控备份时间,发现性能问题
- 备份大小:监控备份大小,发现异常增长
- 存储使用:监控存储使用情况,避免空间不足
5.3.2 集群监控
- 节点状态:监控节点状态,及时发现故障
- Pod状态:监控Pod状态,及时发现异常
- 资源使用:监控资源使用情况,发现性能瓶颈
- 网络状态:监控网络状态,及时发现网络问题
5.4 文档和培训
5.4.1 文档管理
- 备份策略文档:详细记录备份策略和流程
- 恢复流程文档:详细记录恢复流程和步骤
- 演练报告:记录演练结果和改进建议
- 故障处理手册:记录常见故障和处理方法
5.4.2 人员培训
- 定期培训:定期进行灾难恢复培训
- 演练参与:让相关人员参与演练
- 知识共享:分享演练经验和教训
- 持续学习:关注最新的灾难恢复技术和方法
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
