本篇文章详细介绍Rancher集群常见异常排查与修复方法,包括Pod异常、节点故障、网络问题、存储问题、性能问题等实战内容。风哥教程参考Rancher官方文档故障排查与运维管理相关章节。
目录大纲
Part01-基础概念与理论知识
1.1 集群异常分类与诊断方法
Rancher集群异常分为Pod级别、节点级别、集群级别。Pod异常包括CrashLoopBackOff、ImagePullBackOff、Pending等状态。节点异常包括NotReady、DiskPressure、MemoryPressure等。集群异常包括etcd故障、API Server不可用、网络分区等。诊断方法包括查看日志、检查事件、监控指标、网络抓包。更多视频教程www.fgedu.net.cn
1.2 日志收集与分析工具
Kubernetes日志分为容器日志、系统日志、审计日志。容器日志通过kubectl logs查看,系统日志通过journalctl查看。ELK/EFK栈提供日志聚合和分析能力。Prometheus和Grafana提供监控和告警。Rancher内置日志收集功能,支持Fluentd和Fluent Bit。学习交流加群风哥微信: itpux-com
Part02-生产环境规划与建议
2.1 监控告警体系设计
监控指标包括资源使用率、Pod状态、节点状态、网络流量、存储容量。告警规则包括CPU>80%、内存>85%、磁盘>90%、Pod重启次数>5。告警通道包括邮件、短信、钉钉、企业微信。定期检查告警有效性,避免告警疲劳。学习交流加群风哥QQ113257174
2.2 故障响应流程规范
建立故障分级制度:P0级(核心服务不可用)、P1级(功能异常)、P2级(性能下降)。P0级故障要求15分钟内响应,30分钟内定位,1小时内恢复。建立故障处理手册,记录常见问题和解决方案。故障后进行复盘,总结经验教训。更多学习教程公众号风哥教程itpux_com
Part03-生产环境项目实施方案
3.1 Pod异常排查与修复
排查Pod异常状态并修复。
NAMESPACE NAME READY STATUS RESTARTS AGE fgedu-prod fgedu-web-5d4f8b6c6-abc12 0/1 CrashLoopBackOff 5 10m fgedu-prod fgedu-api-7g8h9i0j1-def34 0/1 ImagePullBackOff 0 5m fgedu-dev fgedu-test-9k0l1m2n3-ghi56 0/1 Pending 0 3mName: fgedu-web-5d4f8b6c6-abc12 Namespace: fgedu-prod Priority: 0 Node: fgedu-worker-1/192.168.1.11 Start Time: Fri, 10 Apr 2026 15:00:00 +0800 Labels: app=fgedu-web Annotations:Status: Running IP: 10.42.1.10 IPs: IP: 10.42.1.10 Containers: nginx: Container ID: docker://abc123def456 Image: nginx:1.25 Image ID: docker-pullable://nginx@sha256:7890123456789012345678901234567890123456789012345678901234567890 Port: 80/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 137 Started: Fri, 10 Apr 2026 15:05:00 +0800 Finished: Fri, 10 Apr 2026 15:05:05 +0800 Ready: False Restart Count: 5 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 10m default-scheduler Successfully assigned fgedu-web-5d4f8b6c6-abc12 to fgedu-worker-1 Normal Pulled 10m kubelet Container image "nginx:1.25" already present on machine Normal Created 10m kubelet Created container nginx Normal Started 10m kubelet Started container nginx Warning BackOff 5m kubelet Back-off restarting failed container 2026/04/10 15:00:00 [notice] 1#1: start worker processes 2026/04/10 15:00:00 [notice] 1#1: start worker process 32 2026/04/10 15:00:05 [alert] 32#32: *1 worker process 32 exited on signal 9 (SIGKILL) 2026/04/10 15:00:05 [notice] 1#1: signal 17 (SIGCHLD) received from 32 2026/04/10 15:00:05 [notice] 32#32: exit 2026/04/10 15:00:05 [alert] 1#1: send signal 9 to worker process 32 (pid: 32) failed (3: No such process)resources: limits: cpu: 100m memory: 64Mi requests: cpu: 50m memory: 32Mideployment.apps/fgedu-web patchedNAME READY STATUS RESTARTS AGE fgedu-web-7g8h9i0j1-abc12 1/1 Running 0 1mfrom Rancher视频:www.itpux.com
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m default-scheduler Successfully assigned fgedu-api-7g8h9i0j1-def34 to fgedu-worker-2 Warning Failed 5m kubelet Failed to pull image "fgedu/fgedu-api:1.0.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for fgedu/fgedu-api, repository does not exist or may require 'docker login'secret/fgedu-registry-secret createddeployment.apps/fgedu-api patchedEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 3m default-scheduler 0/3 nodes are available: 3 Insufficient cpu.NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% fgedu-master-1 800m 20% 8Gi 50% fgedu-worker-1 1900m 95% 15Gi 94% fgedu-worker-2 1850m 93% 14Gi 88%deployment.apps/fgedu-test scaled3.2 节点故障排查与修复
排查节点异常状态并修复。
NAME STATUS ROLES AGE VERSION fgedu-master-1 Ready control-plane,master 30d v1.28.5 fgedu-worker-1 NotReady30d v1.28.5 fgedu-worker-2 Ready 30d v1.28.5 Name: fgedu-worker-1 Roles:Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=fgedu-worker-1 kubernetes.io/os=linux Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"02:42:ac:11:00:02"} flannel.alpha.coreos.com/backend-type: vxlan flannel.alpha.coreos.com/kube-subnet-manager: true flannel.alpha.coreos.com/public-ip: 192.168.1.11 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Fri, 10 Apr 2026 15:30:00 +0800 Fri, 10 Apr 2026 14:00:00 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Fri, 10 Apr 2026 15:30:00 +0800 Fri, 10 Apr 2026 14:00:00 +0800 KubeletHasSufficientDisk kubelet has sufficient disk space PIDPressure False Fri, 10 Apr 2026 15:30:00 +0800 Fri, 10 Apr 2026 14:00:00 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Fri, 10 Apr 2026 15:30:00 +0800 Fri, 10 Apr 2026 15:25:00 +0800 KubeletNotReady PLEG is not healthy: pleg was last seen active 5m ago Apr 10 15:25:00 fgedu-worker-1 kubelet[1234]: E0410 15:25:00.123456 12345 kubelet.go:2466] "Error getting node" err="node \"fgedu-worker-1\" not found" Apr 10 15:25:05 fgedu-worker-1 kubelet[1234]: E0410 15:25:05.234567 12345 kubelet.go:2466] "Error getting node" err="node \"fgedu-worker-1\" not found" Apr 10 15:25:10 fgedu-worker-1 kubelet[1234]: E0410 15:25:10.345678 12345 kubelet.go:2466] "Error getting node" err="node \"fgedu-worker-1\" not found" Apr 10 15:25:15 fgedu-worker-1 kubelet[1234]: I0410 15:25:15.456789 12345 kubelet.go:2466] "Attempting to sync node" node="fgedu-worker-1" Apr 10 15:25:20 fgedu-worker-1 kubelet[1234]: E0410 15:25:20.567890 12345 kubelet.go:2466] "PLEG is not healthy: pleg was last seen active 5m ago"node/fgedu-worker-1 condition metNAME STATUS ROLES AGE VERSION fgedu-master-1 Ready control-plane,master 30d v1.28.5 fgedu-worker-1 Ready30d v1.28.5 fgedu-worker-2 Ready 30d v1.28.5 3.3 网络故障排查与修复
排查集群网络故障。
PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=117 time=12.345 ms 64 bytes from 8.8.8.8: seq=1 ttl=117 time=11.234 ms 64 bytes from 8.8.8.8: seq=2 ttl=117 time=13.456 ms --- 8.8.8.8 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 11.234/12.345/13.456 msNAME READY STATUS RESTARTS AGE calico-node-abc12 1/1 Running 0 30d calico-node-def34 1/1 Running 0 30d calico-node-ghi56 1/1 Running 0 30dNAME READY STATUS RESTARTS AGE coredns-5d78c9869d-abc12 1/1 Running 0 30d coredns-5d78c9869d-def34 1/1 Running 0 30dServer: 10.43.0.10 Address 1: 10.43.0.10 kube-dns.kube-system.svc.cluster.local Name: kubernetes.default Address 1: 10.43.0.1 kubernetes.default.svc.cluster.localNAMESPACE NAME fgedu-prod fgedu-deny-all fgedu-prod fgedu-allow-dns fgedu-prod fgedu-allow-same-namespacePart04-生产案例与实战讲解
4.1 存储问题排查实战
排查存储相关故障。
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fgedu-middleware fgedu-mysql-pvc PendingRWO fast-ssd 10m fgedu-middleware fgedu-redis-pvc Bound pvc-abc123def456789012345678901234567890 100Gi RWO fast-ssd 1d Name: fgedu-mysql-pvc Namespace: fgedu-middleware StorageClass: fast-ssd Status: Pending Volume: Labels:Annotations: Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Used By: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 5m persistentvolume-controller Failed to provision volume with StorageClass "fast-ssd": rpc error: code = Internal desc = Could not mount target "/var/lib/kubelet/pods/abc123def456789012345678901234567890/volumes/kubernetes.io~pvc/abc123def456789012345678901234567890": exit status 32 NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE fast-ssd rancher.io/local-path Delete WaitForFirstConsumer false 30dFilesystem Size Used Avail Use% Mounted on /dev/sdb1 500G 495G 5G 99% /Rancher/fgdataFilesystem Size Used Avail Use% Mounted on /dev/sdb1 500G 450G 50G 90% /Rancher/fgdatapersistentvolumeclaim "fgedu-mysql-pvc" deletedpersistentvolumeclaim/fgedu-mysql-pvc createdNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fgedu-mysql-pvc Bound pvc-123456789012345678901234567890123 200Gi RWO fast-ssd 1m4.2 性能问题排查实战
排查集群性能问题。
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% fgedu-master-1 1200m 30% 12Gi 75% fgedu-worker-1 3800m 95% 15Gi 94% fgedu-worker-2 3600m 90% 14Gi 88%NAMESPACE NAME CPU(cores) MEMORY(bytes) fgedu-prod fgedu-web-5d4f8b6c6-abc12 1500m 2Gi fgedu-prod fgedu-api-7g8h9i0j1-def34 1200m 1.5Gi fgedu-middleware fgedu-mysql-9k0l1m2n3-ghi56 800m 4Gi fgedu-middleware fgedu-redis-1n2o3p4q5-jkl78 600m 2Gi fgedu-prod fgedu-web-5d4f8b6c6-mno90 500m 1.8GiName: fgedu-web-5d4f8b6c6-abc12 Namespace: fgedu-prod Priority: 0 Node: fgedu-worker-1/192.168.1.11 Start Time: Fri, 10 Apr 2026 16:00:00 +0800 Labels: app=fgedu-web Containers: nginx: State: Running Started: Fri, 10 Apr 2026 16:00:00 +0800 Ready: True Restart Count: 0 Limits: cpu: 500m memory: 256Mi Requests: cpu: 200m memory: 128Mideployment.apps/fgedu-web patcheddeployment.apps/fgedu-web scaledNAME READY STATUS RESTARTS AGE fgedu-web-7g8h9i0j1-abc12 1/1 Running 0 1m fgedu-web-7g8h9i0j1-def34 1/1 Running 0 1m fgedu-web-7g8h9i0j1-ghi56 1/1 Running 0 1m fgedu-web-7g8h9i0j1-jkl78 1/1 Running 0 1m fgedu-web-7g8h9i0j1-mno90 1/1 Running 0 1m4.3 集群恢复实战
执行集群恢复操作。
NAME READY STATUS RESTARTS AGE etcd-fgedu-master-1 1/1 Running 0 30dSnapshot saved at /Rancher/fgdata/etcd-backup-20260410-170000.dbNAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"}I0410 17:00:00.123456 1 genericapiserver.go:533] Serving insecurely on [::]:8080 I0410 17:00:00.234567 1 serving.go:331] Generated self-signed cert in-memory I0410 17:00:00.345678 1 secure_serving.go:178] Serving securely on [::]:6443 I0410 17:00:00.456789 1 controller.go:608] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io I0410 17:00:00.567890 1 controller.go:608] OpenAPI AggregationController: Processing item v1.custom.metrics.k8s.ioNAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE fgedu-prod 5m Normal ScalingReplicaSet deployment/fgedu-web Scaled up replica set fgedu-web-7g8h9i0j1 to 5 fgedu-prod 5m Normal SuccessfulCreate replicaset/fgedu-web-7g8h9i0j1 Created pod: fgedu-web-7g8h9i0j1-abc12 fgedu-prod 5m Normal Scheduled pod/fgedu-web-7g8h9i0j1-abc12 Successfully assigned fgedu-web-7g8h9i0j1-abc12 to fgedu-worker-1 fgedu-prod 4m Normal Pulling pod/fgedu-web-7g8h9i0j1-abc12 Pulling image "nginx:1.25" fgedu-prod 3m Normal Pulled pod/fgedu-web-7g8h9i0j1-abc12 Successfully pulled image "nginx:1.25"Part05-风哥经验总结与分享
5.1 生产环境最佳实践
1. 建立完善的监控告警体系
2. 定期备份etcd和重要数据
3. 制定故障响应流程和手册
4. 定期进行故障演练和压力测试
5. 使用日志聚合和分析工具
6. 建立知识库和问题库
7. 定期进行安全审计和漏洞扫描
8. 建立灾备和容灾方案5.2 常见问题与解决方案
1. Pod频繁重启:检查资源限制、查看应用日志、验证健康检查
2. 节点NotReady:检查kubelet状态、验证网络连通性、查看系统日志
3. 网络不通:检查CNI插件、验证网络策略、测试DNS解析
4. 存储挂载失败:检查PV/PVC状态、验证存储空间、查看节点日志
5. 性能下降:检查资源使用、优化应用配置、增加节点资源
6. API Server不可用:检查etcd状态、验证证书有效性、查看服务日志
7. 镜像拉取失败:检查镜像仓库、验证认证信息、查看网络配置
8. DNS解析失败:检查CoreDNS状态、验证网络策略、测试连通性本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
