内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。
风哥提示:
本文档介绍Kubernetes常见故障排查实战案例。
Part01-Pod故障排查
1.1 Pod启动失败
[root@k8s-master ~]# kubectl get pods -n fgedu-prod
NAME READY STATUS RESTARTS AGE
fgedu-app-abc12-xyz789 0/1 ImagePullBackOff 0 5m
fgedu-app-abc12-abc12 0/1 CrashLoopBackOff 5 10m
# 查看Pod详情
[root@k8s-master ~]# kubectl describe pod fgedu-app-abc12-xyz789 -n fgedu-prod
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Pulling 5m (x4 over 5m) kubelet Pulling image “fgedu/app:v1.0”
Warning Failed 5m (x4 over 5m) kubelet Failed to pull image “fgedu/app:v1.0”: rpc error: code = Unknown
Warning Failed 5m (x4 over 5m) kubelet Error: ErrImagePull
Normal BackOff 5m (x6 over 5m) kubelet Back-off pulling image
# 解决方案:检查镜像名称和仓库访问
[root@k8s-master ~]# docker pull 192.168.1.100:30002/fgedu/app:v1.0
v1.0: Pulling from fgedu/app
Digest: sha256:abc123
Status: Downloaded newer image
# 查看CrashLoopBackOff原因
[root@k8s-master ~]# kubectl logs fgedu-app-abc12-abc12 -n fgedu-prod –previous
2026-04-04 23:30:00 ERROR Failed to connect to database: Connection refused
2026-04-04 23:30:00 ERROR Application startup failed
Part02-网络故障排查
2.1 服务无法访问
[root@k8s-master ~]# kubectl get svc -n fgedu-prod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-web ClusterIP 10.96.100.100
# 检查Endpoints
[root@k8s-master ~]# kubectl get endpoints fgedu-web -n fgedu-prod
NAME ENDPOINTS AGE
fgedu-web
# 检查Pod标签
[root@k8s-master ~]# kubectl get pods -n fgedu-prod –show-labels
NAME READY STATUS LABELS
fgedu-app-abc12-xyz789 1/1 Running app=fgedu-app
# 修复Service选择器
[root@k8s-master ~]# kubectl patch svc fgedu-web -n fgedu-prod -p ‘{“spec”:{“selector”:{“app”:”fgedu-app”}}}’
service/fgedu-web patched
# 测试DNS解析
[root@k8s-master ~]# kubectl run dns-test –image=busybox –rm -it — nslookup fgedu-web.fgedu-prod
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: fgedu-web.fgedu-prod
Address 1: 10.96.100.100学习交流加群风哥微信: itpux-com fgedu-web.fgedu-prod.svc.cluster.local
Part03-存储故障排查
3.1 PVC无法绑定
[root@k8s-master ~]# kubectl get pvc -n fgedu-prod
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
fgedu-data-pvc Pending fgedu-nfs-storage 5m
# 查看PVC详情
[root@k8s-master ~]# kubectl describe pvc fgedu-data-pvc -n fgedu-prod
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal WaitForFirstConsumer 5m persistentvolume-controller waiting for first consumer to be created before binding
# 检查StorageClass
[root@k8s-master ~]# kubectl get storageclass
NAME PROVISIONER AGE
fgedu-nfs-storage nfs.fgedu.net.cn 10m
# 检查Provisioner状态
[root@k8s-master ~]# kubectl get pods -n kube-system -l app=nfs-provisioner
NAME READY STATUS RESTARTS AGE
nfs-provisioner-abc12-xyz789 1/1 Running 0 10m
# 手动创建PV测试
[root@k8s-master ~]# cat > test-pv.yaml << 'EOF'
apiVersion: v1
kind: PersistentVolume
metadata:
name: test-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
nfs:
server: 192.168.1.100
path: /data/k8s-storage
EOF
[root@k8s-master ~]# kubectl apply -f test-pv.yaml
persistentvolume/test-pv created
Part04-节点故障排查
4.1 节点NotReady
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 10d v1.28.3
k8s-node1 NotReady
k8s-node2 Ready
# 查看节点详情
[root@k8s-master ~]# kubectl describe node k8s-node1
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
—- —— —————– —————— —— ——-
MemoryPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasSufficientMemory kubelet has sufficient memory
DiskPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasSufficientPID kubelet has sufficient PID
Ready False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:30:00 +0800 KubeletNotReady container runtime not ready
# 登录节点检查kubelet
[root@k8s-node1 ~]# systemctl status kubelet
● kubelet.service – Kubernetes Kubelet Server
Active: inactive (dead)
[root@k8s-node1 ~]# systemctl restart kubelet
[root@k8s-node1 ~]# systemctl status kubelet
● kubelet.service – Kubernetes Kubelet Server
Active: active (running)
# 验证节点恢复
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 10d v1.28.3
k8s-node1 Ready
k8s-node2 Ready
- 使用describe查看详细信息
- 检查事件日志定位问题
- 验证资源配置是否正确
- 检查网络连通性
- 验证存储配置
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
