1. 首页 > Linux教程 > 正文

Linux教程FG464-Kubernetes故障排查实战

内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。

风哥提示:

本文档介绍Kubernetes常见故障排查实战案例。

Part01-Pod故障排查

1.1 Pod启动失败

# 查看Pod状态
[root@k8s-master ~]# kubectl get pods -n fgedu-prod
NAME READY STATUS RESTARTS AGE
fgedu-app-abc12-xyz789 0/1 ImagePullBackOff 0 5m
fgedu-app-abc12-abc12 0/1 CrashLoopBackOff 5 10m

# 查看Pod详情
[root@k8s-master ~]# kubectl describe pod fgedu-app-abc12-xyz789 -n fgedu-prod
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Pulling 5m (x4 over 5m) kubelet Pulling image “fgedu/app:v1.0”
Warning Failed 5m (x4 over 5m) kubelet Failed to pull image “fgedu/app:v1.0”: rpc error: code = Unknown
Warning Failed 5m (x4 over 5m) kubelet Error: ErrImagePull
Normal BackOff 5m (x6 over 5m) kubelet Back-off pulling image

# 解决方案:检查镜像名称和仓库访问
[root@k8s-master ~]# docker pull 192.168.1.100:30002/fgedu/app:v1.0
v1.0: Pulling from fgedu/app
Digest: sha256:abc123
Status: Downloaded newer image

# 查看CrashLoopBackOff原因
[root@k8s-master ~]# kubectl logs fgedu-app-abc12-abc12 -n fgedu-prod –previous
2026-04-04 23:30:00 ERROR Failed to connect to database: Connection refused
2026-04-04 23:30:00 ERROR Application startup failed

Part02-网络故障排查

2.1 服务无法访问

# 检查Service状态
[root@k8s-master ~]# kubectl get svc -n fgedu-prod
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-web ClusterIP 10.96.100.100 80/TCP 10m

# 检查Endpoints
[root@k8s-master ~]# kubectl get endpoints fgedu-web -n fgedu-prod
NAME ENDPOINTS AGE
fgedu-web 10m

# 检查Pod标签
[root@k8s-master ~]# kubectl get pods -n fgedu-prod –show-labels
NAME READY STATUS LABELS
fgedu-app-abc12-xyz789 1/1 Running app=fgedu-app

# 修复Service选择器
[root@k8s-master ~]# kubectl patch svc fgedu-web -n fgedu-prod -p ‘{“spec”:{“selector”:{“app”:”fgedu-app”}}}’
service/fgedu-web patched

# 测试DNS解析
[root@k8s-master ~]# kubectl run dns-test –image=busybox –rm -it — nslookup fgedu-web.fgedu-prod
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: fgedu-web.fgedu-prod
Address 1: 10.96.100.100学习交流加群风哥微信: itpux-com fgedu-web.fgedu-prod.svc.cluster.local

Part03-存储故障排查

3.1 PVC无法绑定

# 查看PVC状态
[root@k8s-master ~]# kubectl get pvc -n fgedu-prod
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
fgedu-data-pvc Pending fgedu-nfs-storage 5m

# 查看PVC详情
[root@k8s-master ~]# kubectl describe pvc fgedu-data-pvc -n fgedu-prod
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal WaitForFirstConsumer 5m persistentvolume-controller waiting for first consumer to be created before binding

# 检查StorageClass
[root@k8s-master ~]# kubectl get storageclass
NAME PROVISIONER AGE
fgedu-nfs-storage nfs.fgedu.net.cn 10m

# 检查Provisioner状态
[root@k8s-master ~]# kubectl get pods -n kube-system -l app=nfs-provisioner
NAME READY STATUS RESTARTS AGE
nfs-provisioner-abc12-xyz789 1/1 Running 0 10m

# 手动创建PV测试
[root@k8s-master ~]# cat > test-pv.yaml << 'EOF' apiVersion: v1 kind: PersistentVolume metadata: name: test-pv spec: capacity: storage: 10Gi accessModes: - ReadWriteMany nfs: server: 192.168.1.100 path: /data/k8s-storage EOF [root@k8s-master ~]# kubectl apply -f test-pv.yaml persistentvolume/test-pv created

Part04-节点故障排查

4.1 节点NotReady

# 查看节点状态
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 10d v1.28.3
k8s-node1 NotReady 10d v1.28.3
k8s-node2 Ready 10d v1.28.3

# 查看节点详情
[root@k8s-master ~]# kubectl describe node k8s-node1
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
—- —— —————– —————— —— ——-
MemoryPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasSufficientMemory kubelet has sufficient memory
DiskPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:00:00 +0800 KubeletHasSufficientPID kubelet has sufficient PID
Ready False Sat, 04 Apr 2026 23:00:00 +0800 Sat, 04 Apr 2026 22:30:00 +0800 KubeletNotReady container runtime not ready

# 登录节点检查kubelet
[root@k8s-node1 ~]# systemctl status kubelet
● kubelet.service – Kubernetes Kubelet Server
Active: inactive (dead)

[root@k8s-node1 ~]# systemctl restart kubelet
[root@k8s-node1 ~]# systemctl status kubelet
● kubelet.service – Kubernetes Kubelet Server
Active: active (running)

# 验证节点恢复
[root@k8s-master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready control-plane 10d v1.28.3
k8s-node1 Ready 10d v1.28.3
k8s-node2 Ready 10d v1.28.3

风哥针对故障排查建议:

  • 使用describe查看详细信息
  • 检查事件日志定位问题
  • 验证资源配置是否正确
  • 检查网络连通性
  • 验证存储配置

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息