1. 首页 > Kubernetes教程 > 正文

Kubernetes教程FG029-Kubernetes常用报错与解决方案实战解析

本文档风哥主要介绍Kubernetes中的常用报错与解决方案,包括报错概述、解决方案概念、Kubernetes报错、报错预防、解决方案策略、最佳实践规划、Pod报错、Service报错、节点报错、控制平面报错、Pod报错案例、Service报错案例、节点报错案例、控制平面报错案例等内容,风哥教程参考Kubernetes官方文档和报错处理相关文档,适合DevOps工程师和系统管理员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

Part01-基础概念与理论知识

1.1 报错概述

报错是指系统在运行过程中出现的错误或异常,可能导致系统功能异常或服务不可用。Kubernetes报错包括Pod报错、Service报错、节点报错、控制平面报错等多个方面,需要及时识别和处理。

1.2 解决方案概念

解决方案是指针对报错的处理方法,包括临时解决方案和根本解决方案。临时解决方案用于快速恢复服务,根本解决方案用于彻底解决问题,防止问题再次发生。

1.3 Kubernetes报错

Kubernetes报错是指在Kubernetes集群运行过程中出现的错误或异常,包括Pod状态异常、Service不可用、节点状态异常、控制平面组件故障等。Kubernetes提供了多种工具和方法来帮助识别和处理报错,如kubectl命令、事件查看、日志分析等。

Part02-生产环境规划与建议

2.1 报错预防

生产环境Kubernetes报错预防:

# 报错预防
– 监控和告警:
– 部署Prometheus和Grafana,监控集群和应用状态
– 配置告警规则,及时通知异常情况
– 建立监控仪表盘,直观查看系统状态
– 定期检查监控配置,确保监控有效性
– 日志管理:
– 配置集中式日志管理系统,如ELK Stack或Loki
– 标准化日志格式,便于分析和查询
– 设置日志保留策略,平衡存储成本和查询需求
– 定期清理过期日志,避免存储溢出
– 配置管理:
– 使用ConfigMap和Secret管理应用配置
– 版本控制配置文件,便于回滚和审计
– 定期检查配置文件,确保配置正确
– 避免硬编码配置,使用环境变量或配置文件
– 资源管理:
– 为所有Pod设置资源请求和限制
– 使用资源配额限制命名空间的资源使用
– 定期检查资源使用情况,及时调整资源配置
– 避免资源过度使用,防止资源争用
– 网络管理:
– 配置网络策略,限制不必要的网络通信
– 监控网络性能,及时发现网络问题
– 实施网络隔离,提高网络安全性
– 定期测试网络连通性,确保网络可靠性
– 安全管理:
– 配置Pod安全策略,限制Pod特权
– 使用RBAC控制访问权限
– 定期扫描安全漏洞,及时修复
– 监控安全事件,及时响应
– 文档和流程:
– 建立报错处理流程,明确责任和步骤
– 记录报错处理过程,积累经验
– 编写报错处理文档,指导团队成员
– 定期培训,提高团队报错处理能力

2.2 解决方案策略

生产环境Kubernetes解决方案策略:

# 解决方案策略
– 临时解决方案:
– 重启Pod或服务,快速恢复服务
– 扩容Pod数量,缓解负载压力
– 切换到备用节点或集群,确保服务可用性
– 回滚到之前的版本,避免新版本问题
– 根本解决方案:
– 分析报错原因,找出根本问题
– 修复应用代码或配置,解决根本问题
– 优化系统配置,提高系统稳定性
– 实施监控和告警,及时发现问题
– 预防措施:
– 定期检查系统状态,及时发现潜在问题
– 实施自动化测试,确保代码质量
– 建立灾备方案,提高系统可靠性
– 定期培训团队成员,提高报错处理能力
– 团队协作:
– 建立跨团队协作机制,共同处理报错
– 明确沟通渠道,确保信息及时传递
– 分享报错处理经验,提高团队能力
– 定期回顾报错处理过程,持续改进
– 工具支持:
– 使用监控工具,及时发现异常
– 使用日志分析工具,快速定位问题
– 使用诊断工具,深入分析报错原因
– 使用自动化工具,提高报错处理效率

2.3 最佳实践规划

生产环境Kubernetes常用报错与解决方案的最佳实践规划:

# 最佳实践规划
– 建立报错处理流程:
– 明确报错处理的责任和步骤
– 建立报错分级机制,优先处理严重报错
– 制定报错响应时间目标,确保及时处理
– 建立报错升级机制,避免报错处理延迟
– 编写报错处理文档:
– 记录常见报错及其解决方案
– 编写报错处理指南,指导团队成员
– 建立报错知识库,便于查询和学习
– 定期更新报错处理文档,保持时效性
– 实施监控和告警:
– 部署Prometheus和Grafana,监控集群和应用
– 配置告警规则,及时通知异常情况
– 建立监控仪表盘,直观查看系统状态
– 定期检查监控配置,确保监控有效性
– 自动化报错处理:
– 使用自动化工具,如Kubernetes Operators
– 配置自动重启策略,处理常见报错
– 实现自动扩缩容,应对负载变化
– 建立自动备份和恢复机制,确保数据安全
– 定期演练:
– 定期进行报错处理演练,提高团队能力
– 模拟常见报错场景,测试解决方案
– 评估报错处理效果,持续改进
– 总结演练经验,更新报错处理文档
– 持续改进:
– 定期回顾报错处理过程,分析改进点
– 优化系统配置,提高系统稳定性
– 改进报错处理流程,提高处理效率
– 培训团队成员,提高报错处理能力

Part03-生产环境项目实施方案

3.1 Pod报错

生产环境Kubernetes Pod报错的处理:

# Pod报错
– Pod状态异常:
– CrashLoopBackOff:Pod反复崩溃
– ImagePullBackOff:镜像拉取失败
– Pending:Pod等待调度
– Failed:Pod执行失败
– Unknown:Pod状态未知
– 常见Pod报错及解决方案:
1. CrashLoopBackOff:
– 检查Pod日志:kubectl logs – 检查容器状态:kubectl describe pod – 修复应用代码或配置
– 重启Pod:kubectl delete pod 2. ImagePullBackOff:
– 检查镜像名称和标签是否正确
– 检查镜像仓库访问权限
– 检查网络连接
– 拉取镜像:docker pull
3. Pending:
– 检查节点资源是否足够
– 检查节点是否Ready
– 检查Pod调度约束
– 扩容集群或调整Pod资源请求
4. Failed:
– 检查Pod日志:kubectl logs – 检查容器状态:kubectl describe pod – 修复应用代码或配置
– 重启Pod:kubectl delete pod 5. Unknown:
– 检查节点状态
– 检查网络连接
– 重启kubelet:systemctl restart kubelet
– 重新调度Pod:kubectl delete pod – Pod报错处理流程:
1. 检查Pod状态:kubectl get pods
2. 检查Pod详细信息:kubectl describe pod 3. 检查Pod日志:kubectl logs 4. 分析报错原因
5. 实施解决方案
6. 验证解决方案

3.2 Service报错

生产环境Kubernetes Service报错的处理:

# Service报错
– Service状态异常:
– Service无法访问
– Service后端Pod不可用
– Service端口配置错误
– Service网络策略限制
– 常见Service报错及解决方案:
1. Service无法访问:
– 检查Service状态:kubectl get services
– 检查Service详细信息:kubectl describe service
– 检查后端Pod状态:kubectl get pods -l

3.3 节点报错

生产环境Kubernetes节点报错的处理:

# 节点报错
– 节点状态异常:
– NotReady:节点不可用
– Ready:节点可用
– Unknown:节点状态未知
– 常见节点报错及解决方案:
1. NotReady:
– 检查节点状态:kubectl get nodes
– 检查节点详细信息:kubectl describe node
– 检查kubelet状态:systemctl status kubelet
– 检查容器运行时状态:systemctl status docker
– 重启kubelet:systemctl restart kubelet
– 重启容器运行时:systemctl restart docker
2. Unknown:
– 检查节点网络连接
– 检查kubelet状态:systemctl status kubelet
– 重启kubelet:systemctl restart kubelet
– 重新注册节点:kubeadm join
– 节点报错处理流程:
1. 检查节点状态:kubectl get nodes
2. 检查节点详细信息:kubectl describe node
3. 检查kubelet状态:systemctl status kubelet
4. 检查容器运行时状态:systemctl status docker
5. 分析报错原因
6. 实施解决方案
7. 验证解决方案

3.4 控制平面报错

生产环境Kubernetes控制平面报错的处理:

# 控制平面报错
– 控制平面组件状态异常:
– API Server:无法访问
– etcd:数据存储故障
– Controller Manager:控制器故障
– Scheduler:调度器故障
– 常见控制平面报错及解决方案:
1. API Server无法访问:
– 检查API Server状态:systemctl status kube-apiserver
– 检查API Server日志:journalctl -u kube-apiserver
– 检查etcd状态:systemctl status etcd
– 重启API Server:systemctl restart kube-apiserver
2. etcd数据存储故障:
– 检查etcd状态:systemctl status etcd
– 检查etcd日志:journalctl -u etcd
– 检查etcd数据目录:ls -la /var/lib/etcd
– 恢复etcd数据:etcdctl snapshot restore
3. Controller Manager故障:
– 检查Controller Manager状态:systemctl status kube-controller-manager
– 检查Controller Manager日志:journalctl -u kube-controller-manager
– 重启Controller Manager:systemctl restart kube-controller-manager
4. Scheduler故障:
– 检查Scheduler状态:systemctl status kube-scheduler
– 检查Scheduler日志:journalctl -u kube-scheduler
– 重启Scheduler:systemctl restart kube-scheduler
– 控制平面报错处理流程:
1. 检查控制平面组件状态:kubectl get pods -n kube-system
2. 检查组件日志:journalctl -u
3. 分析报错原因
4. 实施解决方案
5. 验证解决方案

Part04-生产案例与实战讲解

4.1 Pod报错案例

生产环境Kubernetes Pod报错的案例。,风哥提示:。

# 案例:CrashLoopBackOff
# 场景:Pod反复崩溃,状态为CrashLoopBackOff
# 问题:
– Pod状态为CrashLoopBackOff
– 应用日志显示错误信息
– 应用无法正常运行
# 解决方案:
1. 检查Pod状态:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 0/1 CrashLoopBackOff 5 10m
2. 检查Pod详细信息:
$ kubectl describe pod fgedu-app-6d6f58987b-7f5f8
Name: fgedu-app-6d6f58987b-7f5f8
Namespace: default
Priority: 0
Node: fgedu-node1/192.168.1.101
Start Time: Fri, 01 Jan 2024 00:00:00 +0000
Labels: app=fgedu-app
pod-template-hash=6d6f58987b
Annotations:
Status: Running
IP: 10.244.1.2
IPs:
IP: 10.244.1.2
Controlled By: ReplicaSet/fgedu-app-6d6f58987b
Containers:
fgedu-app:
Container ID: docker://1234567890ab
Image: fgedu/app:latest
Image ID: docker-pullable://fgedu/app@sha256:1234567890ab
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 01 Jan 2024 00:05:00 +0000
Finished: Fri, 01 Jan 2024 00:05:30 +0000
Ready: False
Restart Count: 5
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xyz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xyz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xyz
Optional: false
QoS Class: BestEffort
Node-Selectors: ,学习交流加群风哥微信: itpux-com。
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Scheduled 10m default-scheduler Successfully assigned default/fgedu-app-6d6f58987b-7f5f8 to fgedu-node1
Normal Pulling 10m kubelet Pulling image “fgedu/app:latest”
Normal Pulled 10m kubelet Successfully pulled image “fgedu/app:latest”
Normal Created 10m kubelet Created container fgedu-app
Normal Started 10m kubelet Started container fgedu-app
Normal Killing 9m (x5 over 10m) kubelet Container fgedu-app failed liveness probe, will be restarted
Warning Unhealthy 9m (x6 over 10m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500
Warning BackOff 2m (x22 over 9m) kubelet Back-off restarting failed container
3. 检查Pod日志:
$ kubectl logs fgedu-app-6d6f58987b-7f5f8
Error: Cannot connect to database: dial tcp 10.96.123.45:3306: connect: connection refused
panic: Cannot connect to database
goroutine 1 [running]:
main.main()
/app/main.go:100 +0x123
4. 检查数据库服务:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-app LoadBalancer 10.96.123.45 192.168.1.200 80:30080/TCP 1d
kubernetes ClusterIP 10.96.0.1 443/TCP 1d
$ kubectl get pods -l app=mysql
No resources found in default namespace.
5. 部署数据库服务:
$ cat > mysql-deployment.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: mysql namespace: default spec: replicas: 1 selector: matchLabels: app: mysql template: metadata: labels: app: mysql spec: containers: - name: mysql image: mysql:8.0 ports: - containerPort: 3306 env: - name: MYSQL_ROOT_PASSWORD value: "fgedu123" - name: MYSQL_DATABASE value: "fgedudb" - name: MYSQL_USER value: "fgedu" - name: MYSQL_PASSWORD value: "fgedu123" --- apiVersion: v1 kind: Service metadata: name: mysql namespace: default spec: selector: app: mysql ports: - port: 3306 targetPort: 3306 EOF $ kubectl apply -f mysql-deployment.yaml 6. 验证数据库服务: $ kubectl get pods -l app=mysql NAME READY STATUS RESTARTS AGE mysql-6d6f58987b-7f5f8 1/1 Running 0 5m $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE fgedu-app LoadBalancer 10.96.123.45 192.168.1.200 80:30080/TCP 1d kubernetes ClusterIP 10.96.0.1 443/TCP 1d
mysql ClusterIP 10.96.123.46 3306/TCP 5m
7. 验证应用状态:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 1/1 Running 0 15m
mysql-6d6f58987b-7f5f8 1/1 Running 0 5m
# 案例:ImagePullBackOff
# 场景:Pod无法拉取镜像,状态为ImagePullBackOff
# 问题:
– Pod状态为ImagePullBackOff
– 镜像拉取失败
– 应用无法部署
# 解决方案:
1. 检查Pod状态:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 0/1 ImagePullBackOff 0 10m
2. 检查Pod详细信息:
$ kubectl describe pod fgedu-app-6d6f58987b-7f5f8
Name: fgedu-app-6d6f58987b-7f5f8
Namespace: default
Priority: 0
Node: fgedu-node1/192.168.1.101
Start Time: Fri, 01 Jan 2024 00:00:00 +0000
Labels: app=fgedu-app
pod-template-hash=6d6f58987b
Annotations:
Status: Pending
IP: 10.244.1.2
IPs:
IP: 10.244.1.2
Controlled By: ReplicaSet/fgedu-app-6d6f58987b
Containers:
fgedu-app:
Container ID: docker://1234567890ab
Image: fgedu/app:latest
Image ID: docker-pullable://fgedu/app@sha256:1234567890ab
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xyz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-xyz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xyz
Optional: false
QoS Class: BestEffort
Node-Selectors: ,学习交流加群风哥QQ113257174。
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Scheduled 10m default-scheduler Successfully assigned default/fgedu-app-6d6f58987b-7f5f8 to fgedu-node1
Normal Pulling 10m kubelet Pulling image “fgedu/app:latest”
Warning Failed 10m kubelet Failed to pull image “fgedu/app:latest”: rpc error: code = NotFound desc = failed to pull and unpack image “docker.io/fgedu/app:latest”: failed to resolve reference “docker.io/fgedu/app:latest”: pull access denied, repository does not exist or may require authorization
Warning Failed 10m kubelet Error: ErrImagePull
Normal BackOff 10m kubelet Back-off pulling image “fgedu/app:latest”
Warning Failed 10m kubelet Error: ImagePullBackOff
3. 检查镜像名称和标签:
$ kubectl get deployment fgedu-app -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-app
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: fgedu-app
template:
metadata:
labels:
app: fgedu-app
spec:
containers:
– name: fgedu-app
image: fgedu/app:latest
ports:
– containerPort: 8080
4. 修复镜像名称:
$ kubectl edit deployment fgedu-app
# 修改image为正确的镜像名称
image: nginx:latest
5. 验证应用状态:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 1/1 Running 0 15m

4.2 Service报错案例

生产环境Kubernetes Service报错的案例。。

# 案例:Service无法访问
# 场景:Service无法访问,应用服务不可用
# 问题:
– Service无法访问
– 应用服务不可用
– 后端Pod状态正常
# 解决方案:
1. 检查Service状态:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-app LoadBalancer 10.96.123.45 192.168.1.200 80:30080/TCP 1d
kubernetes ClusterIP 10.96.0.1 443/TCP 1d
2. 检查Service详细信息:
$ kubectl describe service fgedu-app
Name: fgedu-app
Namespace: default
Labels: app=fgedu-app
Annotations:
Selector: app=fgedu-app
Type: LoadBalancer
IP Families:
IP: 10.96.123.45
IPs: 10.96.123.45
LoadBalancer Ingress: 192.168.1.200
Port: http 80/TCP
TargetPort: 8080/TCP
NodePort: http 30080/TCP
Endpoints: 10.244.1.2:8080,10.244.2.2:8080
Session Affinity: None
External Traffic Policy: Cluster
Events:
3. 检查后端Pod状态:
$ kubectl get pods -l app=fgedu-app
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 1/1 Running 0 1d
fgedu-app-6d6f58987b-8d2k3 1/1 Running 0 1d
4. 检查网络连接:
$ kubectl run -it –rm –image=busybox:1.28 busybox — wget -O- http://fgedu-app.default.svc.cluster.local
Connecting to fgedu-app.default.svc.cluster.local (10.96.123.45:80)
wget: server returned error: HTTP/1.1 503 Service Unavailable
5. 检查网络策略:
$ kubectl get networkpolicies
NAME AGE
fgedu-app-network-policy 1d
$ kubectl describe networkpolicy fgedu-app-network-policy
Name: fgedu-app-network-policy
Namespace: default
Created on: 2024-01-01 00:00:00 +0000 UTC
Spec:
PodSelector: app=fgedu-app
Allowing ingress traffic:
To Port: 80/TCP
From:
PodSelector: app=fgedu-app
Allowing egress traffic:
To Port: 80/TCP
To:
PodSelector: app=fgedu-app
6. 修改网络策略:
$ kubectl edit networkpolicy fgedu-app-network-policy
# 添加允许来自所有Pod的流量
ingress:
– from:
– podSelector: {}
ports:
– protocol: TCP
port: 80
7. 验证Service访问:
$ kubectl run -it –rm –image=busybox:1.28 busybox — wget -O- http://fgedu-app.default.svc.cluster.local
Connecting to fgedu-app.default.svc.cluster.local (10.96.123.45:80)
HTTP/1.1 200 OK
Server: nginx/1.21.6
Date: Fri, 01 Jan 2024 00:00:00 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Tue, 14 Dec 2021 14:49:29 GMT
Connection: keep-alive
ETag: “61b8a129-267”
Accept-Ranges: bytes

Welcome to nginx!

If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.
Commercial support is available at
nginx.com.

Thank you for using nginx.

# 案例:Service后端Pod不可用
# 场景:Service后端Pod不可用,Service无法正常工作
# 问题:
– Service后端Pod不可用
– Service无法正常工作
– 应用服务不可用
# 解决方案:
1. 检查Service状态:
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-app LoadBalancer 10.96.123.45 192.168.1.200 80:30080/TCP 1d
kubernetes ClusterIP 10.96.0.1 443/TCP 1d
2. 检查Service详细信息:
$ kubectl describe service fgedu-app
Name: fgedu-app
Namespace: default
Labels: app=fgedu-app
Annotations:
Selector: app=fgedu-app
Type: LoadBalancer
IP Families:
IP: 10.96.123.45
IPs: 10.96.123.45
LoadBalancer Ingress: 192.168.1.200
Port: http 80/TCP
TargetPort: 8080/TCP
NodePort: http 30080/TCP
Endpoints:
Session Affinity: None
External Traffic Policy: Cluster
Events:
3. 检查后端Pod状态:
$ kubectl get pods -l app=fgedu-app
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 0/1 CrashLoopBackOff 5 10m
fgedu-app-6d6f58987b-8d2k3 0/1 CrashLoopBackOff 5 10m
4. 检查Pod日志:
$ kubectl logs fgedu-app-6d6f58987b-7f5f8
Error: Cannot connect to database: dial tcp 10.96.123.45:3306: connect: connection refused
panic: Cannot connect to database
goroutine 1 [running]:
main.main()
/app/main.go:100 +0x123
5. 检查数据库服务:
$ kubectl get pods -l app=mysql
NAME READY STATUS RESTARTS AGE
mysql-6d6f58987b-7f5f8 1/1 Running 0 1d
6. 检查数据库连接:
$ kubectl exec -it mysql-6d6f58987b-7f5f8 — mysql -u fgedu -pfgedu123 fgedudb -e “SELECT 1;”
+—+
| 1 |
+—+
| 1 |
+—+
7. 检查应用配置:
$ kubectl get configmap fgedu-app-config -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fgedu-app-config
namespace: default
data:
database_url: mysql://fgedu:fgedu123@mysql:3306/fgedudb
8. 修复应用配置:
$ kubectl edit configmap fgedu-app-config
# 修改database_url为正确的地址
database_url: mysql://fgedu:fgedu123@mysql.default.svc.cluster.local:3306/fgedudb
9. 重启应用:
$ kubectl rollout restart deployment fgedu-app
10. 验证应用状态:
$ kubectl get pods -l app=fgedu-app
NAME READY STATUS RESTARTS AGE
fgedu-app-6d6f58987b-7f5f8 1/1 Running 0 15m
fgedu-app-6d6f58987b-8d2k3 1/1 Running 0 15m
11. 验证Service状态:
$ kubectl describe service fgedu-app
Name: fgedu-app
Namespace: default
Labels: app=fgedu-app
Annotations:
Selector: app=fgedu-app
Type: LoadBalancer
IP Families:
IP: 10.96.123.45
IPs: 10.96.123.45
LoadBalancer Ingress: 192.168.1.200
Port: http 80/TCP
TargetPort: 8080/TCP
NodePort: http 30080/TCP
Endpoints: 10.244.1.2:8080,10.244.2.2:8080
Session Affinity: None
External Traffic Policy: Cluster
Events:
12. 验证Service访问:
$ kubectl run -it –rm –image=busybox:1.28 busybox — wget -O- http://fgedu-app.default.svc.cluster.local
Connecting to fgedu-app.default.svc.cluster.local (10.96.123.45:80)
HTTP/1.1 200 OK
Server: nginx/1.21.6
Date: Fri, 01 Jan 2024 00:00:00 GMT
Content-Type: text/html
Content-Length: 615
Last-Modified: Tue, 14 Dec 2021 14:49:29 GMT
Connection: keep-alive
ETag: “61b8a129-267”
Accept-Ranges: bytes

Welcome to nginx!

If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.
Commercial support is available at
nginx.com.

Thank you for using nginx.

4.3 节点报错案例

生产环境Kubernetes节点报错的案例。。

# 案例:节点NotReady
# 场景:节点状态为NotReady,Pod无法调度到该节点
# 问题:
– 节点状态为NotReady
– Pod无法调度到该节点
– 该节点上的Pod状态异常
# 解决方案:
1. 检查节点状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 NotReady 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0
2. 检查节点详细信息:
$ kubectl describe node fgedu-node1
Name: fgedu-node1
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=fgedu-node1
kubernetes.io/os=linux
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 01 Jan 2024 00:00:00 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
—- —— —————– —————— —— ——-
Ready False Fri, 01 Jan 2024 01:00:00 +0000 Fri, 01 Jan 2024 01:00:00 +0000 KubeletNotReady PLEG is not healthy: pleg was last seen active 3m ago; threshold is 3m
3. 检查kubelet状态:
$ ssh fgedu-node1 systemctl status kubelet
● kubelet.service – kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 00:00:00 UTC; 1d ago
Docs: https://kubernetes.io/docs/
Main PID: 1234 (kubelet)
Tasks: 20
Memory: 100.0M
CPU: 10.0%
CGroup: /system.slice/kubelet.service
└─1234 /usr/bin/kubelet –bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf –kubeconfig=/etc/kubernetes/kubelet.conf –config=/var/lib/kubelet/config.yaml
4. 检查容器运行时状态:
$ ssh fgedu-node1 systemctl status docker
● docker.service – Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2024-01-01 01:00:00 UTC; 10s ago
Docs: https://docs.docker.com
Process: 5678 ExecStart=/usr/bin/dockerd -H fd:// –containerd=/run/containerd/containerd.sock (code=exited, status=255)
Main PID: 5678 (code=exited, status=255)
5. 启动容器运行时:
$ ssh fgedu-node1 systemctl start docker
$ ssh fgedu-node1 systemctl status docker
● docker.service – Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 01:05:00 UTC; 5s ago
Docs: https://docs.docker.com
Process: 5678 ExecStart=/usr/bin/dockerd -H fd:// –containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
Main PID: 5678 (dockerd)
6. 验证节点状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 Ready 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0
# 案例:节点Unknown
# 场景:节点状态为Unknown,无法与集群通信
# 问题:
– 节点状态为Unknown
– 无法与集群通信
– 该节点上的Pod状态异常
# 解决方案:
1. 检查节点状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 Unknown 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0
2. 检查节点网络连接:
$ ping fgedu-node1
PING fgedu-node1 (192.168.1.101) 56(84) bytes of data.
From fgedu-master (192.168.1.100) icmp_seq=1 Destination Host Unreachable
3. 检查节点电源状态:
# 检查节点物理状态
4. 重启节点:
# 重启节点服务器
5. 验证节点状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 Ready 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0

4.4 控制平面报错案例

生产环境Kubernetes控制平面报错的案例。

# 案例:API Server无法访问
# 场景:API Server无法访问,集群无法正常工作
# 问题:
– kubectl命令执行超时
– 集群状态无法查看
– 应用无法部署和管理
# 解决方案:
1. 检查API Server状态:
$ systemctl status kube-apiserver
● kube-apiserver.service – Kubernetes API Server
Loaded: loaded (/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2024-01-01 00:00:00 UTC; 10s ago
Docs: https://kubernetes.io/docs/
Process: 1234 ExecStart=/usr/local/bin/kube-apiserver $KUBE_API_ARGS (code=exited, status=255)
Main PID: 1234 (code=exited, status=255)
2. 检查API Server日志:
$ journalctl -u kube-apiserver
Jan 01 00:00:00 fgedu-master kube-apiserver[1234]: E0101 00:00:00.123456 1234 storage_decorator.go:114] Unable to create storage backend: config (&{etcd3 {https://127.0.0.1:2379} /etc/kubernetes/pki/apiserver-etcd-client.crt /etc/kubernetes/pki/apiserver-etcd-client.key /etc/kubernetes/pki/etcd/ca.crt true false 10s 1m0s 10s}), err (dial tcp 127.0.0.1:2379: connect: connection refused)
3. 检查etcd状态:
$ systemctl status etcd
● etcd.service – etcd
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2024-01-01 00:00:00 UTC; 20s ago
Docs: https://github.com/coreos/etcd
Process: 5678 ExecStart=/usr/local/bin/etcd $ETCD_ARGS (code=exited, status=255)
Main PID: 5678 (code=exited, status=255)
4. 启动etcd:
$ systemctl start etcd
$ systemctl status etcd
● etcd.service – etcd
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 00:05:00 UTC; 5s ago
Docs: https://github.com/coreos/etcd
Process: 5678 ExecStart=/usr/local/bin/etcd $ETCD_ARGS (code=exited, status=0/SUCCESS),from K8S+DB视频:www.itpux.com。
Main PID: 5678 (etcd)
5. 启动API Server:
$ systemctl start kube-apiserver
$ systemctl status kube-apiserver
● kube-apiserver.service – Kubernetes API Server
Loaded: loaded (/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 00:10:00 UTC; 5s ago
Docs: https://kubernetes.io/docs/
Process: 1234 ExecStart=/usr/local/bin/kube-apiserver $KUBE_API_ARGS (code=exited, status=0/SUCCESS)
Main PID: 1234 (kube-apiserver)
6. 验证集群状态:
$ kubectl get nodes。
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 Ready 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0
# 案例:etcd数据存储故障
# 场景:etcd数据存储故障,集群无法正常工作
# 问题:
– etcd无法启动
– API Server无法访问
– 集群状态无法查看
# 解决方案:
1. 检查etcd状态:
$ systemctl status etcd
● etcd.service – etcd
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Fri 2024-01-01 00:00:00 UTC; 10s ago
Docs: https://github.com/coreos/etcd
Process: 5678 ExecStart=/usr/local/bin/etcd $ETCD_ARGS (code=exited, status=255)
Main PID: 5678 (code=exited, status=255)
2. 检查etcd日志:
$ journalctl -u etcd
Jan 01 00:00:00 fgedu-master etcd[5678]: E0101 00:00:00.123456 5678 etcdmain.go:292] etcdserver: cannot restore from snapshot /var/lib/etcd/member/snap/db: snapshot file does not exist
3. 检查etcd数据目录:
$ ls -la /var/lib/etcd
total 16
drwxr-xr-x 3 etcd etcd 4096 Jan 1 00:00 .
drwxr-xr-x 21 root root 4096 Jan 1 00:00 ..
drwxr-xr-x 2 etcd etcd 4096 Jan 1 00:00 member
4. 恢复etcd数据:
$ etcdctl snapshot restore /tmp/etcd-snapshot.db –data-dir=/var/lib/etcd
5. 启动etcd:
$ systemctl start etcd
$ systemctl status etcd
● etcd.service – etcd
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 00:05:00 UTC; 5s ago
Docs: https://github.com/coreos/etcd
Process: 5678 ExecStart=/usr/local/bin/etcd $ETCD_ARGS (code=exited, status=0/SUCCESS)
Main PID: 5678 (etcd)
6. 启动API Server:
$ systemctl start kube-apiserver
$ systemctl status kube-apiserver
● kube-apiserver.service – Kubernetes API Server
Loaded: loaded (/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-01-01 00:10:00 UTC; 5s ago
Docs: https://kubernetes.io/docs/
Process: 1234 ExecStart=/usr/local/bin/kube-apiserver $KUBE_API_ARGS (code=exited, status=0/SUCCESS)
Main PID: 1234 (kube-apiserver)
7. 验证集群状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fgedu-master Ready control-plane,master 1d v1.24.0
fgedu-node1 Ready 1d v1.24.0
fgedu-node2 Ready 1d v1.24.0

Part05-风哥经验总结与分享

5.1 报错处理最佳实践

Kubernetes报错处理的最佳实践:

  • 快速响应:及时发现和处理报错,减少报错影响范围
  • 系统排查:从控制平面、节点、应用、网络等多个方面进行排查
  • 日志分析:查看容器日志、系统日志、应用日志,寻找错误信息
  • 网络诊断:使用ping、curl、netstat等工具测试网络连通性
  • 资源监控:检查CPU、内存、存储等资源使用情况,寻找资源瓶颈
  • 配置检查:检查应用配置、网络配置、存储配置等,寻找配置错误
  • 版本兼容性:检查Kubernetes版本、容器镜像版本等,确保兼容性
  • 测试验证:在修复后进行测试验证,确保报错已解决

5.2 解决方案最佳实践

Kubernetes解决方案的最佳实践:

  • 临时解决方案:快速恢复服务,确保业务连续性
  • 根本解决方案:彻底解决问题,防止问题再次发生
  • 预防措施:实施监控和告警,及时发现潜在问题
  • 团队协作:建立跨团队协作机制,共同处理报错
  • 工具支持:使用监控工具、日志分析工具、诊断工具等,提高报错处理效率
  • 文档和流程:建立报错处理流程,记录报错处理过程,编写报错处理文档
  • 持续改进:定期回顾报错处理经验,持续改进报错处理流程和方法
  • 培训和知识共享:定期培训团队成员,共享报错处理知识和经验

Kubernetes常用报错与解决方案的未来趋势:

  1. 自动化报错处理:使用AI和机器学习技术,实现自动化报错检测和处理
  2. 智能诊断:基于历史数据和模式识别,智能诊断报错原因
  3. 预测性维护:通过分析系统状态和趋势,预测可能的报错,提前进行维护
  4. 边缘计算报错处理:将报错处理扩展到边缘设备,支持边缘计算场景
  5. 多云报错处理:支持跨云平台的报错处理,实现统一的报错处理机制
  6. 服务网格集成:集成服务网格,提供更细粒度的报错处理和调试能力
  7. 零信任架构:基于零信任架构,提高报错处理的安全性和可靠性
  8. 可视化调试:使用更先进的可视化工具,直观展示系统状态和报错原因
持续学习:常用报错与解决方案技术在不断发展,需要持续学习和掌握新的技术和方法,以适应业务需求的变化。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息