Kubernetes教程FG088-Kubernetes日常巡检与健康检查实战解析
目录大纲
Part01-基础概念与理论知识
1.1 日常巡检基础
日常巡检是确保Kubernetes集群稳定运行的重要手段,包括以下内容。,风哥提示:。。。
- 集群状态检查:检查集群整体健康状态
- 节点健康检查:检查节点的运行状态
- Pod状态检查:检查Pod的运行状态
- 存储检查:检查存储的使用情况
- 网络检查:检查网络的连通性
- 资源使用检查:检查CPU、内存、磁盘等资源的使用情况
- 日志检查:检查系统和应用的日志
- 安全检查:检查集群的安全状态
1.2 健康检查类型
- 存活检查(Liveness Probe):检查容器是否存活
- 就绪检查(Readiness Probe):检查容器是否就绪
- 启动检查(Startup Probe):检查容器是否启动完成
1.3 巡检工具
- kubectl:Kubernetes命令行工具
- kube-state-metrics:收集集群状态指标
- Prometheus:监控系统
- Grafana:可视化监控系统
- ELK Stack:日志管理系统
- Velero:备份和恢复工具
Part02-生产环境规划与建议
2.1 巡检频率规划
- 日常巡检:每天执行一次
- 周度巡检:每周执行一次
- 月度巡检:每月执行一次
- 季度巡检:每季度执行一次
2.2 巡检内容规划
- 日常巡检:集群状态、节点健康、Pod状态、资源使用
- 周度巡检:存储使用、网络状态、日志分析
- 月度巡检:安全状态、备份状态、版本更新
- 季度巡检:全面检查、性能评估、优化建议
,风哥提示:。
2.3 告警机制规划
- 设置合理的告警阈值
- 配置告警通知渠道(邮件、短信、微信等)
- 建立告警分级机制
- 制定告警处理流程
Part03-生产环境项目实施方案
3.1 部署监控系统
3.1.1 部署Prometheus和Grafana
# 添加Prometheus Helm仓库 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts # 部署Prometheus和Grafana helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace # 查看监控组件状态 kubectl get pods -n monitoring
执行 →
NAME READY STATUS RESTARTS AGE prometheus-prometheus-node-exporter-5432b 1/1 Running 0 5m prometheus-prometheus-node-exporter-6548b 1/1 Running 0 5m prometheus-prometheus-node-exporter-7890b 1/1 Running 0 5m prometheus-grafana-5678b9c8d9-8x3y7 1/1 Running 0 5m prometheus-kube-state-metrics-7890b1c2d3-9z4w8 1/1 Running 0 5m prometheus-prometheus-0 2/2 Running 0 5m
3.2 配置健康检查
3.2.1 为应用配置健康检查
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
livenessProbe:,学习交流加群风哥微信: itpux-com。
httpGet:
path: /
port: 80
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30
3.3 编写巡检脚本
3.3.1 日常巡检脚本
#!/bin/bash # daily_check.sh # from:www.itpux.com.qq113257174.wx:itpux-com # web: http://www.fgedu.net.cn echo "===== Kubernetes日常巡检报告 =====" echo "巡检时间: $(date)" echo "" # 检查集群状态 echo "1. 集群状态检查" kubectl cluster-info echo "" # 检查节点状态 echo "2. 节点状态检查" kubectl get nodes echo "" # 检查Pod状态 echo "3. Pod状态检查" kubectl get pods --all-namespaces echo "" # 检查资源使用情况 echo "4. 资源使用情况" kubectl top nodes echo "" # 检查存储使用情况 echo "5. 存储使用情况" kubectl get persistentvolumes echo "" # 检查服务状态 echo "6. 服务状态检查" kubectl get services --all-namespaces echo "" echo "===== 巡检完成 ====="
,学习交流加群风哥QQ113257174。
Part04-生产案例与实战讲解
4.1 实战案例:执行日常巡检
4.1.1 运行巡检脚本
# 运行巡检脚本 chmod +x daily_check.sh ./daily_check.sh
执行 →
===== Kubernetes日常巡检报告 ===== 巡检时间: 2024年01月01日 10:00:00 1. 集群状态检查 Kubernetes control plane is running at https://192.168.1.100:6443 CoreDNS is running at https://192.168.1.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy 2. 节点状态检查 NAME STATUS ROLES AGE VERSION k8s-master-1 Ready control-plane 30d v1.27.0 k8s-master-2 Ready control-plane 30d v1.27.0 k8s-worker-1 Ready <none> 30d v1.27.0 k8s-worker-2 Ready <none> 30d v1.27.0 3. Pod状态检查 NAMESPACE NAME READY STATUS RESTARTS AGE default nginx-5432b8c8d9-7k2z7 1/1 Running 0 5m default nginx-5432b8c8d9-9p4xq 1/1 Running 0 5m default nginx-5432b8c8d9-fg67h 1/1 Running 0 5m kube-system coredns-6548b8c8d9-7k2z7 1/1 Running 0 30d kube-system coredns-6548b8c8d9-9p4xq 1/1 Running 0 30d kube-system etcd-k8s-master-1 1/1 Running 0 30d kube-system etcd-k8s-master-2 1/1 Running 0 30d kube-system kube-apiserver-k8s-master-1 1/1 Running 0 30d kube-system kube-apiserver-k8s-master-2 1/1 Running 0 30d kube-system kube-controller-manager-k8s-master-1 1/1 Running 0 30d kube-system kube-controller-manager-k8s-master-2 1/1 Running 0 30d kube-system kube-proxy-5432b 1/1 Running 0 30d kube-system kube-proxy-6548b 1/1 Running 0 30d kube-system kube-proxy-7890b 1/1 Running 0 30d kube-system kube-proxy-8901c 1/1 Running 0 30d kube-system kube-scheduler-k8s-master-1 1/1 Running 0 30d kube-system kube-scheduler-k8s-master-2 1/1 Running 0 30d monitoring prometheus-prometheus-node-exporter-5432b 1/1 Running 0 5m monitoring prometheus-prometheus-node-exporter-6548b 1/1 Running 0 5m monitoring prometheus-prometheus-node-exporter-7890b 1/1 Running 0 5m monitoring prometheus-grafana-5678b9c8d9-8x3y7 1/1 Running 0 5m monitoring prometheus-kube-state-metrics-7890b1c2d3-9z4w8 1/1 Running 0 5m monitoring prometheus-prometheus-0 2/2 Running 0 5m 4. 资源使用情况 NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% k8s-master-1 150m 15% 1500Mi 30% k8s-master-2 120m 12% 1300Mi 26% k8s-worker-1 80m 8% 800Mi 16% k8s-worker-2 90m 9% 900Mi 18% 5. 存储使用情况 NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-abc 10Gi RWO Delete Bound default/nginx-pvc standard 5m pvc-def 20Gi RWO Delete Bound monitoring/prometheus standard 5m 6. 服务状态检查,更多视频教程www.fgedu.net.cn。 NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 30d default nginx ClusterIP 10.96.123.45 <none> 80/TCP 5m kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 30d monitoring prometheus-grafana NodePort 10.96.78.90 <none> 80:30000/TCP 5m ===== 巡检完成 =====
4.2 实战案例:检查Pod健康状态
4.2.1 检查Pod详情
# 检查Pod详情
kubectl describe pod nginx-5432b8c8d9-7k2z7
# 检查Pod日志
kubectl logs nginx-5432b8c8d9-7k2z7
# 检查Pod健康检查状态
kubectl get pod nginx-5432b8c8d9-7k2z7 -o jsonpath='{.status.conditions[*].type} {.status.conditions[*].status}'
执行 →
# Pod详情
Name: nginx-5432b8c8d9-7k2z7
Namespace: default
Priority: 0
Node: k8s-worker-1/192.168.1.102
Start Time: 2024-01-01T09:55:00Z
Labels: app=nginx
pod-template-hash=5432b8c8d9
Annotations: <none>
Status: Running
IP: 10.244.1.2
IPs:
IP: 10.244.1.2
Controlled By: ReplicaSet/nginx-5432b8c8d9
Containers:
nginx:
Container ID: docker://abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
Image: nginx:1.21
Image ID: docker-pullable://nginx@sha256:1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: 2024-01-01T09:55:05Z
Ready: True
Restart Count: 0
Liveness: http-get http://:80/ delay=30s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:80/ delay=5s timeout=1s period=10s #success=1 #failure=3
Startup: http-get http://:80/ delay=10s timeout=1s period=5s #success=1 #failure=30
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xyz (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:,更多学习教程公众号风哥教程itpux_com。
kube-api-access-xyz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned default/nginx-5432b8c8d9-7k2z7 to k8s-worker-1
Normal Pulling 5m kubelet Pulling image "nginx:1.21"
Normal Pulled 5m kubelet Successfully pulled image "nginx:1.21" in 1.2s
Normal Created 5m kubelet Created container nginx
Normal Started 5m kubelet Started container nginx
# Pod日志
192.168.1.1 - - [01/Jan/2024:09:55:10 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
192.168.1.1 - - [01/Jan/2024:09:55:20 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
192.168.1.1 - - [01/Jan/2024:09:55:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
# 健康检查状态
Initialized Ready ContainersReady PodScheduled True True True True
4.3 实战案例:监控告警配置
4.3.1 配置Prometheus告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes
rules:
- alert: NodeDown
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.node }} 已宕机"
description: "节点 {{ $labels.node }} 状态为NotReady超过5分钟"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{container!=""}[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} 崩溃循环"
description: "Pod {{ $labels.pod }} 在过去5分钟内重启次数大于0"
- alert: HighCPUUsage
expr: (sum(node_cpu_seconds_total{mode!="idle"}) by (instance) / sum(node_cpu_seconds_total) by (instance)) * 100 > 80。
for: 5m
labels:,from K8S+DB视频:www.itpux.com。
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} CPU使用率过高"
description: "节点 {{ $labels.instance }} CPU使用率超过80%"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率过高"
description: "节点 {{ $labels.instance }} 内存使用率超过80%"
Part05-风哥经验总结与分享
5.1 日常巡检最佳实践
- 建立完善的巡检机制:制定巡检计划,明确巡检内容和频率
- 使用自动化工具:编写巡检脚本,自动化执行巡检任务
- 部署监控系统:使用Prometheus和Grafana监控集群状态
- 配置健康检查:为应用配置合理的健康检查
- 建立告警机制:设置合理的告警阈值,及时发现问题
- 定期分析巡检结果:总结问题,提出优化建议
- 保持集群更新:定期更新Kubernetes版本,修复安全漏洞
- 备份重要数据:定期备份集群配置和应用数据
5.2 常见问题与解决方案
- 节点状态异常:检查节点资源使用情况,重启kubelet服务
- Pod崩溃循环:检查Pod日志,分析崩溃原因
- 资源使用率过高:优化应用配置,调整资源限制
- 存储不足:清理无用数据,扩展存储容量
- 网络问题:检查网络策略,确保网络连通性
- 告警过多:调整告警阈值,减少误报
5.3 风哥提示
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
