1. 首页 > Kubernetes教程 > 正文

Kubernetes教程FG088-Kubernetes日常巡检与健康检查实战解析

目录大纲

Part01-基础概念与理论知识

1.1 日常巡检基础

日常巡检是确保Kubernetes集群稳定运行的重要手段,包括以下内容。,风哥提示:。。。

  • 集群状态检查:检查集群整体健康状态
  • 节点健康检查:检查节点的运行状态
  • Pod状态检查:检查Pod的运行状态
  • 存储检查:检查存储的使用情况
  • 网络检查:检查网络的连通性
  • 资源使用检查:检查CPU、内存、磁盘等资源的使用情况
  • 日志检查:检查系统和应用的日志
  • 安全检查:检查集群的安全状态

1.2 健康检查类型

  • 存活检查(Liveness Probe):检查容器是否存活
  • 就绪检查(Readiness Probe):检查容器是否就绪
  • 启动检查(Startup Probe):检查容器是否启动完成

1.3 巡检工具

  • kubectl:Kubernetes命令行工具
  • kube-state-metrics:收集集群状态指标
  • Prometheus:监控系统
  • Grafana:可视化监控系统
  • ELK Stack:日志管理系统
  • Velero:备份和恢复工具

Part02-生产环境规划与建议

2.1 巡检频率规划

  • 日常巡检:每天执行一次
  • 周度巡检:每周执行一次
  • 月度巡检:每月执行一次
  • 季度巡检:每季度执行一次

2.2 巡检内容规划

  • 日常巡检:集群状态、节点健康、Pod状态、资源使用
  • 周度巡检:存储使用、网络状态、日志分析
  • 月度巡检:安全状态、备份状态、版本更新
  • 季度巡检:全面检查、性能评估、优化建议

,风哥提示:。

2.3 告警机制规划

  • 设置合理的告警阈值
  • 配置告警通知渠道(邮件、短信、微信等)
  • 建立告警分级机制
  • 制定告警处理流程

Part03-生产环境项目实施方案

3.1 部署监控系统

3.1.1 部署Prometheus和Grafana

# 添加Prometheus Helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# 部署Prometheus和Grafana
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
# 查看监控组件状态
kubectl get pods -n monitoring

执行 →

NAME                                               READY   STATUS    RESTARTS   AGE
prometheus-prometheus-node-exporter-5432b           1/1     Running   0          5m
prometheus-prometheus-node-exporter-6548b           1/1     Running   0          5m
prometheus-prometheus-node-exporter-7890b           1/1     Running   0          5m
prometheus-grafana-5678b9c8d9-8x3y7               1/1     Running   0          5m
prometheus-kube-state-metrics-7890b1c2d3-9z4w8     1/1     Running   0          5m
prometheus-prometheus-0                             2/2     Running   0          5m

3.2 配置健康检查

3.2.1 为应用配置健康检查

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.21
        ports:
        - containerPort: 80
        livenessProbe:,学习交流加群风哥微信: itpux-com。
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
        startupProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 30

3.3 编写巡检脚本

3.3.1 日常巡检脚本

#!/bin/bash
# daily_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo "===== Kubernetes日常巡检报告 ====="
echo "巡检时间: $(date)"
echo ""
# 检查集群状态
echo "1. 集群状态检查"
kubectl cluster-info
echo ""
# 检查节点状态
echo "2. 节点状态检查"
kubectl get nodes
 echo ""
# 检查Pod状态
echo "3. Pod状态检查"
kubectl get pods --all-namespaces
 echo ""
# 检查资源使用情况
echo "4. 资源使用情况"
kubectl top nodes
echo ""
# 检查存储使用情况
echo "5. 存储使用情况"
kubectl get persistentvolumes
 echo ""
# 检查服务状态
echo "6. 服务状态检查"
kubectl get services --all-namespaces
 echo ""
echo "===== 巡检完成 ====="

,学习交流加群风哥QQ113257174。

Part04-生产案例与实战讲解

4.1 实战案例:执行日常巡检

4.1.1 运行巡检脚本

# 运行巡检脚本
chmod +x daily_check.sh
./daily_check.sh

执行 →

===== Kubernetes日常巡检报告 =====
巡检时间: 2024年01月01日 10:00:00
1. 集群状态检查
Kubernetes control plane is running at https://192.168.1.100:6443
CoreDNS is running at https://192.168.1.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
2. 节点状态检查
NAME           STATUS   ROLES           AGE   VERSION
k8s-master-1   Ready    control-plane   30d   v1.27.0
k8s-master-2   Ready    control-plane   30d   v1.27.0
k8s-worker-1   Ready    <none>          30d   v1.27.0
k8s-worker-2   Ready    <none>          30d   v1.27.0
3. Pod状态检查
NAMESPACE         NAME                                      READY   STATUS    RESTARTS   AGE
default           nginx-5432b8c8d9-7k2z7                    1/1     Running   0          5m
default           nginx-5432b8c8d9-9p4xq                    1/1     Running   0          5m
default           nginx-5432b8c8d9-fg67h                    1/1     Running   0          5m
kube-system       coredns-6548b8c8d9-7k2z7                  1/1     Running   0          30d
kube-system       coredns-6548b8c8d9-9p4xq                  1/1     Running   0          30d
kube-system       etcd-k8s-master-1                         1/1     Running   0          30d
kube-system       etcd-k8s-master-2                         1/1     Running   0          30d
kube-system       kube-apiserver-k8s-master-1               1/1     Running   0          30d
kube-system       kube-apiserver-k8s-master-2               1/1     Running   0          30d
kube-system       kube-controller-manager-k8s-master-1      1/1     Running   0          30d
kube-system       kube-controller-manager-k8s-master-2      1/1     Running   0          30d
kube-system       kube-proxy-5432b                          1/1     Running   0          30d
kube-system       kube-proxy-6548b                          1/1     Running   0          30d
kube-system       kube-proxy-7890b                          1/1     Running   0          30d
kube-system       kube-proxy-8901c                          1/1     Running   0          30d
kube-system       kube-scheduler-k8s-master-1               1/1     Running   0          30d
kube-system       kube-scheduler-k8s-master-2               1/1     Running   0          30d
monitoring        prometheus-prometheus-node-exporter-5432b 1/1     Running   0          5m
monitoring        prometheus-prometheus-node-exporter-6548b 1/1     Running   0          5m
monitoring        prometheus-prometheus-node-exporter-7890b 1/1     Running   0          5m
monitoring        prometheus-grafana-5678b9c8d9-8x3y7       1/1     Running   0          5m
monitoring        prometheus-kube-state-metrics-7890b1c2d3-9z4w8 1/1 Running 0          5m
monitoring        prometheus-prometheus-0                   2/2     Running   0          5m
4. 资源使用情况
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
k8s-master-1   150m         15%    1500Mi          30%
k8s-master-2   120m         12%    1300Mi          26%
k8s-worker-1   80m          8%     800Mi           16%
k8s-worker-2   90m          9%     900Mi           18%
5. 存储使用情况
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS   REASON   AGE
pvc-abc   10Gi       RWO            Delete           Bound    default/nginx-pvc        standard                5m
pvc-def   20Gi       RWO            Delete           Bound    monitoring/prometheus     standard                5m
6. 服务状态检查,更多视频教程www.fgedu.net.cn。
NAMESPACE         NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
default           kubernetes           ClusterIP   10.96.0.1      <none>        443/TCP                  30d
default           nginx                ClusterIP   10.96.123.45   <none>        80/TCP                   5m
kube-system       kube-dns             ClusterIP   10.96.0.10     <none>        53/UDP,53/TCP,9153/TCP   30d
monitoring        prometheus-grafana   NodePort    10.96.78.90    <none>        80:30000/TCP             5m
===== 巡检完成 =====

4.2 实战案例:检查Pod健康状态

4.2.1 检查Pod详情

# 检查Pod详情
kubectl describe pod nginx-5432b8c8d9-7k2z7
# 检查Pod日志
kubectl logs nginx-5432b8c8d9-7k2z7
# 检查Pod健康检查状态
kubectl get pod nginx-5432b8c8d9-7k2z7 -o jsonpath='{.status.conditions[*].type} {.status.conditions[*].status}'

执行 →

# Pod详情
Name:         nginx-5432b8c8d9-7k2z7
Namespace:    default
Priority:     0
Node:         k8s-worker-1/192.168.1.102
Start Time:   2024-01-01T09:55:00Z
Labels:       app=nginx
              pod-template-hash=5432b8c8d9
Annotations:  <none>
Status:       Running
IP:           10.244.1.2
IPs:
  IP:           10.244.1.2
Controlled By:  ReplicaSet/nginx-5432b8c8d9
Containers:
  nginx:
    Container ID:   docker://abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
    Image:          nginx:1.21
    Image ID:       docker-pullable://nginx@sha256:1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      2024-01-01T09:55:05Z
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:80/ delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:80/ delay=5s timeout=1s period=10s #success=1 #failure=3
    Startup:        http-get http://:80/ delay=10s timeout=1s period=5s #success=1 #failure=30
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xyz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True
  PodScheduled      True 
Volumes:,更多学习教程公众号风哥教程itpux_com。
  kube-api-access-xyz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  5m    default-scheduler  Successfully assigned default/nginx-5432b8c8d9-7k2z7 to k8s-worker-1
  Normal  Pulling    5m    kubelet            Pulling image "nginx:1.21"
  Normal  Pulled     5m    kubelet            Successfully pulled image "nginx:1.21" in 1.2s
  Normal  Created    5m    kubelet            Created container nginx
  Normal  Started    5m    kubelet            Started container nginx
# Pod日志
192.168.1.1 - - [01/Jan/2024:09:55:10 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
192.168.1.1 - - [01/Jan/2024:09:55:20 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
192.168.1.1 - - [01/Jan/2024:09:55:30 +0000] "GET / HTTP/1.1" 200 612 "-" "kube-probe/1.27"
# 健康检查状态
Initialized Ready ContainersReady PodScheduled True True True True

4.3 实战案例:监控告警配置

4.3.1 配置Prometheus告警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernetes-alerts
  namespace: monitoring
spec:
  groups:
  - name: kubernetes
    rules:
    - alert: NodeDown
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "节点 {{ $labels.node }} 已宕机"
        description: "节点 {{ $labels.node }} 状态为NotReady超过5分钟"
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{container!=""}[5m]) > 0
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.pod }} 崩溃循环"
        description: "Pod {{ $labels.pod }} 在过去5分钟内重启次数大于0"
    - alert: HighCPUUsage
      expr: (sum(node_cpu_seconds_total{mode!="idle"}) by (instance) / sum(node_cpu_seconds_total) by (instance)) * 100 > 80。
      for: 5m
      labels:,from K8S+DB视频:www.itpux.com。
        severity: warning
      annotations:
        summary: "节点 {{ $labels.instance }} CPU使用率过高"
        description: "节点 {{ $labels.instance }} CPU使用率超过80%"
    - alert: HighMemoryUsage
      expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "节点 {{ $labels.instance }} 内存使用率过高"
        description: "节点 {{ $labels.instance }} 内存使用率超过80%"

Part05-风哥经验总结与分享

5.1 日常巡检最佳实践

  • 建立完善的巡检机制:制定巡检计划,明确巡检内容和频率
  • 使用自动化工具:编写巡检脚本,自动化执行巡检任务
  • 部署监控系统:使用Prometheus和Grafana监控集群状态
  • 配置健康检查:为应用配置合理的健康检查
  • 建立告警机制:设置合理的告警阈值,及时发现问题
  • 定期分析巡检结果:总结问题,提出优化建议
  • 保持集群更新:定期更新Kubernetes版本,修复安全漏洞
  • 备份重要数据:定期备份集群配置和应用数据

5.2 常见问题与解决方案

  • 节点状态异常:检查节点资源使用情况,重启kubelet服务
  • Pod崩溃循环:检查Pod日志,分析崩溃原因
  • 资源使用率过高:优化应用配置,调整资源限制
  • 存储不足:清理无用数据,扩展存储容量
  • 网络问题:检查网络策略,确保网络连通性
  • 告警过多:调整告警阈值,减少误报

5.3 风哥提示

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息