1. 首页 > Kubernetes教程 > 正文

Kubernetes教程FG020-Kubernetes日常巡检与健康检查实战解析

本文档风哥主要介绍Kubernetes的日常巡检与健康检查,包括日常巡检概述、健康检查概念、Kubernetes健康检查、巡检规划、健康检查规划、实施策略、集群巡检、节点巡检、应用巡检等内容,风哥教程参考Kubernetes官方文档和健康检查相关文档,适合DevOps工程师和系统管理员在学习和测试中使用,如果要应用于生产环境则需要自行确认。

Part01-基础概念与理论知识

1.1 日常巡检概述

日常巡检是指定期对Kubernetes集群进行检查,以确保集群的稳定运行。日常巡检包括集群状态检查、节点状态检查、应用状态检查、资源使用情况检查等。

1.2 健康检查概念

健康检查是指通过一定的机制,检查系统或应用的健康状态,以确保系统或应用的正常运行。健康检查包括存活检查(Liveness Probe)、就绪检查(Readiness Probe)和启动检查(Startup Probe)。

1.3 Kubernetes健康检查

Kubernetes健康检查是指Kubernetes通过探针(Probe)机制,检查Pod的健康状态。Kubernetes支持三种类型的探针:

  • 存活检查(Liveness Probe):检查Pod是否存活,如果检查失败,Kubernetes会重启Pod
  • 就绪检查(Readiness Probe):检查Pod是否就绪,如果检查失败,Kubernetes会将Pod从服务的 endpoints 中移除
  • 启动检查(Startup Probe):检查Pod是否启动完成,如果检查失败,Kubernetes会重启Pod

Part02-生产环境规划与建议

2.1 巡检规划

生产环境Kubernetes日常巡检的规划:

# 巡检规划
– 巡检频率:根据集群的重要性,确定巡检频率,如每日、每周、每月
– 巡检内容:确定巡检的具体内容,如集群状态、节点状态、应用状态、资源使用情况等
– 巡检工具:选择合适的巡检工具,如kubectl、Prometheus、Grafana等
– 巡检人员:确定负责巡检的人员,明确职责
– 巡检报告:制定巡检报告模板,记录巡检结果
– 异常处理:制定异常处理流程,及时处理巡检中发现的问题
# 巡检频率建议
– 每日巡检:检查集群状态、节点状态、应用状态、资源使用情况等
– 每周巡检:检查集群组件状态、网络状态、存储状态等
– 每月巡检:检查集群版本、安全配置、备份状态等
# 巡检内容建议
– 集群状态:检查集群的整体状态,如API服务器、etcd、调度器、控制器管理器等
– 节点状态:检查节点的状态,如CPU、内存、磁盘使用情况等
– 应用状态:检查应用的状态,如Pod的运行状态、服务的可用性等
– 资源使用情况:检查集群的资源使用情况,如CPU、内存、磁盘、网络等
– 安全状态:检查集群的安全状态,如RBAC配置、网络策略、 secrets管理等
– 备份状态:检查集群的备份状态,如etcd备份、应用数据备份等

2.2 健康检查规划

生产环境Kubernetes健康检查的规划:

# 健康检查规划
– 检查类型:确定使用的健康检查类型,如存活检查、就绪检查、启动检查
– 检查参数:确定健康检查的参数,如检查频率、超时时间、失败阈值、成功阈值等
– 检查方法:确定健康检查的方法,如HTTP检查、TCP检查、命令检查等
– 检查路径:确定健康检查的路径,如HTTP检查的路径、命令检查的命令等
– 异常处理:制定健康检查失败的处理策略,如重启Pod、从服务中移除等
# 健康检查参数建议
– 存活检查:
– 检查频率:30秒
– 超时时间:10秒
– 失败阈值:3
– 成功阈值:1
– 就绪检查:
– 检查频率:10秒
– 超时时间:5秒
– 失败阈值:3,风哥提示:。
– 成功阈值:1
– 启动检查:
– 检查频率:5秒
– 超时时间:10秒
– 失败阈值:30
– 成功阈值:1
# 健康检查方法建议
– HTTP检查:适用于Web应用,检查HTTP端点的响应
– TCP检查:适用于网络服务,检查TCP端口是否可访问
– 命令检查:适用于需要执行命令的应用,检查命令的执行结果

2.3 实施策略

生产环境Kubernetes日常巡检与健康检查的实施策略:

# 实施策略
– 自动化巡检:使用自动化工具,如Ansible、Shell脚本等,实现巡检的自动化
– 监控告警:建立监控系统,如Prometheus、Grafana等,实现实时监控和告警
– 定期巡检:按照巡检计划,定期进行人工巡检
– 异常处理:建立异常处理流程,及时处理巡检中发现的问题
– 持续改进:根据巡检结果,持续改进巡检方案和健康检查配置
# 自动化巡检脚本
– 集群状态检查脚本:检查集群的整体状态
– 节点状态检查脚本:检查节点的状态
– 应用状态检查脚本:检查应用的状态
– 资源使用情况检查脚本:检查集群的资源使用情况
– 安全状态检查脚本:检查集群的安全状态
# 监控告警配置
– 集群组件告警:当集群组件状态异常时告警
– 节点资源告警:当节点资源使用过高时告警
– 应用状态告警:当应用状态异常时告警
– 存储状态告警:当存储容量不足时告警
– 网络状态告警:当网络连接异常时告警

Part03-生产环境项目实施方案

3.1 集群巡检

生产环境Kubernetes集群的巡检:

# 集群状态检查
$ kubectl cluster-info
Kubernetes control plane is running at https://192.168.1.100:6443
CoreDNS is running at https://192.168.1.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
# 集群组件状态检查
$ kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {“health”:”true”}
etcd-1 Healthy {“health”:”true”}
etcd-2 Healthy {“health”:”true”}
# 集群节点状态检查
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 1d v1.24.0
master2 Ready control-plane,master 1d v1.24.0
master3 Ready control-plane,master 1d v1.24.0
worker1 Ready worker 1d v1.24.0
worker2 Ready worker 1d v1.24.0
# 集群Pod状态检查
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6d4b75cb6d-7f5f8 1/1 Running 0 1d
calico-node-4q7k8 1/1 Running 0 1d
calico-node-7c9x6 1/1 Running 0 1d
calico-node-8d2k3 1/1 Running 0 1d
calico-node-9f5g7 1/1 Running 0 1d
calico-node-b7c4d 1/1 Running 0 1d
coredns-6d4b75cb6d-7f5f8 1/1 Running 0 1d
coredns-6d4b75cb6d-8d2k3 1/1 Running 0 1d
etcd-master1 1/1 Running 0 1d
etcd-master2 1/1 Running 0 1d
etcd-master3 1/1 Running 0 1d
kube-apiserver-master1 1/1 Running 0 1d
kube-apiserver-master2 1/1 Running 0 1d
kube-apiserver-master3 1/1 Running 0 1d,学习交流加群风哥微信: itpux-com。
kube-controller-manager-master1 1/1 Running 0 1d
kube-controller-manager-master2 1/1 Running 0 1d
kube-controller-manager-master3 1/1 Running 0 1d
kube-proxy-4q7k8 1/1 Running 0 1d
kube-proxy-7c9x6 1/1 Running 0 1d
kube-proxy-8d2k3 1/1 Running 0 1d
kube-proxy-9f5g7 1/1 Running 0 1d
kube-proxy-b7c4d 1/1 Running 0 1d
kube-scheduler-master1 1/1 Running 0 1d
kube-scheduler-master2 1/1 Running 0 1d
kube-scheduler-master3 1/1 Running 0 1d
# 集群事件检查
$ kubectl get events –sort-by=’.lastTimestamp’

3.2 节点巡检

生产环境Kubernetes节点的巡检。,风哥提示:。

# 节点详细信息检查
$ kubectl describe node worker1
# 节点资源使用情况检查
$ kubectl top node worker1
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
worker1 1500m 75% 12288Mi 60%
# 节点磁盘使用情况检查
$ kubectl exec -it worker1 — df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 25G 25G 50% /
/dev/sdb1 200G 50G 150G 25% /Kubernetes/fgdata
# 节点网络状态检查
$ kubectl exec -it worker1 — ip addr
$ kubectl exec -it worker1 — ip route
$ kubectl exec -it worker1 — ping 8.8.8.8
# 节点系统服务状态检查
$ kubectl exec -it worker1 — systemctl status kubelet
$ kubectl exec -it worker1 — systemctl status docker
$ kubectl exec -it worker1 — systemctl status containerd
# 节点日志检查
$ kubectl exec -it worker1 — journalctl -u kubelet –no-pager
$ kubectl exec -it worker1 — journalctl -u docker –no-pager
$ kubectl exec -it worker1 — journalctl -u containerd –no-pager

3.3 应用巡检

生产环境Kubernetes应用的巡检:

# 应用Pod状态检查
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-6d6f58987b-7f5f8 1/1 Running 0 1d
nginx-6d6f58987b-8d2k3 1/1 Running 0 1d
nginx-6d6f58987b-9f5g7 1/1 Running 0 1d
# 应用Pod详细信息检查
$ kubectl describe pod nginx-6d6f58987b-7f5f8
# 应用Pod日志检查
$ kubectl logs nginx-6d6f58987b-7f5f8
# 应用服务状态检查
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 443/TCP 1d
nginx LoadBalancer 10.100.123.45 192.168.1.200 80:30080/TCP 1d
# 应用服务详细信息检查
$ kubectl describe service nginx
# 应用Ingress状态检查
$ kubectl get ingresses
$ kubectl describe ingress nginx
# 应用资源使用情况检查
$ kubectl top pod nginx-6d6f58987b-7f5f8
NAME CPU(cores) MEMORY(bytes)
nginx-6d6f58987b-7f5f8 50m 100Mi

Part04-生产案例与实战讲解

4.1 集群健康检查案例

生产环境Kubernetes集群健康检查的案例。

,学习交流加群风哥QQ113257174。
# 案例:集群健康检查
# 检查集群状态
$ kubectl cluster-info
Kubernetes control plane is running at https://192.168.1.100:6443
CoreDNS is running at https://192.168.1.100:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
# 检查集群组件状态
$ kubectl get componentstatuses
NAME STATUS MESSAGE ERROR
scheduler Healthy ok
controller-manager Healthy ok
etcd-0 Healthy {“health”:”true”}
etcd-1 Healthy {“health”:”true”}
etcd-2 Healthy {“health”:”true”}
# 检查集群节点状态
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 1d v1.24.0
master2 Ready control-plane,master 1d v1.24.0
master3 Ready control-plane,master 1d v1.24.0
worker1 Ready worker 1d v1.24.0
worker2 Ready worker 1d v1.24.0
# 检查集群Pod状态
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6d4b75cb6d-7f5f8 1/1 Running 0 1d
calico-node-4q7k8 1/1 Running 0 1d
calico-node-7c9x6 1/1 Running 0 1d
calico-node-8d2k3 1/1 Running 0 1d
calico-node-9f5g7 1/1 Running 0 1d
calico-node-b7c4d 1/1 Running 0 1d
coredns-6d4b75cb6d-7f5f8 1/1 Running 0 1d
coredns-6d4b75cb6d-8d2k3 1/1 Running 0 1d
etcd-master1 1/1 Running 0 1d
etcd-master2 1/1 Running 0 1d
etcd-master3 1/1 Running 0 1d
kube-apiserver-master1 1/1 Running 0 1d
kube-apiserver-master2 1/1 Running 0 1d
kube-apiserver-master3 1/1 Running 0 1d
kube-controller-manager-master1 1/1 Running 0 1d
kube-controller-manager-master2 1/1 Running 0 1d
kube-controller-manager-master3 1/1 Running 0 1d
kube-proxy-4q7k8 1/1 Running 0 1d
kube-proxy-7c9x6 1/1 Running 0 1d
kube-proxy-8d2k3 1/1 Running 0 1d
kube-proxy-9f5g7 1/1 Running 0 1d
kube-proxy-b7c4d 1/1 Running 0 1d
kube-scheduler-master1 1/1 Running 0 1d
kube-scheduler-master2 1/1 Running 0 1d
kube-scheduler-master3 1/1 Running 0 1d
# 检查集群事件
$ kubectl get events –sort-by=’.lastTimestamp’
No resources found in default namespace.
# 检查集群版本
$ kubectl version
Client Version: version.Info{Major:”1″, Minor:”24″, GitVersion:”v1.24.0″, GitCommit:”1234567890abcdef”, GitTreeState:”clean”, BuildDate:”2024-01-01T00:00:00Z”, GoVersion:”go1.18.0″, Compiler:”gc”, Platform:”linux/amd64″}
Server Version: version.Info{Major:”1″, Minor:”24″, GitVersion:”v1.24.0″, GitCommit:”1234567890abcdef”, GitTreeState:”clean”, BuildDate:”2024-01-01T00:00:00Z”, GoVersion:”go1.18.0″, Compiler:”gc”, Platform:”linux/amd64″}

4.2 节点健康检查案例

生产环境Kubernetes节点健康检查的案例。

# 案例:节点健康检查
# 检查节点状态
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master1 Ready control-plane,master 1d v1.24.0
master2 Ready control-plane,master 1d v1.24.0
master3 Ready control-plane,master 1d v1.24.0
worker1 Ready worker 1d v1.24.0
worker2 Ready worker 1d v1.24.0
# 检查节点详细信息
$ kubectl describe node worker1
Name: worker1
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=worker1
kubernetes.io/os=linux,更多视频教程www.fgedu.net.cn。
node-role.kubernetes.io/worker=worker
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: 2024-01-01T00:00:00Z
Taints:
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
—- —— —————– —————— —— ——-
NetworkUnavailable False 2024-01-01T00:00:00Z 2024-01-01T00:00:00Z CalicoIsUp Calico is running on this node
MemoryPressure False 2024-01-01T00:00:00Z 2024-01-01T00:00:00Z KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False 2024-01-01T00:00:00Z 2024-01-01T00:00:00Z KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False 2024-01-01T00:00:00Z 2024-01-01T00:00:00Z KubeletHasSufficientPID kubelet has sufficient PID available
Ready True 2024-01-01T00:00:00Z 2024-01-01T00:00:00Z KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.1.104
Hostname: worker1
Capacity:
cpu: 8
ephemeral-storage: 510223Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16384Mi
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 465533Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16384Mi
pods: 110
System Info:
Machine ID: 12345678-1234-1234-1234-1234567890ab
System UUID: 12345678-1234-1234-1234-1234567890ab
Boot ID: 12345678-1234-1234-1234-1234567890ab
Kernel Version: 5.4.0-100-generic
OS Image: Ubuntu 20.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://20.10.8
Kubelet Version: v1.24.0
Kube-Proxy Version: v1.24.0
PodCIDR: 10.244.1.0/24
PodCIDRs: 10.244.1.0/24
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
——— —- ———— ———- ————— ————- —
default nginx-6d6f58987b-7f5f8 100m (1%) 200m (2%) 256Mi (1%) 512Mi (3%) 1d
default nginx-6d6f58987b-8d2k3 100m (1%) 200m (2%) 256Mi (1%) 512Mi (3%) 1d
default nginx-6d6f58987b-9f5g7 100m (1%) 200m (2%) 256Mi (1%) 512Mi (3%) 1d
kube-system calico-node-4q7k8 250m (3%) 500m (6%) 512Mi (3%) 1Gi (6%) 1d
kube-system kube-proxy-4q7k8 100m (1%) 0 (0%) 0 (0%) 0 (0%) 1d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
——– ——– ——
cpu 650m (8%) 1100m (13%)
memory 1280Mi (7%) 2560Mi (15%)
ephemeral-storage 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
—- —— —- —- ——-
Normal Starting 1d kubelet, worker1 Starting kubelet.
Normal NodeHasSufficientMemory 1d kubelet, worker1 Node worker1 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 1d kubelet, worker1 Node worker1 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 1d kubelet, worker1 Node worker1 status is now: NodeHasSufficientPID
Normal NodeReady 1d kubelet, worker1 Node worker1 status is now: NodeReady
# 检查节点资源使用情况
$ kubectl top node worker1
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
worker1 1500m 75% 12288Mi 60%
# 检查节点磁盘使用情况
$ kubectl exec -it worker1 — df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 25G 25G 50% /
/dev/sdb1 200G 50G 150G 25% /Kubernetes/fgdata

4.3 应用健康检查案例

,更多学习教程公众号风哥教程itpux_com。

生产环境Kubernetes应用健康检查的案例。。

# 案例:应用健康检查
# 创建带有健康检查的应用
$ cat > nginx-deployment.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx ports: - containerPort: 80 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 30 periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 startupProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 10 failureThreshold: 30 EOF $ kubectl apply -f nginx-deployment.yaml # 检查应用Pod状态 $ kubectl get pods NAME READY STATUS RESTARTS AGE nginx-6d6f58987b-7f5f8 1/1 Running 0 5m nginx-6d6f58987b-8d2k3 1/1 Running 0 5m nginx-6d6f58987b-9f5g7 1/1 Running 0 5m # 检查应用Pod详细信息 $ kubectl describe pod nginx-6d6f58987b-7f5f8 # 检查应用服务状态 $ kubectl get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 443/TCP 1d
nginx LoadBalancer 10.100.123.45 192.168.1.200 80:30080/TCP 5m
# 测试应用健康状态
$ curl http://192.168.1.200

Welcome to nginx!

# 模拟应用故障
$ kubectl exec -it nginx-6d6f58987b-7f5f8 — rm /usr/share/nginx/html/index.html,from K8S+DB视频:www.itpux.com。
# 检查应用Pod状态
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-6d6f58987b-7f5f8 0/1 CrashLoopBackOff 1 10m
nginx-6d6f58987b-8d2k3 1/1 Running 0 10m
nginx-6d6f58987b-9f5g7 1/1 Running 0 10m
# 检查应用Pod日志
$ kubectl logs nginx-6d6f58987b-7f5f8
2024/01/01 00:00:00 [error] 1#1: *1 open() “/usr/share/nginx/html/index.html” failed (2: No such file or directory), client: 127.0.0.1, server: localhost, request: “GET / HTTP/1.1”, host: “localhost”
# 修复应用故障
$ kubectl exec -it nginx-6d6f58987b-7f5f8 — echo “

Welcome to nginx!

” > /usr/share/nginx/html/index.html。
# 检查应用Pod状态
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-6d6f58987b-7f5f8 1/1 Running 1 15m
nginx-6d6f58987b-8d2k3 1/1 Running 0 15m
nginx-6d6f58987b-9f5g7 1/1 Running 0 15m

Part05-风哥经验总结与分享

5.1 巡检最佳实践

Kubernetes日常巡检的最佳实践。

  • 定期巡检:按照巡检计划,定期进行巡检,及时发现问题
  • 自动化巡检:使用自动化工具,实现巡检的自动化,提高效率
  • 全面检查:检查集群的各个方面,包括集群状态、节点状态、应用状态等
  • 详细记录:详细记录巡检结果,便于后续分析和问题追踪
  • 及时处理:及时处理巡检中发现的问题,避免问题扩大
  • 持续改进:根据巡检结果,持续改进巡检方案和集群配置
  • 培训学习:对运维人员进行培训,提高巡检技能
  • 经验分享:与团队分享巡检经验,提高团队的巡检能力

5.2 健康检查最佳实践

Kubernetes健康检查的最佳实践:

  • 合理配置:根据应用的特点,合理配置健康检查参数
  • 多种检查:使用多种类型的健康检查,如存活检查、就绪检查、启动检查
  • 检查路径:选择合适的检查路径,确保检查的准确性
  • 超时设置:合理设置超时时间,避免误判
  • 频率设置:合理设置检查频率,既保证及时发现问题,又避免过多的检查开销
  • 失败阈值:合理设置失败阈值,避免误判
  • 监控告警:建立健康检查的监控和告警机制,及时发现问题
  • 持续优化:根据应用的运行情况,持续优化健康检查配置

5.3 优化建议

Kubernetes日常巡检与健康检查的优化建议:

  1. 工具优化:使用合适的巡检和健康检查工具,提高效率
  2. 自动化优化:实现巡检和健康检查的自动化,减少人工操作
  3. 监控优化:建立完善的监控系统,及时发现问题
  4. 告警优化:优化告警策略,减少误告警,提高告警的准确性
  5. 流程优化:优化巡检和健康检查的流程,提高效率
  6. 文档优化:建立完善的巡检和健康检查文档,便于参考
  7. 培训优化:对运维人员进行培训,提高巡检和健康检查的技能
  8. 经验优化:总结巡检和健康检查的经验,持续改进
持续学习:日常巡检和健康检查是Kubernetes集群管理的重要组成部分,需要不断学习和掌握新的技术和方法,以适应业务需求的变化。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息