内容简介:本文风哥教程参考Linux官方文档、Red Hat Enterprise Linux官方文档、Ansible Automation Platform官方文档、Docker官方文档、Kubernetes官方文档和Podman官方文档等内容,详细介绍了相关技术的配置和使用方法。
本
风哥提示:
文档介绍Kubernetes集群的监控配置方法。
Part01-监控架构
1.1 监控组件
[root@k8s-master ~]# cat > /root/k8s-monitoring.txt << 'EOF' Kubernetes监控架构 ================== 1. 监控组件 - Prometheus: 指标收集和存储 - Grafana: 可视化展示 - Alertmanager: 告警管理 - Node Exporter: 节点指标 - kube-state-metrics: 集群指标 2. 指标类型 - 资源指标: CPU、内存、存储 - 工作负载指标: Pod、Deployment - 网络指标: 流量、延迟 - 自定义指标: 应用指标 3. 监控层次 - 节点监控 - 容器监控 - 应用监控 - 集群监控 4. 告警机制 - PrometheusRule - Alertmanager配置 - 通知渠道 EOF
Part02-部署Prometheus
2.1 使用Helm部署
[root@k8s-master ~]# curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
Downloading https://get.helm.sh/helm-v3.13.0-linux-amd64.tar.gz
Verifying checksum… Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
# 添加Prometheus仓库
[root@k8s-master ~]# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
“prometheus-community” has been added to your repositories
[root@k8s-master ~]# helm repo update
Hang tight while we grab the latest from your chart repositories…
…Successfully got an update from the “prometheus-community” chart repository
Update Complete. ⎈Happy Helming!⎈
# 创建命名空间
[root@k8s-master ~]# kubectl create namespace monitoring
namespace/monitoring created
# 部署kube-prometheus-stack
[root@k8s-master ~]# helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring
NAME: prometheus
LAST DEPLOYED: Sat Apr 4 11:00:00 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl –namespace monitoring get pods -l “release=prometheus”
Get Grafana ‘admin’ user password by running:
kubectl –namespace monitoring get secrets prometheus-grafana -o jsonpath=”{.data.admin-password}” | base64 –decode ; echo
Access Grafana local instance by forwarding the Grafana port:
kubectl –namespace monitoring port-forward service/prometheus-grafana 3000:80
学习交流加群风哥微信: itpux-com
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and PrometheusOperator instances.
# 查看部署状态
[root@k8s-master ~]# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 2m
prometheus-grafana-abc123-xyz789 3/3 Running 0 2m
prometheus-kube-prometheus-operator-abc123 1/1 Running 0 2m
prometheus-kube-state-metrics-abc123 1/1 Running 0 2m
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 2m
prometheus-prometheus-node-exporter-abc12 1/1 Running 0 2m
prometheus-prometheus-node-exporter-def34 1/1 Running 0 2m
prometheus-prometheus-node-exporter-ghi56 1/1 Running 0 2m
# 获取Grafana密码
[root@k8s-master ~]# kubectl –namespace monitoring get secrets prometheus-grafana -o jsonpath=”{.data.admin-password}” | base64 –decode ; echo
prom-operator
# 端口转发访问Grafana
[root@k8s-master ~]# kubectl –namespace monitoring port-forward service/prometheus-grafana 3000:80
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000
Part03-监控指标
3.1 查看监控数据
[root@k8s-master ~]# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s-master 200m 5% 1024Mi 12%
k8s-node1 150m 3% 512Mi 6%
k8s-node2 150m 3% 512Mi 6%
# 查看Pod资源使用
[root@k8s-master ~]# kubectl top pods -A
NAMESPACE NAME CPU(cores) MEMORY(bytes)
default fgedu-web-abc12-xyz789 1m 5Mi
default fgedu-web-abc12-abc12 1m 5Mi
default fgedu-web-abc12-def34 1m 5Mi
kube-system coredns-5dd5756b68-abc12 3m 15Mi
kube-system coredns-5dd5756b68-def34 3m 15Mi
kube-system etcd-k8s-master 20m 50Mi
kube-system kube-apiserver-k8s-master 50m 200Mi
kube-system kube-controller-manager-k8s-master 15m 50Mi
kube-system kube-proxy-abc12 1m 10Mi
kube-system kube-scheduler-k8s-master 5m 20Mi
# 访问Prometheus UI
[root@k8s-master ~]# kubectl –namespace monitoring port-forward service/prometheus-operated 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
# 查询Prometheus指标
[root@k8s-master ~]# curl -s http://localhost:9090/api/v1/query?query=up | jq
{
“status”: “success”,
“data”: {
“resultType”: “vector”,
“result”: [
{
“metric”: {
“job”: “prometheus”,
“instance”: “localhost:9090”
},
“value”: [1712206800, “1”]
},
{
“metric”: {
“job”: “node-exporter”,
“instance”: “k8s-node1:9100”
},
“value”: [1712206800, “1”]
},
{
“metric”: {
“job”: “node-exporter”,
“instance”: “k8s-node2:9100”
},
“value”: [1712206800, “1”]
}
]
}
}
Part04学习交流加群风哥QQ113257174-配置告警
4.1 告警规则配置
[root@k8s-master ~]# cat > fgedu-alert-rules.yaml << 'EOF' apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: fgedu-alert-rules namespace: monitoring labels: release: prometheus spec: groups: - name: fgedu-node-alerts rules: - alert: NodeHighCPU expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “节点CPU使用率过高”
description: “节点 {{ $labels.instance }} CPU使用率超过80%,当前值: {{ $value }}%”
– alert: NodeHighMemory
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “节点内存使用率过高”
description: “节点 {{ $labels.instance }} 内存使用率超过85%,当前值: {{ $value }}%”
from PG视频:www.itpux.com
– alert: NodeDiskPressure
expr: (1 – (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”})) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “节点磁盘使用率过高”
description: “节点 {{ $labels.instance }} 磁盘 {{ $labels.mountpoint }} 使用率超过85%,当前值: {{ $value }}%”
– name: fgedu-pod-alerts
rules:
– alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: critical
annotations:
summary: “Pod频繁重启”
description: “Pod {{ $labels.namespace }}/{{ $labels.pod }} 在过去15分钟内重启次数: {{ $value }}”
– alert: PodNotReady
expr: kube_pod_status_phase{phase=~”Pending|Unknown”} > 0
for: 10m
labels:
severity: warning
annotations:
summary: “Pod未就绪”
description: “Pod {{ $labels.namespace }}/{{ $labels.pod }} 状态为 {{ $labels.phase }}”
EOF
[root@k8s-master ~]# kubectl apply -f fgedu-alert-rules.yaml
prometheusrule.monitoring.coreos.com/fgedu-alert-rules created
# 查看告警规则
[root@k8s-master ~]# kubectl get prometheusrule -n monitoring
NAME AGE
fgedu-alert-rules 10s
- 部署完整的监控栈
- 配置合理的告警规则
- 设置告警通知渠道
- 定期检查监控数据
- 优化告警阈值
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
