一、Kubernetes监控概述
Kubernetes监控是保障容器化应用稳定运行的关键环节,通过监控可以实时了解集群状态、资源使用情况和应用性能。在云原生架构中,监控体系需要覆盖基础设施、容器平台和应用服务多个层面。
学习交流加群风哥微信: itpux-com,在FGedu企业的Kubernetes集群中,我们建立了完整的监控体系,包括Prometheus指标监控、Grafana可视化、AlertManager告警和ELK日志分析等组件。
1.1 监控架构设计
Kubernetes监控架构需要考虑数据采集、存储、展示和告警等多个环节。
监控层次:
1. 节点层监控
– 节点资源使用(CPU、内存、磁盘、网络)
– 节点状态(Ready、NotReady)
– 内核参数和系统负载
2. 容器层监控
– 容器资源使用
– 容器状态(Running、Pending、Failed)
– 容器重启次数
3. Pod层监控
– Pod状态和生命周期
– Pod资源限制和使用
– Pod网络和存储
4. 服务层监控
– Service可用性
– Ingress流量
– DNS解析
5. 应用层监控
– 应用性能指标
– 业务指标
– 自定义指标
# 监控工具选型
核心组件:
– Prometheus:指标采集和存储
– AlertManager:告警管理
– Grafana:可视化展示
– Node Exporter:节点指标采集
– kube-state-metrics:Kubernetes对象指标
– cAdvisor:容器指标采集
# 监控数据流
数据采集 -> 数据传输 -> 数据存储 -> 数据展示 -> 告警通知
| | | | |
Exporter Pushgateway Prometheus Grafana AlertManager
DaemonSet Sidecar TSDB Dashboard Webhook
二、Prometheus监控部署
2.1 使用Helm部署Prometheus
Helm是Kubernetes的包管理工具,可以快速部署Prometheus监控栈。
$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
# 创建命名空间
$ kubectl create namespace monitoring
# 部署kube-prometheus-stack
$ helm install prometheus prometheus-community/kube-prometheus-stack \
–namespace monitoring \
–set prometheus.prometheusSpec.retention=30d \
–set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
–set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi \
–set grafana.persistence.enabled=true \
–set grafana.persistence.size=10Gi
NAME: prometheus
LAST DEPLOYED: Fri Apr 3 10:00:00 2026
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed.
# 查看部署状态
$ kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 5m
prometheus-grafana-7b8f9c-d4e5f 3/3 Running 0 5m
prometheus-kube-prometheus-operator-7b8f9c-g6h7i 1/1 Running 0 5m
prometheus-kube-state-metrics-7b8f9c-j8k9l 1/1 Running 0 5m
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 5m
prometheus-prometheus-node-exporter-abc12 1/1 Running 0 5m
# 查看服务
$ kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
alertmanager-operated ClusterIP None
prometheus-grafana ClusterIP 10.96.100.1
prometheus-kube-prometheus-alertmanager ClusterIP 10.96.100.2
prometheus-kube-prometheus-operator ClusterIP 10.96.100.3
prometheus-kube-prometheus-prometheus ClusterIP 10.96.100.4
prometheus-operated ClusterIP None
# 访问Grafana
$ kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
# 访问Prometheus
$ kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
# 获取Grafana密码
$ kubectl get secret prometheus-grafana -n monitoring -o jsonpath=”{.data.admin-password}” | base64 –decode
prom-operator
2.2 Prometheus配置详解
了解Prometheus的核心配置,实现自定义监控需求。
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: ‘fgedu-k8s’
region: ‘cn-north’
# 告警规则文件
rule_files:
– /etc/prometheus/rules/*.yml
# 告警管理器配置
alerting:
alertmanagers:
– static_configs:
– targets:
– alertmanager:9093
# 数据采集配置
scrape_configs:
# Prometheus自身监控
– job_name: ‘prometheus’
static_configs:
– targets: [‘fgedudb:9090’]
# Kubernetes API Server
– job_name: ‘kubernetes-apiservers’
kubernetes_sd_configs:
– role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
– source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes节点
– job_name: ‘kubernetes-nodes’
kubernetes_sd_configs:
– role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
– action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Kubernetes Pods
– job_name: ‘kubernetes-pods’
kubernetes_sd_configs:
– role: pod
relabel_configs:
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
– source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
– source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 自定义应用监控
– job_name: ‘fgedu-app’
kubernetes_sd_configs:
– role: pod
namespaces:
names:
– fgedu-app
relabel_configs:
– source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: fgedu-webapp
# 查看Prometheus配置
$ kubectl get prometheus -n monitoring -o yaml
三、指标采集配置
3.1 自定义指标采集
为应用配置自定义指标采集,实现业务层面的监控。
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-webapp
namespace: fgedu-app
spec:
replicas: 3
selector:
matchLabels:
app: fgedu-webapp
template:
metadata:
labels:
app: fgedu-webapp
annotations:
prometheus.io/scrape: “true”
prometheus.io/port: “8080”
prometheus.io/path: “/metrics”
spec:
containers:
– name: webapp
image: fgedu/webapp:v1.0
ports:
– containerPort: 8080
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# ServiceMonitor配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: fgedu-webapp-monitor
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: fgedu-webapp
namespaceSelector:
matchNames:
– fgedu-app
endpoints:
– port: http
path: /metrics
interval: 30s
# PodMonitor配置
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: fgedu-webapp-pods
namespace: monitoring
spec:
selector:
matchLabels:
app: fgedu-webapp
namespaceSelector:
matchNames:
– fgedu-app
podMetricsEndpoints:
– port: http
path: /metrics
interval: 30s
# 应用暴露指标示例
# Python应用使用prometheus_client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# 定义指标
REQUEST_COUNT = Counter(‘request_count’, ‘Total request count’, [‘method’, ‘endpoint’])
REQUEST_LATENCY = Histogram(‘request_latency_seconds’, ‘Request latency’, [‘endpoint’])
ACTIVE_CONNECTIONS = Gauge(‘active_connections’, ‘Active connections’)
# 使用指标
@app.route(‘/api/users’)
def get_users():
REQUEST_COUNT.labels(method=’GET’, endpoint=’/api/users’).inc()
with REQUEST_LATENCY.labels(endpoint=’/api/users’).time():
# 处理请求
pass
return jsonify(users=[])
# 启动指标服务
start_http_server(8080)
# 验证指标采集
$ kubectl port-forward pod/fgedu-webapp-abc123 -n fgedu-app 8080:8080
$ curl http://fgedudb:8080/metrics
# HELP request_count Total request count
# TYPE request_count counter
request_count{endpoint=”/api/users”,method=”GET”} 1234.0
3.2 关键监控指标
掌握Kubernetes的关键监控指标,建立有效的监控体系。
# CPU使用率
100 – (avg by (instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)
# 内存使用率
(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 磁盘使用率
(1 – (node_filesystem_avail_bytes{fstype!~”tmpfs|overlay”} / node_filesystem_size_bytes)) * 100
# 网络流量
irate(node_network_receive_bytes_total[5m])
irate(node_network_transmit_bytes_total[5m])
# Pod监控指标
# CPU使用率
sum(rate(container_cpu_usage_seconds_total{container!=””}[5m])) by (pod, namespace)
# 内存使用率
sum(container_memory_working_set_bytes{container!=””}) by (pod, namespace)
# 网络流量
sum(rate(container_network_receive_bytes_total[5m])) by (pod, namespace)
# 重启次数
increase(kube_pod_container_status_restarts_total[1h])
# Kubernetes对象指标
# Pod状态
kube_pod_status_phase{phase=”Running”}
kube_pod_status_phase{phase=”Pending”}
kube_pod_status_phase{phase=”Failed”}
# 节点状态
kube_node_status_condition{condition=”Ready”,status=”true”}
# Deployment状态
kube_deployment_status_replicas_available / kube_deployment_spec_replicas
# 资源配额
kube_resourcequota{type=”hard”}
kube_resourcequota{type=”used”}
# 查询示例
# 查询所有Pod的CPU使用率
$ curl -G ‘http://prometheus:9090/api/v1/query’ \
–data-urlencode ‘query=sum(rate(container_cpu_usage_seconds_total{container!=””}[5m])) by (pod, namespace) * 100’
# 查询节点内存使用率
$ curl -G ‘http://prometheus:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100’
# 查询Pod重启次数
$ kubectl get –raw ‘/apis/metrics.k8s.io/v1beta1/namespaces/fgedu-app/pods’ | jq .
{
“kind”: “PodMetricsList”,
“items”: [
{
“metadata”: {
“name”: “fgedu-webapp-abc123”,
“namespace”: “fgedu-app”
},
“containers”: [
{
“name”: “webapp”,
“usage”: {
“cpu”: “50m”,
“memory”: “256Mi”
}
}
]
}
]
}
四、Grafana可视化
4.1 导入Dashboard
Grafana提供了丰富的Dashboard模板,可以快速建立可视化监控。
1. Kubernetes Cluster Monitoring (ID: 315)
– 节点资源监控
– Pod资源监控
– 网络流量监控
2. Node Exporter Full (ID: 1860)
– 详细节点监控
– CPU、内存、磁盘、网络
3. Kubernetes Pods (ID: 6417)
– Pod状态监控
– 容器资源使用
4. Kubernetes API Server (ID: 12006)
– API Server性能
– 请求延迟
# 导入Dashboard方法
方法1:通过界面导入
1. 访问Grafana界面
2. 点击 “+” -> “Import”
3. 输入Dashboard ID
4. 选择Prometheus数据源
5. 点击”Import”
方法2:通过API导入
$ curl -X POST http://admin:prom-operator@grafana:3000/api/dashboards/import \
-H “Content-Type: application/json” \
-d @dashboard.json
# 自定义Dashboard配置
# dashboard.json示例
{
“dashboard”: {
“title”: “FGedu Application Dashboard”,
“uid”: “fgedu-app”,
“panels”: [
{
“title”: “Request Rate”,
“type”: “graph”,
“targets”: [
{
“expr”: “sum(rate(http_requests_total[5m])) by (service)”,
“legendFormat”: “{{service}}”
}
],
“gridPos”: {
“x”: 0,
“y”: 0,
“w”: 12,
“h”: 8
}
},
{
“title”: “Error Rate”,
“type”: “graph”,
“targets”: [
{
“expr”: “sum(rate(http_requests_total{status=~\”5..\”}[5m])) by (service)”,
“legendFormat”: “{{service}}”
}
],
“gridPos”: {
“x”: 12,
“y”: 0,
“w”: 12,
“h”: 8
}
}
]
},
“overwrite”: true
}
# 配置告警规则
# PrometheusRule CRD
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fgedu-alerts
namespace: monitoring
labels:
release: prometheus
spec:
groups:
– name: kubernetes-alerts
rules:
– alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: “Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping”
– alert: HighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total{container!=””}[5m])) by (pod, namespace) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.namespace }}/{{ $labels.pod }}”
– alert: PodNotReady
expr: kube_pod_status_phase{phase=~”Pending|Failed”} > 0
for: 10m
labels:
severity: critical
annotations:
summary: “Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready”
# 应用告警规则
$ kubectl apply -f prometheus-rule.yaml
五、日志架构设计
5.1 Kubernetes日志架构
Kubernetes日志架构需要考虑容器日志、应用日志和系统日志的统一管理。
1. 容器标准输出日志
– 位置:/var/log/containers/*.log
– 格式:JSON
– 管理:kubelet
2. 容器标准错误日志
– 位置:/var/log/containers/*.log
– 格式:JSON
– 管理:kubelet
3. 容器内部日志
– 位置:容器内文件系统
– 需要:Sidecar容器或Volume挂载
4. Kubernetes组件日志
– kubelet日志:/var/log/kubelet
– API Server日志:/var/log/kube-apiserver
– Scheduler日志:/var/log/kube-scheduler
– Controller Manager日志:/var/log/kube-controller-manager
# 日志采集架构
方案1:DaemonSet方式
– 每个节点运行日志采集Agent
– 采集节点上所有容器日志
– 发送到中心化日志系统
方案2:Sidecar方式
– 每个Pod运行日志采集Sidecar
– 采集特定应用日志
– 灵活但资源消耗大
方案3:应用直接发送
– 应用直接发送日志到中心系统
– 不依赖额外组件
– 需要应用支持
# FGedu日志架构
架构组件:
– Fluentd/Fluent Bit:日志采集
– Elasticsearch:日志存储
– Kibana:日志展示
– Kafka:日志缓冲(可选)
数据流:
容器日志 -> Fluentd -> Elasticsearch -> Kibana
-> Kafka -> Logstash -> Elasticsearch
六、日志采集与分析
6.1 部署EFK日志系统
部署Elasticsearch、Fluentd和Kibana实现Kubernetes日志管理。
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
– name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
ports:
– containerPort: 9200
env:
– name: discovery.type
value: single-node
– name: ES_JAVA_OPTS
value: “-Xms2g -Xmx2g”
– name: xpack.security.enabled
value: “false”
volumeMounts:
– name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
– metadata:
name: data
spec:
accessModes: [ “ReadWriteOnce” ]
resources:
requests:
storage: 100Gi
—
# Elasticsearch Service
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: logging
spec:
ports:
– port: 9200
selector:
app: elasticsearch
# 部署Fluentd DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
tolerations:
– key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
– name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1.16-debian-elasticsearch7
env:
– name: FLUENT_ELASTICSEARCH_HOST
value: “elasticsearch.logging.svc.cluster.local”
– name: FLUENT_ELASTICSEARCH_PORT
value: “9200”
– name: FLUENT_ELASTICSEARCH_SCHEME
value: “http”
volumeMounts:
– name: varlog
mountPath: /var/log
– name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
– name: config
mountPath: /fluentd/etc/fluent.conf
subPath: fluent.conf
volumes:
– name: varlog
hostPath:
path: /var/log
– name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
– name: config
configMap:
name: fluentd-config
# Fluentd配置
# fluent.conf
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
time_format %Y-%m-%dT%H:%M:%S.%NZ
@type kubernetes_metadata
@id filter_kube_metadata
kubernetes_url “#{ENV[‘KUBERNETES_SERVICE_HOST’]}:#{ENV[‘KUBERNETES_SERVICE_PORT’]}”
@type elasticsearch
host “#{ENV[‘FLUENT_ELASTICSEARCH_HOST’]}”
port “#{ENV[‘FLUENT_ELASTICSEARCH_PORT’]}”
logstash_format true
logstash_prefix k8s-logs
@type file
path /var/log/fluentd-buffer
flush_interval 5s
# 部署Kibana
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
– name: kibana
image: docker.elastic.co/kibana/kibana:8.12.0
ports:
– containerPort: 5601
env:
– name: ELASTICSEARCH_HOSTS
value: “http://elasticsearch:9200”
—
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: logging
spec:
ports:
– port: 5601
selector:
app: kibana
# 查看日志采集状态
$ kubectl logs -n logging daemonset/fluentd
2026-04-03 10:00:00 +0000 [info]: starting fluentd-1.16.0
2026-04-03 10:00:00 +0000 [info]: spawn command to main: cmdline=[“/usr/bin/ruby”, “-Eascii-8bit:ascii-8bit”, “/usr/local/bin/fluentd”, “-c”, “/fluentd/etc/fluent.conf”, “-p”, “/fluentd/plugins”]
2026-04-03 10:00:00 +0000 [info]: adding match pattern=”kubernetes.**” type=”elasticsearch”
# 访问Kibana
$ kubectl port-forward svc/kibana -n logging 5601:5601
# 浏览器访问 http://fgedudb:5601
6.2 日志查询与分析
使用Kibana和kubectl进行日志查询和分析。
# 查看Pod日志
$ kubectl logs -n fgedu-app fgedu-webapp-abc123
# 实时查看日志
$ kubectl logs -f -n fgedu-app fgedu-webapp-abc123
# 查看最近N行日志
$ kubectl logs –tail=100 -n fgedu-app fgedu-webapp-abc123
# 查看指定时间段的日志
$ kubectl logs –since=1h -n fgedu-app fgedu-webapp-abc123
# 查看前一个容器的日志(重启后)
$ kubectl logs -n fgedu-app fgedu-webapp-abc123 –previous
# 查看多容器Pod的日志
$ kubectl logs -n fgedu-app fgedu-webapp-abc123 -c webapp
# 查看所有Pod的日志
$ kubectl logs -l app=fgedu-webapp -n fgedu-app
# Elasticsearch日志查询
# 使用Kibana Dev Tools
# 查询特定命名空间的日志
GET k8s-logs-*/_search
{
“query”: {
“match”: {
“kubernetes.namespace_name”: “fgedu-app”
}
},
“size”: 100,
“sort”: [
{
“@timestamp”: {
“order”: “desc”
}
}
]
}
# 查询特定Pod的日志
GET k8s-logs-*/_search
{
“query”: {
“bool”: {
“must”: [
{
“match”: {
“kubernetes.namespace_name”: “fgedu-app”
}
},
{
“match”: {
“kubernetes.pod_name”: “fgedu-webapp-abc123”
}
}
]
}
}
}
# 查询错误日志
GET k8s-logs-*/_search
{
“query”: {
“match”: {
“log”: “error”
}
}
}
# 聚合分析
GET k8s-logs-*/_search
{
“size”: 0,
“aggs”: {
“by_namespace”: {
“terms”: {
“field”: “kubernetes.namespace_name.keyword”,
“size”: 10
}
}
}
}
# 创建日志告警
# 使用Kibana Alerting或ElastAlert
# 日志分析最佳实践
1. 结构化日志输出
2. 使用统一的日志格式
3. 添加必要的上下文信息
4. 合理设置日志级别
5. 定期归档和清理日志
总结
Kubernetes监控与日志是保障容器化应用稳定运行的关键。本教程详细介绍了Prometheus监控部署、指标采集配置、Grafana可视化、日志架构设计和日志采集分析等内容。通过建立完善的监控和日志体系,可以快速发现和定位问题,确保Kubernetes集群的稳定运行。
更多学习教程www.fgedu.net.cn,在实际工作中,建议根据业务特点定制监控指标和告警规则,同时做好日志的存储规划和安全防护。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
