一、云计算监控概述
云计算监控是确保云环境稳定运行的关键环节,通过实时监控云资源的运行状态、性能指标和成本消耗,及时发现和处理问题,保障业务连续性。在混合云和多云架构中,监控管理面临更大的挑战,需要统一的监控平台和标准化的管理流程。
在FGedu企业的云环境中,我们采用了混合云架构,包括私有云平台和公有云服务,需要建立统一的监控管理体系。学习交流加群风哥微信: itpux-com,有效的云监控能够帮助运维团队快速定位问题、优化资源使用、控制运营成本。
二、监控架构设计
2.1 监控体系架构
建立完整的云监控体系,需要从数据采集、数据处理、数据存储、告警通知等多个层面进行设计。
监控层次:
1. 基础设施层监控
– 物理服务器:CPU、内存、磁盘、网络
– 网络设备:交换机、路由器、防火墙
– 存储设备:SAN、NAS、分布式存储
2. 虚拟化层监控
– 虚拟机:资源使用、运行状态
– 容器:Pod状态、资源限制
– 编排平台:Kubernetes集群状态
3. 平台服务层监控
– 数据库服务:连接数、查询性能
– 中间件服务:请求处理、线程状态
– 消息队列:消息堆积、消费延迟
4. 应用层监控
– 应用性能:响应时间、吞吐量
– 业务指标:用户数、交易量
– 用户体验:页面加载时间、错误率
# 监控工具选型
核心监控组件:
– Prometheus:指标采集和存储
– Grafana:可视化展示
– AlertManager:告警管理
– ELK Stack:日志分析
– Jaeger:分布式追踪
– Zabbix:传统基础设施监控
# 监控数据流
数据采集 -> 数据传输 -> 数据处理 -> 数据存储 -> 数据展示
| | | | |
Exporter Agent Stream TSDB Dashboard
Telegraf Fluentd Processor InfluxDB Grafana
Filebeat Kafka ES
2.2 Prometheus监控部署
Prometheus是云原生监控的事实标准,支持多维数据模型和强大的查询语言。
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: ‘fgedu-cloud’
region: ‘cn-north’
alerting:
alertmanagers:
– static_configs:
– targets:
– alertmanager:9093
rule_files:
– “/etc/prometheus/rules/*.yml”
scrape_configs:
# Prometheus自身监控
– job_name: ‘prometheus’
static_configs:
– targets: [‘fgedudb:9090’]
# Node Exporter监控
– job_name: ‘node-exporter’
static_configs:
– targets:
– ‘10.0.1.11:9100’
– ‘10.0.1.12:9100’
– ‘10.0.1.13:9100’
labels:
env: ‘production’
# Kubernetes监控
– job_name: ‘kubernetes-apiservers’
kubernetes_sd_configs:
– role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# 云服务监控(通过API)
– job_name: ‘cloud-metrics’
metrics_path: /metrics
static_configs:
– targets: [‘cloud-exporter:9173’]
# 启动Prometheus
$ docker run -d \
–name prometheus \
-p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /etc/prometheus/rules:/etc/prometheus/rules \
prom/prometheus:v2.45.0
# 检查Prometheus状态
$ curl http://fgedudb:9090/-/healthy
Prometheus is Healthy.
$ curl http://fgedudb:9090/api/v1/targets | jq ‘.data.activeTargets | length’
15
三、资源监控
3.1 计算资源监控
监控云主机和容器的计算资源使用情况,确保资源合理分配和利用。
CPU监控:
– CPU使用率:instance:cpu_usage:rate5m
– CPU负载:node_load1, node_load5, node_load15
– CPU核心数:count(node_cpu_seconds_total{mode=”idle”}) by (instance)
内存监控:
– 内存使用率:(1 – node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
– 内存使用量:node_memory_MemTotal_bytes – node_memory_MemAvailable_bytes
– Swap使用率:(1 – node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) * 100
# Prometheus查询示例
# CPU使用率查询
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=100 – (avg by (instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)’ | jq ‘.data.result’
{
“status”: “success”,
“data”: {
“resultType”: “vector”,
“result”: [
{
“metric”: {“instance”: “10.0.1.11:9100”},
“value”: [1680508800, “25.5”]
},
{
“metric”: {“instance”: “10.0.1.12:9100”},
“value”: [1680508800, “32.1”]
}
]
}
}
# 内存使用率查询
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100’
# Kubernetes资源监控
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s-master01 500m 25% 4Gi 50%
k8s-node01 800m 40% 6Gi 75%
k8s-node02 600m 30% 5Gi 62%
$ kubectl top pods -n fgedu-app
NAME CPU(cores) MEMORY(bytes)
webapp-7b8f9c-d4e5f 100m 256Mi
webapp-7b8f9c-g6h7i 150m 384Mi
api-server-9j0k1l-m2n3o 200m 512Mi
# 资源配额监控
$ kubectl describe resourcequota compute-quota -n fgedu-app
Name: compute-quota
Namespace: fgedu-app
Resource Used Hard
——– —- —-
limits.cpu 4 10
limits.memory 8Gi 20Gi
pods 15 50
3.2 存储资源监控
监控云存储的使用量和性能指标,确保存储资源充足和性能稳定。
磁盘空间监控:
– 磁盘使用率:(1 – node_filesystem_avail_bytes{fstype!~”tmpfs|overlay”} / node_filesystem_size_bytes) * 100
– 磁盘使用量:node_filesystem_size_bytes – node_filesystem_avail_bytes
– inode使用率:(1 – node_filesystem_files_free / node_filesystem_files) * 100
磁盘IO监控:
– 读IOPS:irate(node_disk_reads_completed_total[5m])
– 写IOPS:irate(node_disk_writes_completed_total[5m])
– 读吞吐:irate(node_disk_read_bytes_total[5m])
– 写吞吐:irate(node_disk_written_bytes_total[5m])
– IO等待时间:irate(node_disk_io_time_seconds_total[5m])
# 存储监控查询
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=(1 – (node_filesystem_avail_bytes{mountpoint=”/data”} / node_filesystem_size_bytes{mountpoint=”/data”})) * 100’
# Kubernetes PVC监控
$ kubectl get pvc -n fgedu-app
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
data-pvc Bound pvc-abc123 100Gi RWO fast-ssd
logs-pvc Bound pvc-def456 50Gi RWO standard
# 查看PVC使用情况
$ df -h | grep /var/lib/kubelet/pods
/dev/mapper/vg-data 100G 65G 35G 65% /var/lib/kubelet/pods/abc123/volumes/kubernetes.io~csi
/dev/mapper/vg-logs 50G 20G 30G 40% /var/lib/kubelet/pods/def456/volumes/kubernetes.io~csi
# 云存储监控(AWS S3示例)
$ aws cloudwatch get-metric-statistics \
–namespace AWS/S3 \
–metric-name BucketSizeBytes \
–dimensions Name=BucketName,Value=fgedu-backup Name=StorageType,Value=StandardStorage \
–start-time 2026-04-01T00:00:00Z \
–end-time 2026-04-03T23:59:59Z \
–period 86400 \
–statistics Average
{
“Datapoints”: [
{
“Average”: 536870912000,
“Timestamp”: “2026-04-01T00:00:00Z”,
“Unit”: “Bytes”
}
]
}
3.3 网络资源监控
监控网络流量、带宽使用和网络连接状态,确保网络通信正常。
网络流量:
– 入站流量:irate(node_network_receive_bytes_total{device!~”lo|veth.*”}[5m])
– 出站流量:irate(node_network_transmit_bytes_total{device!~”lo|veth.*”}[5m])
– 入站包数:irate(node_network_receive_packets_total[5m])
– 出站包数:irate(node_network_transmit_packets_total[5m])
网络错误:
– 入站错误:irate(node_network_receive_errs_total[5m])
– 出站错误:irate(node_network_transmit_errs_total[5m])
– 丢包率:irate(node_network_receive_drop_total[5m])
TCP连接:
– TCP连接数:node_netstat_Tcp_CurrEstab
– TCP各状态连接:node_netstat_Tcp_ActiveOpens, node_netstat_Tcp_PassiveOpens
# 网络监控查询
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=sum by (instance) (irate(node_network_receive_bytes_total[5m])) / 1024 / 1024’
# 网络连接状态监控
$ ss -s
Total: 15234 (kernel 15678)
TCP: 12456 (estab 8234, closed 2345, orphaned 123, synrecv 0, timewait 1890/0)
Transport Total IP IPv6
* 15678 – –
RAW 1 0 1
UDP 234 123 111
TCP 12456 10234 2222
INET 12691 10357 2334
FRAG 0 0 0
# Kubernetes网络策略监控
$ kubectl get networkpolicy -n fgedu-app
NAME POD-SELECTOR AGE
allow-webapp app=webapp 30d
allow-api app=api-server 30d
deny-all
# Service监控
$ kubectl get svc -n fgedu-app
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
webapp LoadBalancer 10.96.100.1 10.0.100.100 80:30080/TCP 30d
api-server ClusterIP 10.96.100.2
四、性能监控
4.1 应用性能监控(APM)
应用性能监控帮助深入了解应用的运行状态和性能瓶颈。
# 使用Jaeger进行分布式追踪
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-webapp
namespace: fgedu-app
spec:
replicas: 3
template:
spec:
containers:
– name: webapp
image: fgedu/webapp:v1.0
env:
– name: JAEGER_AGENT_HOST
value: “jaeger-agent.observability.svc.cluster.local”
– name: JAEGER_AGENT_PORT
value: “6831”
– name: JAEGER_SAMPLER_TYPE
value: “probabilistic”
– name: JAEGER_SAMPLER_PARAM
value: “0.1”
# Jaeger部署
$ kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.47.0/jaeger-operator.yaml -n observability
# 查看追踪数据
$ kubectl port-forward svc/jaeger-query 16686:16686 -n observability
# 应用性能指标
关键指标:
– 响应时间(Latency):P50, P95, P99
– 吞吐量(Throughput):请求/秒
– 错误率(Error Rate):失败请求占比
– Apdex分数:应用性能指数
# Prometheus应用指标采集
$ curl http://webapp:8080/metrics
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method=”GET”,status=”200″} 12345
http_requests_total{method=”POST”,status=”201″} 5678
http_requests_total{method=”GET”,status=”404″} 123
# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le=”0.1″} 8900
http_request_duration_seconds_bucket{le=”0.5″} 11500
http_request_duration_seconds_bucket{le=”1″} 12200
http_request_duration_seconds_bucket{le=”+Inf”} 12345
4.2 数据库性能监控
监控数据库的性能指标,确保数据服务的稳定和高效。
MySQL监控:
– 连接数:mysql_global_status_threads_connected
– 查询数:rate(mysql_global_status_queries[5m])
– 慢查询:rate(mysql_global_status_slow_queries[5m])
– 缓冲池命中率:(mysql_global_status_innodb_buffer_pool_read_requests – mysql_global_status_innodb_buffer_pool_reads) / mysql_global_status_innodb_buffer_pool_read_requests * 100
PostgreSQL监控:
– 活跃连接:pg_stat_activity_count
– 事务数:rate(pg_stat_database_xact_commit[5m])
– 缓存命中率:pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100
# MySQL监控查询
$ mysql -u monitor -p -e “SHOW GLOBAL STATUS LIKE ‘Threads%’;”
+——————-+——-+
| Variable_name | Value |
+——————-+——-+
| Threads_cached | 32 |
| Threads_connected | 150 |
| Threads_created | 256 |
| Threads_running | 12 |
+——————-+——-+
$ mysql -u monitor -p -e “SHOW ENGINE INNODB STATUS\G” | grep -A 10 “BUFFER POOL”
BUFFER POOL AND MEMORY
———————-
Total large memory allocated 137438953472
Dictionary memory allocated 12345678
Buffer pool size 8192000
Free buffers 1024
Database pages 8188976
Old database pages 3019830
Modified db pages 12345
Percent of dirty pages(LRU & free pages): 0.151
# PostgreSQL监控查询
$ psql -c “SELECT count(*) FROM pg_stat_activity WHERE state = ‘active’;”
count
——-
45
$ psql -c “SELECT blks_hit, blks_read FROM pg_stat_database WHERE datname = ‘fgedu_db’;”
blks_hit | blks_read
———-+———–
1234567 | 12345
# 缓存命中率计算
SELECT round(100.0 * blks_hit / (blks_hit + blks_read), 2) as cache_hit_ratio
FROM pg_stat_database WHERE datname = ‘fgedu_db’;
cache_hit_ratio
—————–
99.01
五、成本管理
5.1 云成本监控
监控云资源的使用成本,优化资源配置,控制运营支出。
成本维度:
– 按服务类型:计算、存储、网络、数据库
– 按项目/部门:成本中心、业务线
– 按资源类型:实例类型、存储类型
– 按时间段:日、周、月
# AWS成本监控
$ aws ce get-cost-and-usage \
–time-period Start=2026-04-01,End=2026-04-03 \
–granularity DAILY \
–metrics BlendedCost \
–group-by Type=DIMENSION,Key=SERVICE
{
“ResultsByTime”: [
{
“TimePeriod”: {“Start”: “2026-04-01”, “End”: “2026-04-02”},
“Groups”: [
{“Keys”: [“Amazon EC2”], “Metrics”: {“BlendedCost”: {“Amount”: “1234.56”}}},
{“Keys”: [“Amazon S3”], “Metrics”: {“BlendedCost”: {“Amount”: “234.56”}}},
{“Keys”: [“Amazon RDS”], “Metrics”: {“BlendedCost”: {“Amount”: “567.89”}}}
]
}
]
}
# 阿里云成本监控
$ aliyun bss OpenApiQueryBill –BillingCycle 2026-04
{
“Data”: {
“Items”: {
“Item”: [
{“ProductCode”: “ecs”, “ProductName”: “云服务器ECS”, “PretaxGrossAmount”: “1234.56”},
{“ProductCode”: “oss”, “ProductName”: “对象存储OSS”, “PretaxGrossAmount”: “234.56”},
{“ProductCode”: “rds”, “ProductName”: “云数据库RDS”, “PretaxGrossAmount”: “567.89”}
]
}
}
}
# 成本分析报告
$ cat /tmp/cloud_cost_report.md
# FGedu云成本分析报告
## 月度成本趋势
| 月份 | 计算成本 | 存储成本 | 网络成本 | 总成本 |
|—–|———|———|———|——-|
| 1月 | ¥50,000 | ¥10,000 | ¥5,000 | ¥65,000 |
| 2月 | ¥52,000 | ¥11,000 | ¥5,500 | ¥68,500 |
| 3月 | ¥48,000 | ¥12,000 | ¥5,200 | ¥65,200 |
## 成本优化建议
1. 闲置资源释放:预计节省 ¥5,000/月
2. 预留实例购买:预计节省 ¥8,000/月
3. 存储分层优化:预计节省 ¥2,000/月
5.2 资源优化建议
基于监控数据提供资源优化建议,提高资源利用率。
# CPU利用率分析
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=avg_over_time((100 – (avg by (instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100))[30d])’
# 内存利用率分析
$ curl -G ‘http://fgedudb:9090/api/v1/query’ \
–data-urlencode ‘query=avg_over_time((1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)[30d])’
# Kubernetes资源优化
$ kubectl describe hpa webapp-hpa -n fgedu-app
Name: webapp-hpa
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 25% / 70%
resource memory on pods (as a percentage of request): 45% / 80%
Min replicas: 2
Max replicas: 10
Replicas: 3 current / 3 desired
# 资源优化建议脚本
#!/bin/bash
# 文件名: resource_optimization.sh
# 查找低利用率实例
echo “=== 低利用率实例 (CPU < 20%) ==="
for instance in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
cpu_usage=$(kubectl top node $instance -o json | jq '.usage.cpu' | tr -d '"' | sed 's/m//')
if [ "$cpu_usage" -lt 200 ]; then
echo "Instance: $instance, CPU Usage: ${cpu_usage}m"
fi
done
# 查找闲置存储
echo "=== 闲置存储 (使用率 < 10%) ==="
kubectl get pvc --all-namespaces -o json | jq -r '.items[] | select(.status.phase == "Bound") | "\(.metadata.namespace)/\(.metadata.name)"' | while read pvc; do
ns=$(echo $pvc | cut -d'/' -f1)
name=$(echo $pvc | cut -d'/' -f2)
# 检查PVC使用情况
echo "PVC: $pvc"
done
# 输出优化建议
$ ./resource_optimization.sh
=== 低利用率实例 (CPU < 20%) ===
Instance: k8s-node05, CPU Usage: 150m
Instance: k8s-node06, CPU Usage: 180m
=== 闲置存储 (使用率 < 10%) ===
PVC: fgedu-test/test-data-pvc
PVC: fgedu-dev/dev-backup-pvc
六、自动化运维
6.1 自动化告警
建立完善的告警机制,及时发现和处理问题。
# /etc/prometheus/rules/cloud-alerts.yml
groups:
– name: cloud_infrastructure
rules:
# CPU使用率告警
– alert: HighCPUUsage
expr: 100 – (avg by (instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.instance }}”
description: “CPU usage is {{ $value }}%”
# 内存使用率告警
– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage on {{ $labels.instance }}”
description: “Memory usage is {{ $value }}%”
# 磁盘空间告警
– alert: DiskSpaceLow
expr: (1 – (node_filesystem_avail_bytes{fstype!~”tmpfs|overlay”} / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: “Low disk space on {{ $labels.instance }}”
description: “Disk {{ $labels.mountpoint }} is {{ $value }}% full”
# Pod状态告警
– alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: “Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping”
# AlertManager配置
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.fgedu.net.cn:25’
smtp_from: ‘alertmanager@fgedu.net.cn’
smtp_auth_username: ‘alertmanager@fgedu.net.cn’
smtp_auth_password: ‘Fgedu@Alert123’
route:
group_by: [‘alertname’, ‘cluster’]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: ‘default-receiver’
routes:
– match:
severity: critical
receiver: ‘critical-receiver’
– match:
severity: warning
receiver: ‘warning-receiver’
receivers:
– name: ‘default-receiver’
email_configs:
– to: ‘ops@fgedu.net.cn’
send_resolved: true
– name: ‘critical-receiver’
email_configs:
– to: ‘ops-critical@fgedu.net.cn’
webhook_configs:
– url: ‘http://webhook-server:5000/alerts’
send_resolved: true
– name: ‘warning-receiver’
email_configs:
– to: ‘ops-warning@fgedu.net.cn’
# 测试告警
$ amtool alert add –alertmanager.url=http://fgedudb:9093 \
alertname=”TestAlert” severity=”warning” instance=”test-node”
6.2 自动化运维脚本
通过自动化脚本实现日常运维任务的自动化执行。
#!/bin/bash
# 文件名: cloud_ops_automation.sh
# 功能: 云环境自动化运维脚本
# 配置变量
PROMETHEUS_URL=”http://prometheus:9090″
ALERT_THRESHOLD=90
LOG_FILE=”/var/log/cloud_ops.log”
# 日志函数
log() {
echo “[$(date ‘+%Y-%m-%d %H:%M:%S’)] $1” | tee -a $LOG_FILE
}
# 检查实例健康状态
check_instance_health() {
log “Checking instance health…”
# 获取所有实例
instances=$(curl -s “${PROMETHEUS_URL}/api/v1/label/instance/values” | jq -r ‘.data[]’)
for instance in $instances; do
# 检查CPU使用率
cpu_usage=$(curl -s -G “${PROMETHEUS_URL}/api/v1/query” \
–data-urlencode “query=100 – (avg by (instance) (irate(node_cpu_seconds_total{mode=\”idle\”,instance=\”${instance}\”}[5m])) * 100)” \
| jq -r ‘.data.result[0].value[1]’)
if (( $(echo “$cpu_usage > $ALERT_THRESHOLD” | bc -l) )); then
log “WARNING: High CPU usage on $instance: ${cpu_usage}%”
# 自动扩容(如果支持)
auto_scale_instance $instance
fi
# 检查内存使用率
mem_usage=$(curl -s -G “${PROMETHEUS_URL}/api/v1/query” \
–data-urlencode “query=(1 – (node_memory_MemAvailable_bytes{instance=\”${instance}\”} / node_memory_MemTotal_bytes{instance=\”${instance}\”})) * 100″ \
| jq -r ‘.data.result[0].value[1]’)
if (( $(echo “$mem_usage > $ALERT_THRESHOLD” | bc -l) )); then
log “WARNING: High memory usage on $instance: ${mem_usage}%”
fi
done
}
# 自动扩容实例
auto_scale_instance() {
local instance=$1
log “Attempting to auto-scale instance: $instance”
# Kubernetes HPA扩容
if kubectl get hpa -A | grep -q $instance; then
namespace=$(kubectl get hpa -A -o json | jq -r “.items[] | select(.metadata.name | contains(\”$instance\”)) | .metadata.namespace”)
hpa_name=$(kubectl get hpa -A -o json | jq -r “.items[] | select(.metadata.name | contains(\”$instance\”)) | .metadata.name”)
kubectl patch hpa $hpa_name -n $namespace -p ‘{“spec”:{“minReplicas”:'”$(kubectl get hpa $hpa_name -n $namespace -o jsonpath='{.spec.minReplicas}’ | awk ‘{print $1+1}’)”‘}}’
log “Scaled up HPA $hpa_name in namespace $namespace”
fi
}
# 清理闲置资源
cleanup_idle_resources() {
log “Cleaning up idle resources…”
# 清理未使用的PVC
unused_pvcs=$(kubectl get pvc –all-namespaces -o json | jq -r ‘.items[] | select(.metadata.annotations[“unused”]==”true”) | “\(.metadata.namespace)/\(.metadata.name)”‘)
for pvc in $unused_pvcs; do
ns=$(echo $pvc | cut -d’/’ -f1)
name=$(echo $pvc | cut -d’/’ -f2)
log “Removing unused PVC: $pvc”
kubectl delete pvc $name -n $ns
done
# 清理完成的Pod
kubectl delete pods –all-namespaces –field-selector status.phase=Succeeded
kubectl delete pods –all-namespaces –field-selector status.phase=Failed
}
# 生成日报
generate_daily_report() {
log “Generating daily report…”
report_file=”/tmp/cloud_daily_report_$(date +%Y%m%d).txt”
echo “=== FGedu Cloud Daily Report ===” > $report_file
echo “Date: $(date)” >> $report_file
echo “” >> $report_file
# 资源使用统计
echo “=== Resource Usage ===” >> $report_file
kubectl top nodes >> $report_file
echo “” >> $report_file
# 告警统计
echo “=== Alert Summary ===” >> $report_file
curl -s “${PROMETHEUS_URL}/api/v1/alerts” | jq -r ‘.data.alerts[] | “\(.labels.alertname): \(.state)”‘ >> $report_file
# 发送报告
mail -s “FGedu Cloud Daily Report” ops@fgedu.net.cn < $report_file
}
# 主函数
main() {
log "Starting cloud operations automation..."
check_instance_health
cleanup_idle_resources
generate_daily_report
log "Cloud operations automation completed."
}
main
# 设置定时任务
$ crontab -l
0 8 * * * /opt/scripts/cloud_ops_automation.sh
总结
云计算监控与管理是确保云环境稳定运行的关键环节,需要建立完善的监控体系和自动化运维机制。本教程详细介绍了云监控架构设计、资源监控、性能监控、成本管理和自动化运维等方面的内容,帮助运维团队有效管理云环境。
更多学习教程www.fgedu.net.cn,在实际工作中,建议根据业务特点建立定制化的监控指标和告警规则,同时持续优化资源配置,提高云资源利用效率。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
