IT教程FG397-云监控与告警

内容大纲

1. 云监控与告警概述
2. AWS CloudWatch
3. Azure Monitor
4. Google Cloud Monitoring
5. 阿里云监控
6. 腾讯云监控
7. Prometheus + Grafana
8. 监控与告警最佳实践
9. 监控与告警故障排查
10. 监控与告警自动化

1. 云监控与告警概述

云监控与告警是云服务管理的重要组成部分，它通过收集、分析和可视化云资源的性能数据，帮助用户实时了解系统状态，及时发现和解决问题，确保服务的高可用性和可靠性。

云监控与告警的核心功能包括：

资源性能监控
系统健康状态评估
异常检测和告警
性能趋势分析
容量规划和预测

学习交流加群风哥微信: itpux-com

2. AWS CloudWatch

2.1 CloudWatch 基本概念

AWS CloudWatch是AWS的监控服务，用于收集和跟踪指标、收集和监控日志文件、设置告警等。

2.2 指标监控

# 查看EC2实例CPU利用率
$ aws cloudwatch get-metric-statistics \
–namespace AWS/EC2 \
–metric-name CPUUtilization \
–dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
–start-time 2026-04-02T00:00:00Z \
–end-time 2026-04-03T00:00:00Z \
–period 3600 \
–statistics Average

{
“Datapoints”: [
{
“Timestamp”: “2026-04-02T01:00:00Z”,
“Average”: 10.5,
“Unit”: “Percent”
},
{
“Timestamp”: “2026-04-02T02:00:00Z”,
“Average”: 12.3,
“Unit”: “Percent”
}
],
“Label”: “CPUUtilization”
}

2.3 告警设置

# 创建CPU利用率告警
$ aws cloudwatch put-metric-alarm \
–alarm-name EC2-CPU-High \
–alarm-description “当CPU利用率超过70%时告警” \
–metric-name CPUUtilization \
–namespace AWS/EC2 \
–statistic Average \
–period 300 \
–threshold 70 \
–comparison-operator GreaterThanThreshold \
–dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
–evaluation-periods 2 \
–alarm-actions arn:aws:sns:us-west-2:123456789012:MyTopic \
–ok-actions arn:aws:sns:us-west-2:123456789012:MyTopic

风哥风哥提示：设置告警时，应合理配置评估周期和阈值，避免误告警和漏告警。

2.4 日志监控

# 配置EC2实例日志到CloudWatch
$ aws logs create-log-group \
–log-group-name /aws/ec2/my-instance-logs

$ aws logs create-log-stream \
–log-group-name /aws/ec2/my-instance-logs \
–log-stream-name my-stream

# 推送日志到CloudWatch
$ aws logs put-log-events \
–log-group-name /aws/ec2/my-instance-logs \
–log-stream-name my-stream \
–log-events timestamp=1234567890000,message=”Error: Connection failed”

更多学习教程www.fgedu.net.cn

3. Azure Monitor

3.1 Azure Monitor 基本概念

Azure Monitor是Azure的监控服务，用于收集和分析指标、日志和应用程序性能数据。

3.2 指标监控

# 查看虚拟机CPU利用率
$ az monitor metrics list \
–resource /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM \
–metric “Percentage CPU” \
–time-span 24h \
–interval 1h

{
“cost”: 0,
“timespan”: “2026-04-02T00:00:00Z/2026-04-03T00:00:00Z”,
“interval”: “PT1H”,
“value”: [
{
“id”: “/subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM/providers/Microsoft.Insights/metrics/Percentage%20CPU”,
“type”: “Microsoft.Insights/metrics”,
“name”: {
“value”: “Percentage CPU”,
“localizedValue”: “Percentage CPU”
},
“unit”: “Percent”,
“timeseries”: [
{
“metadatavalues”: [],
“data”: [
{
“timeStamp”: “2026-04-02T01:00:00Z”,
“average”: 15.2
},
{
“timeStamp”: “2026-04-02T02:00:00Z”,
“average”: 18.7
}
]
}
]
}
]
}

3.3 告警设置

# 创建CPU利用率告警
$ az monitor metrics alert create \
–name VM-CPU-High \
–resource-group myResourceGroup \
–scopes /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM \
–condition “avg Percentage CPU > 70” \
–window-size 5m \
–evaluation-frequency 1m \
–action-groups “my-action-group” \
–severity 2

author:www.itpux.com

4. Google Cloud Monitoring

4.1 Cloud Monitoring 基本概念

Google Cloud Monitoring是Google Cloud的监控服务，用于收集和分析指标、日志和应用程序性能数据。

4.2 指标监控

# 查看Compute Engine实例CPU利用率
$ gcloud monitoring metrics list \
–filter=”metric.type=\”compute.googleapis.com/instance/cpu/utilization\””

# 获取CPU利用率数据
$ gcloud monitoring read \
–project=my-project \
“compute.googleapis.com/instance/cpu/utilization” \
–start-time=2026-04-02T00:00:00Z \
–end-time=2026-04-03T00:00:00Z \
–aggregation=mean \
–interval=3600s

4.3 告警设置

# 创建CPU利用率告警
$ gcloud alpha monitoring policies create \
–display-name=”VM CPU Utilization Alert” \
–description=”当CPU利用率超过70%时告警” \
–conditions=”display_name=CPU Usage,condition_threshold=filter=metric.type=\”compute.googleapis.com/instance/cpu/utilization\” resource.type=\”gce_instance\” aggregator=\”avg\” comparison=\”COMPARISON_GT\” threshold_value=0.7 duration=\”60s\”” \
–notification-channels=”projects/my-project/notificationChannels/1234567890″

更多学习教程公众号风哥教程itpux_com

5. 阿里云监控

5.1 阿里云监控基本概念

阿里云监控是阿里云的监控服务，用于收集和分析云产品的性能数据，设置告警等。

5.2 指标监控

# 查看ECS实例CPU利用率
$ aliyun cmn DescribeMetricList \
–RegionId cn-hangzhou \
–Namespace acs_ecs_dashboard \
–MetricName cpu_total \
–Dimensions ‘[{“dimensionName”:”instanceId”,”value”:”i-1234567890abcdef0″}]’ \
–StartTime ‘2026-04-02T00:00:00Z’ \
–EndTime ‘2026-04-03T00:00:00Z’ \
–Period 3600

5.3 告警设置

# 创建CPU利用率告警
$ aliyun cmn CreateAlarm \
–RegionId cn-hangzhou \
–AlarmName ECS-CPU-High \
–Description “当CPU利用率超过70%时告警” \
–Namespace acs_ecs_dashboard \
–MetricName cpu_total \
–Dimensions ‘[{“dimensionName”:”instanceId”,”value”:”i-1234567890abcdef0″}]’ \
–Period 60 \
–EvaluationCount 2 \
–ComparisonOperator GreaterThanThreshold \
–Threshold 70 \
–Statistics Average \
–ContactGroups ‘[“MyContactGroup”]’

风哥风哥提示：阿里云监控支持多种通知方式，包括短信、邮件、钉钉等，应根据实际需求选择合适的通知方式。

6. 腾讯云监控

6.1 腾讯云监控基本概念

腾讯云监控是腾讯云的监控服务，用于收集和分析云产品的性能数据，设置告警等。

6.2 指标监控

# 查看CVM实例CPU利用率
$ tccli monitor GetMonitorData \
–Namespace qce/cvm \
–MetricName CPUUsage \
–Dimensions ‘[{“Name”:”InstanceId”,”Value”:”ins-12345678″}]’ \
–Period 3600 \
–StartTime ‘2026-04-02 00:00:00’ \
–EndTime ‘2026-04-03 00:00:00’

6.3 告警设置

# 创建CPU利用率告警
$ tccli monitor CreateAlarmPolicy \
–PolicyName CVM-CPU-High \
–PolicyType 2 \
–Namespace qce/cvm \
–MetricName CPUUsage \
–Dimensions ‘[{“Name”:”InstanceId”,”Value”:”ins-12345678″}]’ \
–Period 60 \
–Operator gt \
–Value 70 \
–ContinueTime 2 \
–NotifyWay ‘[1,3]’ \
–ReceiverGroupList ‘[“MyReceiverGroup”]’

学习交流加群风哥QQ113257174

7. Prometheus + Grafana

7.1 Prometheus 基本概念

Prometheus是一个开源的监控系统，用于收集和存储时间序列数据。Grafana是一个开源的可视化工具，用于展示监控数据。

7.2 Prometheus 部署

# 使用Docker部署Prometheus
$ docker run -d \
–name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus

# prometheus.yml 配置示例
global:
scrape_interval: 15s

alerting:
alertmanagers:
– static_configs:
– targets:
– alertmanager:9093

rule_files:
– “alerts.yml”

scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘fgedudb:9090’]
– job_name: ‘node’
static_configs:
– targets: [‘node-exporter:9100’]

7.3 Grafana 部署

# 使用Docker部署Grafana
$ docker run -d \
–name grafana \
-p 3000:3000 \
grafana/grafana

# 配置Prometheus数据源
# 1. 登录Grafana（默认fgedu/密码：admin/admin）
# 2. 点击”Configuration” > “Data sources”
# 3. 点击”Add data source”
# 4. 选择”Prometheus”
# 5. 配置URL为”http://prometheus:9090″
# 6. 点击”Save & Test”

7.4 告警配置

# alerts.yml 配置示例
groups:
– name: node_alerts
rules:
– alert: HighCPUUsage
expr: 100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 70
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU Usage”
description: “CPU usage is above 70% for 5 minutes”

更多学习教程www.fgedu.net.cn

8. 监控与告警最佳实践

8.1 监控指标选择

系统层面：CPU、内存、磁盘、网络
应用层面：响应时间、吞吐量、错误率
业务层面：用户数、交易量、收入

8.2 告警策略制定

设置合理的阈值
配置适当的评估周期
使用分级告警（警告、严重、紧急）
避免告警风暴

8.3 监控数据可视化

创建仪表板，集中展示关键指标
使用趋势图，分析性能变化
设置阈值线，直观显示告警条件

9. 监控与告警故障排查

9.1 常见问题

监控数据采集失败
告警误报或漏报
监控系统性能问题
告警通知未送达

9.2 排查步骤

检查监控代理是否正常运行
验证网络连接是否正常
检查监控配置是否正确
查看监控系统日志
测试告警通知渠道

10. 监控与告警自动化

10.1 自动化响应

# 使用AWS Lambda自动响应告警
import boto3

def lambda_handler(event, context):
# 获取告警信息
alarm_name = event[‘Records’][0][‘Sns’][‘MessageAttributes’][‘AlarmName’][‘Value’]
alarm_description = event[‘Records’][0][‘Sns’][‘MessageAttributes’][‘AlarmDescription’][‘Value’]

# 根据告警执行不同操作
if ‘CPU’ in alarm_name:
# 处理CPU告警
ec2 = boto3.client(‘ec2’)
# 例如：扩展Auto Scaling组
autoscaling = boto3.client(‘autoscaling’)
autoscaling.update_auto_scaling_group(
AutoScalingGroupName=’my-asg’,
MinSize=2,
MaxSize=4,
DesiredCapacity=3
)

return {
‘statusCode’: 200,
‘body’: ‘Alarm handled successfully’
}

10.2 自动化配置

# 使用Terraform配置监控
resource “aws_cloudwatch_metric_alarm” “cpu_high” {
alarm_name = “EC2-CPU-High”
comparison_operator = “GreaterThanThreshold”
evaluation_periods = “2”
metric_name = “CPUUtilization”
namespace = “AWS/EC2”
period = “300”
statistic = “Average”
threshold = “70”
alarm_description = “当CPU利用率超过70%时告警”
alarm_actions = [aws_sns_topic.my_topic.arn]
ok_actions = [aws_sns_topic.my_topic.arn]

dimensions = {
InstanceId = aws_instance.my_instance.id
}
}

生产环境风哥建议：

建立监控体系，覆盖所有关键资源和服务
设置合理的告警阈值，避免过多的误告警
定期 review 监控数据，优化监控策略
建立自动化响应机制，提高故障处理效率
备份监控配置，确保监控系统的可靠性