内容大纲
- 1. 云监控与告警概述
- 2. AWS CloudWatch
- 3. Azure Monitor
- 4. Google Cloud Monitoring
- 5. 阿里云监控
- 6. 腾讯云监控
- 7. Prometheus监控
- 8. Grafana可视化
- 9. 告警管理
- 10. 最佳实践
1. 云监控与告警概述
云监控与告警是云服务管理的重要组成部分,通过实时监控云资源的状态和性能,及时发现和响应异常情况,确保云服务的可用性和可靠性。
云监控与告警的核心目标包括:
- 实时监控资源状态
- 及时发现异常情况
- 快速响应告警事件
- 分析性能瓶颈
- 优化资源配置
学习交流加群风哥微信: itpux-com
2. AWS CloudWatch
2.1 CloudWatch监控配置
# 创建CloudWatch告警
$ aws cloudwatch put-metric-alarm \
–alarm-name CPU-Utilization-High \
–alarm-description “Alarm when CPU exceeds 70%” \
–metric-name CPUUtilization \
–namespace AWS/EC2 \
–statistic Average \
–period 300 \
–threshold 70 \
–comparison-operator GreaterThanThreshold \
–dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
–evaluation-periods 2 \
–alarm-actions arn:aws:sns:us-west-2:123456789012:MyTopic
# 查看告警状态
$ aws cloudwatch describe-alarms –alarm-names CPU-Utilization-High
# 输出结果
{
“MetricAlarms”: [
{
“AlarmName”: “CPU-Utilization-High”,
“AlarmArn”: “arn:aws:cloudwatch:us-west-2:123456789012:alarm:CPU-Utilization-High”,
“StateValue”: “OK”,
“StateReason”: “Threshold Crossed: The most recent datapoints (45.2, 48.3) were not greater than the threshold (70.0).”,
“MetricName”: “CPUUtilization”,
“Namespace”: “AWS/EC2”,
“Statistic”: “Average”,
“Period”: 300,
“EvaluationPeriods”: 2,
“Threshold”: 70.0,
“ComparisonOperator”: “GreaterThanThreshold”
}
]
}
# 创建自定义指标
$ aws cloudwatch put-metric-data \
–metric-name RequestLatency \
–namespace MyApplication \
–value 123 \
–dimensions Instance=i-0123456789abcdef0,Service=WebApp
# 创建CloudWatch仪表板
$ aws cloudwatch put-dashboard \
–dashboard-name MyDashboard \
–dashboard-body ‘{
“widgets”: [
{
“type”: “metric”,
“x”: 0,
“y”: 0,
“width”: 12,
“height”: 6,
“properties”: {
“metrics”: [
[“AWS/EC2”, “CPUUtilization”, “InstanceId”, “i-0123456789abcdef0”]
],
“period”: 300,
“stat”: “Average”,
“region”: “us-west-2”,
“title”: “EC2 CPU Utilization”
}
}
]
}’
# 创建日志组
$ aws logs create-log-group –log-group-name /aws/ec2/myapp
# 创建日志流
$ aws logs create-log-stream –log-group-name /aws/ec2/myapp –log-stream-name i-0123456789abcdef0
# 写入日志
$ aws logs put-log-events \
–log-group-name /aws/ec2/myapp \
–log-stream-name i-0123456789abcdef0 \
–log-events timestamp=$(date +%s)000,message=”Application started successfully”
2.2 CloudWatch告警配置
# 创建SNS主题
$ aws sns create-topic –name MyAlerts
# 订阅SNS主题
$ aws sns subscribe \
–topic-arn arn:aws:sns:us-west-2:123456789012:MyAlerts \
–protocol email \
–notification-endpoint admin@fgedu.net.cn
# 创建复合告警
$ aws cloudwatch put-composite-alarm \
–alarm-name High-CPU-And-Memory \
–alarm-description “Alarm when both CPU and Memory are high” \
–actions-enabled \
–alarm-actions arn:aws:sns:us-west-2:123456789012:MyAlerts \
–composite-alarm-rule ‘ALARM(CPU-Utilization-High) AND ALARM(Memory-Utilization-High)’
# 创建异常检测告警
$ aws cloudwatch put-anomaly-detector \
–namespace AWS/EC2 \
–metric-name CPUUtilization \
–dimensions Name=InstanceId,Value=i-0123456789abcdef0
# 创建告警规则
$ aws cloudwatch put-metric-alarm \
–alarm-name CPU-Anomaly-Detection \
–alarm-description “Alarm when CPU usage is anomalous” \
–metric-name CPUUtilization \
–namespace AWS/EC2 \
–statistic Average \
–period 300 \
–evaluation-periods 2 \
–threshold-metric-id e1 \
–metrics ‘[{“Id”:”m1″,”ReturnData”:true,”MetricStat”:{“Metric”:{“Namespace”:”AWS/EC2″,”MetricName”:”CPUUtilization”,”Dimensions”:[{“Name”:”InstanceId”,”Value”:”i-0123456789abcdef0″}]},”Period”:300,”Stat”:”Average”}},{“Id”:”e1″,”ReturnData”:true,”Expression”:”ANOMALY_DETECTION_BAND(m1, 2)”}]’ \
–comparison-operator LessThanLowerOrGreaterThanUpperThreshold
风哥风哥提示:CloudWatch是AWS的核心监控服务,可以监控几乎所有AWS服务的指标和日志。
3. Azure Monitor
3.1 Azure Monitor监控配置
# 创建Log Analytics工作区
$ az monitor log-analytics workspace create \
–resource-group myResourceGroup \
–workspace-name myWorkspace \
–location eastus
# 启用诊断设置
$ az monitor diagnostic-settings create \
–name myDiagnosticSetting \
–resource /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM \
–workspace /subscriptions/12345678-1234-1234-1234-123456789012/resourcegroups/myResourceGroup/providers/microsoft.operationalinsights/workspaces/myWorkspace \
–logs ‘[{“category”: “Administrative”,”enabled”: true}]’ \
–metrics ‘[{“category”: “AllMetrics”,”enabled”: true}]’
# 创建告警规则
$ az monitor metrics alert create \
–name CPU-Utilization-High \
–resource-group myResourceGroup \
–scopes /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM \
–condition “avg percentage CPU > 70” \
–window-size 5m \
–evaluation-frequency 1m \
–action groups “my-action-group”
# 查看告警规则
$ az monitor metrics alert show –name CPU-Utilization-High –resource-group myResourceGroup
# 输出结果
{
“actions”: [
{
“actionGroupId”: “/subscriptions/12345678-1234-1234-1234-123456789012/resourcegroups/myResourceGroup/providers/microsoft.insights/actiongroups/my-action-group”
}
],
“criteria”: {
“allOf”: [
{
“metricName”: “Percentage CPU”,
“metricNamespace”: “Microsoft.Compute/virtualMachines”,
“operator”: “GreaterThan”,
“threshold”: 70.0,
“timeAggregation”: “Average”
}
]
},
“enabled”: true,
“evaluationFrequency”: “PT1M”,
“name”: “CPU-Utilization-High”,
“resourceGroup”: “myResourceGroup”,
“scopes”: [
“/subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/Microsoft.Compute/virtualMachines/myVM”
],
“severity”: 3,
“windowSize”: “PT5M”
}
# 查询日志
$ az monitor log-analytics query \
–workspace myWorkspace \
–analytics-query “AzureDiagnostics | where ResourceProvider == ‘MICROSOFT.COMPUTE’ | summarize count() by bin(TimeGenerated, 1h)”
3.2 Azure Monitor告警配置
# 创建操作组
$ az monitor action-group create \
–name myActionGroup \
–resource-group myResourceGroup \
–short-name myAction \
–email-receivers ‘[{“name”: “Admin”,”email_address”: “admin@fgedu.net.cn”}]’ \
–sms-receivers ‘[{“name”: “Admin”,”country_code”: “86”,”phone_number”: “13800138000”}]’
# 创建日志告警
$ az monitor scheduled-query create \
–name High-Error-Rate \
–resource-group myResourceGroup \
–scopes /subscriptions/12345678-1234-1234-1234-123456789012/resourceGroups/myResourceGroup/providers/microsoft.operationalinsights/workspaces/myWorkspace \
–condition “AzureDiagnostics | where ResourceProvider == ‘MICROSOFT.COMPUTE’ | where ResultSignature == ‘Failed’ | summarize count() by bin(TimeGenerated, 5m) | where count_ > 10” \
–description “Alert when error rate is high” \
–severity 2 \
–frequency 5m \
–action-groups “myActionGroup”
# 创建活动日志告警
$ az monitor activity-log alert create \
–name VM-Restart \
–resource-group myResourceGroup \
–condition category=Administrative and operationName=Microsoft.Compute/virtualMachines/restart/action \
–scope /subscriptions/12345678-1234-1234-1234-123456789012 \
–action-group myActionGroup
# 创建智能检测告警
$ az monitor app-insights alert create \
–name Failure-Anomalies \
–resource-group myResourceGroup \
–app myApp \
–type FailureAnomalies \
–action-groups myActionGroup
更多学习教程www.fgedu.net.cn
4. Google Cloud Monitoring
4.1 Google Cloud Monitoring配置
# 创建告警策略
$ gcloud alpha monitoring policies create \
–display-name=”CPU Utilization High” \
–description=”Alert when CPU utilization exceeds 70%” \
–conditions=’display-name=”CPU Usage”,condition-threshold=filter=metric.type=”compute.googleapis.com/instance/cpu/utilization” resource.type=”gce_instance” aggregator=”REDUCE_MEAN” comparison=”COMPARISON_GT” threshold-value=0.7 duration=”300s”‘ \
–notification-channels=”projects/my-project/notificationChannels/1234567890″
# 输出结果
Created policy [projects/my-project/alertPolicies/1234567890].
# 列出告警策略
$ gcloud alpha monitoring policies list
# 创建通知渠道
$ gcloud beta monitoring channels create \
–display-name=”Email Channel” \
–type=email \
–channel-labels=email_address=admin@fgedu.net.cn
# 创建仪表板
$ gcloud monitoring dashboards create –config-from-file=dashboard.json
# dashboard.json示例
{
“displayName”: “VM Monitoring Dashboard”,
“gridLayout”: {
“widgets”: [
{
“title”: “CPU Utilization”,
“xyChart”: {
“dataSets”: [
{
“timeSeriesQuery”: {
“timeSeriesFilter”: {
“filter”: “metric.type=\”compute.googleapis.com/instance/cpu/utilization\” resource.type=\”gce_instance\””,
“aggregation”: {
“alignmentPeriod”: “60s”,
“perSeriesAligner”: “ALIGN_RATE”
}
}
}
}
]
}
}
]
}
}
# 查询指标
$ gcloud monitoring read \
–project=my-project \
“compute.googleapis.com/instance/cpu/utilization” \
–start-time=2026-04-02T00:00:00Z \
–end-time=2026-04-03T00:00:00Z \
–aggregation=mean \
–interval=3600s
4.2 Google Cloud Logging配置
# 创建日志接收器
$ gcloud logging sinks create my-sink \
bigquery.googleapis.com/projects/my-project/datasets/my_dataset \
–log-filter=’resource.type=”gce_instance”‘
# 创建日志排除过滤器
$ gcloud logging sinks create my-exclusion \
–log-filter=’resource.type=”gce_instance” AND severity=”DEBUG”‘ \
–disabled
# 查看日志
$ gcloud logging read “resource.type=gce_instance” –limit=10
# 输出结果
insertId: 1234567890
logName: projects/my-project/logs/compute.googleapis.com%2Factivity_log
receiveTimestamp: ‘2026-04-03T10:00:00.000000000Z’
resource:
labels:
instance_id: ‘1234567890123456789’
project_id: my-project
zone: us-central1-a
type: gce_instance
severity: INFO
timestamp: ‘2026-04-03T10:00:00.000000000Z’
# 创建基于日志的指标
$ gcloud logging metrics create my-metric \
–description=”My custom log metric” \
–log-filter=’resource.type=”gce_instance” AND severity=”ERROR”‘
# 创建基于日志的告警
$ gcloud alpha monitoring policies create \
–display-name=”Error Log Alert” \
–conditions=’display-name=”Error Logs”,condition-threshold=filter=metric.type=”logging.googleapis.com/user/my-metric” resource.type=”gce_instance” comparison=”COMPARISON_GT” threshold-value=10 duration=”300s”‘ \
–notification-channels=”projects/my-project/notificationChannels/1234567890″
author:www.itpux.com
5. 阿里云监控
5.1 阿里云云监控配置
# 创建告警规则
$ aliyun cms PutMetricRule \
–Namespace acs_ecs_dashboard \
–MetricName CPUUtilization \
–RuleName CPU-Utilization-High \
–Threshold 70 \
–ComparisonOperator GreaterThanOrEqualToThreshold \
–Statistics Average \
–Period 300 \
–EvaluationCount 2 \
–ContactGroups my-contact-group
# 查看告警规则
$ aliyun cms DescribeMetricRuleList –Namespace acs_ecs_dashboard –MetricName CPUUtilization
# 输出结果
{
“Alarms”: {
“Alarm”: [
{
“Id”: “1234567890”,
“Name”: “CPU-Utilization-High”,
“Namespace”: “acs_ecs_dashboard”,
“MetricName”: “CPUUtilization”,
“Threshold”: “70”,
“ComparisonOperator”: “GreaterThanOrEqualToThreshold”,
“Statistics”: “Average”,
“Period”: “300”,
“EvaluationCount”: 2,
“State”: “OK”
}
]
},
“RequestId”: “473469C7-AA6F-4DC5-B3DB-A3DC0DE3C83E”
}
# 查询监控数据
$ aliyun cms QueryMetricList \
–Namespace acs_ecs_dashboard \
–MetricName CPUUtilization \
–Dimensions ‘{“instanceId”:”i-0123456789abcdef0″}’ \
–StartTime 2026-04-02 00:00:00 \
–EndTime 2026-04-03 00:00:00
# 创建联系人组
$ aliyun cms CreateContactGroup \
–ContactGroupName my-contact-group \
–Describe “My contact group”
# 添加联系人
$ aliyun cms CreateContact \
–ContactGroupName my-contact-group \
–Name Admin \
–Channels ‘{“Email”:”admin@fgedu.net.cn”,”SMS”:”13800138000″}’
5.2 阿里云日志服务配置
# 创建项目
$ aliyun log CreateProject \
–project-name my-project \
–project-describe “My log project”
# 创建日志库
$ aliyun log CreateLogStore \
–project-name my-project \
–logstore-name my-logstore \
–ttl 30 \
–shard-count 2
# 创建日志主题
$ aliyun log CreateIndex \
–project-name my-project \
–logstore-name my-logstore \
–index-config ‘{“keys”:{“level”:{“type”:”text”},”message”:{“type”:”text”}},”line”:{“token”:[“,”,” “,”‘”]}}’
# 写入日志
$ aliyun log PutLogs \
–project-name my-project \
–logstore-name my-logstore \
–log-item ‘[{“Time”:1680500400,”Contents”:[{“Key”:”level”,”Value”:”INFO”},{“Key”:”message”,”Value”:”Application started”}]}]’
# 查询日志
$ aliyun log GetLogs \
–project-name my-project \
–logstore-name my-logstore \
–from-time 2026-04-02 00:00:00 \
–to-time 2026-04-03 00:00:00 \
–query “level:ERROR”
# 创建告警
$ aliyun log CreateAlert \
–project-name my-project \
–alert-name my-alert \
–alert-display-name “Error Log Alert” \
–alert-config ‘{“query”:”level:ERROR”,”from”:-3600,”to”:0,”condition”:”count > 10″,”notification”:{“type”:”Email”,”receivers”:[“admin@fgedu.net.cn”]}}’
更多学习教程公众号风哥教程itpux_com
6. 腾讯云监控
6.1 腾讯云云监控配置
# 创建告警策略
$ tccli monitor CreateAlarmPolicy \
–Module monitor \
–PolicyName CPU-Utilization-High \
–Namespace QCE/CVM \
–MonitorType MT_QCE \
–Conditions ‘[{“MetricName”:”CPUUsage”,”CalcType”:1,”CalcValue”:70,”ContinueTime”:300}]’
# 输出结果
{
“PolicyId”: “policy-1234567890”,
“RequestId”: “473469C7-AA6F-4DC5-B3DB-A3DC0DE3C83E”
}
# 绑定告警对象
$ tccli monitor BindingPolicyObject \
–Module monitor \
–PolicyId policy-1234567890 \
–InstanceGroupId ins-0123456789abcdef0
# 创建通知模板
$ tccli monitor CreateAlarmNotice \
–Module monitor \
–NoticeName my-notice \
–NoticeType ALARM \
–NoticeReceivers ‘[{“ReceiverType”:”USER”,”ReceiverId”:”user-1234567890″,”StartTime”:0,”EndTime”:1}]’
# 查询监控数据
$ tccli monitor GetMonitorData \
–Namespace QCE/CVM \
–MetricName CPUUsage \
–Instances ‘[{“Dimensions”:[{“Name”:”InstanceId”,”Value”:”ins-0123456789abcdef0″}]}]’ \
–Period 300 \
–StartTime 2026-04-02T00:00:00Z \
–EndTime 2026-04-03T00:00:00Z
# 创建仪表板
$ tccli monitor CreateDashboard \
–Module monitor \
–DashboardName my-dashboard \
–DashboardConfig ‘{“widgets”:[{“type”:”line”,”title”:”CPU Usage”,”metrics”:[{“namespace”:”QCE/CVM”,”metricName”:”CPUUsage”,”dimensions”:[{“name”:”InstanceId”,”value”:”ins-0123456789abcdef0″}]}]}]}’
6.2 腾讯云日志服务配置
# 创建日志集
$ tccli cls CreateLogset \
–LogsetName my-logset
# 创建日志主题
$ tccli cls CreateTopic \
–LogsetId cls-1234567890 \
–TopicName my-topic
# 创建索引
$ tccli cls CreateIndex \
–TopicId topic-1234567890 \
–Rule ‘{“FullText”:{“CaseSensitive”:false},”KeyValue”:{“Key”:{“type”:”text”},”Level”:{“type”:”text”}}}’
# 写入日志
$ tccli cls UploadLog \
–TopicId topic-1234567890 \
–LogBody ‘[{“timestamp”:1680500400,”content”:”level=INFO message=Application started”}]’
# 查询日志
$ tccli cls SearchLog \
–TopicId topic-1234567890 \
–From 1680496800 \
–To 1680500400 \
–Query “level:ERROR”
# 创建告警
$ tccli cls CreateAlarm \
–AlarmName my-alarm \
–TopicId topic-1234567890 \
–Query “level:ERROR” \
–Condition “count > 10” \
–NoticeId notice-1234567890
风哥风哥提示:云监控与告警是确保云服务可用性和可靠性的重要手段,需要建立完善的监控体系。
7. Prometheus监控
7.1 Prometheus安装配置
# 下载Prometheus
$ wget https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz
$ tar -zxvf prometheus-2.37.0.linux-amd64.tar.gz
$ cd prometheus-2.37.0.linux-amd64
# 配置prometheus.yml
$ cat > prometheus.yml << EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- fgedudb:9093
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['fgedudb:9090']
- job_name: 'node'
static_configs:
- targets: ['fgedudb:9100']
- job_name: 'mysql'
static_configs:
- targets: ['fgedudb:9104']
- job_name: 'nginx'
static_configs:
- targets: ['fgedudb:9113']
EOF
# 配置告警规则
$ cat > alert_rules.yml << EOF
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage detected”
description: “CPU usage is above 70% (current value: {{ $value }}%)”
– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage detected”
description: “Memory usage is above 80% (current value: {{ $value }}%)”
– alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint=”/”} / node_filesystem_size_bytes{mountpoint=”/”}) * 100 < 20
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is below 20% (current value: {{ $value }}%)"
EOF
# 启动Prometheus
$ ./prometheus --config.file=prometheus.yml
# 输出结果
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:388 msg="Starting Prometheus" version="(version=2.37.0, branch=HEAD, revision=1234567890)"
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:393 msg="Build context" build_context="(go=go1.19.1, user=root@build, date=20260403-00:00:00)"
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:652 msg="Completed loading of configuration file" filename=prometheus.yml
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:526 msg="Server is ready to receive web requests."
7.2 Prometheus查询
# 查询CPU使用率
100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)
# 查询内存使用率
(1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# 查询磁盘使用率
(node_filesystem_size_bytes{mountpoint=”/”} – node_filesystem_avail_bytes{mountpoint=”/”}) / node_filesystem_size_bytes{mountpoint=”/”} * 100
# 查询网络流量
rate(node_network_receive_bytes_total{device=”eth0″}[5m])
rate(node_network_transmit_bytes_total{device=”eth0″}[5m])
# 查询HTTP请求率
rate(http_requests_total[5m])
# 查询HTTP错误率
sum(rate(http_requests_total{status=~”5..”}[5m])) / sum(rate(http_requests_total[5m])) * 100
# 查询响应时间
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
学习交流加群风哥QQ113257174
8. Grafana可视化
8.1 Grafana安装配置
# 安装Grafana
$ yum install -y grafana
# 启动Grafana
$ systemctl start grafana-server
$ systemctl enable grafana-server
# 访问Grafana
# http://fgedudb:3000
# 默认fgedu: admin
# 默认密码: admin
# 配置Prometheus数据源
$ cat > /etc/grafana/provisioning/datasources/prometheus.yml << EOF
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://fgedudb:9090
isDefault: true
EOF
# 配置仪表板
$ cat > /etc/grafana/provisioning/dashboards/dashboard.yml << EOF
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
EOF
# 创建仪表板JSON
$ cat > /var/lib/grafana/dashboards/node-dashboard.json << EOF
{
"dashboard": {
"title": "Node Monitoring",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Disk Usage",
"type": "graph",
"targets": [
{
"expr": "(node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_avail_bytes{mountpoint=\"/\"}) / node_filesystem_size_bytes{mountpoint=\"/\"} * 100",
"legendFormat": "{{instance}}"
}
]
}
]
}
}
EOF
# 重启Grafana
$ systemctl restart grafana-server
8.2 Grafana告警配置
# 配置通知渠道
$ cat > /etc/grafana/provisioning/notifiers/email.yml << EOF apiVersion: 1 notifiers: - name: email type: email uid: email settings: addresses: admin@fgedu.net.cn EOF # 配置告警规则 # 在Grafana UI中配置告警规则 # 1. 打开仪表板 # 2. 选择面板 # 3. 点击"Edit" # 4. 切换到"Alert"选项卡 # 5. 配置告警规则 # 告警规则示例 { "name": "CPU Usage Alert", "conditions": [ { "evaluator": { "type": "gt", "params": [70] }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "type": "avg" } } ], "executionErrorState": "alerting", "frequency": "1m", "handler": 1, "message": "CPU usage is above 70%", "noDataState": "no_data" }
9. 告警管理
9.1 Alertmanager配置
# 下载Alertmanager
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
$ tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz
$ cd alertmanager-0.24.0.linux-amd64
# 配置alertmanager.yml
$ cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.fgedu.net.cn:587'
smtp_from: 'alertmanager@fgedu.net.cn'
smtp_auth_username: 'alertmanager@fgedu.net.cn'
smtp_auth_password: 'MyPassword123'
smtp_require_tls: true
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'team-email-critical'
- match:
severity: warning
receiver: 'team-email-warning'
receivers:
- name: 'team-email'
email_configs:
- to: 'admin@fgedu.net.cn'
send_resolved: true
- name: 'team-email-critical'
email_configs:
- to: 'admin-critical@fgedu.net.cn'
send_resolved: true
- name: 'team-email-warning'
email_configs:
- to: 'admin-warning@fgedu.net.cn'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
EOF
# 启动Alertmanager
$ ./alertmanager --config.file=alertmanager.yml
# 输出结果
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:220 msg="Starting Alertmanager" version="(version=0.24.0, branch=HEAD, revision=1234567890)"
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:225 msg="Build context" build_context="(go=go1.18.1, user=root@build, date=20260403-00:00:00)"
level=info ts=2026-04-03T10:00:00.000Z caller=cluster.go:161 msg="setting advertise address explicitly" addr=127.0.0.1 port=9094
level=info ts=2026-04-03T10:00:00.000Z caller=main.go:505 msg="Listening for connections" address=0.0.0.0:9093
9.2 告警最佳实践
- 设置合理的告警阈值
- 避免告警疲劳
- 实施告警分级
- 建立告警响应流程
- 定期审查告警规则
10. 最佳实践
10.1 云监控与告警最佳实践
- 建立完善的监控体系
- 设置合理的告警阈值
- 实施告警分级管理
- 建立告警响应流程
- 定期审查和优化监控策略
- 使用可视化工具
- 保存历史监控数据
- 定期进行监控演练
10.2 监控指标选择
- CPU使用率
- 内存使用率
- 磁盘使用率
- 网络流量
- 应用响应时间
- 错误率
- 请求量
- 服务可用性
10.3 告警策略
- 设置合理的告警阈值
- 实施告警分级
- 配置告警通知渠道
- 建立告警响应流程
- 定期审查告警规则
10.4 可视化最佳实践
- 创建清晰的仪表板
- 使用合适的图表类型
- 提供关键指标概览
- 支持深入分析
- 定期更新仪表板
- 建立完善的云监控与告警体系
- 监控关键业务指标
- 设置合理的告警阈值
- 实施告警分级管理
- 建立告警响应流程
- 使用可视化工具
- 保存历史监控数据
- 定期审查和优化监控策略
author:www.itpux.com
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
