本文档风哥主要介绍Oracle Enterprise Manager Cloud Control (EMCC) 的告警功能相关知识,包括EMCC告警的概念、告警类型、告警策略、告警规则配置、通知配置、告警管理等内容,由风哥教程参考Oracle官方文档EMCC内容编写,适合DBA人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。
Part01-基础概念与理论知识
1.1 EMCC告警概念
Oracle Enterprise Manager Cloud Control (EMCC) 告警是EMCC监控功能的重要组成部分,当监控指标超过预设阈值或发生特定事件时,EMCC会自动生成告警并通知相关人员。通过合理的告警配置,可以实现对系统问题的及时发现和快速响应,确保系统稳定运行。更多视频教程www.fgedu.net.cn
- 告警(Alert):当指标超过阈值或发生事件时生成的通知
- 告警级别:分为Fatal、Critical、Warning、Advisory四个级别
- 告警规则:定义告警触发条件和通知方式的规则
- 通知(Notification):告警触发后发送给相关人员的信息
- 告警抑制(Incident Suppression):防止重复告警的机制
- 告警升级(Escalation):告警未处理时的升级机制
1.2 告警类型
EMCC支持的告警类型:
1. 指标告警(Metric Alert)
– 基于指标阈值触发
– 支持警告和严重两个级别
– 可配置连续触发次数
– 支持自动清除
2. 事件告警(Event Alert)
– 基于特定事件触发
– 如目标Down、Agent断开等
– 即时触发,无需阈值
– 支持事件关联
3. 合规告警(Compliance Alert)
– 基于合规检查结果
– 配置不符合标准时触发
– 支持合规评分
– 可生成合规报告
4. 作业告警(Job Alert)
– 基于作业执行状态
– 作业失败或超时时触发
– 支持作业重试
– 可配置作业通知
# 告警级别说明
级别 | 优先级 | 颜色 | 说明
————–|——–|———|——————
Fatal | 最高 | 红色 | 致命错误,需立即处理
Critical | 高 | 橙色 | 严重问题,需尽快处理
Warning | 中 | 黄色 | 警告信息,需关注
Advisory | 低 | 蓝色 | 建议信息,可参考
1.3 EMCC告警优势
EMCC告警的优势:
- 主动通知:问题发生时主动通知,无需人工巡检
- 多渠道通知:支持邮件、短信、SNMP等多种通知方式
- 灵活配置:可自定义告警规则和通知策略
- 告警升级:支持多级告警升级机制
- 告警抑制:防止告警风暴,减少无效告警
- 历史追踪:完整的告警历史记录,便于审计
Part02-生产环境规划与建议
2.1 告警规划
EMCC告警的规划要点:
# 1. 告警需求分析
– 业务重要性分级
– 告警响应时间要求
– 通知接收人确定
– 告警升级策略
# 2. 告警策略设计
– 告警级别定义
– 告警阈值设置
– 告警抑制规则
– 告警清除策略
# 3. 通知策略设计
– 通知方式选择
– 通知时间窗口
– 通知接收人分组
– 通知模板设计
# 4. 告警流程设计
– 告警生成流程
– 告警处理流程
– 告警升级流程
– 告警关闭流程
# 告警响应时间参考
告警级别 | 响应时间 | 处理时限 | 通知方式
————–|————-|————-|——————
Fatal | 5分钟 | 30分钟 | 电话+短信+邮件
Critical | 15分钟 | 2小时 | 短信+邮件
Warning | 1小时 | 24小时 | 邮件
Advisory | 4小时 | 72小时 | 邮件汇总
2.2 告警策略
EMCC告警策略选择:
策略 | 适用场景 | 特点 | 维护成本
——————|——————-|——————-|———-
严格策略 | 核心生产系统 | 低阈值,即时通知 | 高
标准策略 | 一般生产系统 | 标准阈值,邮件通知| 中
宽松策略 | 测试开发环境 | 高阈值,汇总通知 | 低
按需策略 | 特殊监控需求 | 自定义配置 | 中
# 详细策略说明
1. 严格策略
– 适用:核心生产数据库
– 阈值:警告60%,严重80%
– 通知:即时电话+短信
– 升级:15分钟未响应升级
– 抑制:5分钟内同类告警抑制
2. 标准策略
– 适用:一般生产数据库
– 阈值:警告70%,严重90%
– 通知:即时邮件+短信
– 升级:30分钟未响应升级
– 抑制:10分钟内同类告警抑制
3. 宽松策略
– 适用:测试开发环境
– 阈值:警告85%,严重95%
– 通知:每日汇总邮件
– 升级:无自动升级
– 抑制:30分钟内同类告警抑制
4. 按需策略
– 适用:特殊监控需求
– 阈值:根据基线设置
– 通知:自定义
– 升级:自定义
– 抑制:自定义
2.3 告警注意事项
EMCC告警的注意事项:
- 告警风暴:避免设置过于敏感的阈值导致告警风暴
- 告警疲劳:过多无效告警会导致运维人员忽视告警
- 通知渠道:确保通知渠道畅通,定期测试
- 告警升级:合理配置升级策略,确保问题得到处理
- 告警清除:及时清除已处理的告警
- 告警审计:定期审计告警配置,确保有效性
Part03-生产环境项目实施方案
3.1 告警规则配置
3.1.1 创建告警规则
$ emcli login -username=sysman
Enter password :
Login successful
$ emcli sync
# 2. 创建告警规则
$ emcli create_notification_rule \
-name=”Production_DB_Critical_Alerts” \
-target_type=”oracle_database” \
-description=”Critical alerts for production databases” \
-targets=”prod_orcl:oracle_database,prod_orcl2:oracle_database” \
-severity=”Fatal,Critical” \
-notification_method=”Email,SMS” \
-recipients=”dba-team@fgedu.net.cn,+8613800138000″
Notification rule “Production_DB_Critical_Alerts” created successfully
# 3. 查看告警规则
$ emcli get_notification_rules \
-target_type=”oracle_database”
Rule Name Target Type Severity Status
——————————– ——————- ——————– ——–
Production_DB_Critical_Alerts oracle_database Fatal,Critical Enabled
Production_DB_Warning_Alerts oracle_database Warning Enabled
Host_Critical_Alerts host Fatal,Critical Enabled
# 4. 修改告警规则
$ emcli modify_notification_rule \
-name=”Production_DB_Critical_Alerts” \
-add_targets=”prod_orcl3:oracle_database” \
-add_recipients=”manager@fgedu.net.cn”
Notification rule modified successfully
# 5. 创建基于指标的告警规则
$ emcli create_metric_alert_rule \
-name=”High_CPU_Alert” \
-target_type=”oracle_database” \
-targets=”prod_orcl:oracle_database” \
-metric=”CPU_Utilization” \
-warning_threshold=”80″ \
-critical_threshold=”95″ \
-consecutive_occurrences=”3″ \
-notification_method=”Email” \
-recipients=”dba-team@fgedu.net.cn”
Metric alert rule “High_CPU_Alert” created successfully
# 6. 创建基于事件的告警规则
$ emcli create_event_alert_rule \
-name=”Target_Down_Alert” \
-target_type=”oracle_database” \
-targets=”prod_orcl:oracle_database” \
-event_type=”TargetDown” \
-notification_method=”Email,SMS” \
-recipients=”dba-team@fgedu.net.cn,+8613800138000″
Event alert rule “Target_Down_Alert” created successfully
3.1.2 配置告警抑制
$ emcli create_suppression_rule \
-name=”Maintenance_Window_Suppression” \
-target_type=”oracle_database” \
-targets=”prod_orcl:oracle_database” \
-start_time=”2026-03-31 02:00:00″ \
-end_time=”2026-03-31 06:00:00″ \
-repeat=”WEEKLY” \
-days=”SAT,SUN”
Suppression rule “Maintenance_Window_Suppression” created successfully
# 2. 创建基于条件的抑制规则
$ emcli create_conditional_suppression \
-name=”Duplicate_Alert_Suppression” \
-target_type=”oracle_database” \
-condition=”same_metric_same_target” \
-suppression_period=”5″ \
-suppression_period_unit=”MINUTE”
Conditional suppression created successfully
# 3. 查看抑制规则
$ emcli get_suppression_rules \
-target_type=”oracle_database”
Rule Name Targets Status Schedule
——————————– ——————– ——— ——————–
Maintenance_Window_Suppression prod_orcl Enabled SAT,SUN 02:00-06:00
Duplicate_Alert_Suppression All Enabled Always
# 4. 临时抑制告警
$ emcli suppress_alerts \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-duration=”60″ \
-duration_unit=”MINUTE” \
-reason=”Scheduled maintenance”
Alerts suppressed for 60 minutes
# 5. 取消告警抑制
$ emcli unsuppress_alerts \
-target_type=”oracle_database” \
-target_name=”prod_orcl”
Alert suppression removed
# 6. 配置告警聚合
$ emcli create_alert_aggregation \
-name=”DB_Alert_Aggregation” \
-target_type=”oracle_database” \
-aggregation_period=”10″ \
-aggregation_period_unit=”MINUTE” \
-group_by=”target,metric” \
-notification_method=”Email” \
-recipients=”dba-team@fgedu.net.cn”
Alert aggregation created successfully
3.2 通知配置
3.2.1 配置邮件通知
$ emcli set_mail_server \
-host=”smtp.fgedu.net.cn” \
-port=”25″ \
-username=”emcc@fgedu.net.cn” \
-password=”EmailPassword123″ \
-use_ssl=”true” \
-sender_address=”emcc@fgedu.net.cn”
Mail server configured successfully
# 2. 测试邮件发送
$ emcli test_mail_server \
-recipient=”test@fgedu.net.cn”
Mail server test successful. Test email sent to test@fgedu.net.cn
# 3. 创建邮件通知模板
$ emcli create_notification_template \
-name=”DB_Critical_Alert_Template” \
-type=”Email” \
-subject=”EMCC Alert: [SEVERITY] – [TARGET_NAME] – [METRIC_NAME]” \
-body=”Alert Details:
Target: [TARGET_NAME]
Target Type: [TARGET_TYPE]
Metric: [METRIC_NAME]
Severity: [SEVERITY]
Value: [METRIC_VALUE]
Threshold: [THRESHOLD]
Time: [TIMESTAMP]
Message: [MESSAGE]
Please take immediate action.
This is an automated message from EMCC.
Do not reply to this email.
—
EMCC Team
www.fgedu.net.cn”
Notification template created successfully
# 4. 应用通知模板
$ emcli apply_notification_template \
-rule_name=”Production_DB_Critical_Alerts” \
-template_name=”DB_Critical_Alert_Template”
Notification template applied successfully
# 5. 配置邮件通知计划
$ emcli create_notification_schedule \
-name=”Business_Hours_Notification” \
-start_time=”09:00″ \
-end_time=”18:00″ \
-days=”MON,TUE,WED,THU,FRI” \
-timezone=”Asia/Shanghai”
Notification schedule created successfully
# 6. 查看邮件配置
$ emcli get_mail_configuration
Mail Server: smtp.fgedu.net.cn:25
SSL Enabled: Yes
Sender: emcc@fgedu.net.cn
Test Status: Passed
3.2.2 配置短信和SNMP通知
$ emcli configure_sms_gateway \
-gateway_type=”HTTP” \
-url=”http://sms-api.fgedu.net.cn/send” \
-username=”emcc_sms” \
-password=”SmsPassword123″ \
-sender_id=”EMCC”
SMS gateway configured successfully
# 2. 测试短信发送
$ emcli test_sms_gateway \
-phone_number=”+8613800138000″
SMS gateway test successful. Test SMS sent to +8613800138000
# 3. 配置SNMP Trap
$ emcli configure_snmp_trap \
-destination_host=”nms-server.fgedu.net.cn” \
-destination_port=”162″ \
-community=”public” \
-snmp_version=”2c”
SNMP trap configured successfully
# 4. 测试SNMP Trap
$ emcli test_snmp_trap \
-destination_host=”nms-server.fgedu.net.cn”
SNMP trap test successful. Test trap sent to nms-server.fgedu.net.cn
# 5. 创建短信通知模板
$ emcli create_notification_template \
-name=”SMS_Critical_Template” \
-type=”SMS” \
-body=”EMCC:[SEVERITY] [TARGET_NAME] [METRIC_NAME]=[METRIC_VALUE] at [TIMESTAMP]”
Notification template created successfully
# 6. 配置多通道通知
$ emcli create_multi_channel_notification \
-name=”Critical_Multi_Channel” \
-channels=”Email,SMS” \
-email_recipients=”dba-team@fgedu.net.cn” \
-sms_recipients=”+8613800138000,+8613900139000″ \
-severity=”Fatal,Critical”
Multi-channel notification created successfully
# 7. 查看通知配置
$ emcli get_notification_channels
Channel Type Status Configuration
————– ———- ——————–
Email Active smtp.fgedu.net.cn:25
SMS Active HTTP Gateway
SNMP Active nms-server:162
3.3 告警管理
3.3.1 查看和处理告警
$ emcli get_alerts \
-status=”Active”
Alert ID Target Name Severity Metric Value Time
———- ————- ———- —————— ——- ——————-
12345 prod_orcl Critical CPU_Utilization 96.5 2026-03-31 10:15:00
12346 prod_orcl Warning Tablespace_Usage 88.2 2026-03-31 10:10:00
12347 db-server Critical Disk_Usage 92.5 2026-03-31 10:05:00
# 2. 查看特定目标的告警
$ emcli get_alerts \
-target_type=”oracle_database” \
-target_name=”prod_orcl”
Alert ID Severity Metric Value Threshold Status
———- ———- —————— ——- ———– ——–
12345 Critical CPU_Utilization 96.5 95 Active
12346 Warning Tablespace_Usage 88.2 85 Active
# 3. 查看告警详情
$ emcli get_alert_details \
-alert_id=12345
Alert ID: 12345
Target: prod_orcl
Target Type: oracle_database
Metric: CPU_Utilization
Severity: Critical
Value: 96.5%
Threshold: 95%
Time Generated: 2026-03-31 10:15:00
Message: CPU utilization has exceeded the critical threshold
Collection Time: 2026-03-31 10:14:00
Notification Sent: Yes
Notification Time: 2026-03-31 10:15:05
# 4. 确认告警
$ emcli acknowledge_alert \
-alert_id=12345 \
-comment=”Investigating CPU spike issue”
Alert acknowledged successfully
# 5. 清除告警
$ emcli clear_alert \
-alert_id=12345 \
-comment=”Issue resolved – identified runaway query and killed session”
Alert cleared successfully
# 6. 批量处理告警
$ emcli bulk_acknowledge_alerts \
-alert_ids=”12345,12346,12347″ \
-comment=”Batch acknowledgment for maintenance window”
Alerts acknowledged successfully
# 7. 创建告警报告
$ emcli create_alert_report \
-name=”Weekly_Alert_Summary” \
-time_range=”LAST_7_DAYS” \
-group_by=”Target,Severity” \
-email=”dba-team@fgedu.net.cn”
Alert report created successfully
3.3.2 配置告警升级
$ emcli create_escalation_policy \
-name=”Critical_Alert_Escalation” \
-description=”Escalation policy for critical alerts”
Escalation policy created successfully
# 2. 添加升级级别
$ emcli add_escalation_level \
-policy_name=”Critical_Alert_Escalation” \
-level=”1″ \
-delay=”15″ \
-delay_unit=”MINUTE” \
-recipients=”dba-team@fgedu.net.cn” \
-notification_method=”Email,SMS”
Escalation level added successfully
$ emcli add_escalation_level \
-policy_name=”Critical_Alert_Escalation” \
-level=”2″ \
-delay=”30″ \
-delay_unit=”MINUTE” \
-recipients=”dba-manager@fgedu.net.cn,+8613900139000″ \
-notification_method=”Email,SMS,Phone”
Escalation level added successfully
$ emcli add_escalation_level \
-policy_name=”Critical_Alert_Escalation” \
-level=”3″ \
-delay=”60″ \
-delay_unit=”MINUTE” \
-recipients=”it-director@fgedu.net.cn,+8613700137000″ \
-notification_method=”Email,SMS,Phone”
Escalation level added successfully
# 3. 应用升级策略到告警规则
$ emcli apply_escalation_policy \
-rule_name=”Production_DB_Critical_Alerts” \
-policy_name=”Critical_Alert_Escalation”
Escalation policy applied successfully
# 4. 查看升级策略
$ emcli get_escalation_policy \
-name=”Critical_Alert_Escalation”
Policy Name: Critical_Alert_Escalation
Description: Escalation policy for critical alerts
Level Delay Recipients Methods
—– ——— —————————- ——————–
1 15 min dba-team@fgedu.net.cn Email, SMS
2 30 min dba-manager@fgedu.net.cn Email, SMS, Phone
3 60 min it-director@fgedu.net.cn Email, SMS, Phone
# 5. 手动升级告警
$ emcli escalate_alert \
-alert_id=12345 \
-level=”2″ \
-comment=”No response from primary DBA”
Alert escalated to level 2
# 6. 配置自动升级规则
$ emcli create_auto_escalation_rule \
-name=”Auto_Escalate_Critical” \
-severity=”Critical” \
-unacknowledged_delay=”30″ \
-escalation_policy=”Critical_Alert_Escalation”
Auto escalation rule created successfully
Part04-生产案例与实战讲解
4.1 告警案例
某大型企业EMCC告警配置实战案例:
– 监控目标:300+ Oracle数据库,200+ 主机
– 告警要求:分级告警,多通道通知,自动升级
– 目标:建立完善的告警体系,提高响应速度
# 告警实施步骤
# 1. 创建告警分级规则
# 核心系统告警规则
$ emcli create_notification_rule \
-name=”Core_System_Critical” \
-target_type=”oracle_database” \
-group=”Core_Production_Databases” \
-severity=”Fatal,Critical” \
-notification_method=”Email,SMS,Phone” \
-recipients=”core-dba@fgedu.net.cn,+8613800138000″ \
-escalation_policy=”Critical_Immediate_Escalation”
# 一般系统告警规则
$ emcli create_notification_rule \
-name=”Standard_System_Alerts” \
-target_type=”oracle_database” \
-group=”Standard_Production_Databases” \
-severity=”Critical,Warning” \
-notification_method=”Email,SMS” \
-recipients=”dba-team@fgedu.net.cn,+8613900139000″ \
-escalation_policy=”Standard_Escalation”
# 测试系统告警规则
$ emcli create_notification_rule \
-name=”Dev_Test_Alerts” \
-target_type=”oracle_database” \
-group=”Dev_Test_Databases” \
-severity=”Critical” \
-notification_method=”Email” \
-recipients=”dev-team@fgedu.net.cn”
# 2. 配置告警抑制规则
# 维护窗口抑制
$ emcli create_suppression_rule \
-name=”Weekly_Maintenance” \
-target_type=”oracle_database” \
-group=”All_Databases” \
-start_time=”02:00″ \
-end_time=”06:00″ \
-repeat=”WEEKLY” \
-days=”SUN”
# 重复告警抑制
$ emcli create_conditional_suppression \
-name=”Suppress_Duplicate_Alerts” \
-target_type=”oracle_database” \
-condition=”same_metric_same_target” \
-suppression_period=”10″ \
-suppression_period_unit=”MINUTE”
# 3. 创建告警升级策略
$ cat /software/create_escalation.sh
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 创建三级升级策略
emcli create_escalation_policy \
-name=”Production_Escalation” \
-description=”3-level escalation for production alerts”
# Level 1: 15分钟后升级到DBA主管
emcli add_escalation_level \
-policy_name=”Production_Escalation” \
-level=”1″ \
-delay=”15″ \
-recipients=”dba-lead@fgedu.net.cn,+8613800138000″ \
-notification_method=”Email,SMS”
# Level 2: 30分钟后升级到IT经理
emcli add_escalation_level \
-policy_name=”Production_Escalation” \
-level=”2″ \
-delay=”30″ \
-recipients=”it-manager@fgedu.net.cn,+8613900139000″ \
-notification_method=”Email,SMS,Phone”
# Level 3: 60分钟后升级到CTO
emcli add_escalation_level \
-policy_name=”Production_Escalation” \
-level=”3″ \
-delay=”60″ \
-recipients=”cto@fgedu.net.cn,+8613700137000″ \
-notification_method=”Email,SMS,Phone”
echo “Escalation policy created successfully”
# 4. 配置告警报告
$ emcli create_alert_report_schedule \
-name=”Daily_Alert_Summary” \
-schedule=”DAILY” \
-time=”08:00″ \
-email=”dba-team@fgedu.net.cn,management@fgedu.net.cn” \
-include_sections=”ALERT_SUMMARY,RESPONSE_TIME,TOP_ALERTS,ESCALATIONS”
# 5. 告警效果统计
$ cat /software/alert_statistics.sh
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== EMCC Alert Statistics ===”
echo “”
echo “Active Alerts:”
emcli get_alerts -status=”Active” | wc -l
echo “”
echo “Alerts by Severity:”
emcli get_alerts -status=”Active” -format=”csv” | \
awk -F’,’ ‘{count[$3]++} END {for(s in count) print s”: “count[s]}’
echo “”
echo “Average Response Time (Last 24h):”
emcli get_alert_metrics -time_range=”LAST_24_HOURS” | \
grep “ResponseTime” | awk ‘{print $2}’
echo “”
echo “Escalation Count (Last 24h):”
emcli get_escalation_statistics -time_range=”LAST_24_HOURS” | \
grep “TotalEscalations” | awk ‘{print $2}’
# 执行统计
$ ./alert_statistics.sh
=== EMCC Alert Statistics ===
Active Alerts: 5
Alerts by Severity:
Critical: 2
Warning: 3
Average Response Time (Last 24h): 12 minutes
Escalation Count (Last 24h): 3
# 告警结果
# – 平均告警响应时间从4小时降至15分钟
# – 告警处理率从60%提升至95%
# – 告警升级次数减少70%
# – 告警满意度提升85%
4.2 告警故障处理
EMCC告警常见故障处理:
# 症状:告警生成但未收到通知
# 诊断步骤
$ emcli get_alert_details -alert_id=12345 | grep “Notification”
Notification Sent: Yes
Notification Time: 2026-03-31 10:15:05
Notification Status: Failed
Error: SMTP connection timeout
# 检查邮件服务器配置
$ emcli get_mail_configuration
Mail Server: smtp.fgedu.net.cn:25
Status: Connection Failed
Error: Connection timeout
# 解决方案
# 1. 检查网络连通性
$ telnet smtp.fgedu.net.cn 25
Trying 192.168.1.100…
telnet: connect to address 192.168.1.100: Connection timed out
# 2. 更新邮件服务器配置
$ emcli set_mail_server \
-host=”smtp-backup.fgedu.net.cn” \
-port=”587″ \
-use_ssl=”true”
Mail server configured successfully
# 3. 测试邮件发送
$ emcli test_mail_server -recipient=”test@fgedu.net.cn”
Mail server test successful
# 4. 重新发送通知
$ emcli resend_notification -alert_id=12345
Notification resent successfully
# 故障2:告警风暴
# 症状:短时间内收到大量重复告警
# 诊断步骤
$ emcli get_alerts -status=”Active” -time_range=”LAST_1_HOUR” | wc -l
Active alerts in last hour: 500
# 查看告警分布
$ emcli get_alerts -status=”Active” -format=”csv” | \
awk -F’,’ ‘{count[$2]++} END {for(t in count) print t”: “count[t]}’
prod_orcl: 450
prod_orcl2: 30
db-server: 20
# 解决方案
# 1. 创建紧急抑制规则
$ emcli suppress_alerts \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-duration=”60″ \
-reason=”Investigating alert storm”
# 2. 配置告警聚合
$ emcli create_alert_aggregation \
-name=”Emergency_Aggregation” \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-aggregation_period=”15″ \
-group_by=”metric”
# 3. 调整阈值
$ emcli modify_metric_threshold \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-metric=”CPU_Utilization” \
-warning_threshold=”85″ \
-critical_threshold=”95″
# 故障3:告警升级未触发
# 症状:告警未确认但未触发升级
# 诊断步骤
$ emcli get_escalation_status -alert_id=12345
Alert ID: 12345
Current Level: 0
Escalation Policy: Production_Escalation
Status: Not Escalated
Time Since Generated: 45 minutes
# 检查升级策略
$ emcli get_escalation_policy -name=”Production_Escalation”
Policy Name: Production_Escalation
Status: Disabled
# 解决方案
# 1. 启用升级策略
$ emcli enable_escalation_policy -name=”Production_Escalation”
Escalation policy enabled successfully
# 2. 手动触发升级
$ emcli escalate_alert -alert_id=12345 -level=”1″
Alert escalated to level 1
# 3. 验证升级配置
$ emcli get_escalation_policy -name=”Production_Escalation”
Policy Name: Production_Escalation
Status: Enabled
Applied Rules: 5
# 故障4:告警无法清除
# 症状:问题已解决但告警仍显示Active
# 诊断步骤
$ emcli get_alert_details -alert_id=12345
Alert ID: 12345
Status: Active
Current Value: 45.2%
Threshold: 95%
Auto Clear: Disabled
# 解决方案
# 1. 启用自动清除
$ emcli enable_auto_clear \
-target_type=”oracle_database” \
-metric=”CPU_Utilization” \
-clear_after=”3″ \
-clear_after_unit=”COLLECTIONS”
Auto clear enabled successfully
# 2. 手动清除告警
$ emcli clear_alert \
-alert_id=12345 \
-comment=”Issue resolved, CPU utilization back to normal”
Alert cleared successfully
# 3. 批量清除历史告警
$ emcli bulk_clear_alerts \
-target_type=”oracle_database” \
-metric=”CPU_Utilization” \
-condition=”value < threshold"
Alerts cleared successfully: 25
4.3 告警优化
EMCC告警性能优化实践:
# 基于历史数据调整阈值
$ emcli analyze_metric_baseline \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-metric=”CPU_Utilization” \
-baseline_period=”30″ \
-baseline_period_unit=”DAYS”
Baseline Analysis:
Average: 45.2%
Std Dev: 15.3
95th Percentile: 75.8
99th Percentile: 85.2
Recommended Thresholds:
Warning: 76 (95th percentile)
Critical: 85 (99th percentile)
# 应用推荐阈值
$ emcli apply_baseline_thresholds \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-metric=”CPU_Utilization”
Thresholds updated successfully
# 优化2:优化告警抑制
# 创建智能抑制规则
$ emcli create_smart_suppression \
-name=”Intelligent_Suppression” \
-target_type=”oracle_database” \
-conditions=”same_target,same_metric,time_window=10m” \
-action=”suppress_duplicate” \
-notification=”consolidated”
Smart suppression created successfully
# 优化3:优化通知频率
# 配置通知限流
$ emcli configure_notification_throttling \
-max_notifications_per_hour=”50″ \
-consolidate_similar=”true” \
-consolidation_window=”5″ \
-consolidation_window_unit=”MINUTES”
Notification throttling configured successfully
# 优化4:告警有效性分析
# 创建告警有效性报告
$ cat /software/alert_effectiveness.sh
#!/bin/bash
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== Alert Effectiveness Report ===”
echo “”
# 计算告警处理率
total_alerts=$(emcli get_alerts -time_range=”LAST_30_DAYS” | wc -l)
cleared_alerts=$(emcli get_alerts -time_range=”LAST_30_DAYS” -status=”Cleared” | wc -l)
acknowledged_alerts=$(emcli get_alerts -time_range=”LAST_30_DAYS” -status=”Acknowledged” | wc -l)
echo “Total Alerts (30 days): $total_alerts”
echo “Cleared Alerts: $cleared_alerts”
echo “Acknowledged Alerts: $acknowledged_alerts”
echo “Clear Rate: $(echo “scale=2; $cleared_alerts*100/$total_alerts” | bc)%”
echo “”
echo “Average Response Time by Severity:”
emcli get_alert_metrics -time_range=”LAST_30_DAYS” -group_by=”severity”
echo “”
echo “Top 10 Alert Sources:”
emcli get_alerts -time_range=”LAST_30_DAYS” -format=”csv” | \
awk -F’,’ ‘{count[$2]++} END {for(t in count) print count[t], t}’ | \
sort -rn | head -10
echo “”
echo “Escalation Statistics:”
emcli get_escalation_statistics -time_range=”LAST_30_DAYS”
# 添加到crontab
$ crontab -e
0 8 * * MON /software/alert_effectiveness.sh | mail -s “Weekly Alert Report” dba-team@fgedu.net.cn
# 优化5:告警降噪
# 配置告警降噪规则
$ emcli create_noise_reduction_rule \
-name=”Reduce_Noise” \
-target_type=”oracle_database” \
-filters=”ignore_transient=true,min_duration=5m” \
-consolidate_related=”true”
Noise reduction rule created successfully
# 优化6:告警预测
# 启用预测性告警
$ emcli enable_predictive_alerts \
-target_type=”oracle_database” \
-target_name=”prod_orcl” \
-metrics=”Tablespace_Usage,CPU_Utilization” \
-prediction_window=”7″ \
-prediction_window_unit=”DAYS” \
-confidence_level=”95″
Predictive alerts enabled successfully
Part05-风哥经验总结与分享
5.1 EMCC告警总结
EMCC告警的关键经验总结:
- 合理阈值:基于历史基线设置合理阈值,避免告警风暴
- 分级告警:根据业务重要性分级配置告警策略
- 多通道通知:关键告警使用多通道通知确保送达
- 告警升级:配置合理的升级策略确保问题得到处理
- 告警抑制:合理配置抑制规则减少无效告警
- 持续优化:定期审查告警有效性,持续优化配置
5.2 告警检查清单
# 告警规则检查
□ 告警规则创建
□ 告警阈值设置
□ 告警级别配置
□ 告警抑制规则
□ 告警升级策略
# 通知配置检查
□ 邮件服务器配置
□ 短信网关配置
□ SNMP配置
□ 通知模板配置
□ 通知计划配置
# 告警处理检查
□ 告警查看流程
□ 告警确认流程
□ 告警清除流程
□ 告警升级流程
□ 告警报告生成
# 告警优化检查
□ 阈值合理性
□ 抑制规则有效性
□ 通知频率合理性
□ 升级策略有效性
□ 告警处理率
5.3 告警工具推荐
EMCC告警管理推荐工具:
| 工具名称 | 用途 | 说明 |
|---|---|---|
| EMCC Console | 告警管理 | 图形界面告警管理 |
| EMCLI | 命令行管理 | 批量配置和自动化脚本 |
| Incident Manager | 事件管理 | 统一事件管理平台 |
| Notification System | 通知管理 | 多通道通知管理 |
| Alert History | 历史分析 | 告警历史查询和分析 |
| Alert Reports | 告警报告 | 告警统计和趋势报告 |
- 建立分级告警策略,差异化配置
- 基于历史基线设置合理阈值
- 配置多通道通知确保关键告警送达
- 建立告警升级机制确保问题处理
- 定期审查告警有效性,持续优化
- 建立告警处理流程和责任制
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
