本文详细介绍Hadoop关键指标告警配置实战,包括Prometheus、Grafana告警配置,HDFS、YARN、系统告警等内容,适合大数据运维工程师使用。学习交流加群风哥微信: itpux-com
Part01-基础概念与理论知识
1.1 告警概述
告警是指系统发现异常情况时通知相关人员。更多视频教程www.fgedu.net.cn
- 及时发现问题
- 快速响应处理
- 减少业务影响
- 避免问题扩大
1.2 关键指标
关键指标:
1. 系统指标
– CPU使用率
– 内存使用率
– 磁盘使用率
– 磁盘IO
– 网络流量
2. HDFS指标
– NameNode状态
– DataNode数量
– HDFS可用空间
– 数据块状态
– 慢节点
3. YARN指标
– ResourceManager状态
– NodeManager数量
– 队列资源
– 任务状态
– 失败任务
4. 应用指标
– 任务成功率
– 任务延迟
– 数据处理量
– 响应时间
1.3 告警级别
告警级别:
Part02-生产环境规划与建议
2.1 告警规划
告警规划:
1. 指标选择
– 选择关键指标
– 避免告警轰炸
– 关注业务影响
2. 阈值设置
– 合理设置阈值
– 避免误报
– 避免漏报
3. 告警方式
– 邮件
– 短信
– 微信
– 电话
– 工单
4. 告警时间
– 工作时间
– 非工作时间
– 节假日
2.2 阈值设置
阈值设置:
- CPU使用率:Warning 70%, Critical 90%
- 内存使用率:Warning 80%, Critical 95%
- 磁盘使用率:Warning 80%, Critical 90%
- DataNode:Warning <3台, Critical <2台
from bigdata视频:www.itpux.com
2.3 告警升级
告警升级:
1. 一级告警
– 发送邮件
– 通知值班人员
– 15分钟未响应升级
2. 二级告警
– 发送短信
– 通知技术负责人
– 30分钟未响应升级
3. 三级告警
– 拨打电话
– 通知部门经理
– 1小时未响应升级
4. 四级告警
– 通知高管
– 启动应急预案
Part03-生产环境项目实施方案
3.1 Prometheus告警配置
3.1.1 Prometheus告警规则
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
– static_configs:
– targets:
– alertmanager:9093
rule_files:
– “alerts.yml”
# alerts.yml
groups:
– name: hadoop_alerts
rules:
– alert: HighCpuUsage
expr: 100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU usage on {{ $labels.instance }}”
description: “CPU usage is {{ $value }}% on {{ $labels.instance }}”
– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “High memory usage on {{ $labels.instance }}”
description: “Memory usage is {{ $value }}% on {{ $labels.instance }}”
– alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”}) * 100 < 20
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is {{ $value }}% on {{ $labels.instance }}"
- alert: DataNodeDown
expr: sum(up{job="datanode"}) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Too few DataNodes"
description: "Only {{ $value }} DataNodes are up"
3.2 Grafana告警配置
3.2.1 Grafana告警规则
# 在Grafana界面配置
# 1. 创建Dashboard
# 2. 添加Panel
# 3. 配置Query
# 4. 设置Alert
# 示例:HDFS磁盘使用率告警
# Query: 100 – (hdfs_datanode_remaining / hdfs_datanode_capacity) * 100
# Condition: avg() > 80
# For: 5m
# Severity: warning
# 示例:DataNode数量告警
# Query: count(up{job=”datanode”})
# Condition: avg() < 3
# For: 5m
# Severity: critical
3.3 邮件告警
3.3.1 Alertmanager邮件配置
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: [‘alertname’]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: ’email’
receivers:
– name: ’email’
email_configs:
– to: ‘admin@fgedu.net.cn’
from: ‘alertmanager@fgedu.net.cn’
smarthost: ‘smtp.fgedu.net.cn:587’
auth_username: ‘alertmanager@fgedu.net.cn’
auth_password: ‘fgedu123’
require_tls: true
inhibit_rules:
– source_match:
severity: critical
target_match:
severity: warning
equal: [‘alertname’]
Part04-生产案例与实战讲解
4.1 HDFS告警配置
4.1.1 实战案例
# 1. NameNode状态
– alert: NameNodeDown
expr: up{job=”namenode”} == 0
for: 1m
labels:
severity: critical
annotations:
summary: “NameNode {{ $labels.instance }} is down”
# 2. DataNode数量
– alert: DataNodeLow
expr: sum(up{job=”datanode”}) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Too few DataNodes: {{ $value }}"
# 3. HDFS空间
- alert: LowHdfsSpace
expr: (hdfs_namenode_capacity_remaining / hdfs_namenode_capacity_total) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low HDFS space: {{ $value }}%"
# 4. 坏块
- alert: CorruptBlocks
expr: hdfs_namenode_corrupt_blocks > 0
for: 5m
labels:
severity: critical
annotations:
summary: “Corrupt blocks: {{ $value }}”
4.2 YARN告警配置
4.2.1 实战案例
# 1. ResourceManager状态
– alert: ResourceManagerDown
expr: up{job=”resourcemanager”} == 0
for: 1m
labels:
severity: critical
annotations:
summary: “ResourceManager {{ $labels.instance }} is down”
# 2. NodeManager数量
– alert: NodeManagerLow
expr: sum(up{job=”nodemanager”}) < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Too few NodeManagers: {{ $value }}"
# 3. 失败任务
- alert: FailedApplications
expr: increase(yarn_apps_failed_total[1h]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: “Many failed applications: {{ $value }}”
# 4. 队列资源
– alert: QueueHighUsage
expr: yarn_queue_used_memory / yarn_queue_capacity_memory > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: “High queue usage: {{ $value }}”
4.3 系统告警配置
4.3.1 实战案例
# 1. CPU使用率
– alert: HighCpuUsage
expr: 100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “High CPU: {{ $value }}%”
– alert: CriticalCpuUsage
expr: 100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
annotations:
summary: “Critical CPU: {{ $value }}%”
# 2. 内存使用率
– alert: HighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “High memory: {{ $value }}%”
– alert: CriticalMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: “Critical memory: {{ $value }}%”
# 3. 磁盘使用率
– alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk: {{ $value }}%"
- alert: CriticalLowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Critical low disk: {{ $value }}%"
Part05-风哥经验总结与分享
5.1 最佳实践
最佳实践:
- 关键指标:只告警关键指标
- 合理阈值:设置合理的阈值
- 告警降噪:避免告警轰炸
- 多种方式:多种告警方式
- 告警升级:未响应时升级
5.2 告警降噪
1. 聚合告警
– 相同告警聚合
– 减少重复告警
2. 告警抑制
– 高优先级抑制低优先级
– 根因告警抑制衍生告警
3. 静默告警
– 维护期间静默
– 已知问题静默
4. 调整阈值
– 避免频繁波动
– 增加持续时间
5. 告警分级
– 不同级别不同处理
– 工作时间/非工作时间
5.3 检查清单
## 配置检查
– [ ] 关键指标已配置告警
– [ ] 阈值设置合理
– [ ] 告警方式配置
– [ ] 告警升级配置
– [ ] 告警记录配置
## 功能检查
– [ ] 告警能正常触发
– [ ] 告警能正常发送
– [ ] 告警聚合正常
– [ ] 告警抑制正常
– [ ] 告警升级正常
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
