本文详细介绍Hadoop大数据治理与监控体系搭建,包括Prometheus、Grafana、EFK、Atlas等工具的安装配置、监控告警、数据治理等内容,风哥教程参考各工具官方文档,适合大数据运维工程师使用。学习交流加群风哥QQ113257174
Part01-基础概念与理论知识
1.1 大数据治理概述
大数据治理是指对大数据的可用性、完整性、一致性、安全性等进行管理的过程,确保数据能够被正确、安全、高效地使用。更多视频教程www.fgedu.net.cn
- 元数据管理:数据字典、血缘关系
- 数据质量管理:数据质量监控、清洗
- 数据安全管理:权限控制、加密、脱敏
- 数据生命周期管理:数据归档、清理
- 数据资产管理:数据 catalog、价值评估
- 监控告警:系统监控、异常告警
1.2 治理体系框架
大数据治理体系框架:
工具层:
– 元数据管理:Apache Atlas
– 数据质量:Apache Griffin
– 数据安全:Apache Ranger、Sentry
– 监控告警:Prometheus、Grafana
– 日志平台:ELK/EFK
– 工作流调度:Apache Oozie、Airflow
管理层:
– 元数据管理
– 数据质量管理
– 数据安全管理
– 生命周期管理
– 成本管理
流程层:
– 数据接入流程
– 数据开发流程
– 数据发布流程
– 数据销毁流程
组织层:
– 数据治理委员会
– 数据Owner
– 数据管家
– 数据用户
1.3 监控体系介绍
监控体系核心组件:
- 指标采集:Prometheus、Telegraf、JMX Exporter
- 日志采集:Fluentd、Filebeat、Logstash
- 存储:Prometheus TSDB、Elasticsearch
- 可视化:Grafana、Kibana
- 告警:Alertmanager
Part02-生产环境规划与建议
2.1 治理架构规划
治理架构规划要点:
监控服务器:
– 数量:3台(Prometheus高可用)
– 配置:16核32GB
– 磁盘:SSD 2TB
日志服务器:
– 数量:3-5台(Elasticsearch集群)
– 配置:16核64GB
– 磁盘:SSD 4TB
治理服务器:
– 数量:2台(Atlas高可用)
– 配置:8核16GB
– 磁盘:SSD 500GB
# 组件规划
监控组件:
– Prometheus:指标存储和查询
– Grafana:可视化
– Alertmanager:告警
– Node Exporter:主机指标
– JMX Exporter:JVM指标
日志组件:
– Elasticsearch:日志存储
– Fluentd:日志采集
– Kibana:日志查询
治理组件:
– Atlas:元数据管理
– Ranger:权限管理
2.2 监控架构规划
监控架构规划:
采集层:
– Node Exporter:每台主机
– JMX Exporter:每个JVM应用
– Fluentd:每台主机
传输层:
– Prometheus Pull模式
– Fluentd Push模式
存储层:
– Prometheus TSDB:时序数据
– Elasticsearch:日志数据
应用层:
– Grafana:监控Dashboard
– Kibana:日志查询
– Alertmanager:告警通知
2.3 告警规则规划
告警规则规划:
- 主机:CPU使用率、内存使用率、磁盘使用率
- HDFS:NameNode状态、DataNode状态、磁盘空间
- YARN:ResourceManager状态、NodeManager状态、队列资源
- Hive:HiveServer2状态、Metastore状态
- 其他组件:服务状态、关键指标
from bigdata视频:www.itpux.com
Part03-生产环境项目实施方案
3.1 Prometheus + Grafana监控搭建
3.1.1 Prometheus安装配置
cd /bigdata/app
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar -zxvf prometheus-2.45.0.linux-amd64.tar.gz
ln -s prometheus-2.45.0.linux-amd64 prometheus
# 2. 配置prometheus.yml
cat > /bigdata/app/prometheus/prometheus.yml << ‘EOF’
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
– static_configs:
– targets:
– fgedu-alertmanager:9093
rule_files:
– “rules/*.yml”
scrape_configs:
– job_name: ‘prometheus’
static_configs:
– targets: [‘localhost:9090’]
– job_name: ‘node’
file_sd_configs:
– files:
– ‘/bigdata/app/prometheus/targets/node.json’
– job_name: ‘hadoop_jmx’
file_sd_configs:
– files:
– ‘/bigdata/app/prometheus/targets/hadoop_jmx.json’
EOF
# 3. 创建配置目录
mkdir -p /bigdata/app/prometheus/rules
mkdir -p /bigdata/app/prometheus/targets
mkdir -p /bigdata/fgdata/prometheus
# 4. 配置Node Exporter目标
cat > /bigdata/app/prometheus/targets/node.json << ‘EOF’
[
{
“targets”: [
“fgedu-nn:9100”,
“fgedu-dn01:9100”,
“fgedu-dn02:9100”,
“fgedu-dn03:9100”,
“fgedu-rm:9100”,
“fgedu-hive:9100”
],
“labels”: {
“env”: “production”
}
}
]
EOF
# 5. 配置Hadoop JMX目标
cat > /bigdata/app/prometheus/targets/hadoop_jmx.json << ‘EOF’
[
{
“targets”: [“fgedu-nn:9400”],
“labels”: {
“job”: “hdfs_namenode”,
“env”: “production”
}
},
{
“targets”: [“fgedu-rm:9401”],
“labels”: {
“job”: “yarn_resourcemanager”,
“env”: “production”
}
}
]
EOF
# 6. 配置告警规则
cat > /bigdata/app/prometheus/rules/host_rules.yml << ‘EOF’
groups:
– name: host_alerts
rules:
– alert: HostHighCpuUsage
expr: 100 – (avg by(instance) (irate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: “Host high CPU usage”
description: “CPU usage is {{ $value }}%”
– alert: HostHighMemoryUsage
expr: (1 – (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: “Host high memory usage”
description: “Memory usage is {{ $value }}%”
– alert: HostDiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!=”tmpfs”} / node_filesystem_size_bytes{fstype!=”tmpfs”}) * 100 < 20
for: 5m
labels:
severity: critical
annotations:
summary: “Host disk space low”
description: “Disk {{ $labels.mountpoint }} has {{ $value }}% available”
EOF
# 7. 启动Prometheus
cd /bigdata/app/prometheus
nohup ./prometheus \
–config.file=prometheus.yml \
–storage.tsdb.path=/bigdata/fgdata/prometheus \
–web.enable-lifecycle &
# 8. 验证Prometheus
curl http://localhost:9090
# 9. 安装Grafana
cd /bigdata/app
wget https://dl.grafana.com/oss/release/grafana-9.5.2.linux-amd64.tar.gz
tar -zxvf grafana-9.5.2.linux-amd64.tar.gz
ln -s grafana-9.5.2 grafana
# 10. 启动Grafana
cd /bigdata/app/grafana
nohup ./bin/grafana-server &
# 11. 访问Grafana
# http://fgedu-grafana:3000
# 默认用户名: admin
# 默认密码: admin
# 12. 配置Prometheus数据源
# 在Grafana中添加Prometheus数据源
# URL: http://fgedu-prometheus:9090
# 13. 导入Dashboard
# Node Exporter Dashboard: 1860
# JVM Dashboard: 8563
# HDFS Dashboard: 自定义
3.2 EFK日志平台搭建
3.2.1 Elasticsearch安装配置
cd /bigdata/app
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.10-linux-x86_64.tar.gz
tar -zxvf elasticsearch-7.17.10-linux-x86_64.tar.gz
ln -s elasticsearch-7.17.10 elasticsearch
# 2. 修改系统配置
vi /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096
vi /etc/sysctl.conf
vm.max_map_count=262144
sysctl -p
# 3. 配置elasticsearch.yml
cat > /bigdata/app/elasticsearch/config/elasticsearch.yml << ‘EOF’
cluster.name: fgedu-es-cluster
node.name: fgedu-es01
path.data: /bigdata/fgdata/elasticsearch/data
path.logs: /bigdata/fgdata/logs/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.seed_hosts: [“fgedu-es01”, “fgedu-es02”, “fgedu-es03”]
cluster.initial_master_nodes: [“fgedu-es01”, “fgedu-es02”, “fgedu-es03”]
EOF
# 4. 创建目录
mkdir -p /bigdata/fgdata/elasticsearch/data
mkdir -p /bigdata/fgdata/logs/elasticsearch
useradd elasticsearch
chown -R elasticsearch:elasticsearch /bigdata/app/elasticsearch
chown -R elasticsearch:elasticsearch /bigdata/fgdata/elasticsearch
chown -R elasticsearch:elasticsearch /bigdata/fgdata/logs/elasticsearch
# 5. 启动Elasticsearch
su – elasticsearch
cd /bigdata/app/elasticsearch
nohup ./bin/elasticsearch &
# 6. 验证Elasticsearch
curl http://localhost:9200
# 7. 安装Kibana
cd /bigdata/app
wget https://artifacts.elastic.co/downloads/kibana/kibana-7.17.10-linux-x86_64.tar.gz
tar -zxvf kibana-7.17.10-linux-x86_64.tar.gz
ln -s kibana-7.17.10-linux-x86_64 kibana
# 8. 配置kibana.yml
cat > /bigdata/app/kibana/config/kibana.yml << ‘EOF’
server.host: “0.0.0.0”
server.port: 5601
elasticsearch.hosts: [“http://fgedu-es01:9200”]
EOF
# 9. 启动Kibana
cd /bigdata/app/kibana
nohup ./bin/kibana &
# 10. 安装Fluentd
curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent4.sh | sh
# 11. 配置Fluentd
cat > /etc/td-agent/td-agent.conf << ‘EOF’
<source>
@type tail
path /bigdata/fgdata/logs/hadoop/hadoop-*.log
pos_file /var/log/td-agent/hadoop.log.pos
tag hadoop.log
<parse>
@type none
</parse>
</source>
<source>
@type tail
path /bigdata/fgdata/logs/yarn/yarn-*.log
pos_file /var/log/td-agent/yarn.log.pos
tag yarn.log
<parse>
@type none
</parse>
</source>
<match hadoop.log>
@type elasticsearch
host fgedu-es01
port 9200
logstash_format true
logstash_prefix hadoop
flush_interval 5s
</match>
<match yarn.log>
@type elasticsearch
host fgedu-es01
port 9200
logstash_format true
logstash_prefix yarn
flush_interval 5s
</match>
EOF
# 12. 启动Fluentd
systemctl start td-agent
systemctl enable td-agent
3.3 Atlas元数据管理搭建
3.3.1 Atlas安装配置
cd /bigdata/app
wget https://archive.apache.org/dist/atlas/2.3.0/apache-atlas-2.3.0-sources.tar.gz
tar -zxvf apache-atlas-2.3.0-sources.tar.gz
cd apache-atlas-sources-2.3.0
# 2. 编译Atlas(跳过测试)
mvn clean -DskipTests package -Pdist,embedded-hbase-solr
# 3. 解压编译后的包
cd distro/target
tar -zxvf apache-atlas-2.3.0-server.tar.gz
mv apache-atlas-2.3.0 /bigdata/app/
ln -s /bigdata/app/apache-atlas-2.3.0 /bigdata/app/atlas
# 4. 配置Atlas
cd /bigdata/app/atlas
vi conf/atlas-application.properties
# 关键配置
atlas.server.address=fgedu-atlas
atlas.server.http.port=21000
atlas.graph.storage.backend=hbase2
atlas.graph.storage.hostname=fgedu-hbase
atlas.graph.storage.hbase.table=atlas_janus
atlas.kafka.zookeeper.connect=fgedu-zk:2181
atlas.notification.embedded=false
atlas.kafka.bootstrap.servers=fgedu-kafka:9092
# 5. 启动Atlas
bin/atlas_start.py
# 6. 访问Atlas
# http://fgedu-atlas:21000
# 默认用户名: admin
# 默认密码: admin
# 7. 配置Hive Hook
# 在hive-site.xml中添加
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
# 复制Atlas配置到Hive
cp /bigdata/app/atlas/conf/atlas-application.properties /bigdata/app/hive/conf/
cp /bigdata/app/atlas/hook/hive/*.jar /bigdata/app/hive/lib/
# 8. 重启Hive
hive –service hiveserver2 –stop
hive –service hiveserver2 –start
# 9. 测试Hive元数据同步
beeline
> CREATE TABLE fgedu_test (id INT, name STRING);
> INSERT INTO fgedu_test VALUES (1, ‘fgedu’);
# 10. 在Atlas中查看元数据
# 访问Atlas UI,搜索fgedu_test表
Part04-生产案例与实战讲解
4.1 集群监控实战
4.1.1 Node Exporter安装配置
cd /bigdata/app
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar -zxvf node_exporter-1.6.1.linux-amd64.tar.gz
ln -s node_exporter-1.6.1.linux-amd64 node_exporter
# 2. 启动Node Exporter
cd /bigdata/app/node_exporter
nohup ./node_exporter –web.listen-address=:9100 &
# 3. 验证Node Exporter
curl http://localhost:9100/metrics
# 4. 配置JMX Exporter
cd /bigdata/app
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.19.0/jmx_prometheus_javaagent-0.19.0.jar
# 5. 配置HDFS使用JMX Exporter
vi /bigdata/app/hadoop/etc/hadoop/hadoop-env.sh
export HDFS_NAMENODE_OPTS=”-javaagent:/bigdata/app/jmx_prometheus_javaagent-0.19.0.jar=9400:/bigdata/app/hadoop/etc/hadoop/jmx_exporter.yaml $HDFS_NAMENODE_OPTS”
export HDFS_DATANODE_OPTS=”-javaagent:/bigdata/app/jmx_prometheus_javaagent-0.19.0.jar=9400:/bigdata/app/hadoop/etc/hadoop/jmx_exporter.yaml $HDFS_DATANODE_OPTS”
# 6. 配置YARN使用JMX Exporter
export YARN_RESOURCEMANAGER_OPTS=”-javaagent:/bigdata/app/jmx_prometheus_javaagent-0.19.0.jar=9401:/bigdata/app/hadoop/etc/hadoop/jmx_exporter.yaml $YARN_RESOURCEMANAGER_OPTS”
export YARN_NODEMANAGER_OPTS=”-javaagent:/bigdata/app/jmx_prometheus_javaagent-0.19.0.jar=9402:/bigdata/app/hadoop/etc/hadoop/jmx_exporter.yaml $YARN_NODEMANAGER_OPTS”
# 7. 重启HDFS和YARN
hdfs –daemon stop namenode
hdfs –daemon start namenode
yarn –daemon stop resourcemanager
yarn –daemon start resourcemanager
# 8. 验证JMX Exporter
curl http://fgedu-nn:9400/metrics
# 9. 配置Grafana Dashboard
# 在Grafana中导入Node Exporter Dashboard 1860
# 配置Prometheus数据源
# 查看监控图表
4.2 告警配置实战
4.2.1 Alertmanager安装配置
cd /bigdata/app
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar -zxvf alertmanager-0.25.0.linux-amd64.tar.gz
ln -s alertmanager-0.25.0.linux-amd64 alertmanager
# 2. 配置alertmanager.yml
cat > /bigdata/app/alertmanager/alertmanager.yml << ‘EOF’
global:
resolve_timeout: 5m
smtp_smarthost: ‘smtp.fgedu.net.cn:587’
smtp_from: ‘alertmanager@fgedu.net.cn’
smtp_auth_username: ‘alertmanager@fgedu.net.cn’
smtp_auth_password: ‘fgedu123’
route:
group_by: [‘alertname’, ‘env’]
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: ’email-notifications’
routes:
– match:
severity: critical
receiver: ‘critical-notifications’
continue: true
receivers:
– name: ’email-notifications’
email_configs:
– to: ‘admin@fgedu.net.cn’
send_resolved: true
– name: ‘critical-notifications’
email_configs:
– to: ‘admin@fgedu.net.cn,oncall@fgedu.net.cn’
send_resolved: true
webhook_configs:
– url: ‘http://fgedu-webhook/api/alert’
EOF
# 3. 启动Alertmanager
cd /bigdata/app/alertmanager
nohup ./alertmanager –config.file=alertmanager.yml &
# 4. 验证Alertmanager
curl http://localhost:9093
# 5. 测试告警
# 在Prometheus中触发告警
# 查看Alertmanager UI
# 查看邮件通知
4.3 数据治理实战
4.3.1 元数据管理实战
# 访问Atlas UI
# 搜索表名、字段名
# 查看数据血缘
# 2. 使用Atlas API
# 搜索实体
curl -u admin:admin -X GET \
“http://fgedu-atlas:21000/api/atlas/v2/search/basic?query=fgedu_test”
# 获取实体详情
curl -u admin:admin -X GET \
“http://fgedu-atlas:21000/api/atlas/v2/entity/guid/<guid>”
# 获取血缘关系
curl -u admin:admin -X GET \
“http://fgedu-atlas:21000/api/atlas/v2/lineage/<guid>”
# 3. 创建分类
curl -u admin:admin -X POST \
-H “Content-Type: application/json” \
-d ‘{
“classificationDefs”: [
{
“category”: “CLASSIFICATION”,
“name”: “Sensitive”,
“description”: “Sensitive data”
}
]
}’ \
“http://fgedu-atlas:21000/api/atlas/v2/types/typedefs”
# 4. 添加分类到实体
curl -u admin:admin -X POST \
-H “Content-Type: application/json” \
-d ‘[
{
“typeName”: “Sensitive”,
“entityGuid”: “<guid>”
}
]’ \
“http://fgedu-atlas:21000/api/atlas/v2/entity/bulk/classification”
# 5. 数据生命周期管理
# 配置HDFS数据归档策略
# 配置表分区保留策略
# 定期清理过期数据
Part05-风哥经验总结与分享
5.1 治理与监控最佳实践
治理与监控最佳实践:
- 监控优先:先搭建监控体系,再做其他治理工作
- 告警优化:避免告警风暴,只关注重要告警
- 循序渐进:治理体系逐步建设,不要一蹴而就
- 工具整合:选择成熟的工具,避免重复造轮子
- 自动化:尽可能自动化运维操作
- 文档完善:做好监控和治理文档
5.2 常见问题处理
– 检查Exporter状态
– 检查网络连接
– 检查防火墙
– 查看Prometheus日志
# 常见问题2:Grafana无数据
– 检查数据源配置
– 检查Prometheus状态
– 检查时间范围
– 查看Grafana日志
# 常见问题3:告警不发送
– 检查Alertmanager配置
– 检查告警规则
– 检查邮件/微信配置
– 查看Alertmanager日志
# 常见问题4:Elasticsearch查询慢
– 检查索引分片
– 检查JVM内存
– 检查查询语句
– 优化索引
# 常见问题5:Atlas元数据不同步
– 检查Hook配置
– 检查Kafka
– 检查HBase
– 查看Atlas日志
5.3 运维检查清单
– [ ] Prometheus状态
– [ ] Grafana状态
– [ ] Alertmanager状态
– [ ] Elasticsearch状态
– [ ] Kibana状态
– [ ] Atlas状态
– [ ] 监控数据正常
– [ ] 告警规则正常
– [ ] 日志采集正常
– [ ] 元数据同步正常
– [ ] 告警通知正常
– [ ] 告警规则检查
– [ ] 日志检查
# 日常巡检内容
1. 检查监控系统状态
2. 查看告警信息
3. 查看关键指标
4. 检查日志采集
5. 检查元数据同步
6. 查看错误日志
7. 优化告警规则
8. 完善监控覆盖
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
