目录大纲
Part01-基础概念与理论知识
1.1 集群巡检概述
1.2 巡检项目分类
1.3 维护操作规范
Part02-生产环境规划与建议
2.1 巡检计划规划
2.2 巡检工具规划
2.3 维护窗口规划
Part03-生产环境项目实施方案
3.1 HDFS巡检实施
3.2 YARN巡检实施
3.3 系统资源巡检
3.4 日志巡检分析
Part04-生产案例与实战讲解
4.1 自动化巡检脚本案例
4.2 异常问题处理案例
4.3 定期维护案例
Part05-风哥经验总结与分享
5.1 巡检维护最佳实践
5.2 运维经验总结
Part01-基础概念与理论知识
1.1 集群巡检概述
Hadoop集群巡检是保障集群稳定运行的重要手段。更多视频教程www.fgedu.net.cn 通过定期巡检可以及时发现潜在问题,避免故障发生。
1.2 巡检项目分类
Hadoop集群巡检项目分为多个类别。学习交流加群风哥微信: itpux-com
– HDFS巡检:NameNode状态、DataNode状态、存储容量、块健康
– YARN巡检:ResourceManager状态、NodeManager状态、资源使用
– 系统巡检:CPU、内存、磁盘、网络
– 日志巡检:错误日志、告警日志
1.3 维护操作规范
维护操作需要遵循规范流程。from bigdata视频:www.itpux.com
– 操作前做好备份
– 选择合适的维护窗口
– 记录操作过程
– 验证操作结果
– 做好回滚预案
Part02-生产环境规划与建议
2.1 巡检计划规划
巡检计划需要根据集群规模和业务重要性制定。更多学习教程公众号风哥教程itpux_com
– 每日巡检:基础状态检查
– 每周巡检:性能指标分析
– 每月巡检:深度健康检查
– 季度巡检:全面安全审计
2.2 巡检工具规划
巡检工具需要覆盖各个层面。学习交流加群风哥QQ113257174
hdfs –help
yarn –help
Usage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
SUBCOMMAND is one of:
dfs run a filesystem command on the file systems
dfsadmin run a DFS admin client
fsck run a DFS filesystem checking utility
haadmin run a DFS HA admin client
# YARN命令
Usage: yarn [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
SUBCOMMAND is one of:
application prints application(s) report/kill application
applicationattempt prints applicationattempt(s) report
node prints node report(s)
rmadmin ResourceManager admin client
2.3 维护窗口规划
维护窗口需要选择业务低峰期。风哥提示:建议在凌晨2-6点进行维护操作。
yarn application -list -appStates RUNNING | wc -l
hdfs dfsadmin -report | grep “DFS Used%”
5
# HDFS使用率
DFS Used%: 50.00%
Part03-生产环境项目实施方案
3.1 HDFS巡检实施
3.1.1 NameNode巡检
hdfs dfsadmin -safemode get
hdfs dfsadmin -report | head -20
# 检查NameNode内存
jstat -gcutil $(pgrep -f NameNode) 1 1
Safe mode is OFF
# NameNode报告
Configured Capacity: 131941395087360 (120 TB)
Present Capacity: 125244325333120 (113.87 TB)
DFS Remaining: 59273548779520 (53.93 TB)
DFS Used: 65970776553600 (60 TB)
DFS Used%: 52.70%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
# GC状态
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 75.00 50.00 30.00 93.75 93.75 100 5.000 5 2.500 7.500
3.1.2 DataNode巡检
hdfs dfsadmin -report -live | grep -E “Datanode|State|Capacity|Used|Remaining”
# 检查块健康
hdfs fsck / -files -blocks | tail -10
Datanode: fgedu01:9866
State: In Service
Capacity: 21990232555520 B (20 TB)
Used: 10995116277760 B (10 TB)
Remaining: 10995116277760 B (10 TB)
Datanode: fgedu02:9866
State: In Service
Capacity: 21990232555520 B (20 TB)
Used: 10995116277760 B (10 TB)
Remaining: 10995116277760 B (10 TB)
# 块健康状态
Total size: 65970697543680 B
Total blocks: 1000000
Minimally replicated blocks: 1000000
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 3.0
3.2 YARN巡检实施
3.2.1 ResourceManager巡检
yarn rmadmin -getServiceState rm1
yarn rmadmin -getAllServiceState
# 查看集群资源
yarn node -list -showDetails
active
# 所有RM状态
rm1: active
rm2: standby
# 节点详情
Total Nodes: 6
Node-Id State Memory-Used Memory-Avail VCores-Used VCores-Avail
fgedu01:8041 RUNNING 16384 MB 65536 MB 8 24
fgedu02:8041 RUNNING 16384 MB 65536 MB 8 24
fgedu03:8041 RUNNING 16384 MB 65536 MB 8 24
fgedu04:8041 RUNNING 16384 MB 65536 MB 8 24
fgedu05:8041 RUNNING 16384 MB 65536 MB 8 24
fgedu06:8041 RUNNING 16384 MB 65536 MB 8 24
3.2.2 NodeManager巡检
yarn node -list -states RUNNING,UNHEALTHY
# 查看应用状态
yarn application -list -appStates RUNNING,ACCEPTED
Total Nodes:6
Node-Id State
fgedu01:8041 RUNNING
fgedu02:8041 RUNNING
fgedu03:8041 RUNNING
fgedu04:8041 RUNNING
fgedu05:8041 RUNNING
fgedu06:8041 RUNNING
# 应用状态
Total Applications:5
Application-Id Application-Name User State
app_1705473600001 etl-job fgedu RUNNING
app_1705473600002 streaming-job fgedu RUNNING
app_1705473600003 hive-query fgedu RUNNING
app_1705473600004 spark-sql fgedu RUNNING
app_1705473600005 flink-job fgedu RUNNING
3.3 系统资源巡检
3.3.1 CPU和内存巡检
top -bn1 | head -5
# 内存使用巡检
free -h
# 进程资源使用
ps aux –sort=-%mem | head -10
top – 10:00:00 up 30 days, 2:00, 1 user, load average: 2.50, 2.00, 1.50
Tasks: 200 total, 5 running, 195 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.3 us, 5.0 sy, 0.0 ni, 79.7 id, 0.0 wa, 0.0 hi, 0.0 si
# 内存状态
total used free shared buff/cache available
Mem: 125G 60G 10G 1.0G 55G 60G
Swap: 16G 0B 16G
# 进程内存
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hdfs 12345 50.0 20.0 25g 25g ? Sl Jan01 100:00 /usr/lib/jvm/java/bin/java -Dproc_namenode
yarn 12346 30.0 10.0 12g 12g ? Sl Jan01 60:00 /usr/lib/jvm/java/bin/java -Dproc_resourcemanager
3.3.2 磁盘和网络巡检
df -h | grep -E “Filesystem|/data|/bigdata”
# 磁盘IO巡检
iostat -x 1 3 | grep -E “Device|sda|sdb”
# 网络巡检
netstat -i | grep -E “Kernel|eth0”
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20T 10T 10T 50% /bigdata
/dev/sdb1 20T 10T 10T 50% /bigdata/fgdata
# 磁盘IO
Device rrqm/s wrqm/s r/s w/s rMB/s wMB/s %util
sda 0.00 10.00 50.00 100.00 5.00 10.00 85.00
sdb 0.00 5.00 30.00 50.00 3.00 5.00 60.00
# 网络状态
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 10000000 0 0 0 5000000 0 0 0 BMRU
3.4 日志巡检分析
3.4.1 NameNode日志巡检
grep -E “ERROR|FATAL” /bigdata/app/hadoop/logs/hadoop-hdfs-namenode-fgedu01.log | tail -20
# NameNode告警日志巡检
grep “WARN” /bigdata/app/hadoop/logs/hadoop-hdfs-namenode-fgedu01.log | tail -20
2024-01-17 10:00:00,000 ERROR namenode.NameNode: Failed to load image from /bigdata/fgdata/namenode/current
2024-01-17 10:00:05,000 FATAL namenode.NameNode: Exiting due to unrecoverable error
# 告警日志
2024-01-17 10:00:00,000 WARN namenode.NameNode: Slow block report from DataNode fgedu03:9866
2024-01-17 10:00:05,000 WARN hdfs.server.blockmanagement.BlockManager: Block blk_1073741825 has corrupt replicas
3.4.2 ResourceManager日志巡检
grep -E “ERROR|FATAL” /bigdata/app/hadoop/logs/yarn-resourcemanager-fgedu01.log | tail -20
# ResourceManager告警日志巡检
grep “WARN” /bigdata/app/hadoop/logs/yarn-resourcemanager-fgedu01.log | tail -20
2024-01-17 10:00:00,000 ERROR resourcemanager.ResourceManager: Failed to recover applications
2024-01-17 10:00:05,000 FATAL resourcemanager.ResourceManager: Exiting due to unrecoverable error
# 告警日志
2024-01-17 10:00:00,000 WARN resourcemanager.RMAppManager: Application application_1705473600000_0001 failed
2024-01-17 10:00:05,000 WARN resourcemanager.ResourceManager: NodeManager fgedu05:8041 disconnected
Part04-生产案例与实战讲解
4.1 自动化巡检脚本案例
自动化巡检脚本可以提高运维效率。更多视频教程www.fgedu.net.cn
# daily_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# Hadoop集群日常巡检脚本
LOG_FILE=”/bigdata/logs/daily_check_$(date +%Y%m%d).log”
ALERT_EMAIL=”admin@fgedu.net.cn”
echo “=== Hadoop Cluster Daily Check ===” > ${LOG_FILE}
echo “Date: $(date)” >> ${LOG_FILE}
# 1. HDFS巡检
echo “=== HDFS Check ===” >> ${LOG_FILE}
LIVE_NODES=$(hdfs dfsadmin -report | grep “Live datanodes” | awk ‘{print $3}’)
DEAD_NODES=$(hdfs dfsadmin -report | grep “Dead datanodes” | awk ‘{print $3}’)
MISSING_BLOCKS=$(hdfs dfsadmin -report | grep “Missing blocks” | awk ‘{print $3}’)
echo “Live Nodes: ${LIVE_NODES}” >> ${LOG_FILE}
echo “Dead Nodes: ${DEAD_NODES}” >> ${LOG_FILE}
echo “Missing Blocks: ${MISSING_BLOCKS}” >> ${LOG_FILE}
if [ ${DEAD_NODES} -gt 0 ] || [ ${MISSING_BLOCKS} -gt 0 ]; then
echo “ALERT: HDFS has issues!” >> ${LOG_FILE}
mail -s “HDFS Alert” ${ALERT_EMAIL} < ${LOG_FILE}
fi
# 2. YARN巡检
echo “=== YARN Check ===” >> ${LOG_FILE}
RUNNING_NODES=$(yarn node -list | grep “RUNNING” | wc -l)
UNHEALTHY_NODES=$(yarn node -list -states UNHEALTHY | grep “UNHEALTHY” | wc -l)
echo “Running Nodes: ${RUNNING_NODES}” >> ${LOG_FILE}
echo “Unhealthy Nodes: ${UNHEALTHY_NODES}” >> ${LOG_FILE}
if [ ${UNHEALTHY_NODES} -gt 0 ]; then
echo “ALERT: YARN has unhealthy nodes!” >> ${LOG_FILE}
mail -s “YARN Alert” ${ALERT_EMAIL} < ${LOG_FILE}
fi
# 3. 系统资源巡检
echo “=== System Resource Check ===” >> ${LOG_FILE}
CPU_USAGE=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’)
MEM_USAGE=$(free | grep Mem | awk ‘{printf “%.1f”, $3/$2 * 100.0}’)
DISK_USAGE=$(df -h /bigdata | tail -1 | awk ‘{print $5}’)
echo “CPU Usage: ${CPU_USAGE}%” >> ${LOG_FILE}
echo “Memory Usage: ${MEM_USAGE}%” >> ${LOG_FILE}
echo “Disk Usage: ${DISK_USAGE}” >> ${LOG_FILE}
echo “=== Check Completed ===” >> ${LOG_FILE}
./daily_check.sh
# 巡检报告
=== Hadoop Cluster Daily Check ===
Date: Wed Jan 17 10:00:00 CST 2024
=== HDFS Check ===
Live Nodes: 6
Dead Nodes: 0
Missing Blocks: 0
=== YARN Check ===
Running Nodes: 6
Unhealthy Nodes: 0
=== System Resource Check ===
CPU Usage: 15.3%
Memory Usage: 48.0%
Disk Usage: 50%
=== Check Completed ===
4.2 异常问题处理案例
异常问题处理需要快速定位和解决。学习交流加群风哥微信: itpux-com
# 1. 检查DataNode状态
hdfs dfsadmin -report | grep -A5 “Datanode: fgedu03”
# 2. 检查DataNode日志
tail -100 /bigdata/app/hadoop/logs/hadoop-hdfs-datanode-fgedu03.log
# 3. 重启DataNode
ssh fgedu03 “hdfs –daemon stop datanode”
ssh fgedu03 “hdfs –daemon start datanode”
# 4. 验证恢复
hdfs dfsadmin -report | grep -A5 “Datanode: fgedu03”
Datanode: fgedu03:9866
State: Dead
# DataNode日志
2024-01-17 10:00:00,000 ERROR datanode.DataNode: Received fatal signal 15
2024-01-17 10:00:05,000 INFO datanode.DataNode: Shutting down
# 重启DataNode
stopping datanode
starting datanode, logging to /bigdata/app/hadoop/logs/hadoop-hdfs-datanode-fgedu03.log
# 验证恢复
Datanode: fgedu03:9866
State: In Service
Capacity: 21990232555520 B (20 TB)
Used: 10995116277760 B (10 TB)
4.3 定期维护案例
4.3.1 集群均衡维护
hdfs dfsadmin -report | grep -E “Capacity|Used|Remaining”
# 执行数据均衡
hdfs balancer -threshold 5
# 验证均衡结果
hdfs dfsadmin -report | grep -E “Capacity|Used|Remaining”
Capacity: 21990232555520 B
Used: 13194139508736 B (60%)
Capacity: 21990232555520 B
Used: 8796093011968 B (40%)
# 均衡执行
24/01/17 11:00:00 INFO balancer.Balancer: Starting balancer
24/01/17 11:30:00 INFO balancer.Balancer: Balancing completed
# 均衡后状态
Capacity: 21990232555520 B
Used: 10995116277760 B (50%)
Capacity: 21990232555520 B
Used: 10995116277760 B (50%)
4.3.2 日志清理维护
# log_cleanup.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 日志清理脚本
LOG_DIR=”/bigdata/app/hadoop/logs”
RETENTION_DAYS=30
# 清理旧日志
find ${LOG_DIR} -name “*.log.*” -mtime +${RETENTION_DAYS} -delete
find ${LOG_DIR} -name “*.out.*” -mtime +${RETENTION_DAYS} -delete
# 清理审计日志
find /bigdata/fgdata/namenode/current -name “edits_*” -mtime +${RETENTION_DAYS} -delete
echo “Log cleanup completed at $(date)”
./log_cleanup.sh
Log cleanup completed at Wed Jan 17 12:00:00 CST 2024
# 清理效果
# 清理前:日志占用 50GB
# 清理后:日志占用 10GB
# 节省空间:40GB
Part05-风哥经验总结与分享
5.1 巡检维护最佳实践
在实际生产环境中,巡检维护需要注意以下几点:from bigdata视频:www.itpux.com
1. 建立标准化巡检流程
2. 使用自动化巡检脚本
3. 建立问题处理知识库
4. 定期进行维护演练
5. 做好巡检记录归档
5.2 运维经验总结
5.2.1 运维建议
– 操作前做好备份
– 选择合适的维护窗口
– 记录所有操作
– 验证操作结果
– 建立应急预案
5.2.2 巡检报告模板
# inspection_report.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
REPORT_FILE=”/bigdata/reports/inspection_$(date +%Y%m%d).html”
cat > ${REPORT_FILE} << 'EOF'
<html>
<head><title>Hadoop集群巡检报告</title></head>
<body>
<h1>Hadoop集群巡检报告</h1>
<p>日期: DATE_PLACEHOLDER</p>
<h2>一、HDFS状态</h2>
<p>存活节点: LIVE_NODES</p>
<p>存储使用率: STORAGE_USAGE</p>
<h2>二、YARN状态</h2>
<p>运行节点: RUNNING_NODES</p>
<p>资源使用率: RESOURCE_USAGE</p>
<h2>三、系统资源</h2>
<p>CPU使用率: CPU_USAGE</p>
<p>内存使用率: MEM_USAGE</p>
<p>磁盘使用率: DISK_USAGE</p>
</body>
</html>
EOF
# 替换变量
sed -i “s/DATE_PLACEHOLDER/$(date)/g” ${REPORT_FILE}
sed -i “s/LIVE_NODES/$(hdfs dfsadmin -report | grep ‘Live datanodes’ | awk ‘{print $3}’)/g” ${REPORT_FILE}
echo “Report generated: ${REPORT_FILE}”
./inspection_report.sh
Report generated: /bigdata/reports/inspection_20240117.html
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
