目录大纲
Part01-基础概念与理论知识
1.1 故障类型分类
1.2 故障诊断方法
1.3 故障处理原则
Part02-生产环境规划与建议
2.1 故障预防规划
2.2 应急响应规划
2.3 故障复盘规划
Part03-生产环境项目实施方案
3.1 NameNode故障处理
3.2 DataNode故障处理
3.3 YARN故障处理
3.4 网络故障处理
Part04-生产案例与实战讲解
4.1 NameNode宕机故障案例
4.2 数据块损坏故障案例
4.3 资源耗尽故障案例
Part05-风哥经验总结与分享
5.1 故障处理最佳实践
5.2 故障预防经验总结
Part01-基础概念与理论知识
1.1 故障类型分类
Hadoop集群故障可以分为多种类型。更多视频教程www.fgedu.net.cn 了解故障类型有助于快速定位问题。
1.2 故障诊断方法
故障诊断需要系统性的方法。学习交流加群风哥微信: itpux-com
1. 收集故障现象
2. 查看相关日志
3. 分析错误信息
4. 定位故障原因
5. 制定解决方案
1.3 故障处理原则
故障处理需要遵循一定的原则。from bigdata视频:www.itpux.com
– 先恢复服务,再分析原因
– 保留现场,便于分析
– 记录处理过程
– 验证修复效果
– 总结经验教训
Part02-生产环境规划与建议
2.1 故障预防规划
故障预防是运维工作的重点。更多学习教程公众号风哥教程itpux_com
– 建立完善的监控体系
– 配置合理的告警阈值
– 定期进行健康检查
– 做好数据备份
– 制定应急预案
2.2 应急响应规划
应急响应需要快速有效。学习交流加群风哥QQ113257174
hdfs dfsadmin -report
yarn node -list
Live datanodes: 6
Dead datanodes: 0
# YARN状态
Total Nodes: 6
Running: 6
2.3 故障复盘规划
故障复盘是提升运维能力的重要手段。风哥提示:每次故障后都要进行复盘总结。
– 故障原因分析
– 处理过程回顾
– 改进措施制定
– 知识库更新
Part03-生产环境项目实施方案
3.1 NameNode故障处理
3.1.1 NameNode无法启动
tail -100 /bigdata/app/hadoop/logs/hadoop-hdfs-namenode-fgedu01.log
# 检查存储目录
ls -la /bigdata/fgdata/namenode/current/
# 检查端口占用
netstat -tlnp | grep 9870
2024-01-17 14:00:00,000 ERROR namenode.NameNode: Failed to load image from /bigdata/fgdata/namenode/current
2024-01-17 14:00:05,000 ERROR namenode.NameNode: Cannot read file /bigdata/fgdata/namenode/current/seen_txid
# 存储目录
total 8
drwxr-xr-x 2 hdfs hdfs 4096 Jan 17 14:00 .
drwxr-xr-x 3 hdfs hdfs 4096 Jan 17 14:00 ..
# 目录为空,元数据丢失
# 端口状态
# 9870端口未被占用
3.1.2 NameNode恢复操作
cp -r /bigdata/backup/namenode/current/* /bigdata/fgdata/namenode/current/
# 或从Standby NameNode同步
hdfs namenode -bootstrapStandby
# 启动NameNode
hdfs –daemon start namenode
# 验证启动状态
hdfs dfsadmin -safemode get
# 从备份恢复成功
# NameNode启动
starting namenode, logging to /bigdata/app/hadoop/logs/hadoop-hdfs-namenode-fgedu01.log
# 验证状态
Safe mode is OFF
# NameNode恢复正常
3.2 DataNode故障处理
3.2.1 DataNode离线处理
hdfs dfsadmin -report | grep -A10 “Datanode: fgedu03”
# 检查DataNode日志
tail -100 /bigdata/app/hadoop/logs/hadoop-hdfs-datanode-fgedu03.log
# 检查磁盘状态
ssh fgedu03 “df -h”
Datanode: fgedu03:9866
State: Dead
# DataNode日志
2024-01-17 14:30:00,000 ERROR datanode.DataNode: Disk error on /bigdata/fgdata/datanode
2024-01-17 14:30:05,000 ERROR datanode.DataNode: No available storage directories
# 磁盘状态
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20T 20T 0 100% /bigdata
# 磁盘已满
3.2.2 DataNode恢复操作
ssh fgedu03 “rm -rf /bigdata/logs/*.log.*”
ssh fgedu03 “rm -rf /bigdata/tmp/*”
# 重启DataNode
ssh fgedu03 “hdfs –daemon stop datanode”
ssh fgedu03 “hdfs –daemon start datanode”
# 验证恢复
hdfs dfsadmin -report | grep -A10 “Datanode: fgedu03”
# 清理完成,释放50GB空间
# DataNode重启
stopping datanode
starting datanode, logging to /bigdata/app/hadoop/logs/hadoop-hdfs-datanode-fgedu03.log
# 验证状态
Datanode: fgedu03:9866
State: In Service
Capacity: 21990232555520 B (20 TB)
Used: 16492674331648 B (15 TB)
# DataNode恢复正常
3.3 YARN故障处理
3.3.1 ResourceManager故障处理
yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2
# 检查RM日志
tail -100 /bigdata/app/hadoop/logs/yarn-resourcemanager-fgedu01.log
# 检查ZooKeeper连接
echo stat | nc fgedu01 2181
rm1: standby
rm2: active
# RM日志
2024-01-17 15:00:00,000 ERROR resourcemanager.ResourceManager: Failed to connect to ZooKeeper
2024-01-17 15:00:05,000 ERROR resourcemanager.ResourceManager: Unable to become active
# ZooKeeper状态
Zookeeper version: 3.6.3
Mode: leader
# ZooKeeper正常
3.3.2 NodeManager故障处理
yarn node -list -states UNHEALTHY
# 检查NodeManager日志
tail -100 /bigdata/app/hadoop/logs/yarn-nodemanager-fgedu04.log
# 检查本地资源
ssh fgedu04 “ls -la /bigdata/fgdata/yarn/local/”
Total Nodes:1
Node-Id State
fgedu04:8041 UNHEALTHY
# NodeManager日志
2024-01-17 15:30:00,000 WARN nodemanager.NodeManager: Node health checker failed
2024-01-17 15:30:05,000 ERROR nodemanager.NodeManager: Disk space is low
# 本地资源
total 0
# 本地目录为空
3.4 网络故障处理
3.4.1 网络连接故障
ping -c 3 fgedu03
# 检查端口连通性
telnet fgedu03 9866
# 检查防火墙
firewall-cmd –list-ports
# 检查网络配置
ip addr show eth0
PING fgedu03 (192.168.1.13) 56(84) bytes of data.
64 bytes from fgedu03: icmp_seq=1 ttl=64 time=0.5 ms
64 bytes from fgedu03: icmp_seq=2 ttl=64 time=0.5 ms
64 bytes from fgedu03: icmp_seq=3 ttl=64 time=0.5 ms
# 端口连通性
Trying 192.168.1.13…
Connected to fgedu03.
Escape character is ‘^]’.
# 防火墙端口
9870/tcp 8088/tcp 9866/tcp 8042/tcp
# 网络配置
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
inet 192.168.1.11/24 brd 192.168.1.255 scope global eth0
3.4.2 DNS解析故障
nslookup fgedu03
# 检查hosts文件
cat /etc/hosts | grep fgedu
# 检查DNS配置
cat /etc/resolv.conf
Server: 192.168.1.1
Address: 192.168.1.1#53
Name: fgedu03.fgedu.net.cn
Address: 192.168.1.13
# hosts文件
192.168.1.11 fgedu01 fgedu01.fgedu.net.cn
192.168.1.12 fgedu02 fgedu02.fgedu.net.cn
192.168.1.13 fgedu03 fgedu03.fgedu.net.cn
# DNS配置
nameserver 192.168.1.1
search fgedu.net.cn
Part04-生产案例与实战讲解
4.1 NameNode宕机故障案例
NameNode宕机是严重故障,需要快速恢复。更多视频教程www.fgedu.net.cn
# namenode_failover.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# NameNode故障切换脚本
FAILED_NN=”fgedu01″
STANDBY_NN=”fgedu02″
echo “=== NameNode Failover Process ===”
echo “Failed NN: ${FAILED_NN}”
echo “Standby NN: ${STANDBY_NN}”
# 1. 检查Standby状态
echo “Checking Standby NameNode status…”
ssh ${STANDBY_NN} “hdfs dfsadmin -safemode get”
# 2. 强制切换到Standby
echo “Forcing transition to active…”
ssh ${STANDBY_NN} “hdfs haadmin -transitionToActive –forcemanual rm2”
# 3. 验证切换结果
echo “Verifying failover…”
ssh ${STANDBY_NN} “hdfs haadmin -getServiceState rm2”
# 4. 更新客户端配置
echo “Updating client configuration…”
# 更新配置文件中的active NameNode地址
echo “=== Failover Completed ===”
./namenode_failover.sh
=== NameNode Failover Process ===
Failed NN: fgedu01
Standby NN: fgedu02
Checking Standby NameNode status…
Safe mode is OFF
Forcing transition to active…
Successfully transitioned rm2 to active
Verifying failover…
active
=== Failover Completed ===
# 故障切换成功
4.2 数据块损坏故障案例
数据块损坏会导致数据丢失风险。学习交流加群风哥微信: itpux-com
hdfs fsck / -list-corruptfileblocks
# 查看损坏块详情
hdfs fsck / -files -blocks -locations | grep “CORRUPT”
# 修复损坏块
hdfs debug recoverLease -path /bigdata/warehouse/fgedu/corrupt_file.parquet -retries 3
# 验证修复结果
hdfs fsck / -list-corruptfileblocks
The list of corrupt files is:
/bigdata/warehouse/fgedu/data_202401.parquet
/bigdata/warehouse/fgedu/data_202402.parquet
# 损坏块详情
/bigdata/warehouse/fgedu/data_202401.parquet: CORRUPT blockpool BP-12345678 block blk_1073741825
# 修复操作
Recover lease for /bigdata/warehouse/fgedu/data_202401.parquet succeeded
# 验证结果
The filesystem under path ‘/’ has 0 CORRUPT files
# 损坏块已修复
4.3 资源耗尽故障案例
4.3.1 内存耗尽故障
free -h
# 检查进程内存
ps aux –sort=-%mem | head -10
# 检查OOM日志
grep “Out of memory” /var/log/messages | tail -10
# 清理内存缓存
sync && echo 3 > /proc/sys/vm/drop_caches
total used free shared buff/cache available
Mem: 125G 120G 500M 1.0G 5G 2G
Swap: 16G 15G 1G
# 进程内存
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
hdfs 12345 50.0 80.0 100g 100g ? Sl Jan01 100:00 /usr/lib/jvm/java/bin/java -Dproc_namenode
# OOM日志
Jan 17 16:00:00 fgedu01 kernel: Out of memory: Kill process 12345 (java) score 800 or sacrifice child
# 内存清理后
total used free shared buff/cache available
Mem: 125G 60G 60G 1.0G 5G 60G
4.3.2 磁盘空间耗尽故障
df -h
# 查找大文件
find /bigdata -type f -size +1G -exec ls -lh {} \; | head -20
# 清理日志文件
find /bigdata/logs -name “*.log.*” -mtime +7 -delete
# 清理临时文件
rm -rf /bigdata/tmp/*
# 验证清理结果
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20T 20T 0 100% /bigdata
# 大文件列表
-rw-r–r– 1 hdfs hdfs 5.0G Jan 17 16:00 /bigdata/logs/hadoop-hdfs-namenode-fgedu01.log
-rw-r–r– 1 hdfs hdfs 3.0G Jan 17 16:00 /bigdata/logs/yarn-resourcemanager-fgedu01.log
# 清理日志
# 清理完成,释放10GB空间
# 清理临时文件
# 清理完成,释放5GB空间
# 清理结果
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20T 15T 5.0T 75% /bigdata
Part05-风哥经验总结与分享
5.1 故障处理最佳实践
在实际生产环境中,故障处理需要注意以下几点:from bigdata视频:www.itpux.com
1. 建立完善的监控告警体系
2. 制定详细的应急预案
3. 定期进行故障演练
4. 做好数据备份
5. 建立故障知识库
5.2 故障预防经验总结
5.2.1 故障预防建议
– 定期巡检集群状态
– 监控关键指标
– 及时处理告警
– 做好容量规划
– 定期演练应急预案
5.2.2 故障诊断脚本
# fault_diagnosis.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== Hadoop Fault Diagnosis ===”
echo “Date: $(date)”
# 1. HDFS诊断
echo “=== HDFS Diagnosis ===”
LIVE=$(hdfs dfsadmin -report | grep “Live datanodes” | awk ‘{print $3}’)
DEAD=$(hdfs dfsadmin -report | grep “Dead datanodes” | awk ‘{print $3}’)
MISSING=$(hdfs dfsadmin -report | grep “Missing blocks” | awk ‘{print $3}’)
CORRUPT=$(hdfs fsck / -list-corruptfileblocks 2>/dev/null | grep “CORRUPT” | wc -l)
echo “Live Nodes: ${LIVE}”
echo “Dead Nodes: ${DEAD}”
echo “Missing Blocks: ${MISSING}”
echo “Corrupt Files: ${CORRUPT}”
# 2. YARN诊断
echo “=== YARN Diagnosis ===”
RUNNING=$(yarn node -list | grep “RUNNING” | wc -l)
UNHEALTHY=$(yarn node -list -states UNHEALTHY | grep “UNHEALTHY” | wc -l)
echo “Running Nodes: ${RUNNING}”
echo “Unhealthy Nodes: ${UNHEALTHY}”
# 3. 系统资源诊断
echo “=== System Resource Diagnosis ===”
CPU=$(top -bn1 | grep “Cpu(s)” | awk ‘{print $2}’)
MEM=$(free | grep Mem | awk ‘{printf “%.1f”, $3/$2 * 100.0}’)
DISK=$(df -h /bigdata | tail -1 | awk ‘{print $5}’)
echo “CPU Usage: ${CPU}%”
echo “Memory Usage: ${MEM}%”
echo “Disk Usage: ${DISK}”
# 4. 故障判断
if [ ${DEAD} -gt 0 ] || [ ${MISSING} -gt 0 ] || [ ${CORRUPT} -gt 0 ]; then
echo “ALERT: HDFS has issues!”
fi
if [ ${UNHEALTHY} -gt 0 ]; then
echo “ALERT: YARN has unhealthy nodes!”
fi
./fault_diagnosis.sh
=== Hadoop Fault Diagnosis ===
Date: Wed Jan 17 17:00:00 CST 2024
=== HDFS Diagnosis ===
Live Nodes: 6
Dead Nodes: 0
Missing Blocks: 0
Corrupt Files: 0
=== YARN Diagnosis ===
Running Nodes: 6
Unhealthy Nodes: 0
=== System Resource Diagnosis ===
CPU Usage: 15.3%
Memory Usage: 48.0%
Disk Usage: 50%
# 集群状态正常
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
