本文详细介绍Hadoop服务异常自动重启实战,包括systemd自动重启、脚本自动重启、Monit监控等内容,适合大数据运维工程师使用。学习交流加群风哥微信: itpux-com
Part01-基础概念与理论知识
1.1 自动重启概述
自动重启是指当服务异常停止时,系统自动将其重新启动。更多视频教程www.fgedu.net.cn
- 提高服务可用性
- 减少人工干预
- 快速恢复服务
- 减少业务中断
1.2 常用工具
常用工具:
1. systemd
– Linux系统自带
– 配置简单
– 功能强大
– 推荐使用
2. 自定义脚本
– 灵活
– 可定制
– 需要自己写
3. Monit
– 轻量级
– 界面友好
– 功能丰富
4. Supervisor
– Python开发
– 进程管理
– Web界面
1.3 实现原理
实现原理:
Part02-生产环境规划与建议
2.1 重启策略
重启策略:
1. 总是重启
– 服务停止就重启
– 适合无状态服务
– 例如:DataNode
2. 失败时重启
– 异常退出时重启
– 正常退出不重启
– 更智能
3. 限制重启次数
– 避免频繁重启
– 例如:5分钟内最多重启3次
– 超过则停止
2.2 限制条件
限制条件:
- 重启频率:避免频繁重启
- 重启次数:限制总次数
- 时间窗口:在一定时间内
from bigdata视频:www.itpux.com
2.3 告警机制
告警机制:
1. 服务停止告警
– 服务停止时告警
– 通知运维人员
2. 自动重启告警
– 自动重启时告警
– 记录重启事件
3. 频繁重启告警
– 短时间内多次重启
– 严重告警
4. 告警方式
– 邮件
– 短信
– 微信
– 电话
Part03-生产环境项目实施方案
3.1 systemd自动重启
3.1.1 systemd服务配置
# 1. 创建服务文件
cat > /etc/systemd/system/hadoop-datanode.service << 'EOF' [Unit] Description=Hadoop DataNode After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode ExecStop=/bigdata/app/hadoop/sbin/hadoop-daemon.sh stop datanode Restart=on-failure RestartSec=10s StartLimitInterval=5min StartLimitBurst=3 [Install] WantedBy=multi-user.target EOF # 2. 重新加载systemd systemctl daemon-reload # 3. 启动服务 systemctl start hadoop-datanode # 4. 开机自启 systemctl enable hadoop-datanode # 5. 查看状态 systemctl status hadoop-datanode # 6. 查看日志 journalctl -u hadoop-datanode -f
3.1.2 systemd其他服务
# NameNode
cat > /etc/systemd/system/hadoop-namenode.service << 'EOF' [Unit] Description=Hadoop NameNode After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hadoop/sbin/hadoop-daemon.sh start namenode ExecStop=/bigdata/app/hadoop/sbin/hadoop-daemon.sh stop namenode Restart=on-failure RestartSec=10s [Install] WantedBy=multi-user.target EOF # ResourceManager cat > /etc/systemd/system/hadoop-resourcemanager.service << 'EOF' [Unit] Description=Hadoop ResourceManager After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hadoop/sbin/yarn-daemon.sh start resourcemanager ExecStop=/bigdata/app/hadoop/sbin/yarn-daemon.sh stop resourcemanager Restart=on-failure RestartSec=10s [Install] WantedBy=multi-user.target EOF # NodeManager cat > /etc/systemd/system/hadoop-nodemanager.service << 'EOF' [Unit] Description=Hadoop NodeManager After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hadoop/sbin/yarn-daemon.sh start nodemanager ExecStop=/bigdata/app/hadoop/sbin/yarn-daemon.sh stop nodemanager Restart=on-failure RestartSec=10s StartLimitInterval=5min StartLimitBurst=3 [Install] WantedBy=multi-user.target EOF # 应用配置 systemctl daemon-reload systemctl enable hadoop-namenode systemctl enable hadoop-resourcemanager systemctl enable hadoop-nodemanager
3.2 脚本自动重启
3.2.1 监控脚本
# monitor_hadoop.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
DATE=$(date +%Y-%m-%d_%H:%M:%S)
LOG_DIR=/bigdata/logs/monitor
mkdir -p $LOG_DIR
LOG_FILE=$LOG_DIR/monitor_$(date +%Y%m%d).log
# 检查DataNode
check_datanode() {
if ! jps | grep -q DataNode; then
echo “$DATE DataNode is down, restarting…” >> $LOG_FILE
/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode
sleep 30
if jps | grep -q DataNode; then
echo “$DATE DataNode restarted successfully” >> $LOG_FILE
else
echo “$DATE DataNode restart failed” >> $LOG_FILE
fi
fi
}
# 检查NodeManager
check_nodemanager() {
if ! jps | grep -q NodeManager; then
echo “$DATE NodeManager is down, restarting…” >> $LOG_FILE
/bigdata/app/hadoop/sbin/yarn-daemon.sh start nodemanager
sleep 30
if jps | grep -q NodeManager; then
echo “$DATE NodeManager restarted successfully” >> $LOG_FILE
else
echo “$DATE NodeManager restart failed” >> $LOG_FILE
fi
fi
}
# 执行检查
check_datanode
check_nodemanager
# 添加到crontab
# * * * * * /bigdata/scripts/monitor_hadoop.sh
3.3 Monit监控
3.3.1 Monit安装配置
# 1. 安装Monit
yum install -y monit
# 2. 配置Monit
cat > /etc/monit.d/hadoop << 'EOF'
check process datanode with pidfile /bigdata/app/hadoop/logs/hadoop-hadoop-datanode.pid
start program = "/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode"
stop program = "/bigdata/app/hadoop/sbin/hadoop-daemon.sh stop datanode"
if failed host 127.0.0.1 port 9866 then restart
if 3 restarts within 5 cycles then timeout
check process nodemanager with pidfile /bigdata/app/hadoop/logs/yarn-hadoop-nodemanager.pid
start program = "/bigdata/app/hadoop/sbin/yarn-daemon.sh start nodemanager"
stop program = "/bigdata/app/hadoop/sbin/yarn-daemon.sh stop nodemanager"
if failed host 127.0.0.1 port 8040 then restart
if 3 restarts within 5 cycles then timeout
EOF
# 3. 启动Monit
systemctl start monit
systemctl enable monit
# 4. 查看状态
monit status
monit summary
# 5. Web界面
# 默认端口2812
# http://server:2812
Part04-生产案例与实战讲解
4.1 DataNode自动重启
4.1.1 实战案例
# 1. 配置systemd服务
cat > /etc/systemd/system/hadoop-datanode.service << 'EOF' [Unit] Description=Hadoop DataNode After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk PIDFile=/bigdata/app/hadoop/logs/hadoop-hadoop-datanode.pid ExecStart=/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode ExecStop=/bigdata/app/hadoop/sbin/hadoop-daemon.sh stop datanode Restart=on-failure RestartSec=10 StartLimitInterval=300 StartLimitBurst=3 [Install] WantedBy=multi-user.target EOF # 2. 启动服务 systemctl daemon-reload systemctl start hadoop-datanode systemctl enable hadoop-datanode # 3. 验证服务 systemctl status hadoop-datanode jps # 4. 模拟故障 # 停止DataNode hdfs --daemon stop datanode jps # 5. 观察自动重启 sleep 20 jps # DataNode应该已经自动重启 # 6. 查看日志 journalctl -u hadoop-datanode -n 50
4.2 NodeManager自动重启
4.2.1 实战案例
# 1. 配置systemd服务
cat > /etc/systemd/system/hadoop-nodemanager.service << 'EOF' [Unit] Description=Hadoop NodeManager After=network.target [Service] Type=forking User=hadoop Group=hadoop Environment=HADOOP_HOME=/bigdata/app/hadoop Environment=JAVA_HOME=/bigdata/app/jdk PIDFile=/bigdata/app/hadoop/logs/yarn-hadoop-nodemanager.pid ExecStart=/bigdata/app/hadoop/sbin/yarn-daemon.sh start nodemanager ExecStop=/bigdata/app/hadoop/sbin/yarn-daemon.sh stop nodemanager Restart=on-failure RestartSec=10 StartLimitInterval=300 StartLimitBurst=3 [Install] WantedBy=multi-user.target EOF # 2. 启动服务 systemctl daemon-reload systemctl start hadoop-nodemanager systemctl enable hadoop-nodemanager # 3. 验证 systemctl status hadoop-nodemanager jps # 4. 模拟故障 yarn --daemon stop nodemanager jps # 5. 观察自动重启 sleep 20 jps # 6. 查看日志 journalctl -u hadoop-nodemanager -n 50
4.3 其他服务自动重启
4.3.1 实战案例
# ZooKeeper
cat > /etc/systemd/system/zookeeper.service << 'EOF' [Unit] Description=ZooKeeper After=network.target [Service] Type=forking User=zookeeper Group=zookeeper ExecStart=/bigdata/app/zookeeper/bin/zkServer.sh start ExecStop=/bigdata/app/zookeeper/bin/zkServer.sh stop Restart=on-failure RestartSec=10 [Install] WantedBy=multi-user.target EOF # HBase Master cat > /etc/systemd/system/hbase-master.service << 'EOF' [Unit] Description=HBase Master After=network.target hadoop.service [Service] Type=forking User=hbase Group=hbase Environment=HBASE_HOME=/bigdata/app/hbase Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hbase/bin/hbase-daemon.sh start master ExecStop=/bigdata/app/hbase/bin/hbase-daemon.sh stop master Restart=on-failure RestartSec=10 [Install] WantedBy=multi-user.target EOF # HBase RegionServer cat > /etc/systemd/system/hbase-regionserver.service << 'EOF' [Unit] Description=HBase RegionServer After=network.target hadoop.service [Service] Type=forking User=hbase Group=hbase Environment=HBASE_HOME=/bigdata/app/hbase Environment=JAVA_HOME=/bigdata/app/jdk ExecStart=/bigdata/app/hbase/bin/hbase-daemon.sh start regionserver ExecStop=/bigdata/app/hbase/bin/hbase-daemon.sh stop regionserver Restart=on-failure RestartSec=10 StartLimitInterval=300 StartLimitBurst=3 [Install] WantedBy=multi-user.target EOF # 启动服务 systemctl daemon-reload systemctl enable zookeeper systemctl enable hbase-master systemctl enable hbase-regionserver
Part05-风哥经验总结与分享
5.1 最佳实践
最佳实践:
- 使用systemd:系统自带,配置简单
- 限制重启:避免频繁重启
- 告警通知:重启时通知运维
- 日志记录:记录重启事件
- 区别对待:有状态服务谨慎自动重启
5.2 常见坑点
1. 有状态服务自动重启
– 现象:NameNode频繁自动重启
– 风险:数据损坏
– 避坑:有状态服务不配置自动重启
2. 不限制重启次数
– 现象:服务陷入重启循环
– 风险:资源耗尽
– 避坑:配置重启限制
3. 不告警
– 现象:服务自动重启不知道
– 风险:问题积累
– 避坑:配置告警
4. 不检查原因
– 现象:只重启,不查原因
– 风险:问题反复出现
– 避坑:重启后检查原因
5.3 检查清单
## 配置检查
– [ ] systemd服务配置正确
– [ ] 重启策略合理
– [ ] 重启限制配置
– [ ] 告警配置
– [ ] 日志配置
## 功能检查
– [ ] 服务正常启动
– [ ] 服务自动重启
– [ ] 告警正常发送
– [ ] 日志正常记录
– [ ] 重启限制生效
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
