本教程主要介绍大数据集群自动化运维的方法和实战技巧,包括自动化部署、自动化监控、自动化故障处理等内容。风哥教程参考bigdata官方文档自动化运维指南、配置说明等相关内容。
通过本教程的学习,您将掌握大数据集群的自动化运维方法,提高运维效率和可靠性。
目录大纲
Part01-基础概念与理论知识
1.1 自动化运维概述
大数据集群自动化运维是指通过各种自动化工具和技术,实现集群的自动化部署、监控、故障处理和任务执行,主要包括:
- 自动化部署:自动安装和配置集群
- 自动化监控:自动监控集群状态和性能
- 自动化故障处理:自动检测和处理故障
- 自动化任务执行:自动执行日常维护任务
- 自动化配置管理:自动管理和更新配置
自动化运维是大数据集群管理的重要组成部分,能够提高运维效率,减少人为错误,确保集群的稳定运行,学习交流加群风哥微信: itpux-com
1.2 自动化运维工具
常用的自动化运维工具:
- 配置管理工具:Ansible、Puppet、Chef等
- 监控工具:Prometheus、Grafana、Zabbix等
- 容器化工具:Docker、Kubernetes等
- 编排工具:Apache Airflow、Luigi等
- 日志分析工具:ELK Stack、Graylog等
- 自动化脚本:Shell脚本、Python脚本等
1.3 自动化运维流程
自动化运维流程:
- 需求分析:分析运维需求,确定自动化范围
- 工具选择:选择适合的自动化工具
- 方案设计:设计自动化运维方案
- 实施部署:部署自动化工具和脚本
- 测试验证:测试自动化运维效果
- 持续优化:根据实际情况,持续优化自动化运维方案
Part02-生产环境规划与建议
2.1 自动化运维规划
风哥提示:自动化运维规划应根据集群规模和业务需求,制定合理的自动化运维策略,确保自动化的有效性。
自动化运维规划建议:
- 自动化目标:明确自动化目标,如提高效率、减少错误、确保稳定性等
- 自动化范围:确定自动化范围,如部署、监控、故障处理等
- 工具选择:选择适合的自动化工具,考虑易用性、可靠性、扩展性等
- 实施计划:制定详细的实施计划,包括时间、步骤、责任人等
- 风险评估:评估自动化实施过程中的风险,制定应对策略
2.2 自动化运维策略
自动化运维策略建议:
- 渐进式实施:从简单任务开始,逐步扩展自动化范围
- 标准化配置:建立标准化的配置模板,确保配置一致性
- 版本控制:使用版本控制系统管理配置和脚本
- 文档化:记录自动化流程和脚本,便于后续维护
- 监控告警:建立监控告警机制,及时发现和处理问题
- 定期测试:定期测试自动化脚本和流程,确保其有效性
2.3 自动化运维实施
自动化运维实施建议:
- 团队培训:加强团队培训,提高自动化技能
- 试点实施:在小范围试点,验证自动化效果
- 逐步推广:在试点成功后,逐步推广到整个集群
- 持续改进:根据实际情况,持续改进自动化方案
- 反馈机制:建立反馈机制,收集用户意见和建议
Part03-生产环境项目实施方案
3.1 自动化部署
配置自动化部署:
## 1.1 使用Ansible部署
### 1.1.1 安装Ansible
yum install -y ansible
### 1.1.2 配置Ansible inventory
vi /etc/ansible/hosts
[hadoop]
fgedu01 ansible_ssh_host=192.168.1.10
fgedu02 ansible_ssh_host=192.168.1.11
fgedu03 ansible_ssh_host=192.168.1.12
### 1.1.3 创建Ansible playbook
vi hadoop-deploy.yml
—
– hosts: hadoop
become: yes
tasks:
– name: Install Java
yum: name=java-1.8.0-openjdk-devel state=present
– name: Create hadoop user
user: name=fgedu state=present
– name: Download Hadoop
get_url: url=https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
dest=/tmp/hadoop-3.3.6.tar.gz
– name: Extract Hadoop
unarchive: src=/tmp/hadoop-3.3.6.tar.gz dest=/bigdata/app/ remote_src=yes
– name: Create symbolic link
file: src=/bigdata/app/hadoop-3.3.6 dest=/bigdata/app/hadoop state=link
– name: Configure Hadoop
template: src=hadoop/templates/core-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/core-site.xml
template: src=hadoop/templates/hdfs-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/hdfs-site.xml
template: src=hadoop/templates/yarn-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/yarn-site.xml
template: src=hadoop/templates/mapred-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/mapred-site.xml
– name: Format HDFS
command: /bigdata/app/hadoop/bin/hdfs namenode -format
when: inventory_hostname == ‘fgedu01’
– name: Start HDFS
command: /bigdata/app/hadoop/sbin/start-dfs.sh
when: inventory_hostname == ‘fgedu01’
– name: Start YARN
command: /bigdata/app/hadoop/sbin/start-yarn.sh
when: inventory_hostname == ‘fgedu01’
3.2 自动化监控
配置自动化监控:
## 1.1 使用Prometheus监控
### 1.1.1 安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar -xzf prometheus-2.35.0.linux-amd64.tar.gz -C /bigdata/app/
ln -s /bigdata/app/prometheus-2.35.0.linux-amd64 /bigdata/app/prometheus
### 1.1.2 配置Prometheus
vi /bigdata/app/prometheus/prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
– job_name: ‘hadoop’
static_configs:
– targets: [‘fgedu01:9100’, ‘fgedu02:9100’, ‘fgedu03:9100’]
– job_name: ‘hdfs’
static_configs:
– targets: [‘fgedu01:9870’]
– job_name: ‘yarn’
static_configs:
– targets: [‘fgedu01:8088’]
### 1.1.3 安装Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz -C /bigdata/app/
ln -s /bigdata/app/node_exporter-1.3.1.linux-amd64 /bigdata/app/node_exporter
### 1.1.4 启动Node Exporter
/bigdata/app/node_exporter/node_exporter &
### 1.1.5 启动Prometheus
/bigdata/app/prometheus/prometheus –config.file=/bigdata/app/prometheus/prometheus.yml &
## 1.2 使用Grafana可视化
### 1.2.1 安装Grafana
wget https://dl.grafana.com/oss/release/grafana-8.5.5-1.x86_64.rpm
yum install -y grafana-8.5.5-1.x86_64.rpm
### 1.2.2 启动Grafana
systemctl start grafana-server
systemctl enable grafana-server
### 1.2.3 配置Grafana
# 访问 http://fgedu01:3000,使用 admin/admin 登录
# 添加Prometheus数据源
# 导入Hadoop监控面板
3.3 自动化故障处理
配置自动化故障处理:
## 1.1 编写故障检测脚本
vi /bigdata/app/scripts/fault_detection.sh
#!/bin/bash
# fault_detection.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`
# 检查HDFS状态
hdfs dfsadmin -report > /tmp/hdfs_report.txt
if grep -q “Missing blocks” /tmp/hdfs_report.txt; then
echo “HDFS missing blocks detected, sending alert…”
# 发送告警
fi
# 检查YARN状态
yarn node -list > /tmp/yarn_report.txt
if grep -q “UNHEALTHY” /tmp/yarn_report.txt; then
echo “YARN unhealthy nodes detected, sending alert…”
# 发送告警
fi
# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "DataNode count is less than 3, sending alert..." #
发送告警 fi ## 1.2 编写故障自动修复脚本 vi /bigdata/app/scripts/fault_repair.sh #!/bin/bash # fault_repair.sh #
from:www.itpux.com.qq113257174.wx:itpux-com # web: `http://www.fgedu.net.cn` # 检查DataNode状态 jps | grep DataNode>
/tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "Starting DataNode..."
/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode fi # 检查ResourceManager状态 jps | grep
ResourceManager> /tmp/resourcemanager_status.txt
if [ $(wc -l < /tmp/resourcemanager_status.txt) -eq 0 ]; then echo "Starting ResourceManager..."
/bigdata/app/hadoop/sbin/yarn-daemon.sh start resourcemanager fi # 检查NameNode状态 jps | grep NameNode>
/tmp/namenode_status.txt
if [ $(wc -l < /tmp/namenode_status.txt) -eq 0 ]; then echo "Starting NameNode..."
/bigdata/app/hadoop/sbin/hadoop-daemon.sh start namenode fi ## 1.3 配置定时任务 crontab -e */5 * * * *
/bigdata/app/scripts/fault_detection.sh */10 * * * * /bigdata/app/scripts/fault_repair.sh
3.4 自动化任务执行
配置自动化任务执行:
## 1.1 使用Apache Airflow
### 1.1.1 安装Airflow
pip install apache-airflow
### 1.1.2 初始化Airflow
airflow db init
### 1.1.3 创建Airflow用户
airflow users create –username fgedu –password fgedu –firstname fgedu –lastname fgedu –role Admin
–email fgedu@fgedu.net.cn
### 1.1.4 启动Airflow
airflow webserver -p 8080 &
airflow scheduler &
### 1.1.5 创建DAG
vi /home/fgedu/airflow/dags/hadoop_tasks.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
default_args = {
‘owner’: ‘fgedu’,
‘start_date’: datetime(2026, 4, 8),
‘retries’: 1,
}
with DAG(‘hadoop_tasks’, default_args=default_args, schedule_interval=’@daily’) as dag:
# 清理HDFS临时文件
clean_hdfs = BashOperator(
task_id=’clean_hdfs’,
bash_command=’hdfs dfs -rm -r /user/fgedu/tmp/*’
)
# 执行HDFS平衡
balance_hdfs = BashOperator(
task_id=’balance_hdfs’,
bash_command=’hdfs balancer’
)
# 备份数据
backup_data = BashOperator(
task_id=’backup_data’,
bash_command=’hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date +%Y%m%d)’
)
clean_hdfs >> balance_hdfs >> backup_data
## 1.2 使用crontab
### 1.2.1 配置crontab
crontab -e
# 每天凌晨1点执行HDFS清理
0 1 * * * hdfs dfs -rm -r /user/fgedu/tmp/*
# 每周日凌晨2点执行HDFS平衡
0 2 * * 0 hdfs balancer
# 每天凌晨3点执行数据备份
0 3 * * * hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date +%Y%m%d)
Part04-生产案例与实战讲解
4.1 自动化部署实战
案例:使用Ansible自动化部署Hadoop集群
# 安装Ansible
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Resolving Dependencies
–> Running transaction check
—> Package ansible.noarch 0:2.9.27-1.el8 will be installed
–> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
ansible noarch 2.9.27-1.el8 epel 16 M
Transaction Summary
================================================================================
Install 1 Package
Total download size: 16 M
Installed size: 77 M
Downloading Packages:
ansible-2.9.27-1.el8.noarch.rpm 3.5 MB/s | 16 MB 00:04
——————————————————————————–
Total 3.5 MB/s | 16 MB 00:04
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : ansible-2.9.27-1.el8.noarch 1/1
Verifying : ansible-2.9.27-1.el8.noarch 1/1
Installed:
ansible.noarch 0:2.9.27-1.el8
Complete!
# 配置Ansible inventory
[hadoop]
fgedu01 ansible_ssh_host=192.168.1.10
fgedu02 ansible_ssh_host=192.168.1.11
fgedu03 ansible_ssh_host=192.168.1.12
# 创建Ansible playbook
—
– hosts: hadoop
become: yes
tasks:
– name: Install Java
yum: name=java-1.8.0-openjdk-devel state=present
– name: Create hadoop user
user: name=fgedu state=present
– name: Download Hadoop
get_url: url=https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
dest=/tmp/hadoop-3.3.6.tar.gz
– name: Extract Hadoop
unarchive: src=/tmp/hadoop-3.3.6.tar.gz dest=/bigdata/app/ remote_src=yes
– name: Create symbolic link
file: src=/bigdata/app/hadoop-3.3.6 dest=/bigdata/app/hadoop state=link
– name: Configure Hadoop
template: src=hadoop/templates/core-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/core-site.xml
template: src=hadoop/templates/hdfs-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/hdfs-site.xml
template: src=hadoop/templates/yarn-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/yarn-site.xml
template: src=hadoop/templates/mapred-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/mapred-site.xml
– name: Format HDFS
command: /bigdata/app/hadoop/bin/hdfs namenode -format
when: inventory_hostname == ‘fgedu01’
– name: Start HDFS
command: /bigdata/app/hadoop/sbin/start-dfs.sh
when: inventory_hostname == ‘fgedu01’
– name: Start YARN
command: /bigdata/app/hadoop/sbin/start-yarn.sh
when: inventory_hostname == ‘fgedu01’
# 执行Ansible playbook
PLAY [hadoop] ********************************************************************
TASK [Gathering Facts] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Install Java] ****************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Create hadoop user] **********************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Download Hadoop] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Extract Hadoop] **************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Create symbolic link] ********************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Configure Hadoop] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]
TASK [Format HDFS] ****************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]
TASK [Start HDFS] ******************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]
TASK [Start YARN] ******************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]
PLAY RECAP ************************************************************************
fgedu01 : ok=10 changed=6 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
fgedu02 : ok=7 changed=4 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0
fgedu03 : ok=7 changed=4 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0
4.2 自动化监控实战
案例:使用Prometheus和Grafana监控Hadoop集群
# 安装Prometheus
–2026-04-08 10:00:00–
https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
Resolving github.com (github.com)… 192.168.1.1
Connecting to github.com (github.com)|192.168.1.1|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location:
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dprometheus-2.35.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
–2026-04-08 10:00:00–
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dprometheus-2.35.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)… 192.168.1.2
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|192.168.1.2|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 86764336 (83M) [application/octet-stream]
Saving to: ‘prometheus-2.35.0.linux-amd64.tar.gz’
100%[======================================>] 86,764,336 10MB/s in 8.3s
2026-04-08 10:00:08 (10.0 MB/s) – ‘prometheus-2.35.0.linux-amd64.tar.gz’ saved [86764336/86764336]
$ tar -xzf prometheus-2.35.0.linux-amd64.tar.gz -C /bigdata/app/
$ ln -s /bigdata/app/prometheus-2.35.0.linux-amd64 /bigdata/app/prometheus
# 配置Prometheus
global:
scrape_interval: 15s
scrape_configs:
– job_name: ‘hadoop’
static_configs:
– targets: [‘fgedu01:9100’, ‘fgedu02:9100’, ‘fgedu03:9100’]
– job_name: ‘hdfs’
static_configs:
– targets: [‘fgedu01:9870’]
– job_name: ‘yarn’
static_configs:
– targets: [‘fgedu01:8088’]
# 安装并启动Node Exporter
https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
–2026-04-08 10:00:00–
https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
Resolving github.com (github.com)… 192.168.1.1
Connecting to github.com (github.com)|192.168.1.1|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location:
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dnode_exporter-1.3.1.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
–2026-04-08 10:00:00–
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dnode_exporter-1.3.1.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)… 192.168.1.2
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|192.168.1.2|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 15937080 (15M) [application/octet-stream]
Saving to: ‘node_exporter-1.3.1.linux-amd64.tar.gz’
100%[======================================>] 15,937,080 10MB/s in 1.5s
2026-04-08 10:00:01 (10.0 MB/s) – ‘node_exporter-1.3.1.linux-amd64.tar.gz’ saved [15937080/15937080]
$ tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz -C /bigdata/app/
$ ln -s /bigdata/app/node_exporter-1.3.1.linux-amd64 /bigdata/app/node_exporter
$ /bigdata/app/node_exporter/node_exporter &
[1] 12345
# 启动Prometheus
[1] 23456
# 安装并启动Grafana
–2026-04-08 10:00:00– https://dl.grafana.com/oss/release/grafana-8.5.5-1.x86_64.rpm
Resolving dl.grafana.com (dl.grafana.com)… 192.168.1.3
Connecting to dl.grafana.com (dl.grafana.com)|192.168.1.3|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 68839320 (66M) [application/x-redhat-package-manager]
Saving to: ‘grafana-8.5.5-1.x86_64.rpm’
100%[======================================>] 68,839,320 10MB/s in 6.9s
2026-04-08 10:00:06 (9.9 MB/s) – ‘grafana-8.5.5-1.x86_64.rpm’ saved [68839320/68839320]
$ yum install -y grafana-8.5.5-1.x86_64.rpm
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Resolving Dependencies
–> Running transaction check
—> Package grafana.x86_64 0:8.5.5-1 will be installed
–> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
grafana x86_64 8.5.5-1 /grafana-8.5.5-1.x86_64 66 M
Transaction Summary
================================================================================
Install 1 Package
Total size: 66 M
Installed size: 218 M
Downloading Packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : grafana-8.5.5-1.x86_64 1/1
Verifying : grafana-8.5.5-1.x86_64 1/1
Installed:
grafana.x86_64 0:8.5.5-1
Complete!
$ systemctl start grafana-server
$ systemctl enable grafana-server
Created symlink /etc/systemd/system/multi-user.target.wants/grafana-server.service →
/usr/lib/systemd/system/grafana-server.service.
4.3 自动化故障处理实战
案例:自动化故障检测和修复
# 编写故障检测脚本
#!/bin/bash
# fault_detection.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`
# 检查HDFS状态
hdfs dfsadmin -report > /tmp/hdfs_report.txt
if grep -q “Missing blocks” /tmp/hdfs_report.txt; then
echo “HDFS missing blocks detected, sending alert…”
# 发送告警
fi
# 检查YARN状态
yarn node -list > /tmp/yarn_report.txt
if grep -q “UNHEALTHY” /tmp/yarn_report.txt; then
echo “YARN unhealthy nodes detected, sending alert…”
# 发送告警
fi
# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "DataNode count is less than 3, sending alert..."
# 发送告警 fi
# 编写故障自动修复脚本
#!/bin/bash
# fault_repair.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`
# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "Starting DataNode..."
/bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode fi # 检查ResourceManager状态 jps | grep
ResourceManager> /tmp/resourcemanager_status.txt
if [ $(wc -l < /tmp/resourcemanager_status.txt) -eq 0 ]; then echo "Starting ResourceManager..."
/bigdata/app/hadoop/sbin/yarn-daemon.sh start resourcemanager fi # 检查NameNode状态 jps | grep NameNode>
/tmp/namenode_status.txt
if [ $(wc -l < /tmp/namenode_status.txt) -eq 0 ]; then echo "Starting NameNode..."
/bigdata/app/hadoop/sbin/hadoop-daemon.sh start namenode fi
# 配置定时任务
*/5 * * * * /bigdata/app/scripts/fault_detection.sh
*/10 * * * * /bigdata/app/scripts/fault_repair.sh
4.4 自动化任务执行实战
案例:使用Apache Airflow执行自动化任务
# 安装Airflow
Collecting apache-airflow
Downloading apache_airflow-2.3.3-py3-none-any.whl (11.9 MB)
|████████████████████████████████| 11.9 MB 1.2 MB/s
Collecting alembic<2.0,>=1.6.5
Downloading alembic-1.7.7-py3-none-any.whl (160 kB)
|████████████████████████████████| 160 kB 1.5 MB/s
Collecting argcomplete<2.0,>=1.10.0
Downloading argcomplete-1.12.3-py2.py3-none-any.whl (38 kB)
|████████████████████████████████| 38 kB 1.3 MB/s
…
Installing collected packages: zipp, typing-extensions, six, pyparsing, pyrsistent, pyopenssl,
pyjwt, pycparser, pyasn1, psutil, psycopg2-binary, protobuf, prison, prometheus-client, ply,
pluggy, pika, pathlib2, pandas, packaging, oauthlib, numpy, mysql-connector-python, monotonic,
Mako, MarkupSafe, marshmallow, marshmallow-enum, lz4, lazy-object-proxy, kubernetes, jinja2,
idna, importlib-metadata, importlib-resources, greenlet, graphviz, gunicorn,
googleapis-common-protos, Flask-WTF, Flask-SQLAlchemy, Flask-Login, Flask-OpenID, Flask,
clickclick, click, cffi, chardet, certifi, boto3, backports.zoneinfo, apache-airflow
Successfully installed Flask-2.0.3 Flask-Login-0.4.1 Flask-OpenID-1.3.0 Flask-SQLAlchemy-2.5.1
Flask-WTF-0.14.3 Mako-1.1.6 MarkupSafe-2.0.1 apache-airflow-2.3.3 argcomplete-1.12.3
backports.zoneinfo-0.2.1 boto3-1.22.10 certifi-2021.10.8 cffi-1.15.0 chardet-4.0.0 click-7.1.2
clickclick-20.10.2 greenlet-1.1.2 graphviz-0.19.1 gunicorn-20.1.0 idna-2.10
importlib-metadata-4.11.4 importlib-resources-5.4.0 jinja2-3.0.3 kubernetes-18.20.0
lazy-object-proxy-1.7.1 lz4-3.1.3 marshmallow-3.14.1 marshmallow-enum-1.5.1 monotonic-1.6
mysql-connector-python-8.0.28 numpy-1.22.3 oauthlib-3.2.0 packaging-21.3 pandas-1.4.2
pathlib2-2.3.6 pika-1.2.0 ply-3.11 pluggy-1.0.0 prison-0.2.1 prometheus-client-0.13.1
protobuf-3.19.4 psycopg2-binary-2.9.3 psutil-5.9.0 pyasn1-0.4.8 pycparser-2.21 pyjwt-2.3.0
pyopenssl-20.0.1 pyparsing-3.0.8 pyrsistent-0.18.1 six-1.16.0 typing-extensions-4.1.1
zipp-3.7.0
# 初始化Airflow
DB: sqlite:////home/fgedu/airflow/airflow.db
[2026-04-08 10:00:00,000] {db.py:685} INFO – Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> e3a246e0dc1, current version
INFO [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 13eb55f81627, add dagrun
INFO [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61,
kubernetes_resource_checkpointing
INFO [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d53670a240, add
pod_mutation_hook
INFO [alembic.runtime.migration] Running upgrade 52d53670a240 -> 502898887f84, add log_level to
TaskInstance
INFO [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun_type
INFO [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, add task fails
journal table
INFO [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 4d5f3980047f, add task instance
notes
INFO [alembic.runtime.migration] Running upgrade 4d5f3980047f -> 8f966b9c467a, add TI state index
INFO [alembic.runtime.migration] Running upgrade 8f966b9c467a -> 239ba3e28140, add kubernetes
scheduler uniqueness
INFO [alembic.runtime.migration] Running upgrade 239ba3e28140 -> c838f05986b8, add dag_stats_table
INFO [alembic.runtime.migration] Running upgrade c838f05986b8 -> ef6991263f63, add task_outlet
INFO [alembic.runtime.migration] Running upgrade ef6991263f63 -> 53374a90596b, add logging table
INFO [alembic.runtime.migration] Running upgrade 53374a90596b -> 559517196036, add Celery executor
tables
INFO [alembic.runtime.migration] Running upgrade 559517196036 -> 40e67319e3a9, add executor_config
to task_instance
INFO [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 180961f13199, add
dag_id_subdag_task_idx
INFO [alembic.runtime.migration] Running upgrade 180961f13199 -> 64de99470447, add foreign key to
task_instance
INFO [alembic.runtime.migration] Running upgrade 64de99470447 -> 749d42a1634d, change datetime to
datetime2(6) on MSSQL tables
INFO [alembic.runtime.migration] Running upgrade 749d42a1634d -> 2e82aab8ef20, add unique
constraint to DagRun
INFO [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> a4c68966c399, add task_reschedule
table
INFO [alembic.runtime.migration] Running upgrade a4c68966c399 -> 8504051e801b, xcom dag id index
INFO [alembic.runtime.migration] Running upgrade 8504051e801b -> 23c547d1366f, add kubeconfig_path
to kube_resource
INFO [alembic.runtime.migration] Running upgrade 23c547d1366f -> 86770d1215c0, add schedulers
table
INFO [alembic.runtime.migration] Running upgrade 86770d1215c0 -> 45ba3f1493b9, add picklefield to
xcom
INFO [alembic.runtime.migration] Running upgrade 45ba3f1493b9 -> 947454bf1d8a, add is_encrypted to
xcom
INFO [alembic.runtime.migration] Running upgrade 947454bf1d8a -> 2e42b40a6a10, add ti_job_id_index
INFO [alembic.runtime.migration] Running upgrade 2e42b40a6a10 -> e1a11ece99cc, add dag_description
field
INFO [alembic.runtime.migration] Running upgrade e1a11ece99cc -> 82b79b47d670, change task
instance indices
INFO [alembic.runtime.migration] Running upgrade 82b79b47d670 -> 506383402a14, Add fractional
seconds to MySQL tables
INFO [alembic.runtime.migration] Running upgrade 506383402a14 -> 61ec73d9401f, add kubernetes
labels
INFO [alembic.runtime.migration] Running upgrade 61ec73d9401f -> 127d2bf2dfa7, Add description
field to Connection
INFO [alembic.runtime.migration] Running upgrade 127d2bf2dfa7 -> e959f08ac86c, Add
“last_heartbeat“ to “SchedulerJob“
INFO [alembic.runtime.migration] Running upgrade e959f08ac86c -> f23433877c24, Add “max_tries“
column to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade f23433877c24 -> 7322f3687067, Add executor config
to “dag_run“
INFO [alembic.runtime.migration] Running upgrade 7322f3687067 -> 92c57b58940d, add “run_type“ to
“DagRun“
INFO [alembic.runtime.migration] Running upgrade 92c57b58940d -> 004c1210f153, Add “trigger_id“
column to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 004c1210f153 -> 9635ae0956e7, Add “conf“ column
to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> a2d516f6a651, Add
“last_scheduling_decision“ column to “DagRun“
INFO [alembic.runtime.migration] Running upgrade a2d516f6a651 -> 7b2661a43ba3, add “message“ to
“TaskFail“
INFO [alembic.runtime.migration] Running upgrade 7b2661a43ba3 -> 489ab3659401, Add
“data_interval_start“ and “data_interval_end“ to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 489ab3659401 -> 0adf413b3428, Add “next_dagrun“
to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 0adf413b3428 -> 939bb1e647c8, add
“has_import_errors“ column to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 7e2f47d4914c, add “state“
column to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 7e2f47d4914c -> a36f97f6747d, Add
“max_active_tasks“ and “max_active_runs“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade a36f97f6747d -> b247b1e37c0f, Add
“is_paused_upon_creation“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade b247b1e37c0f -> 0c61e88bee8f, Add
“last_expired“ to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 0c61e88bee8f -> 64091f18647f, Make
“DagModel.pickle_id“ nullable
INFO [alembic.runtime.migration] Running upgrade 64091f18647f -> 8b9d6b7e7c74, Add “concurrency“
to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 8b9d6b7e7c74 -> 27c6a30d7c24, Add “start_date“
and “end_date“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 27c6a30d7c24 -> 6e96a59344a4, Make
“DagModel.is_subdag“ nullable
INFO [alembic.runtime.migration] Running upgrade 6e96a59344a4 -> 8661d9e86677, Add
“schedule_interval“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 8661d9e86677 -> cb02d2f64868, Add “catchup“ to
“DagModel“
INFO [alembic.runtime.migration] Running upgrade cb02d2f64868 -> d2ae31099d61, Add “orientation“
to “DagModel“
INFO [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 451aebb31d03, Add
“default_view“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 451aebb31d03 -> e38be357a868, Add
“dag_display_name“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade e38be357a868 -> 574d5d88f18d, Add “tags“ to
“DagModel“
INFO [alembic.runtime.migration] Running upgrade 574d5d88f18d -> 0e2a74e0fc9f, Add
“priority_weight“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 0e2a74e0fc9f -> 94d1a3ddf762, Add “description“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 94d1a3ddf762 -> 0061c3052d12, Add “queue“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0061c3052d12 -> 8f9c606d1638, Add “pool“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 8f9c606d1638 -> d9db535d0250, Add “pool_slots“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade d9db535d0250 -> 4e387b4a1569, Add “max_tries“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 4e387b4a1569 -> 364159661c78, Add “executor“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 364159661c78 -> 3b8c5669556a, Add “start_date“
and “end_date“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 3b8c5669556a -> 97cdd93827b8, Add “duration“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 97cdd93827b8 -> 789930867c39, Add “state“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 789930867c39 -> 0b785b737a2b, Add “try_number“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0b785b737a2b -> 41f4f3a7de7b, Add “hostname“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 41f4f3a7de7b -> 0d2647751745, Add “unixname“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0d2647751745 -> 74d776215140, Add “job_id“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 74d776215140 -> 241f733888ed, Add “pool“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 241f733888ed -> b83345395c37, Add “queue“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade b83345395c37 -> c616b9116c38, Add “pool_slots“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade c616b9116c38 -> 43df8de3a5f2, Add
“executor_config“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 43df8de3a5f2 -> 2d52648d4a11, Add
“last_heartbeat“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 2d52648d4a11 -> 6e65365a0389, Add
“external_executor_id“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 6e65365a0389 -> 4e50d9b1a55a, Add “trigger_id“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 4e50d9b1a55a -> 55a95788c784, Add “run_id“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 55a95788c784 -> 804ef328f1e0, Add “dag_id“ and
“task_id“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 804ef328f1e0 -> 52d53670a240
[2026-04-08 10:00:00,000] {db.py:685} INFO – Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade 52d53670a240 -> 0e2a74e0fc9f, current version
# 创建Airflow用户
Admin –email fgedu@fgedu.net.cn
Admin user fgedu created
# 启动Airflow
[1] 12345
$ airflow scheduler &
[2] 23456
# 创建DAG
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
default_args = {
‘owner’: ‘fgedu’,
‘start_date’: datetime(2026, 4, 8),
‘retries’: 1,
}
with DAG(‘hadoop_tasks’, default_args=default_args, schedule_interval=’@daily’) as dag:
# 清理HDFS临时文件
clean_hdfs = BashOperator(
task_id=’clean_hdfs’,
bash_command=’hdfs dfs -rm -r /user/fgedu/tmp/*’
)
# 执行HDFS平衡
balance_hdfs = BashOperator(
task_id=’balance_hdfs’,
bash_command=’hdfs balancer’
)
# 备份数据
backup_data = BashOperator(
task_id=’backup_data’,
bash_command=’hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date
+%Y%m%d)’
)
clean_hdfs >> balance_hdfs >> backup_data
# 配置crontab
# 每天凌晨1点执行HDFS清理
0 1 * * * hdfs dfs -rm -r /user/fgedu/tmp/*
# 每周日凌晨2点执行HDFS平衡
0 2 * * 0 hdfs balancer
# 每天凌晨3点执行数据备份
0 3 * * * hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date +%Y%m%d)
Part05-风哥经验总结与分享
5.1 常见问题解决方案
常见问题解决方案:
- 自动化脚本执行失败:检查脚本权限,确
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
