大数据教程FG147-大数据集群自动化运维实战

本教程主要介绍大数据集群自动化运维的方法和实战技巧，包括自动化部署、自动化监控、自动化故障处理等内容。风哥教程参考bigdata官方文档自动化运维指南、配置说明等相关内容。

通过本教程的学习，您将掌握大数据集群的自动化运维方法，提高运维效率和可靠性。

目录大纲

Part01-基础概念与理论知识
Part02-生产环境规划与建议
Part03-生产环境项目实施方案
Part04-生产案例与实战讲解
Part05-风哥经验总结与分享

Part01-基础概念与理论知识

1.1 自动化运维概述

大数据集群自动化运维是指通过各种自动化工具和技术，实现集群的自动化部署、监控、故障处理和任务执行，主要包括：

自动化部署：自动安装和配置集群
自动化监控：自动监控集群状态和性能
自动化故障处理：自动检测和处理故障
自动化任务执行：自动执行日常维护任务
自动化配置管理：自动管理和更新配置

自动化运维是大数据集群管理的重要组成部分，能够提高运维效率，减少人为错误，确保集群的稳定运行，学习交流加群风哥微信: itpux-com

1.2 自动化运维工具

常用的自动化运维工具：

配置管理工具：Ansible、Puppet、Chef等
监控工具：Prometheus、Grafana、Zabbix等
容器化工具：Docker、Kubernetes等
编排工具：Apache Airflow、Luigi等
日志分析工具：ELK Stack、Graylog等
自动化脚本：Shell脚本、Python脚本等

1.3 自动化运维流程

自动化运维流程：

需求分析：分析运维需求，确定自动化范围
工具选择：选择适合的自动化工具
方案设计：设计自动化运维方案
实施部署：部署自动化工具和脚本
测试验证：测试自动化运维效果
持续优化：根据实际情况，持续优化自动化运维方案

Part02-生产环境规划与建议

2.1 自动化运维规划

风哥提示：自动化运维规划应根据集群规模和业务需求，制定合理的自动化运维策略，确保自动化的有效性。

自动化运维规划建议：

自动化目标：明确自动化目标，如提高效率、减少错误、确保稳定性等
自动化范围：确定自动化范围，如部署、监控、故障处理等
工具选择：选择适合的自动化工具，考虑易用性、可靠性、扩展性等
实施计划：制定详细的实施计划，包括时间、步骤、责任人等
风险评估：评估自动化实施过程中的风险，制定应对策略

2.2 自动化运维策略

自动化运维策略建议：

渐进式实施：从简单任务开始，逐步扩展自动化范围
标准化配置：建立标准化的配置模板，确保配置一致性
版本控制：使用版本控制系统管理配置和脚本
文档化：记录自动化流程和脚本，便于后续维护
监控告警：建立监控告警机制，及时发现和处理问题
定期测试：定期测试自动化脚本和流程，确保其有效性

2.3 自动化运维实施

自动化运维实施建议：

团队培训：加强团队培训，提高自动化技能
试点实施：在小范围试点，验证自动化效果
逐步推广：在试点成功后，逐步推广到整个集群
持续改进：根据实际情况，持续改进自动化方案
反馈机制：建立反馈机制，收集用户意见和建议

Part03-生产环境项目实施方案

3.1 自动化部署

配置自动化部署：

# 1. 自动化部署
## 1.1 使用Ansible部署
### 1.1.1 安装Ansible
yum install -y ansible

### 1.1.2 配置Ansible inventory
vi /etc/ansible/hosts
[hadoop]
fgedu01 ansible_ssh_host=192.168.1.10
fgedu02 ansible_ssh_host=192.168.1.11
fgedu03 ansible_ssh_host=192.168.1.12

### 1.1.3 创建Ansible playbook
vi hadoop-deploy.yml
—
– hosts: hadoop
become: yes
tasks:
– name: Install Java
yum: name=java-1.8.0-openjdk-devel state=present

– name: Create hadoop user
user: name=fgedu state=present

– name: Download Hadoop
get_url: url=https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
dest=/tmp/hadoop-3.3.6.tar.gz

– name: Extract Hadoop
unarchive: src=/tmp/hadoop-3.3.6.tar.gz dest=/bigdata/app/ remote_src=yes

– name: Create symbolic link
file: src=/bigdata/app/hadoop-3.3.6 dest=/bigdata/app/hadoop state=link

– name: Configure Hadoop
template: src=hadoop/templates/core-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/core-site.xml
template: src=hadoop/templates/hdfs-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/hdfs-site.xml
template: src=hadoop/templates/yarn-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/yarn-site.xml
template: src=hadoop/templates/mapred-site.xml.j2 dest=/bigdata/app/hadoop/etc/hadoop/mapred-site.xml

– name: Format HDFS
command: /bigdata/app/hadoop/bin/hdfs namenode -format
when: inventory_hostname == ‘fgedu01’

– name: Start HDFS
command: /bigdata/app/hadoop/sbin/start-dfs.sh
when: inventory_hostname == ‘fgedu01’

– name: Start YARN
command: /bigdata/app/hadoop/sbin/start-yarn.sh
when: inventory_hostname == ‘fgedu01’

3.2 自动化监控

配置自动化监控：

# 1. 自动化监控
## 1.1 使用Prometheus监控
### 1.1.1 安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
tar -xzf prometheus-2.35.0.linux-amd64.tar.gz -C /bigdata/app/
ln -s /bigdata/app/prometheus-2.35.0.linux-amd64 /bigdata/app/prometheus

### 1.1.2 配置Prometheus
vi /bigdata/app/prometheus/prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
– job_name: ‘hadoop’
static_configs:
– targets: [‘fgedu01:9100’, ‘fgedu02:9100’, ‘fgedu03:9100’]

– job_name: ‘hdfs’
static_configs:
– targets: [‘fgedu01:9870’]

– job_name: ‘yarn’
static_configs:
– targets: [‘fgedu01:8088’]

### 1.1.3 安装Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz -C /bigdata/app/
ln -s /bigdata/app/node_exporter-1.3.1.linux-amd64 /bigdata/app/node_exporter

### 1.1.4 启动Node Exporter
/bigdata/app/node_exporter/node_exporter &

### 1.1.5 启动Prometheus
/bigdata/app/prometheus/prometheus –config.file=/bigdata/app/prometheus/prometheus.yml &

## 1.2 使用Grafana可视化
### 1.2.1 安装Grafana
wget https://dl.grafana.com/oss/release/grafana-8.5.5-1.x86_64.rpm
yum install -y grafana-8.5.5-1.x86_64.rpm

### 1.2.2 启动Grafana
systemctl start grafana-server
systemctl enable grafana-server

### 1.2.3 配置Grafana
# 访问 http://fgedu01:3000，使用 admin/admin 登录
# 添加Prometheus数据源
# 导入Hadoop监控面板

3.3 自动化故障处理

配置自动化故障处理：

# 1. 自动化故障处理
## 1.1 编写故障检测脚本
vi /bigdata/app/scripts/fault_detection.sh
#!/bin/bash
# fault_detection.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`

# 检查HDFS状态
hdfs dfsadmin -report > /tmp/hdfs_report.txt
if grep -q “Missing blocks” /tmp/hdfs_report.txt; then
echo “HDFS missing blocks detected, sending alert…”
# 发送告警
fi

# 检查YARN状态
yarn node -list > /tmp/yarn_report.txt
if grep -q “UNHEALTHY” /tmp/yarn_report.txt; then
echo “YARN unhealthy nodes detected, sending alert…”
# 发送告警
fi

# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "DataNode count is less than 3, sending alert..." # 发送告警 fi ## 1.2 编写故障自动修复脚本 vi /bigdata/app/scripts/fault_repair.sh #!/bin/bash # fault_repair.sh # from:www.itpux.com.qq113257174.wx:itpux-com # web: `http://www.fgedu.net.cn` # 检查DataNode状态 jps | grep DataNode>
/tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "Starting DataNode..." /bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode fi # 检查ResourceManager状态 jps | grep ResourceManager> /tmp/resourcemanager_status.txt
if [ $(wc -l < /tmp/resourcemanager_status.txt) -eq 0 ]; then echo "Starting ResourceManager..." /bigdata/app/hadoop/sbin/yarn-daemon.sh start resourcemanager fi # 检查NameNode状态 jps | grep NameNode>
/tmp/namenode_status.txt
if [ $(wc -l < /tmp/namenode_status.txt) -eq 0 ]; then echo "Starting NameNode..." /bigdata/app/hadoop/sbin/hadoop-daemon.sh start namenode fi ## 1.3 配置定时任务 crontab -e */5 * * * * /bigdata/app/scripts/fault_detection.sh */10 * * * * /bigdata/app/scripts/fault_repair.sh

3.4 自动化任务执行

配置自动化任务执行：

# 1. 自动化任务执行
## 1.1 使用Apache Airflow
### 1.1.1 安装Airflow
pip install apache-airflow

### 1.1.2 初始化Airflow
airflow db init

### 1.1.3 创建Airflow用户
airflow users create –username fgedu –password fgedu –firstname fgedu –lastname fgedu –role Admin
–email fgedu@fgedu.net.cn

### 1.1.4 启动Airflow
airflow webserver -p 8080 &
airflow scheduler &

### 1.1.5 创建DAG
vi /home/fgedu/airflow/dags/hadoop_tasks.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

default_args = {
‘owner’: ‘fgedu’,
‘start_date’: datetime(2026, 4, 8),
‘retries’: 1,
}

with DAG(‘hadoop_tasks’, default_args=default_args, schedule_interval=’@daily’) as dag:
# 清理HDFS临时文件
clean_hdfs = BashOperator(
task_id=’clean_hdfs’,
bash_command=’hdfs dfs -rm -r /user/fgedu/tmp/*’
)

# 执行HDFS平衡
balance_hdfs = BashOperator(
task_id=’balance_hdfs’,
bash_command=’hdfs balancer’
)

# 备份数据
backup_data = BashOperator(
task_id=’backup_data’,
bash_command=’hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date +%Y%m%d)’
)

clean_hdfs >> balance_hdfs >> backup_data

## 1.2 使用crontab
### 1.2.1 配置crontab
crontab -e
# 每天凌晨1点执行HDFS清理
0 1 * * * hdfs dfs -rm -r /user/fgedu/tmp/*

# 每周日凌晨2点执行HDFS平衡
0 2 * * 0 hdfs balancer

# 每天凌晨3点执行数据备份
0 3 * * * hdfs dfs -cp /user/fgedu/data hdfs://backup-cluster/user/fgedu/backup/$(date +%Y%m%d)

Part04-生产案例与实战讲解

4.1 自动化部署实战

案例：使用Ansible自动化部署Hadoop集群

# 安装Ansible

$ yum install -y ansible
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Resolving Dependencies
–> Running transaction check
—> Package ansible.noarch 0:2.9.27-1.el8 will be installed
–> Finished Dependency Resolution

Dependencies Resolved

================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
ansible noarch 2.9.27-1.el8 epel 16 M

Transaction Summary
================================================================================
Install 1 Package

Total download size: 16 M
Installed size: 77 M
Downloading Packages:
ansible-2.9.27-1.el8.noarch.rpm 3.5 MB/s | 16 MB 00:04
——————————————————————————–
Total 3.5 MB/s | 16 MB 00:04
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : ansible-2.9.27-1.el8.noarch 1/1
Verifying : ansible-2.9.27-1.el8.noarch 1/1

Installed:
ansible.noarch 0:2.9.27-1.el8

Complete!

# 配置Ansible inventory

$ vi /etc/ansible/hosts
[hadoop]
fgedu01 ansible_ssh_host=192.168.1.10
fgedu02 ansible_ssh_host=192.168.1.11
fgedu03 ansible_ssh_host=192.168.1.12

# 创建Ansible playbook

$ vi hadoop-deploy.yml
—
– hosts: hadoop
become: yes
tasks:
– name: Install Java
yum: name=java-1.8.0-openjdk-devel state=present

– name: Create hadoop user
user: name=fgedu state=present

– name: Download Hadoop
get_url: url=https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
dest=/tmp/hadoop-3.3.6.tar.gz

– name: Extract Hadoop
unarchive: src=/tmp/hadoop-3.3.6.tar.gz dest=/bigdata/app/ remote_src=yes

– name: Create symbolic link
file: src=/bigdata/app/hadoop-3.3.6 dest=/bigdata/app/hadoop state=link

– name: Format HDFS
command: /bigdata/app/hadoop/bin/hdfs namenode -format
when: inventory_hostname == ‘fgedu01’

– name: Start HDFS
command: /bigdata/app/hadoop/sbin/start-dfs.sh
when: inventory_hostname == ‘fgedu01’

– name: Start YARN
command: /bigdata/app/hadoop/sbin/start-yarn.sh
when: inventory_hostname == ‘fgedu01’

# 执行Ansible playbook

$ ansible-playbook hadoop-deploy.yml
PLAY [hadoop] ********************************************************************

TASK [Gathering Facts] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Install Java] ****************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Create hadoop user] **********************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Download Hadoop] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Extract Hadoop] **************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Create symbolic link] ********************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Configure Hadoop] ************************************************************
ok: [fgedu01]
ok: [fgedu02]
ok: [fgedu03]

TASK [Format HDFS] ****************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]

TASK [Start HDFS] ******************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]

TASK [Start YARN] ******************************************************************
ok: [fgedu01]
skipping: [fgedu02]
skipping: [fgedu03]

PLAY RECAP ************************************************************************
fgedu01 : ok=10 changed=6 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
fgedu02 : ok=7 changed=4 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0
fgedu03 : ok=7 changed=4 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0

4.2 自动化监控实战

案例：使用Prometheus和Grafana监控Hadoop集群

# 安装Prometheus

$ wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
–2026-04-08 10:00:00–
https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
Resolving github.com (github.com)… 192.168.1.1
Connecting to github.com (github.com)|192.168.1.1|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location:
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dprometheus-2.35.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
–2026-04-08 10:00:00–
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dprometheus-2.35.0.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)… 192.168.1.2
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|192.168.1.2|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 86764336 (83M) [application/octet-stream]
Saving to: ‘prometheus-2.35.0.linux-amd64.tar.gz’

100%[======================================>] 86,764,336 10MB/s in 8.3s

2026-04-08 10:00:08 (10.0 MB/s) – ‘prometheus-2.35.0.linux-amd64.tar.gz’ saved [86764336/86764336]

$ tar -xzf prometheus-2.35.0.linux-amd64.tar.gz -C /bigdata/app/
$ ln -s /bigdata/app/prometheus-2.35.0.linux-amd64 /bigdata/app/prometheus

# 配置Prometheus

$ vi /bigdata/app/prometheus/prometheus.yml
global:
scrape_interval: 15s

scrape_configs:
– job_name: ‘hadoop’
static_configs:
– targets: [‘fgedu01:9100’, ‘fgedu02:9100’, ‘fgedu03:9100’]

– job_name: ‘hdfs’
static_configs:
– targets: [‘fgedu01:9870’]

– job_name: ‘yarn’
static_configs:
– targets: [‘fgedu01:8088’]

# 安装并启动Node Exporter

$ wget
https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
–2026-04-08 10:00:00–
https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
Resolving github.com (github.com)… 192.168.1.1
Connecting to github.com (github.com)|192.168.1.1|:443… connected.
HTTP request sent, awaiting response… 302 Found
Location:
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dnode_exporter-1.3.1.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
–2026-04-08 10:00:00–
https://objects.githubusercontent.com/github-production-release-asset-2e65be/9435822/12345678-1234-1234-1234-123456789012?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20260408%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260408T100000Z&X-Amz-Expires=300&X-Amz-Signature=1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=9435822&response-content-disposition=attachment%3B%20filename%3Dnode_exporter-1.3.1.linux-amd64.tar.gz&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)… 192.168.1.2
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|192.168.1.2|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 15937080 (15M) [application/octet-stream]
Saving to: ‘node_exporter-1.3.1.linux-amd64.tar.gz’

100%[======================================>] 15,937,080 10MB/s in 1.5s

2026-04-08 10:00:01 (10.0 MB/s) – ‘node_exporter-1.3.1.linux-amd64.tar.gz’ saved [15937080/15937080]

$ tar -xzf node_exporter-1.3.1.linux-amd64.tar.gz -C /bigdata/app/
$ ln -s /bigdata/app/node_exporter-1.3.1.linux-amd64 /bigdata/app/node_exporter
$ /bigdata/app/node_exporter/node_exporter &
[1] 12345

# 启动Prometheus

$ /bigdata/app/prometheus/prometheus –config.file=/bigdata/app/prometheus/prometheus.yml &
[1] 23456

# 安装并启动Grafana

$ wget https://dl.grafana.com/oss/release/grafana-8.5.5-1.x86_64.rpm
–2026-04-08 10:00:00– https://dl.grafana.com/oss/release/grafana-8.5.5-1.x86_64.rpm
Resolving dl.grafana.com (dl.grafana.com)… 192.168.1.3
Connecting to dl.grafana.com (dl.grafana.com)|192.168.1.3|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 68839320 (66M) [application/x-redhat-package-manager]
Saving to: ‘grafana-8.5.5-1.x86_64.rpm’

100%[======================================>] 68,839,320 10MB/s in 6.9s

2026-04-08 10:00:06 (9.9 MB/s) – ‘grafana-8.5.5-1.x86_64.rpm’ saved [68839320/68839320]

$ yum install -y grafana-8.5.5-1.x86_64.rpm
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Resolving Dependencies
–> Running transaction check
—> Package grafana.x86_64 0:8.5.5-1 will be installed
–> Finished Dependency Resolution

Dependencies Resolved

Transaction Summary
================================================================================
Install 1 Package

Total size: 66 M
Installed size: 218 M
Downloading Packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : grafana-8.5.5-1.x86_64 1/1
Verifying : grafana-8.5.5-1.x86_64 1/1

Installed:
grafana.x86_64 0:8.5.5-1

Complete!

$ systemctl start grafana-server
$ systemctl enable grafana-server
Created symlink /etc/systemd/system/multi-user.target.wants/grafana-server.service →
/usr/lib/systemd/system/grafana-server.service.

4.3 自动化故障处理实战

案例：自动化故障检测和修复

# 编写故障检测脚本

$ vi /bigdata/app/scripts/fault_detection.sh
#!/bin/bash
# fault_detection.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`

# 检查HDFS状态
hdfs dfsadmin -report > /tmp/hdfs_report.txt
if grep -q “Missing blocks” /tmp/hdfs_report.txt; then
echo “HDFS missing blocks detected, sending alert…”
# 发送告警
fi

# 检查YARN状态
yarn node -list > /tmp/yarn_report.txt
if grep -q “UNHEALTHY” /tmp/yarn_report.txt; then
echo “YARN unhealthy nodes detected, sending alert…”
# 发送告警
fi

# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "DataNode count is less than 3, sending alert..." # 发送告警 fi

# 编写故障自动修复脚本

$ vi /bigdata/app/scripts/fault_repair.sh
#!/bin/bash
# fault_repair.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`

# 检查DataNode状态
jps | grep DataNode > /tmp/datanode_status.txt
if [ $(wc -l < /tmp/datanode_status.txt) -lt 3 ]; then echo "Starting DataNode..." /bigdata/app/hadoop/sbin/hadoop-daemon.sh start datanode fi # 检查ResourceManager状态 jps | grep ResourceManager> /tmp/resourcemanager_status.txt
if [ $(wc -l < /tmp/resourcemanager_status.txt) -eq 0 ]; then echo "Starting ResourceManager..." /bigdata/app/hadoop/sbin/yarn-daemon.sh start resourcemanager fi # 检查NameNode状态 jps | grep NameNode>
/tmp/namenode_status.txt
if [ $(wc -l < /tmp/namenode_status.txt) -eq 0 ]; then echo "Starting NameNode..." /bigdata/app/hadoop/sbin/hadoop-daemon.sh start namenode fi

# 配置定时任务

$ crontab -e
*/5 * * * * /bigdata/app/scripts/fault_detection.sh
*/10 * * * * /bigdata/app/scripts/fault_repair.sh

4.4 自动化任务执行实战

案例：使用Apache Airflow执行自动化任务

# 安装Airflow

$ pip install apache-airflow
Collecting apache-airflow
Downloading apache_airflow-2.3.3-py3-none-any.whl (11.9 MB)
|████████████████████████████████| 11.9 MB 1.2 MB/s
Collecting alembic<2.0,>=1.6.5
Downloading alembic-1.7.7-py3-none-any.whl (160 kB)
|████████████████████████████████| 160 kB 1.5 MB/s
Collecting argcomplete<2.0,>=1.10.0
Downloading argcomplete-1.12.3-py2.py3-none-any.whl (38 kB)
|████████████████████████████████| 38 kB 1.3 MB/s
…
Installing collected packages: zipp, typing-extensions, six, pyparsing, pyrsistent, pyopenssl,
pyjwt, pycparser, pyasn1, psutil, psycopg2-binary, protobuf, prison, prometheus-client, ply,
pluggy, pika, pathlib2, pandas, packaging, oauthlib, numpy, mysql-connector-python, monotonic,
Mako, MarkupSafe, marshmallow, marshmallow-enum, lz4, lazy-object-proxy, kubernetes, jinja2,
idna, importlib-metadata, importlib-resources, greenlet, graphviz, gunicorn,
googleapis-common-protos, Flask-WTF, Flask-SQLAlchemy, Flask-Login, Flask-OpenID, Flask,
clickclick, click, cffi, chardet, certifi, boto3, backports.zoneinfo, apache-airflow
Successfully installed Flask-2.0.3 Flask-Login-0.4.1 Flask-OpenID-1.3.0 Flask-SQLAlchemy-2.5.1
Flask-WTF-0.14.3 Mako-1.1.6 MarkupSafe-2.0.1 apache-airflow-2.3.3 argcomplete-1.12.3
backports.zoneinfo-0.2.1 boto3-1.22.10 certifi-2021.10.8 cffi-1.15.0 chardet-4.0.0 click-7.1.2
clickclick-20.10.2 greenlet-1.1.2 graphviz-0.19.1 gunicorn-20.1.0 idna-2.10
importlib-metadata-4.11.4 importlib-resources-5.4.0 jinja2-3.0.3 kubernetes-18.20.0
lazy-object-proxy-1.7.1 lz4-3.1.3 marshmallow-3.14.1 marshmallow-enum-1.5.1 monotonic-1.6
mysql-connector-python-8.0.28 numpy-1.22.3 oauthlib-3.2.0 packaging-21.3 pandas-1.4.2
pathlib2-2.3.6 pika-1.2.0 ply-3.11 pluggy-1.0.0 prison-0.2.1 prometheus-client-0.13.1
protobuf-3.19.4 psycopg2-binary-2.9.3 psutil-5.9.0 pyasn1-0.4.8 pycparser-2.21 pyjwt-2.3.0
pyopenssl-20.0.1 pyparsing-3.0.8 pyrsistent-0.18.1 six-1.16.0 typing-extensions-4.1.1
zipp-3.7.0

# 初始化Airflow

$ airflow db init
DB: sqlite:////home/fgedu/airflow/airflow.db
[2026-04-08 10:00:00,000] {db.py:685} INFO – Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> e3a246e0dc1, current version
INFO [alembic.runtime.migration] Running upgrade e3a246e0dc1 -> 13eb55f81627, add dagrun
INFO [alembic.runtime.migration] Running upgrade 13eb55f81627 -> 338e90f54d61,
kubernetes_resource_checkpointing
INFO [alembic.runtime.migration] Running upgrade 338e90f54d61 -> 52d53670a240, add
pod_mutation_hook
INFO [alembic.runtime.migration] Running upgrade 52d53670a240 -> 502898887f84, add log_level to
TaskInstance
INFO [alembic.runtime.migration] Running upgrade 502898887f84 -> 1b38cef5b76e, add dagrun_type
INFO [alembic.runtime.migration] Running upgrade 1b38cef5b76e -> 2e541a1dcfed, add task fails
journal table
INFO [alembic.runtime.migration] Running upgrade 2e541a1dcfed -> 4d5f3980047f, add task instance
notes
INFO [alembic.runtime.migration] Running upgrade 4d5f3980047f -> 8f966b9c467a, add TI state index
INFO [alembic.runtime.migration] Running upgrade 8f966b9c467a -> 239ba3e28140, add kubernetes
scheduler uniqueness
INFO [alembic.runtime.migration] Running upgrade 239ba3e28140 -> c838f05986b8, add dag_stats_table
INFO [alembic.runtime.migration] Running upgrade c838f05986b8 -> ef6991263f63, add task_outlet
INFO [alembic.runtime.migration] Running upgrade ef6991263f63 -> 53374a90596b, add logging table
INFO [alembic.runtime.migration] Running upgrade 53374a90596b -> 559517196036, add Celery executor
tables
INFO [alembic.runtime.migration] Running upgrade 559517196036 -> 40e67319e3a9, add executor_config
to task_instance
INFO [alembic.runtime.migration] Running upgrade 40e67319e3a9 -> 180961f13199, add
dag_id_subdag_task_idx
INFO [alembic.runtime.migration] Running upgrade 180961f13199 -> 64de99470447, add foreign key to
task_instance
INFO [alembic.runtime.migration] Running upgrade 64de99470447 -> 749d42a1634d, change datetime to
datetime2(6) on MSSQL tables
INFO [alembic.runtime.migration] Running upgrade 749d42a1634d -> 2e82aab8ef20, add unique
constraint to DagRun
INFO [alembic.runtime.migration] Running upgrade 2e82aab8ef20 -> a4c68966c399, add task_reschedule
table
INFO [alembic.runtime.migration] Running upgrade a4c68966c399 -> 8504051e801b, xcom dag id index
INFO [alembic.runtime.migration] Running upgrade 8504051e801b -> 23c547d1366f, add kubeconfig_path
to kube_resource
INFO [alembic.runtime.migration] Running upgrade 23c547d1366f -> 86770d1215c0, add schedulers
table
INFO [alembic.runtime.migration] Running upgrade 86770d1215c0 -> 45ba3f1493b9, add picklefield to
xcom
INFO [alembic.runtime.migration] Running upgrade 45ba3f1493b9 -> 947454bf1d8a, add is_encrypted to
xcom
INFO [alembic.runtime.migration] Running upgrade 947454bf1d8a -> 2e42b40a6a10, add ti_job_id_index
INFO [alembic.runtime.migration] Running upgrade 2e42b40a6a10 -> e1a11ece99cc, add dag_description
field
INFO [alembic.runtime.migration] Running upgrade e1a11ece99cc -> 82b79b47d670, change task
instance indices
INFO [alembic.runtime.migration] Running upgrade 82b79b47d670 -> 506383402a14, Add fractional
seconds to MySQL tables
INFO [alembic.runtime.migration] Running upgrade 506383402a14 -> 61ec73d9401f, add kubernetes
labels
INFO [alembic.runtime.migration] Running upgrade 61ec73d9401f -> 127d2bf2dfa7, Add description
field to Connection
INFO [alembic.runtime.migration] Running upgrade 127d2bf2dfa7 -> e959f08ac86c, Add
“last_heartbeat“ to “SchedulerJob“
INFO [alembic.runtime.migration] Running upgrade e959f08ac86c -> f23433877c24, Add “max_tries“
column to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade f23433877c24 -> 7322f3687067, Add executor config
to “dag_run“
INFO [alembic.runtime.migration] Running upgrade 7322f3687067 -> 92c57b58940d, add “run_type“ to
“DagRun“
INFO [alembic.runtime.migration] Running upgrade 92c57b58940d -> 004c1210f153, Add “trigger_id“
column to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 004c1210f153 -> 9635ae0956e7, Add “conf“ column
to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 9635ae0956e7 -> a2d516f6a651, Add
“last_scheduling_decision“ column to “DagRun“
INFO [alembic.runtime.migration] Running upgrade a2d516f6a651 -> 7b2661a43ba3, add “message“ to
“TaskFail“
INFO [alembic.runtime.migration] Running upgrade 7b2661a43ba3 -> 489ab3659401, Add
“data_interval_start“ and “data_interval_end“ to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 489ab3659401 -> 0adf413b3428, Add “next_dagrun“
to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 0adf413b3428 -> 939bb1e647c8, add
“has_import_errors“ column to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 939bb1e647c8 -> 7e2f47d4914c, add “state“
column to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 7e2f47d4914c -> a36f97f6747d, Add
“max_active_tasks“ and “max_active_runs“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade a36f97f6747d -> b247b1e37c0f, Add
“is_paused_upon_creation“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade b247b1e37c0f -> 0c61e88bee8f, Add
“last_expired“ to “DagRun“
INFO [alembic.runtime.migration] Running upgrade 0c61e88bee8f -> 64091f18647f, Make
“DagModel.pickle_id“ nullable
INFO [alembic.runtime.migration] Running upgrade 64091f18647f -> 8b9d6b7e7c74, Add “concurrency“
to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 8b9d6b7e7c74 -> 27c6a30d7c24, Add “start_date“
and “end_date“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 27c6a30d7c24 -> 6e96a59344a4, Make
“DagModel.is_subdag“ nullable
INFO [alembic.runtime.migration] Running upgrade 6e96a59344a4 -> 8661d9e86677, Add
“schedule_interval“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 8661d9e86677 -> cb02d2f64868, Add “catchup“ to
“DagModel“
INFO [alembic.runtime.migration] Running upgrade cb02d2f64868 -> d2ae31099d61, Add “orientation“
to “DagModel“
INFO [alembic.runtime.migration] Running upgrade d2ae31099d61 -> 451aebb31d03, Add
“default_view“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 451aebb31d03 -> e38be357a868, Add
“dag_display_name“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade e38be357a868 -> 574d5d88f18d, Add “tags“ to
“DagModel“
INFO [alembic.runtime.migration] Running upgrade 574d5d88f18d -> 0e2a74e0fc9f, Add
“priority_weight“ to “DagModel“
INFO [alembic.runtime.migration] Running upgrade 0e2a74e0fc9f -> 94d1a3ddf762, Add “description“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 94d1a3ddf762 -> 0061c3052d12, Add “queue“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0061c3052d12 -> 8f9c606d1638, Add “pool“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 8f9c606d1638 -> d9db535d0250, Add “pool_slots“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade d9db535d0250 -> 4e387b4a1569, Add “max_tries“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 4e387b4a1569 -> 364159661c78, Add “executor“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 364159661c78 -> 3b8c5669556a, Add “start_date“
and “end_date“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 3b8c5669556a -> 97cdd93827b8, Add “duration“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 97cdd93827b8 -> 789930867c39, Add “state“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 789930867c39 -> 0b785b737a2b, Add “try_number“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0b785b737a2b -> 41f4f3a7de7b, Add “hostname“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 41f4f3a7de7b -> 0d2647751745, Add “unixname“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 0d2647751745 -> 74d776215140, Add “job_id“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 74d776215140 -> 241f733888ed, Add “pool“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 241f733888ed -> b83345395c37, Add “queue“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade b83345395c37 -> c616b9116c38, Add “pool_slots“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade c616b9116c38 -> 43df8de3a5f2, Add
“executor_config“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 43df8de3a5f2 -> 2d52648d4a11, Add
“last_heartbeat“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 2d52648d4a11 -> 6e65365a0389, Add
“external_executor_id“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 6e65365a0389 -> 4e50d9b1a55a, Add “trigger_id“
to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 4e50d9b1a55a -> 55a95788c784, Add “run_id“ to
“TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 55a95788c784 -> 804ef328f1e0, Add “dag_id“ and
“task_id“ to “TaskInstance“
INFO [alembic.runtime.migration] Running upgrade 804ef328f1e0 -> 52d53670a240
[2026-04-08 10:00:00,000] {db.py:685} INFO – Creating tables
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
INFO [alembic.runtime.migration] Running upgrade 52d53670a240 -> 0e2a74e0fc9f, current version

# 创建Airflow用户

$ airflow users create –username fgedu –password fgedu –firstname fgedu –lastname fgedu –role
Admin –email fgedu@fgedu.net.cn
Admin user fgedu created

# 启动Airflow

$ airflow webserver -p 8080 &
[1] 12345
$ airflow scheduler &
[2] 23456

# 创建DAG

$ vi /home/fgedu/airflow/dags/hadoop_tasks.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime