目录大纲
Part01-基础概念与理论知识
1.1 YARN应用生命周期
1.2 应用调度原理
1.3 应用优先级机制
Part02-生产环境规划与建议
2.1 应用管理规划
2.2 调度策略规划
2.3 应用监控规划
Part03-生产环境项目实施方案
3.1 应用生命周期管理
3.2 应用状态监控
3.3 调度策略优化
3.4 应用优先级配置
Part04-生产案例与实战讲解
4.1 批处理应用调度案例
4.2 实时应用调度案例
4.3 应用故障处理案例
Part05-风哥经验总结与分享
5.1 应用管理最佳实践
5.2 调度优化经验总结
Part01-基础概念与理论知识
1.1 YARN应用生命周期
YARN应用从提交到完成经历多个状态。更多视频教程www.fgedu.net.cn 应用状态包括:NEW、NEW_SAVING、SUBMITTED、ACCEPTED、RUNNING、FINISHED、FAILED、KILLED等。
1.2 应用调度原理
YARN调度器负责将应用分配到合适的队列和节点。学习交流加群风哥微信: itpux-com 调度器根据队列容量、应用优先级、资源可用性等因素进行调度决策。
1. 应用提交到ResourceManager
2. ResourceManager将应用放入调度队列
3. 调度器根据策略选择应用
4. 为应用分配Container
5. ApplicationMaster启动并管理任务
1.3 应用优先级机制
YARN支持应用优先级配置,优先级高的应用优先获得资源。from bigdata视频:www.itpux.com
cat /bigdata/app/hadoop/etc/hadoop/yarn-site.xml | grep -A3 “priority”
<property>
<name>yarn.cluster.max-application-priority</name>
<value>100</value>
</property>
<!– 优先级调度策略 –>
<property>
<name>yarn.scheduler.capacity.root.default.priority</name>
<value>50</value>
</property>
Part02-生产环境规划与建议
2.1 应用管理规划
应用管理需要建立完善的监控和运维机制。更多学习教程公众号风哥教程itpux_com
– 建立应用分类标准
– 制定应用资源配额
– 配置应用超时机制
– 建立应用告警机制
2.2 调度策略规划
调度策略需要根据业务特点制定。学习交流加群风哥QQ113257174
yarn scheduler -printClusterInfo
Cluster Info:
Scheduler: CapacityScheduler
Cluster Resources: memory: 491520 MB, vCores: 48
Scheduling Mode: FIFO per queue
Preemption Enabled: true
2.3 应用监控规划
应用监控需要关注资源使用、运行状态和性能指标。风哥提示:建议配置应用运行时长告警。
yarn application -list -appStates RUNNING
Total Applications:5
Application-Id Application-Name Application-Type User State Queue
app_1705473600001 etl-job SPARK fgedu RUNNING root.production.etl
app_1705473600002 streaming-job SPARK fgedu RUNNING root.production.realtime
app_1705473600003 hive-query MAPREDUCE fgedu RUNNING root.production.adhoc
app_1705473600004 spark-sql SPARK fgedu RUNNING root.development
app_1705473600005 flink-job FLINK fgedu RUNNING root.production.realtime
Part03-生产环境项目实施方案
3.1 应用生命周期管理
3.1.1 应用提交
spark-submit –master yarn –deploy-mode cluster \
–name fgedu-etl-job \
–queue root.production.etl \
–driver-memory 4g \
–executor-memory 8g \
–executor-cores 4 \
–num-executors 10 \
–class com.fgedu.etl.DataProcessor \
/bigdata/app/spark/etl-job.jar
# 查看应用状态
yarn application -list -appTypes SPARK
24/01/17 17:00:00 INFO client.RMProxy: Connecting to ResourceManager
24/01/17 17:00:05 INFO yarn.Client: Requesting a new application from RM
24/01/17 17:00:10 INFO yarn.Client: Submitted application application_1705473600006
# 应用状态
Total Applications:1
Application-Id Application-Name User State Final-Status
app_1705473600006 fgedu-etl-job fgedu RUNNING UNDEFINED
3.1.2 应用状态管理
yarn application -status application_1705473600006
# 查看应用尝试
yarn applicationattempt -list application_1705473600006
Application Report:
Application-Id: application_1705473600006
Application-Name: fgedu-etl-job
Application-Type: SPARK
User: fgedu
Queue: root.production.etl
State: RUNNING
Final-Status: UNDEFINED
Started-Time: 1705473600000
Elapsed-Time: 300000 ms
Tracking-URL: http://fgedu01:8088/proxy/application_1705473600006/
RPC-Port: 4040
AM-Host: fgedu02
Aggregate-Resource-Allocation: 819200 MB-seconds, 400 vcore-seconds
# 应用尝试
Total Application Attempts:1
Attempt-Id State AM-Container-Id
appattempt_1705473600006_0001 RUNNING container_1705473600006_0001_01_000001
3.1.3 应用终止
yarn application -kill application_1705473600006
# 强制停止应用
yarn application -fail application_1705473600006
# 查看已停止应用
yarn application -list -appStates FINISHED,KILLED,FAILED
24/01/17 17:10:00 INFO client.RMProxy: Connecting to ResourceManager
Killing application application_1705473600006
24/01/17 17:10:05 INFO impl.YarnClientImpl: Killed application application_1705473600006
# 已停止应用
Total Applications:1
Application-Id Application-Name User State Final-Status
app_1705473600006 fgedu-etl-job fgedu KILLED KILLED
3.2 应用状态监控
3.2.1 应用资源监控
yarn application -status application_1705473600001 | grep -A5 “Aggregate”
# 查看Container资源
yarn container -list appattempt_1705473600001_0001
Aggregate Resource Allocation: 2457600 MB-seconds, 1200 vcore-seconds
Aggregate Resource Preempted: 0 MB-seconds, 0 vcore-seconds
# Container列表
Total Containers: 11
Container-Id State Node-Id Resource
container_1705473600001_0001_01_000001 RUNNING fgedu01:8041 memory:4096, vCores:1
container_1705473600001_0001_01_000002 RUNNING fgedu02:8041 memory:8192, vCores:4
container_1705473600001_0001_01_000003 RUNNING fgedu03:8041 memory:8192, vCores:4
…
3.2.2 应用日志查看
yarn logs -applicationId application_1705473600001
# 查看特定Container日志
yarn logs -applicationId application_1705473600001 -containerId container_1705473600001_0001_01_000001
# 下载应用日志
yarn logs -applicationId application_1705473600001 -out /bigdata/logs/app_1705473600001.log
Container: container_1705473600001_0001_01_000001 on fgedu01:8041
LogAggregationType: AGGREGATED
LogType:stderr
24/01/17 17:00:00 INFO spark.SparkContext: Running Spark version 3.3.0
24/01/17 17:00:05 INFO spark.SparkContext: Submitted application: fgedu-etl-job
24/01/17 17:00:10 INFO spark.SparkContext: Starting SparkContext
…
LogType:stdout
Processing data partition 1
Processing data partition 2
…
# 日志下载成功
Log saved to /bigdata/logs/app_1705473600001.log
3.3 调度策略优化
3.3.1 配置调度策略
cat /bigdata/app/hadoop/etc/hadoop/capacity-scheduler.xml | grep -A3 “scheduling”
<property>
<name>yarn.scheduler.capacity.root.production.etl.scheduling-policy</name>
<value>fifo</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.production.realtime.scheduling-policy</name>
<value>fair</value>
</property>
<!– 抢占配置 –>
<property>
<name>yarn.scheduler.capacity.preemption.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.scheduler.capacity.preemption.monitor.enabled</name>
<value>true</value>
</property>
3.3.2 启用抢占调度
cat /bigdata/app/hadoop/etc/hadoop/yarn-site.xml | grep -A3 “preemption”
<property>
<name>yarn.resourcemanager.scheduler.monitor.enable</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.monitor.policies</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy</value>
</property>
<property>
<name>yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round</name>
<value>0.1</value>
</property>
3.4 应用优先级配置
3.4.1 设置应用优先级
spark-submit –master yarn –deploy-mode cluster \
–name fgedu-critical-job \
–queue root.production.etl \
–priority 100 \
–class com.fgedu.etl.CriticalProcessor \
/bigdata/app/spark/critical-job.jar
# 查看应用优先级
yarn application -status application_1705473600007 | grep Priority
24/01/17 17:30:00 INFO yarn.Client: Submitted application application_1705473600007
# 应用优先级
Priority: 100
# 优先级范围:0-100,数值越大优先级越高
3.4.2 动态调整应用优先级
yarn application -updatePriority 90 -app application_1705473600007
# 验证优先级更新
yarn application -status application_1705473600007 | grep Priority
24/01/17 17:35:00 INFO client.RMProxy: Connecting to ResourceManager
Application application_1705473600007 priority updated to 90
# 验证结果
Priority: 90
Part04-生产案例与实战讲解
4.1 批处理应用调度案例
批处理应用调度需要考虑资源利用和执行时间。更多视频教程www.fgedu.net.cn
# batch_job_scheduler.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
# 批处理任务调度脚本
QUEUE=”root.production.etl”
PRIORITY=80
# 定义批处理任务
JOBS=(“daily-etl” “data-sync” “report-gen”)
for job in “${JOBS[@]}”; do
echo “Submitting job: ${job}”
spark-submit –master yarn –deploy-mode cluster \
–name “${job}-$(date +%Y%m%d)” \
–queue ${QUEUE} \
–priority ${PRIORITY} \
–driver-memory 4g \
–executor-memory 8g \
–executor-cores 4 \
–num-executors 5 \
–class “com.fgedu.batch.${job}” \
/bigdata/app/spark/batch-jobs.jar
# 等待任务完成
sleep 60
done
echo “All batch jobs submitted”
./batch_job_scheduler.sh
Submitting job: daily-etl
24/01/17 18:00:00 INFO yarn.Client: Submitted application application_1705473600010
Submitting job: data-sync
24/01/17 18:10:00 INFO yarn.Client: Submitted application application_1705473600011
Submitting job: report-gen
24/01/17 18:20:00 INFO yarn.Client: Submitted application application_1705473600012
All batch jobs submitted
# 查看任务状态
yarn application -list -appStates RUNNING,ACCEPTED
Total Applications:3
Application-Id Application-Name State
app_1705473600010 daily-etl-20240117 RUNNING
app_1705473600011 data-sync-20240117 RUNNING
app_1705473600012 report-gen-20240117 ACCEPTED
4.2 实时应用调度案例
实时应用调度需要保证资源稳定性。学习交流加群风哥微信: itpux-com
spark-submit –master yarn –deploy-mode cluster \
–name fgedu-streaming-job \
–queue root.production.realtime \
–priority 100 \
–driver-memory 8g \
–executor-memory 16g \
–executor-cores 8 \
–num-executors 3 \
–conf spark.yarn.maxAppAttempts=1 \
–conf spark.yarn.am.waitTime=300s \
–class com.fgedu.streaming.RealtimeProcessor \
/bigdata/app/spark/streaming-job.jar
24/01/17 18:30:00 INFO yarn.Client: Submitted application application_1705473600020
24/01/17 18:30:05 INFO yarn.Client: Application report for application_1705473600020
# 应用运行状态
yarn application -status application_1705473600020
Application Report:
Application-Id: application_1705473600020
Application-Name: fgedu-streaming-job
State: RUNNING
Queue: root.production.realtime
Priority: 100
Started-Time: 1705474200000
Elapsed-Time: 60000 ms
4.3 应用故障处理案例
4.3.1 应用失败诊断
yarn application -list -appStates FAILED
# 查看失败原因
yarn application -status application_1705473600030 | grep -A5 “Diagnostics”
Total Applications:1
Application-Id Application-Name State Final-Status
app_1705473600030 test-job FAILED FAILED
# 失败诊断
Diagnostics:
Application application_1705473600030 failed 1 times due to AM Container for appattempt_1705473600030_000001 exited with exitCode: -104
Failing this attempt.
Diagnostics report from attempt:
Container [pid=12345,containerID=container_1705473600030_0001_01_000001] is running beyond virtual memory limits.
Current usage: 8.5 GB of 8 GB physical memory used; 17.2 GB of 16.8 GB virtual memory used.
4.3.2 应用恢复处理
spark-submit –master yarn –deploy-mode cluster \
–name test-job-retry \
–queue root.development \
–driver-memory 4g \
–executor-memory 16g \
–conf spark.executor.memoryOverhead=4g \
–conf spark.yarn.executor.memoryOverhead=4096 \
–class com.fgedu.TestJob \
/bigdata/app/spark/test-job.jar
# 监控应用运行
yarn application -status application_1705473600031
24/01/17 19:00:00 INFO yarn.Client: Submitted application application_1705473600031
# 应用运行成功
Application Report:
Application-Id: application_1705473600031
Application-Name: test-job-retry
State: FINISHED
Final-Status: SUCCEEDED
Elapsed-Time: 300000 ms
Part05-风哥经验总结与分享
5.1 应用管理最佳实践
在实际生产环境中,应用管理需要注意以下几点:from bigdata视频:www.itpux.com
1. 合理设置应用资源配额
2. 配置应用超时和重试机制
3. 建立应用监控告警
4. 定期清理历史应用日志
5. 做好应用故障应急预案
5.2 调度优化经验总结
5.2.1 调度优化建议
– 根据业务特点选择调度策略
– 合理配置抢占参数
– 监控调度延迟
– 避免队列过度细分
– 定期审查调度效果
5.2.2 应用监控脚本
# app_monitor.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
ALERT_DURATION=3600000 # 1小时
echo “=== YARN Application Monitor ===”
echo “Date: $(date)”
# 获取运行时间过长的应用
yarn application -list -appStates RUNNING | while read line; do
APP_ID=$(echo ${line} | awk ‘{print $1}’)
if [[ ${APP_ID} == application_* ]]; then
ELAPSED=$(yarn application -status ${APP_ID} | grep “Elapsed-Time” | awk ‘{print $3}’)
if [ ${ELAPSED} -gt ${ALERT_DURATION} ]; then
echo “WARNING: Application ${APP_ID} running for $((${ELAPSED}/1000/60)) minutes”
fi
fi
done
=== YARN Application Monitor ===
Date: Wed Jan 17 19:30:00 CST 2024
WARNING: Application application_1705473600020 running for 90 minutes
# 流式应用运行时间较长是正常的
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
