本文档风哥主要介绍Spark监控与运维管理,包括Spark监控体系、Metrics系统、Web UI、历史服务器、Prometheus集成等内容,风哥教程参考Spark官方文档Monitoring、Metrics等内容,适合大数据开发运维人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。更多视频教程www.fgedu.net.cn
Part01-基础概念与理论知识
1.1 Spark监控体系概述
Spark监控体系包括Web UI、Metrics、日志、外部监控系统等多个层面,全面监控集群和应用状态。学习交流加群风哥微信: itpux-com
- Web UI:可视化监控界面
- Metrics:指标收集系统
- Event Log:事件日志
- History Server:历史服务器
- 外部监控:Prometheus、Grafana等
1.2 Metrics系统详解
Spark Metrics系统详解:
Spark使用Codahale/Dropwizard Metrics库收集和报告指标。
# 指标来源
1. Master/Worker指标
– 资源使用情况
– 应用数量
– Worker状态
2. Driver指标
– 任务执行情况
– 内存使用
– Shuffle数据
3. Executor指标
– 任务执行
– 内存使用
– 磁盘使用
# 指标类型
– Counter:计数器
– Gauge:即时值
– Meter:速率
– Histogram:分布
– Timer:时间
# 指标实例
– driver:Driver进程指标
– executor:Executor进程指标
– application:应用指标
– master:Master指标
– worker:Worker指标
# Sink类型
– ConsoleSink:输出到控制台
– CsvSink:输出到CSV文件
– GraphiteSink:发送到Graphite
– PrometheusServlet:Prometheus格式
– JmxSink:JMX暴露
1.3 日志系统详解
Spark日志系统详解:
1. 事件日志
– 应用执行事件
– 存储到HDFS
– 用于历史服务器
2. 运行日志
– 标准输出/错误
– 存储到本地文件
– 用于问题排查
3. 审计日志
– 用户操作记录
– 安全审计
# 日志配置
# 事件日志
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://192.168.1.60:9000/spark-logs
spark.eventLog.compress=true
spark.eventLog.compression.codec=org.apache.spark.io.SnappyCompressionCodec
# 运行日志
# log4j2.properties
rootLogger.level=INFO
rootLogger.appenderRef.stdout.ref=console
# 日志级别
logger.spark.name=org.apache.spark
logger.spark.level=INFO
# 日志滚动
appender.rolling.type=RollingFile
appender.rolling.name=file
appender.rolling.fileName=${sys:spark.log.dir}/spark.log
appender.rolling.filePattern=${sys:spark.log.dir}/spark-%d{yyyy-MM-dd}-%i.log.gz
appender.rolling.policies.type=Policies
appender.rolling.policies.time.type=TimeBasedTriggeringPolicy
appender.rolling.policies.time.interval=1
appender.rolling.policies.time.modulate=true
appender.rolling.policies.size.type=SizeBasedTriggeringPolicy
appender.rolling.policies.size.size=100MB
Part02-生产环境规划与建议
2.1 监控架构规划
监控架构规划建议:
1. 数据采集层
– Spark Metrics
– Event Log
– 系统指标
2. 数据存储层
– Prometheus(时序数据)
– HDFS(事件日志)
– Elasticsearch(日志)
3. 数据展示层
– Spark Web UI
– Grafana
– Kibana
4. 告警层
– Prometheus Alertmanager
– 自定义告警脚本
# 监控架构图
┌─────────────────────────────────────────┐
│ 数据采集层 │
│ Spark Metrics │ Event Log │ 系统 │
└─────────────────────────────────────────┘
│
┌─────────────────────────────────────────┐
│ 数据存储层 │
│ Prometheus │ HDFS │ Elasticsearch │
└─────────────────────────────────────────┘
│
┌─────────────────────────────────────────┐
│ 数据展示层 │
│ Spark Web UI │ Grafana │ Kibana │
└─────────────────────────────────────────┘
│
┌─────────────────────────────────────────┐
│ 告警层 │
│ Alertmanager │ 告警通知 │
└─────────────────────────────────────────┘
# 监控指标规划
1. 集群指标
– 资源使用率
– Worker状态
– 应用数量
2. 应用指标
– 任务执行时间
– 内存使用
– Shuffle数据量
3. 系统指标
– CPU使用率
– 内存使用率
– 磁盘IO
– 网络IO
2.2 告警规划
告警规划建议:
1. 集群告警
– Worker宕机
– 资源不足
– 磁盘空间不足
2. 应用告警
– 应用失败
– 任务执行超时
– 内存溢出
3. 性能告警
– 任务执行慢
– Shuffle数据量大
– GC时间长
# 告警级别
– Critical:严重告警,需要立即处理
– Warning:警告告警,需要关注
– Info:信息告警,仅供参考
# 告警通知方式
– 邮件
– 短信
– 企业微信/钉钉
– Webhook
# 告警规则示例
# Prometheus规则
groups:
– name: spark-alerts
rules:
– alert: SparkWorkerDown
expr: spark_worker_alive == 0
for: 5m
labels:
severity: critical
annotations:
summary: “Spark Worker宕机”
description: “Worker {{ $labels.worker_id }} 已经宕机超过5分钟”
– alert: SparkApplicationFailed
expr: spark_app_status{status=”failed”} > 0
for: 1m
labels:
severity: critical
annotations:
summary: “Spark应用失败”
description: “应用 {{ $labels.app_name }} 执行失败”
2.3 日志规划
日志规划建议:
1. 事件日志
– 存储路径:hdfs://192.168.1.60:9000/spark-logs
– 保留时间:30天
– 压缩:启用
2. 运行日志
– 存储路径:/var/log/spark/
– 滚动策略:按天+大小
– 保留数量:10个文件
3. 审计日志
– 存储路径:/var/log/spark/audit/
– 保留时间:90天
# 日志清理脚本
#!/bin/bash
# spark_log_cleanup.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
LOG_DIR=/var/log/spark
RETENTION_DAYS=30
# 清理旧日志
find $LOG_DIR -name “*.log.*” -mtime +$RETENTION_DAYS -delete
find $LOG_DIR -name “*.gz” -mtime +$RETENTION_DAYS -delete
# 清理HDFS事件日志
hdfs dfs -ls /spark-logs | grep “^d” | awk ‘{print $8}’ | while read dir; do
dir_date=$(echo $dir | grep -oP ‘\d{4}-\d{2}-\d{2}’)
if [ ! -z “$dir_date” ]; then
diff_days=$(( ($(date +%s) – $(date -d “$dir_date” +%s)) / 86400 ))
if [ $diff_days -gt $RETENTION_DAYS ]; then
hdfs dfs -rm -r -skipTrash $dir
fi
fi
done
Part03-生产环境项目实施方案
3.1 Web UI监控
3.1.1 Master Web UI
# http://192.168.1.60:8080
# Web UI功能
1. 集群概览
– Workers数量
– Cores数量
– Memory数量
– 应用数量
2. Workers页面
– Worker列表
– 资源使用
– 状态信息
3. Running Applications
– 运行中的应用
– 资源分配
– 执行时间
4. Completed Applications
– 已完成应用
– 执行结果
# 通过API获取信息
$ curl http://192.168.1.60:8080/api/v1/applications
[
{
“id” : “app-20260408130000-0001”,
“name” : “fgedu-app”,
“attempts” : [ {
“startTime” : “2026-04-08T13:00:00.000Z”,
“endTime” : “2026-04-08T13:30:00.000Z”,
“sparkUser” : “fgeduuser”,
“completed” : true,
“appSparkVersion” : “3.5.1”,
“duration” : 1800000
} ]
}
]
# 获取Worker信息
$ curl http://192.168.1.60:8080/api/v1/workers
[
{
“id” : “worker-20260408120000-192.168.1.61-8081”,
“host” : “192.168.1.61”,
“port” : 8081,
“webuiaddress” : “http://192.168.1.61:8081”,
“cores” : 16,
“coresused” : 8,
“coresfree” : 8,
“memory” : 98304,
“memoryused” : 49152,
“memoryfree” : 49152,
“state” : “ALIVE”,
“lastheartbeat” : 1680941400000
}
]
3.1.2 Application Web UI
# http://192.168.1.61:4040
# Web UI功能
1. Jobs页面
– Job列表
– 执行状态
– 执行时间
2. Stages页面
– Stage列表
– 任务分布
– Shuffle数据
3. Storage页面
– RDD缓存
– 内存使用
– 磁盘使用
4. Environment页面
– 环境变量
– Spark配置
– 依赖信息
5. Executors页面
– Executor列表
– 资源使用
– 任务统计
6. SQL页面
– SQL执行计划
– 执行时间
# 通过API获取应用信息
$ curl http://192.168.1.61:4040/api/v1/applications/app-20260408130000-0001/jobs
[
{
“jobId” : 0,
“name” : “count at FgeduApp.scala:25”,
“submissionTime” : “2026-04-08T13:00:00.000Z”,
“completionTime” : “2026-04-08T13:05:00.000Z”,
“stageIds” : [0, 1],
“status” : “SUCCEEDED”,
“numTasks” : 200,
“numActiveTasks” : 0,
“numCompletedTasks” : 200,
“numFailedTasks” : 0,
“numSkippedTasks” : 0
}
]
# 获取Executor信息
$ curl http://192.168.1.61:4040/api/v1/applications/app-20260408130000-0001/executors
[
{
“id” : “driver”,
“hostPort” : “192.168.1.60:54321”,
“isActive” : true,
“rddBlocks” : 0,
“memoryUsed” : 0,
“diskUsed” : 0,
“totalCores” : 1,
“maxMemory” : 4294967296,
“memoryMetrics” : {
“usedOnHeapStorageMemory” : 0,
“usedOffHeapStorageMemory” : 0
}
}
]
3.2 Metrics配置
3.2.1 配置Metrics Sink
$ cat > /bigdata/app/spark/conf/metrics.properties << 'EOF' # Prometheus Sink配置 *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet *.sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus worker.sink.prometheusServlet.path=/metrics/worker/prometheus application.sink.prometheusServlet.path=/metrics/app/prometheus # Console Sink配置(调试用) *.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink *.sink.console.period=10 *.sink.console.unit=seconds # CSV Sink配置 *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink *.sink.csv.period=60 *.sink.csv.unit=seconds *.sink.csv.directory=/var/log/spark/metrics # Graphite Sink配置 *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=192.168.1.70 *.sink.graphite.port=2003 *.sink.graphite.period=10 *.sink.graphite.unit=seconds *.sink.graphite.prefix=spark # JMX Sink配置 *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink # 指标源配置 # Master指标 master.source.jvm.class=org.apache.spark.metrics.source.JvmSource # Worker指标 worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource # Driver指标 driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource # Executor指标 executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource EOF # 配置spark-defaults.conf $ cat >> /bigdata/app/spark/conf/spark-defaults.conf << 'EOF' spark.metrics.conf=/bigdata/app/spark/conf/metrics.properties spark.metrics.namespace=spark spark.ui.prometheus.enabled=true spark.ui.prometheus.path=/metrics/prometheus EOF # 访问Prometheus指标 $ curl http://192.168.1.60:8080/metrics/master/prometheus # HELP spark_master_alive_workers Number of alive workers # TYPE spark_master_alive_workers gauge spark_master_alive_workers 5 # HELP spark_master_apps Number of applications # TYPE spark_master_apps gauge spark_master_apps 3 # HELP spark_master_workers Number of workers # TYPE spark_master_workers gauge spark_master_workers 5 # HELP spark_master_memory_mb Memory in MB # TYPE spark_master_memory_mb gauge spark_master_memory_mb{type="total"} 491520 spark_master_memory_mb{type="used"} 245760 spark_master_memory_mb{type="free"} 245760 # HELP spark_master_cores Cores # TYPE spark_master_cores gauge spark_master_cores{type="total"} 80 spark_master_cores{type="used"} 40 spark_master_cores{type="free"} 40
3.2.2 Prometheus集成
$ cat >> /etc/prometheus/prometheus.yml << 'EOF' scrape_configs: # Spark Master - job_name: 'spark-master' static_configs: - targets: ['192.168.1.60:8080'] metrics_path: '/metrics/master/prometheus' # Spark Workers - job_name: 'spark-workers' static_configs: - targets: - '192.168.1.61:8081' - '192.168.1.62:8081' - '192.168.1.63:8081' - '192.168.1.64:8081' - '192.168.1.65:8081' metrics_path: '/metrics/worker/prometheus' # Spark Applications - job_name: 'spark-apps' static_configs: - targets: ['192.168.1.60:4040'] metrics_path: '/metrics/app/prometheus' EOF # 重启Prometheus $ systemctl restart prometheus # 验证Prometheus抓取 $ curl http://192.168.1.70:9090/api/v1/targets { "status": "success", "data": { "activeTargets": [ { "discoveredLabels": { "job": "spark-master" }, "labels": { "job": "spark-master" }, "scrapeUrl": "http://192.168.1.60:8080/metrics/master/prometheus", "lastError": "", "lastScrape": "2026-04-08T13:00:00.000Z", "health": "up" } ] } }
3.3 历史服务器配置
$ cat >> /bigdata/app/spark/conf/spark-defaults.conf << 'EOF' spark.eventLog.enabled=true spark.eventLog.dir=hdfs://192.168.1.60:9000/spark-logs spark.eventLog.compress=true spark.eventLog.compression.codec=org.apache.spark.io.SnappyCompressionCodec spark.eventLog.rolling.enabled=true spark.eventLog.rolling.maxFileSize=128m EOF # 创建日志目录 $ hdfs dfs -mkdir -p /spark-logs $ hdfs dfs -chmod 777 /spark-logs # 配置历史服务器 $ cat > /bigdata/app/spark/conf/spark-history-server.conf << 'EOF' spark.history.fs.logDirectory=hdfs://192.168.1.60:9000/spark-logs spark.history.fs.update.interval=10s spark.history.retainedApplications=50 spark.history.ui.port=18080 spark.history.kerberos.enabled=false spark.history.fs.cleaner.enabled=true spark.history.fs.cleaner.maxAge=7d spark.history.fs.cleaner.interval=1h EOF # 启动历史服务器 $ /bigdata/app/spark/sbin/start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.history.HistoryServer-1-fgedu-spark-master.out # 访问历史服务器 # http://192.168.1.60:18080 # 查看历史应用 $ curl http://192.168.1.60:18080/api/v1/applications [ { "id" : "app-20260408130000-0001", "name" : "fgedu-app", "attempts" : [ { "startTime" : "2026-04-08T13:00:00.000Z", "endTime" : "2026-04-08T13:30:00.000Z", "sparkUser" : "fgeduuser", "completed" : true, "appSparkVersion" : "3.5.1", "duration" : 1800000 } ] } ] # 停止历史服务器 $ /bigdata/app/spark/sbin/stop-history-server.sh
Part04-生产案例与实战讲解
4.1 Prometheus监控案例
# 1. 配置告警规则
$ cat > /etc/prometheus/rules/spark-alerts.yml << 'EOF'
groups:
- name: spark-cluster-alerts
rules:
- alert: SparkWorkerDown
expr: spark_worker_alive == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Spark Worker宕机"
description: "Worker {{ $labels.worker_id }} 已经宕机超过5分钟"
- alert: SparkHighMemoryUsage
expr: spark_worker_memory_used / spark_worker_memory_total > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: “Spark内存使用率高”
description: “Worker {{ $labels.worker_id }} 内存使用率超过90%”
– alert: SparkApplicationFailed
expr: increase(spark_app_status{status=”failed”}[5m]) > 0
labels:
severity: critical
annotations:
summary: “Spark应用失败”
description: “应用 {{ $labels.app_name }} 执行失败”
– alert: SparkHighGCTime
expr: spark_executor_jvm_gc_time / spark_executor_run_time > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: “Spark GC时间过长”
description: “Executor {{ $labels.executor_id }} GC时间占比超过10%”
EOF
# 2. 配置Alertmanager
$ cat > /etc/prometheus/alertmanager.yml << 'EOF'
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.fgedu.net.cn:25'
smtp_from: 'alert@fgedu.net.cn'
smtp_auth_username: 'alert@fgedu.net.cn'
smtp_auth_password: 'fgedu123'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'default-receiver'
routes:
- match:
severity: critical
receiver: 'critical-receiver'
- match:
severity: warning
receiver: 'warning-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@fgedu.net.cn'
- name: 'critical-receiver'
email_configs:
- to: 'admin@fgedu.net.cn'
webhook_configs:
- url: 'http://192.168.1.80:8060/dingtalk/webhook/send'
- name: 'warning-receiver'
email_configs:
- to: 'admin@fgedu.net.cn'
EOF
# 3. 重启服务
$ systemctl restart prometheus
$ systemctl restart alertmanager
# 4. Grafana Dashboard
# 导入Spark Dashboard
# Dashboard ID: 7059 (Spark Metrics Dashboard)
# Dashboard ID: 13187 (Spark Application Dashboard)
4.2 日常运维案例
#!/bin/bash
# spark_daily_check.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
SPARK_MASTER=”192.168.1.60:8080″
LOG_FILE=”/var/log/spark/daily_check.log”
echo “========================================” >> $LOG_FILE
echo “Spark集群日常巡检 – $(date)” >> $LOG_FILE
echo “========================================” >> $LOG_FILE
# 1. 检查Master状态
echo “1. 检查Master状态” >> $LOG_FILE
curl -s http://$SPARK_MASTER/api/v1/applications | python -m json.tool >> $LOG_FILE
# 2. 检查Worker状态
echo “2. 检查Worker状态” >> $LOG_FILE
curl -s http://$SPARK_MASTER/api/v1/workers | python -m json.tool >> $LOG_FILE
# 3. 检查资源使用
echo “3. 检查资源使用” >> $LOG_FILE
curl -s http://$SPARK_MASTER/metrics/master/prometheus | grep -E “spark_master_(memory|cores)” >> $LOG_FILE
# 4. 检查磁盘空间
echo “4. 检查磁盘空间” >> $LOG_FILE
df -h /bigdata >> $LOG_FILE
# 5. 检查HDFS空间
echo “5. 检查HDFS空间” >> $LOG_FILE
hdfs dfs -df -h / >> $LOG_FILE
# 6. 检查日志目录
echo “6. 检查日志目录” >> $LOG_FILE
hdfs dfs -du -h /spark-logs >> $LOG_FILE
echo “巡检完成” >> $LOG_FILE
# 定时任务
# crontab -e
# 0 8 * * * /bigdata/scripts/spark_daily_check.sh
4.3 常见问题处理
4.3.1 Web UI无法访问
# 排查步骤
# 1. 检查服务状态
$ jps | grep -E “Master|Worker|HistoryServer”
# 2. 检查端口
$ netstat -tlnp | grep -E “8080|8081|18080|4040”
# 3. 检查防火墙
$ firewall-cmd –list-ports
# 解决方案
# 1. 重启服务
$ /bigdata/app/spark/sbin/stop-all.sh
$ /bigdata/app/spark/sbin/start-all.sh
# 2. 开放端口
$ firewall-cmd –add-port=8080/tcp –permanent
$ firewall-cmd –add-port=8081/tcp –permanent
$ firewall-cmd –add-port=18080/tcp –permanent
$ firewall-cmd –reload
# 3. 检查配置
spark.ui.port=8080
spark.ui.enabled=true
4.3.2 历史服务器无数据
# 排查步骤
# 1. 检查事件日志配置
$ cat /bigdata/app/spark/conf/spark-defaults.conf | grep eventLog
# 2. 检查HDFS目录
$ hdfs dfs -ls /spark-logs
# 3. 检查历史服务器配置
$ cat /bigdata/app/spark/conf/spark-history-server.conf
# 解决方案
# 1. 确认事件日志启用
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs://192.168.1.60:9000/spark-logs
# 2. 创建日志目录
$ hdfs dfs -mkdir -p /spark-logs
$ hdfs dfs -chmod 777 /spark-logs
# 3. 重启历史服务器
$ /bigdata/app/spark/sbin/stop-history-server.sh
$ /bigdata/app/spark/sbin/start-history-server.sh
Part05-风哥经验总结与分享
5.1 监控运维最佳实践
Spark监控运维最佳实践建议:
1. 启用事件日志和Metrics
2. 部署历史服务器
3. 集成Prometheus+Grafana
4. 配置合理的告警规则
5. 定期巡检集群状态
6. 保留足够的日志历史
5.2 故障排查建议
故障排查建议:
- 查看Web UI获取详细信息
- 检查日志文件定位问题
- 使用Metrics分析性能瓶颈
- 对比正常和异常应用
5.3 工具推荐
监控运维工具:
- Spark Web UI:内置监控界面
- Prometheus:指标收集和告警
- Grafana:可视化仪表盘
- History Server:历史应用查看
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
