本文档风哥主要介绍Apache Spark基础与集群部署实战,包括Spark核心概念与架构、Spark集群规划与部署、Spark核心配置、Spark应用提交等内容,风哥教程参考Spark官方文档Overview、Cluster Overview、Submitting Applications等内容,适合大数据开发运维人员在学习和测试中使用,如果要应用于生产环境则需要自行确认。更多视频教程www.fgedu.net.cn
Part01-基础概念与理论知识
1.1 Spark核心概念与架构
Apache Spark是一个快速、通用、可扩展的大数据分析引擎,最初由UC Berkeley AMP实验室开发,后来成为Apache顶级项目。Spark基于内存计算,比Hadoop MapReduce快100倍。学习交流加群风哥微信: itpux-com
- Driver:驱动程序,运行应用的main函数,创建SparkContext
- Executor:执行器,运行在工作节点上,执行任务并存储数据
- Cluster Manager:集群管理器,管理集群资源(Standalone、YARN、K8s)
- Worker Node:工作节点,运行Executor进程
- RDD:弹性分布式数据集,Spark的核心数据抽象
- DAG:有向无环图,描述任务依赖关系
1.2 Spark核心组件详解
Spark生态系统包含多个核心组件:
1. Spark Core
– 核心引擎,提供RDD抽象
– 任务调度、内存管理、错误恢复
– 与存储系统交互
2. Spark SQL
– 结构化数据处理
– DataFrame和Dataset API
– SQL查询支持
3. Spark Streaming
– 实时流处理
– 微批处理模型
– 支持多种数据源
4. MLlib
– 机器学习库
– 分类、回归、聚类算法
– 特征工程和模型评估
5. GraphX
– 图计算引擎
– 图算法和图操作
– 属性图抽象
# Spark运行架构
+—————-+
| Driver Program |
| (SparkContext)|
+——-+——–+
|
+—————+—————+
| Cluster Manager |
+—————+—————+
| | |
+——-+——-+——-+——-+
| | | | |
+—+—+ +—+—+ +—+—+ +—+—+
|Worker | |Worker | |Worker | |Worker |
+—+—+ +—+—+ +—+—+ +—+—+
| | | |
+—+—+ +—+—+ +—+—+ +—+—+
|Executor| |Executor| |Executor| |Executor|
+——-+ +——-+ +——-+ +——-+
1.3 Spark核心特性与优势
Spark核心特性:
- 快速:基于内存计算,比MapReduce快100倍
- 易用:支持Java、Scala、Python、R多种语言
- 通用:批处理、流处理、机器学习、图计算统一平台
- 兼容:支持HDFS、HBase、Kafka等多种数据源
- 可扩展:支持Standalone、YARN、Kubernetes多种集群管理器
Part02-生产环境规划与建议
2.1 Spark集群规划
Spark集群规划需要考虑以下因素:
1. 开发环境
– 节点数:1-3个
– 内存:8-16GB/节点
– CPU:4-8核/节点
2. 测试环境
– 节点数:3-5个
– 内存:16-32GB/节点
– CPU:8-16核/节点
3. 生产环境
– 节点数:5个以上
– 内存:64-256GB/节点
– CPU:16-64核/节点
# 生产环境推荐配置
Master节点(192.168.1.60):Master + History Server
Worker节点1(192.168.1.61):Worker
Worker节点2(192.168.1.62):Worker
Worker节点3(192.168.1.63):Worker
Worker节点4(192.168.1.64):Worker
Worker节点5(192.168.1.65):Worker
# 资源分配建议
– 每个Worker分配总内存的75%给Spark
– 每个Worker保留1-2个核心给系统
– 每个Executor分配4-8GB内存
– 每个Executor分配2-5个核心
2.2 硬件资源规划
硬件资源规划建议:
– 类型:Intel Xeon或AMD EPYC
– 核心数:16-64核/节点
– 建议:多核心提高并行度
# 内存规划
– 容量:64-256GB/节点
– 类型:DDR4 ECC内存
– 建议:内存越大,缓存数据越多
# 磁盘规划
– 类型:SSD或NVMe SSD
– 容量:1-4TB/节点
– 用途:存储Shuffle数据和RDD缓存
– 建议:多块磁盘提高IO性能
# 网络规划
– 带宽:千兆或万兆网卡
– 建议:低延迟网络提高Shuffle性能
# 示例配置(中型集群)
– CPU:32核
– 内存:128GB
– 磁盘:2TB SSD
– 网络:万兆网卡
2.3 部署模式规划
Spark支持多种部署模式:
1. Standalone模式
– 优点:部署简单,独立运行
– 缺点:资源管理能力有限
– 适用:独立Spark集群
2. YARN模式
– 优点:与Hadoop集成,资源共享
– 缺点:依赖Hadoop
– 适用:已有Hadoop集群
3. Kubernetes模式
– 优点:容器化部署,弹性扩展
– 缺点:配置复杂
– 适用:云原生环境
# 推荐方案
– 已有Hadoop集群:YARN模式
– 云原生环境:Kubernetes模式
– 独立部署:Standalone模式
Part03-生产环境项目实施方案
3.1 Spark集群安装部署
3.1.1 环境准备
$ java -version
openjdk version “17.0.2” 2022-01-18 LTS
OpenJDK Runtime Environment (build 17.0.2+8-LTS-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-LTS-86, mixed mode, sharing)
# 创建Spark用户
$ useradd -m -s /bin/bash spark
$ id spark
uid=1002(spark) gid=1002(spark) groups=1002(spark)
# 创建安装目录
$ mkdir -p /bigdata/app/spark
$ mkdir -p /bigdata/spark-logs
$ mkdir -p /bigdata/spark-work
$ chown -R spark:spark /bigdata/app/spark
$ chown -R spark:spark /bigdata/spark-logs
$ chown -R spark:spark /bigdata/spark-work
# 配置SSH免密登录(所有节点)
$ ssh-keygen -t rsa
$ ssh-copy-id spark@192.168.1.60
$ ssh-copy-id spark@192.168.1.61
$ ssh-copy-id spark@192.168.1.62
$ ssh-copy-id spark@192.168.1.63
3.1.2 下载安装Spark
$ cd /bigdata/app
$ wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
–2026-04-08 10:00:00– https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Resolving archive.apache.org… 192.168.1.100
Connecting to archive.apache.org|192.168.1.100|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 304096789 (290M) [application/x-gzip]
Saving to: ‘spark-3.5.1-bin-hadoop3.tgz’
spark-3.5.1-bin-hadoop3.tgz 100%[=====================================>] 290.10M 15.2MB/s in 19s
2026-04-08 10:00:19 (15.2 MB/s) – ‘spark-3.5.1-bin-hadoop3.tgz’ saved
# 解压安装
$ tar -zxvf spark-3.5.1-bin-hadoop3.tgz
$ mv spark-3.5.1-bin-hadoop3 spark
$ chown -R spark:spark /bigdata/app/spark
# 查看目录结构
$ ls -la /bigdata/app/spark/
total 64
drwxr-xr-x 1 spark spark 4096 Apr 8 10:00 .
drwxr-xr-x 3 root root 4096 Apr 8 10:00 ..
drwxr-xr-x 2 spark spark 4096 Apr 8 10:00 bin
drwxr-xr-x 2 spark spark 4096 Apr 10:00 conf
drwxr-xr-x 5 spark spark 4096 Apr 8 10:00 data
drwxr-xr-x 4 spark spark 4096 Apr 8 10:00 examples
drwxr-xr-x 2 spark spark 4096 Apr 8 10:00 jars
drwxr-xr-x 4 spark spark 4096 Apr 8 10:00 kubernetes
-rw-r–r– 1 spark spark 1453 Jan 1 00:00 LICENSE
drwxr-xr-x 2 spark spark 4096 Apr 8 10:00 licenses
-rw-r–r– 1 spark spark 8762 Jan 1 00:00 NOTICE
drwxr-xr-x 7 spark spark 4096 Apr 8 10:00 python
drwxr-xr-x 3 spark spark 4096 Apr 8 10:00 R
-rw-r–r– 1 spark spark 3720 Jan 1 00:00 RELEASE
drwxr-xr-x 2 spark spark 4096 Apr 8 10:00 sbin
drwxr-xr-x 2 spark spark 4096 Apr 8 10:00 yarn
3.2 Spark核心配置
3.2.1 配置spark-env.sh
$ cat > /bigdata/app/spark/conf/spark-env.sh << 'EOF' #!/bin/bash # spark-env.sh # from:www.itpux.com.qq113257174.wx:itpux-com # web: http://www.fgedu.net.cn # Java环境 export JAVA_HOME=/usr/lib/jvm/java-17-openjdk # Spark环境 export SPARK_HOME=/bigdata/app/spark export SPARK_CONF_DIR=/bigdata/app/spark/conf # Master配置 export SPARK_MASTER_HOST=192.168.1.60 export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=8080 # Worker配置 export SPARK_WORKER_CORES=16 export SPARK_WORKER_MEMORY=96g export SPARK_WORKER_DIR=/bigdata/spark-work export SPARK_WORKER_WEBUI_PORT=8081 # History Server配置 export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://192.168.1.60:9000/spark-logs" # 内存配置 export SPARK_DAEMON_MEMORY=4g export SPARK_DAEMON_JAVA_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20" EOF $ chmod +x /bigdata/app/spark/conf/spark-env.sh
3.2.2 配置spark-defaults.conf
$ cat > /bigdata/app/spark/conf/spark-defaults.conf << 'EOF' # Spark默认配置 # 应用配置 spark.app.name fgedu-spark-app spark.master spark://192.168.1.60:7077 # 资源配置 spark.executor.memory 8g spark.executor.cores 4 spark.executor.instances 2 spark.driver.memory 4g spark.driver.cores 2 # 序列化配置 spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 512m # 内存配置 spark.memory.fraction 0.6 spark.memory.storageFraction 0.5 # Shuffle配置 spark.shuffle.service.enabled true spark.shuffle.compress true spark.shuffle.spill.compress true # 历史服务器配置 spark.eventLog.enabled true spark.eventLog.dir hdfs://192.168.1.60:9000/spark-logs spark.history.fs.logDirectory hdfs://192.168.1.60:9000/spark-logs # UI配置 spark.ui.enabled true spark.ui.port 4040 spark.ui.retainedJobs 100 spark.ui.retainedStages 100 # 动态资源分配 spark.dynamicAllocation.enabled false EOF
3.2.3 配置workers文件
$ cat > /bigdata/app/spark/conf/workers << 'EOF' 192.168.1.61 192.168.1.62 192.168.1.63 192.168.1.64 192.168.1.65 EOF # 分发配置到所有节点 $ scp -r /bigdata/app/spark/conf/* spark@192.168.1.61:/bigdata/app/spark/conf/ $ scp -r /bigdata/app/spark/conf/* spark@192.168.1.62:/bigdata/app/spark/conf/ $ scp -r /bigdata/app/spark/conf/* spark@192.168.1.63:/bigdata/app/spark/conf/ $ scp -r /bigdata/app/spark/conf/* spark@192.168.1.64:/bigdata/app/spark/conf/ $ scp -r /bigdata/app/spark/conf/* spark@192.168.1.65:/bigdata/app/spark/conf/
3.3 Spark集群启动验证
3.3.1 启动Spark集群
$ hdfs dfs -mkdir -p /spark-logs
# 启动Master
$ /bigdata/app/spark/sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1-fgedu-spark-master.out
# 启动所有Worker
$ /bigdata/app/spark/sbin/start-workers.sh
192.168.1.61: starting org.apache.spark.deploy.worker.Worker, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-fgedu-spark-worker1.out
192.168.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-fgedu-spark-worker2.out
192.168.1.63: starting org.apache.spark.deploy.worker.Worker, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-fgedu-spark-worker3.out
192.168.1.64: starting org.apache.spark.deploy.worker.Worker, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-fgedu-spark-worker4.out
192.168.1.65: starting org.apache.spark.deploy.worker.Worker, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-fgedu-spark-worker5.out
# 启动History Server
$ /bigdata/app/spark/sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /bigdata/app/spark/logs/spark-spark-org.apache.spark.deploy.history.HistoryServer-1-fgedu-spark-master.out
# 检查进程
$ jps
12345 Master
12456 HistoryServer
12567 Jps
# 在Worker节点检查
$ ssh spark@192.168.1.61 jps
12345 Worker
12456 Jps
3.3.2 验证集群状态
# Master Web UI: http://192.168.1.60:8080
# History Server: http://192.168.1.60:18080
# 查看集群状态
$ curl http://192.168.1.60:8080/json
{
“url” : “spark://192.168.1.60:7077”,
“workers” : [ {
“id” : “worker-20260408100000-192.168.1.61-8081”,
“host” : “192.168.1.61”,
“port” : 8081,
“cores” : 16,
“coresUsed” : 0,
“memory” : 98304,
“memoryUsed” : 0,
“state” : “ALIVE”
}, {
“id” : “worker-20260408100000-192.168.1.62-8081”,
“host” : “192.168.1.62”,
“port” : 8081,
“cores” : 16,
“coresUsed” : 0,
“memory” : 98304,
“memoryUsed” : 0,
“state” : “ALIVE”
} ],
“cores” : 80,
“coresUsed” : 0,
“memory” : 491520,
“memoryUsed” : 0
}
# 运行测试应用
$ /bigdata/app/spark/bin/run-example SparkPi 10
Pi is roughly 3.1415791415791416
Part04-生产案例与实战讲解
4.1 Spark Shell实战
4.1.1 启动Spark Shell
$ /bigdata/app/spark/bin/spark-shell \
–master spark://192.168.1.60:7077 \
–executor-memory 4g \
–total-executor-cores 4
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.60:4040
Spark context available as ‘sc’ (master = spark://192.168.1.60:7077, app id = app-20260408100000-0000).
Spark session available as ‘spark’.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.1
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 17.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
4.1.2 基本操作示例
scala> val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
data: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> val rdd = sc.parallelize(data, 3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
# 基本转换操作
scala> rdd.count()
res0: Long = 10
scala> rdd.sum()
res1: Double = 55.0
scala> rdd.mean()
res2: Double = 5.5
scala> rdd.max()
res3: Int = 10
scala> rdd.min()
res4: Int = 1
# Map操作
scala> val mapped = rdd.map(_ * 2)
mapped: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at
scala> mapped.collect()
res5: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
# Filter操作
scala> val filtered = rdd.filter(_ > 5)
filtered: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at
scala> filtered.collect()
res6: Array[Int] = Array(6, 7, 8, 9, 10)
# Reduce操作
scala> rdd.reduce(_ + _)
res7: Int = 55
# 退出Spark Shell
scala> :quit
4.2 Spark应用提交实战
4.2.1 提交应用
$ /bigdata/app/spark/bin/spark-submit \
–class org.apache.spark.examples.SparkPi \
–master spark://192.168.1.60:7077 \
–executor-memory 4g \
–total-executor-cores 4 \
/bigdata/app/spark/examples/jars/spark-examples_2.12-3.5.1.jar \
100
2026-04-08 10:30:00 WARN NativeCodeLoader:62 – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2026-04-08 10:30:01 INFO SparkContext:54 – Running Spark version 3.5.1
2026-04-08 10:30:02 INFO ResourceUtils:54 – Starting Spark Master
2026-04-08 10:30:03 INFO SparkContext:54 – Submitted application: SparkPi
…
Pi is roughly 3.1415591415591416
# 提交Python应用
$ /bigdata/app/spark/bin/spark-submit \
–master spark://192.168.1.60:7077 \
–executor-memory 4g \
/bigdata/app/spark/examples/src/main/python/pi.py \
100
Pi is roughly 3.141240
4.2.2 提交参数说明
–master MASTER_URL # 集群Master地址
–deploy-mode DEPLOY_MODE # 部署模式:client/cluster
–class CLASS_NAME # 主类名(Java/Scala应用)
–name NAME # 应用名称
–jars JARS # 依赖JAR包
–packages PACKAGES # Maven依赖包
–executor-memory MEM # 每个Executor内存
–executor-cores NUM # 每个Executor核心数
–num-executors NUM # Executor数量
–driver-memory MEM # Driver内存
–driver-cores NUM # Driver核心数
–conf PROP=VALUE # 配置参数
# 示例:提交应用到YARN
$ /bigdata/app/spark/bin/spark-submit \
–master yarn \
–deploy-mode cluster \
–name fgedu-spark-app \
–executor-memory 8g \
–executor-cores 4 \
–num-executors 10 \
–conf spark.dynamicAllocation.enabled=false \
/bigdata/app/spark-apps/fgedu-app.jar
4.3 常见问题处理
4.3.1 Worker无法连接Master
# 排查步骤
# 1. 检查网络连通性
$ ping 192.168.1.60
# 2. 检查端口是否开放
$ telnet 192.168.1.60 7077
# 3. 检查防火墙
$ firewall-cmd –list-ports
# 4. 检查Master日志
$ tail -100 /bigdata/app/spark/logs/spark-*-master-*.log
# 解决方案
# 1. 开放防火墙端口
$ firewall-cmd –add-port=7077/tcp –permanent
$ firewall-cmd –add-port=8080/tcp –permanent
$ firewall-cmd –reload
# 2. 检查spark-env.sh配置
export SPARK_MASTER_HOST=192.168.1.60
4.3.2 内存不足问题
# 排查步骤
# 1. 检查Executor内存配置
$ grep executor.memory /bigdata/app/spark/conf/spark-defaults.conf
# 2. 检查Worker内存分配
$ curl http://192.168.1.60:8080/json | grep memory
# 解决方案
# 1. 增加Executor内存
spark.executor.memory 16g
# 2. 调整内存比例
spark.memory.fraction 0.6
spark.memory.storageFraction 0.5
# 3. 启用动态资源分配
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 20
Part05-风哥经验总结与分享
5.1 Spark部署最佳实践
Spark部署最佳实践建议:
1. 合理规划资源
– Worker内存:总内存的75%
– Executor内存:4-8GB
– Executor核心:2-5个
2. 优化配置
– 使用Kryo序列化
– 启用Shuffle服务
– 配置历史服务器
3. 监控运维
– 监控资源使用
– 监控应用性能
– 定期清理日志
4. 高可用配置
– Master HA(ZooKeeper)
– 配置多个Master
– 启用应用重试
5.2 部署检查清单
部署检查清单:
- Java环境是否正确
- SSH免密登录是否配置
- 配置文件是否分发到所有节点
- Master是否正常启动
- Worker是否正常注册
- Web UI是否可访问
- 测试应用是否正常运行
5.3 运维工具推荐
Spark运维工具:
- Spark Web UI:集群和应用监控
- History Server:历史应用查看
- Ganglia:系统资源监控
- Prometheus + Grafana:指标监控
- Spark UI:应用性能分析
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
