目录大纲
Part01-基础概念与理论知识
1.1 Hadoop SQL概述
Hadoop SQL是指在Hadoop生态系统中使用SQL语言进行数据查询和分析的技术。主要包括Hive SQL、Spark SQL、Presto SQL等。Hadoop SQL使得用户可以使用熟悉的SQL语言处理和分析海量数据。更多视频教程www.fgedu.net.cn
1.2 Hive SQL语法
- 数据定义语言(DDL):CREATE、ALTER、DROP等
- 数据操作语言(DML):INSERT、UPDATE、DELETE等
- 数据查询语言(DQL):SELECT、FROM、WHERE、GROUP BY、ORDER BY等
- 数据控制语言(DCL):GRANT、REVOKE等
1.3 Spark SQL语法
Spark SQL语法与Hive SQL类似,但支持更多的函数和特性,如窗口函数、复杂类型等。学习交流加群风哥微信: itpux-com
Part02-生产环境规划与建议
2.1 SQL开发环境搭建
# 安装Hive
[root@fgedu.net.cn ~]# wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
[root@fgedu.net.cn ~]# tar -zxvf apache-hive-3.1.3-bin.tar.gz
[root@fgedu.net.cn ~]# mv apache-hive-3.1.3-bin /bigdata/app/hive
# 安装Spark
[root@fgedu.net.cn ~]# wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
[root@fgedu.net.cn ~]# tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz
[root@fgedu.net.cn ~]# mv spark-3.2.1-bin-hadoop3.2 /bigdata/app/spark
# 配置环境变量
[root@fgedu.net.cn ~]# vi ~/.bashrc
export HIVE_HOME=/bigdata/app/hive
export SPARK_HOME=/bigdata/app/spark
export PATH=$HIVE_HOME/bin:$SPARK_HOME/bin:$PATH
[root@fgedu.net.cn ~]# source ~/.bashrc
[root@fgedu.net.cn ~]# wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
[root@fgedu.net.cn ~]# tar -zxvf apache-hive-3.1.3-bin.tar.gz
[root@fgedu.net.cn ~]# mv apache-hive-3.1.3-bin /bigdata/app/hive
# 安装Spark
[root@fgedu.net.cn ~]# wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
[root@fgedu.net.cn ~]# tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz
[root@fgedu.net.cn ~]# mv spark-3.2.1-bin-hadoop3.2 /bigdata/app/spark
# 配置环境变量
[root@fgedu.net.cn ~]# vi ~/.bashrc
export HIVE_HOME=/bigdata/app/hive
export SPARK_HOME=/bigdata/app/spark
export PATH=$HIVE_HOME/bin:$SPARK_HOME/bin:$PATH
[root@fgedu.net.cn ~]# source ~/.bashrc
2.2 SQL开发规范
推荐的SQL开发规范包括:命名规范、代码风格、注释规范、性能规范等。风哥提示:良好的SQL开发规范可以提高代码的可读性和可维护性。
2.3 SQL性能优化
# SQL性能优化建议
# 1. 使用分区表:根据时间或其他维度进行分区
# 2. 使用分桶表:根据字段进行分桶
# 3. 使用索引:为常用查询字段建立索引
# 4. 避免全表扫描:使用WHERE条件过滤数据
# 5. 优化JOIN操作:使用合适的JOIN类型,避免笛卡尔积
# 6. 使用聚合函数:减少数据传输量
# 7. 合理使用子查询:避免嵌套过深的子查询
# 1. 使用分区表:根据时间或其他维度进行分区
# 2. 使用分桶表:根据字段进行分桶
# 3. 使用索引:为常用查询字段建立索引
# 4. 避免全表扫描:使用WHERE条件过滤数据
# 5. 优化JOIN操作:使用合适的JOIN类型,避免笛卡尔积
# 6. 使用聚合函数:减少数据传输量
# 7. 合理使用子查询:避免嵌套过深的子查询
Part03-生产环境项目实施方案
3.1 Hive SQL开发
# 创建表
hive> CREATE TABLE fgedu_sales (
> sale_id STRING,
> order_id STRING,
> customer_id STRING,
> product_id STRING,
> sale_date STRING,
> sale_amount DOUBLE,
> sale_quantity INT
> )
> PARTITIONED BY (dt STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
# 加载数据
hive> LOAD DATA INPATH ‘/user/fgedu/data/sales’ INTO TABLE fgedu_sales PARTITION (dt=’20230408′);
# 查询数据
hive> SELECT
> product_id,
> SUM(sale_amount) AS total_amount,
> SUM(sale_quantity) AS total_quantity
> FROM fgedu_sales
> WHERE dt=’20230408′
> GROUP BY product_id
> ORDER BY total_amount DESC;
# 创建视图
hive> CREATE VIEW fgedu_sales_view AS
> SELECT
> sale_date,
> product_id,
> SUM(sale_amount) AS total_amount
> FROM fgedu_sales
> GROUP BY sale_date, product_id;
hive> CREATE TABLE fgedu_sales (
> sale_id STRING,
> order_id STRING,
> customer_id STRING,
> product_id STRING,
> sale_date STRING,
> sale_amount DOUBLE,
> sale_quantity INT
> )
> PARTITIONED BY (dt STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
# 加载数据
hive> LOAD DATA INPATH ‘/user/fgedu/data/sales’ INTO TABLE fgedu_sales PARTITION (dt=’20230408′);
# 查询数据
hive> SELECT
> product_id,
> SUM(sale_amount) AS total_amount,
> SUM(sale_quantity) AS total_quantity
> FROM fgedu_sales
> WHERE dt=’20230408′
> GROUP BY product_id
> ORDER BY total_amount DESC;
# 创建视图
hive> CREATE VIEW fgedu_sales_view AS
> SELECT
> sale_date,
> product_id,
> SUM(sale_amount) AS total_amount
> FROM fgedu_sales
> GROUP BY sale_date, product_id;
3.2 Spark SQL开发
# 使用Spark SQL
[root@fgedu.net.cn ~]# spark-shell
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder()
.appName(“Spark SQL Example”)
.master(“local[*]”)
.getOrCreate()
scala> // 读取数据
scala> val salesDF = spark.read
.option(“delimiter”, “\t”)
.schema(“sale_id STRING, order_id STRING, customer_id STRING, product_id STRING, sale_date STRING, sale_amount DOUBLE, sale_quantity INT”)
.csv(“/user/fgedu/data/sales”)
scala> // 注册临时表
scala> salesDF.createOrReplaceTempView(“fgedu_sales”)
scala> // 执行SQL查询
scala> val result = spark.sql(“””
SELECT
product_id,
SUM(sale_amount) AS total_amount,
SUM(sale_quantity) AS total_quantity
FROM fgedu_sales
GROUP BY product_id
ORDER BY total_amount DESC
“””)
scala> // 显示结果
scala> result.show()
[root@fgedu.net.cn ~]# spark-shell
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder()
.appName(“Spark SQL Example”)
.master(“local[*]”)
.getOrCreate()
scala> // 读取数据
scala> val salesDF = spark.read
.option(“delimiter”, “\t”)
.schema(“sale_id STRING, order_id STRING, customer_id STRING, product_id STRING, sale_date STRING, sale_amount DOUBLE, sale_quantity INT”)
.csv(“/user/fgedu/data/sales”)
scala> // 注册临时表
scala> salesDF.createOrReplaceTempView(“fgedu_sales”)
scala> // 执行SQL查询
scala> val result = spark.sql(“””
SELECT
product_id,
SUM(sale_amount) AS total_amount,
SUM(sale_quantity) AS total_quantity
FROM fgedu_sales
GROUP BY product_id
ORDER BY total_amount DESC
“””)
scala> // 显示结果
scala> result.show()
3.3 SQL作业调度
# 使用Oozie调度SQL作业
[root@fgedu.net.cn ~]# vi /bigdata/app/oozie/workflows/sql_workflow.xml
${jobTracker}
${nameNode}
dt=${dt}
Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
# 提交Oozie作业
[root@fgedu.net.cn ~]# oozie job -oozie http://localhost:11000/oozie -config job.properties -run
[root@fgedu.net.cn ~]# vi /bigdata/app/oozie/workflows/sql_workflow.xml
# 提交Oozie作业
[root@fgedu.net.cn ~]# oozie job -oozie http://localhost:11000/oozie -config job.properties -run
Part04-生产案例与实战讲解
4.1 电商销售分析
案例背景
某电商平台需要分析销售数据,了解销售趋势、产品表现和用户购买行为。
实施步骤
- 创建销售数据表
- 加载销售数据
- 编写SQL查询分析销售数据
- 使用可视化工具展示分析结果
实施效果
通过SQL分析,电商平台了解了销售趋势、产品表现和用户购买行为,为业务决策提供了支持。from bigdata视频:www.itpux.com
4.2 用户行为分析
# 用户行为分析SQL
hive> — 分析用户活跃度
SELECT
dt,
COUNT(DISTINCT user_id) AS active_users,
COUNT(*) AS total_behavior,
SUM(CASE WHEN behavior = ‘purchase’ THEN 1 ELSE 0 END) AS purchase_count,
SUM(CASE WHEN behavior = ‘purchase’ THEN 1 ELSE 0 END) / COUNT(DISTINCT user_id) AS purchase_rate
FROM fgedu_user_behavior
GROUP BY dt
ORDER BY dt;
— 分析用户留存率
SELECT
first_date,
COUNT(DISTINCT user_id) AS new_users,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 1) THEN user_id END) AS day1_retention,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 7) THEN user_id END) AS day7_retention,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 1) THEN user_id END) / COUNT(DISTINCT user_id) AS day1_retention_rate,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 7) THEN user_id END) / COUNT(DISTINCT user_id) AS day7_retention_rate
FROM (
SELECT
user_id,
MIN(dt) AS first_date,
MAX(dt) AS next_date
FROM fgedu_user_behavior
GROUP BY user_id
) t
GROUP BY first_date
ORDER BY first_date;
hive> — 分析用户活跃度
SELECT
dt,
COUNT(DISTINCT user_id) AS active_users,
COUNT(*) AS total_behavior,
SUM(CASE WHEN behavior = ‘purchase’ THEN 1 ELSE 0 END) AS purchase_count,
SUM(CASE WHEN behavior = ‘purchase’ THEN 1 ELSE 0 END) / COUNT(DISTINCT user_id) AS purchase_rate
FROM fgedu_user_behavior
GROUP BY dt
ORDER BY dt;
— 分析用户留存率
SELECT
first_date,
COUNT(DISTINCT user_id) AS new_users,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 1) THEN user_id END) AS day1_retention,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 7) THEN user_id END) AS day7_retention,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 1) THEN user_id END) / COUNT(DISTINCT user_id) AS day1_retention_rate,
COUNT(DISTINCT CASE WHEN next_date = date_add(first_date, 7) THEN user_id END) / COUNT(DISTINCT user_id) AS day7_retention_rate
FROM (
SELECT
user_id,
MIN(dt) AS first_date,
MAX(dt) AS next_date
FROM fgedu_user_behavior
GROUP BY user_id
) t
GROUP BY first_date
ORDER BY first_date;
4.3 运营指标分析
# 运营指标分析SQL
hive> — 分析销售指标
SELECT
dt,
SUM(sale_amount) AS total_sales,
COUNT(DISTINCT order_id) AS order_count,
SUM(sale_amount) / COUNT(DISTINCT order_id) AS avg_order_value,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(sale_amount) / COUNT(DISTINCT customer_id) AS avg_customer_value
FROM fgedu_sales
GROUP BY dt
ORDER BY dt;
— 分析产品类别表现
SELECT
category,
SUM(sale_amount) AS total_sales,
COUNT(DISTINCT product_id) AS product_count,
SUM(sale_amount) / COUNT(DISTINCT product_id) AS avg_product_sales
FROM fgedu_sales s
JOIN fgedu_product p ON s.product_id = p.product_id
GROUP BY category
ORDER BY total_sales DESC;
hive> — 分析销售指标
SELECT
dt,
SUM(sale_amount) AS total_sales,
COUNT(DISTINCT order_id) AS order_count,
SUM(sale_amount) / COUNT(DISTINCT order_id) AS avg_order_value,
COUNT(DISTINCT customer_id) AS customer_count,
SUM(sale_amount) / COUNT(DISTINCT customer_id) AS avg_customer_value
FROM fgedu_sales
GROUP BY dt
ORDER BY dt;
— 分析产品类别表现
SELECT
category,
SUM(sale_amount) AS total_sales,
COUNT(DISTINCT product_id) AS product_count,
SUM(sale_amount) / COUNT(DISTINCT product_id) AS avg_product_sales
FROM fgedu_sales s
JOIN fgedu_product p ON s.product_id = p.product_id
GROUP BY category
ORDER BY total_sales DESC;
Part05-风哥经验总结与分享
5.1 SQL开发最佳实践
- 规范命名:使用统一的命名规范,提高代码可读性
- 代码风格:保持一致的代码风格,使用缩进和换行
- 注释规范:添加必要的注释,说明代码的功能和逻辑
- 性能优化:使用合适的索引、分区和分桶
- 错误处理:添加错误处理逻辑,提高代码的健壮性
5.2 常见SQL问题与解决方案
| 问题 | 原因 | 解决方案 |
|---|---|---|
| 查询性能慢 | 全表扫描或JOIN操作复杂 | 使用索引、分区,优化JOIN操作 |
| 数据倾斜 | 某些键值的数据量过大 | 使用分桶、加盐等技术解决数据倾斜 |
| 内存不足 | 数据量过大或查询复杂 | 增加内存配置,优化查询 |
| 语法错误 | SQL语句编写错误 | 检查SQL语法,使用IDE辅助 |
5.3 SQL性能优化技巧
# SQL性能优化技巧
# 1. 使用EXPLAIN分析执行计划
hive> EXPLAIN SELECT * FROM fgedu_sales WHERE dt=’20230408′ AND product_id=’P001′;
# 2. 使用分区裁剪
hive> SELECT * FROM fgedu_sales WHERE dt=’20230408′;
# 3. 使用谓词下推
hive> SELECT * FROM fgedu_sales s JOIN fgedu_product p ON s.product_id = p.product_id WHERE s.dt=’20230408′;
# 4. 使用聚合函数减少数据量
hive> SELECT product_id, SUM(sale_amount) FROM fgedu_sales GROUP BY product_id;
# 5. 使用合适的JOIN类型
hive> SELECT * FROM fgedu_sales s INNER JOIN fgedu_product p ON s.product_id = p.product_id;
# 1. 使用EXPLAIN分析执行计划
hive> EXPLAIN SELECT * FROM fgedu_sales WHERE dt=’20230408′ AND product_id=’P001′;
# 2. 使用分区裁剪
hive> SELECT * FROM fgedu_sales WHERE dt=’20230408′;
# 3. 使用谓词下推
hive> SELECT * FROM fgedu_sales s JOIN fgedu_product p ON s.product_id = p.product_id WHERE s.dt=’20230408′;
# 4. 使用聚合函数减少数据量
hive> SELECT product_id, SUM(sale_amount) FROM fgedu_sales GROUP BY product_id;
# 5. 使用合适的JOIN类型
hive> SELECT * FROM fgedu_sales s INNER JOIN fgedu_product p ON s.product_id = p.product_id;
通过Hadoop集群SQL语言与开发的实施,可以使用熟悉的SQL语言处理和分析海量数据,为业务决策提供支持。SQL开发是大数据分析的重要技能,需要掌握SQL语法和性能优化技巧。学习交流加群风哥QQ113257174
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
