目录大纲
Part01-基础概念与理论知识
1.1 Hive SQL语法概述
1.2 数据类型与函数
1.3 查询执行原理
Part02-生产环境规划与建议
2.1 SQL开发规范
2.2 性能优化策略
2.3 资源配置规划
Part03-生产环境项目实施方案
3.1 基础查询开发
3.2 复杂查询开发
3.3 查询性能优化
3.4 执行计划分析
Part04-生产案例与实战讲解
4.1 ETL开发案例
4.2 报表查询案例
4.3 性能调优案例
Part05-风哥经验总结与分享
5.1 SQL开发最佳实践
5.2 性能优化经验总结
Part01-基础概念与理论知识
1.1 Hive SQL语法概述
Hive SQL是类SQL语言,支持大部分标准SQL语法。更多视频教程www.fgedu.net.cn Hive SQL将SQL转换为MapReduce或Tez任务执行。
1.2 数据类型与函数
Hive支持丰富的数据类型和内置函数。学习交流加群风哥微信: itpux-com
hive -e “DESCRIBE FUNCTION EXTENDED cast;”
# 查看内置函数
hive -e “SHOW FUNCTIONS;” | head -20
# 查看函数说明
hive -e “DESCRIBE FUNCTION EXTENDED date_format;”
cast(expr AS type) – Converts the expression to the given type
# 内置函数列表
!
!=
%
&
*
+
–
/
<
<=>
<>
=
==
>
>=
abs
acos
add_months
and
array
# 函数说明
date_format(date/timestamp/string, fmt) – Converts a date/timestamp/string to a string with the specified format
1.3 查询执行原理
Hive查询执行过程包括解析、编译、优化、执行四个阶段。from bigdata视频:www.itpux.com
– 解析:SQL语法解析
– 编译:生成逻辑执行计划
– 优化:优化逻辑执行计划
– 执行:转换为物理执行计划并执行
Part02-生产环境规划与建议
2.1 SQL开发规范
SQL开发规范是保证代码质量的重要手段。更多学习教程公众号风哥教程itpux_com
– 使用明确的列名,避免SELECT *
– 合理使用分区过滤
– 避免数据倾斜
– 添加必要的注释
– 遵循命名规范
2.2 性能优化策略
性能优化需要从多个角度考虑。学习交流加群风哥QQ113257174
– 分区裁剪
– 列裁剪
– 谓词下推
– Map端聚合
– 合理设置并行度
2.3 资源配置规划
资源配置影响查询性能。风哥提示:需要根据集群资源和查询特点合理配置。
hive -e “SET hive.exec.reducers.bytes.per.reducer;”
hive -e “SET mapreduce.map.memory.mb;”
hive -e “SET mapreduce.reduce.memory.mb;”
# 设置资源参数
hive -e “SET hive.exec.reducers.bytes.per.reducer=256000000;”
hive -e “SET mapreduce.map.memory.mb=4096;”
hive -e “SET mapreduce.reduce.memory.mb=8192;”
hive.exec.reducers.bytes.per.reducer=256000000
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192
# 资源配置设置
# 设置成功
Part03-生产环境项目实施方案
3.1 基础查询开发
3.1.1 SELECT查询
hive -e “
SELECT user_id, user_name, age, gender
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
LIMIT 10;
“
# 聚合查询
hive -e “
SELECT gender, COUNT(*) AS cnt, AVG(age) AS avg_age
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
GROUP BY gender;
“
# 排序查询
hive -e “
SELECT user_id, user_name, age
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
ORDER BY age DESC
LIMIT 10;
”
user_id user_name age gender
1 fgedu01 25 M
2 fgedu02 30 F
3 fgedu03 28 M
…
Time taken: 5.0 seconds
# 聚合查询结果
gender cnt avg_age
M 500000 30.5
F 500000 28.3
Time taken: 30.0 seconds
# 排序查询结果
user_id user_name age
999999 fgedu99 80
888888 fgedu88 78
…
Time taken: 60.0 seconds
3.1.2 JOIN查询
hive -e “
SELECT u.user_id, u.user_name, o.order_id, o.order_amount
FROM fgedudb.fgedu_user u
JOIN fgedudb.fgedu_order o ON u.user_id = o.user_id
WHERE u.dt=’20240118′ AND o.dt=’20240118′
LIMIT 10;
“
# LEFT JOIN
hive -e “
SELECT u.user_id, u.user_name, COUNT(o.order_id) AS order_cnt
FROM fgedudb.fgedu_user u
LEFT JOIN fgedudb.fgedu_order o ON u.user_id = o.user_id AND o.dt=’20240118′
WHERE u.dt=’20240118′
GROUP BY u.user_id, u.user_name
LIMIT 10;
“
# MAP JOIN优化
hive -e “
SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=50000000;
SELECT /*+ MAPJOIN(d) */ f.user_id, d.city_name
FROM fgedudb.fgedu_user f
JOIN fgedudb.dim_city d ON f.city_id = d.city_id
WHERE f.dt=’20240118′
LIMIT 10;
”
user_id user_name order_id order_amount
1 fgedu01 10001 100.00
1 fgedu01 10002 200.00
…
Time taken: 120.0 seconds
# LEFT JOIN结果
user_id user_name order_cnt
1 fgedu01 5
2 fgedu02 3
…
Time taken: 180.0 seconds
# MAP JOIN结果
user_id city_name
1 北京
2 上海
…
Time taken: 30.0 seconds
3.2 复杂查询开发
3.2.1 窗口函数
hive -e “
SELECT user_id, user_name, age,
ROW_NUMBER() OVER (PARTITION BY gender ORDER BY age DESC) AS rank
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
LIMIT 10;
“
# SUM窗口函数
hive -e “
SELECT user_id, order_amount,
SUM(order_amount) OVER (PARTITION BY user_id ORDER BY order_time) AS cum_amount
FROM fgedudb.fgedu_order
WHERE dt=’20240118′
LIMIT 10;
“
# LAG/LEAD窗口函数
hive -e “
SELECT user_id, order_time, order_amount,
LAG(order_amount, 1, 0) OVER (PARTITION BY user_id ORDER BY order_time) AS prev_amount,
LEAD(order_amount, 1, 0) OVER (PARTITION BY user_id ORDER BY order_time) AS next_amount
FROM fgedudb.fgedu_order
WHERE dt=’20240118′
LIMIT 10;
”
user_id user_name age rank
999999 fgedu99 80 1
888888 fgedu88 78 2
…
Time taken: 60.0 seconds
# SUM窗口函数结果
user_id order_amount cum_amount
1 100.00 100.00
1 200.00 300.00
1 150.00 450.00
…
Time taken: 90.0 seconds
# LAG/LEAD结果
user_id order_time order_amount prev_amount next_amount
1 2024-01-18 10:00:00 100.00 0.00 200.00
1 2024-01-18 11:00:00 200.00 100.00 150.00
…
Time taken: 90.0 seconds
3.2.2 子查询
hive -e “
SELECT user_id, user_name, total_amount
FROM (
SELECT u.user_id, u.user_name, SUM(o.order_amount) AS total_amount
FROM fgedudb.fgedu_user u
JOIN fgedudb.fgedu_order o ON u.user_id = o.user_id AND o.dt=’20240118′
WHERE u.dt=’20240118′
GROUP BY u.user_id, u.user_name
) t
WHERE total_amount > 1000
LIMIT 10;
“
# WITH子句
hive -e “
WITH user_orders AS (
SELECT user_id, SUM(order_amount) AS total_amount
FROM fgedudb.fgedu_order
WHERE dt=’20240118′
GROUP BY user_id
),
high_value_users AS (
SELECT user_id, total_amount
FROM user_orders
WHERE total_amount > 1000
)
SELECT u.user_id, u.user_name, h.total_amount
FROM fgedudb.fgedu_user u
JOIN high_value_users h ON u.user_id = h.user_id
WHERE u.dt=’20240118′
LIMIT 10;
”
user_id user_name total_amount
100 fgedu100 5000.00
200 fgedu200 3000.00
…
Time taken: 180.0 seconds
# WITH子句结果
user_id user_name total_amount
100 fgedu100 5000.00
200 fgedu200 3000.00
…
Time taken: 120.0 seconds
3.3 查询性能优化
3.3.1 分区裁剪优化
hive -e “SET hive.optimize.ppd=true;”
hive -e “SET hive.optimize.ppd.storage=true;”
# 使用分区过滤
hive -e “
EXPLAIN SELECT user_id, user_name
FROM fgedudb.fgedu_user
WHERE dt=’20240118′ AND age > 30;
“
# 查看执行计划
hive -e “
EXPLAIN EXTENDED SELECT user_id, user_name
FROM fgedudb.fgedu_user
WHERE dt=’20240118′;
”
# 设置成功
# 执行计划
STAGE DEPENDENCIES:
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-0
Map Reduce
Alias -> Map Operator Tree:
fgedu_user
TableScan
alias: fgedu_user
filterExpr: (dt = ‘20240118’) (type: boolean)
Statistics: Num rows: 1000000
# 详细执行计划
…
partition values:
dt 20240118
…
3.3.2 并行度优化
hive -e “SET hive.exec.parallel=true;”
hive -e “SET hive.exec.parallel.thread.number=16;”
# 设置Reducer数量
hive -e “SET hive.exec.reducers.bytes.per.reducer=256000000;”
hive -e “SET mapreduce.job.reduces=10;”
# 动态分区设置
hive -e “SET hive.exec.dynamic.partition=true;”
hive -e “SET hive.exec.dynamic.partition.mode=nonstrict;”
hive -e “SET hive.exec.max.dynamic.partitions=10000;”
# 执行优化后的查询
hive -e “
INSERT OVERWRITE TABLE fgedudb.fgedu_user_agg PARTITION(dt)
SELECT user_id, COUNT(*) AS cnt, dt
FROM fgedudb.fgedu_user
GROUP BY user_id, dt;
”
# 设置成功
# Reducer设置
# 设置成功
# 动态分区设置
# 设置成功
# 查询执行
Loading data to table fgedudb.fgedu_user_agg partition (dt=null)
Time taken: 300.0 seconds
3.4 执行计划分析
3.4.1 EXPLAIN命令
hive -e “
EXPLAIN SELECT user_id, COUNT(*) AS cnt
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
GROUP BY user_id;
“
# 查看详细执行计划
hive -e “
EXPLAIN EXTENDED SELECT user_id, COUNT(*) AS cnt
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
GROUP BY user_id;
“
# 查看依赖关系
hive -e “
EXPLAIN DEPENDENCY SELECT user_id
FROM fgedudb.fgedu_user
WHERE dt=’20240118′;
”
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: fgedu_user
Statistics: Num rows: 1000000
Select Operator
Statistics: Num rows: 1000000
Group By Operator
aggregations: count()
keys: user_id
# 详细执行计划
…
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
…
# 依赖关系
{“input_tables”:[{“tablename”:”fgedudb@fgedu_user”,”tabletype”:”MANAGED_TABLE”}]}
3.4.2 性能分析
hive -e “SET hive.execution.engine=tez;”
# 开启向量化执行
hive -e “SET hive.vectorized.execution.enabled=true;”
hive -e “SET hive.vectorized.execution.reduce.enabled=true;”
# 开启CBO优化
hive -e “SET hive.cbo.enable=true;”
hive -e “SET hive.compute.query.using.stats=true;”
# 执行查询并查看时间
time hive -e “
SELECT user_id, COUNT(*) AS cnt
FROM fgedudb.fgedu_user
WHERE dt=’20240118′
GROUP BY user_id;
”
# 设置成功
# 向量化设置
# 设置成功
# CBO设置
# 设置成功
# 查询执行时间
1000000
Time taken: 30.0 seconds
real 0m35.000s
user 0m1.000s
sys 0m0.500s
Part04-生产案例与实战讲解
4.1 ETL开发案例
ETL是数据仓库的核心工作。更多视频教程www.fgedu.net.cn
# etl_user_order.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
DT=$(date -d “1 day ago” +%Y%m%d)
echo “=== ETL Process for ${DT} ===”
# 1. ODS层数据加载
echo “=== Load ODS Layer ===”
hive -e “
INSERT OVERWRITE TABLE ods.ods_user PARTITION(dt=’${DT}’)
SELECT user_id, user_name, age, gender, city_id
FROM staging.stg_user
WHERE dt=’${DT}’;
“
# 2. DWD层数据清洗
echo “=== Transform DWD Layer ===”
hive -e “
INSERT OVERWRITE TABLE dwd.fact_order PARTITION(dt=’${DT}’)
SELECT order_id, user_id, product_id, order_amount,
order_quantity, order_time, CURRENT_TIMESTAMP AS etl_time
FROM ods.ods_order
WHERE dt=’${DT}’ AND order_amount > 0;
“
# 3. DWS层数据汇总
echo “=== Aggregate DWS Layer ===”
hive -e “
INSERT OVERWRITE TABLE dws.agg_user_order PARTITION(dt=’${DT}’)
SELECT user_id,
COUNT(*) AS order_count,
SUM(order_amount) AS total_amount,
MIN(order_time) AS first_order_time,
MAX(order_time) AS last_order_time,
CURRENT_TIMESTAMP AS etl_time
FROM dwd.fact_order
WHERE dt=’${DT}’
GROUP BY user_id;
“
echo “=== ETL Completed ===”
./etl_user_order.sh
=== ETL Process for 20240117 ===
=== Load ODS Layer ===
Loading data to table ods.ods_user partition (dt=20240117)
Time taken: 60.0 seconds
=== Transform DWD Layer ===
Loading data to table dwd.fact_order partition (dt=20240117)
Time taken: 120.0 seconds
=== Aggregate DWS Layer ===
Loading data to table dws.agg_user_order partition (dt=20240117)
Time taken: 180.0 seconds
=== ETL Completed ===
4.2 报表查询案例
报表查询是数据仓库的重要应用。学习交流加群风哥微信: itpux-com
hive -e “
SELECT
dt,
COUNT(DISTINCT user_id) AS active_users,
COUNT(*) AS order_count,
SUM(order_amount) AS total_amount,
AVG(order_amount) AS avg_amount
FROM dwd.fact_order
WHERE dt BETWEEN ‘20240101’ AND ‘20240118’
GROUP BY dt
ORDER BY dt;
“
# 用户留存报表
hive -e “
WITH first_login AS (
SELECT user_id, MIN(dt) AS first_dt
FROM dwd.fact_user_login
GROUP BY user_id
),
retention AS (
SELECT
f.first_dt,
COUNT(DISTINCT CASE WHEN l.dt = f.first_dt THEN l.user_id END) AS day0,
COUNT(DISTINCT CASE WHEN l.dt = DATE_ADD(f.first_dt, 1) THEN l.user_id END) AS day1,
COUNT(DISTINCT CASE WHEN l.dt = DATE_ADD(f.first_dt, 7) THEN l.user_id END) AS day7
FROM first_login f
LEFT JOIN dwd.fact_user_login l ON f.user_id = l.user_id
GROUP BY f.first_dt
)
SELECT first_dt, day0, day1, day7
FROM retention
ORDER BY first_dt;
”
dt active_users order_count total_amount avg_amount
20240101 50000 100000 5000000.00 50.00
20240102 52000 105000 5250000.00 50.00
…
20240118 55000 110000 5500000.00 50.00
Time taken: 300.0 seconds
# 用户留存报表
first_dt day0 day1 day7
20240101 50000 25000 10000
20240102 52000 26000 10500
…
Time taken: 600.0 seconds
4.3 性能调优案例
4.3.1 数据倾斜优化
hive -e “
SELECT user_id, COUNT(*) AS cnt
FROM dwd.fact_order
WHERE dt=’20240118′
GROUP BY user_id;
“
# 开启数据倾斜优化
hive -e “SET hive.groupby.skewindata=true;”
hive -e “SET hive.optimize.skewjoin=true;”
hive -e “SET hive.skewjoin.key=100000;”
# 优化后查询
hive -e “
SELECT user_id, COUNT(*) AS cnt
FROM dwd.fact_order
WHERE dt=’20240118′
GROUP BY user_id;
”
user_id cnt
…
Time taken: 1800.0 seconds
# 存在数据倾斜问题
# 优化配置
# 设置成功
# 优化后查询
user_id cnt
…
Time taken: 300.0 seconds
# 性能提升6倍
4.3.2 内存优化
hive -e “SET mapreduce.map.memory.mb=8192;”
hive -e “SET mapreduce.reduce.memory.mb=16384;”
hive -e “SET mapreduce.map.java.opts=-Xmx6144m;”
hive -e “SET mapreduce.reduce.java.opts=-Xmx12288m;”
# 堆内存优化
hive -e “SET hive.exec.reducers.bytes.per.reducer=512000000;”
# 执行优化查询
hive -e “
SELECT user_id, COUNT(*) AS cnt
FROM dwd.fact_order
WHERE dt=’20240118′
GROUP BY user_id;
”
# 设置成功
# 堆内存配置
# 设置成功
# 查询执行
user_id cnt
…
Time taken: 200.0 seconds
# 内存优化后性能稳定
Part05-风哥经验总结与分享
5.1 SQL开发最佳实践
在实际生产环境中,SQL开发需要注意以下几点:from bigdata视频:www.itpux.com
1. 使用分区过滤减少数据扫描
2. 避免SELECT * 全列扫描
3. 合理使用MAP JOIN
4. 注意数据倾斜问题
5. 定期收集统计信息
5.2 性能优化经验总结
5.2.1 优化建议
– 选择合适的执行引擎
– 开启向量化执行
– 合理设置并行度
– 优化数据存储格式
– 定期执行表压缩
5.2.2 SQL优化脚本
# hive_optimize.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== Hive Optimization ===”
echo “Date: $(date)”
# 1. 收集统计信息
echo “=== Collect Statistics ===”
hive -e “
SET hive.stats.autogather=true;
ANALYZE TABLE dwd.fact_order COMPUTE STATISTICS FOR COLUMNS;
“
# 2. 压缩表
echo “=== Compact Tables ===”
hive -e “ALTER TABLE dwd.fact_order COMPACT ‘major’;”
# 3. 清理小文件
echo “=== Merge Small Files ===”
hive -e “
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=16000000;
“
echo “=== Optimization Completed ===”
./hive_optimize.sh
=== Hive Optimization ===
Date: Thu Jan 18 15:00:00 CST 2024
=== Collect Statistics ===
OK
=== Compact Tables ===
Compaction enqueued
=== Merge Small Files ===
# 设置成功
=== Optimization Completed ===
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
