目录大纲
Part01-基础概念与理论知识
1.1 Sqoop架构概述
1.2 数据导入原理
1.3 数据导出原理
Part02-生产环境规划与建议
2.1 安装部署规划
2.2 数据源规划
2.3 性能优化规划
Part03-生产环境项目实施方案
3.1 Sqoop安装配置
3.2 数据导入操作
3.3 数据导出操作
3.4 增量同步操作
Part04-生产案例与实战讲解
4.1 MySQL到HDFS案例
4.2 HDFS到MySQL案例
4.3 增量同步案例
Part05-风哥经验总结与分享
5.1 Sqoop最佳实践
5.2 数据迁移经验总结
Part01-基础概念与理论知识
1.1 Sqoop架构概述
Sqoop是Apache顶级项目,用于在Hadoop和关系型数据库之间高效传输数据。更多视频教程www.fgedu.net.cn
1.2 数据导入原理
Sqoop导入数据时将数据从关系型数据库导入到Hadoop。学习交流加群风哥微信: itpux-com
1. 连接数据库获取表元数据
2. 生成MapReduce作业
3. 并行读取数据库数据
4. 写入HDFS或Hive表
1.3 数据导出原理
Sqoop导出数据时将数据从Hadoop导出到关系型数据库。from bigdata视频:www.itpux.com
sqoop version
# 查看Sqoop帮助
sqoop help
Sqoop 1.4.7
git commit id 1234567890abcdef
Compiled by root on 2023-01-01T00:00Z
# Sqoop帮助
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
Part02-生产环境规划与建议
2.1 安装部署规划
Sqoop部署需要考虑与Hadoop和数据库的集成。更多学习教程公众号风哥教程itpux_com
– 安装在Hadoop客户端节点
– 配置数据库驱动
– 设置环境变量
– 配置Hadoop和Hive路径
2.2 数据源规划
数据源规划需要考虑数据库类型和数据量。学习交流加群风哥QQ113257174
sqoop list-databases –connect jdbc:mysql://fgedu01:3306/ –username fgedu –password fgedu123
# 查看数据库表
sqoop list-tables –connect jdbc:mysql://fgedu01:3306/fgedudb –username fgedu –password fgedu123
information_schema
fgedudb
hive_metastore
mysql
performance_schema
# 表列表
fgedu_user
fgedu_order
fgedu_product
fgedu_category
# 数据库连接成功
2.3 性能优化规划
性能优化需要从并行度和资源分配考虑。风哥提示:合理的并行度可以显著提升迁移效率。
– 设置合适的map数量
– 使用直连模式
– 压缩传输数据
– 优化数据库查询
– 监控迁移进度
Part03-生产环境项目实施方案
3.1 Sqoop安装配置
3.1.1 安装Sqoop
wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
tar -xzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /bigdata/app/
ln -s /bigdata/app/sqoop-1.4.7.bin__hadoop-2.6.0 /bigdata/app/sqoop
# 配置环境变量
export SQOOP_HOME=/bigdata/app/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
# 配置sqoop-env.sh
cat > /bigdata/app/sqoop/conf/sqoop-env.sh << 'EOF'
export HADOOP_COMMON_HOME=/bigdata/app/hadoop
export HADOOP_MAPRED_HOME=/bigdata/app/hadoop
export HIVE_HOME=/bigdata/app/hive
export HBASE_HOME=/bigdata/app/hbase
export ZOOCFGDIR=/bigdata/app/zookeeper/conf
EOF
# 复制MySQL驱动
cp mysql-connector-java-8.0.30.jar /bigdata/app/sqoop/lib/
# 验证安装
sqoop version
# 完成
# 环境变量
# 配置完成
# 配置文件
# 配置完成
# 复制驱动
# 复制完成
# 验证安装
Sqoop 1.4.7
git commit id 1234567890abcdef
Compiled by root on 2023-01-01T00:00Z
# Sqoop安装成功
3.1.2 验证连接
sqoop eval –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–query “SELECT COUNT(*) FROM fgedu_user”
# 查看表结构
sqoop eval –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–query “DESCRIBE fgedu_user”
# 测试HDFS连接
hdfs dfs -ls /
———————-
| COUNT(*) |
———————-
| 1000000 |
———————-
# 表结构
———————-
| Field | Type |
———————-
| user_id | int |
| user_name | varchar |
| age | int |
| gender | varchar |
| create_time| datetime|
———————-
# HDFS连接
Found 5 items
drwxr-xr-x – hdfs hdfs 0 2024-01-19 07:00 /bigdata
drwxr-xr-x – hdfs hdfs 0 2024-01-19 07:00 /tmp
drwxr-xr-x – hdfs hdfs 0 2024-01-19 07:00 /user
# 连接验证成功
3.2 数据导入操作
3.2.1 导入到HDFS
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user \
–target-dir /bigdata/sqoop/fgedu_user \
–delete-target-dir \
–num-mappers 4 \
–fields-terminated-by ‘\t’
# 查看导入结果
hdfs dfs -ls /bigdata/sqoop/fgedu_user/
hdfs dfs -cat /bigdata/sqoop/fgedu_user/part-m-00000 | head -5
24/01/19 07:30:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
24/01/19 07:30:05 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset
24/01/19 07:30:10 INFO mapreduce.Job: Running job job_1705542000000_0001
24/01/19 07:35:00 INFO mapreduce.Job: Job job_1705542000000_0001 completed successfully
24/01/19 07:35:00 INFO sqoop.ImportTool: Transferred 50.0 MB in 300.0 seconds
# 导入结果
Found 5 items
-rw-r–r– 3 fgedu fgedu 0 2024-01-19 07:35 /bigdata/sqoop/fgedu_user/_SUCCESS
-rw-r–r– 3 fgedu fgedu 12500000 2024-01-19 07:35 /bigdata/sqoop/fgedu_user/part-m-00000
-rw-r–r– 3 fgedu fgedu 12500000 2024-01-19 07:35 /bigdata/sqoop/fgedu_user/part-m-00001
-rw-r–r– 3 fgedu fgedu 12500000 2024-01-19 07:35 /bigdata/sqoop/fgedu_user/part-m-00002
-rw-r–r– 3 fgedu fgedu 12500000 2024-01-19 07:35 /bigdata/sqoop/fgedu_user/part-m-00003
# 数据内容
1 fgedu01 25 M 2024-01-01 10:00:00
2 fgedu02 30 F 2024-01-01 11:00:00
3 fgedu03 28 M 2024-01-01 12:00:00
4 fgedu04 35 F 2024-01-01 13:00:00
5 fgedu05 22 M 2024-01-01 14:00:00
3.2.2 导入到Hive
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_order \
–hive-import \
–hive-database fgedudb \
–hive-table fgedu_order \
–num-mappers 4 \
–hive-overwrite
# 查看Hive表
hive -e “SELECT COUNT(*) FROM fgedudb.fgedu_order;”
hive -e “SELECT * FROM fgedudb.fgedu_order LIMIT 5;”
24/01/19 08:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
24/01/19 08:00:05 INFO hive.HiveImport: Loading uploaded data into Hive
24/01/19 08:05:00 INFO hive.HiveImport: Hive import complete.
# Hive表查询
1000000
Time taken: 10.0 seconds
# 数据内容
1 1 100.00 2024-01-01 10:00:00
2 1 200.00 2024-01-01 11:00:00
3 2 150.00 2024-01-01 12:00:00
4 2 300.00 2024-01-01 13:00:00
5 3 250.00 2024-01-01 14:00:00
3.2.3 条件导入
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user \
–where “age > 25” \
–target-dir /bigdata/sqoop/fgedu_user_filtered \
–delete-target-dir \
–num-mappers 2
# 使用SQL查询导入
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–query “SELECT user_id, user_name, age FROM fgedu_user WHERE age > 25 AND \$CONDITIONS” \
–target-dir /bigdata/sqoop/fgedu_user_query \
–delete-target-dir \
–num-mappers 2 \
–split-by user_id
# 查看导入结果
hdfs dfs -cat /bigdata/sqoop/fgedu_user_filtered/part-m-00000 | head -5
24/01/19 08:30:00 INFO mapreduce.Job: Job completed successfully
# SQL查询导入
24/01/19 08:35:00 INFO mapreduce.Job: Job completed successfully
# 导入结果
2 fgedu02 30 F 2024-01-01 11:00:00
4 fgedu04 35 F 2024-01-01 13:00:00
6 fgedu06 28 M 2024-01-01 15:00:00
8 fgedu08 32 F 2024-01-01 17:00:00
10 fgedu10 40 M 2024-01-01 19:00:00
3.3 数据导出操作
3.3.1 从HDFS导出
hdfs dfs -mkdir -p /bigdata/sqoop/export/fgedu_user
hdfs dfs -put /bigdata/data/user_export.txt /bigdata/sqoop/export/fgedu_user/
# 导出数据到MySQL
sqoop export –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user_export \
–export-dir /bigdata/sqoop/export/fgedu_user \
–num-mappers 4 \
–fields-terminated-by ‘\t’
# 验证导出
sqoop eval –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–query “SELECT COUNT(*) FROM fgedu_user_export”
# 准备完成
# 导出执行
24/01/19 09:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
24/01/19 09:05:00 INFO mapreduce.Job: Job completed successfully
24/01/19 09:05:00 INFO sqoop.ExportTool: Exported 100000 records.
# 验证导出
———————-
| COUNT(*) |
———————-
| 100000 |
———————-
# 导出成功
3.3.2 从Hive导出
sqoop export –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_order_export \
–hcatalog-database fgedudb \
–hcatalog-table fgedu_order \
–num-mappers 4
# 验证导出
sqoop eval –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–query “SELECT * FROM fgedu_order_export LIMIT 5”
24/01/19 09:30:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
24/01/19 09:35:00 INFO mapreduce.Job: Job completed successfully
24/01/19 09:35:00 INFO sqoop.ExportTool: Exported 1000000 records.
# 验证导出
———————-
| order_id | user_id | amount |
———————-
| 1 | 1 | 100.00 |
| 2 | 1 | 200.00 |
| 3 | 2 | 150.00 |
| 4 | 2 | 300.00 |
| 5 | 3 | 250.00 |
———————-
# 导出成功
3.4 增量同步操作
3.4.1 增量导入
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user \
–target-dir /bigdata/sqoop/fgedu_user_incremental \
–incremental append \
–check-column user_id \
–last-value 1000000 \
–num-mappers 2
# 增量导入(lastmodified模式)
sqoop import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user \
–target-dir /bigdata/sqoop/fgedu_user_time \
–incremental lastmodified \
–check-column create_time \
–last-value “2024-01-18 00:00:00” \
–num-mappers 2
# 查看导入结果
hdfs dfs -ls /bigdata/sqoop/fgedu_user_incremental/
24/01/19 10:00:00 INFO sqoop.ImportTool: Incremental import complete!
24/01/19 10:00:00 INFO sqoop.ImportTool: Next incremental import will use –last-value 1001000
# 增量导入(lastmodified)
24/01/19 10:05:00 INFO sqoop.ImportTool: Incremental import complete!
24/01/19 10:05:00 INFO sqoop.ImportTool: Next incremental import will use –last-value 2024-01-19 10:05:00
# 导入结果
Found 3 items
-rw-r–r– 3 fgedu fgedu 0 2024-01-19 10:00 /bigdata/sqoop/fgedu_user_incremental/_SUCCESS
-rw-r–r– 3 fgedu fgedu 5000000 2024-01-19 10:00 /bigdata/sqoop/fgedu_user_incremental/part-m-00000
-rw-r–r– 3 fgedu fgedu 5000000 2024-01-19 10:00 /bigdata/sqoop/fgedu_user_incremental/part-m-00001
3.4.2 保存作业
sqoop job –create fgedu_user_import — import –connect jdbc:mysql://fgedu01:3306/fgedudb \
–username fgedu –password fgedu123 \
–table fgedu_user \
–target-dir /bigdata/sqoop/fgedu_user_job \
–incremental append \
–check-column user_id \
–num-mappers 4
# 查看作业列表
sqoop job –list
# 执行作业
sqoop job –exec fgedu_user_import
# 查看作业详情
sqoop job –show fgedu_user_import
# 创建成功
# 作业列表
Available jobs:
fgedu_user_import
# 执行作业
24/01/19 10:30:00 INFO sqoop.ImportTool: Incremental import complete!
# 作业详情
Job: fgedu_user_import
Tool: import
Options:
—————————-
connect = jdbc:mysql://fgedu01:3306/fgedudb
username = fgedu
table = fgedu_user
target-dir = /bigdata/sqoop/fgedu_user_job
incremental = append
check-column = user_id
last-value = 1002000
Part04-生产案例与实战讲解
4.1 MySQL到HDFS案例
MySQL到HDFS是最常见的数据迁移场景。更多视频教程www.fgedu.net.cn
# mysql_to_hdfs.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== MySQL to HDFS Migration ===”
echo “Date: $(date)”
# 配置参数
DB_HOST=”fgedu01″
DB_PORT=”3306″
DB_NAME=”fgedudb”
DB_USER=”fgedu”
DB_PASS=”fgedu123″
HDFS_DIR=”/bigdata/sqoop/mysql_import”
# 获取表列表
TABLES=$(sqoop list-tables –connect jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME} –username ${DB_USER} –password ${DB_PASS})
# 导入所有表
for table in ${TABLES}; do
echo “Importing table: ${table}”
sqoop import –connect jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME} \
–username ${DB_USER} –password ${DB_PASS} \
–table ${table} \
–target-dir ${HDFS_DIR}/${table} \
–delete-target-dir \
–num-mappers 4 \
–compress \
–compression-codec org.apache.hadoop.io.compress.SnappyCodec
done
echo “=== Migration Completed ===”
./mysql_to_hdfs.sh
=== MySQL to HDFS Migration ===
Date: Fri Jan 19 11:00:00 CST 2024
Importing table: fgedu_user
24/01/19 11:00:00 INFO mapreduce.Job: Job completed successfully
Importing table: fgedu_order
24/01/19 11:10:00 INFO mapreduce.Job: Job completed successfully
Importing table: fgedu_product
24/01/19 11:20:00 INFO mapreduce.Job: Job completed successfully
=== Migration Completed ===
4.2 HDFS到MySQL案例
HDFS到MySQL用于数据导出场景。学习交流加群风哥微信: itpux-com
# hdfs_to_mysql.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== HDFS to MySQL Export ===”
echo “Date: $(date)”
# 配置参数
DB_HOST=”fgedu01″
DB_PORT=”3306″
DB_NAME=”report_db”
DB_USER=”fgedu”
DB_PASS=”fgedu123″
HDFS_DIR=”/bigdata/report/user_summary”
TABLE_NAME=”user_summary_report”
# 清空目标表
sqoop eval –connect jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME} \
–username ${DB_USER} –password ${DB_PASS} \
–query “TRUNCATE TABLE ${TABLE_NAME}”
# 导出数据
sqoop export –connect jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME} \
–username ${DB_USER} –password ${DB_PASS} \
–table ${TABLE_NAME} \
–export-dir ${HDFS_DIR} \
–num-mappers 4 \
–fields-terminated-by ‘\t’ \
–lines-terminated-by ‘\n’
# 验证导出
sqoop eval –connect jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME} \
–username ${DB_USER} –password ${DB_PASS} \
–query “SELECT COUNT(*) FROM ${TABLE_NAME}”
echo “=== Export Completed ===”
./hdfs_to_mysql.sh
=== HDFS to MySQL Export ===
Date: Fri Jan 19 11:30:00 CST 2024
# 清空目标表
# 清空成功
# 导出数据
24/01/19 11:35:00 INFO mapreduce.Job: Job completed successfully
24/01/19 11:35:00 INFO sqoop.ExportTool: Exported 500000 records.
# 验证导出
———————-
| COUNT(*) |
———————-
| 500000 |
———————-
=== Export Completed ===
4.3 增量同步案例
4.3.1 定时增量同步
# incremental_sync.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== Incremental Sync ===”
echo “Date: $(date)”
# 执行增量同步作业
sqoop job –exec fgedu_user_import
# 检查同步结果
LAST_VALUE=$(sqoop job –show fgedu_user_import | grep “last-value” | awk ‘{print $3}’)
echo “Last synced value: ${LAST_VALUE}”
# 记录同步日志
echo “$(date): Synced up to ${LAST_VALUE}” >> /bigdata/logs/sqoop_sync.log
echo “=== Sync Completed ===”
./incremental_sync.sh
=== Incremental Sync ===
Date: Fri Jan 19 12:00:00 CST 2024
# 执行作业
24/01/19 12:00:00 INFO sqoop.ImportTool: Incremental import complete!
24/01/19 12:00:00 INFO sqoop.ImportTool: Next incremental import will use –last-value 1005000
# 同步结果
Last synced value: 1005000
# 同步日志
Fri Jan 19 12:00:00 CST 2024: Synced up to 1005000
=== Sync Completed ===
Part05-风哥经验总结与分享
5.1 Sqoop最佳实践
在实际生产环境中,Sqoop使用需要注意以下几点:from bigdata视频:www.itpux.com
1. 合理设置并行度
2. 使用压缩减少存储
3. 增量同步减少数据量
4. 监控迁移进度
5. 做好错误处理
5.2 数据迁移经验总结
5.2.1 迁移建议
– 评估数据量和迁移时间
– 选择合适的迁移窗口
– 做好数据验证
– 制定回滚方案
– 记录迁移日志
5.2.2 Sqoop运维脚本
# sqoop_maintenance.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: http://www.fgedu.net.cn
echo “=== Sqoop Maintenance ===”
echo “Date: $(date)”
# 1. 检查Sqoop版本
echo “=== Check Sqoop Version ===”
sqoop version
# 2. 检查作业列表
echo “=== Check Job List ===”
sqoop job –list
# 3. 检查数据库连接
echo “=== Check Database Connection ===”
sqoop list-databases –connect jdbc:mysql://fgedu01:3306/ –username fgedu –password fgedu123
# 4. 清理临时文件
echo “=== Clean Temp Files ===”
hdfs dfs -rm -r -skipTrash /tmp/sqoop-* 2>/dev/null
echo “=== Maintenance Completed ===”
./sqoop_maintenance.sh
=== Sqoop Maintenance ===
Date: Fri Jan 19 12:30:00 CST 2024
=== Check Sqoop Version ===
Sqoop 1.4.7
=== Check Job List ===
Available jobs:
fgedu_user_import
fgedu_order_import
=== Check Database Connection ===
information_schema
fgedudb
hive_metastore
=== Clean Temp Files ===
# 清理完成
=== Maintenance Completed ===
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
