1. Sqoop简介与版本说明
Sqoop是Apache开源的批量数据迁移工具,用于在Apache Hadoop和关系数据库之间高效传输批量数据。Sqoop可以将关系数据库中的数据导入到Hadoop的HDFS、Hive、HBase中,也可以将Hadoop中的数据导出到关系数据库。更多学习教程www.fgedu.net.cn
Sqoop版本说明:
Sqoop 1.4.6 (稳定版)
Sqoop 1.4.5 (历史版本)
Sqoop2 1.99.7 (下一代版本,不推荐生产使用)
Sqoop支持的数据库:
– MySQL / MariaDB
– PostgreSQL
– Oracle
– SQL Server
– DB2
– Informix
– Teradata
– Netezza
大数据存储:
– HDFS
– Hive
– HBase
– Accumulo
2. Sqoop下载方式
Sqoop提供多种下载方式,包括官方下载、国内镜像下载等。学习交流加群风哥微信: itpux-com
方式一:官方下载
$ mkdir -p /fgeudb/software/sqoop
$ cd /fgeudb/software/sqoop
# 下载Sqoop 1.4.7
$ wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
# 下载Sqoop源码包(可选)
$ wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.src.tar.gz
# 查看下载文件
$ ls -lh
输出示例如下:
total 180M
-rw-r–r– 1 root root 170M Apr 4 10:00 sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
-rw-r–r– 1 root root 10M Apr 4 10:00 sqoop-1.4.7.src.tar.gz
方式二:国内镜像下载
$ wget https://mirrors.aliyun.com/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
# 使用华为云镜像
$ wget https://mirrors.huawei.com/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
# 使用清华大学镜像
$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
方式三:下载MySQL JDBC驱动
$ cd /fgeudb/software/sqoop
$ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.33/mysql-connector-java-8.0.33.jar
# 下载Oracle JDBC驱动(需要Oracle账号)
# 从Oracle官网下载 ojdbc8.jar
# 下载PostgreSQL JDBC驱动
$ wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.1/postgresql-42.7.1.jar
# 查看驱动文件
$ ls -lh *.jar
输出示例如下:
-rw-r–r– 1 root root 2.5M Apr 4 10:00 mysql-connector-java-8.0.33.jar
-rw-r–r– 1 root root 1.0M Apr 4 10:00 postgresql-42.7.1.jar
3. Sqoop安装部署
Sqoop依赖Hadoop和Java环境,需要先安装配置好这些依赖。学习交流加群风哥QQ113257174
步骤1:解压安装
$ cd /fgeudb
$ tar -zxvf /fgeudb/software/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
$ mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop
# 查看目录结构
$ ls -la /fgeudb/sqoop/
输出示例如下:
total 24
drwxr-xr-x 2 root root 4096 Apr 4 10:00 bin
drwxr-xr-x 2 root root 4096 Apr 4 10:00 conf
drwxr-xr-x 2 root root 4096 Apr 4 10:00 lib
drwxr-xr-x 2 root root 4096 Apr 4 10:00 docs
-rw-r–r– 1 root root 8196 Apr 4 10:00 README.txt
步骤2:配置环境变量
$ vi ~/.bash_profile
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
export HADOOP_HOME=/fgeudb/hadoop
export HIVE_HOME=/fgeudb/hive
export SQOOP_HOME=/fgeudb/sqoop
export PATH=$SQOOP_HOME/bin:$PATH
# 使环境变量生效
$ source ~/.bash_profile
# 验证环境
$ echo $SQOOP_HOME
/fgeudb/sqoop
# 验证Java
$ java -version
输出示例如下:
openjdk version “1.8.0_402”
OpenJDK Runtime Environment (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (build 25.402-b06, mixed mode)
# 验证Hadoop
$ hadoop version
输出示例如下:
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r abc123
Compiled by root on 2024-01-15T10:00Z
步骤3:配置Sqoop
$ cd /fgeudb/sqoop/conf
$ cp sqoop-env-template.sh sqoop-env.sh
# 编辑配置文件
$ vi sqoop-env.sh
export HADOOP_COMMON_HOME=/fgeudb/hadoop
export HADOOP_MAPRED_HOME=/fgeudb/hadoop
export HBASE_HOME=/fgeudb/hbase
export HIVE_HOME=/fgeudb/hive
export ZOOCFGDIR=/fgeudb/zookeeper/conf
export ZOOKEEPER_HOME=/fgeudb/zookeeper
# 复制JDBC驱动到lib目录
$ cp /fgeudb/software/sqoop/mysql-connector-java-8.0.33.jar /fgeudb/sqoop/lib/
$ cp /fgeudb/software/sqoop/postgresql-42.7.1.jar /fgeudb/sqoop/lib/
# 查看lib目录
$ ls -la /fgeudb/sqoop/lib/
输出示例如下:
total 5120
-rw-r–r– 1 root root 2580480 Apr 4 10:00 mysql-connector-java-8.0.33.jar
-rw-r–r– 1 root root 1048576 Apr 4 10:00 postgresql-42.7.1.jar
步骤4:验证安装
$ sqoop version
输出示例如下:
Sqoop 1.4.7
Git commit hash abc123def456
Compiled by root on Mon Jan 15 10:00:00 UTC 2024
# 测试MySQL连接
$ sqoop list-databases –connect jdbc:mysql://192.168.1.51:3306/ –username root –password root123
输出示例如下:
information_schema
fgedu_db
mysql
performance_schema
sys
# 测试PostgreSQL连接
$ sqoop list-databases –connect jdbc:postgresql://192.168.1.51:5432/ –username postgres –password postgres123
输出示例如下:
fgedu_db
postgres
4. Sqoop配置详解
Sqoop配置包括环境配置和连接器配置两部分。风哥提示:正确配置环境变量是Sqoop正常运行的前提。
sqoop-env.sh配置说明
$ vi /fgeudb/sqoop/conf/sqoop-env.sh
# Hadoop相关路径
export HADOOP_COMMON_HOME=/fgeudb/hadoop
export HADOOP_MAPRED_HOME=/fgeudb/hadoop
# Hive路径(如需导入到Hive)
export HIVE_HOME=/fgeudb/hive
# HBase路径(如需导入到HBase)
export HBASE_HOME=/fgeudb/hbase
# ZooKeeper路径
export ZOOCFGDIR=/fgeudb/zookeeper/conf
export ZOOKEEPER_HOME=/fgeudb/zookeeper
# Accumulo路径(可选)
export ACCUMULO_HOME=/fgeudb/accumulo
sqoop-site.xml配置
$ vi /fgeudb/sqoop/conf/sqoop-site.xml
<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
<description>是否在metastore中保存密码</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://192.168.1.51:3306/sqoop_metastore</value>
<description>Sqoop Metastore数据库连接URL</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>sqoop</value>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>sqoop123</value>
</property>
</configuration>
创建密码文件
$ echo -n “root123” > /fgeudb/sqoop/.mysql.password
$ chmod 400 /fgeudb/sqoop/.mysql.password
# 使用密码文件连接
$ sqoop list-databases \
–connect jdbc:mysql://192.168.1.51:3306/ \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password
输出示例如下:
information_schema
fgedu_db
mysql
performance_schema
sys
5. 数据导入实战
Sqoop支持将关系数据库中的数据导入到HDFS、Hive、HBase等存储系统。更多学习教程公众号风哥教程itpux_com
场景一:MySQL数据导入HDFS
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–target-dir /user/hadoop/orders \
–delete-target-dir \
–num-mappers 4 \
–fields-terminated-by ‘\t’ \
–lines-terminated-by ‘\n’
输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
26/04/04 10:00:02 INFO mapreduce.ImportJobBase: Beginning import of orders
26/04/04 10:00:05 INFO mapreduce.Job: Running job job_1712217600000_0001
26/04/04 10:00:30 INFO mapreduce.Job: Job job_1712217600000_0001 completed successfully
26/04/04 10:00:31 INFO mapreduce.ImportJobBase: Transferred 125.5 MB in 30.25 seconds (4.15 MB/sec)
26/04/04 10:00:31 INFO mapreduce.ImportJobBase: Retrieved 1000000 records.
# 查看导入的数据
$ hdfs dfs -ls /user/hadoop/orders
输出示例如下:
Found 5 items
-rw-r–r– 1 root supergroup 0 2026-04-04 10:00 /user/hadoop/orders/_SUCCESS
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00000
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00001
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00002
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00003
场景二:条件查询导入
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–query “SELECT id, order_no, user_id, amount, status FROM orders WHERE create_time >= ‘2026-01-01’ AND \$CONDITIONS” \
–target-dir /user/hadoop/orders_2026 \
–delete-target-dir \
–num-mappers 4 \
–split-by id
输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Transferred 45.2 MB in 28.15 seconds (1.61 MB/sec)
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Retrieved 350000 records.
# 使用columns参数选择列
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–columns “id,order_no,user_id,amount” \
–target-dir /user/hadoop/orders_partial \
–delete-target-dir \
–num-mappers 2
场景三:导入到Hive表
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hive-import \
–hive-database fgedu_dw \
–hive-table orders \
–create-hive-table \
–num-mappers 4
输出示例如下:
26/04/04 10:00:00 INFO hive.HiveImport: Loading uploaded data into Hive
26/04/04 10:00:30 INFO hive.HiveImport: Table default.orders stats: [numFiles=4, numRows=1000000, totalSize=125500000, rawDataSize=125000000]
26/04/04 10:00:30 INFO hive.HiveImport: Hive import complete.
# 验证Hive表
$ hive -e “SELECT COUNT(*) FROM fgedu_dw.orders;”
输出示例如下:
1000000
# 增量导入
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hive-import \
–hive-database fgedu_dw \
–hive-table orders \
–incremental append \
–check-column id \
–last-value 1000000 \
–num-mappers 2
场景四:导入到HBase表
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hbase-table fgedu_orders \
–column-family cf \
–hbase-row-key id \
–num-mappers 4
输出示例如下:
26/04/04 10:00:00 INFO hbase.HBaseImportJob: Uploading data to HBase table fgedu_orders
26/04/04 10:00:30 INFO hbase.HBaseImportJob: HBase import complete.
# 验证HBase数据
$ echo “scan ‘fgedu_orders’, {LIMIT => 5}” | hbase shell
输出示例如下:
ROW COLUMN+CELL
1 column=cf:amount, timestamp=1712217600000, value=999.99
1 column=cf:order_no, timestamp=1712217600000, value=ORD20260404001
1 column=cf:status, timestamp=1712217600000, value=pending
1 column=cf:user_id, timestamp=1712217600000, value=100
6. 数据导出实战
Sqoop支持将HDFS、Hive中的数据导出到关系数据库。from:www.itpux.com
场景一:HDFS数据导出到MySQL
$ mysql -h 192.168.1.51 -u root -proot123 -e ”
CREATE TABLE IF NOT EXISTS fgedu_db.orders_export (
id INT PRIMARY KEY,
order_no VARCHAR(50),
user_id INT,
amount DECIMAL(10,2),
status VARCHAR(20),
create_time DATETIME
);
”
# 导出HDFS数据到MySQL
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–input-fields-terminated-by ‘\t’ \
–input-lines-terminated-by ‘\n’ \
–num-mappers 4
输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
26/04/04 10:00:02 INFO mapreduce.ExportJobBase: Beginning export of orders_export
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Transferred 125.5 MB in 28.15 seconds (4.46 MB/sec)
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Exported 1000000 records.
# 验证导出数据
$ mysql -h 192.168.1.51 -u root -proot123 -e “SELECT COUNT(*) FROM fgedu_db.orders_export;”
输出示例如下:
+———-+
| COUNT(*) |
+———-+
| 1000000 |
+———-+
场景二:更新模式导出
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders_update \
–input-fields-terminated-by ‘\t’ \
–update-key id \
–update-mode allowinsert \
–num-mappers 4
输出示例如下:
26/04/04 10:00:00 INFO mapreduce.ExportJobBase: Beginning export of orders_export
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Exported 50000 records.
# 使用call存储过程导出
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–call process_orders \
–num-mappers 2
7. Sqoop Job管理
Sqoop支持创建和管理作业,方便重复执行数据迁移任务。
步骤1:创建Sqoop Job
$ sqoop job –create import_orders \
— import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–target-dir /user/hadoop/orders \
–incremental append \
–check-column id \
–last-value 0 \
–num-mappers 4
输出示例如下:
Job ‘import_orders’ created successfully.
# 创建导出作业
$ sqoop job –create export_orders \
— export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–input-fields-terminated-by ‘\t’ \
–num-mappers 4
输出示例如下:
Job ‘export_orders’ created successfully.
步骤2:管理Sqoop Job
$ sqoop job –list
输出示例如下:
Available jobs:
export_orders
import_orders
# 查看作业详情
$ sqoop job –show import_orders
输出示例如下:
Job: import_orders
Tool: import
Options:
——————————-
connect = jdbc:mysql://192.168.1.51:3306/fgedu_db
username = root
password-file = file:///fgeudb/sqoop/.mysql.password
table = orders
target-dir = /user/hadoop/orders
incremental = append
check-column = id
last-value = 0
num-mappers = 4
# 执行作业
$ sqoop job –exec import_orders
输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Retrieved 50000 records.
Incremental import complete! LastValue = 1050000
# 删除作业
$ sqoop job –delete import_orders
输出示例如下:
Job ‘import_orders’ deleted successfully.
步骤3:定时执行Sqoop Job
$ vi /fgeudb/scripts/sqoop_daily_import.sh
#!/bin/bash
export SQOOP_HOME=/fgeudb/sqoop
export HADOOP_HOME=/fgeudb/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
LOG_FILE=/fgeudb/logs/sqoop_import_$(date +%Y%m%d).log
echo “Starting Sqoop import job at $(date)” >> $LOG_FILE
$SQOOP_HOME/bin/sqoop job –exec import_orders >> $LOG_FILE 2>&1
echo “Sqoop import job completed at $(date)” >> $LOG_FILE
# 添加执行权限
$ chmod +x /fgeudb/scripts/sqoop_daily_import.sh
# 配置crontab定时任务
$ crontab -e
# 每天凌晨2点执行
0 2 * * * /fgeudb/scripts/sqoop_daily_import.sh
# 查看定时任务
$ crontab -l
输出示例如下:
0 2 * * * /fgeudb/scripts/sqoop_daily_import.sh
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
