1. 首页 > 软件下载 > 正文

Sqoop下载-Hadoop数据迁移工具下载地址-Sqoop安装部署方法

1. Sqoop简介与版本说明

Sqoop是Apache开源的批量数据迁移工具,用于在Apache Hadoop和关系数据库之间高效传输批量数据。Sqoop可以将关系数据库中的数据导入到Hadoop的HDFS、Hive、HBase中,也可以将Hadoop中的数据导出到关系数据库。更多学习教程www.fgedu.net.cn

Sqoop版本说明:

Sqoop 1.4.7 (最新稳定版)
Sqoop 1.4.6 (稳定版)
Sqoop 1.4.5 (历史版本)

Sqoop2 1.99.7 (下一代版本,不推荐生产使用)

Sqoop支持的数据库:

关系数据库:
– MySQL / MariaDB
– PostgreSQL
– Oracle
– SQL Server
– DB2
– Informix
– Teradata
– Netezza

大数据存储:
– HDFS
– Hive
– HBase
– Accumulo

2. Sqoop下载方式

Sqoop提供多种下载方式,包括官方下载、国内镜像下载等。学习交流加群风哥微信: itpux-com

方式一:官方下载

# 创建下载目录
$ mkdir -p /fgeudb/software/sqoop
$ cd /fgeudb/software/sqoop

# 下载Sqoop 1.4.7
$ wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# 下载Sqoop源码包(可选)
$ wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.src.tar.gz

# 查看下载文件
$ ls -lh

输出示例如下:
total 180M
-rw-r–r– 1 root root 170M Apr 4 10:00 sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
-rw-r–r– 1 root root 10M Apr 4 10:00 sqoop-1.4.7.src.tar.gz

方式二:国内镜像下载

# 使用阿里云镜像
$ wget https://mirrors.aliyun.com/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# 使用华为云镜像
$ wget https://mirrors.huawei.com/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# 使用清华大学镜像
$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

方式三:下载MySQL JDBC驱动

# 下载MySQL Connector/J
$ cd /fgeudb/software/sqoop
$ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.33/mysql-connector-java-8.0.33.jar

# 下载Oracle JDBC驱动(需要Oracle账号)
# 从Oracle官网下载 ojdbc8.jar

# 下载PostgreSQL JDBC驱动
$ wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.1/postgresql-42.7.1.jar

# 查看驱动文件
$ ls -lh *.jar

输出示例如下:
-rw-r–r– 1 root root 2.5M Apr 4 10:00 mysql-connector-java-8.0.33.jar
-rw-r–r– 1 root root 1.0M Apr 4 10:00 postgresql-42.7.1.jar

3. Sqoop安装部署

Sqoop依赖Hadoop和Java环境,需要先安装配置好这些依赖。学习交流加群风哥QQ113257174

步骤1:解压安装

# 解压安装包
$ cd /fgeudb
$ tar -zxvf /fgeudb/software/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
$ mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop

# 查看目录结构
$ ls -la /fgeudb/sqoop/

输出示例如下:
total 24
drwxr-xr-x 2 root root 4096 Apr 4 10:00 bin
drwxr-xr-x 2 root root 4096 Apr 4 10:00 conf
drwxr-xr-x 2 root root 4096 Apr 4 10:00 lib
drwxr-xr-x 2 root root 4096 Apr 4 10:00 docs
-rw-r–r– 1 root root 8196 Apr 4 10:00 README.txt

步骤2:配置环境变量

# 配置环境变量
$ vi ~/.bash_profile

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk
export HADOOP_HOME=/fgeudb/hadoop
export HIVE_HOME=/fgeudb/hive
export SQOOP_HOME=/fgeudb/sqoop
export PATH=$SQOOP_HOME/bin:$PATH

# 使环境变量生效
$ source ~/.bash_profile

# 验证环境
$ echo $SQOOP_HOME
/fgeudb/sqoop

# 验证Java
$ java -version

输出示例如下:
openjdk version “1.8.0_402”
OpenJDK Runtime Environment (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (build 25.402-b06, mixed mode)

# 验证Hadoop
$ hadoop version

输出示例如下:
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r abc123
Compiled by root on 2024-01-15T10:00Z

步骤3:配置Sqoop

# 复制配置模板
$ cd /fgeudb/sqoop/conf
$ cp sqoop-env-template.sh sqoop-env.sh

# 编辑配置文件
$ vi sqoop-env.sh

export HADOOP_COMMON_HOME=/fgeudb/hadoop
export HADOOP_MAPRED_HOME=/fgeudb/hadoop
export HBASE_HOME=/fgeudb/hbase
export HIVE_HOME=/fgeudb/hive
export ZOOCFGDIR=/fgeudb/zookeeper/conf
export ZOOKEEPER_HOME=/fgeudb/zookeeper

# 复制JDBC驱动到lib目录
$ cp /fgeudb/software/sqoop/mysql-connector-java-8.0.33.jar /fgeudb/sqoop/lib/
$ cp /fgeudb/software/sqoop/postgresql-42.7.1.jar /fgeudb/sqoop/lib/

# 查看lib目录
$ ls -la /fgeudb/sqoop/lib/

输出示例如下:
total 5120
-rw-r–r– 1 root root 2580480 Apr 4 10:00 mysql-connector-java-8.0.33.jar
-rw-r–r– 1 root root 1048576 Apr 4 10:00 postgresql-42.7.1.jar

步骤4:验证安装

# 验证Sqoop安装
$ sqoop version

输出示例如下:
Sqoop 1.4.7
Git commit hash abc123def456
Compiled by root on Mon Jan 15 10:00:00 UTC 2024

# 测试MySQL连接
$ sqoop list-databases –connect jdbc:mysql://192.168.1.51:3306/ –username root –password root123

输出示例如下:
information_schema
fgedu_db
mysql
performance_schema
sys

# 测试PostgreSQL连接
$ sqoop list-databases –connect jdbc:postgresql://192.168.1.51:5432/ –username postgres –password postgres123

输出示例如下:
fgedu_db
postgres

生产环境建议:将JDBC驱动放置在$SQOOP_HOME/lib目录下,确保Sqoop能够正确加载数据库驱动。密码不要直接写在命令行中,建议使用–password-file参数指定密码文件。

4. Sqoop配置详解

Sqoop配置包括环境配置和连接器配置两部分。风哥提示:正确配置环境变量是Sqoop正常运行的前提。

sqoop-env.sh配置说明

# Sqoop环境配置文件
$ vi /fgeudb/sqoop/conf/sqoop-env.sh

# Hadoop相关路径
export HADOOP_COMMON_HOME=/fgeudb/hadoop
export HADOOP_MAPRED_HOME=/fgeudb/hadoop

# Hive路径(如需导入到Hive)
export HIVE_HOME=/fgeudb/hive

# HBase路径(如需导入到HBase)
export HBASE_HOME=/fgeudb/hbase

# ZooKeeper路径
export ZOOCFGDIR=/fgeudb/zookeeper/conf
export ZOOKEEPER_HOME=/fgeudb/zookeeper

# Accumulo路径(可选)
export ACCUMULO_HOME=/fgeudb/accumulo

sqoop-site.xml配置

# 创建sqoop-site.xml
$ vi /fgeudb/sqoop/conf/sqoop-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
<description>是否在metastore中保存密码</description>
</property>

<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://192.168.1.51:3306/sqoop_metastore</value>
<description>Sqoop Metastore数据库连接URL</description>
</property>

<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>sqoop</value>
</property>

<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>sqoop123</value>
</property>
</configuration>

创建密码文件

# 创建密码文件(安全方式)
$ echo -n “root123” > /fgeudb/sqoop/.mysql.password
$ chmod 400 /fgeudb/sqoop/.mysql.password

# 使用密码文件连接
$ sqoop list-databases \
–connect jdbc:mysql://192.168.1.51:3306/ \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password

输出示例如下:
information_schema
fgedu_db
mysql
performance_schema
sys

5. 数据导入实战

Sqoop支持将关系数据库中的数据导入到HDFS、Hive、HBase等存储系统。更多学习教程公众号风哥教程itpux_com

场景一:MySQL数据导入HDFS

# 导入整张表到HDFS
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–target-dir /user/hadoop/orders \
–delete-target-dir \
–num-mappers 4 \
–fields-terminated-by ‘\t’ \
–lines-terminated-by ‘\n’

输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
26/04/04 10:00:02 INFO mapreduce.ImportJobBase: Beginning import of orders
26/04/04 10:00:05 INFO mapreduce.Job: Running job job_1712217600000_0001
26/04/04 10:00:30 INFO mapreduce.Job: Job job_1712217600000_0001 completed successfully
26/04/04 10:00:31 INFO mapreduce.ImportJobBase: Transferred 125.5 MB in 30.25 seconds (4.15 MB/sec)
26/04/04 10:00:31 INFO mapreduce.ImportJobBase: Retrieved 1000000 records.

# 查看导入的数据
$ hdfs dfs -ls /user/hadoop/orders

输出示例如下:
Found 5 items
-rw-r–r– 1 root supergroup 0 2026-04-04 10:00 /user/hadoop/orders/_SUCCESS
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00000
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00001
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00002
-rw-r–r– 1 root supergroup 31457280 2026-04-04 10:00 /user/hadoop/orders/part-m-00003

场景二:条件查询导入

# 使用query参数进行条件导入
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–query “SELECT id, order_no, user_id, amount, status FROM orders WHERE create_time >= ‘2026-01-01’ AND \$CONDITIONS” \
–target-dir /user/hadoop/orders_2026 \
–delete-target-dir \
–num-mappers 4 \
–split-by id

输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Transferred 45.2 MB in 28.15 seconds (1.61 MB/sec)
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Retrieved 350000 records.

# 使用columns参数选择列
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–columns “id,order_no,user_id,amount” \
–target-dir /user/hadoop/orders_partial \
–delete-target-dir \
–num-mappers 2

场景三:导入到Hive表

# 导入数据到Hive表
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hive-import \
–hive-database fgedu_dw \
–hive-table orders \
–create-hive-table \
–num-mappers 4

输出示例如下:
26/04/04 10:00:00 INFO hive.HiveImport: Loading uploaded data into Hive
26/04/04 10:00:30 INFO hive.HiveImport: Table default.orders stats: [numFiles=4, numRows=1000000, totalSize=125500000, rawDataSize=125000000]
26/04/04 10:00:30 INFO hive.HiveImport: Hive import complete.

# 验证Hive表
$ hive -e “SELECT COUNT(*) FROM fgedu_dw.orders;”

输出示例如下:
1000000

# 增量导入
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hive-import \
–hive-database fgedu_dw \
–hive-table orders \
–incremental append \
–check-column id \
–last-value 1000000 \
–num-mappers 2

场景四:导入到HBase表

# 导入数据到HBase
$ sqoop import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–hbase-table fgedu_orders \
–column-family cf \
–hbase-row-key id \
–num-mappers 4

输出示例如下:
26/04/04 10:00:00 INFO hbase.HBaseImportJob: Uploading data to HBase table fgedu_orders
26/04/04 10:00:30 INFO hbase.HBaseImportJob: HBase import complete.

# 验证HBase数据
$ echo “scan ‘fgedu_orders’, {LIMIT => 5}” | hbase shell

输出示例如下:
ROW COLUMN+CELL
1 column=cf:amount, timestamp=1712217600000, value=999.99
1 column=cf:order_no, timestamp=1712217600000, value=ORD20260404001
1 column=cf:status, timestamp=1712217600000, value=pending
1 column=cf:user_id, timestamp=1712217600000, value=100

6. 数据导出实战

Sqoop支持将HDFS、Hive中的数据导出到关系数据库。from:www.itpux.com

场景一:HDFS数据导出到MySQL

# 创建目标表
$ mysql -h 192.168.1.51 -u root -proot123 -e ”
CREATE TABLE IF NOT EXISTS fgedu_db.orders_export (
id INT PRIMARY KEY,
order_no VARCHAR(50),
user_id INT,
amount DECIMAL(10,2),
status VARCHAR(20),
create_time DATETIME
);

# 导出HDFS数据到MySQL
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–input-fields-terminated-by ‘\t’ \
–input-lines-terminated-by ‘\n’ \
–num-mappers 4

输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
26/04/04 10:00:02 INFO mapreduce.ExportJobBase: Beginning export of orders_export
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Transferred 125.5 MB in 28.15 seconds (4.46 MB/sec)
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Exported 1000000 records.

# 验证导出数据
$ mysql -h 192.168.1.51 -u root -proot123 -e “SELECT COUNT(*) FROM fgedu_db.orders_export;”

输出示例如下:
+———-+
| COUNT(*) |
+———-+
| 1000000 |
+———-+

场景二:更新模式导出

# 使用更新模式导出
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders_update \
–input-fields-terminated-by ‘\t’ \
–update-key id \
–update-mode allowinsert \
–num-mappers 4

输出示例如下:
26/04/04 10:00:00 INFO mapreduce.ExportJobBase: Beginning export of orders_export
26/04/04 10:00:30 INFO mapreduce.ExportJobBase: Exported 50000 records.

# 使用call存储过程导出
$ sqoop export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–call process_orders \
–num-mappers 2

7. Sqoop Job管理

Sqoop支持创建和管理作业,方便重复执行数据迁移任务。

步骤1:创建Sqoop Job

# 创建导入作业
$ sqoop job –create import_orders \
— import \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders \
–target-dir /user/hadoop/orders \
–incremental append \
–check-column id \
–last-value 0 \
–num-mappers 4

输出示例如下:
Job ‘import_orders’ created successfully.

# 创建导出作业
$ sqoop job –create export_orders \
— export \
–connect jdbc:mysql://192.168.1.51:3306/fgedu_db \
–username root \
–password-file file:///fgeudb/sqoop/.mysql.password \
–table orders_export \
–export-dir /user/hadoop/orders \
–input-fields-terminated-by ‘\t’ \
–num-mappers 4

输出示例如下:
Job ‘export_orders’ created successfully.

步骤2:管理Sqoop Job

# 列出所有作业
$ sqoop job –list

输出示例如下:
Available jobs:
export_orders
import_orders

# 查看作业详情
$ sqoop job –show import_orders

输出示例如下:
Job: import_orders
Tool: import
Options:
——————————-
connect = jdbc:mysql://192.168.1.51:3306/fgedu_db
username = root
password-file = file:///fgeudb/sqoop/.mysql.password
table = orders
target-dir = /user/hadoop/orders
incremental = append
check-column = id
last-value = 0
num-mappers = 4

# 执行作业
$ sqoop job –exec import_orders

输出示例如下:
26/04/04 10:00:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
26/04/04 10:00:30 INFO mapreduce.ImportJobBase: Retrieved 50000 records.
Incremental import complete! LastValue = 1050000

# 删除作业
$ sqoop job –delete import_orders

输出示例如下:
Job ‘import_orders’ deleted successfully.

步骤3:定时执行Sqoop Job

# 创建定时任务脚本
$ vi /fgeudb/scripts/sqoop_daily_import.sh

#!/bin/bash
export SQOOP_HOME=/fgeudb/sqoop
export HADOOP_HOME=/fgeudb/hadoop
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk

LOG_FILE=/fgeudb/logs/sqoop_import_$(date +%Y%m%d).log

echo “Starting Sqoop import job at $(date)” >> $LOG_FILE

$SQOOP_HOME/bin/sqoop job –exec import_orders >> $LOG_FILE 2>&1

echo “Sqoop import job completed at $(date)” >> $LOG_FILE

# 添加执行权限
$ chmod +x /fgeudb/scripts/sqoop_daily_import.sh

# 配置crontab定时任务
$ crontab -e

# 每天凌晨2点执行
0 2 * * * /fgeudb/scripts/sqoop_daily_import.sh

# 查看定时任务
$ crontab -l

输出示例如下:
0 2 * * * /fgeudb/scripts/sqoop_daily_import.sh

生产环境建议:使用Sqoop Job管理定期执行的数据迁移任务,避免重复输入命令。增量导入时使用–check-column和–last-value参数,确保数据不重复不丢失。大批量数据迁移建议在业务低峰期执行,并设置合理的并行度。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息