1. Apache Impala简介与版本说明
Apache Impala是一个开源的分布式SQL查询引擎,专为Hadoop环境中的大规模数据分析而设计。更多学习教程www.fgedu.net.cn。Impala提供低延迟的SQL查询能力,支持对HDFS、HBase、Amazon S3等存储系统中的数据进行交互式分析。
Impala采用MPP架构,支持标准ANSI SQL,与Hive Metastore兼容。学习交流加群风哥微信: itpux-com。Impala广泛应用于数据仓库、实时分析、BI报表等场景,是Cloudera平台的核心组件。
Apache Impala核心特性:
– 标准SQL:支持ANSI SQL-92标准
– 多数据源:支持HDFS、HBase、S3、Kudu
– Hive兼容:共享Hive Metastore
– 实时查询:支持交互式数据分析
– 并行处理:MPP架构,分布式执行
– 资源管理:与YARN集成
– 安全性:支持Kerberos、Sentry、Ranger
– JDBC/ODBC:标准数据库连接接口
– UDF支持:支持用户自定义函数
Impala架构组件:
Impalad Impala守护进程,执行查询
Statestore 状态存储服务,维护集群状态
Catalogd 目录服务,管理元数据
Hive Metastore 元数据存储服务
HDFS 分布式文件系统
Kudu 列式存储系统
2. Impala版本选择与下载地址
Apache Impala主要通过Cloudera发行版发布,当前最新版本为Cloudera Runtime 7.3.1系列。
Impala版本状态:
Cloudera Runtime 7.3.1 2024-XX-XX 最新稳定版
Cloudera Runtime 7.1.7 2023-XX-XX 稳定版
CDH 6.3.x 2020-XX-XX 旧版支持
CDH 5.16.x 2019-XX-XX 旧版支持
Impala版本对应:
– Cloudera Runtime 7.3.1 包含 Impala 4.x
– Cloudera Runtime 7.1.7 包含 Impala 4.x
– CDH 6.3.x 包含 Impala 3.x
– CDH 5.16.x 包含 Impala 2.x
Impala 4.x主要更新:
– 性能优化
– Iceberg支持
– Kudu功能增强
– 安全性改进
– SQL兼容性增强
官方下载地址:
Cloudera下载:https://www.cloudera.com/downloads/
源码仓库:https://github.com/apache/impala
文档中心:https://impala.apache.org/docs/build/
ODBC驱动:https://www.cloudera.com/downloads/connectors/impala-odbc.html
JDBC驱动:https://www.cloudera.com/downloads/connectors/impala-jdbc.html
3. Impala下载方式详解
方式一:Cloudera Manager安装(推荐生产环境)
1. 下载Cloudera Manager安装包
2. 安装Cloudera Manager Server和Agent
3. 通过Web UI添加Impala服务
4. 配置Impala参数
5. 启动Impala服务
下载Cloudera Manager:
$ cd /fgeudb/software
$ wget https://archive.cloudera.com/cm7/7.3.1/cloudera-manager-centos7-7.3.1.tar.gz
输出示例如下:
–2026-04-04 10:00:00– https://archive.cloudera.com/cm7/7.3.1/cloudera-manager-centos7-7.3.1.tar.gz
Resolving archive.cloudera.com… 151.101.1.167
Connecting to archive.cloudera.com|151.101.1.167|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 1234567890 (1.2G) [application/octet-stream]
Saving to: ‘cloudera-manager-centos7-7.3.1.tar.gz’
解压安装:
$ tar -zxvf cloudera-manager-centos7-7.3.1.tar.gz -C /fgeudb/
$ cd /fgeudb/cm-7.3.1
$ ./cloudera-manager-installer.bin
方式二:RPM包安装
# vi /etc/yum.repos.d/cloudera.repo
[cloudera-runtime]
name=Cloudera Runtime
baseurl=https://archive.cloudera.com/p/cdh7/7.3.1/redhat7/yum/
gpgkey=https://archive.cloudera.com/p/cdh7/7.3.1/redhat7/yum/RPM-GPG-KEY-cloudera
gpgcheck=1
安装Impala:
# yum install -y impala impala-server impala-state-store impala-catalog impala-shell
输出示例如下:
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Resolving Dependencies
–> Running transaction check
—> Package impala.x86_64 0:4.4.0-1.el7 will be installed
—> Package impala-server.x86_64 0:4.4.0-1.el7 will be installed
—> Package impala-state-store.x86_64 0:4.4.0-1.el7 will be installed
—> Package impala-catalog.x86_64 0:4.4.0-1.el7 will be installed
—> Package impala-shell.x86_64 0:4.4.0-1.el7 will be installed
–> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
impala x86_64 4.4.0-1.el7 cloudera-runtime 1.2 G
impala-server x86_64 4.4.0-1.el7 cloudera-runtime 1.5 G
impala-state-store x86_64 4.4.0-1.el7 cloudera-runtime 500 M
impala-catalog x86_64 4.4.0-1.el7 cloudera-runtime 500 M
impala-shell x86_64 4.4.0-1.el7 cloudera-runtime 50 M
Transaction Summary
================================================================================
Install 5 Packages
Total download size: 3.7 G
Installed size: 10 G
Downloading packages:
(1/5): impala-4.4.0-1.el7.x86_64.rpm | 1.2 GB 00:05:00
(2/5): impala-server-4.4.0-1.el7.x86_64.rpm | 1.5 GB 00:06:00
(3/5): impala-state-store-4.4.0-1.el7.x86_64.rpm | 500 MB 00:02:00
(4/5): impala-catalog-4.4.0-1.el7.x86_64.rpm | 500 MB 00:02:00
(5/5): impala-shell-4.4.0-1.el7.x86_64.rpm | 50 MB 00:00:10
Complete!
方式三:源码编译安装
$ git clone https://github.com/apache/impala.git
$ cd impala
$ git checkout branch-4.4.0
安装依赖:
$ ./bin/bootstrap_system.sh
编译源码:
$ ./build.sh
输出示例如下:
[INFO] Building Impala…
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 02:30:00
[INFO] Finished at: 2026-04-04T12:30:00+08:00
4. Impala安装部署实战
步骤1:安装依赖环境
# yum install -y java-1.8.0-openjdk java-1.8.0-openjdk-devel
验证Java版本:
$ java -version
输出示例如下:
openjdk version “1.8.0_362”
OpenJDK Runtime Environment (build 1.8.0_362-b08)
OpenJDK 64-Bit Server VM (build 25.362-b08, mixed mode)
安装Hadoop:
# yum install -y hadoop hadoop-hdfs hadoop-yarn hadoop-client
安装Hive:
# yum install -y hive hive-metastore hive-server2
创建必要目录:
# mkdir -p /fgeudb/impala/{logs,warehouse}
# chown -R impala:impala /fgeudb/impala
步骤2:配置Hive Metastore
# mysql -u root -p
创建数据库:
mysql> CREATE DATABASE metastore DEFAULT CHARACTER SET utf8mb4;
mysql> CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘hive123’;
mysql> GRANT ALL PRIVILEGES ON metastore.* TO ‘hive’@’%’;
mysql> FLUSH PRIVILEGES;
配置hive-site.xml:
# vi /etc/hive/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.1.51:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive123</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/fgeudb/impala/warehouse</value>
</property>
</configuration>
初始化Metastore:
$ schematool -dbType mysql -initSchema
步骤3:配置Impala
# vi /etc/default/impala
IMPALA_CATALOG_SERVICE_HOST=192.168.1.51
IMPALA_STATE_STORE_HOST=192.168.1.51
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/fgeudb/impala/logs
IMPALA_CATALOG_ARGS=” -log_dir=${IMPALA_LOG_DIR}”
IMPALA_STATE_STORE_ARGS=” -log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}”
IMPALA_SERVER_ARGS=” \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}”
配置impala-site.xml:
# vi /etc/impala/conf/impala-site.xml
<configuration>
<property>
<name>impala.catalog.service.hosts</name>
<value>192.168.1.51:26000</value>
</property>
<property>
<name>impala.state.store.hosts</name>
<value>192.168.1.51:24000</value>
</property>
<property>
<name>impala.daemon.log.dir</name>
<value>/fgeudb/impala/logs</value>
</property>
<property>
<name>impala.mem_limit</name>
<value>80%</value>
</property>
</configuration>
步骤4:启动Impala服务
# service impala-state-store start
输出示例如下:
Starting Impala State Store Server: [ OK ]
启动Catalog:
# service impala-catalog start
输出示例如下:
Starting Impala Catalog Server: [ OK ]
启动Impalad:
# service impala-server start
输出示例如下:
Starting Impala Server: [ OK ]
查看服务状态:
# service impala-server status
输出示例如下:
impalad (pid 12345) is running…
设置开机自启:
# chkconfig impala-state-store on
# chkconfig impala-catalog on
# chkconfig impala-server on
5. Impala配置文件详解
核心配置参数
-mem_limit=80% 内存使用上限
-num_threads_per_disk=64 每磁盘线程数
-num_scanner_threads=48 扫描线程数
查询配置:
-query_mem_limit=128GB 查询内存限制
-query_timeout_s=3600 查询超时时间
-idle_session_timeout_s=1800 空闲会话超时
缓存配置:
-disable_cached_reads=false 禁用缓存读取
-cache_force_single_read=false 强制单次读取
日志配置:
-log_dir=/fgeudb/impala/logs 日志目录
-logbuflevel=0 日志缓冲级别
-log_filename=impalad 日志文件名
网络配置:
-be_port=22000 后端端口
-hs2_port=21050 HiveServer2端口
-beeswax_port=21000 Beeswax端口
-webserver_port=25000 Web UI端口
性能优化配置
IMPALA_SERVER_ARGS=” \
-log_dir=/fgeudb/impala/logs \
-catalog_service_host=192.168.1.51 \
-state_store_host=192.168.1.51 \
-state_store_port=24000 \
-mem_limit=80% \
-num_threads_per_disk=64 \
-num_scanner_threads=48 \
-query_mem_limit=128GB \
-query_timeout_s=3600 \
-idle_session_timeout_s=1800 \
-disable_cached_reads=false \
-enable_partitioned_hash_join=true \
-enable_expr_rewrites=true \
-kudu_client_rpc_timeout_ms=30000″
资源队列配置:
# vi /etc/impala/conf/resource_pools.yaml
pools:
– name: default
max_memory: 64GB
max_queued: 100
queue_timeout: 60s
– name: etl
max_memory: 128GB
max_queued: 50
queue_timeout: 300s
– name: interactive
max_memory: 32GB
max_queued: 200
queue_timeout: 30s
6. Impala查询实战
使用impala-shell
$ impala-shell -i 192.168.1.51:21000
输出示例如下:
Connected to 192.168.1.51:21000
Server version: impalad version 4.4.0
***********************************************************************************
Welcome to the Impala shell.
***********************************************************************************
[192.168.1.51:21000] >
查看数据库:
[192.168.1.51:21000] > SHOW DATABASES;
输出示例如下:
Query: show DATABASES
+——————+
| name |
+——————+
| _impala_builtins |
| default |
| fgedu_db |
+——————+
Fetched 3 row(s) in 0.02s
创建数据库:
[192.168.1.51:21000] > CREATE DATABASE fgedu_db;
创建表:
[192.168.1.51:21000] > CREATE TABLE fgedu_db.sales (
> id INT,
> product_name STRING,
> sales_amount DECIMAL(10,2),
> sales_date DATE
> )
> STORED AS PARQUET;
输出示例如下:
Query: create TABLE fgedu_db.sales (
id INT,
product_name STRING,
sales_amount DECIMAL(10,2),
sales_date DATE
)
STORED AS PARQUET
Fetched 0 row(s) in 0.15s
数据加载与查询
[192.168.1.51:21000] > INSERT INTO fgedu_db.sales VALUES
> (1, ‘Product A’, 1000.00, ‘2026-01-01’),
> (2, ‘Product B’, 2000.00, ‘2026-01-02’),
> (3, ‘Product C’, 3000.00, ‘2026-01-03’);
输出示例如下:
Query: insert INTO fgedu_db.sales VALUES
(1, ‘Product A’, 1000.00, ‘2026-01-01’),
(2, ‘Product B’, 2000.00, ‘2026-01-02’),
(3, ‘Product C’, 3000.00, ‘2026-01-03’)
Modified 3 row(s), 0 row error(s) in 0.25s
查询数据:
[192.168.1.51:21000] > SELECT
> product_name,
> SUM(sales_amount) as total_sales
> FROM fgedu_db.sales
> GROUP BY product_name
> ORDER BY total_sales DESC;
输出示例如下:
Query: select
product_name,
sum(sales_amount) as total_sales
from fgedu_db.sales
group by product_name
order by total_sales desc
+————–+————-+
| product_name | total_sales |
+————–+————-+
| Product C | 3000.00 |
| Product B | 2000.00 |
| Product A | 1000.00 |
+————–+————-+
Fetched 3 row(s) in 0.12s
分区表操作
[192.168.1.51:21000] > CREATE TABLE fgedu_db.sales_partitioned (
> id INT,
> product_name STRING,
> sales_amount DECIMAL(10,2)
> )
> PARTITIONED BY (year INT, month INT)
> STORED AS PARQUET;
插入分区数据:
[192.168.1.51:21000] > INSERT INTO fgedu_db.sales_partitioned
> PARTITION (year=2026, month=1)
> SELECT id, product_name, sales_amount
> FROM fgedu_db.sales
> WHERE YEAR(sales_date) = 2026 AND MONTH(sales_date) = 1;
查询分区数据:
[192.168.1.51:21000] > SELECT * FROM fgedu_db.sales_partitioned
> WHERE year = 2026 AND month = 1;
刷新元数据:
[192.168.1.51:21000] > REFRESH fgedu_db.sales_partitioned;
[192.168.1.51:21000] > INVALIDATE METADATA;
7. 安装验证与测试
查看Impala状态
$ ps -ef | grep impala
输出示例如下:
impala 12345 1 5 10:10 ? 00:00:30 /usr/lib/impala/bin/impalad …
impala 12346 1 2 10:10 ? 00:00:10 /usr/lib/impala/bin/statestored …
impala 12347 1 2 10:10 ? 00:00:10 /usr/lib/impala/bin/catalogd …
查看端口监听:
$ netstat -tlnp | grep impala
输出示例如下:
tcp6 0 0 :::21000 :::* LISTEN 12345/java
tcp6 0 0 :::21050 :::* LISTEN 12345/java
tcp6 0 0 :::24000 :::* LISTEN 12346/java
tcp6 0 0 :::25000 :::* LISTEN 12345/java
tcp6 0 0 :::26000 :::* LISTEN 12347/java
访问Web UI:
打开浏览器访问:http://192.168.1.51:25000
查看集群状态:
$ impala-shell -i 192.168.1.51:21000 -q “SELECT * FROM __all_logs__ LIMIT 1”
输出示例如下:
Connected to 192.168.1.51:21000
Server version: impalad version 4.4.0
性能测试
创建TPC-H表:
[192.168.1.51:21000] > CREATE TABLE tpch.lineitem (
> l_orderkey BIGINT,
> l_partkey BIGINT,
> l_suppkey BIGINT,
> l_linenumber INT,
> l_quantity DECIMAL(15,2),
> l_extendedprice DECIMAL(15,2),
> l_discount DECIMAL(15,2),
> l_tax DECIMAL(15,2),
> l_returnflag STRING,
> l_linestatus STRING,
> l_shipdate DATE,
> l_commitdate DATE,
> l_receiptdate DATE,
> l_shipinstruct STRING,
> l_shipmode STRING,
> l_comment STRING
> )
> STORED AS PARQUET;
执行查询:
[192.168.1.51:21000] > SELECT
> l_returnflag,
> l_linestatus,
> SUM(l_quantity) as sum_qty,
> SUM(l_extendedprice) as sum_base_price
> FROM tpch.lineitem
> WHERE l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
> GROUP BY l_returnflag, l_linestatus
> ORDER BY l_returnflag, l_linestatus;
查询性能示例:
– SF1(1GB数据):0.5秒
– SF10(10GB数据):2秒
– SF100(100GB数据):15秒
8. 常见问题与解决方案
问题1:内存不足
解决方案:
1. 增加内存限制:
-mem_limit=90%
2. 增加查询内存限制:
-query_mem_limit=256GB
3. 优化查询:
– 减少处理的数据量
– 使用分区裁剪
– 添加过滤条件
4. 查看内存使用:
访问Web UI查看内存使用情况
问题2:元数据不同步
解决方案:
1. 刷新元数据:
[192.168.1.51:21000] > REFRESH table_name;
2. 使元数据失效:
[192.168.1.51:21000] > INVALIDATE METADATA;
3. 重新加载所有元数据:
[192.168.1.51:21000] > INVALIDATE METADATA table_name;
4. 检查Hive Metastore连接:
确保Hive Metastore服务正常运行
问题3:查询超时
解决方案:
1. 增加超时时间:
-query_timeout_s=7200
2. 优化查询:
– 使用分区表
– 添加索引
– 优化JOIN顺序
3. 检查资源使用:
查看CPU和内存使用情况
4. 检查数据倾斜:
确保数据分布均匀
Impala管理命令
# service impala-state-store start
# service impala-catalog start
# service impala-server start
停止服务:
# service impala-server stop
# service impala-catalog stop
# service impala-state-store stop
重启服务:
# service impala-server restart
查看状态:
# service impala-server status
连接Shell:
$ impala-shell -i 192.168.1.51:21000
$ impala-shell -i 192.168.1.51:21000 -f query.sql
$ impala-shell -i 192.168.1.51:21000 -q “SELECT 1”
执行SQL文件:
$ impala-shell -i 192.168.1.51:21000 -f /path/to/query.sql
输出到文件:
$ impala-shell -i 192.168.1.51:21000 -q “SELECT * FROM table” -o output.txt
查看执行计划:
[192.168.1.51:21000] > EXPLAIN SELECT * FROM table;
[192.168.1.51:21000] > EXPLAIN LEVEL=VERBOSE SELECT * FROM table;
查看查询Profile:
[192.168.1.51:21000] > PROFILE;
1. 使用Cloudera Manager管理Impala集群;2. 配置足够的内存资源;3. 使用Parquet格式存储数据;4. 合理设计分区策略;5. 定期刷新元数据;6. 配置资源池管理查询;7. 启用Kerberos认证;8. 监控集群状态;9. 优化查询语句;10. 定期维护表统计信息。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
