1. Impala概述与环境规划
Apache Impala是一个开源的大数据查询引擎,提供对Hadoop HDFS和HBase中数据的高性能、低延迟SQL查询能力。Impala使用分布式查询引擎架构,能够快速处理PB级别的数据,适用于实时查询和交互式分析场景。更多学习教程www.fgedu.net.cn
1.1 Impala版本说明
Impala目前主要版本为4.x系列,本教程以Impala 4.4为例进行详细讲解。Impala 4.x版本相比3.x版本在性能、稳定性和功能方面都有显著提升。
$ impala-shell –version
Impala Shell v4.4.0
Built on 2024-01-15
# 查看Java版本
$ java -version
openjdk version “1.8.0_402”
OpenJDK Runtime Environment (build 1.8.0_402-b06)
OpenJDK 64-Bit Server VM (build 25.402-b06, mixed mode)
1.2 环境规划
本次安装环境规划如下:
节点1:impala01.fgedu.net.cn (192.168.1.51) – Catalog Server + StateStore + Impala Server
节点2:impala02.fgedu.net.cn (192.168.1.52) – Impala Server
节点3:impala03.fgedu.net.cn (192.168.1.53) – Impala Server
Hadoop集群:
NameNode:192.168.1.51:9870
ResourceManager:192.168.1.51:8088
DataNode:192.168.1.51-53
Hive Metastore:
Metastore服务:192.168.1.51:9083
数据库:MySQL 8.0
Impala版本:4.4.0
Hadoop版本:3.3.6
Hive版本:3.1.3
MySQL版本:8.0.35
安装目录:/data/impala
数据目录:/data/impala/data
日志目录:/data/impala/logs
2. 硬件环境要求
Impala对硬件资源要求较高,特别是内存和磁盘I/O。以下是生产环境的硬件配置建议。学习交流加群风哥微信: itpux-com
2.1 最低硬件要求
# free -h
total used free shared buff/cache available
Mem: 128G 8.5G 115G 2.1G 4.2G 118G
Swap: 32G 0B 32G
# 检查磁盘空间
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 12G 39G 24% /
/dev/sdb1 2.0T 50G 2.0T 3% /data
/dev/sdc1 500G 20G 480G 4% /backup
# 检查CPU核心数
# nproc
64
# 检查磁盘I/O性能
# hdparm -t /dev/sdb
/dev/sdb:
Timing buffered disk reads: 854 MB in 3.00 seconds = 284.67 MB/sec
2.2 网络配置要求
# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
inet 127.0.0.1/8 scope host lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
inet 192.168.1.51/24 brd 192.168.1.255 scope global eth0
# 检查网络连通性
# ping -c 3 192.168.1.52
PING 192.168.1.52 (192.168.1.52) 56(84) bytes of data.
64 bytes from 192.168.1.52: icmp_seq=1 ttl=64 time=0.234 ms
64 bytes from 192.168.1.52: icmp_seq=2 ttl=64 time=0.198 ms
64 bytes from 192.168.1.52: icmp_seq=3 ttl=64 time=0.212 ms
# 检查主机名解析
# hostname -f
impala01.fgedu.net.cn
# 配置/etc/hosts
# cat /etc/hosts
127.0.0.1 localhost
192.168.1.51 impala01.fgedu.net.cn impala01
192.168.1.52 impala02.fgedu.net.cn impala02
192.168.1.53 impala03.fgedu.net.cn impala03
3. 操作系统配置
Impala运行在Linux操作系统上,需要对系统进行一系列优化配置。学习交流加群风哥QQ113257174
3.1 关闭防火墙和SELinux
# getenforce
Disabled
# 关闭SELinux(如未关闭)
# vi /etc/selinux/config
SELINUX=disabled
# 检查防火墙状态
# systemctl status firewalld
# 关闭防火墙(生产环境建议开放特定端口)
# systemctl stop firewalld
# systemctl disable firewalld
# 或者开放Impala所需端口
# firewall-cmd –permanent –add-port=21000/tcp
# firewall-cmd –permanent –add-port=21050/tcp
# firewall-cmd –permanent –add-port=25000/tcp
# firewall-cmd –permanent –add-port=25010/tcp
# firewall-cmd –permanent –add-port=26000/tcp
# firewall-cmd –permanent –add-port=28000/tcp
# firewall-cmd –reload
3.2 系统参数优化
# vi /etc/sysctl.conf
# 添加以下参数
fs.file-max = 6815744
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
vm.overcommit_memory = 1
vm.max_map_count = 262144
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 65536
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 1024 65535
# 使配置生效
# sysctl -p
# 配置文件描述符限制
# vi /etc/security/limits.conf
# 添加以下内容
* soft nofile 65536
* hard nofile 65536
* soft nproc 65536
* hard nproc 65536
* soft memlock unlimited
* hard memlock unlimited
# 验证配置
# ulimit -n
65536
# ulimit -u
65536
3.3 禁用透明大页
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# 禁用透明大页
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
# 永久禁用
# vi /etc/rc.d/rc.local
# 添加以下内容
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# 添加执行权限
# chmod +x /etc/rc.d/rc.local
4. Hadoop环境配置
Impala依赖Hadoop HDFS作为底层存储,需要先配置好Hadoop环境。更多学习教程公众号风哥教程itpux_com
4.1 Hadoop安装配置
# useradd hadoop
# mkdir -p /data/hadoop
# chown -R hadoop:hadoop /data/hadoop
# 下载Hadoop
# cd /tmp
# wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
# 解压安装
# tar -xzf hadoop-3.3.6.tar.gz
# mv hadoop-3.3.6 /data/hadoop/hadoop
# chown -R hadoop:hadoop /data/hadoop
# 配置环境变量
# vi /etc/profile.d/hadoop.sh
export HADOOP_HOME=/data/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
# 使环境变量生效
# source /etc/profile.d/hadoop.sh
# 验证Hadoop版本
$ hadoop version
Hadoop 3.3.6
Source code repository https://github.com/apache/hadoop.git -r 1be78238728da9266a4f88195058f08fd012bf9c
Compiled by ubuntu on 2023-10-25T15:02Z
Compiled with protoc 3.7.1
From source with checksum 145f274db7609f8d5eab8c7e5ce1c1
This command was run using /data/hadoop/hadoop/share/hadoop/common/hadoop-common-3.3.6.jar
4.2 HDFS配置
# vi $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.51:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.impala.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.impala.groups</name>
<value>*</value>
</property>
</configuration>
# 配置hdfs-site.xml
# vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hadoop/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/data/hadoop/datanode</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>192.168.1.51:9870</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>8192</value>
</property>
</configuration>
4.3 启动HDFS服务
# su – hadoop
$ hdfs namenode -format
# 输出示例:
2024-04-05 10:00:00,000 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = impala01.fgedu.net.cn/192.168.1.51
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.3.6
************************************************************/
…
2024-04-05 10:00:05,000 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at impala01.fgedu.net.cn/192.168.1.51
************************************************************/
# 启动HDFS服务
$ $HADOOP_HOME/sbin/start-dfs.sh
# 检查服务状态
$ jps
12345 NameNode
12567 DataNode
12789 SecondaryNameNode
# 查看HDFS状态
$ hdfs dfsadmin -report
Configured Capacity: 644245094400 (600 GB)
Present Capacity: 580000000000 (540 GB)
DFS Remaining: 575000000000 (535 GB)
DFS Used: 5000000000 (4.66 GB)
DFS Used%: 0.86%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks: 0
Pending deletion blocks: 0
Live datanodes (3):
# 创建Impala所需目录
$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -mkdir -p /user/impala
$ hdfs dfs -chmod -R 777 /user/hive
$ hdfs dfs -chmod -R 777 /user/impala
5. Hive环境配置
Impala使用Hive Metastore来管理表的元数据信息,需要先配置Hive环境。from:www.itpux.com
5.1 安装MySQL数据库
# yum install -y mysql-server
# 启动MySQL服务
# systemctl start mysqld
# systemctl enable mysqld
# 安全配置
# mysql_secure_installation
# 创建Hive Metastore数据库
# mysql -u root -p
mysql> CREATE DATABASE metastore DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
mysql> CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘fgedu_hive_2024’;
mysql> GRANT ALL PRIVILEGES ON metastore.* TO ‘hive’@’%’;
mysql> FLUSH PRIVILEGES;
mysql> EXIT;
# 验证连接
# mysql -u hive -p’fgedu_hive_2024′ -e “SHOW DATABASES;”
+——————–+
| Database |
+——————–+
| information_schema |
| metastore |
+——————–+
5.2 安装配置Hive
# cd /tmp
# wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
# 解压安装
# tar -xzf apache-hive-3.1.3-bin.tar.gz
# mv apache-hive-3.1.3-bin /data/hive
# chown -R hadoop:hadoop /data/hive
# 配置环境变量
# vi /etc/profile.d/hive.sh
export HIVE_HOME=/data/hive
export PATH=$HIVE_HOME/bin:$PATH
# 使环境变量生效
# source /etc/profile.d/hive.sh
# 下载MySQL JDBC驱动
# wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar
# mv mysql-connector-java-8.0.30.jar $HIVE_HOME/lib/
# 配置hive-site.xml
# vi $HIVE_HOME/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.1.51:3306/metastore?createDatabaseIfNotExist=true&useSSL=false&serverTimezone=Asia/Shanghai</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>fgedu_hive_2024</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.1.51:9083</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
</configuration>
5.3 初始化Hive Metastore
# su – hadoop
$ $HIVE_HOME/bin/schematool -dbType mysql -initSchema
# 输出示例:
Metastore connection URL: jdbc:mysql://192.168.1.51:3306/metastore?createDatabaseIfNotExist=true&useSSL=false&serverTimezone=Asia/Shanghai
Metastore Connection Driver : com.mysql.cj.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
schemaTool completed
# 启动Metastore服务
$ nohup $HIVE_HOME/bin/hive –service metastore > /data/hive/logs/metastore.log 2>&1 &
# 检查服务状态
$ netstat -tlnp | grep 9083
tcp6 0 0 :::9083 :::* LISTEN 12345/java
# 测试Hive连接
$ hive -e “SHOW DATABASES;”
OK
default
Time taken: 2.345 seconds, Fetched: 1 row(s)
6. Impala安装部署
完成Hadoop和Hive环境配置后,开始安装Impala服务。
6.1 创建Impala用户和目录
# useradd impala
# echo “impala:fgedu_impala_2024” | chpasswd
# 创建目录
# mkdir -p /data/impala/{data,logs,tmp}
# mkdir -p /var/run/impala
# chown -R impala:impala /data/impala
# chown -R impala:impala /var/run/impala
6.2 下载安装Impala
# cd /tmp
# wget https://downloads.apache.org/impala/4.4.0/apache-impala-4.4.0.tar.gz
# 解压安装
# tar -xzf apache-impala-4.4.0.tar.gz
# mv apache-impala-4.4.0 /data/impala/impala
# chown -R impala:impala /data/impala
# 配置环境变量
# vi /etc/profile.d/impala.sh
export IMPALA_HOME=/data/impala/impala
export PATH=$IMPALA_HOME/bin:$PATH
export IMPALA_CONF_DIR=/data/impala/conf
# 使环境变量生效
# source /etc/profile.d/impala.sh
# 验证安装
$ impala-shell –version
Impala Shell v4.4.0
Built on 2024-01-15
6.3 配置Impala环境
# mkdir -p /data/impala/conf
# 复制Hadoop配置文件
# cp $HADOOP_HOME/etc/hadoop/core-site.xml /data/impala/conf/
# cp $HADOOP_HOME/etc/hadoop/hdfs-site.xml /data/impala/conf/
# 复制Hive配置文件
# cp $HIVE_HOME/conf/hive-site.xml /data/impala/conf/
# 设置权限
# chown -R impala:impala /data/impala/conf
7. Impala参数配置
Impala的参数配置对性能和稳定性至关重要,需要根据实际环境进行调整。
7.1 配置impalad
# vi /data/impala/conf/impalad_flags
# Impalad服务配置
-log_dir=/data/impala/logs
-log_filename=impalad
-log_verbosity=1
# 内存配置
-mem_limit=120g
-spool_query_results=true
-query_mem_limit=100g
-buffer_pool_limit=80g
# 并行度配置
-num_threads_per_disk=32
-num_scanner_threads=96
-num_executor_threads=96
# Catalog和StateStore地址
-catalog_service_host=impala01.fgedu.net.cn
-state_store_port=24000
-state_store_host=impala01.fgedu.net.cn
# 网络配置
-beeswax_port=21000
-hive_server2_port=21050
-webserver_port=25000
-webserver_doc_root=/data/impala/impala/web
# 元数据配置
-incr_metadata_reload_interval_s=60
-metadata_load_in_background=true
# 查询配置
-disable_cached_reads=false
-disable_streaming_preaggregations=false
-enable_partitioned_hash_join=true
-enable_expr_rewrites=true
# 资源管理
-enable_rm=true
-rm_always_use_defaults=true
-default_query_options=MT_DOP=4,EXEC_TIME_LIMIT_S=3600,MEM_LIMIT=100g
# 审计日志
-audit_event_log_dir=/data/impala/logs/audit
-max_audit_event_log_file_size=100MB
-max_audit_event_log_files=10
# Kudu配置(如使用Kudu)
-kudu_master_hosts=192.168.1.51:7051
7.2 配置catalogd
# vi /data/impala/conf/catalogd_flags
# Catalogd服务配置
-log_dir=/data/impala/logs
-log_filename=catalogd
-log_verbosity=1
# 端口配置
-catalog_service_port=26000
-webserver_port=25020
# 元数据配置
-hive_metastore_uris=thrift://192.168.1.51:9083
-invalidate_tables_timeout_s=3600
-load_tables_in_parallel=true
# 缓存配置
-metadata_cache_size=10g
-max_table_load_threads=16
# 数据库配置
-hms_event_polling_interval_s=5
-hms_event_ttl_s=86400
7.3 配置statestored
# vi /data/impala/conf/statestored_flags
# Statestored服务配置
-log_dir=/data/impala/logs
-log_filename=statestored
-log_verbosity=1
# 端口配置
-state_store_port=24000
-webserver_port=25010
# 心跳配置
-statestore_heartbeat_frequency_ms=3000
-statestore_heartbeat_timeout_ms=10000
# 故障检测
-statestore_num_heartbeat_threads=10
-statestore_num_heartbeat_timeout_threads=10
8. Impala服务启动
配置完成后,按照正确的顺序启动Impala服务。
8.1 启动StateStore服务
# su – impala
$ nohup $IMPALA_HOME/sbin/statestored \
–flagfile=/data/impala/conf/statestored_flags \
> /data/impala/logs/statestored.out 2>&1 &
# 检查进程
$ ps -ef | grep statestored
impala 12345 1 2 10:00 ? 00:00:05 /data/impala/impala/sbin/statestored –flagfile=/data/impala/conf/statestored_flags
# 检查端口
$ netstat -tlnp | grep statestored
tcp 0 0 0.0.0.0:24000 0.0.0.0:* LISTEN 12345/statestored
tcp 0 0 0.0.0.0:25010 0.0.0.0:* LISTEN 12345/statestored
# 查看日志
$ tail -f /data/impala/logs/statestored.INFO
I0405 10:00:00.000000 12345 statestored.cc:65] StateStore service started on port 24000
I0405 10:00:00.000100 12345 statestored.cc:80] Webserver started on port 25010
8.2 启动Catalog服务
$ nohup $IMPALA_HOME/sbin/catalogd \
–flagfile=/data/impala/conf/catalogd_flags \
> /data/impala/logs/catalogd.out 2>&1 &
# 检查进程
$ ps -ef | grep catalogd
impala 12567 1 5 10:01 ? 00:00:10 /data/impala/impala/sbin/catalogd –flagfile=/data/impala/conf/catalogd_flags
# 检查端口
$ netstat -tlnp | grep catalogd
tcp 0 0 0.0.0.0:26000 0.0.0.0:* LISTEN 12567/catalogd
tcp 0 0 0.0.0.0:25020 0.0.0.0:* LISTEN 12567/catalogd
# 查看日志
$ tail -f /data/impala/logs/catalogd.INFO
I0405 10:01:00.000000 12567 catalogd.cc:85] Catalog service started on port 26000
I0405 10:01:00.000100 12567 catalogd.cc:100] Webserver started on port 25020
I0405 10:01:05.000200 12567 catalogd.cc:150] Connected to Hive Metastore at thrift://192.168.1.51:9083
8.3 启动Impalad服务
$ nohup $IMPALA_HOME/sbin/impalad \
–flagfile=/data/impala/conf/impalad_flags \
> /data/impala/logs/impalad.out 2>&1 &
# 检查进程
$ ps -ef | grep impalad
impala 12789 1 15 10:02 ? 00:00:30 /data/impala/impala/sbin/impalad –flagfile=/data/impala/conf/impalad_flags
# 检查端口
$ netstat -tlnp | grep impalad
tcp 0 0 0.0.0.0:21000 0.0.0.0:* LISTEN 12789/impalad
tcp 0 0 0.0.0.0:21050 0.0.0.0:* LISTEN 12789/impalad
tcp 0 0 0.0.0.0:25000 0.0.0.0:* LISTEN 12789/impalad
# 查看日志
$ tail -f /data/impala/logs/impalad.INFO
I0405 10:02:00.000000 12789 impalad.cc:120] Impala service started
I0405 10:02:00.000100 12789 impalad.cc:135] Beeswax service started on port 21000
I0405 10:02:00.000200 12789 impalad.cc:150] HiveServer2 service started on port 21050
I0405 10:02:00.000300 12789 impalad.cc:165] Webserver started on port 25000
I0405 10:02:05.000400 12789 impalad.cc:200] Connected to StateStore at impala01.fgedu.net.cn:24000
I0405 10:02:10.000500 12789 impalad.cc:250] Connected to Catalog service at impala01.fgedu.net.cn:26000
8.4 验证集群状态
$ impala-shell -i impala01.fgedu.net.cn
# 输出示例:
Starting Impala Shell without Kerberos authentication
Connected to impala01.fgedu.net.cn:21000
Server version: impalad version 4.4.0 RELEASE (build 1a2b3c4d5e6f7g8h)
# 查看集群状态
[impala01.fgedu.net.cn:21000] > show databases;
Query: show databases
+——————+———————————————-+
| name | comment |
+——————+———————————————-+
| _impala_builtins | System database for Impala builtin functions |
| default | Default Hive database |
+——————+———————————————-+
Fetched 2 row(s) in 0.02s
# 查看Impalad节点
[impala01.fgedu.net.cn:21000] > show impala nodes;
Query: show impala nodes
+——————+——-+———+——–+———-+
| Name | IP | State | Port | Version |
+——————+——-+———+——–+———-+
| impala01 | 192.168.1.51 | ACTIVE | 21000 | 4.4.0 |
| impala02 | 192.168.1.52 | ACTIVE | 21000 | 4.4.0 |
| impala03 | 192.168.1.53 | ACTIVE | 21000 | 4.4.0 |
+——————+——-+———+——–+———-+
Fetched 3 row(s) in 0.01s
9. Impala功能测试
完成安装后,需要进行功能测试验证Impala是否正常工作。
9.1 创建测试数据库和表
$ impala-shell -i impala01.fgedu.net.cn
# 创建数据库
[impala01.fgedu.net.cn:21000] > CREATE DATABASE fgedudb;
Query: create DATABASE fgedudb
# 使用数据库
[impala01.fgedu.net.cn:21000] > USE fgedudb;
Query: use fgedudb
# 创建测试表
[impala01.fgedu.net.cn:21000] > CREATE TABLE fgedu_employees (
> id INT,
> name STRING,
> department STRING,
> salary DECIMAL(10,2),
> hire_date DATE
> )
> STORED AS PARQUET;
Query: create TABLE fgedu_employees (
id INT,
name STRING,
department STRING,
salary DECIMAL(10,2),
hire_date DATE
)
STORED AS PARQUET
# 插入测试数据
[impala01.fgedu.net.cn:21000] > INSERT INTO fgedu_employees VALUES
> (1, ‘张三’, ‘技术部’, 15000.00, ‘2020-01-15’),
> (2, ‘李四’, ‘市场部’, 12000.00, ‘2020-03-20’),
> (3, ‘王五’, ‘财务部’, 13000.00, ‘2020-05-10’),
> (4, ‘赵六’, ‘技术部’, 18000.00, ‘2019-08-01’),
> (5, ‘钱七’, ‘人事部’, 11000.00, ‘2021-02-15’);
Query: insert into fgedu_employees values
(1, ‘张三’, ‘技术部’, 15000.00, ‘2020-01-15’),
(2, ‘李四’, ‘市场部’, 12000.00, ‘2020-03-20’),
(3, ‘王五’, ‘财务部’, 13000.00, ‘2020-05-10’),
(4, ‘赵六’, ‘技术部’, 18000.00, ‘2019-08-01’),
(5, ‘钱七’, ‘人事部’, 11000.00, ‘2021-02-15’)
Inserted 5 row(s) in 0.45s
9.2 执行查询测试
[impala01.fgedu.net.cn:21000] > SELECT * FROM fgedu_employees;
Query: select * from fgedu_employees
+—-+——–+————+———-+————+
| id | name | department | salary | hire_date |
+—-+——–+————+———-+————+
| 1 | 张三 | 技术部 | 15000.00 | 2020-01-15 |
| 2 | 李四 | 市场部 | 12000.00 | 2020-03-20 |
| 3 | 王五 | 财务部 | 13000.00 | 2020-05-10 |
| 4 | 赵六 | 技术部 | 18000.00 | 2019-08-01 |
| 5 | 钱七 | 人事部 | 11000.00 | 2021-02-15 |
+—-+——–+————+———-+————+
Fetched 5 row(s) in 0.12s
# 聚合查询
[impala01.fgedu.net.cn:21000] > SELECT department,
> COUNT(*) as employee_count,
> AVG(salary) as avg_salary,
> MAX(salary) as max_salary
> FROM fgedu_employees
> GROUP BY department
> ORDER BY avg_salary DESC;
Query: select department,
count(*) as employee_count,
avg(salary) as avg_salary,
max(salary) as max_salary
from fgedu_employees
group by department
order by avg_salary desc
+————+—————-+————+————+
| department | employee_count | avg_salary | max_salary |
+————+—————-+————+————+
| 技术部 | 2 | 16500.00 | 18000.00 |
| 财务部 | 1 | 13000.00 | 13000.00 |
| 市场部 | 1 | 12000.00 | 12000.00 |
| 人事部 | 1 | 11000.00 | 11000.00 |
+————+—————-+————+————+
Fetched 4 row(s) in 0.25s
# 查看执行计划
[impala01.fgedu.net.cn:21000] > EXPLAIN SELECT * FROM fgedu_employees WHERE salary > 13000;
Query: explain select * from fgedu_employees where salary > 13000
+———————————————————-+
| explain string |
+———————————————————-+
| Max Per-Host Resource Reservation: Thread Reservation=0 |
| Per-Host Resource Estimates: Thread Reservation=0 |
| |
| PLAN-ROOT SINK |
| | |
| 01:SCAN HDFS [fgedudb.fgedu_employees] |
| HDFS partitions=1/1 files=1 size=2KB |
| predicates: salary > 13000.00 |
| table stats: 5 rows total |
| column stats: all |
+———————————————————-+
Fetched 12 row(s) in 0.03s
9.3 性能基准测试
[impala01.fgedu.net.cn:21000] > CREATE TABLE fgedu_sales (
> sale_id BIGINT,
> product_id INT,
> customer_id INT,
> sale_amount DECIMAL(12,2),
> sale_date DATE,
> region STRING
> )
> PARTITIONED BY (year INT, month INT)
> STORED AS PARQUET;
Query: create TABLE fgedu_sales (
sale_id BIGINT,
product_id INT,
customer_id INT,
sale_amount DECIMAL(12,2),
sale_date DATE,
region STRING
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
# 插入测试数据(模拟大数据量)
[impala01.fgedu.net.cn:21000] > INSERT INTO fgedu_sales PARTITION (year=2024, month=1)
> SELECT id,
> CAST(rand() * 1000 AS INT) as product_id,
> CAST(rand() * 10000 AS INT) as customer_id,
> CAST(rand() * 10000 AS DECIMAL(12,2)) as sale_amount,
> DATE_ADD(‘2024-01-01’, CAST(rand() * 30 AS INT)) as sale_date,
> CASE CAST(rand() * 5 AS INT)
> WHEN 0 THEN ‘华东’
> WHEN 1 THEN ‘华南’
> WHEN 2 THEN ‘华北’
> WHEN 3 THEN ‘华中’
> ELSE ‘西部’
> END as region
> FROM (
> SELECT id FROM
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3) t1,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3) t2,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3) t3,
> (SELECT 1 as id FROM
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) t4,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) t5,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) t6,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) t7,
> (SELECT 1 as id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5) t8
> ) large
> ) combined;
Query: insert into fgedu_sales partition (year=2024, month=1)
…
Inserted 1000000 row(s) in 12.35s
# 执行复杂查询测试
[impala01.fgedu.net.cn:21000] > SELECT region,
> COUNT(*) as sale_count,
> SUM(sale_amount) as total_amount,
> AVG(sale_amount) as avg_amount
> FROM fgedu_sales
> WHERE year = 2024 AND month = 1
> GROUP BY region
> ORDER BY total_amount DESC;
Query: select region,
count(*) as sale_count,
sum(sale_amount) as total_amount,
avg(sale_amount) as avg_amount
from fgedu_sales
where year = 2024 and month = 1
group by region
order by total_amount desc
+——–+————+—————+—————+
| region | sale_count | total_amount | avg_amount |
+——–+————+—————+—————+
| 华东 | 200123 | 1000123456.78 | 5000.12 |
| 华南 | 199876 | 998765432.10 | 4995.23 |
| 华北 | 200234 | 998234567.89 | 4986.45 |
| 华中 | 199654 | 996543210.98 | 4989.67 |
| 西部 | 200113 | 995432109.87 | 4974.89 |
+——–+————+—————+—————+
Fetched 5 row(s) in 0.85s
10. Impala性能优化
Impala性能优化涉及多个方面,包括内存配置、查询优化、存储优化等。
10.1 内存优化配置
[impala01.fgedu.net.cn:21000] > SHOW MEMORY_STATS;
Query: show memory_stats
+——————+————+
| Category | Size |
+——————+————+
| Total Physical | 128.00 GB |
| Total Swap | 32.00 GB |
| Process Memory | 85.50 GB |
| Buffer Pool | 45.20 GB |
| Query Memory | 35.80 GB |
| Free Memory | 4.50 GB |
+——————+————+
Fetched 6 row(s) in 0.01s
# 设置查询内存限制
[impala01.fgedu.net.cn:21000] > SET MEM_LIMIT=80g;
MEM_LIMIT set to 80g
# 设置并行度
[impala01.fgedu.net.cn:21000] > SET MT_DOP=8;
MT_DOP set to 8
# 查看当前查询选项
[impala01.fgedu.net.cn:21000] > SET;
Query: set
+————————–+———–+
| Option | Value |
+————————–+———–+
| MEM_LIMIT | 80g |
| MT_DOP | 8 |
| EXEC_TIME_LIMIT_S | 3600 |
| QUERY_TIMEOUT_S | 3600 |
| DISABLE_CACHED_READS | false |
| ENABLE_EXPR_REWRITES | true |
+————————–+———–+
Fetched 6 row(s) in 0.01s
10.2 查询优化建议
[impala01.fgedu.net.cn:21000] > EXPLAIN SELECT * FROM fgedu_sales
> WHERE year = 2024 AND month IN (1, 2, 3);
Query: explain select * from fgedu_sales
where year = 2024 and month in (1, 2, 3)
+———————————————————-+
| explain string |
+———————————————————-+
| Max Per-Host Resource Reservation: Thread Reservation=0 |
| Per-Host Resource Estimates: Thread Reservation=0 |
| |
| PLAN-ROOT SINK |
| | |
| 02:SCAN HDFS [fgedudb.fgedu_sales] |
| HDFS partitions=3/24 files=3 size=30MB |
| predicates: year = 2024, month IN (1, 2, 3) |
+———————————————————-+
# 使用统计信息优化
[impala01.fgedu.net.cn:21000] > COMPUTE STATS fgedu_sales;
Query: compute stats fgedu_sales
+——————————————+
| summary |
+——————————————+
| Updated 3 partition(s) and 6 column(s). |
+——————————————+
Fetched 1 row(s) in 2.35s
# 查看表统计信息
[impala01.fgedu.net.cn:21000] > SHOW TABLE STATS fgedu_sales;
Query: show table stats fgedu_sales
+——+——-+——–+————+——–+——————-+
| year | month | #Rows | #Files | Size | Format |
+——+——-+——–+————+——–+——————-+
| 2024 | 1 | 1000000| 3 | 30.5MB | PARQUET/SNAPPY |
| 2024 | 2 | 950000 | 3 | 28.8MB | PARQUET/SNAPPY |
| 2024 | 3 | 1050000| 3 | 31.2MB | PARQUET/SNAPPY |
+——+——-+——–+————+——–+——————-+
Fetched 3 row(s) in 0.02s
10.3 存储格式优化
[impala01.fgedu.net.cn:21000] > CREATE TABLE fgedu_sales_optimized (
> sale_id BIGINT,
> product_id INT,
> customer_id INT,
> sale_amount DECIMAL(12,2),
> sale_date DATE,
> region STRING
> )
> PARTITIONED BY (year INT, month INT)
> STORED AS PARQUET
> TBLPROPERTIES (‘parquet.compression’=’SNAPPY’);
Query: create TABLE fgedu_sales_optimized (
sale_id BIGINT,
product_id INT,
customer_id INT,
sale_amount DECIMAL(12,2),
sale_date DATE,
region STRING
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
TBLPROPERTIES (‘parquet.compression’=’SNAPPY’)
# 设置文件大小
[impala01.fgedu.net.cn:21000] > SET PARQUET_FILE_SIZE=256MB;
PARQUET_FILE_SIZE set to 256MB
# 插入数据
[impala01.fgedu.net.cn:21000] > INSERT INTO fgedu_sales_optimized PARTITION (year=2024, month=1)
> SELECT * FROM fgedu_sales WHERE year=2024 AND month=1;
Query: insert into fgedu_sales_optimized partition (year=2024, month=1)
select * from fgedu_sales where year=2024 and month=1
Inserted 1000000 row(s) in 8.56s
11. Impala升级迁移
Impala升级需要谨慎操作,确保数据安全和业务连续性。
11.1 升级前准备
$ impala-shell –version
Impala Shell v3.4.0
# 备份配置文件
# cp -r /data/impala/conf /backup/impala_conf_$(date +%Y%m%d)
# 备份Hive Metastore数据库
# mysqldump -u hive -p’fgedu_hive_2024′ metastore > /backup/metastore_backup_$(date +%Y%m%d).sql
# 检查HDFS数据完整性
$ hdfs fsck /user/hive/warehouse
FSCK started by hadoop from /192.168.1.51
………………………………………………
/user/hive/warehouse/fgedudb.db/fgedu_employees: OK
/user/hive/warehouse/fgedudb.db/fgedu_sales: OK
The filesystem under path ‘/user/hive/warehouse’ is HEALTHY
# 记录当前运行的查询
[impala01.fgedu.net.cn:21000] > SHOW RUNNING QUERIES;
Query: show running queries
+———————————-+———–+———-+————+
| Query ID | User | State | Start Time |
+———————————-+———–+———-+————+
| 4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d| hadoop | RUNNING | 10:30:00 |
+———————————-+———–+———-+————+
# 等待所有查询完成
[impala01.fgedu.net.cn:21000] > SHOW RUNNING QUERIES;
Query: show running queries
Fetched 0 row(s) in 0.01s
11.2 执行升级操作
# 在所有Impalad节点停止impalad
$ pkill -f impalad
# 在Catalog节点停止catalogd
$ pkill -f catalogd
# 在StateStore节点停止statestored
$ pkill -f statestored
# 验证所有进程已停止
$ ps -ef | grep -E ‘impalad|catalogd|statestored’ | grep -v grep
# 下载新版本Impala
# cd /tmp
# wget https://downloads.apache.org/impala/4.4.0/apache-impala-4.4.0.tar.gz
# 备份旧版本
# mv /data/impala/impala /data/impala/impala_3.4.0_backup
# 解压新版本
# tar -xzf apache-impala-4.4.0.tar.gz
# mv apache-impala-4.4.0 /data/impala/impala
# chown -R impala:impala /data/impala/impala
# 恢复配置文件
# cp -r /backup/impala_conf_$(date +%Y%m%d)/* /data/impala/conf/
# 检查配置兼容性
# diff /backup/impala_conf_$(date +%Y%m%d)/impalad_flags /data/impala/conf/impalad_flags
11.3 启动新版本并验证
# 启动statestored
$ nohup $IMPALA_HOME/sbin/statestored \
–flagfile=/data/impala/conf/statestored_flags \
> /data/impala/logs/statestored.out 2>&1 &
# 等待10秒
$ sleep 10
# 启动catalogd
$ nohup $IMPALA_HOME/sbin/catalogd \
–flagfile=/data/impala/conf/catalogd_flags \
> /data/impala/logs/catalogd.out 2>&1 &
# 等待30秒让catalogd完全启动
$ sleep 30
# 启动impalad(在所有节点)
$ nohup $IMPALA_HOME/sbin/impalad \
–flagfile=/data/impala/conf/impalad_flags \
> /data/impala/logs/impalad.out 2>&1 &
# 验证版本
$ impala-shell –version
Impala Shell v4.4.0
# 连接测试
$ impala-shell -i impala01.fgedu.net.cn -q “SELECT VERSION();”
Query: select version()
+——————-+
| version() |
+——————-+
| impalad version 4.4.0 |
+——————-+
Fetched 1 row(s) in 0.02s
# 验证数据完整性
[impala01.fgedu.net.cn:21000] > SELECT COUNT(*) FROM fgedudb.fgedu_sales;
Query: select count(*) from fgedudb.fgedu_sales
+———-+
| count(*) |
+———-+
| 1000000 |
+———-+
Fetched 1 row(s) in 0.35s
# 验证查询功能
[impala01.fgedu.net.cn:21000] > SELECT * FROM fgedudb.fgedu_employees LIMIT 5;
Query: select * from fgedudb.fgedu_employees limit 5
+—-+——–+————+———-+————+
| id | name | department | salary | hire_date |
+—-+——–+————+———-+————+
| 1 | 张三 | 技术部 | 15000.00 | 2020-01-15 |
| 2 | 李四 | 市场部 | 12000.00 | 2020-03-20 |
| 3 | 王五 | 财务部 | 13000.00 | 2020-05-10 |
| 4 | 赵六 | 技术部 | 18000.00 | 2019-08-01 |
| 5 | 钱七 | 人事部 | 11000.00 | 2021-02-15 |
+—-+——–+————+———-+————+
Fetched 5 row(s) in 0.12s
12. Impala监控运维
Impala的监控运维包括服务状态监控、性能监控、日志管理等。
12.1 服务状态监控
$ ps -ef | grep -E ‘impalad|catalogd|statestored’ | grep -v grep
impala 12345 1 2 10:00 ? 00:05:23 /data/impala/impala/sbin/statestored
impala 12567 1 5 10:01 ? 00:12:45 /data/impala/impala/sbin/catalogd
impala 12789 1 15 10:02 ? 00:35:67 /data/impala/impala/sbin/impalad
# 检查端口监听
$ netstat -tlnp | grep -E ‘21000|21050|24000|25000|26000’
tcp 0 0 0.0.0.0:21000 0.0.0.0:* LISTEN 12789/impalad
tcp 0 0 0.0.0.0:21050 0.0.0.0:* LISTEN 12789/impalad
tcp 0 0 0.0.0.0:24000 0.0.0.0:* LISTEN 12345/statestored
tcp 0 0 0.0.0.0:25000 0.0.0.0:* LISTEN 12789/impalad
tcp 0 0 0.0.0.0:26000 0.0.0.0:* LISTEN 12567/catalogd
# 通过Web UI监控
# StateStore Web UI: http://192.168.1.51:25010
# Catalog Web UI: http://192.168.1.51:25020
# Impalad Web UI: http://192.168.1.51:25000
# 查看集群节点状态
[impala01.fgedu.net.cn:21000] > SHOW IMPALA NODES;
Query: show impala nodes
+———–+————-+———+——–+———-+
| Name | IP | State | Port | Version |
+———–+————-+———+——–+———-+
| impala01 | 192.168.1.51| ACTIVE | 21000 | 4.4.0 |
| impala02 | 192.168.1.52| ACTIVE | 21000 | 4.4.0 |
| impala03 | 192.168.1.53| ACTIVE | 21000 | 4.4.0 |
+———–+————-+———+——–+———-+
12.2 性能监控
[impala01.fgedu.net.cn:21000] > SHOW RUNNING QUERIES;
Query: show running queries
+———————————-+———–+———-+————+————-+
| Query ID | User | State | Start Time | Memory Used |
+———————————-+———–+———-+————+————-+
| 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d| hadoop | RUNNING | 10:35:00 | 2.5 GB |
+———————————-+———–+———-+————+————-+
# 查看查询历史
[impala01.fgedu.net.cn:21000] > SHOW COMPLETED QUERIES;
Query: show completed queries
+———————————-+———–+———-+————+————-+———-+
| Query ID | User | State | Start Time | Memory Used | Duration |
+———————————-+———–+———-+————+————-+———-+
| 9f8e7d6c5b4a3210987654321fedcba0| hadoop | FINISHED | 10:30:00 | 1.8 GB | 2.35s |
| 8a7b6c5d4e3f2109876543210fedcba9| hadoop | FINISHED | 10:25:00 | 3.2 GB | 5.67s |
+———————————-+———–+———-+————+————-+———-+
# 查看资源使用情况
[impala01.fgedu.net.cn:21000] > SHOW RESOURCE_POOLS;
Query: show resource_pools
+—————-+———–+———-+————+————+
| Pool Name | Max Memory| Max Queries| Running | Queued |
+—————-+———–+———-+————+————+
| root.default | 120 GB | unlimited| 2 | 0 |
| root.admin | 40 GB | 10 | 0 | 0 |
| root.etl | 80 GB | 20 | 1 | 0 |
+—————-+———–+———-+————+————+
12.3 日志管理
$ tail -100 /data/impala/logs/impalad.INFO
I0405 10:00:00.000000 12789 impalad.cc:120] Impala service started
I0405 10:00:05.000100 12789 impalad.cc:200] Connected to StateStore
# 查看错误日志
$ tail -100 /data/impala/logs/impalad.ERROR
E0405 10:05:00.000000 12789 impalad.cc:300] Query execution failed
# 查看审计日志
$ tail -100 /data/impala/logs/audit/audit_event_log.1
{“query_id”:”1234567890abcdef”,”start_time”:”2024-04-05 10:00:00″,”user”:”hadoop”,”query”:”SELECT * FROM fgedu_sales”,”status”:”FINISHED”}
# 日志轮转配置
$ cat /data/impala/conf/impalad_flags | grep log
-log_dir=/data/impala/logs
-log_filename=impalad
-log_verbosity=1
-max_log_files=10
-max_log_size=100MB
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
