本文详细介绍ClickHouse列式数据库实时分析实战,包括ClickHouse概述、架构原理、集群部署、数据导入、SQL查询、性能优化等内容,风哥教程参考ClickHouse官方文档Getting Started、Installation、SQL Reference等内容,适合大数据工程师使用。学习交流加群风哥QQ113257174
Part01-基础概念与理论知识
1.1 ClickHouse概述与核心特性
ClickHouse是Yandex开源的列式数据库管理系统,专门用于联机分析(OLAP)查询。更多视频教程www.fgedu.net.cn
- 列式存储:压缩率高,IO少
- 向量计算:CPU利用率高
- 线性扩展:支持水平扩展
- 实时导入:支持高吞吐数据写入
- SQL支持:标准SQL语法
- 高可用:支持副本和分片
1.2 ClickHouse架构与表引擎
ClickHouse核心表引擎:
MergeTree:
– 最基础的表引擎
– 支持数据分区
– 支持数据排序
– 支持索引
ReplacingMergeTree:
– 去重引擎
– 根据排序键去重
– 后台合并时去重
SummingMergeTree:
– 聚合引擎
– 自动聚合数值列
– 适合预聚合
AggregatingMergeTree:
– 高级聚合引擎
– 支持聚合函数状态
– 增量聚合
CollapsingMergeTree:
– 折叠引擎
– 通过sign列标记
– 适合更新删除
VersionedCollapsingMergeTree:
– 版本折叠引擎
– 带版本号的折叠
– 更可靠
# 分布式引擎
Distributed:
– 分布式表引擎
– 查询时自动路由
– 写入时自动分发
1.3 ClickHouse数据类型
ClickHouse常用数据类型:
- 整数:Int8/Int16/Int32/Int64/UInt8/UInt16/UInt32/UInt64
- 浮点:Float32/Float64/Decimal
- 字符串:String/FixedString
- 日期:Date/DateTime/DateTime64
- 数组:Array(T)
- 枚举:Enum8/Enum16
Part02-生产环境规划与建议
2.1 ClickHouse集群规划
ClickHouse集群规划要点:
ClickHouse节点:
– 数量:3-100个
– 配置:16核64GB-64核256GB
– 磁盘:SSD/NVMe(性能优先)
– 网络:万兆网卡
# 集群架构
副本:
– 至少2副本
– 建议3副本
– 跨机架部署
分片:
– 根据数据量决定
– 建议2-64分片
– 均匀分布数据
# 目录规划
安装目录:/bigdata/app/clickhouse
数据目录:/bigdata/fgdata/clickhouse
日志目录:/bigdata/fgdata/logs/clickhouse
2.2 ClickHouse核心配置
ClickHouse核心配置建议:
<path>/bigdata/fgdata/clickhouse</path>
<tmp_path>/bigdata/fgdata/clickhouse/tmp</tmp_path>
<user_files_path>/bigdata/fgdata/clickhouse/user_files</user_files_path>
<max_concurrent_queries>100</max_concurrent_queries>
<max_memory_usage>64424509440</max_memory_usage>
<max_memory_usage_for_user>53687091200</max_memory_usage_for_user>
<mark_cache_size>5368709120</mark_cache_size>
<uncompressed_cache_size>8589934592</uncompressed_cache_size>
# users.xml
<max_memory_usage>53687091200</max_memory_usage>
<max_query_size>268435456</max_query_size>
<max_insert_block_size>1048576</max_insert_block_size>
# 远程服务器配置
<remote_servers>
<fgedu_cluster>
<shard>
<replica>
<host>fgedu-ck01</host>
<port>9000</port>
</replica>
<replica>
<host>fgedu-ck02</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>fgedu-ck03</host>
<port>9000</port>
</replica>
<replica>
<host>fgedu-ck04</host>
<port>9000</port>
</replica>
</shard>
</fgedu_cluster>
</remote_servers>
2.3 资源配置建议
资源配置建议:
- 小型集群:3节点,每台16核64GB
- 中型集群:10节点,每台32核128GB
- 大型集群:30节点,每台64核256GB
from bigdata视频:www.itpux.com
Part03-生产环境项目实施方案
3.1 ClickHouse集群安装部署
3.1.1 下载安装ClickHouse
yum install -y yum-utils
yum-config-manager –add-repo https://packages.clickhouse.com/rpm/stable/clickhouse.repo
# 2. 安装ClickHouse
yum install -y clickhouse-server clickhouse-client
# 3. 配置
cp /etc/clickhouse-server/config.xml /etc/clickhouse-server/config.xml.bak
cp /etc/clickhouse-server/users.xml /etc/clickhouse-server/users.xml.bak
# 4. 修改配置文件
vi /etc/clickhouse-server/config.xml
# 修改path为/bigdata/fgdata/clickhouse
# 配置remote_servers
# 5. 创建目录
mkdir -p /bigdata/fgdata/clickhouse
mkdir -p /bigdata/fgdata/clickhouse/tmp
mkdir -p /bigdata/fgdata/clickhouse/user_files
mkdir -p /bigdata/fgdata/logs/clickhouse
chown -R clickhouse:clickhouse /bigdata/fgdata/clickhouse
chown -R clickhouse:clickhouse /bigdata/fgdata/logs/clickhouse
# 6. 启动服务
systemctl start clickhouse-server
systemctl enable clickhouse-server
# 7. 验证安装
clickhouse-client
🙂 SELECT version();
# 8. 集群验证
🙂 SELECT * FROM system.clusters;
3.2 ClickHouse表创建与数据导入
3.2.1 创建表并导入数据
clickhouse-client
# 创建数据库
🙂 CREATE DATABASE IF NOT EXISTS fgedu_db;
🙂 USE fgedu_db;
# 创建本地表(MergeTree)
🙂 CREATE TABLE IF NOT EXISTS fgedu_user_log_local (
dt Date,
ts DateTime,
user_id UInt64,
event_type String,
page String,
duration UInt32
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(dt)
ORDER BY (dt, ts, user_id)
SETTINGS index_granularity = 8192;
# 创建分布式表
🙂 CREATE TABLE IF NOT EXISTS fgedu_user_log (
dt Date,
ts DateTime,
user_id UInt64,
event_type String,
page String,
duration UInt32
)
ENGINE = Distributed(‘fgedu_cluster’, ‘fgedu_db’, ‘fgedu_user_log_local’, rand());
# 导入CSV文件
🙂 INSERT INTO fgedu_user_log
FROM INFILE ‘/bigdata/fgdata/user_log.csv’
FORMAT CSV;
# 使用clickhouse-client导入
cat /bigdata/fgdata/user_log.csv | clickhouse-client -h fgedu-ck \
–query=”INSERT INTO fgedu_db.fgedu_user_log FORMAT CSV”
# 并行导入
for file in /bigdata/fgdata/data_part_*.csv; do
cat $file | clickhouse-client -h fgedu-ck \
–query=”INSERT INTO fgedu_db.fgedu_user_log FORMAT CSV” &
done
wait
3.3 HDFS数据同步到ClickHouse
3.3.1 使用Spark同步数据
<dependency>
<groupId>com.github.housepower</groupId>
<artifactId>clickhouse-native-jdbc</artifactId>
<version>2.6.5</version>
</dependency>
# Spark写入ClickHouse
import org.apache.spark.sql.SparkSession
import java.util.Properties
object FgeduSpark2ClickHouse {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(“FgeduSpark2ClickHouse”)
.enableHiveSupport()
.getOrCreate()
val df = spark.read.table(“hive_db.fgedu_user_table”)
val properties = new Properties()
properties.setProperty(“user”, “default”)
properties.setProperty(“password”, “”)
properties.setProperty(“driver”, “com.github.housepower.jdbc.ClickHouseDriver”)
properties.setProperty(“socket_timeout”, “300000”)
properties.setProperty(“rewriteBatchedStatements”, “true”)
properties.setProperty(“batchsize”, “100000”)
df.write
.mode(“append”)
.option(“isolationLevel”, “NONE”)
.option(“numPartitions”, “10”)
.jdbc(“jdbc:clickhouse://fgedu-ck:8123/fgedu_db”,
“fgedu_user_table”, properties)
spark.stop()
}
}
# 使用ClickHouse表函数直接读取HDFS
🙂 INSERT INTO fgedu_user_log
SELECT * FROM hdfs(
‘hdfs://fgedu-nn:8020/bigdata/fgdata/hive/warehouse/fgedu_user/*’,
‘CSV’,
‘dt Date, ts DateTime, user_id UInt64, event_type String, page String, duration UInt32’
);
Part04-生产案例与实战讲解
4.1 ClickHouse SQL查询实战
4.1.1 基础SQL查询
clickhouse-client -d fgedu_db
# 基础查询
🙂 SELECT COUNT(*) FROM fgedu_user_log WHERE dt = ‘2024-04-08’;
┌─count()─┐
│ 5000000 │
└─────────┘
# 聚合查询
🙂 SELECT
dt,
event_type,
COUNT(*) AS pv,
uniq(user_id) AS uv
FROM fgedu_user_log
WHERE dt BETWEEN ‘2024-04-01’ AND ‘2024-04-07’
GROUP BY dt, event_type
ORDER BY dt, pv DESC;
# TopN查询
🙂 SELECT
user_id,
COUNT(*) AS event_cnt
FROM fgedu_user_log
WHERE dt = ‘2024-04-08’
GROUP BY user_id
ORDER BY event_cnt DESC
LIMIT 100;
# 窗口函数
🙂 SELECT
dt,
city,
sales_amt,
RANK() OVER (PARTITION BY dt ORDER BY sales_amt DESC) AS city_rank
FROM fgedu_sales
WHERE dt = ‘2024-04-08’
QUALIFY RANK() OVER (PARTITION BY dt ORDER BY sales_amt DESC) <= 10;
# 查看执行计划
🙂 EXPLAIN SELECT * FROM fgedu_user_log WHERE dt = ‘2024-04-08’;
4.2 ClickHouse性能优化实战
4.2.1 查询优化技巧
# 不推荐
SELECT * FROM fgedu_user_log WHERE toYYYYMM(dt) = 202404;
# 推荐
SELECT * FROM fgedu_user_log WHERE dt BETWEEN ‘2024-04-01’ AND ‘2024-04-30’;
# 优化2:使用主键索引
# ORDER BY的列要在WHERE中使用
SELECT * FROM fgedu_user_log WHERE dt = ‘2024-04-08’ AND user_id = 12345;
# 优化3:使用物化视图
🙂 CREATE MATERIALIZED VIEW fgedu_user_log_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(dt)
ORDER BY (dt, event_type)
AS SELECT
dt,
event_type,
countState(*) AS pv_state,
uniqState(user_id) AS uv_state
FROM fgedu_user_log
GROUP BY dt, event_type;
# 查询物化视图
🙂 SELECT
dt,
event_type,
countMerge(pv_state) AS pv,
uniqMerge(uv_state) AS uv
FROM fgedu_user_log_mv
GROUP BY dt, event_type;
# 优化4:调整max_threads
🙂 SET max_threads = 16;
# 优化5:调整max_memory_usage
🙂 SET max_memory_usage = 50000000000;
# 优化6:使用prewhere代替where
🙂 SELECT * FROM fgedu_user_log
PREWHERE dt = ‘2024-04-08’
WHERE event_type = ‘click’;
# 优化7:数据压缩
SETTINGS compression = ‘LZ4’;
4.3 ClickHouse监控与运维
4.3.1 系统表监控
🙂 SELECT query_id, user, query, elapsed, read_rows, read_bytes
FROM system.processes;
# 查询慢查询
🙂 SELECT query, duration_ms, read_rows, read_bytes
FROM system.query_log
WHERE type = ‘QueryFinish’
AND duration_ms > 1000
ORDER BY duration_ms DESC
LIMIT 100;
# 查询表大小
🙂 SELECT
database,
table,
formatReadableSize(sum(bytes)) AS size,
sum(rows) AS rows
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes) DESC;
# 查询副本状态
🙂 SELECT * FROM system.replicas;
# 查询ZooKeeper状态(如果使用)
🙂 SELECT * FROM system.zookeeper WHERE path = ‘/’;
# Prometheus监控
# config.xml中配置
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
</prometheus>
# Grafana Dashboard
# 导入ClickHouse Dashboard
# 关键指标:
– QPS
– 查询延迟
– 并发查询数
– 数据量
– 磁盘IO
– CPU/内存使用
Part05-风哥经验总结与分享
5.1 ClickHouse生产最佳实践
ClickHouse生产最佳实践:
- 表引擎:选择合适的表引擎,MergeTree是基础
- 主键设计:ORDER BY列要放在WHERE中
- 分区设计:按时间分区,定期管理分区
- 数据类型:选择合适的数据类型,越小越好
- 物化视图:合理使用物化视图加速查询
- 监控告警:实时监控慢查询和资源使用
5.2 常见问题处理
– 调整max_memory_usage
– 减少查询数据量
– 优化SQL
– 增加节点内存
# 常见问题2:查询慢
– 检查是否使用分区裁剪
– 检查是否使用主键索引
– 创建物化视图
– 增加节点
# 常见问题3:写入慢
– 增加batch size
– 使用并行写入
– 调整max_insert_block_size
– 检查磁盘IO
# 常见问题4:副本不同步
– 检查ZooKeeper状态
– 检查网络连接
– 查看system.replicas
– 手动触发同步
# 常见问题5:磁盘空间不足
– 清理旧分区
– 优化压缩
– 增加磁盘
– 数据归档
5.3 运维检查清单
– [ ] 节点状态
– [ ] 副本状态
– [ ] 查询状态
– [ ] 慢查询
– [ ] 磁盘空间
– [ ] 内存使用
– [ ] CPU使用
– [ ] ZooKeeper状态
– [ ] 分区状态
– [ ] 数据量增长
– [ ] 告警规则检查
– [ ] 日志检查
# 日常巡检内容
1. 检查节点状态
2. 查看慢查询
3. 检查副本同步
4. 检查磁盘空间
5. 检查资源使用
6. 查看错误日志
7. 管理分区
8. 清理旧数据
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
