290. Kylin OLAP引擎培训

一、Kylin概述

Apache Kylin是开源的分布式分析引擎，提供Hadoop之上的SQL接口和多维分析能力，支持超大规模数据的亚秒级查询。

风哥风哥提示：

1.1 Kylin核心特性

亚秒级查询：预计算Cube实现快速查询
标准SQL：支持ANSI SQL标准
MOLAP：多维OLAP分析引擎
Hadoop集成：基于Hadoop生态系统

1.2 Kylin架构

Kylin架构组件：

┌─────────────────────────────────────────┐
│           Kylin Web UI                  │
│           Kylin REST API                │
├─────────────────────────────────────────┤
│           Query Engine                  │
│           (Calcite)                     │
├─────────────────────────────────────────┤
│           Metadata Store                │
│           (MySQL)                       │
├─────────────────────────────────────────┤
│           Cube Storage                  │
│           (HBase/HDFS)                  │
├─────────────────────────────────────────┤
│           Build Engine                  │
│           (MapReduce/Spark)             │
└─────────────────────────────────────────┘

二、Kylin安装部署

2.1 环境准备

# 安装依赖
# Hadoop、HBase、Hive需要提前安装

# 配置环境变量
export JAVA_HOME=/usr/lib/jvm/java-8
export HADOOP_HOME=/opt/hadoop
export HBASE_HOME=/opt/hbase
export HIVE_HOME=/opt/hive
export SPARK_HOME=/opt/spark
export KYLIN_HOME=/opt/kylin

# 下载Kylin
wget https://archive.apache.org/dist/kylin/apache-kylin-4.0.3/apache-kylin-4.0.3-bin.tar.gz

# 解压安装
tar -xzf apache-kylin-4.0.3-bin.tar.gz -C /opt/
ln -s /opt/apache-kylin-4.0.3-bin /opt/kylin

# 检查环境
$KYLIN_HOME/bin/check-env.sh

# 输出示例
Checking JAVA_HOME…
Checking HADOOP_HOME…
Checking HIVE_HOME…
Checking HBASE_HOME…
Environment check passed.

2.2 启动Kylin

# 启动Kylin
$KYLIN_HOME/bin/kylin.sh start

# 查看日志
tail -f $KYLIN_HOME/logs/kylin.log

# 访问Web UI
# http://fgedudb:7070/kylin
# 默认账号：ADMIN/KYLIN

# 停止Kylin
$KYLIN_HOME/bin/kylin.sh stop

# 重启Kylin
$KYLIN_HOME/bin/kylin.sh restart

三、项目与数据源

3.1 创建项目

# Web UI创建项目
# 1. 登录Kylin Web UI
# 2. 点击"Modeling" -> "Project"
# 3. 点击"+ Project"
# 4. 输入项目名称

# 使用REST API创建项目
curl -X POST \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  -d '{"project":"fgsales_project"}' \
  http://fgedudb:7070/kylin/api/projects

3.2 加载数据源

# 加载Hive表
# Web UI操作：
# 1. 选择项目
# 2. 点击"Modeling" -> "Data Source"
# 3. 点击"Load Table"
# 4. 输入表名或选择数据库

# 使用REST API
curl -X POST \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  -d '{"tables":"default.fgsales, default.users"}' \
  http://fgedudb:7070/kylin/api/tables/default.fgsales_project/hive

# 查看表结构
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/tables/default.fgsales

四、数据模型

4.1 创建数据模型

# Web UI创建数据模型
# 1. 点击"Modeling" -> "Models"
# 2. 点击"+ Model"
# 3. 选择事实表
# 4. 添加维度表和关联条件
# 5. 选择维度列和度量列
# 6. 设置分区列

# 示例：销售数据模型
事实表: fgsales
维度表: users, products, time
关联条件:
  fgsales.user_id = users.id
  fgsales.product_id = products.id
  fgsales.time_id = time.id

维度列:
  users.name (fgedu)
  products.category (产品类别)
  time.year (年份)
  time.month (月份)

度量列:
  fgsales.amount (销售额)
  fgsales.quantity (数量)

4.2 模型配置

# 使用REST API创建模型
curl -X POST \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "fgsales_model",
    "fact_table": "default.fgsales",
    "lookups": [{
      "table": "default.users",
      "join": {
        "type": "inner",
        "primary_key": ["id"],
        "foreign_key": ["user_id"]
      }
    }],
    "dimensions": [
      {"table": "users", "column": "name"},
      {"table": "products", "column": "category"}
    ],
    "metrics": [
      {"table": "fgsales", "column": "amount"},
      {"table": "fgsales", "column": "quantity"}
    ]
  }' \
  http://fgedudb:7070/kylin/api/models

五、Cube设计

5.1 创建Cube

# Web UI创建Cube
# 1. 选择数据模型
# 2. 点击"+ Cube"
# 3. 选择维度
# 4. 选择度量
# 5. 设置聚合组
# 6. 配置分区和合并设置

# 维度选择
维度类型:
  - 普通维度 (Normal)
  - 衍生维度 (Derived)
  - 层级维度 (Hierarchy)

# 度量选择
度量类型:
  - SUM (求和)
  - COUNT (计数)
  - MAX (最大值)
  - MIN (最小值)
  - COUNT_DISTINCT (去重计数)
  - TOP_N (Top N)

5.2 Cube优化

# 聚合组配置
# 将经常一起查询的维度放在同一聚合组

聚合组1: [用户维度]
  - users.name
  - users.city
  - users.level

聚合组2: [时间维度]
  - time.year
  - time.month
  - time.day

聚合组3: [产品维度]
  - products.category
  - products.brand

# 强制维度
# 查询必须包含的维度

# 联合维度
# 经常一起出现的维度组合

# 衍生维度
# 从维度表派生的维度，减少Cube大小

六、Cube构建

6.1 全量构建

# Web UI构建
# 1. 选择Cube
# 2. 点击"Build"
# 3. 选择构建类型：全量构建
# 4. 点击"Submit"

# REST API构建
curl -X PUT \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  -d '{"buildType":"BUILD"}' \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube/build

# 查看构建状态
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube

6.2 增量构建

# 增量构建（分区表）
curl -X PUT \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  -H "Content-Type: application/json" \
  -d '{
    "buildType": "BUILD",
    "startTime": 1704067200000,
    "endTime": 1704153600000
  }' \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube/build

# 查看构建任务
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/jobs

# 取消构建任务
curl -X PUT \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/jobs/{job_id}/cancel

七、SQL查询

7.1 基本查询

-- 查询Cube
SELECT 
    users.name,
    products.category,
    SUM(fgsales.amount) as total_amount,
    COUNT(*) as order_count
FROM fgsales
JOIN users ON fgsales.user_id = users.id
JOIN products ON fgsales.product_id = products.id
GROUP BY users.name, products.category
ORDER BY total_amount DESC
LIMIT 10;

-- 时间维度查询
SELECT 
    time.year,
    time.month,
    SUM(fgsales.amount) as monthly_fgsales
FROM fgsales
JOIN time ON fgsales.time_id = time.id
WHERE time.year = 2024
GROUP BY time.year, time.month
ORDER BY time.month;

-- 过滤查询
SELECT * FROM fgsales
WHERE dt = '2024-01-15'
AND category = 'Electronics';

7.2 多维分析

-- 下钻分析
SELECT 
    time.year,
    time.month,
    time.day,
    SUM(fgsales.amount)
FROM fgsales
JOIN time ON fgsales.time_id = time.id
GROUP BY time.year, time.month, time.day;

-- 上卷分析
SELECT 
    time.year,
    SUM(fgsales.amount)
FROM fgsales
JOIN time ON fgsales.time_id = time.id
GROUP BY time.year;

-- 切片分析
SELECT 
    products.category,
    SUM(fgsales.amount)
FROM fgsales
JOIN products ON fgsales.product_id = products.id
WHERE time.year = 2024
GROUP BY products.category;

-- 切块分析
SELECT *
FROM fgsales
WHERE time.year = 2024
AND time.month IN (1, 2, 3)
AND products.category = 'Electronics';

八、监控与运维

8.1 监控指标

# Web UI监控
# 1. Cube状态监控
# 2. 构建任务监控
# 3. 查询性能监控
# 4. 存储空间监控

# REST API获取监控数据
# Cube统计
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube/statistics

# 查询历史
curl -X GET \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/query/history

8.2 运维操作

# Cube合并
curl -X PUT \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube/merge

# Cube刷新
curl -X PUT \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube/refresh

# Cube删除
curl -X DELETE \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/cubes/fgsales_cube

# 清理元数据
curl -X DELETE \
  -H "Authorization: Basic QURNSU46S1lMSU4=" \
  http://fgedudb:7070/kylin/api/storage_cleanup

九、最佳实践

配置项	建议值	说明
维度数量	≤20	避免Cube过大
聚合组	按业务分组	减少组合数
增量构建	每日构建	减少构建时间
衍生维度	优先使用	减少存储空间

注意事项：

合理设计维度避免Cube膨胀
定期合并Segment提升查询性能
监控存储空间使用
优化构建参数提升构建速度

十、总结

Kylin是高性能的OLAP分析引擎。通过本培训文档，您应该掌握了：

Kylin架构和核心概念
安装部署方法
数据模型设计
Cube设计和优化
Cube构建和增量更新
SQL查询和多维分析

IT运维培训文档系列 | 第290篇 | Kylin OLAP引擎培训

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html