本文档风哥主要介绍DM数据库数据分析与机器学习,包括数据分析概述、机器学习概述、数据分析工具、数据分析方法、机器学习算法、模型评估、数据分析实现、机器学习实现、模型部署、实际案例和最佳实践等内容,风哥教程参考DM官方文档DM8数据分析指南、DM8机器学习指南,适合数据分析和机器学习开发人员在学习和生产环境中使用。
Part01-基础概念与理论知识
1.1 数据分析概述
数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。
# 数据分析的定义
数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。
# 数据分析的类型
1. 描述性分析
– 分析历史数据
– 描述数据特征
– 总结数据规律
2. 诊断性分析
– 分析问题原因
– 找出影响因素
– 解释数据变化
3. 预测性分析
– 预测未来趋势
– 估计未来值
– 识别潜在风险
4. 指导性分析
– 指导决策制定
– 优化业务流程
– 提高业务效率
# 数据分析的价值
– 支持决策:为决策提供数据支持
– 提高效率:提高业务运营效率
– 降低成本:降低业务运营成本
– 增强竞争力:增强企业竞争力
数据分析是指用适当的统计分析方法对收集来的大量数据进行分析,提取有用信息和形成结论而对数据加以详细研究和概括总结的过程。
# 数据分析的类型
1. 描述性分析
– 分析历史数据
– 描述数据特征
– 总结数据规律
2. 诊断性分析
– 分析问题原因
– 找出影响因素
– 解释数据变化
3. 预测性分析
– 预测未来趋势
– 估计未来值
– 识别潜在风险
4. 指导性分析
– 指导决策制定
– 优化业务流程
– 提高业务效率
# 数据分析的价值
– 支持决策:为决策提供数据支持
– 提高效率:提高业务运营效率
– 降低成本:降低业务运营成本
– 增强竞争力:增强企业竞争力
1.2 机器学习概述
机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。
# 机器学习的定义
机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。
# 机器学习的类型
1. 监督学习
– 分类:预测类别标签
– 回归:预测连续值
2. 无监督学习
– 聚类:将数据分组
– 降维:减少数据维度
3. 强化学习
– 通过奖励机制学习
– 优化决策策略
# 机器学习的流程
1. 数据收集
– 收集相关数据
– 确保数据质量
2. 数据预处理
– 数据清洗
– 数据转换
– 特征工程
3. 模型训练
– 选择算法
– 训练模型 风哥提示:
– 调整参数
4. 模型评估
– 评估模型性能
– 选择最佳模型
5. 模型部署
– 部署模型
– 监控模型
– 更新模型
# 机器学习的价值
– 自动化决策:自动做出决策
– 提高准确性:提高预测准确性
– 降低成本:降低人工成本
– 增强竞争力:增强企业竞争力
机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。
# 机器学习的类型
1. 监督学习
– 分类:预测类别标签
– 回归:预测连续值
2. 无监督学习
– 聚类:将数据分组
– 降维:减少数据维度
3. 强化学习
– 通过奖励机制学习
– 优化决策策略
# 机器学习的流程
1. 数据收集
– 收集相关数据
– 确保数据质量
2. 数据预处理
– 数据清洗
– 数据转换
– 特征工程
3. 模型训练
– 选择算法
– 训练模型 风哥提示:
– 调整参数
4. 模型评估
– 评估模型性能
– 选择最佳模型
5. 模型部署
– 部署模型
– 监控模型
– 更新模型
# 机器学习的价值
– 自动化决策:自动做出决策
– 提高准确性:提高预测准确性
– 降低成本:降低人工成本
– 增强竞争力:增强企业竞争力
1.3 数据分析工具
数据分析工具是进行数据分析的重要工具,包括数据库工具、编程语言、可视化工具等。
# 数据分析工具分类
1. 数据库工具
– DM数据库:国产数据库
– MySQL:开源数据库
– PostgreSQL:开源数据库
2. 编程语言
– Python:数据分析主流语言
– R:统计分析语言
– SQL:数据库查询语言
3. 可视化工具
– Tableau:商业可视化工具
– Power BI:微软可视化工具
– Matplotlib:Python可视化库 学习交流加群风哥微信: itpux-com
4. 机器学习框架
– Scikit-learn:Python机器学习库
– TensorFlow:Google深度学习框架
– PyTorch:Facebook深度学习框架
# DM数据库在数据分析中的应用
– 数据存储:存储大量数据
– 数据查询:快速查询数据
– 数据分析:支持复杂分析
– 数据挖掘:支持数据挖掘
1. 数据库工具
– DM数据库:国产数据库
– MySQL:开源数据库
– PostgreSQL:开源数据库
2. 编程语言
– Python:数据分析主流语言
– R:统计分析语言
– SQL:数据库查询语言
3. 可视化工具
– Tableau:商业可视化工具
– Power BI:微软可视化工具
– Matplotlib:Python可视化库 学习交流加群风哥微信: itpux-com
4. 机器学习框架
– Scikit-learn:Python机器学习库
– TensorFlow:Google深度学习框架
– PyTorch:Facebook深度学习框架
# DM数据库在数据分析中的应用
– 数据存储:存储大量数据
– 数据查询:快速查询数据
– 数据分析:支持复杂分析
– 数据挖掘:支持数据挖掘
风哥提示:数据分析和机器学习是企业数字化转型的重要工具,掌握数据分析和机器学习的方法和工具,是构建智能决策系统的关键。根据业务需求和数据特点,选择合适的数据分析和机器学习方案,是保证项目成功的重要手段。
Part02-生产环境规划与建议
2.1 数据分析方法
2.1.1 描述性分析
# 1. 统计分析
– 基本统计量
SQL> SELECT COUNT(*) AS count,
AVG(amount) AS avg_amount,
MAX(amount) AS max_amount,
MIN(amount) AS min_amount,
STDDEV(amount) AS stddev_amount
FROM fgedu_order;
– 分组统计
SQL> SELECT user_id, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
GROUP BY user_id
ORDER BY total_amount DESC;
# 2. 趋势分析
– 时间趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
– 增长趋势 学习交流加群风哥QQ113257174
SQL> SELECT order_date,
COUNT(*) AS order_count,
LAG(COUNT(*)) OVER (ORDER BY order_date) AS prev_count,
(COUNT(*) – LAG(COUNT(*)) OVER (ORDER BY order_date)) * 100.0 / LAG(COUNT(*)) OVER (ORDER BY order_date) AS growth_rate
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 3. 实际示例
– 基本统计量
SQL> SELECT COUNT(*) AS count,
AVG(amount) AS avg_amount,
MAX(amount) AS max_amount,
MIN(amount) AS min_amount,
STDDEV(amount) AS stddev_amount
FROM fgedu_order;
# 输出结果
# COUNT AVG_AMOUNT MAX_AMOUNT MIN_AMOUNT STDDEV_AMOUNT
# —— ———– ———– ———– ————-
# 10000 100.00 1000.00 10.00 50.00
– 时间趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 输出结果
# ORDER_DATE ORDER_COUNT TOTAL_AMOUNT
# ———– ———— ————-
# 2024-01-01 100 10000.00
# 2024-01-02 150 15000.00
– 基本统计量
SQL> SELECT COUNT(*) AS count,
AVG(amount) AS avg_amount,
MAX(amount) AS max_amount,
MIN(amount) AS min_amount,
STDDEV(amount) AS stddev_amount
FROM fgedu_order;
– 分组统计
SQL> SELECT user_id, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
GROUP BY user_id
ORDER BY total_amount DESC;
# 2. 趋势分析
– 时间趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
– 增长趋势 学习交流加群风哥QQ113257174
SQL> SELECT order_date,
COUNT(*) AS order_count,
LAG(COUNT(*)) OVER (ORDER BY order_date) AS prev_count,
(COUNT(*) – LAG(COUNT(*)) OVER (ORDER BY order_date)) * 100.0 / LAG(COUNT(*)) OVER (ORDER BY order_date) AS growth_rate
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 3. 实际示例
– 基本统计量
SQL> SELECT COUNT(*) AS count,
AVG(amount) AS avg_amount,
MAX(amount) AS max_amount,
MIN(amount) AS min_amount,
STDDEV(amount) AS stddev_amount
FROM fgedu_order;
# 输出结果
# COUNT AVG_AMOUNT MAX_AMOUNT MIN_AMOUNT STDDEV_AMOUNT
# —— ———– ———– ———– ————-
# 10000 100.00 1000.00 10.00 50.00
– 时间趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 输出结果
# ORDER_DATE ORDER_COUNT TOTAL_AMOUNT
# ———– ———— ————-
# 2024-01-01 100 10000.00
# 2024-01-02 150 15000.00
2.1.2 预测性分析
# 1. 时间序列预测
– 移动平均
SQL> SELECT order_date,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM fgedu_order 更多视频教程www.fgedu.net.cn
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
– 指数平滑
SQL> SELECT order_date,
amount,
0.3 * amount + 0.7 * LAG(amount) OVER (ORDER BY order_date) AS smoothed_value
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
# 2. 回归分析
– 线性回归
SQL> SELECT user_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
REGR_SLOPE(total_amount, order_count) AS slope,
REGR_INTERCEPT(total_amount, order_count) AS intercept
FROM fgedu_order
GROUP BY user_id;
# 3. 实际示例
– 移动平均
SQL> SELECT order_date,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
# 输出结果
# ORDER_DATE MOVING_AVG
# ———– ———–
# 2024-01-01 100.00
# 2024-01-02 105.00
# 2024-01-03 110.00
– 移动平均
SQL> SELECT order_date,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM fgedu_order 更多视频教程www.fgedu.net.cn
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
– 指数平滑
SQL> SELECT order_date,
amount,
0.3 * amount + 0.7 * LAG(amount) OVER (ORDER BY order_date) AS smoothed_value
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
# 2. 回归分析
– 线性回归
SQL> SELECT user_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
REGR_SLOPE(total_amount, order_count) AS slope,
REGR_INTERCEPT(total_amount, order_count) AS intercept
FROM fgedu_order
GROUP BY user_id;
# 3. 实际示例
– 移动平均
SQL> SELECT order_date,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
ORDER BY order_date;
# 输出结果
# ORDER_DATE MOVING_AVG
# ———– ———–
# 2024-01-01 100.00
# 2024-01-02 105.00
# 2024-01-03 110.00
2.2 机器学习算法
2.2.1 监督学习
# 1. 分类算法
– 逻辑回归
– 二分类问题
– 多分类问题
– 概率预测 更多学习教程公众号风哥教程itpux_com
– 决策树
– 分类树
– 回归树
– 特征重要性
– 随机森林
– 集成学习
– 多个决策树
– 投票机制
# 2. 回归算法
– 线性回归
– 简单线性回归
– 多元线性回归
– 正则化回归
– 支持向量回归
– 核函数
– 回归预测
– 非线性回归
# 3. 实际示例
– 用户流失预测
# 特征:用户行为数据
# 标签:是否流失(0/1)
# 算法:逻辑回归
# 数据准备
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id; from DB视频:www.itpux.com # 模型训练(Python示例) from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 准备数据 X = df[['order_count', 'avg_amount', 'last_order_date']] y = df['is_churn'] # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练模型 model = LogisticRegression() model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 评估 accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy}')
– 逻辑回归
– 二分类问题
– 多分类问题
– 概率预测 更多学习教程公众号风哥教程itpux_com
– 决策树
– 分类树
– 回归树
– 特征重要性
– 随机森林
– 集成学习
– 多个决策树
– 投票机制
# 2. 回归算法
– 线性回归
– 简单线性回归
– 多元线性回归
– 正则化回归
– 支持向量回归
– 核函数
– 回归预测
– 非线性回归
# 3. 实际示例
– 用户流失预测
# 特征:用户行为数据
# 标签:是否流失(0/1)
# 算法:逻辑回归
# 数据准备
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id; from DB视频:www.itpux.com # 模型训练(Python示例) from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # 准备数据 X = df[['order_count', 'avg_amount', 'last_order_date']] y = df['is_churn'] # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 训练模型 model = LogisticRegression() model.fit(X_train, y_train) # 预测 y_pred = model.predict(X_test) # 评估 accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy}')
2.2.2 无监督学习
# 1. 聚类算法
– K-means聚类
– 基于距离
– K个簇
– 迭代优化
– 层次聚类
– 树状结构
– 自底向上
– 自顶向下
– DBSCAN聚类
– 密度聚类
– 噪声点
– 任意形状
# 2. 降维算法
– PCA主成分分析
– 线性降维
– 保留方差
– 特征提取
– t-SNE
– 非线性降维
– 可视化
– 流形学习
# 3. 实际示例
– 客户细分
# 特征:用户行为数据
# 算法:K-means聚类
# 数据准备
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date
FROM fgedu_order
GROUP BY user_id;
# 模型训练(Python示例)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘last_order_date’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 训练模型
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)
# 预测
labels = kmeans.labels_
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
– K-means聚类
– 基于距离
– K个簇
– 迭代优化
– 层次聚类
– 树状结构
– 自底向上
– 自顶向下
– DBSCAN聚类
– 密度聚类
– 噪声点
– 任意形状
# 2. 降维算法
– PCA主成分分析
– 线性降维
– 保留方差
– 特征提取
– t-SNE
– 非线性降维
– 可视化
– 流形学习
# 3. 实际示例
– 客户细分
# 特征:用户行为数据
# 算法:K-means聚类
# 数据准备
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date
FROM fgedu_order
GROUP BY user_id;
# 模型训练(Python示例)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘last_order_date’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 训练模型
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)
# 预测
labels = kmeans.labels_
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
2.3 模型评估
2.3.1 分类模型评估
# 1. 评估指标
– 准确率(Accuracy)
– 正确预测的比例
– 计算公式:(TP + TN) / (TP + TN + FP + FN)
– 精确率(Precision)
– 预测为正例中实际为正例的比例
– 计算公式:TP / (TP + FP)
– 召回率(Recall)
– 实际为正例中预测为正例的比例
– 计算公式:TP / (TP + FN)
– F1分数
– 精确率和召回率的调和平均
– 计算公式:2 * (Precision * Recall) / (Precision + Recall)
# 2. 混淆矩阵
– 真阳性(TP):实际为正例,预测为正例
– 假阳性(FP):实际为负例,预测为正例
– 假阴性(FN):实际为正例,预测为负例
– 真阴性(TN):实际为负例,预测为负例
# 3. 实际示例
– 模型评估(Python示例)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# 计算评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f’Accuracy: {accuracy}’)
print(f’Precision: {precision}’)
print(f’Recall: {recall}’)
print(f’F1 Score: {f1}’)
print(f’Confusion Matrix:\n{conf_matrix}’)
# 输出结果
# Accuracy: 0.85
# Precision: 0.80
# Recall: 0.75
# F1 Score: 0.77
# Confusion Matrix:
# [[700 50]
# [100 150]]
– 准确率(Accuracy)
– 正确预测的比例
– 计算公式:(TP + TN) / (TP + TN + FP + FN)
– 精确率(Precision)
– 预测为正例中实际为正例的比例
– 计算公式:TP / (TP + FP)
– 召回率(Recall)
– 实际为正例中预测为正例的比例
– 计算公式:TP / (TP + FN)
– F1分数
– 精确率和召回率的调和平均
– 计算公式:2 * (Precision * Recall) / (Precision + Recall)
# 2. 混淆矩阵
– 真阳性(TP):实际为正例,预测为正例
– 假阳性(FP):实际为负例,预测为正例
– 假阴性(FN):实际为正例,预测为负例
– 真阴性(TN):实际为负例,预测为负例
# 3. 实际示例
– 模型评估(Python示例)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# 计算评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f’Accuracy: {accuracy}’)
print(f’Precision: {precision}’)
print(f’Recall: {recall}’)
print(f’F1 Score: {f1}’)
print(f’Confusion Matrix:\n{conf_matrix}’)
# 输出结果
# Accuracy: 0.85
# Precision: 0.80
# Recall: 0.75
# F1 Score: 0.77
# Confusion Matrix:
# [[700 50]
# [100 150]]
2.3.2 回归模型评估
# 1. 评估指标
– 均方误差(MSE)
– 预测值与真实值差的平方的平均
– 计算公式:mean((y_true – y_pred) ** 2)
– 均方根误差(RMSE)
– MSE的平方根
– 计算公式:sqrt(MSE)
– 平均绝对误差(MAE)
– 预测值与真实值差的绝对值的平均
– 计算公式:mean(abs(y_true – y_pred))
– R平方(R²)
– 模型解释的方差比例
– 计算公式:1 – (SS_res / SS_tot)
# 2. 实际示例
– 模型评估(Python示例)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# 计算评估指标
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f’MSE: {mse}’)
print(f’RMSE: {rmse}’)
print(f’MAE: {mae}’)
print(f’R²: {r2}’)
# 输出结果
# MSE: 2500.00
# RMSE: 50.00
# MAE: 40.00
# R²: 0.85
– 均方误差(MSE)
– 预测值与真实值差的平方的平均
– 计算公式:mean((y_true – y_pred) ** 2)
– 均方根误差(RMSE)
– MSE的平方根
– 计算公式:sqrt(MSE)
– 平均绝对误差(MAE)
– 预测值与真实值差的绝对值的平均
– 计算公式:mean(abs(y_true – y_pred))
– R平方(R²)
– 模型解释的方差比例
– 计算公式:1 – (SS_res / SS_tot)
# 2. 实际示例
– 模型评估(Python示例)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# 计算评估指标
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f’MSE: {mse}’)
print(f’RMSE: {rmse}’)
print(f’MAE: {mae}’)
print(f’R²: {r2}’)
# 输出结果
# MSE: 2500.00
# RMSE: 50.00
# MAE: 40.00
# R²: 0.85
生产环境建议:根据业务需求和数据特点,选择合适的数据分析方法和机器学习算法。在模型训练中,要注意数据质量和特征工程。建立完善的模型评估体系,确保模型的准确性和可靠性。
Part03-生产环境项目实施方案
3.1 数据分析实现
3.1.1 用户行为分析
# 1. 用户活跃度分析
– 活跃用户统计
SQL> SELECT COUNT(*) AS active_user_count
FROM fgedu_user u
WHERE EXISTS (
SELECT 1 FROM fgedu_order o
WHERE o.user_id = u.user_id
AND o.order_date >= ADD_MONTHS(SYSDATE, -1)
);
– 用户活跃度分布
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 2. 用户价值分析
– RFM分析
SQL> SELECT user_id,
COUNT(*) AS frequency,
MAX(order_date) AS recency,
SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
ORDER BY monetary DESC;
– 用户价值分层
SQL> SELECT
CASE
WHEN monetary >= 10000 THEN ‘高价值’
WHEN monetary >= 5000 THEN ‘中价值’
ELSE ‘低价值’
END AS value_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
) t
GROUP BY value_level;
# 3. 实际示例
– 用户活跃度分析
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 输出结果
# ACTIVITY_LEVEL USER_COUNT
# ————— ———-
# 高活跃 200
# 中活跃 300
# 低活跃 500
– 活跃用户统计
SQL> SELECT COUNT(*) AS active_user_count
FROM fgedu_user u
WHERE EXISTS (
SELECT 1 FROM fgedu_order o
WHERE o.user_id = u.user_id
AND o.order_date >= ADD_MONTHS(SYSDATE, -1)
);
– 用户活跃度分布
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 2. 用户价值分析
– RFM分析
SQL> SELECT user_id,
COUNT(*) AS frequency,
MAX(order_date) AS recency,
SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
ORDER BY monetary DESC;
– 用户价值分层
SQL> SELECT
CASE
WHEN monetary >= 10000 THEN ‘高价值’
WHEN monetary >= 5000 THEN ‘中价值’
ELSE ‘低价值’
END AS value_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
) t
GROUP BY value_level;
# 3. 实际示例
– 用户活跃度分析
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 输出结果
# ACTIVITY_LEVEL USER_COUNT
# ————— ———-
# 高活跃 200
# 中活跃 300
# 低活跃 500
3.1.2 销售分析
# 1. 销售趋势分析
– 每日销售趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
– 销售增长率
SQL> SELECT order_date,
SUM(amount) AS total_amount,
LAG(SUM(amount)) OVER (ORDER BY order_date) AS prev_amount,
(SUM(amount) – LAG(SUM(amount)) OVER (ORDER BY order_date)) * 100.0 / LAG(SUM(amount)) OVER (ORDER BY order_date) AS growth_rate
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 2. 产品销售分析
– 产品销售排行
SQL> SELECT product_id, product_name,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_id, product_name
ORDER BY total_amount DESC;
– 产品类别分析
SQL> SELECT product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_category
ORDER BY total_amount DESC;
# 3. 实际示例
– 产品销售排行
SQL> SELECT product_id, product_name,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_id, product_name
ORDER BY total_amount DESC;
# 输出结果
# PRODUCT_ID PRODUCT_NAME ORDER_COUNT TOTAL_AMOUNT
# ———– ————- ———— ————-
# 1 产品1 1000 100000.00
# 2 产品2 800 80000.00
# 3 产品3 600 60000.00
– 每日销售趋势
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
– 销售增长率
SQL> SELECT order_date,
SUM(amount) AS total_amount,
LAG(SUM(amount)) OVER (ORDER BY order_date) AS prev_amount,
(SUM(amount) – LAG(SUM(amount)) OVER (ORDER BY order_date)) * 100.0 / LAG(SUM(amount)) OVER (ORDER BY order_date) AS growth_rate
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -6)
GROUP BY order_date
ORDER BY order_date;
# 2. 产品销售分析
– 产品销售排行
SQL> SELECT product_id, product_name,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_id, product_name
ORDER BY total_amount DESC;
– 产品类别分析
SQL> SELECT product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_category
ORDER BY total_amount DESC;
# 3. 实际示例
– 产品销售排行
SQL> SELECT product_id, product_name,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order o
JOIN fgedu_product p ON o.product_id = p.product_id
GROUP BY product_id, product_name
ORDER BY total_amount DESC;
# 输出结果
# PRODUCT_ID PRODUCT_NAME ORDER_COUNT TOTAL_AMOUNT
# ———– ————- ———— ————-
# 1 产品1 1000 100000.00
# 2 产品2 800 80000.00
# 3 产品3 600 60000.00
3.2 机器学习实现
3.2.1 用户流失预测
# 1. 数据准备
– 特征提取
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id; - 数据清洗 SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id ) t WHERE order_count > 0;
# 2. 模型训练(Python示例)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id ) t WHERE order_count > 0
“””, conn)
# 特征工程
df[‘days_since_last_order’] = (pd.Timestamp.now() – df[‘last_order_date’]).dt.days
df[‘days_since_first_order’] = (pd.Timestamp.now() – df[‘first_order_date’]).dt.days
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
y = df[‘is_churn’]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练模型
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)
# 预测
y_pred = model.predict(X_test_scaled)
# 评估
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f’Accuracy: {accuracy:.2f}’)
print(f’Precision: {precision:.2f}’)
print(f’Recall: {recall:.2f}’)
print(f’F1 Score: {f1:.2f}’)
# 输出结果
# Accuracy: 0.85
# Precision: 0.80
# Recall: 0.75
# F1 Score: 0.77
– 特征提取
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id; - 数据清洗 SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id ) t WHERE order_count > 0;
# 2. 模型训练(Python示例)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date,
CASE WHEN MAX(order_date) < ADD_MONTHS(SYSDATE, -3) THEN 1 ELSE 0 END AS is_churn FROM fgedu_order GROUP BY user_id ) t WHERE order_count > 0
“””, conn)
# 特征工程
df[‘days_since_last_order’] = (pd.Timestamp.now() – df[‘last_order_date’]).dt.days
df[‘days_since_first_order’] = (pd.Timestamp.now() – df[‘first_order_date’]).dt.days
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
y = df[‘is_churn’]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练模型
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)
# 预测
y_pred = model.predict(X_test_scaled)
# 评估
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f’Accuracy: {accuracy:.2f}’)
print(f’Precision: {precision:.2f}’)
print(f’Recall: {recall:.2f}’)
print(f’F1 Score: {f1:.2f}’)
# 输出结果
# Accuracy: 0.85
# Precision: 0.80
# Recall: 0.75
# F1 Score: 0.77
3.2.2 客户细分
# 1. 数据准备
– 特征提取
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id;
– 数据清洗
SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
WHERE order_count > 0;
# 2. 模型训练(Python示例)
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
WHERE order_count > 0
“””, conn)
# 特征工程
df[‘days_since_last_order’] = (pd.Timestamp.now() – df[‘last_order_date’]).dt.days
df[‘days_since_first_order’] = (pd.Timestamp.now() – df[‘first_order_date’]).dt.days
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 寻找最佳聚类数
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
print(f’Clusters: {n_clusters}, Silhouette Score: {score:.3f}’)
# 选择最佳聚类数
best_n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
print(f’Best number of clusters: {best_n_clusters}’)
# 训练最终模型
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
# 输出结果
# Clusters: 2, Silhouette Score: 0.650
# Clusters: 3, Silhouette Score: 0.680
# Clusters: 4, Silhouette Score: 0.620
# Best number of clusters: 3
# order_count avg_amount days_since_last_order days_since_first_order
# cluster
# 0 5.234 123.45 15.23 45.67
# 1 2.123 78.90 30.45 60.78
# 2 8.567 234.56 5.67 30.12
– 特征提取
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id;
– 数据清洗
SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
WHERE order_count > 0;
# 2. 模型训练(Python示例)
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
WHERE order_count > 0
“””, conn)
# 特征工程
df[‘days_since_last_order’] = (pd.Timestamp.now() – df[‘last_order_date’]).dt.days
df[‘days_since_first_order’] = (pd.Timestamp.now() – df[‘first_order_date’]).dt.days
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 寻找最佳聚类数
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
print(f’Clusters: {n_clusters}, Silhouette Score: {score:.3f}’)
# 选择最佳聚类数
best_n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
print(f’Best number of clusters: {best_n_clusters}’)
# 训练最终模型
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
# 输出结果
# Clusters: 2, Silhouette Score: 0.650
# Clusters: 3, Silhouette Score: 0.680
# Clusters: 4, Silhouette Score: 0.620
# Best number of clusters: 3
# order_count avg_amount days_since_last_order days_since_first_order
# cluster
# 0 5.234 123.45 15.23 45.67
# 1 2.123 78.90 30.45 60.78
# 2 8.567 234.56 5.67 30.12
3.3 模型部署
3.3.1 模型保存与加载
# 1. 模型保存(Python示例)
import joblib
# 保存模型
joblib.dump(model, ‘churn_prediction_model.pkl’)
# 保存标准化器
joblib.dump(scaler, ‘scaler.pkl’)
print(‘Model saved successfully’)
# 2. 模型加载(Python示例)
import joblib
# 加载模型
model = joblib.load(‘churn_prediction_model.pkl’)
# 加载标准化器
scaler = joblib.load(‘scaler.pkl’)
print(‘Model loaded successfully’)
# 3. 模型预测(Python示例)
import pandas as pd
# 准备新数据
new_data = pd.DataFrame({
‘order_count’: [5, 10, 2],
‘avg_amount’: [100.0, 200.0, 50.0],
‘days_since_last_order’: [10, 5, 30],
‘days_since_first_order’: [30, 20, 60]
})
# 标准化
new_data_scaled = scaler.transform(new_data)
# 预测
predictions = model.predict(new_data_scaled)
probabilities = model.predict_proba(new_data_scaled)
print(f’Predictions: {predictions}’)
print(f’Probabilities: {probabilities}’)
# 输出结果
# Predictions: [0 0 1]
# Probabilities: [[0.85 0.15]
# [0.90 0.10]
# [0.20 0.80]]
import joblib
# 保存模型
joblib.dump(model, ‘churn_prediction_model.pkl’)
# 保存标准化器
joblib.dump(scaler, ‘scaler.pkl’)
print(‘Model saved successfully’)
# 2. 模型加载(Python示例)
import joblib
# 加载模型
model = joblib.load(‘churn_prediction_model.pkl’)
# 加载标准化器
scaler = joblib.load(‘scaler.pkl’)
print(‘Model loaded successfully’)
# 3. 模型预测(Python示例)
import pandas as pd
# 准备新数据
new_data = pd.DataFrame({
‘order_count’: [5, 10, 2],
‘avg_amount’: [100.0, 200.0, 50.0],
‘days_since_last_order’: [10, 5, 30],
‘days_since_first_order’: [30, 20, 60]
})
# 标准化
new_data_scaled = scaler.transform(new_data)
# 预测
predictions = model.predict(new_data_scaled)
probabilities = model.predict_proba(new_data_scaled)
print(f’Predictions: {predictions}’)
print(f’Probabilities: {probabilities}’)
# 输出结果
# Predictions: [0 0 1]
# Probabilities: [[0.85 0.15]
# [0.90 0.10]
# [0.20 0.80]]
3.3.2 模型监控与更新
# 1. 模型监控
– 监控指标
– 准确率
– 精确率
– 召回率
– F1分数
– 监控方法
– 定期评估
– 实时监控
– 告警机制
# 2. 模型更新
– 更新触发条件
– 性能下降
– 数据分布变化
– 业务需求变化
– 更新方法
– 增量训练
– 全量训练
– 在线学习
# 3. 实际示例
– 模型监控(Python示例)
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# 读取新数据
new_data = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM fgedu_order
WHERE order_date >= ADD_DAYS(SYSDATE, -7)
“””, conn)
# 特征工程
new_data[‘days_since_last_order’] = (pd.Timestamp.now() – new_data[‘last_order_date’]).dt.days
new_data[‘days_since_first_order’] = (pd.Timestamp.now() – new_data[‘first_order_date’]).dt.days
# 准备数据
X_new = new_data[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
y_new = new_data[‘is_churn’]
# 标准化
X_new_scaled = scaler.transform(X_new)
# 预测
y_pred = model.predict(X_new_scaled)
# 评估
accuracy = accuracy_score(y_new, y_pred)
precision = precision_score(y_new, y_pred)
recall = recall_score(y_new, y_pred)
f1 = f1_score(y_new, y_pred)
print(f’Accuracy: {accuracy:.2f}’)
print(f’Precision: {precision:.2f}’)
print(f’Recall: {recall:.2f}’)
print(f’F1 Score: {f1:.2f}’)
# 输出结果
# Accuracy: 0.82
# Precision: 0.78
# Recall: 0.72
# F1 Score: 0.75
– 监控指标
– 准确率
– 精确率
– 召回率
– F1分数
– 监控方法
– 定期评估
– 实时监控
– 告警机制
# 2. 模型更新
– 更新触发条件
– 性能下降
– 数据分布变化
– 业务需求变化
– 更新方法
– 增量训练
– 全量训练
– 在线学习
# 3. 实际示例
– 模型监控(Python示例)
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# 读取新数据
new_data = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
is_churn
FROM fgedu_order
WHERE order_date >= ADD_DAYS(SYSDATE, -7)
“””, conn)
# 特征工程
new_data[‘days_since_last_order’] = (pd.Timestamp.now() – new_data[‘last_order_date’]).dt.days
new_data[‘days_since_first_order’] = (pd.Timestamp.now() – new_data[‘first_order_date’]).dt.days
# 准备数据
X_new = new_data[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
y_new = new_data[‘is_churn’]
# 标准化
X_new_scaled = scaler.transform(X_new)
# 预测
y_pred = model.predict(X_new_scaled)
# 评估
accuracy = accuracy_score(y_new, y_pred)
precision = precision_score(y_new, y_pred)
recall = recall_score(y_new, y_pred)
f1 = f1_score(y_new, y_pred)
print(f’Accuracy: {accuracy:.2f}’)
print(f’Precision: {precision:.2f}’)
print(f’Recall: {recall:.2f}’)
print(f’F1 Score: {f1:.2f}’)
# 输出结果
# Accuracy: 0.82
# Precision: 0.78
# Recall: 0.72
# F1 Score: 0.75
风哥提示:数据分析和机器学习是一个持续的过程,需要根据业务需求和数据变化,不断调整和优化模型。建立完善的监控体系,是保障模型稳定运行的关键。
Part04-生产案例与实战讲解
4.1 用户行为分析案例
4.1.1 案例描述
某电商企业需要分析用户行为,了解用户活跃度和用户价值,为运营决策提供支持。
4.1.2 分析步骤
# 1. 数据准备
– 用户数据
SQL> SELECT user_id, user_name, user_email, user_status,
create_time, update_time
FROM fgedu_user;
– 订单数据
SQL> SELECT order_id, user_id, product_id, amount, order_date
FROM fgedu_order;
# 2. 用户活跃度分析
– 活跃用户统计
SQL> SELECT COUNT(*) AS active_user_count
FROM fgedu_user u
WHERE EXISTS (
SELECT 1 FROM fgedu_order o
WHERE o.user_id = u.user_id
AND o.order_date >= ADD_MONTHS(SYSDATE, -1)
);
– 用户活跃度分布
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 3. 用户价值分析
– RFM分析
SQL> SELECT user_id,
COUNT(*) AS frequency,
MAX(order_date) AS recency,
SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
ORDER BY monetary DESC;
– 用户价值分层
SQL> SELECT
CASE
WHEN monetary >= 10000 THEN ‘高价值’
WHEN monetary >= 5000 THEN ‘中价值’
ELSE ‘低价值’
END AS value_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
) t
GROUP BY value_level;
# 4. 分析结果
– 活跃用户数:800
– 高活跃用户:200
– 中活跃用户:300
– 低活跃用户:500
– 高价值用户:100
– 中价值用户:300
– 低价值用户:600
– 用户数据
SQL> SELECT user_id, user_name, user_email, user_status,
create_time, update_time
FROM fgedu_user;
– 订单数据
SQL> SELECT order_id, user_id, product_id, amount, order_date
FROM fgedu_order;
# 2. 用户活跃度分析
– 活跃用户统计
SQL> SELECT COUNT(*) AS active_user_count
FROM fgedu_user u
WHERE EXISTS (
SELECT 1 FROM fgedu_order o
WHERE o.user_id = u.user_id
AND o.order_date >= ADD_MONTHS(SYSDATE, -1)
);
– 用户活跃度分布
SQL> SELECT
CASE
WHEN order_count >= 10 THEN ‘高活跃’
WHEN order_count >= 5 THEN ‘中活跃’
ELSE ‘低活跃’
END AS activity_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, COUNT(*) AS order_count
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -3)
GROUP BY user_id
) t
GROUP BY activity_level;
# 3. 用户价值分析
– RFM分析
SQL> SELECT user_id,
COUNT(*) AS frequency,
MAX(order_date) AS recency,
SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
ORDER BY monetary DESC;
– 用户价值分层
SQL> SELECT
CASE
WHEN monetary >= 10000 THEN ‘高价值’
WHEN monetary >= 5000 THEN ‘中价值’
ELSE ‘低价值’
END AS value_level,
COUNT(*) AS user_count
FROM (
SELECT user_id, SUM(amount) AS monetary
FROM fgedu_order
GROUP BY user_id
) t
GROUP BY value_level;
# 4. 分析结果
– 活跃用户数:800
– 高活跃用户:200
– 中活跃用户:300
– 低活跃用户:500
– 高价值用户:100
– 中价值用户:300
– 低价值用户:600
4.2 销售预测案例
4.2.1 案例描述
某电商企业需要预测未来销售趋势,为库存管理和营销决策提供支持。
4.2.2 预测步骤
# 1. 数据准备
– 历史销售数据
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
ORDER BY order_date;
# 2. 特征工程
– 时间特征
SQL> SELECT order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(DAY FROM order_date) AS day,
EXTRACT(WEEKDAY FROM order_date) AS weekday
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
ORDER BY order_date;
# 3. 模型训练(Python示例)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# 读取数据
df = pd.read_sql_query(“””
SELECT order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(DAY FROM order_date) AS day,
EXTRACT(WEEKDAY FROM order_date) AS weekday
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
“””, conn)
# 准备数据
X = df[[‘year’, ‘month’, ‘day’, ‘weekday’]]
y = df[‘total_amount’]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 评估
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f’MSE: {mse:.2f}’)
print(f’RMSE: {rmse:.2f}’)
print(f’MAE: {mae:.2f}’)
print(f’R²: {r2:.2f}’)
# 输出结果
# MSE: 1000000.00
# RMSE: 1000.00
# MAE: 800.00
# R²: 0.85
# 4. 未来预测
import pandas as pd
from datetime import datetime, timedelta
# 生成未来日期
future_dates = []
base_date = datetime.now()
for i in range(30):
future_date = base_date + timedelta(days=i)
future_dates.append(future_date)
# 准备预测数据
future_data = pd.DataFrame({
‘year’: [d.year for d in future_dates],
‘month’: [d.month for d in future_dates],
‘day’: [d.day for d in future_dates],
‘weekday’: [d.weekday() for d in future_dates]
})
# 预测
future_predictions = model.predict(future_data)
# 输出预测结果
for date, prediction in zip(future_dates, future_predictions):
print(f'{date.strftime(“%Y-%m-%d”)}: {prediction:.2f}’)
# 输出结果
# 2024-01-10: 12000.00
# 2024-01-11: 12500.00
# 2024-01-12: 13000.00
– 历史销售数据
SQL> SELECT order_date, COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
ORDER BY order_date;
# 2. 特征工程
– 时间特征
SQL> SELECT order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(DAY FROM order_date) AS day,
EXTRACT(WEEKDAY FROM order_date) AS weekday
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
ORDER BY order_date;
# 3. 模型训练(Python示例)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# 读取数据
df = pd.read_sql_query(“””
SELECT order_date,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(DAY FROM order_date) AS day,
EXTRACT(WEEKDAY FROM order_date) AS weekday
FROM fgedu_order
WHERE order_date >= ADD_MONTHS(SYSDATE, -12)
GROUP BY order_date
“””, conn)
# 准备数据
X = df[[‘year’, ‘month’, ‘day’, ‘weekday’]]
y = df[‘total_amount’]
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# 预测
y_pred = model.predict(X_test)
# 评估
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f’MSE: {mse:.2f}’)
print(f’RMSE: {rmse:.2f}’)
print(f’MAE: {mae:.2f}’)
print(f’R²: {r2:.2f}’)
# 输出结果
# MSE: 1000000.00
# RMSE: 1000.00
# MAE: 800.00
# R²: 0.85
# 4. 未来预测
import pandas as pd
from datetime import datetime, timedelta
# 生成未来日期
future_dates = []
base_date = datetime.now()
for i in range(30):
future_date = base_date + timedelta(days=i)
future_dates.append(future_date)
# 准备预测数据
future_data = pd.DataFrame({
‘year’: [d.year for d in future_dates],
‘month’: [d.month for d in future_dates],
‘day’: [d.day for d in future_dates],
‘weekday’: [d.weekday() for d in future_dates]
})
# 预测
future_predictions = model.predict(future_data)
# 输出预测结果
for date, prediction in zip(future_dates, future_predictions):
print(f'{date.strftime(“%Y-%m-%d”)}: {prediction:.2f}’)
# 输出结果
# 2024-01-10: 12000.00
# 2024-01-11: 12500.00
# 2024-01-12: 13000.00
4.3 客户细分案例
4.3.1 案例描述
某电商企业需要对客户进行细分,了解不同客户群体的特征,为精准营销提供支持。
4.3.2 细分步骤
# 1. 数据准备
– 用户行为数据
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id;
# 2. 特征工程
– 计算时间特征
SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
DATEDIFF(DAY, last_order_date, SYSDATE) AS days_since_last_order,
DATEDIFF(DAY, first_order_date, SYSDATE) AS days_since_first_order
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t;
# 3. 模型训练(Python示例)
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
DATEDIFF(DAY, last_order_date, SYSDATE) AS days_since_last_order,
DATEDIFF(DAY, first_order_date, SYSDATE) AS days_since_first_order
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
“””, conn)
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 寻找最佳聚类数
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
print(f’Clusters: {n_clusters}, Silhouette Score: {score:.3f}’)
# 选择最佳聚类数
best_n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
print(f’Best number of clusters: {best_n_clusters}’)
# 训练最终模型
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
# 输出结果
# Clusters: 2, Silhouette Score: 0.650
# Clusters: 3, Silhouette Score: 0.680
# Clusters: 4, Silhouette Score: 0.620
# Best number of clusters: 3
# order_count avg_amount days_since_last_order days_since_first_order
# cluster
# 0 5.234 123.45 15.23 45.67
# 1 2.123 78.90 30.45 60.78
# 2 8.567 234.56 5.67 30.12
# 4. 客户群体分析
– 群体0:中等活跃、中等价值客户
– 群体1:低活跃、低价值客户
– 群体2:高活跃、高价值客户
# 5. 营销策略
– 群体0:提供优惠券,增加购买频率
– 群体1:发送促销信息,提高活跃度
– 群体2:提供VIP服务,提高忠诚度
– 用户行为数据
SQL> SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id;
# 2. 特征工程
– 计算时间特征
SQL> SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
DATEDIFF(DAY, last_order_date, SYSDATE) AS days_since_last_order,
DATEDIFF(DAY, first_order_date, SYSDATE) AS days_since_first_order
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t;
# 3. 模型训练(Python示例)
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# 读取数据
df = pd.read_sql_query(“””
SELECT user_id,
order_count,
avg_amount,
last_order_date,
first_order_date,
DATEDIFF(DAY, last_order_date, SYSDATE) AS days_since_last_order,
DATEDIFF(DAY, first_order_date, SYSDATE) AS days_since_first_order
FROM (
SELECT user_id,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM fgedu_order
GROUP BY user_id
) t
“””, conn)
# 准备数据
X = df[[‘order_count’, ‘avg_amount’, ‘days_since_last_order’, ‘days_since_first_order’]]
# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 寻找最佳聚类数
silhouette_scores = []
for n_clusters in range(2, 11):
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
silhouette_scores.append(score)
print(f’Clusters: {n_clusters}, Silhouette Score: {score:.3f}’)
# 选择最佳聚类数
best_n_clusters = silhouette_scores.index(max(silhouette_scores)) + 2
print(f’Best number of clusters: {best_n_clusters}’)
# 训练最终模型
kmeans = KMeans(n_clusters=best_n_clusters, random_state=42)
labels = kmeans.fit_predict(X_scaled)
df[‘cluster’] = labels
# 分析聚类结果
print(df.groupby(‘cluster’).mean())
# 输出结果
# Clusters: 2, Silhouette Score: 0.650
# Clusters: 3, Silhouette Score: 0.680
# Clusters: 4, Silhouette Score: 0.620
# Best number of clusters: 3
# order_count avg_amount days_since_last_order days_since_first_order
# cluster
# 0 5.234 123.45 15.23 45.67
# 1 2.123 78.90 30.45 60.78
# 2 8.567 234.56 5.67 30.12
# 4. 客户群体分析
– 群体0:中等活跃、中等价值客户
– 群体1:低活跃、低价值客户
– 群体2:高活跃、高价值客户
# 5. 营销策略
– 群体0:提供优惠券,增加购买频率
– 群体1:发送促销信息,提高活跃度
– 群体2:提供VIP服务,提高忠诚度
生产环境建议:在数据分析和机器学习项目完成后,要进行充分的测试,确保模型的准确性和可靠性。建立完善的监控体系,及时发现和解决问题。定期进行模型更新,保持模型性能。
Part05-风哥经验总结与分享
5.1 数据分析与机器学习最佳实践
DM数据库数据分析与机器学习最佳实践:
- 充分理解业务:在开始分析前,充分理解业务需求和业务背景
- 数据质量保证:保证数据质量,包括数据清洗、数据验证、数据监控
- 特征工程:进行有效的特征工程,提取有意义的特征
- 模型选择:根据业务需求和数据特点,选择合适的模型
- 模型评估:建立完善的模型评估体系,确保模型性能
- 模型监控:建立模型监控体系,及时发现模型性能下降
- 模型更新:定期更新模型,保持模型性能
- 文档记录:记录分析过程和结果,便于后续维护
- 团队协作:与团队协作,共同完成分析工作
- 持续改进:根据业务需求和数据变化,持续改进模型
5.2 常见问题与解决方案
# 1. 数据质量问题
– 症状:数据质量差,影响分析结果
– 原因:数据源数据质量差、数据清洗不充分
– 解决方案:加强数据清洗、建立数据质量监控
# 2. 特征工程问题
– 症状:特征选择不当,影响模型性能
– 原因:特征理解不充分、特征选择方法不当
– 解决方案:深入理解特征、使用特征选择方法
# 3. 模型性能问题
– 症状:模型性能差,无法满足业务需求
– 原因:模型选择不当、参数调优不充分
– 解决方案:尝试不同模型、优化模型参数
# 4. 过拟合问题
– 症状:训练集性能好,测试集性能差
– 原因:模型复杂度过高、训练数据不足
– 解决方案:简化模型、增加训练数据、使用正则化
# 5. 模型部署问题
– 症状:模型部署困难,无法投入使用
– 原因:模型复杂度高、环境配置复杂
– 解决方案:简化模型、使用容器化部署
– 症状:数据质量差,影响分析结果
– 原因:数据源数据质量差、数据清洗不充分
– 解决方案:加强数据清洗、建立数据质量监控
# 2. 特征工程问题
– 症状:特征选择不当,影响模型性能
– 原因:特征理解不充分、特征选择方法不当
– 解决方案:深入理解特征、使用特征选择方法
# 3. 模型性能问题
– 症状:模型性能差,无法满足业务需求
– 原因:模型选择不当、参数调优不充分
– 解决方案:尝试不同模型、优化模型参数
# 4. 过拟合问题
– 症状:训练集性能好,测试集性能差
– 原因:模型复杂度过高、训练数据不足
– 解决方案:简化模型、增加训练数据、使用正则化
# 5. 模型部署问题
– 症状:模型部署困难,无法投入使用
– 原因:模型复杂度高、环境配置复杂
– 解决方案:简化模型、使用容器化部署
5.3 数据分析与机器学习检查清单
DM数据库数据分析与机器学习检查清单:
- 业务理解检查:业务需求是否明确,业务背景是否了解
- 数据质量检查:数据质量是否保证,数据清洗是否充分
- 特征工程检查:特征选择是否合理,特征提取是否充分
- 模型选择检查:模型选择是否合适,算法选择是否正确
- 模型评估检查:评估指标是否合理,评估结果是否可靠
- 模型监控检查:监控体系是否建立,监控指标是否完善
- 模型更新检查:更新策略是否制定,更新频率是否合理
- 文档记录检查:分析过程是否记录,文档是否完善
- 团队协作检查:团队协作是否顺畅,责任分工是否明确
- 持续改进检查:改进计划是否制定,改进措施是否落实
持续改进:数据分析和机器学习是一个持续的过程,需要根据业务需求和数据变化,不断调整和优化。建立完善的监控体系,是保障模型稳定运行的关键。定期进行模型更新,保持模型性能。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
