IT教程FG308-AI模型监控与管理

1. AI模型监控概述

AI模型监控是确保模型在生产环境中持续稳定运行的关键环节。更多学习教程www.fgedu.net.cn

# 查看模型服务状态
# kubectl get pods -n ai-serving
NAME READY STATUS RESTARTS AGE
model-server-xyz12-abc34 1/1 Running 0 7d
model-server-xyz12-def56 1/1 Running 0 7d
model-server-xyz12-ghi78 1/1 Running 0 7d
tensorflow-serving-0 1/1 Running 0 30d
triton-server-0 1/1 Running 0 15d

# 查看模型服务详情
# kubectl describe deployment model-server -n ai-serving
Name: model-server
Namespace: ai-serving
CreationTimestamp: Mon, 01 Jan 2026 00:00:00 +0800
Labels: app=model-server
environment=production
Annotations: deployment.kubernetes.io/revision: 5
Selector: app=model-server
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=model-server
Containers:
model-server:
Image: fgedu/model-server:v1.2.0
Ports: 8501/TCP, 8500/TCP
Liveness: http-get http://:8501/v1/models/model delay=30s timeout=5s period=10s
Readiness: http-get http://:8501/v1/models/model delay=10s timeout=3s period=5s

# 查看模型推理服务配置
# cat /opt/ai-serving/config/model-config.conf
model_config_list {
config {
name: “fgedu_classifier”
base_path: “/models/fgedu_classifier/1”
model_platform: “tensorflow”
}
config {
name: “fgedu_nlp_model”
base_path: “/models/fgedu_nlp_model/2”
model_platform: “tensorflow”
}
config {
name: “fgedu_cv_model”
base_path: “/models/fgedu_cv_model/3”
model_platform: “pytorch”
}
}

# 查看模型存储结构
# ls -la /models/
total 12
drwxr-xr-x 3 root root 4096 Apr 3 10:00 fgedu_classifier
drwxr-xr-x 3 root root 4096 Apr 3 10:00 fgedu_nlp_model
drwxr-xr-x 3 root root 4096 Apr 3 10:00 fgedu_cv_model

# ls -la /models/fgedu_classifier/
total 12
drwxr-xr-x 2 root root 4096 Mar 1 10:00 1
drwxr-xr-x 2 root root 4096 Mar 15 10:00 2
drwxr-xr-x 2 root root 4096 Apr 1 10:00 3

生产环境风哥建议：部署多副本模型服务实现高可用，配置健康检查确保服务稳定，使用模型版本管理支持快速回滚。

2. 模型性能指标监控

监控模型性能指标是评估模型质量的重要手段。学习交流加群风哥微信: itpux-com

# 查看模型推理延迟
# curl -s http://fgedudb:8501/v1/models/fgedu_classifier/metrics | grep -E “latency|duration”
# TYPE tensorflow_serving_request_latency_seconds histogram
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.001″} 1523
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.005″} 8523
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.01″} 12345
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.025″} 18923
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.05″} 21567
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.1″} 22345
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.25″} 22789
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”0.5″} 22890
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”1″} 22900
tensorflow_serving_request_latency_seconds_bucket{model=”fgedu_classifier”,le=”+Inf”} 22900
tensorflow_serving_request_latency_seconds_sum{model=”fgedu_classifier”} 456.78
tensorflow_serving_request_latency_seconds_count{model=”fgedu_classifier”} 22900

# 计算平均延迟
# echo “scale=4; 456.78 / 22900” | bc
.0199

# 查看模型吞吐量
# curl -s http://fgedudb:8501/v1/models/fgedu_classifier/metrics | grep -E “request_count|throughput”
# TYPE tensorflow_serving_request_count counter
tensorflow_serving_request_count{model=”fgedu_classifier”,status=”success”} 22890
tensorflow_serving_request_count{model=”fgedu_classifier”,status=”error”} 10
# TYPE tensorflow_serving_batching_utilization gauge
tensorflow_serving_batching_utilization{model=”fgedu_classifier”} 0.85

# 查看模型准确率指标
# python3 /opt/ai-monitoring/check_accuracy.py –model fgedu_classifier –window 1h
Model: fgedu_classifier
Time Window: 1 hour
Total Predictions: 15234
Correct Predictions: 14567
Accuracy: 95.62%
Precision: 94.89%
Recall: 96.15%
F1-Score: 95.51%

# 查看模型预测分布
# python3 /opt/ai-monitoring/prediction_distribution.py –model fgedu_classifier
Prediction Distribution (Last 24h):
Class 0: 45.2% (68,532 predictions)
Class 1: 32.8% (49,712 predictions)
Class 2: 15.3% (23,178 predictions)
Class 3: 4.5% (6,822 predictions)
Class 4: 2.2% (3,336 predictions)

# 使用Prometheus查询模型指标
# curl -s ‘http://prometheus:9090/api/v1/query?query=tensorflow_serving_request_latency_seconds_sum’ | jq .
{
“status”: “success”,
“data”: {
“result”: [
{
“metric”: {
“model”: “fgedu_classifier”,
“instance”: “model-server-0:8501”
},
“value”: [1712127600, “456.78”]
}
]
}
}

# 查看模型QPS
# curl -s ‘http://prometheus:9090/api/v1/query?query=rate(tensorflow_serving_request_count[5m])’ | jq .
{
“status”: “success”,
“data”: {
“result”: [
{
“metric”: {
“model”: “fgedu_classifier”,
“status”: “success”
},
“value”: [1712127600, “152.34”]
}
]
}
}

# 查看错误率
# curl -s ‘http://prometheus:9090/api/v1/query?query=rate(tensorflow_serving_request_count{status=”error”}[5m])’ | jq .
{
“status”: “success”,
“data”: {
“result”: [
{
“metric”: {
“model”: “fgedu_classifier”,
“status”: “error”
},
“value”: [1712127600, “0.05”]
}
]
}
}

3. 数据漂移检测

数据漂移是指生产环境数据分布与训练数据分布发生变化，可能导致模型性能下降。

# 检测特征漂移
# python3 /opt/ai-monitoring/detect_drift.py –model fgedu_classifier –method ks_test
Feature Drift Detection Report
===============================
Model: fgedu_classifier
Baseline Period: 2026-01-01 to 2026-02-01
Current Period: 2026-04-01 to 2026-04-03

Feature Analysis:
—————–
feature_age:
KS Statistic: 0.0234
P-value: 0.4521
Drift Status: No Drift

feature_income:
KS Statistic: 0.0892
P-value: 0.0012
Drift Status: DRIFT DETECTED
Baseline Mean: 45678.32
Current Mean: 52341.67
Change: +14.6%

feature_credit_score:
KS Statistic: 0.0156
P-value: 0.7823
Drift Status: No Drift

feature_transaction_amount:
KS Statistic: 0.1234
P-value: 0.0001
Drift Status: DRIFT DETECTED
Baseline Mean: 1234.56
Current Mean: 1876.23
Change: +51.9%

# 使用PSI检测漂移
# python3 /opt/ai-monitoring/calculate_psi.py –model fgedu_classifier
Population Stability Index (PSI) Report
========================================
Model: fgedu_classifier
Analysis Date: 2026-04-03

Feature PSI Values:
——————-
feature_age: PSI = 0.05 (Stable)
feature_income: PSI = 0.18 (Moderate Change)
feature_credit_score: PSI = 0.03 (Stable)
feature_transaction: PSI = 0.35 (Significant Change)
feature_location: PSI = 0.08 (Stable)
feature_device: PSI = 0.12 (Moderate Change)

PSI Interpretation:
– PSI < 0.1: No significant change - PSI 0.1-0.25: Moderate change, monitor closely - PSI > 0.25: Significant change, action required

Recommendations:
1. Retrain model with recent data for features with PSI > 0.25
2. Monitor features with PSI between 0.1-0.25
3. Investigate root cause of distribution changes

# 检测标签漂移
# python3 /opt/ai-monitoring/label_drift.py –model fgedu_classifier
Label Distribution Comparison
==============================
Baseline Period vs Current Period

Class Distribution:
——————-
Baseline Current Change
Class 0 45.2% 42.8% -2.4%
Class 1 32.8% 35.1% +2.3%
Class 2 15.3% 14.2% -1.1%
Class 3 4.5% 5.2% +0.7%
Class 4 2.2% 2.7% +0.5%

Chi-Square Test:
—————-
Chi-Square Statistic: 12.34
P-value: 0.015
Result: Significant label drift detected

Action Required:
– Investigate business logic changes
– Consider model retraining
– Update monitoring thresholds

风哥风哥提示：数据漂移是模型性能下降的主要原因之一，建议建立自动化漂移检测机制，当PSI超过阈值时自动触发告警。

4. 模型退化检测

模型退化是指模型性能随时间逐渐下降，需要及时发现并处理。学习交流加群风哥QQ113257174

# 检测模型性能退化
# python3 /opt/ai-monitoring/performance_degradation.py –model fgedu_classifier
Model Performance Degradation Analysis
======================================
Model: fgedu_classifier
Analysis Period: Last 30 days

Performance Metrics Trend:
————————–
Date Accuracy Precision Recall F1-Score
2026-03-05 96.23% 95.89% 96.45% 96.17%
2026-03-12 95.89% 95.56% 96.12% 95.84%
2026-03-19 95.45% 95.12% 95.78% 95.45%
2026-03-26 94.98% 94.67% 95.34% 95.00%
2026-04-02 94.52% 94.23% 94.89% 94.56%

Degradation Detection:
———————-
Accuracy Degradation: -1.71% (WARNING)
Precision Degradation: -1.66% (WARNING)
Recall Degradation: -1.56% (WARNING)
F1-Score Degradation: -1.61% (WARNING)

Statistical Test:
—————–
Mann-Kendall Trend Test:
Tau: -0.89
P-value: 0.002
Trend: Significant downward trend detected

Root Cause Analysis:
——————–
1. Data drift detected in 2 features
2. Concept drift in business logic
3. Seasonal pattern change

# 检测概念漂移
# python3 /opt/ai-monitoring/concept_drift.py –model fgedu_classifier
Concept Drift Detection Report
===============================
Model: fgedu_classifier
Method: ADWIN (Adaptive Windowing)

Drift Detection Results:
————————
Window Size: 10000 samples
Drift Points Detected:
– 2026-03-15 14:23:45 (Sample #45234)
– 2026-03-28 09:15:32 (Sample #67890)

Drift Magnitude:
First Drift: 0.234 (Moderate)
Second Drift: 0.456 (Significant)

Performance Impact:
Pre-Drift Accuracy: 96.12%
Post-Drift Accuracy: 94.56%
Performance Drop: 1.56%

Recommended Actions:
1. Retrain model with recent data
2. Implement online learning
3. Update feature engineering pipeline

# 设置性能告警阈值
# cat /opt/ai-monitoring/alerts_config.yaml
model_alerts:
fgedu_classifier:
accuracy_threshold: 0.95
precision_threshold: 0.94
recall_threshold: 0.94
latency_threshold_ms: 50
error_rate_threshold: 0.01
drift_psi_threshold: 0.25

notification:
email: [“ai-team@fgedu.net.cn”]
slack: “#ai-alerts”
pagerduty: “ai-oncall”

auto_action:
retrain_trigger: false
rollback_threshold: 0.90

# 查看告警历史
# python3 /opt/ai-monitoring/alert_history.py –model fgedu_classifier –days 7
Alert History (Last 7 Days)
============================
Time Alert Type Severity Status
2026-04-03 10:00:00 Accuracy Drop Warning Active
2026-04-02 15:30:00 Data Drift Warning Resolved
2026-04-01 09:00:00 Latency Spike Info Resolved
2026-03-30 14:00:00 Error Rate High Critical Resolved

生产环境风哥建议：建立模型性能基线，设置合理的告警阈值，定期评估模型性能，当性能下降超过阈值时及时触发重训练流程。

5. 资源使用监控

监控模型服务的资源使用情况，确保资源合理分配和成本控制。更多学习教程公众号风哥教程itpux_com

# 查看模型服务资源使用
# kubectl top pods -n ai-serving
NAME CPU(cores) MEMORY(bytes)
model-server-xyz12-abc34 2500m 8Gi
model-server-xyz12-def56 2300m 7.5Gi
model-server-xyz12-ghi78 2400m 7.8Gi
tensorflow-serving-0 1500m 4Gi
triton-server-0 2000m 6Gi

# 查看容器资源限制
# kubectl describe pod model-server-xyz12-abc34 -n ai-serving | grep -A 5 “Limits:”
Limits:
cpu: 4
memory: 16Gi
nvidia.com/gpu: 1
Requests:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 1

# 查看GPU资源分配
# kubectl describe nodes | grep -A 10 “nvidia.com/gpu”
nvidia.com/gpu: 2
nvidia.com/gpu: 2
Allocated resources:
nvidia.com/gpu 1

# 监控推理批处理效率
# curl -s http://fgedudb:8501/v1/models/fgedu_classifier/metrics | grep batch
# TYPE tensorflow_serving_batching_session_duration_seconds histogram
tensorflow_serving_batching_session_duration_seconds_sum 1234.56
tensorflow_serving_batching_session_duration_seconds_count 10000
# TYPE tensorflow_serving_batching_num_batched_requests histogram
tensorflow_serving_batching_num_batched_requests_sum 45000
tensorflow_serving_batching_num_batched_requests_count 10000
# TYPE tensorflow_serving_batching_utilization gauge
tensorflow_serving_batching_utilization 0.85

# 平均批大小
# echo “scale=2; 45000 / 10000” | bc
4.50

# 分析资源使用趋势
# python3 /opt/ai-monitoring/resource_analysis.py –days 7
Resource Usage Analysis (Last 7 Days)
=====================================
Model: fgedu_classifier

CPU Usage:
Average: 2.3 cores
Peak: 3.8 cores
Trend: Stable

Memory Usage:
Average: 7.8 GB
Peak: 12.5 GB
Trend: Increasing (+5%)

GPU Memory:
Average: 22.5 GB
Peak: 32.0 GB
Utilization: 45%

GPU Utilization:
Average: 42%
Peak: 78%
Trend: Stable

Recommendations:
1. Consider increasing memory limit to 20GB
2. Optimize batch size for better GPU utilization
3. Enable GPU memory growth for dynamic allocation

6. 模型版本管理

模型版本管理是确保模型可追溯和可回滚的关键。

# 使用MLflow管理模型版本
# mlflow models list –experiment-name fgedu_classifier
Name Version Stage Creation Time
fgedu_classifier 1 Archived 2026-01-15 10:00:00
fgedu_classifier 2 Archived 2026-02-15 10:00:00
fgedu_classifier 3 Production 2026-03-15 10:00:00
fgedu_classifier 4 Staging 2026-04-01 10:00:00

# 查看模型版本详情
# mlflow models describe –name fgedu_classifier –version 3
Model: fgedu_classifier
Version: 3
Stage: Production
Creation Time: 2026-03-15 10:00:00
Source: runs:/abc123def456/model
Metrics:
accuracy: 0.9623
precision: 0.9589
recall: 0.9645
f1_score: 0.9617
Parameters:
learning_rate: 0.001
batch_size: 256
epochs: 100
optimizer: adam
Tags:
framework: tensorflow
dataset: fgedu_v2
author: fengge

# 部署新模型版本
# mlflow models transition –name fgedu_classifier –version 4 –stage Production
Model version 4 transitioned to ‘Production’ stage.

# 回滚到之前版本
# mlflow models transition –name fgedu_classifier –version 3 –stage Production
Model version 3 transitioned to ‘Production’ stage.
Model version 4 transitioned to ‘Archived’ stage.

# 比较模型版本
# mlflow models compare –name fgedu_classifier –versions 3,4
Model Version Comparison
========================
Metric Version 3 Version 4 Change
Accuracy 96.23% 96.45% +0.22%
Precision 95.89% 96.12% +0.23%
Recall 96.45% 96.34% -0.11%
F1-Score 96.17% 96.23% +0.06%
Latency (ms) 18.5 22.3 +3.8ms
Model Size (MB) 256 312 +56MB

# 导出模型版本
# mlflow models export –name fgedu_classifier –version 3 –output-path /backup/models/
Model exported to /backup/models/fgedu_classifier_v3/

# 使用DVC管理模型数据版本
# dvc list models
models/fgedu_classifier_v1.dvc
models/fgedu_classifier_v2.dvc
models/fgedu_classifier_v3.dvc

# 查看模型变更历史
# dvc dag
+——————-+
| data/train_v1.dvc |
+——————-+
|
*
|
+——————-+
| models/fgedu_v1.dvc |
+——————-+
|
*
|
+——————-+
| data/train_v2.dvc |
+——————-+
|
*
|
+——————-+
| models/fgedu_v2.dvc |
+——————-+

# 恢复历史版本
# dvc checkout models/fgedu_classifier_v2.dvc
A models/fgedu_classifier.pkl

风哥风哥提示：建立完善的模型版本管理流程，记录每次模型变更的原因、性能指标和部署状态，便于问题追溯和版本回滚。

7. 模型重训练策略

制定合理的模型重训练策略，确保模型持续保持最佳性能。

# 配置自动重训练策略
# cat /opt/ai-training/retrain_config.yaml
retraining_policy:
model_name: fgedu_classifier

triggers:
performance_based:
accuracy_threshold: 0.95
precision_threshold: 0.94
recall_threshold: 0.94
evaluation_window: 7d

drift_based:
psi_threshold: 0.25
feature_drift_count: 2

schedule_based:
interval: 30d
start_time: “02:00”

training_config:
data_source: /data/fgedu_training
validation_split: 0.2
test_split: 0.1

hyperparameters:
learning_rate: [0.001, 0.0001]
batch_size: [128, 256, 512]
epochs: [50, 100]

early_stopping:
patience: 10
min_delta: 0.001

deployment:
auto_deploy: false
canary_percentage: 10
rollback_threshold: 0.90

# 执行模型重训练
# python3 /opt/ai-training/retrain_model.py –model fgedu_classifier –trigger performance
Model Retraining Pipeline
=========================
Model: fgedu_classifier
Trigger: Performance degradation detected

Step 1: Data Collection
———————–
Training data collected: 1,234,567 samples
Validation data: 308,642 samples
Test data: 154,321 samples
Data quality check: PASSED

Step 2: Feature Engineering
—————————
Features extracted: 45
New features added: 3
Feature selection: 32 features selected

Step 3: Model Training
———————-
Hyperparameter search: Random Search
Best parameters found:
learning_rate: 0.0005
batch_size: 256
epochs: 85

Training Progress:
Epoch 1/100 – loss: 0.4523 – accuracy: 0.7823
Epoch 10/100 – loss: 0.1234 – accuracy: 0.9234
Epoch 50/100 – loss: 0.0456 – accuracy: 0.9567
Epoch 85/100 – loss: 0.0312 – accuracy: 0.9678
Early stopping triggered

Step 4: Model Evaluation
————————
Validation Metrics:
Accuracy: 96.78%
Precision: 96.45%
Recall: 96.89%
F1-Score: 96.67%

Test Metrics:
Accuracy: 96.52%
Precision: 96.23%
Recall: 96.45%
F1-Score: 96.34%

Step 5: Model Registration
————————–
Model version: 5
Stage: Staging
MLflow run ID: def456ghi789

# 查看重训练历史
# python3 /opt/ai-training/retrain_history.py –model fgedu_classifier
Retraining History
==================
Model: fgedu_classifier

Version Date Trigger Accuracy Status
1 2026-01-15 Initial 94.23% Archived
2 2026-02-15 Scheduled 95.45% Archived
3 2026-03-15 Performance 96.23% Production
4 2026-04-01 Drift 96.45% Staging
5 2026-04-03 Performance 96.78% Staging

Performance Improvement:
Version 1 -> 5: +2.55% accuracy improvement
Average improvement per retrain: +0.64%

Resource Usage:
Average training time: 4.5 hours
Average GPU hours: 9.0
Average data processed: 1.2M samples

8. 模型部署管理

模型部署管理确保模型安全、稳定地发布到生产环境。author:www.itpux.com

# 查看当前部署状态
# kubectl get deployments -n ai-serving
NAME READY UP-TO-DATE AVAILABLE AGE
model-server 3/3 3 3 90d
triton-server 1/1 1 1 30d

# 部署新模型版本（金丝雀发布）
# kubectl apply -f – <

# 配置流量分流
# kubectl apply -f – <

# 完成金丝雀发布
# kubectl scale deployment model-server -n ai-serving –replicas=4
deployment.apps/model-server scaled

# kubectl scale deployment model-server-canary -n ai-serving –replicas=0
deployment.apps/model-server-canary scaled

# 更新主部署版本
# kubectl set image deployment/model-server model-server=fgedu/model-server:v1.3.0 -n ai-serving
deployment.apps/model-server image updated

# 验证部署状态
# kubectl rollout status deployment/model-server -n ai-serving
Waiting for deployment “model-server” rollout to finish: 2 of 3 updated replicas are available…
deployment “model-server” successfully rolled out.

# 回滚部署（如需要）
# kubectl rollout undo deployment/model-server -n ai-serving
deployment.apps/model-server rolled back

# 查看部署历史
# kubectl rollout history deployment/model-server -n ai-serving
deployment.apps/model-server
REVISION CHANGE-CAUSE
1
2
3
4 Update to v1.3.0

9. 模型治理与合规

模型治理确保模型符合企业规范和监管要求。

# 查看模型审计日志
# python3 /opt/ai-governance/audit_log.py –model fgedu_classifier –days 30
Model Audit Log
===============
Model: fgedu_classifier
Period: Last 30 days

Model Changes:
————–
Date Action User Details
2026-04-03 Deploy v5 fengge Performance improvement
2026-04-01 Train v5 ai-pipeline Auto-triggered
2026-03-28 Alert system Data drift detected
2026-03-15 Deploy v3 fengge Scheduled release

Access Log:
———–
Date User Action IP Address
2026-04-03 fengge Predict 192.168.1.100
2026-04-03 api-service Predict 10.0.0.50
2026-04-03 batch-job Predict 10.0.0.51

Compliance Checks:
——————
Date Check Status Details
2026-04-03 Bias Detection PASS No significant bias found
2026-04-03 Privacy Check PASS No PII in predictions
2026-04-03 Explainability PASS SHAP values available
2026-04-03 Model Card PASS Documentation complete

# 检测模型偏见
# python3 /opt/ai-governance/bias_detection.py –model fgedu_classifier
Bias Detection Report
=====================
Model: fgedu_classifier
Date: 2026-04-03

Demographic Parity:
——————-
Group Positive Rate Disparity
Male 45.2% Baseline
Female 44.8% -0.4% (Acceptable)
Age 18-25 46.1% +0.9% (Acceptable)
Age 26-40 45.0% -0.2% (Acceptable)
Age 41-60 44.5% -0.7% (Acceptable)
Age 60+ 43.8% -1.4% (Acceptable)

Equalized Odds:
—————
Group TPR FPR Disparity
Male 92.3% 5.2% Baseline
Female 91.8% 5.5% Acceptable

Individual Fairness:
——————–
Similar instances prediction consistency: 98.5%
Fairness threshold: 95%
Status: PASS

Recommendations:
1. Continue monitoring demographic parity
2. Add fairness constraints to training
3. Document bias mitigation measures

# 生成模型卡片
# python3 /opt/ai-governance/generate_model_card.py –model fgedu_classifier
Model Card Generated
====================
File: /opt/ai-governance/model_cards/fgedu_classifier_v5.md

Model Card Content:
——————-
# Model Card: fgedu_classifier v5

## Model Details
– Model Type: Classification
– Framework: TensorFlow 2.12
– Version: 5
– Created: 2026-04-03
– Author: fengge

## Intended Use
– Primary Use: Customer behavior prediction
– Users: Marketing team, Product team
– Out-of-scope: Individual decision making

## Training Data
– Source: fgedu_customer_data_v3
– Size: 1,234,567 samples
– Time Period: 2025-01 to 2026-03
– Features: 32

## Evaluation Data
– Size: 154,321 samples
– Time Period: 2026-03

## Metrics
– Accuracy: 96.78%
– Precision: 96.45%
– Recall: 96.89%
– F1-Score: 96.67%

## Ethical Considerations
– Bias Check: Passed
– Privacy: No PII used
– Explainability: SHAP available

## Limitations
– Not suitable for real-time decisions
– Requires periodic retraining
– Performance may degrade with data drift

10. 监控工具与实践

建立完整的监控工具链，实现模型全生命周期管理。

# 配置Prometheus监控规则
# cat /opt/monitoring/prometheus_rules/ai_model_rules.yml
groups:
– name: ai_model_alerts
rules:
– alert: ModelAccuracyLow
expr: model_accuracy < 0.95 for: 5m labels: severity: warning annotations: summary: "Model accuracy below threshold" description: "Model {{ $labels.model }} accuracy is {{ $value }}" - alert: ModelLatencyHigh expr: model_latency_p99 > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: “Model latency above threshold”
description: “Model {{ $labels.model }} p99 latency is {{ $value }}s”

– alert: DataDriftDetected
expr: model_psi > 0.25
for: 1m
labels:
severity: critical
annotations:
summary: “Data drift detected”
description: “Model {{ $labels.model }} PSI is {{ $value }}”

– alert: ModelErrorRateHigh
expr: rate(model_errors_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: “Model error rate high”
description: “Model {{ $labels.model }} error rate is {{ $value }}”

# 配置Grafana仪表板
# cat /opt/monitoring/grafana_dashboards/ai_model_dashboard.json
{
“dashboard”: {
“title”: “AI Model Monitoring Dashboard”,
“panels”: [
{
“title”: “Model Accuracy”,
“type”: “graph”,
“targets”: [
{
“expr”: “model_accuracy”,
“legendFormat”: “{{model}}”
}
]
},
{
“title”: “Request Latency”,
“type”: “graph”,
“targets”: [
{
“expr”: “histogram_quantile(0.99, rate(model_latency_bucket[5m]))”,
“legendFormat”: “p99”
},
{
“expr”: “histogram_quantile(0.95, rate(model_latency_bucket[5m]))”,
“legendFormat”: “p95”
}
]
},
{
“title”: “Request Rate”,
“type”: “graph”,
“targets”: [
{
“expr”: “rate(model_requests_total[5m])”,
“legendFormat”: “{{model}}”
}
]
},
{
“title”: “Data Drift PSI”,
“type”: “gauge”,
“targets”: [
{
“expr”: “model_psi”,
“legendFormat”: “{{feature}}”
}
]
}
]
}
}

# 创建监控脚本
# cat /opt/ai-monitoring/monitor.sh
#!/bin/bash

MODEL_NAME=”fgedu_classifier”
THRESHOLD_ACCURACY=0.95
THRESHOLD_LATENCY=50
THRESHOLD_ERROR_RATE=0.01

check_model_health() {
echo “Checking model health: $MODEL_NAME”
echo “================================”

ACCURACY=$(curl -s http://prometheus:9090/api/v1/query?query=model_accuracy{model=\”$MODEL_NAME\”} | jq -r ‘.data.result[0].value[1]’)
LATENCY=$(curl -s http://prometheus:9090/api/v1/query?query=model_latency_p99{model=\”$MODEL_NAME\”} | jq -r ‘.data.result[0].value[1]’)
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query=rate(model_errors_total{model=\”$MODEL_NAME\”}[5m]) | jq -r ‘.data.result[0].value[1]’)

echo “Accuracy: $ACCURACY (Threshold: $THRESHOLD_ACCURACY)”
echo “Latency: ${LATENCY}s (Threshold: ${THRESHOLD_LATENCY}ms)”
echo “Error Rate: $ERROR_RATE (Threshold: $THRESHOLD_ERROR_RATE)”

if (( $(echo “$ACCURACY < $THRESHOLD_ACCURACY" | bc -l) )); then echo "WARNING: Accuracy below threshold!" send_alert "Model accuracy low: $ACCURACY" fi if (( $(echo "$LATENCY > $THRESHOLD_LATENCY” | bc -l) )); then
echo “WARNING: Latency above threshold!”
send_alert “Model latency high: ${LATENCY}s”
fi

if (( $(echo “$ERROR_RATE > $THRESHOLD_ERROR_RATE” | bc -l) )); then
echo “CRITICAL: Error rate above threshold!”
send_alert “Model error rate high: $ERROR_RATE”
fi
}

send_alert() {
curl -X POST -H ‘Content-type: application/json’ \
–data “{\”text\”:\”AI Model Alert: $1\”}” \
https://hooks.slack.com/services/xxx/yyy/zzz
}

check_model_health

生产环境风哥建议：建立完整的监控告警体系，定期检查模型性能和数据质量，制定应急预案，确保模型服务稳定可靠。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

IT教程FG308-AI模型监控与管理

1. AI模型监控概述

2. 模型性能指标监控

3. 数据漂移检测

4. 模型退化检测

5. 资源使用监控

6. 模型版本管理

7. 模型重训练策略

8. 模型部署管理

9. 模型治理与合规

10. 监控工具与实践

相关推荐

联系我们