1. 模型版本管理概述
AI模型版本管理是机器学习运维(MLOps)的核心环节,确保模型的可追溯性、可复现性和可管理性。更多学习教程www.fgedu.net.cn
# mlflow –version
mlflow, version 2.10.0
# 启动MLflow服务器
# mlflow server \
–backend-store-uri postgresql://fgedu:fgedu123@192.168.1.100:5432/mlflow \
–default-artifact-root s3://fgedu-mlflow-artifacts/ \
–host 0.0.0.0 \
–port 5000
[2026-04-03 10:00:00 +0800] [12345] [INFO] Starting gunicorn 21.2.0
[2026-04-03 10:00:00 +0800] [12345] [INFO] Listening at: http://0.0.0.0:5000 (12345)
[2026-04-03 10:00:00 +0800] [12345] [INFO] Using worker: sync
[2026-04-03 10:00:00 +0800] [12346] [INFO] Booting worker with pid: 12346
# 查看MLflow服务状态
# curl -s http://fgedudb:5000/health
{
“status”: “healthy”,
“version”: “2.10.0”
}
# 查看已注册的模型
# mlflow models list –tracking-uri http://fgedudb:5000
Registered Models:
– fgedu_classifier
– fgedu_regressor
– fgedu_ner_model
– fgedu_recommendation
2. MLflow环境搭建
MLflow是开源的机器学习生命周期管理平台,提供实验跟踪、模型注册、部署等功能。学习交流加群风哥微信: itpux-com
# pip install mlflow==2.10.0 boto3 psycopg2-binary
Successfully installed mlflow-2.10.0 boto3-1.34.0 psycopg2-binary-2.9.9
# 配置环境变量
# export MLFLOW_TRACKING_URI=”http://192.168.1.100:5000″
# export AWS_ACCESS_KEY_ID=”fgedu_access_key”
# export AWS_SECRET_ACCESS_KEY=”fgedu_secret_key”
# export MLFLOW_S3_ENDPOINT_URL=”http://192.168.1.100:9000″
# 创建MLflow数据库
# psql -h 192.168.1.100 -U postgres -c “CREATE DATABASE mlflow;”
CREATE DATABASE
# 授权访问
# psql -h 192.168.1.100 -U postgres -c “GRANT ALL PRIVILEGES ON DATABASE mlflow TO fgedu;”
GRANT
# 初始化MLflow数据库
# mlflow db upgrade postgresql://fgedu:fgedu123@192.168.1.100:5432/mlflow
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO [alembic.runtime.migration] Running upgrade -> 45baa85826b8, add model metrics table
INFO [alembic.runtime.migration] Running upgrade 45baa85826b8 -> 84d6ceebbc5b, add latest metrics table
INFO [alembic.runtime.migration] Running upgrade 84d6ceebbc5b -> bc0c6d3c3c2e, add registered model tables
# cat > /opt/mlflow/mlflow.conf << 'EOF' [server] backend_store_uri = postgresql://fgedu:fgedu123@192.168.1.100:5432/mlflow default_artifact_root = s3://fgedu-mlflow-artifacts/ host = 0.0.0.0 port = 5000 workers = 4 timeout = 120 [auth] admin_username = fgedu_admin admin_password = Fgedu@123456 [logging] level = INFO file = /var/log/mlflow/mlflow.log EOF # 创建systemd服务 # cat > /etc/systemd/system/mlflow.service << 'EOF' [Unit] Description=MLflow Tracking Server After=network.target postgresql.service minio.service [Service] Type=simple User=mlflow Group=mlflow Environment="MLFLOW_TRACKING_URI=http://192.168.1.100:5000" Environment="AWS_ACCESS_KEY_ID=fgedu_access_key" Environment="AWS_SECRET_ACCESS_KEY=fgedu_secret_key" Environment="MLFLOW_S3_ENDPOINT_URL=http://192.168.1.100:9000" ExecStart=/opt/mlflow/venv/bin/mlflow server \ --backend-store-uri postgresql://fgedu:fgedu123@192.168.1.100:5432/mlflow \ --default-artifact-root s3://fgedu-mlflow-artifacts/ \ --host 0.0.0.0 \ --port 5000 \ --workers 4 Restart=always RestartSec=10 [Install] WantedBy=multi-user.target EOF # 启动服务 # systemctl daemon-reload # systemctl enable mlflow # systemctl start mlflow # 查看服务状态 # systemctl status mlflow ● mlflow.service - MLflow Tracking Server Loaded: loaded (/etc/systemd/system/mlflow.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2026-04-03 10:00:00 CST; 1min ago Main PID: 12345 (mlflow) Tasks: 5 (limit: 49143) Memory: 512.0M CGroup: /system.slice/mlflow.service └─12345 /opt/mlflow/venv/bin/python /opt/mlflow/venv/bin/mlflow server...
3. 模型注册与管理
模型注册是版本管理的核心功能,支持模型的存储、标记和管理。
# cat > register_model.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient mlflow.set_tracking_uri("http://192.168.1.100:5000") mlflow.set_experiment("fgedu_classifier_experiment") with mlflow.start_run(run_name="fgedu_classifier_v1.0"): mlflow.log_param("model_type", "RandomForest") mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) mlflow.log_metric("accuracy", 0.95) mlflow.log_metric("precision", 0.94) mlflow.log_metric("recall", 0.96) mlflow.log_metric("f1_score", 0.95) mlflow.sklearn.log_model( sk_model=model, artifact_path="model", registered_model_name="fgedu_classifier" ) run_id = mlflow.active_run().info.run_id print(f"Run ID: {run_id}") EOF # python register_model.py Run ID: abc123-def456-ghi789-jkl012 Successfully registered model 'fgedu_classifier'. 2026/04/03 10:00:00 INFO mlflow.tracking._model_registry.fluent: Created version '1' of model 'fgedu_classifier'. # 查看注册的模型 # mlflow models list --tracking-uri http://192.168.1.100:5000 Registered Models: - fgedu_classifier (version: 1) - fgedu_regressor (version: 3) - fgedu_ner_model (version: 2) # 查看模型版本详情 # mlflow models versions list fgedu_classifier --tracking-uri http://192.168.1.100:5000 Version Status Created Source Run ID ------- -------- --------------- ----------------------------------------------- ---------------- 1 READY 2026-04-03 s3://fgedu-mlflow-artifacts/1/abc123/artifacts abc123-def456...
# cat > model_management.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" # 获取模型所有版本 versions = client.search_model_versions(f"name='{model_name}'") for version in versions: print(f"Version: {version.version}, Status: {version.status}, Stage: {version.current_stage}") # 设置模型版本别名 client.set_registered_model_alias( name=model_name, alias="champion", version="1" ) # 添加模型描述 client.update_model_version( name=model_name, version="1", description="Initial production model with 95% accuracy" ) # 设置模型标签 client.set_model_version_tag( name=model_name, version="1", key="production_ready", value="true" ) # 过渡模型阶段 client.transition_model_version_stage( name=model_name, version="1", stage="Production" ) print("Model version 1 transitioned to Production stage") EOF # python model_management.py Version: 1, Status: READY, Stage: None Model version 1 transitioned to Production stage
4. 版本控制策略
版本控制策略确保模型版本的有序管理和追溯。学习交流加群风哥QQ113257174
# cat > version_naming.py << 'EOF' from mlflow.tracking import MlflowClient import datetime client = MlflowClient(tracking_uri="http://192.168.1.100:5000") def generate_version_name(model_name, version_type="patch"): latest_versions = client.search_model_versions( f"name='{model_name}'", order_by=["version_number DESC"], max_results=1 ) if not latest_versions: return "v1.0.0" current_version = latest_versions[0].tags.get("semantic_version", "v1.0.0") major, minor, patch = map(int, current_version[1:].split(".")) if version_type == "major": major += 1 minor = 0 patch = 0 elif version_type == "minor": minor += 1 patch = 0 else: patch += 1 return f"v{major}.{minor}.{patch}" model_name = "fgedu_classifier" new_version = generate_version_name(model_name, "patch") print(f"New version: {new_version}") client.set_model_version_tag( name=model_name, version="2", key="semantic_version", value=new_version ) client.set_model_version_tag( name=model_name, version="2", key="release_date", value=datetime.datetime.now().strftime("%Y-%m-%d") ) client.set_model_version_tag( name=model_name, version="2", key="author", value="fgedu_ml_team" ) EOF # python version_naming.py New version: v1.0.1
# cat > view_tags.py << 'EOF' from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" version = "1" model_version = client.get_model_version(model_name, version) print(f"Model: {model_name}") print(f"Version: {version}") print(f"Stage: {model_version.current_stage}") print(f"Description: {model_version.description}") print(f"Tags:") for key, value in model_version.tags.items(): print(f" {key}: {value}") EOF # python view_tags.py Model: fgedu_classifier Version: 1 Stage: Production Description: Initial production model with 95% accuracy Tags: semantic_version: v1.0.0 release_date: 2026-04-03 author: fgedu_ml_team production_ready: true # 查看模型变更历史 # cat > version_history.py << 'EOF' from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" print(f"Version History for {model_name}:") print("=" * 60) for version in client.search_model_versions(f"name='{model_name}'"): print(f"\nVersion: {version.version}") print(f" Stage: {version.current_stage}") print(f" Status: {version.status}") print(f" Created: {version.creation_timestamp}") print(f" Updated: {version.last_updated_timestamp}") print(f" Run ID: {version.run_id}") print(f" Source: {version.source}") EOF # python version_history.py Version History for fgedu_classifier: ============================================================ Version: 1 Stage: Production Status: READY Created: 2026-04-03 10:00:00 Updated: 2026-04-03 10:30:00 Run ID: abc123-def456-ghi789-jkl012 Source: s3://fgedu-mlflow-artifacts/1/abc123/artifacts/model
5. 模型部署管理
模型部署管理确保模型从开发到生产的平滑过渡。更多学习教程公众号风哥教程itpux_com
# cat > deploy_model.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient import subprocess client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" model_version = "1" model_uri = f"models:/{model_name}/{model_version}" print(f"Model URI: {model_uri}") model = mlflow.sklearn.load_model(model_uri) print(f"Model loaded successfully: {type(model)}") mlflow.models.serve( model_uri=model_uri, host="0.0.0.0", port=5001, enable_mlserver=True, env_manager="conda" ) print("Model serving started on port 5001") EOF # python deploy_model.py Model URI: models:/fgedu_classifier/1 Model loaded successfully:
Model serving started on port 5001
# 测试模型服务
# curl -X POST http://fgedudb:5001/invocations \
-H “Content-Type: application/json” \
-d ‘{“dataframe_split”: {“columns”: [“feature1”, “feature2”, “feature3”], “data”: [[1.0, 2.0, 3.0]]}}’
{
“predictions”: [1]
}
# cat > fgedu-model-deployment.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: fgedu-classifier namespace: mlflow spec: replicas: 3 selector: matchLabels: app: fgedu-classifier template: metadata: labels: app: fgedu-classifier model-version: v1.0.0 spec: containers: - name: model-server image: mlflow-model-server:v1.0 ports: - containerPort: 5001 env: - name: MLFLOW_TRACKING_URI value: "http://mlflow-server:5000" - name: MODEL_NAME value: "fgedu_classifier" - name: MODEL_VERSION value: "1" resources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "2000m" memory: "4Gi" livenessProbe: httpGet: path: /health port: 5001 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 5001 initialDelaySeconds: 5 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: fgedu-classifier namespace: mlflow spec: selector: app: fgedu-classifier ports: - port: 5001 targetPort: 5001 type: ClusterIP EOF # kubectl apply -f fgedu-model-deployment.yaml deployment.apps/fgedu-classifier created service/fgedu-classifier created # 查看部署状态 # kubectl get pods -n mlflow -l app=fgedu-classifier NAME READY STATUS RESTARTS AGE fgedu-classifier-6b8b9c8d4f-abc12 1/1 Running 0 1m fgedu-classifier-6b8b9c8d4f-def34 1/1 Running 0 1m fgedu-classifier-6b8b9c8d4f-ghi56 1/1 Running 0 1m
6. 模型血缘追踪
模型血缘追踪记录模型从数据到部署的完整生命周期。
# cat > model_lineage.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" version = "1" model_version = client.get_model_version(model_name, version) run = client.get_run(model_version.run_id) print(f"Model Lineage for {model_name} v{version}") print("=" * 60) print("\nRun Information:") print(f" Run ID: {run.info.run_id}") print(f" Experiment ID: {run.info.experiment_id}") print(f" Start Time: {run.info.start_time}") print(f" Status: {run.info.status}") print("\nParameters:") for key, value in run.data.params.items(): print(f" {key}: {value}") print("\nMetrics:") for key, value in run.data.metrics.items(): print(f" {key}: {value}") print("\nArtifacts:") for artifact in client.list_artifacts(run.info.run_id): print(f" {artifact.path}") print("\nTags:") for key, value in run.data.tags.items(): if key.startswith("mlflow."): continue print(f" {key}: {value}") EOF # python model_lineage.py Model Lineage for fgedu_classifier v1 ============================================================ Run Information: Run ID: abc123-def456-ghi789-jkl012 Experiment ID: 1 Start Time: 2026-04-03 10:00:00 Status: FINISHED Parameters: model_type: RandomForest n_estimators: 100 max_depth: 10 random_state: 42 Metrics: accuracy: 0.95 precision: 0.94 recall: 0.96 f1_score: 0.95 Artifacts: model confusion_matrix.png feature_importance.png training_log.txt Tags: data_version: v2.1.0 training_dataset: fgedu_training_data_2026q1 author: fgedu_ml_team
# cat > data_lineage.py << 'EOF' from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" version = "1" model_version = client.get_model_version(model_name, version) run = client.get_run(model_version.run_id) print("Data Lineage Information:") print("=" * 60) data_tags = {k: v for k, v in run.data.tags.items() if k.startswith("data.")} for key, value in data_tags.items(): print(f"{key}: {value}") print("\nData Processing Pipeline:") print(f" Source: {run.data.tags.get('data.source', 'N/A')}") print(f" Preprocessing: {run.data.tags.get('data.preprocessing', 'N/A')}") print(f" Feature Engineering: {run.data.tags.get('data.feature_engineering', 'N/A')}") print(f" Train/Test Split: {run.data.tags.get('data.train_test_split', 'N/A')}") print("\nDataset Statistics:") print(f" Training Samples: {run.data.metrics.get('training_samples', 'N/A')}") print(f" Test Samples: {run.data.metrics.get('test_samples', 'N/A')}") print(f" Features: {run.data.metrics.get('num_features', 'N/A')}") print(f" Classes: {run.data.metrics.get('num_classes', 'N/A')}") EOF # python data_lineage.py Data Lineage Information: ============================================================ data.source: s3://fgedu-data/raw/customer_data_2026q1.parquet data.preprocessing: normalization, outlier_removal, missing_value_imputation data.feature_engineering: feature_selection, pca, encoding data.train_test_split: 80/20 stratified split Data Processing Pipeline: Source: s3://fgedu-data/raw/customer_data_2026q1.parquet Preprocessing: normalization, outlier_removal, missing_value_imputation Feature Engineering: feature_selection, pca, encoding Train/Test Split: 80/20 stratified split Dataset Statistics: Training Samples: 80000 Test Samples: 20000 Features: 50 Classes: 5
7. 模型回滚机制
模型回滚机制确保在出现问题时能够快速恢复到稳定版本。author:www.itpux.com
# cat > model_rollback.py << 'EOF' from mlflow.tracking import MlflowClient import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) client = MlflowClient(tracking_uri="http://192.168.1.100:5000") def rollback_model(model_name, target_version, reason="Performance degradation"): current_version = None for version in client.search_model_versions(f"name='{model_name}'"): if version.current_stage == "Production": current_version = version.version break if current_version is None: logger.error("No production version found") return False logger.info(f"Current production version: {current_version}") logger.info(f"Target version: {target_version}") logger.info(f"Reason: {reason}") client.transition_model_version_stage( name=model_name, version=current_version, stage="Archived" ) client.set_model_version_tag( name=model_name, version=current_version, key="rollback_reason", value=reason ) client.transition_model_version_stage( name=model_name, version=target_version, stage="Production" ) client.set_model_version_tag( name=model_name, version=target_version, key="rollback_from", value=current_version ) logger.info(f"Rollback completed: v{current_version} -> v{target_version}”)
return True
model_name = “fgedu_classifier”
target_version = “1”
rollback_model(model_name, target_version, “Accuracy dropped below threshold”)
EOF
# python model_rollback.py
INFO:__main__:Current production version: 2
INFO:__main__:Target version: 1
INFO:__main__:Reason: Accuracy dropped below threshold
INFO:__main__:Rollback completed: v2 -> v1
# cat > auto_rollback_monitor.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient import time import requests client = MlflowClient(tracking_uri="http://192.168.1.100:5000") MODEL_NAME = "fgedu_classifier" ACCURACY_THRESHOLD = 0.90 CHECK_INTERVAL = 300 def get_production_version(model_name): for version in client.search_model_versions(f"name='{model_name}'"): if version.current_stage == "Production": return version.version return None def check_model_performance(model_name, version): model_version = client.get_model_version(model_name, version) run = client.get_run(model_version.run_id) return run.data.metrics.get("accuracy", 0) def trigger_rollback(model_name, current_version, previous_version): client.transition_model_version_stage( name=model_name, version=current_version, stage="Archived" ) client.transition_model_version_stage( name=model_name, version=previous_version, stage="Production" ) print(f"Rollback triggered: v{current_version} -> v{previous_version}”)
def monitor():
while True:
current_version = get_production_version(MODEL_NAME)
if current_version:
accuracy = check_model_performance(MODEL_NAME, current_version)
print(f”Current version: {current_version}, Accuracy: {accuracy}”)
if accuracy < ACCURACY_THRESHOLD: print(f"Accuracy {accuracy} below threshold {ACCURACY_THRESHOLD}") previous_version = str(int(current_version) - 1) if previous_version: trigger_rollback(MODEL_NAME, current_version, previous_version) time.sleep(CHECK_INTERVAL) if __name__ == "__main__": monitor() EOF # python auto_rollback_monitor.py & [1] 12345 Current version: 2, Accuracy: 0.85 Accuracy 0.85 below threshold 0.90 Rollback triggered: v2 -> v1
8. 模型版本对比
模型版本对比帮助选择最优模型版本。
# cat > compare_versions.py << 'EOF' from mlflow.tracking import MlflowClient import pandas as pd client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" versions_data = [] for version in client.search_model_versions(f"name='{model_name}'"): run = client.get_run(version.run_id) version_info = { "Version": version.version, "Stage": version.current_stage, "Status": version.status, "Created": version.creation_timestamp, "Accuracy": run.data.metrics.get("accuracy", 0), "Precision": run.data.metrics.get("precision", 0), "Recall": run.data.metrics.get("recall", 0), "F1_Score": run.data.metrics.get("f1_score", 0), "Model_Type": run.data.params.get("model_type", "N/A"), "Run_ID": version.run_id[:8] } versions_data.append(version_info) df = pd.DataFrame(versions_data) print("\nModel Version Comparison:") print("=" * 100) print(df.to_string(index=False)) print("\n\nBest Version by Accuracy:") best_version = df.loc[df["Accuracy"].idxmax()] print(f" Version: {best_version['Version']}") print(f" Accuracy: {best_version['Accuracy']:.4f}") print(f" F1 Score: {best_version['F1_Score']:.4f}") print(f" Model Type: {best_version['Model_Type']}") EOF # python compare_versions.py Model Version Comparison: ==================================================================================================== Version Stage Status Created Accuracy Precision Recall F1_Score Model_Type Run_ID ------- --------- ------ ----------------- -------- --------- ------ -------- ----------- ------- 1 Production READY 2026-04-03 10:00:00 0.95 0.94 0.96 0.95 RandomForest abc12345 2 Archived READY 2026-04-02 15:00:00 0.85 0.83 0.87 0.85 XGBoost def67890 3 Staging READY 2026-04-03 09:00:00 0.94 0.93 0.95 0.94 RandomForest ghi12345 Best Version by Accuracy: Version: 1 Accuracy: 0.9500 F1 Score: 0.9500 Model Type: RandomForest
9. 模型制品管理
模型制品管理确保模型文件和相关资源的完整存储。
# cat > list_artifacts.py << 'EOF' from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" version = "1" model_version = client.get_model_version(model_name, version) run_id = model_version.run_id print(f"Artifacts for {model_name} v{version}:") print("=" * 60) def list_artifacts_recursive(path=""): artifacts = client.list_artifacts(run_id, path) for artifact in artifacts: full_path = f"{path}/{artifact.path}" if path else artifact.path if artifact.is_dir: print(f"[DIR] {full_path}/") list_artifacts_recursive(artifact.path) else: size_mb = artifact.file_size / (1024 * 1024) if artifact.file_size else 0 print(f"[FILE] {full_path} ({size_mb:.2f} MB)") list_artifacts_recursive() EOF # python list_artifacts.py Artifacts for fgedu_classifier v1: ============================================================ [DIR] model/ [FILE] model/MLmodel (0.00 MB) [FILE] model/model.pkl (5.23 MB) [FILE] model/requirements.txt (0.00 MB) [FILE] model/conda.yaml (0.00 MB) [FILE] confusion_matrix.png (0.12 MB) [FILE] feature_importance.png (0.08 MB) [FILE] training_log.txt (0.01 MB) [FILE] model_card.md (0.01 MB)
# cat > download_artifacts.py << 'EOF' import mlflow from mlflow.tracking import MlflowClient import os client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" version = "1" download_path = f"/tmp/{model_name}_v{version}" model_uri = f"models:/{model_name}/{version}" local_path = mlflow.artifacts.download_artifacts( artifact_uri=model_uri, dst_path=download_path ) print(f"Model downloaded to: {local_path}") print("\nDownloaded files:") for root, dirs, files in os.walk(local_path): for file in files: file_path = os.path.join(root, file) size = os.path.getsize(file_path) print(f" {file}: {size / 1024:.2f} KB") EOF # python download_artifacts.py Model downloaded to: /tmp/fgedu_classifier_v1 Downloaded files: MLmodel: 0.45 KB model.pkl: 5352.96 KB requirements.txt: 0.12 KB conda.yaml: 0.34 KB # 模型制品清理 # cat > cleanup_artifacts.py << 'EOF' from mlflow.tracking import MlflowClient from datetime import datetime, timedelta client = MlflowClient(tracking_uri="http://192.168.1.100:5000") model_name = "fgedu_classifier" retention_days = 90 cutoff_date = datetime.now() - timedelta(days=retention_days) print(f"Cleaning up artifacts older than {retention_days} days...") print(f"Cutoff date: {cutoff_date}") deleted_count = 0 for version in client.search_model_versions(f"name='{model_name}'"): if version.current_stage in ["Archived", "None"]: created_date = datetime.fromtimestamp(version.creation_timestamp / 1000) if created_date < cutoff_date: print(f"Deleting version {version.version} (created: {created_date})") client.delete_model_version(model_name, version.version) deleted_count += 1 print(f"\nDeleted {deleted_count} old versions") EOF # python cleanup_artifacts.py Cleaning up artifacts older than 90 days... Cutoff date: 2026-01-03 10:00:00 Deleted 0 old versions
10. 生产环境最佳实践
生产环境模型版本管理需要遵循最佳实践确保稳定性和可维护性。
# cat > /opt/mlflow/production_config.yaml << 'EOF' model_registry: naming_convention: pattern: "^[a-z][a-z0-9_]*$" max_length: 100 version_policy: max_versions: 50 retention_days: 365 archive_stages: ["Archived", "Staging"] stage_transitions: None -> Staging:
required_tags: [“tested”, “reviewed”]
required_metrics: [“accuracy”, “f1_score”]
Staging -> Production:
required_tags: [“approved”, “performance_validated”]
required_metrics: [“accuracy >= 0.90”, “f1_score >= 0.90”]
approval_required: true
Production -> Archived:
approval_required: true
reason_required: true
deployment:
serving_platform: “kubernetes”
replicas: 3
resource_limits:
cpu: “2000m”
memory: “4Gi”
health_check:
endpoint: “/health”
interval_seconds: 10
monitoring:
performance_tracking: true
drift_detection: true
alert_threshold:
accuracy_drop: 0.05
latency_increase: 100
EOF
# 模型版本管理流程脚本
# cat > model_lifecycle.py << 'EOF'
from mlflow.tracking import MlflowClient
import yaml
with open("/opt/mlflow/production_config.yaml") as f:
config = yaml.safe_load(f)
client = MlflowClient(tracking_uri="http://192.168.1.100:5000")
def validate_model_for_production(model_name, version):
model_version = client.get_model_version(model_name, version)
run = client.get_run(model_version.run_id)
required_tags = config["model_registry"]["stage_transitions"]["Staging -> Production”][“required_tags”]
for tag in required_tags:
if tag not in model_version.tags:
print(f”Missing required tag: {tag}”)
return False
required_metrics = config[“model_registry”][“stage_transitions”][“Staging -> Production”][“required_metrics”]
for metric_req in required_metrics:
metric_name, operator, threshold = metric_req.split()
metric_value = run.data.metrics.get(metric_name, 0)
threshold_value = float(threshold)
if operator == “>=” and metric_value < threshold_value: print(f"Metric {metric_name} ({metric_value}) below threshold {threshold_value}") return False print("Model validation passed") return True model_name = "fgedu_classifier" version = "3" if validate_model_for_production(model_name, version): print(f"Model {model_name} v{version} is ready for production deployment") EOF # python model_lifecycle.py Model validation passed Model fgedu_classifier v3 is ready for production deployment
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
