Kubernetes教程FG077-Kubernetes机器学习与Kubernetes实战
内容简介
本篇文章主要介绍Kubernetes中机器学习工作负载的部署与管理方法。风哥教程参考Kubernetes官方文档机器学习相关内容,结合生产环境实际操作场景,详细讲解机器学习模型的训练、部署和管理。
目录大纲
Part01-基础概念与理论知识
1.1 机器学习概述
机器学习是一种人工智能技术,它允许计算机从数据中学习并做出预测或决策。机器学习的主要任务包括:
- 监督学习:从标记数据中学习模式
- 无监督学习:从未标记数据中发现模式
- 强化学习:通过与环境交互学习最优策略
- 半监督学习:结合标记和未标记数据
- 迁移学习:将从一个任务中学到的知识应用到另一个任务
1.2 机器学习在Kubernetes上的优势
在Kubernetes上运行机器学习工作负载的优势包括:
- 弹性扩缩容:根据工作负载需求自动调整资源
- 资源隔离:为不同的机器学习任务提供隔离的环境
- 标准化部署:使用容器化技术实现一致的部署环境
- 工作流管理:使用Kubernetes原生的工作负载控制器管理训练和推理任务
- 高可用性:提供机器学习服务的高可用性
- 成本优化:根据需求动态分配资源,降低成本
Part02-生产环境规划与建议
2.1 机器学习工作负载规划
- 任务类型:
- 模型训练:计算密集型,需要大量CPU/GPU资源
- 模型推理:低延迟要求,需要快速响应
- 数据预处理:IO密集型,需要大量存储和网络带宽
- 资源需求:
- CPU:适合传统机器学习算法
- GPU:适合深度学习算法
- 内存:需要足够的内存存储模型和数据
- 存储:需要大容量存储存储训练数据和模型
- 部署策略:
- 训练任务:使用Job或CronJob
- 推理服务:使用Deployment或StatefulSet
- 分布式训练:使用TFJob或PyTorchJob
2.2 资源配置建议
- CPU配置:
- 训练任务:8-32核心
- 推理服务:4-16核心
- 数据预处理:16-64核心
- GPU配置:
- 训练任务:1-8个GPU
- 推理服务:1-2个GPU
- 内存配置:
- 训练任务:32-256GB
- 推理服务:16-64GB
- 数据预处理:64-512GB
- 存储配置:
- 训练数据:1TB以上
- 模型存储:100GB以上
- 使用PersistentVolume和StorageClass
Part03-生产环境项目实施方案
3.1 机器学习训练部署
部署机器学习训练任务:
创建训练Job
# 创建训练Job配置
[root@fgedu-master ~]# cat > ml-training-job.yaml << EOF apiVersion: batch/v1 kind: Job metadata: name: ml-training namespace: fgedu-ml spec: template: spec: containers: - name: ml-training image: fgedu/ml-training:v1.0 command: ["python", "/app/train.py"] resources: requests: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF
[root@fgedu-master ~]# cat > ml-training-job.yaml << EOF apiVersion: batch/v1 kind: Job metadata: name: ml-training namespace: fgedu-ml spec: template: spec: containers: - name: ml-training image: fgedu/ml-training:v1.0 command: ["python", "/app/train.py"] resources: requests: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF
# 应用训练Job配置
[root@fgedu-master ~]# kubectl apply -f ml-training-job.yaml
[root@fgedu-master ~]# kubectl apply -f ml-training-job.yaml
job.batch/ml-training created
查看训练状态
,风哥提示:。
# 查看训练Pod状态
[root@fgedu-master ~]# kubectl get pods -n fgedu-ml
[root@fgedu-master ~]# kubectl get pods -n fgedu-ml
NAME READY STATUS RESTARTS AGE
ml-training-2k8x9 1/1 Running 0 10m
ml-training-2k8x9 1/1 Running 0 10m
查看训练日志
# 查看训练日志
[root@fgedu-master ~]# kubectl logs -f ml-training-2k8x9 -n fgedu-ml
[root@fgedu-master ~]# kubectl logs -f ml-training-2k8x9 -n fgedu-ml
2023-10-01 10:00:00,000 – INFO – Starting training…
2023-10-01 10:00:01,000 – INFO – Loading data from /data/train.csv
2023-10-01 10:00:10,000 – INFO – Data loaded successfully
2023-10-01 10:00:10,000 – INFO – Preprocessing data…
2023-10-01 10:00:20,000 – INFO – Data preprocessed successfully
2023-10-01 10:00:20,000 – INFO – Training model…
2023-10-01 10:05:00,000 – INFO – Epoch 1/10: loss=0.4567, accuracy=0.85
2023-10-01 10:10:00,000 – INFO – Epoch 2/10: loss=0.3456, accuracy=0.88
2023-10-01 10:15:00,000 – INFO – Epoch 3/10: loss=0.2345, accuracy=0.91
2023-10-01 10:20:00,000 – INFO – Epoch 4/10: loss=0.1234, accuracy=0.94
2023-10-01 10:25:00,000 – INFO – Epoch 5/10: loss=0.0987, accuracy=0.96
2023-10-01 10:30:00,000 – INFO – Epoch 6/10: loss=0.0765, accuracy=0.97
2023-10-01 10:35:00,000 – INFO – Epoch 7/10: loss=0.0543, accuracy=0.98
2023-10-01 10:40:00,000 – INFO – Epoch 8/10: loss=0.0321, accuracy=0.99
2023-10-01 10:45:00,000 – INFO – Epoch 9/10: loss=0.0210, accuracy=0.99
2023-10-01 10:50:00,000 – INFO – Epoch 10/10: loss=0.0100, accuracy=0.99
2023-10-01 10:50:00,000 – INFO – Training completed
2023-10-01 10:50:05,000 – INFO – Saving model to /models/model.pth
2023-10-01 10:50:10,000 – INFO – Model saved successfully
2023-10-01 10:00:01,000 – INFO – Loading data from /data/train.csv
2023-10-01 10:00:10,000 – INFO – Data loaded successfully
2023-10-01 10:00:10,000 – INFO – Preprocessing data…
2023-10-01 10:00:20,000 – INFO – Data preprocessed successfully
2023-10-01 10:00:20,000 – INFO – Training model…
2023-10-01 10:05:00,000 – INFO – Epoch 1/10: loss=0.4567, accuracy=0.85
2023-10-01 10:10:00,000 – INFO – Epoch 2/10: loss=0.3456, accuracy=0.88
2023-10-01 10:15:00,000 – INFO – Epoch 3/10: loss=0.2345, accuracy=0.91
2023-10-01 10:20:00,000 – INFO – Epoch 4/10: loss=0.1234, accuracy=0.94
2023-10-01 10:25:00,000 – INFO – Epoch 5/10: loss=0.0987, accuracy=0.96
2023-10-01 10:30:00,000 – INFO – Epoch 6/10: loss=0.0765, accuracy=0.97
2023-10-01 10:35:00,000 – INFO – Epoch 7/10: loss=0.0543, accuracy=0.98
2023-10-01 10:40:00,000 – INFO – Epoch 8/10: loss=0.0321, accuracy=0.99
2023-10-01 10:45:00,000 – INFO – Epoch 9/10: loss=0.0210, accuracy=0.99
2023-10-01 10:50:00,000 – INFO – Epoch 10/10: loss=0.0100, accuracy=0.99
2023-10-01 10:50:00,000 – INFO – Training completed
2023-10-01 10:50:05,000 – INFO – Saving model to /models/model.pth
2023-10-01 10:50:10,000 – INFO – Model saved successfully
3.2 机器学习模型部署
部署机器学习模型服务,风哥提示:。
创建模型部署
# 创建模型部署配置
[root@fgedu-master ~]# cat > ml-inference-deployment.yaml << EOF apiVersion: apps/v1 kind: Deployment metadata: name: ml-inference namespace: fgedu-ml spec: replicas: 3 selector: matchLabels: app: ml-inference template: metadata: labels: app: ml-inference spec: containers: - name: ml-inference image: fgedu/ml-inference:v1.0 ports: - containerPort: 8080 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc EOF
[root@fgedu-master ~]# cat > ml-inference-deployment.yaml << EOF apiVersion: apps/v1 kind: Deployment metadata: name: ml-inference namespace: fgedu-ml spec: replicas: 3 selector: matchLabels: app: ml-inference template: metadata: labels: app: ml-inference spec: containers: - name: ml-inference image: fgedu/ml-inference:v1.0 ports: - containerPort: 8080 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc EOF
# 应用模型部署配置
[root@fgedu-master ~]# kubectl apply -f ml-inference-deployment.yaml
[root@fgedu-master ~]# kubectl apply -f ml-inference-deployment.yaml
deployment.apps/ml-inference created
创建服务
# 创建服务配置
[root@fgedu-master ~]# cat > ml-inference-service.yaml << EOF apiVersion: v1 kind: Service metadata: name: ml-inference namespace: fgedu-ml spec: selector: app: ml-inference ports: - port: 80 targetPort: 8080 type: ClusterIP EOF
[root@fgedu-master ~]# cat > ml-inference-service.yaml << EOF apiVersion: v1 kind: Service metadata: name: ml-inference namespace: fgedu-ml spec: selector: app: ml-inference ports: - port: 80 targetPort: 8080 type: ClusterIP EOF
# 应用服务配置
[root@fgedu-master ~]# kubectl apply -f ml-inference-service.yaml
[root@fgedu-master ~]# kubectl apply -f ml-inference-service.yaml
,学习交流加群风哥微信: itpux-com。
service/ml-inference created
测试模型服务
# 测试模型服务
[root@fgedu-master ~]# curl -X POST http://ml-inference.fgedu-ml.svc.cluster.local/predict \
-H “Content-Type: application/json” \
-d ‘{“features”: [1.2, 3.4, 5.6, 7.8]}’
[root@fgedu-master ~]# curl -X POST http://ml-inference.fgedu-ml.svc.cluster.local/predict \
-H “Content-Type: application/json” \
-d ‘{“features”: [1.2, 3.4, 5.6, 7.8]}’
{“prediction”: 0.95, “confidence”: 0.98}
3.3 机器学习工作流管理
使用Kubeflow管理机器学习工作流。
安装Kubeflow
# 安装Kubeflow
[root@fgedu-master ~]# curl -s https://raw.githubusercontent.com/kubeflow/kfctl/master/kfctl.sh | bash -s — init kfctl_k8s_istio.v1.2.0.yaml
[root@fgedu-master ~]# ./kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
[root@fgedu-master ~]# curl -s https://raw.githubusercontent.com/kubeflow/kfctl/master/kfctl.sh | bash -s — init kfctl_k8s_istio.v1.2.0.yaml
[root@fgedu-master ~]# ./kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
创建Pipeline
# 创建Pipeline配置
[root@fgedu-master ~]# cat > ml-pipeline.yaml << EOF apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: ml-pipeline- spec: entrypoint: ml-pipeline templates: - name: ml-pipeline dag: tasks: - name: data-preprocessing template: data-preprocessing - name: model-training template: model-training dependencies: [data-preprocessing] - name: model-evaluation template: model-evaluation dependencies: [model-training] - name: model-deployment template: model-deployment dependencies: [model-evaluation] - name: data-preprocessing container: image: fgedu/data-preprocessing:v1.0 command: ["python", "/app/preprocess.py"] volumeMounts: - name: data-volume mountPath: /data - name: model-training container: image: fgedu/model-training:v1.0 command: ["python", "/app/train.py"] resources: requests: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models - name: model-evaluation container: image: fgedu/model-evaluation:v1.0 command: ["python", "/app/evaluate.py"] volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models - name: model-deployment container: image: fgedu/model-deployment:v1.0 command: ["python", "/app/deploy.py"] volumeMounts: - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc EOF
[root@fgedu-master ~]# cat > ml-pipeline.yaml << EOF apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: ml-pipeline- spec: entrypoint: ml-pipeline templates: - name: ml-pipeline dag: tasks: - name: data-preprocessing template: data-preprocessing - name: model-training template: model-training dependencies: [data-preprocessing] - name: model-evaluation template: model-evaluation dependencies: [model-training] - name: model-deployment template: model-deployment dependencies: [model-evaluation] - name: data-preprocessing container: image: fgedu/data-preprocessing:v1.0 command: ["python", "/app/preprocess.py"] volumeMounts: - name: data-volume mountPath: /data - name: model-training container: image: fgedu/model-training:v1.0 command: ["python", "/app/train.py"] resources: requests: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 limits: cpu: 16 memory: 64Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models - name: model-evaluation container: image: fgedu/model-evaluation:v1.0 command: ["python", "/app/evaluate.py"] volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models - name: model-deployment container: image: fgedu/model-deployment:v1.0 command: ["python", "/app/deploy.py"] volumeMounts: - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc EOF
# 提交Pipeline
[root@fgedu-master ~]# kubectl create -f ml-pipeline.yaml
[root@fgedu-master ~]# kubectl create -f ml-pipeline.yaml
Part04-生产案例与实战讲解
,学习交流加群风哥QQ113257174。
4.1 企业级机器学习平台部署
某企业需要部署企业级机器学习平台,用于模型训练和推理服务。
案例背景
- 平台需求:
- 支持多种机器学习框架(TensorFlow、PyTorch、Scikit-learn)
- 提供模型训练和推理服务
- 支持分布式训练
- 提供模型管理和版本控制
- 集成监控和日志系统
- 技术栈:
- Kubernetes
- Kubeflow
- TensorFlow/PyTorch
- Prometheus/Grafana
- ELK/EFK
部署方案
# 1. 准备环境
# 创建命名空间
kubectl create namespace fgedu-ml-platform
# 创建存储
kubectl apply -f – << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-pvc namespace: fgedu-ml-platform spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: standard --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: fgedu-ml-platform spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: standard EOF # 2. 安装Kubeflow curl -s https://raw.githubusercontent.com/kubeflow/kfctl/master/kfctl.sh | bash -s -- init kfctl_k8s_istio.v1.2.0.yaml ./kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml # 3. 部署模型训练服务 kubectl apply -f - << EOF apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: tensorflow-training namespace: fgedu-ml-platform spec: tfReplicaSpecs: Worker: replicas: 3 template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF # 4. 部署模型推理服务 kubectl apply -f - << EOF apiVersion: apps/v1 kind: Deployment metadata: name: tensorflow-serving namespace: fgedu-ml-platform spec: replicas: 3 selector: matchLabels: app: tensorflow-serving,更多视频教程www.fgedu.net.cn。 template: metadata: labels: app: tensorflow-serving spec: containers: - name: tensorflow-serving image: tensorflow/serving:2.8.0-gpu ports: - containerPort: 8501 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models env: - name: MODEL_NAME value: "model" volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc --- apiVersion: v1 kind: Service metadata: name: tensorflow-serving namespace: fgedu-ml-platform spec: selector: app: tensorflow-serving ports: - port: 80 targetPort: 8501 type: ClusterIP EOF # 5. 部署监控和日志系统 # 安装Prometheus和Grafana helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace # 安装ELK helm repo add elastic https://helm.elastic.co helm install elasticsearch elastic/elasticsearch --namespace logging --create-namespace helm install kibana elastic/kibana --namespace logging helm install filebeat elastic/filebeat --namespace logging # 6. 验证部署 kubectl get pods -n fgedu-ml-platform kubectl get services -n fgedu-ml-platform kubectl get tfjob -n fgedu-ml-platform
# 创建命名空间
kubectl create namespace fgedu-ml-platform
# 创建存储
kubectl apply -f – << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-pvc namespace: fgedu-ml-platform spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: standard --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: fgedu-ml-platform spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: standard EOF # 2. 安装Kubeflow curl -s https://raw.githubusercontent.com/kubeflow/kfctl/master/kfctl.sh | bash -s -- init kfctl_k8s_istio.v1.2.0.yaml ./kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml # 3. 部署模型训练服务 kubectl apply -f - << EOF apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: tensorflow-training namespace: fgedu-ml-platform spec: tfReplicaSpecs: Worker: replicas: 3 template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF # 4. 部署模型推理服务 kubectl apply -f - << EOF apiVersion: apps/v1 kind: Deployment metadata: name: tensorflow-serving namespace: fgedu-ml-platform spec: replicas: 3 selector: matchLabels: app: tensorflow-serving,更多视频教程www.fgedu.net.cn。 template: metadata: labels: app: tensorflow-serving spec: containers: - name: tensorflow-serving image: tensorflow/serving:2.8.0-gpu ports: - containerPort: 8501 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models env: - name: MODEL_NAME value: "model" volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc --- apiVersion: v1 kind: Service metadata: name: tensorflow-serving namespace: fgedu-ml-platform spec: selector: app: tensorflow-serving ports: - port: 80 targetPort: 8501 type: ClusterIP EOF # 5. 部署监控和日志系统 # 安装Prometheus和Grafana helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace # 安装ELK helm repo add elastic https://helm.elastic.co helm install elasticsearch elastic/elasticsearch --namespace logging --create-namespace helm install kibana elastic/kibana --namespace logging helm install filebeat elastic/filebeat --namespace logging # 6. 验证部署 kubectl get pods -n fgedu-ml-platform kubectl get services -n fgedu-ml-platform kubectl get tfjob -n fgedu-ml-platform
4.2 深度学习模型训练与部署实战
某企业需要训练和部署深度学习模型,用于图像分类任务。。。
案例背景
- 任务:图像分类
- 数据集:CIFAR-10
- 模型:ResNet50
- 框架:PyTorch
- 硬件:GPU集群
部署方案
# 1. 准备环境
[root@fgedu-master ~]# kubectl create namespace fgedu-dl
[root@fgedu-master ~]# kubectl create namespace fgedu-dl
# 2. 创建存储
[root@fgedu-master ~]# kubectl apply -f – << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-pvc namespace: fgedu-dl spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: standard --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: fgedu-dl spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: standard EOF
[root@fgedu-master ~]# kubectl apply -f – << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-pvc namespace: fgedu-dl spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: standard --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc namespace: fgedu-dl spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi storageClassName: standard EOF
# 3. 部署训练任务
[root@fgedu-master ~]# cat > pytorch-training.yaml << EOF apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: pytorch-trainingnamespace: fgedu-dl spec:,更多学习教程公众号风哥教程itpux_com。 pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure Worker: replicas: 2 template: spec: containers: - name: pytorch image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF
[root@fgedu-master ~]# cat > pytorch-training.yaml << EOF apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: pytorch-trainingnamespace: fgedu-dl spec:,更多学习教程公众号风哥教程itpux_com。 pytorchReplicaSpecs: Master: replicas: 1 template: spec: containers: - name: pytorch image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure Worker: replicas: 2 template: spec: containers: - name: pytorch image: pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime command: - python - /app/train.py resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: data-volume mountPath: /data - name: model-volume mountPath: /models volumes: - name: data-volume persistentVolumeClaim: claimName: data-pvc - name: model-volume persistentVolumeClaim: claimName: model-pvc restartPolicy: OnFailure EOF
# 应用训练配置
[root@fgedu-master ~]# kubectl apply -f pytorch-training.yaml
[root@fgedu-master ~]# kubectl apply -f pytorch-training.yaml
# 4. 部署推理服务
[root@fgedu-master ~]# cat > pytorch-inference.yaml << EOF apiVersion: apps/v1 kind: Deployment metadata: name: pytorch-inference namespace: fgedu-dl spec: replicas: 3 selector: matchLabels: app: pytorch-inference template: metadata: labels: app: pytorch-inference spec: containers: - name: pytorch-inference image: fgedu/pytorch-inference:v1.0 ports: - containerPort: 8080 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc --- apiVersion: v1。 kind: Service,from K8S+DB视频:www.itpux.com。 metadata: name: pytorch-inference namespace: fgedu-dl spec: selector: app: pytorch-inference ports: - port: 80 targetPort: 8080 type: ClusterIP EOF
[root@fgedu-master ~]# cat > pytorch-inference.yaml << EOF apiVersion: apps/v1 kind: Deployment metadata: name: pytorch-inference namespace: fgedu-dl spec: replicas: 3 selector: matchLabels: app: pytorch-inference template: metadata: labels: app: pytorch-inference spec: containers: - name: pytorch-inference image: fgedu/pytorch-inference:v1.0 ports: - containerPort: 8080 resources: requests: cpu: 4 memory: 16Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 32Gi nvidia.com/gpu: 1 volumeMounts: - name: model-volume mountPath: /models volumes: - name: model-volume persistentVolumeClaim: claimName: model-pvc --- apiVersion: v1。 kind: Service,from K8S+DB视频:www.itpux.com。 metadata: name: pytorch-inference namespace: fgedu-dl spec: selector: app: pytorch-inference ports: - port: 80 targetPort: 8080 type: ClusterIP EOF
# 应用推理服务配置
[root@fgedu-master ~]# kubectl apply -f pytorch-inference.yaml
[root@fgedu-master ~]# kubectl apply -f pytorch-inference.yaml
# 5. 测试推理服务
[root@fgedu-master ~]# curl -X POST http://pytorch-inference.fgedu-dl.svc.cluster.local/predict \
-H “Content-Type: application/json” \
-d ‘{“image”: “base64-encoded-image”}’
[root@fgedu-master ~]# curl -X POST http://pytorch-inference.fgedu-dl.svc.cluster.local/predict \
-H “Content-Type: application/json” \
-d ‘{“image”: “base64-encoded-image”}’
{“class”: “cat”, “confidence”: 0.98}
Part05-风哥经验总结与分享
5.1 机器学习与Kubernetes最佳实践
- 资源管理:
- 为训练任务和推理服务设置合理的资源请求和限制
- 使用节点亲和性将GPU任务调度到有GPU的节点
- 配置资源配额和限制范围
- 存储管理:
- 使用PersistentVolume存储训练数据和模型
- 选择合适的存储类型(SSD、HDD)
- 配置存储类以支持动态卷 provisioning
- 工作流管理:
- 使用Kubeflow管理机器学习工作流
- 使用Pipeline编排训练和部署流程
- 实现CI/CD pipeline自动构建和部署模型
- 监控与日志:
- 使用Prometheus监控资源使用情况
- 使用Grafana创建监控面板
- 使用ELK/EFK聚合和分析日志
- 设置告警规则监控模型性能
- 安全:
- 使用RBAC控制对机器学习资源的访问
- 使用Secret存储敏感信息(API密钥、密码)
- 配置网络策略隔离机器学习工作负载
- 模型管理:
- 实现模型版本控制
- 管理模型生命周期
- 实现模型A/B测试
5.2 常见问题与解决方案
问题 原因 解决方案
GPU资源不足 集群中GPU节点不足 添加更多GPU节点
训练速度慢 资源配置不足 增加CPU/GPU资源
模型部署失败 模型文件不存在或损坏 检查模型文件
推理延迟高 资源配置不足或模型过大 增加资源或优化模型
数据存储不足 存储容量不足 增加存储容量
工作流执行失败 依赖项错误或配置问题 检查依赖项和配置
监控数据缺失 监控配置错误 检查监控配置
日志管理困难 日志分散 使用集中式日志管理
GPU资源不足 集群中GPU节点不足 添加更多GPU节点
训练速度慢 资源配置不足 增加CPU/GPU资源
模型部署失败 模型文件不存在或损坏 检查模型文件
推理延迟高 资源配置不足或模型过大 增加资源或优化模型
数据存储不足 存储容量不足 增加存储容量
工作流执行失败 依赖项错误或配置问题 检查依赖项和配置
监控数据缺失 监控配置错误 检查监控配置
日志管理困难 日志分散 使用集中式日志管理
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
