Part01-基础概念与理论知识
1.1 AI/机器学习平台概述
AI/机器学习平台是指用于开发、训练、部署和管理机器学习模型的基础设施和工具集。它提供了从数据处理、模型训练到模型部署的完整流程支持,帮助数据科学家和工程师更高效地构建和部署AI应用。
AI/机器学习平台的核心功能包括:
- 数据管理与处理
- 模型训练与调优
- 模型部署与服务
- 模型监控与管理
- 资源管理与调度
1.2 K8s在AI/机器学习中的应用
Kubernetes在AI/机器学习中的应用主要体现在以下几个方面:
- 资源管理:K8s可以高效管理GPU、TPU等加速硬件资源
- 弹性伸缩:根据训练任务的需求自动调整资源
- 任务调度:智能调度训练任务,提高资源利用率
- 服务部署:快速部署和管理模型服务
- 环境隔离:为不同的训练任务和模型服务提供隔离的环境
1.3 常用AI/机器学习框架
常用的AI/机器学习框架包括:
- TensorFlow:Google开发的开源机器学习框架,支持分布式训练
- PyTorch:Facebook开发的开源机器学习框架,动态计算图是其特色
- Keras:高级神经网络API,可运行在TensorFlow、Theano或CNTK之上
- Scikit-learn:用于机器学习的Python库,包含多种算法
- MXNet:Apache的开源深度学习框架,支持多种编程语言
。
Part02-生产环境规划与建议
2.1 硬件资源规划
AI/机器学习平台的硬件资源规划需要考虑以下因素:
- CPU:选择多核、高主频的CPU,适合数据预处理和模型评估
- GPU:选择高性能GPU,如NVIDIA A100、H100等,适合模型训练
- 内存:需要大容量内存,尤其是处理大规模数据集时
- 存储:选择高速存储,如NVMe SSD,提高数据读写速度
- 网络:选择高带宽网络,支持分布式训练时的通信需求
2.2 平台架构设计
AI/机器学习平台的架构设计应考虑以下因素:
- 分层架构:数据层、计算层、服务层、应用层
- 组件选择:选择适合的开源组件,如Kubeflow、MLflow等
- 扩展性:支持水平扩展,满足不断增长的计算需求
- 可靠性:实现高可用,确保平台的稳定运行
- 可维护性:设计清晰的组件边界,便于维护和升级
2.3 存储与网络规划
存储规划:
- 使用分布式存储系统,如Ceph、GlusterFS等,存储大规模数据集
- 为不同类型的数据选择合适的存储方案:
- 训练数据:使用高速存储,如NVMe SSD
- 模型文件:使用可靠的存储,如对象存储
- 日志和监控数据:使用弹性存储,如NAS
网络规划:
- 使用高速网络,如100Gbps以太网或InfiniBand,支持分布式训练
- 实现网络隔离,为不同的训练任务和模型服务提供独立的网络环境
- 配置网络QoS,确保关键任务的网络带宽需求
风哥提示:在AI/机器学习平台中,GPU资源的管理和调度是关键,需要合理规划和配置,以提高资源利用率。
。
Part03-生产环境项目实施方案
以Kubeflow为例,实施方案如下:
3.1 安装Kubeflow
$ git clone https://github.com/kubeflow/kfctl.git
$ cd kfctl
$ git checkout v1.2.0
# 安装Kubeflow
$ export PATH=$PATH:$PWD/bin
$ export KF_NAME=kubeflow
$ export BASE_DIR=/opt/kubeflow
$ export KF_DIR=${BASE_DIR}/${KF_NAME}
$ mkdir -p ${KF_DIR}
$ cd ${KF_DIR}
# 下载Kubeflow配置
$ kfctl init ${KF_NAME} –config=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml
# 部署Kubeflow
$ kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml
# 验证Kubeflow安装
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-567890abc-12345 1/1 Running 0 10m
cache-deployer-deployment-567890abc-67890 1/1 Running 0 10m
centraldashboard-567890abc-abcde 1/1 Running 0 10m
jupyter-web-app-deployment-567890abc-fghij 1/1 Running 0 10m
katib-controller-567890abc-ijklm 1/1 Running 0 10m
metacontroller-567890abc-nopqr 1/1 Running 0 10m
metadata-567890abc-stuvw 1/1 Running 0 10m
minio-567890abc-xyzab 1/1 Running 0 10m
mysql-567890abc-cdefg 1/1 Running 0 10m
notebook-controller-deployment-567890abc-hijkl 1/1 Running 0 10m
profile-controller-567890abc-mnopq 1/1 Running 0 10m
pytorch-operator-567890abc-qrstu 1/1 Running 0 10m
resource-driver-deployment-567890abc-vwxyz 1/1 Running 0 10m
tensorboard-controller-deployment-567890abc-12345 1/1 Running 0 10m
tf-job-operator-567890abc-67890 1/1 Running 0 10m
workflow-controller-567890abc-abcde 1/1 Running 0 10m
# 查看Kubeflow服务
$ kubectl get svc -n kubeflow | grep istio-ingressgateway
istio-ingressgateway LoadBalancer 10.96.78.90
3.2 配置GPU支持
$ sudo dnf install -y nvidia-driver
# 安装NVIDIA容器运行时
$ sudo dnf install -y nvidia-container-runtime
# 配置Docker使用NVIDIA运行时
$ sudo tee /etc/docker/daemon.json << 'EOF'
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOF
# 重启Docker
$ sudo systemctl restart docker
# 安装NVIDIA设备插件
$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml
# 验证GPU设备插件
$ kubectl get pods -n kube-system | grep nvidia
nvidia-device-plugin-daemonset-567890abc-12345 1/1 Running 0 5m
nvidia-device-plugin-daemonset-567890abc-67890 1/1 Running 0 5m
# 验证GPU资源
$ kubectl get nodes -o json | jq '.items[].status.allocatable'
{
"cpu": "32",
"ephemeral-storage": "100Gi",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "128Gi",
"nvidia.com/gpu": "2",
"pods": "110"
}
3.3 部署Jupyter Notebook
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: fgedu-notebook namespace: kubeflow-user-example-com spec: template: spec: containers: - name: notebook image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: notebook-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: notebook-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi EOF # 查看Notebook状态 $ kubectl get notebooks -n kubeflow-user-example-com NAME STATUS AGE fgedu-notebook Running 5m # 获取Notebook访问URL $ kubectl get svc -n kubeflow | grep istio-ingressgateway istio-ingressgateway LoadBalancer 10.96.78.90 192.168.1.100 15020:32400/TCP,80:30880/TCP,443:30443/TCP,31400:31400/TCP,15443:32281/TCP 15m # 访问URL: http://192.168.1.100:30880
3.4 部署模型训练任务
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: fgedu-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: training-code --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: training-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: v1 kind: ConfigMap metadata: name: training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 加载数据 (x_train, y_train), (x_test, y_test) = mnist.load_data() # 数据预处理 if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape[0], 1, 28, 28) x_test = x_test.reshape(x_test.shape[0], 1, 28, 28) input_shape = (1, 28, 28) else: x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/mnist_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE fgedu-tf-training Running 5m # 查看训练日志 $ kubectl logs -f tfjob-fgedu-tf-training-worker-0 -n kubeflow-user-example-com
3.5 部署模型服务
$ kubectl apply -f – << 'EOF' apiVersion: serving.kubeflow.org/v1alpha2 kind: InferenceService metadata: name: mnist-model namespace: kubeflow-user-example-com spec: default: predictor: tensorflow: storageUri: pvc://model-storage/mnist_model.h5 resources: requests: cpu: "1" memory: "4Gi" limits: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF # 查看模型服务状态 $ kubectl get inferenceservices -n kubeflow-user-example-com NAME URL READY AGE mnist-model http://mnist-model.kubeflow-user-example-com.fgedu.net.cn True 5m # 测试模型服务 $ curl -X POST http://mnist-model.kubeflow-user-example-com.fgedu.net.cn/v1/models/mnist-model:predict -d '{"instances": [[[[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.更多视频教程www.fgedu.net.cn0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0更多学习教程公众号风哥教程itpux_com.0], [0.0], [0.0], [0.0], [0.0], [0学习交流加群风哥QQ113257174.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]}'
from PG视频:www.itpux.com
>。
Part04-生产案例与实战讲解
4.1 企业级AI平台案例
某企业的AI平台实践如下:
- 技术栈:Kubernetes + Kubeflow + TensorFlow + PyTorch
- 硬件资源:NVIDIA A100 GPU集群,高速NVMe存储
- 平台功能:数据处理、模型训练、模型部署、模型监控
- 应用场景:图像识别、自然语言处理、预测分析
- 性能优化:使用分布式训练,优化GPU利用率
4.2 大规模模型训练案例
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: distributed-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app Worker: replicas: 4 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app PS: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: distributed-training-code --- apiVersion: v1 kind: ConfigMap metadata: name: distributed-training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 分布式策略 strategy = tf.distribute.MirroredStrategy() # 加载数据 (x_train, y_train), (x_test, y_test) = cifar10.load_data() # 数据预处理 x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 with strategy.scope(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=50, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/cifar10_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE distributed-tf-training Running 10m # 查看训练日志 $ kubectl logs -f tfjob-distributed-tf-training-master-0 -n kubeflow-user-example-com
4.3 AI模型监控与管理
$ kubectl apply -f – << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: model-monitor namespace: kubeflow-user-example-com spec: replicas: 1 selector: matchLabels: app: model-monitor template: metadata: labels: app: model-monitor spec: containers: - name: model-monitor image: harbor.fgedu.net.cn/library/model-monitor:v1.0.0 ports: - containerPort: 8080 env: - name: PROMETHEUS_URL value: "http://prometheus.kubeflow.svc:9090" - name: GRAFANA_URL value: "http://grafana.kubeflow.svc:3000" resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" --- apiVersion: v1 kind: Service metadata: name: model-monitor namespace: kubeflow-user-example-com spec: selector: app: model-monitor ports: - port: 80 targetPort: 8080 type: NodePort EOF # 配置Prometheus监控规则 $ kubectl apply -f - << 'EOF' apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: model-monitoring-rules namespace: kubeflow spec: groups: - name: model-monitoring rules: - alert: ModelAccuracyDrop expr: model_accuracy{job="model-monitor"} < 0.8 for: 5m labels: severity: warning annotations: summary: "模型准确率下降" description: "模型 {{ $labels.model }} 的准确率低于80%" - alert: ModelLatencyHigh expr: model_inference_latency{job="model-monitor"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: “模型推理延迟过高”
description: “模型 {{ $labels.model }} 的推理延迟超过1000ms”
EOF
4.4 AI平台自动化脚本
# ai_platform.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`
# 部署Jupyter Notebook
deploy_notebook() {
local name=$1
local namespace=$2
local gpu=$3
echo “=== 部署Jupyter Notebook: $name ===”
kubectl apply -f – << EOF
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: $name
namespace: $namespace
spec:
template:
spec:
containers:
- name: notebook
image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "$gpu"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "$gpu"
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: ${name}-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${name}-data
namespace: $namespace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
EOF
}
# 部署模型训练任务
deploy_training() {
local name=$1
local namespace=$2
local replicas=$3
local gpu=$4
echo "=== 部署模型训练任务: $name ==="
kubectl apply -f - << EOF
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: $name
namespace: $namespace
spec:
tfReplicaSpecs:
Worker:
replicas: $replicas
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.8.0-gpu
command:
- python
- /app/train.py
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "$gpu"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "$gpu"
volumeMounts:
- name: data
mountPath: /data
- name: code
mountPath: /app
volumes:
- name: data
persistentVolumeClaim:
claimName: ${name}-data
- name: code
configMap:
name: ${name}-code
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${name}-data
namespace: $namespace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ${name}-code
namespace: $namespace
data:
train.py: |
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras import backend as K
# 加载数据
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# 数据预处理
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, 28, 28)
x_test = x_test.reshape(x_test.shape[0], 1, 28, 28)
input_shape = (1, 28, 28)
else:
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
# 转换标签
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
# 构建模型
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# 编译模型
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=tf.keras.optimizers.Adadelta(),
metrics=['accuracy'])
# 训练模型
model.fit(x_train, y_train,
batch_size=128,
epochs=10,
verbose=1,
validation_data=(x_test, y_test))
# 评估模型
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# 保存模型
model.save('/data/model.h5')
EOF
}
# 部署模型服务
deploy_model_service() {
local name=$1
local namespace=$2
local model_path=$3
local gpu=$4
echo "=== 部署模型服务: $name ==="
kubectl apply -f - << EOF
apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
name: $name
namespace: $namespace
spec:
default:
predictor:
tensorflow:
storageUri: $model_path
resources:
requests:
cpu: "1"
memory: "4Gi"
nvidia.com/gpu: "$gpu"
limits:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "$gpu"
EOF
}
# 主菜单
main() {
echo "AI平台管理脚本"
echo "1. 部署Jupyter Notebook"
echo "2. 部署模型训练任务"
echo "3. 部署模型服务"
echo "4. 查看资源状态"
echo "5. 退出"
read -p "请选择操作: " choice
case $choice in
1)
read -p "请输入Notebook名称: " name
read -p "请输入命名空间: " namespace
read -p "请输入GPU数量: " gpu
deploy_notebook $name $namespace $gpu
;;
2)
read -p "请输入训练任务名称: " name
read -p "请输入命名空间: " namespace
read -p "请输入Worker数量: " replicas
read -p "请输入每个Worker的GPU数量: " gpu
deploy_training $name $namespace $replicas $gpu
;;
3)
read -p "请输入模型服务名称: " name
read -p "请输入命名空间: " namespace
read -p "请输入模型存储路径: " model_path
read -p "请输入GPU数量: " gpu
deploy_model_service $name $namespace $model_path $gpu
;;
4)
read -p "请输入命名空间: " namespace
echo "=== 查看Pod状态 ==="
kubectl get pods -n $namespace
echo "=== 查看服务状态 ==="
kubectl get svc -n $namespace
echo "=== 查看GPU资源 ==="
kubectl get nodes -o json | jq '.items[].status.allocatable' | grep -E 'nvidia|cpu|memory'
;;
5)
exit 0
;;
*)
echo "无效选择"
;;
esac
main
}
main
from Linux:www.itpux.com。
Part05-风哥经验总结与分享
5.1 AI/机器学习平台最佳实践
- 硬件选择:根据模型类型和训练规模选择合适的GPU,如NVIDIA A100、H100等
- 软件栈选择:选择适合的AI框架和工具,如TensorFlow、PyTorch、Kubeflow等
- 资源管理:合理配置GPU资源,使用K8s的资源配额和限制
- 分布式训练:对于大规模模型,使用分布式训练提高训练速度
- 模型管理:建立模型版本管理和部署流程,确保模型的可追溯性
5.2 常见问题与解决方案
- GPU资源不足:使用GPU共享技术,合理调度GPU资源
- 训练速度慢:优化模型结构,使用混合精度训练,增加批量大小
- 模型部署延迟高:使用模型量化、模型压缩等技术,减少推理延迟
- 数据处理瓶颈:使用分布式数据处理框架,如Apache Spark
- 平台维护复杂:使用自动化工具,简化平台的维护和管理
5.3 性能优化建议
- GPU优化:使用CUDA和cuDNN的最新版本,优化GPU内存使用
- 数据优化:使用数据预加载和缓存,减少数据加载时间
- 模型优化:使用模型压缩、量化等技术,减少模型大小和推理时间
- 网络优化:使用高速网络,减少分布式训练的通信开销
- 存储优化:使用高速存储,如NVMe SSD,提高数据读写速度
5.4 未来发展趋势
- 大模型时代:支持更大规模的模型训练和部署
- 边缘AI:将AI模型部署到边缘设备,实现实时推理
- 自动化机器学习:使用AutoML技术,自动选择和优化模型
- 联邦学习:在保护数据隐私的前提下进行模型训练
- AI与云原生深度集成:利用云原生技术,提供更灵活、可扩展的AI平台
风哥提示:AI/机器学习平台的搭建需要综合考虑硬件、软件、网络等多个因素,需要根据具体的应用场景和需求进行优化和调整。
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
