1. 首页 > Linux教程 > 正文

Linux教程FG559-大规模K8s AI/机器学习平台搭建

Part01-基础概念与理论知识

1.1 AI/机器学习平台概述

AI/机器学习平台是指用于开发、训练、部署和管理机器学习模型的基础设施和工具集。它提供了从数据处理、模型训练到模型部署的完整流程支持,帮助数据科学家和工程师更高效地构建和部署AI应用。

AI/机器学习平台的核心功能包括:

  • 数据管理与处理
  • 模型训练与调优
  • 模型部署与服务
  • 模型监控与管理
  • 资源管理与调度

1.2 K8s在AI/机器学习中的应用

Kubernetes在AI/机器学习中的应用主要体现在以下几个方面:

  • 资源管理:K8s可以高效管理GPU、TPU等加速硬件资源
  • 弹性伸缩:根据训练任务的需求自动调整资源
  • 任务调度:智能调度训练任务,提高资源利用率
  • 服务部署:快速部署和管理模型服务
  • 环境隔离:为不同的训练任务和模型服务提供隔离的环境

1.3 常用AI/机器学习框架

常用的AI/机器学习框架包括:

  • TensorFlow:Google开发的开源机器学习框架,支持分布式训练
  • PyTorch:Facebook开发的开源机器学习框架,动态计算图是其特色
  • Keras:高级神经网络API,可运行在TensorFlow、Theano或CNTK之上
  • Scikit-learn:用于机器学习的Python库,包含多种算法
  • MXNet:Apache的开源深度学习框架,支持多种编程语言

Part02-生产环境规划与建议

2.1 硬件资源规划

AI/机器学习平台的硬件资源规划需要考虑以下因素:

  • CPU:选择多核、高主频的CPU,适合数据预处理和模型评估
  • GPU:选择高性能GPU,如NVIDIA A100、H100等,适合模型训练
  • 内存:需要大容量内存,尤其是处理大规模数据集时
  • 存储:选择高速存储,如NVMe SSD,提高数据读写速度
  • 网络:选择高带宽网络,支持分布式训练时的通信需求

2.2 平台架构设计

AI/机器学习平台的架构设计应考虑以下因素:

  • 分层架构:数据层、计算层、服务层、应用层
  • 组件选择:选择适合的开源组件,如Kubeflow、MLflow等
  • 扩展性:支持水平扩展,满足不断增长的计算需求
  • 可靠性:实现高可用,确保平台的稳定运行
  • 可维护性:设计清晰的组件边界,便于维护和升级

2.3 存储与网络规划

存储规划:

  • 使用分布式存储系统,如Ceph、GlusterFS等,存储大规模数据集
  • 为不同类型的数据选择合适的存储方案:
    • 训练数据:使用高速存储,如NVMe SSD
    • 模型文件:使用可靠的存储,如对象存储
    • 日志和监控数据:使用弹性存储,如NAS

网络规划:

  • 使用高速网络,如100Gbps以太网或InfiniBand,支持分布式训练
  • 实现网络隔离,为不同的训练任务和模型服务提供独立的网络环境
  • 配置网络QoS,确保关键任务的网络带宽需求

风哥提示:在AI/机器学习平台中,GPU资源的管理和调度是关键,需要合理规划和配置,以提高资源利用率。

Part03-生产环境项目实施方案

以Kubeflow为例,实施方案如下:

3.1 安装Kubeflow

# 下载Kubeflow安装脚本
$ git clone https://github.com/kubeflow/kfctl.git
$ cd kfctl
$ git checkout v1.2.0

# 安装Kubeflow
$ export PATH=$PATH:$PWD/bin
$ export KF_NAME=kubeflow
$ export BASE_DIR=/opt/kubeflow
$ export KF_DIR=${BASE_DIR}/${KF_NAME}
$ mkdir -p ${KF_DIR}
$ cd ${KF_DIR}

# 下载Kubeflow配置
$ kfctl init ${KF_NAME} –config=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml

# 部署Kubeflow
$ kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml

# 验证Kubeflow安装
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-567890abc-12345 1/1 Running 0 10m
cache-deployer-deployment-567890abc-67890 1/1 Running 0 10m
centraldashboard-567890abc-abcde 1/1 Running 0 10m
jupyter-web-app-deployment-567890abc-fghij 1/1 Running 0 10m
katib-controller-567890abc-ijklm 1/1 Running 0 10m
metacontroller-567890abc-nopqr 1/1 Running 0 10m
metadata-567890abc-stuvw 1/1 Running 0 10m
minio-567890abc-xyzab 1/1 Running 0 10m
mysql-567890abc-cdefg 1/1 Running 0 10m
notebook-controller-deployment-567890abc-hijkl 1/1 Running 0 10m
profile-controller-567890abc-mnopq 1/1 Running 0 10m
pytorch-operator-567890abc-qrstu 1/1 Running 0 10m
resource-driver-deployment-567890abc-vwxyz 1/1 Running 0 10m
tensorboard-controller-deployment-567890abc-12345 1/1 Running 0 10m
tf-job-operator-567890abc-67890 1/1 Running 0 10m
workflow-controller-567890abc-abcde 1/1 Running 0 10m

# 查看Kubeflow服务
$ kubectl get svc -n kubeflow | grep istio-ingressgateway
istio-ingressgateway LoadBalancer 10.96.78.90 15020:32400/TCP,80:30880/TCP,443:30443/TCP,31400:31400/TCP,15443:32281/TCP 10m

3.2 配置GPU支持

# 安装NVIDIA驱动
$ sudo dnf install -y nvidia-driver

# 安装NVIDIA容器运行时
$ sudo dnf install -y nvidia-container-runtime

# 配置Docker使用NVIDIA运行时
$ sudo tee /etc/docker/daemon.json << 'EOF' { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } EOF # 重启Docker $ sudo systemctl restart docker # 安装NVIDIA设备插件 $ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml # 验证GPU设备插件 $ kubectl get pods -n kube-system | grep nvidia nvidia-device-plugin-daemonset-567890abc-12345 1/1 Running 0 5m nvidia-device-plugin-daemonset-567890abc-67890 1/1 Running 0 5m # 验证GPU资源 $ kubectl get nodes -o json | jq '.items[].status.allocatable' { "cpu": "32", "ephemeral-storage": "100Gi", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "128Gi", "nvidia.com/gpu": "2", "pods": "110" }

3.3 部署Jupyter Notebook

# 创建Jupyter Notebook配置
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: fgedu-notebook namespace: kubeflow-user-example-com spec: template: spec: containers: - name: notebook image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: notebook-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: notebook-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi EOF # 查看Notebook状态 $ kubectl get notebooks -n kubeflow-user-example-com NAME STATUS AGE fgedu-notebook Running 5m # 获取Notebook访问URL $ kubectl get svc -n kubeflow | grep istio-ingressgateway istio-ingressgateway LoadBalancer 10.96.78.90 192.168.1.100 15020:32400/TCP,80:30880/TCP,443:30443/TCP,31400:31400/TCP,15443:32281/TCP 15m # 访问URL: http://192.168.1.100:30880

3.4 部署模型训练任务

# 创建TensorFlow训练任务
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: fgedu-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: training-code --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: training-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: v1 kind: ConfigMap metadata: name: training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 加载数据 (x_train, y_train), (x_test, y_test) = mnist.load_data() # 数据预处理 if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape[0], 1, 28, 28) x_test = x_test.reshape(x_test.shape[0], 1, 28, 28) input_shape = (1, 28, 28) else: x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/mnist_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE fgedu-tf-training Running 5m # 查看训练日志 $ kubectl logs -f tfjob-fgedu-tf-training-worker-0 -n kubeflow-user-example-com

3.5 部署模型服务

# 创建模型服务配置
$ kubectl apply -f – << 'EOF' apiVersion: serving.kubeflow.org/v1alpha2 kind: InferenceService metadata: name: mnist-model namespace: kubeflow-user-example-com spec: default: predictor: tensorflow: storageUri: pvc://model-storage/mnist_model.h5 resources: requests: cpu: "1" memory: "4Gi" limits: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF # 查看模型服务状态 $ kubectl get inferenceservices -n kubeflow-user-example-com NAME URL READY AGE mnist-model http://mnist-model.kubeflow-user-example-com.fgedu.net.cn True 5m # 测试模型服务 $ curl -X POST http://mnist-model.kubeflow-user-example-com.fgedu.net.cn/v1/models/mnist-model:predict -d '{"instances": [[[[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.更多视频教程www.fgedu.net.cn0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0更多学习教程公众号风哥教程itpux_com.0], [0.0], [0.0], [0.0], [0.0], [0学习交流加群风哥QQ113257174.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]}'

from PG视频:www.itpux.com

>。

Part04-生产案例与实战讲解

4.1 企业级AI平台案例

某企业的AI平台实践如下:

  • 技术栈:Kubernetes + Kubeflow + TensorFlow + PyTorch
  • 硬件资源:NVIDIA A100 GPU集群,高速NVMe存储
  • 平台功能:数据处理、模型训练、模型部署、模型监控
  • 应用场景:图像识别、自然语言处理、预测分析
  • 性能优化:使用分布式训练,优化GPU利用率

4.2 大规模模型训练案例

# 部署分布式训练任务
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: distributed-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app Worker: replicas: 4 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app PS: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: distributed-training-code --- apiVersion: v1 kind: ConfigMap metadata: name: distributed-training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 分布式策略 strategy = tf.distribute.MirroredStrategy() # 加载数据 (x_train, y_train), (x_test, y_test) = cifar10.load_data() # 数据预处理 x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 with strategy.scope(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=50, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/cifar10_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE distributed-tf-training Running 10m # 查看训练日志 $ kubectl logs -f tfjob-distributed-tf-training-master-0 -n kubeflow-user-example-com

4.3 AI模型监控与管理

# 部署模型监控服务
$ kubectl apply -f – << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: model-monitor namespace: kubeflow-user-example-com spec: replicas: 1 selector: matchLabels: app: model-monitor template: metadata: labels: app: model-monitor spec: containers: - name: model-monitor image: harbor.fgedu.net.cn/library/model-monitor:v1.0.0 ports: - containerPort: 8080 env: - name: PROMETHEUS_URL value: "http://prometheus.kubeflow.svc:9090" - name: GRAFANA_URL value: "http://grafana.kubeflow.svc:3000" resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" --- apiVersion: v1 kind: Service metadata: name: model-monitor namespace: kubeflow-user-example-com spec: selector: app: model-monitor ports: - port: 80 targetPort: 8080 type: NodePort EOF # 配置Prometheus监控规则 $ kubectl apply -f - << 'EOF' apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: model-monitoring-rules namespace: kubeflow spec: groups: - name: model-monitoring rules: - alert: ModelAccuracyDrop expr: model_accuracy{job="model-monitor"} < 0.8 for: 5m labels: severity: warning annotations: summary: "模型准确率下降" description: "模型 {{ $labels.model }} 的准确率低于80%" - alert: ModelLatencyHigh expr: model_inference_latency{job="model-monitor"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: “模型推理延迟过高”
description: “模型 {{ $labels.model }} 的推理延迟超过1000ms”
EOF

4.4 AI平台自动化脚本

#!/bin/bash
# ai_platform.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`

# 部署Jupyter Notebook
deploy_notebook() {
local name=$1
local namespace=$2
local gpu=$3

echo “=== 部署Jupyter Notebook: $name ===”
kubectl apply -f – << EOF apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: $name namespace: $namespace spec: template: spec: containers: - name: notebook image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0 resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "$gpu" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: ${name}-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ${name}-data namespace: $namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi EOF } # 部署模型训练任务 deploy_training() { local name=$1 local namespace=$2 local replicas=$3 local gpu=$4 echo "=== 部署模型训练任务: $name ===" kubectl apply -f - << EOF apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: $name namespace: $namespace spec: tfReplicaSpecs: Worker: replicas: $replicas restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "$gpu" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: ${name}-data - name: code configMap: name: ${name}-code --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ${name}-data namespace: $namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: v1 kind: ConfigMap metadata: name: ${name}-code namespace: $namespace data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 加载数据 (x_train, y_train), (x_test, y_test) = mnist.load_data() # 数据预处理 if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape[0], 1, 28, 28) x_test = x_test.reshape(x_test.shape[0], 1, 28, 28) input_shape = (1, 28, 28) else: x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/model.h5') EOF } # 部署模型服务 deploy_model_service() { local name=$1 local namespace=$2 local model_path=$3 local gpu=$4 echo "=== 部署模型服务: $name ===" kubectl apply -f - << EOF apiVersion: serving.kubeflow.org/v1alpha2 kind: InferenceService metadata: name: $name namespace: $namespace spec: default: predictor: tensorflow: storageUri: $model_path resources: requests: cpu: "1" memory: "4Gi" nvidia.com/gpu: "$gpu" limits: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" EOF } # 主菜单 main() { echo "AI平台管理脚本" echo "1. 部署Jupyter Notebook" echo "2. 部署模型训练任务" echo "3. 部署模型服务" echo "4. 查看资源状态" echo "5. 退出" read -p "请选择操作: " choice case $choice in 1) read -p "请输入Notebook名称: " name read -p "请输入命名空间: " namespace read -p "请输入GPU数量: " gpu deploy_notebook $name $namespace $gpu ;; 2) read -p "请输入训练任务名称: " name read -p "请输入命名空间: " namespace read -p "请输入Worker数量: " replicas read -p "请输入每个Worker的GPU数量: " gpu deploy_training $name $namespace $replicas $gpu ;; 3) read -p "请输入模型服务名称: " name read -p "请输入命名空间: " namespace read -p "请输入模型存储路径: " model_path read -p "请输入GPU数量: " gpu deploy_model_service $name $namespace $model_path $gpu ;; 4) read -p "请输入命名空间: " namespace echo "=== 查看Pod状态 ===" kubectl get pods -n $namespace echo "=== 查看服务状态 ===" kubectl get svc -n $namespace echo "=== 查看GPU资源 ===" kubectl get nodes -o json | jq '.items[].status.allocatable' | grep -E 'nvidia|cpu|memory' ;; 5) exit 0 ;; *) echo "无效选择" ;; esac main } main

from Linux:www.itpux.com。

Part05-风哥经验总结与分享

5.1 AI/机器学习平台最佳实践

  • 硬件选择:根据模型类型和训练规模选择合适的GPU,如NVIDIA A100、H100等
  • 软件栈选择:选择适合的AI框架和工具,如TensorFlow、PyTorch、Kubeflow等
  • 资源管理:合理配置GPU资源,使用K8s的资源配额和限制
  • 分布式训练:对于大规模模型,使用分布式训练提高训练速度
  • 模型管理:建立模型版本管理和部署流程,确保模型的可追溯性

5.2 常见问题与解决方案

  • GPU资源不足:使用GPU共享技术,合理调度GPU资源
  • 训练速度慢:优化模型结构,使用混合精度训练,增加批量大小
  • 模型部署延迟高:使用模型量化、模型压缩等技术,减少推理延迟
  • 数据处理瓶颈:使用分布式数据处理框架,如Apache Spark
  • 平台维护复杂:使用自动化工具,简化平台的维护和管理

5.3 性能优化建议

  • GPU优化:使用CUDA和cuDNN的最新版本,优化GPU内存使用
  • 数据优化:使用数据预加载和缓存,减少数据加载时间
  • 模型优化:使用模型压缩、量化等技术,减少模型大小和推理时间
  • 网络优化:使用高速网络,减少分布式训练的通信开销
  • 存储优化:使用高速存储,如NVMe SSD,提高数据读写速度

5.4 未来发展趋势

  • 大模型时代:支持更大规模的模型训练和部署
  • 边缘AI:将AI模型部署到边缘设备,实现实时推理
  • 自动化机器学习:使用AutoML技术,自动选择和优化模型
  • 联邦学习:在保护数据隐私的前提下进行模型训练
  • AI与云原生深度集成:利用云原生技术,提供更灵活、可扩展的AI平台

风哥提示:AI/机器学习平台的搭建需要综合考虑硬件、软件、网络等多个因素,需要根据具体的应用场景和需求进行优化和调整。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html

联系我们

在线咨询:点击这里给我发消息

微信号:itpux-com

工作日:9:30-18:30,节假日休息