Linux教程FG559-大规模K8s AI/机器学习平台搭建

教程整理：风哥教程 | 更新时间：2026-01-25 | 教程分类：Linux教程 | 文档学习：28

Part01-基础概念与理论知识

1.1 AI/机器学习平台概述

AI/机器学习平台是指用于开发、训练、部署和管理机器学习模型的基础设施和工具集。它提供了从数据处理、模型训练到模型部署的完整流程支持，帮助数据科学家和工程师更高效地构建和部署AI应用。

AI/机器学习平台的核心功能包括：

数据管理与处理
模型训练与调优
模型部署与服务
模型监控与管理
资源管理与调度

1.2 K8s在AI/机器学习中的应用

Kubernetes在AI/机器学习中的应用主要体现在以下几个方面：

资源管理：K8s可以高效管理GPU、TPU等加速硬件资源
弹性伸缩：根据训练任务的需求自动调整资源
任务调度：智能调度训练任务，提高资源利用率
服务部署：快速部署和管理模型服务
环境隔离：为不同的训练任务和模型服务提供隔离的环境

1.3 常用AI/机器学习框架

常用的AI/机器学习框架包括：

TensorFlow：Google开发的开源机器学习框架，支持分布式训练
PyTorch：Facebook开发的开源机器学习框架，动态计算图是其特色
Keras：高级神经网络API，可运行在TensorFlow、Theano或CNTK之上
Scikit-learn：用于机器学习的Python库，包含多种算法
MXNet：Apache的开源深度学习框架，支持多种编程语言

。

Part02-生产环境规划与建议

2.1 硬件资源规划

AI/机器学习平台的硬件资源规划需要考虑以下因素：

CPU：选择多核、高主频的CPU，适合数据预处理和模型评估
GPU：选择高性能GPU，如NVIDIA A100、H100等，适合模型训练
内存：需要大容量内存，尤其是处理大规模数据集时
存储：选择高速存储，如NVMe SSD，提高数据读写速度
网络：选择高带宽网络，支持分布式训练时的通信需求

2.2 平台架构设计

AI/机器学习平台的架构设计应考虑以下因素：

分层架构：数据层、计算层、服务层、应用层
组件选择：选择适合的开源组件，如Kubeflow、MLflow等
扩展性：支持水平扩展，满足不断增长的计算需求
可靠性：实现高可用，确保平台的稳定运行
可维护性：设计清晰的组件边界，便于维护和升级

2.3 存储与网络规划

存储规划：

使用分布式存储系统，如Ceph、GlusterFS等，存储大规模数据集
为不同类型的数据选择合适的存储方案：
- 训练数据：使用高速存储，如NVMe SSD
- 模型文件：使用可靠的存储，如对象存储
- 日志和监控数据：使用弹性存储，如NAS

网络规划：

使用高速网络，如100Gbps以太网或InfiniBand，支持分布式训练
实现网络隔离，为不同的训练任务和模型服务提供独立的网络环境
配置网络QoS，确保关键任务的网络带宽需求

风哥提示：在AI/机器学习平台中，GPU资源的管理和调度是关键，需要合理规划和配置，以提高资源利用率。

。

Part03-生产环境项目实施方案

以Kubeflow为例，实施方案如下：

3.1 安装Kubeflow

# 下载Kubeflow安装脚本
$ git clone https://github.com/kubeflow/kfctl.git
$ cd kfctl
$ git checkout v1.2.0

# 安装Kubeflow
$ export PATH=$PATH:$PWD/bin
$ export KF_NAME=kubeflow
$ export BASE_DIR=/opt/kubeflow
$ export KF_DIR=${BASE_DIR}/${KF_NAME}
$ mkdir -p ${KF_DIR}
$ cd ${KF_DIR}

# 下载Kubeflow配置
$ kfctl init ${KF_NAME} –config=https://raw.githubusercontent.com/kubeflow/manifests/v1.2-branch/kfdef/kfctl_k8s_istio.v1.2.0.yaml

# 部署Kubeflow
$ kfctl apply -V -f kfctl_k8s_istio.v1.2.0.yaml

# 验证Kubeflow安装
$ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
admission-webhook-deployment-567890abc-12345 1/1 Running 0 10m
cache-deployer-deployment-567890abc-67890 1/1 Running 0 10m
centraldashboard-567890abc-abcde 1/1 Running 0 10m
jupyter-web-app-deployment-567890abc-fghij 1/1 Running 0 10m
katib-controller-567890abc-ijklm 1/1 Running 0 10m
metacontroller-567890abc-nopqr 1/1 Running 0 10m
metadata-567890abc-stuvw 1/1 Running 0 10m
minio-567890abc-xyzab 1/1 Running 0 10m
mysql-567890abc-cdefg 1/1 Running 0 10m
notebook-controller-deployment-567890abc-hijkl 1/1 Running 0 10m
profile-controller-567890abc-mnopq 1/1 Running 0 10m
pytorch-operator-567890abc-qrstu 1/1 Running 0 10m
resource-driver-deployment-567890abc-vwxyz 1/1 Running 0 10m
tensorboard-controller-deployment-567890abc-12345 1/1 Running 0 10m
tf-job-operator-567890abc-67890 1/1 Running 0 10m
workflow-controller-567890abc-abcde 1/1 Running 0 10m

# 查看Kubeflow服务
$ kubectl get svc -n kubeflow | grep istio-ingressgateway
istio-ingressgateway LoadBalancer 10.96.78.90 15020:32400/TCP,80:30880/TCP,443:30443/TCP,31400:31400/TCP,15443:32281/TCP 10m

3.2 配置GPU支持

# 安装NVIDIA驱动
$ sudo dnf install -y nvidia-driver

# 安装NVIDIA容器运行时
$ sudo dnf install -y nvidia-container-runtime

# 配置Docker使用NVIDIA运行时
$ sudo tee /etc/docker/daemon.json << 'EOF' { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } EOF # 重启Docker $ sudo systemctl restart docker # 安装NVIDIA设备插件 $ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.2/nvidia-device-plugin.yml # 验证GPU设备插件 $ kubectl get pods -n kube-system | grep nvidia nvidia-device-plugin-daemonset-567890abc-12345 1/1 Running 0 5m nvidia-device-plugin-daemonset-567890abc-67890 1/1 Running 0 5m # 验证GPU资源 $ kubectl get nodes -o json | jq '.items[].status.allocatable' { "cpu": "32", "ephemeral-storage": "100Gi", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "128Gi", "nvidia.com/gpu": "2", "pods": "110" }

3.3 部署Jupyter Notebook

# 创建Jupyter Notebook配置
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: fgedu-notebook namespace: kubeflow-user-example-com spec: template: spec: containers: - name: notebook image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: notebook-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: notebook-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi EOF # 查看Notebook状态 $ kubectl get notebooks -n kubeflow-user-example-com NAME STATUS AGE fgedu-notebook Running 5m # 获取Notebook访问URL $ kubectl get svc -n kubeflow | grep istio-ingressgateway istio-ingressgateway LoadBalancer 10.96.78.90 192.168.1.100 15020:32400/TCP,80:30880/TCP,443:30443/TCP,31400:31400/TCP,15443:32281/TCP 15m # 访问URL: http://192.168.1.100:30880

3.4 部署模型训练任务

# 创建TensorFlow训练任务
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: fgedu-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: training-code --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: training-data namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: v1 kind: ConfigMap metadata: name: training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 加载数据 (x_train, y_train), (x_test, y_test) = mnist.load_data() # 数据预处理 if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape[0], 1, 28, 28) x_test = x_test.reshape(x_test.shape[0], 1, 28, 28) input_shape = (1, 28, 28) else: x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/mnist_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE fgedu-tf-training Running 5m # 查看训练日志 $ kubectl logs -f tfjob-fgedu-tf-training-worker-0 -n kubeflow-user-example-com

3.5 部署模型服务

# 创建模型服务配置
$ kubectl apply -f – << 'EOF' apiVersion: serving.kubeflow.org/v1alpha2 kind: InferenceService metadata: name: mnist-model namespace: kubeflow-user-example-com spec: default: predictor: tensorflow: storageUri: pvc://model-storage/mnist_model.h5 resources: requests: cpu: "1" memory: "4Gi" limits: cpu: "2" memory: "8Gi" nvidia.com/gpu: "1" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-storage namespace: kubeflow-user-example-com spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi EOF # 查看模型服务状态 $ kubectl get inferenceservices -n kubeflow-user-example-com NAME URL READY AGE mnist-model http://mnist-model.kubeflow-user-example-com.fgedu.net.cn True 5m # 测试模型服务 $ curl -X POST http://mnist-model.kubeflow-user-example-com.fgedu.net.cn/v1/models/mnist-model:predict -d '{"instances": [[[[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.更多视频教程www.fgedu.net.cn0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0更多学习教程公众号风哥教程itpux_com.0], [0.0], [0.0], [0.0], [0.0], [0学习交流加群风哥QQ113257174.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]], [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [0.0]]]}'

from PG视频:www.itpux.com

>。

Part04-生产案例与实战讲解

4.1 企业级AI平台案例

某企业的AI平台实践如下：

技术栈：Kubernetes + Kubeflow + TensorFlow + PyTorch
硬件资源：NVIDIA A100 GPU集群，高速NVMe存储
平台功能：数据处理、模型训练、模型部署、模型监控
应用场景：图像识别、自然语言处理、预测分析
性能优化：使用分布式训练，优化GPU利用率

4.2 大规模模型训练案例

# 部署分布式训练任务
$ kubectl apply -f – << 'EOF' apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: distributed-tf-training namespace: kubeflow-user-example-com spec: tfReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app Worker: replicas: 4 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app PS: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "4" memory: "16Gi" limits: cpu: "8" memory: "32Gi" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: training-data - name: code configMap: name: distributed-training-code --- apiVersion: v1 kind: ConfigMap metadata: name: distributed-training-code namespace: kubeflow-user-example-com data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import cifar10 from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 分布式策略 strategy = tf.distribute.MirroredStrategy() # 加载数据 (x_train, y_train), (x_test, y_test) = cifar10.load_data() # 数据预处理 x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 with strategy.scope(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=50, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/cifar10_model.h5') EOF # 查看训练任务状态 $ kubectl get tfjobs -n kubeflow-user-example-com NAME STATUS AGE distributed-tf-training Running 10m # 查看训练日志 $ kubectl logs -f tfjob-distributed-tf-training-master-0 -n kubeflow-user-example-com

4.3 AI模型监控与管理

# 部署模型监控服务
$ kubectl apply -f – << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: model-monitor namespace: kubeflow-user-example-com spec: replicas: 1 selector: matchLabels: app: model-monitor template: metadata: labels: app: model-monitor spec: containers: - name: model-monitor image: harbor.fgedu.net.cn/library/model-monitor:v1.0.0 ports: - containerPort: 8080 env: - name: PROMETHEUS_URL value: "http://prometheus.kubeflow.svc:9090" - name: GRAFANA_URL value: "http://grafana.kubeflow.svc:3000" resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" --- apiVersion: v1 kind: Service metadata: name: model-monitor namespace: kubeflow-user-example-com spec: selector: app: model-monitor ports: - port: 80 targetPort: 8080 type: NodePort EOF # 配置Prometheus监控规则 $ kubectl apply -f - << 'EOF' apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: model-monitoring-rules namespace: kubeflow spec: groups: - name: model-monitoring rules: - alert: ModelAccuracyDrop expr: model_accuracy{job="model-monitor"} < 0.8 for: 5m labels: severity: warning annotations: summary: "模型准确率下降" description: "模型 {{ $labels.model }} 的准确率低于80%" - alert: ModelLatencyHigh expr: model_inference_latency{job="model-monitor"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: “模型推理延迟过高”
description: “模型 {{ $labels.model }} 的推理延迟超过1000ms”
EOF

4.4 AI平台自动化脚本

#!/bin/bash
# ai_platform.sh
# from:www.itpux.com.qq113257174.wx:itpux-com
# web: `http://www.fgedu.net.cn`

# 部署Jupyter Notebook
deploy_notebook() {
local name=$1
local namespace=$2
local gpu=$3

echo “=== 部署Jupyter Notebook: $name ===”
kubectl apply -f – << EOF apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: $name namespace: $namespace spec: template: spec: containers: - name: notebook image: kubeflownotebookswg/jupyter-tensorflow-full:v1.2.0 resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "$gpu" volumeMounts: - name: data mountPath: /data volumes: - name: data persistentVolumeClaim: claimName: ${name}-data --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ${name}-data namespace: $namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi EOF } # 部署模型训练任务 deploy_training() { local name=$1 local namespace=$2 local replicas=$3 local gpu=$4 echo "=== 部署模型训练任务: $name ===" kubectl apply -f - << EOF apiVersion: kubeflow.org/v1 kind: TFJob metadata: name: $name namespace: $namespace spec: tfReplicaSpecs: Worker: replicas: $replicas restartPolicy: OnFailure template: spec: containers: - name: tensorflow image: tensorflow/tensorflow:2.8.0-gpu command: - python - /app/train.py resources: requests: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: "$gpu" volumeMounts: - name: data mountPath: /data - name: code mountPath: /app volumes: - name: data persistentVolumeClaim: claimName: ${name}-data - name: code configMap: name: ${name}-code --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ${name}-data namespace: $namespace spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi --- apiVersion: v1 kind: ConfigMap metadata: name: ${name}-code namespace: $namespace data: train.py: | import tensorflow as tf from tensorflow.keras.datasets import mnist from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K # 加载数据 (x_train, y_train), (x_test, y_test) = mnist.load_data() # 数据预处理 if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape[0], 1, 28, 28) x_test = x_test.reshape(x_test.shape[0], 1, 28, 28) input_shape = (1, 28, 28) else: x_train = x_train.reshape(x_train.shape[0], 28, 28, 1) x_test = x_test.reshape(x_test.shape[0], 28, 28, 1) input_shape = (28, 28, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 # 转换标签 y_train = tf.keras.utils.to_categorical(y_train, 10) y_test = tf.keras.utils.to_categorical(y_test, 10) # 构建模型 model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(10, activation='softmax')) # 编译模型 model.compile(loss=tf.keras.losses.categorical_crossentropy, optimizer=tf.keras.optimizers.Adadelta(), metrics=['accuracy']) # 训练模型 model.fit(x_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(x_test, y_test)) # 评估模型 score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # 保存模型 model.save('/data/model.h5') EOF } # 部署模型服务 deploy_model_service() { local name=$1 local namespace=$2 local model_path=$3 local gpu=$4 echo "=== 部署模型服务: $name ===" kubectl apply -f - << EOF apiVersion: serving.kubeflow.org/v1alpha2 kind: InferenceService metadata: name: $name namespace: $namespace spec: default: predictor: tensorflow: storageUri: $model_path resources: requests: cpu: "1" memory: "4Gi" nvidia.com/gpu: "$gpu" limits: cpu: "2" memory: "8Gi" nvidia.com/gpu: "$gpu" EOF } # 主菜单 main() { echo "AI平台管理脚本" echo "1. 部署Jupyter Notebook" echo "2. 部署模型训练任务" echo "3. 部署模型服务" echo "4. 查看资源状态" echo "5. 退出" read -p "请选择操作: " choice case $choice in 1) read -p "请输入Notebook名称: " name read -p "请输入命名空间: " namespace read -p "请输入GPU数量: " gpu deploy_notebook $name $namespace $gpu ;; 2) read -p "请输入训练任务名称: " name read -p "请输入命名空间: " namespace read -p "请输入Worker数量: " replicas read -p "请输入每个Worker的GPU数量: " gpu deploy_training $name $namespace $replicas $gpu ;; 3) read -p "请输入模型服务名称: " name read -p "请输入命名空间: " namespace read -p "请输入模型存储路径: " model_path read -p "请输入GPU数量: " gpu deploy_model_service $name $namespace $model_path $gpu ;; 4) read -p "请输入命名空间: " namespace echo "=== 查看Pod状态 ===" kubectl get pods -n $namespace echo "=== 查看服务状态 ===" kubectl get svc -n $namespace echo "=== 查看GPU资源 ===" kubectl get nodes -o json | jq '.items[].status.allocatable' | grep -E 'nvidia|cpu|memory' ;; 5) exit 0 ;; *) echo "无效选择" ;; esac main } main

from Linux:www.itpux.com。

Part05-风哥经验总结与分享

5.1 AI/机器学习平台最佳实践

硬件选择：根据模型类型和训练规模选择合适的GPU，如NVIDIA A100、H100等
软件栈选择：选择适合的AI框架和工具，如TensorFlow、PyTorch、Kubeflow等
资源管理：合理配置GPU资源，使用K8s的资源配额和限制
分布式训练：对于大规模模型，使用分布式训练提高训练速度
模型管理：建立模型版本管理和部署流程，确保模型的可追溯性

5.2 常见问题与解决方案

GPU资源不足：使用GPU共享技术，合理调度GPU资源
训练速度慢：优化模型结构，使用混合精度训练，增加批量大小
模型部署延迟高：使用模型量化、模型压缩等技术，减少推理延迟
数据处理瓶颈：使用分布式数据处理框架，如Apache Spark
平台维护复杂：使用自动化工具，简化平台的维护和管理

5.3 性能优化建议

GPU优化：使用CUDA和cuDNN的最新版本，优化GPU内存使用
数据优化：使用数据预加载和缓存，减少数据加载时间
模型优化：使用模型压缩、量化等技术，减少模型大小和推理时间
网络优化：使用高速网络，减少分布式训练的通信开销
存储优化：使用高速存储，如NVMe SSD，提高数据读写速度

5.4 未来发展趋势

大模型时代：支持更大规模的模型训练和部署
边缘AI：将AI模型部署到边缘设备，实现实时推理
自动化机器学习：使用AutoML技术，自动选择和优化模型
联邦学习：在保护数据隐私的前提下进行模型训练
AI与云原生深度集成：利用云原生技术，提供更灵活、可扩展的AI平台

风哥提示：AI/机器学习平台的搭建需要综合考虑硬件、软件、网络等多个因素，需要根据具体的应用场景和需求进行优化和调整。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html