IT教程FG350-AI模型部署架构

一、部署架构概述

AI模型部署架构是将训练好的模型投入生产环境的关键环节，需要考虑性能、可靠性、可扩展性等多个方面。良好的部署架构能够确保模型服务的高效运行和快速响应。

学习交流加群风哥微信: itpux-com，在FGedu企业的AI模型部署中，我们建立了从模型训练到生产部署的完整流水线，支持多种模型服务框架和部署方式。

1.1 部署架构设计

根据业务需求选择合适的部署架构。

# AI模型部署架构设计

# 部署模式对比
部署模式：
1. 在线推理
– 实时响应
– 低延迟要求
– 资源常驻
– 适用场景：实时推荐、图像识别

2. 批量推理
– 批量处理
– 吞吐量优先
– 资源按需
– 适用场景：离线分析、数据处理

3. 边缘推理
– 本地部署
– 隐私保护
– 资源受限
– 适用场景：IoT设备、移动应用

# FGedu AI模型部署架构
架构层次：
├── 模型训练层
│ ├── 训练集群（GPU）
│ ├── 实验管理
│ └── 模型注册
├── 模型服务层
│ ├── 在线推理服务
│ ├── 批量推理服务
│ └── 模型路由
├── 应用接入层
│ ├── API网关
│ ├── 负载均衡
│ └── 流量控制
└── 监控运维层
├── 性能监控
├── 日志收集
└── 告警管理

# 模型服务框架选型
框架语言性能特点适用场景
——– —- —- —- ——–
TensorFlow C++/Py 高生态完善大规模生产
PyTorch Python 中易用性好研发迭代
ONNX Runtime C++ 高跨平台多框架部署
TensorRT C++ 最高 NVIDIA优化 GPU推理
Triton C++ 高多模型支持企业级服务
TorchServe Java 中 PyTorch原生 PyTorch模型

# 资源需求规划
模型类型 CPU需求 GPU需求内存需求延迟要求
——– ——- ——- ——– ——–
图像分类 2核 1卡 4GB <100ms 目标检测 4核 1卡 8GB <200ms NLP模型 4核 1卡 16GB <500ms 推荐模型 8核无 32GB <50ms 大语言模型 16核 4卡 128GB <2s

二、模型服务框架

2.1 TensorFlow Serving

使用TensorFlow Serving部署模型服务。

# TensorFlow Serving部署

# 1. 模型导出
import tensorflow as tf
import os

# 加载训练好的模型
model = tf.keras.models.load_model(‘/models/fgedu_classifier’)

# 导出为SavedModel格式
export_path = ‘/models/fgedu_classifier/1’
tf.saved_model.save(model, export_path)

print(f”模型已导出到: {export_path}”)

# 输出
模型已导出到: /models/fgedu_classifier/1

# 2. 启动TensorFlow Serving
$ docker run -p 8501:8501 \
–mount type=bind,source=/models/fgedu_classifier,target=/models/fgedu_classifier \
-e MODEL_NAME=fgedu_classifier \
tensorflow/serving:latest

# 输出日志
2026-04-03 10:00:00.000000: I tensorflow_serving/model_servers/main.cc:360]
Building single TensorFlow model file config: model_name: fgedu_classifier model_base_path: /models/fgedu_classifier
2026-04-03 10:00:00.000000: I tensorflow_serving/model_servers/server_core.cc:465]
Adding/updating models.
2026-04-03 10:00:00.000000: I tensorflow_serving/model_servers/server_core.cc:591]
Successfully loaded servable version {name: fgedu_classifier version: 1}

# 3. REST API调用
$ curl -d ‘{“instances”: [[1.0, 2.0, 3.0, 4.0]]}’ \
-X POST http://fgedudb:8501/v1/models/fgedu_classifier:predict

{
“predictions”: [[0.95, 0.03, 0.02]]
}

# 4. gRPC调用
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc

# 创建gRPC通道
channel = grpc.insecure_channel(‘fgedudb:8500’)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# 创建请求
request = predict_pb2.PredictRequest()
request.model_spec.name = ‘fgedu_classifier’
request.model_spec.signature_name = ‘serving_default’
request.inputs[‘input_tensor’].CopyFrom(
tf.make_tensor_proto([[1.0, 2.0, 3.0, 4.0]])
)

# 发送请求
response = stub.Predict(request, 10.0)
print(response.outputs[‘output_tensor’].float_val)

# 输出
[0.95, 0.03, 0.02]

# 5. 配置文件
# models.config
model_config_list {
config {
name: ‘fgedu_classifier’
base_path: ‘/models/fgedu_classifier’
model_platform: ‘tensorflow’
model_version_policy {
specific {
versions: 1
versions: 2
}
}
}
config {
name: ‘fgedu_nlp’
base_path: ‘/models/fgedu_nlp’
model_platform: ‘tensorflow’
}
}

# 使用配置文件启动
$ docker run -p 8501:8501 \
–mount type=bind,source=/models,target=/models \
–mount type=bind,source=/config/models.config,target=/config/models.config \
tensorflow/serving:latest \
–model_config_file=/config/models.config

# 6. 模型版本管理
# 查看模型状态
$ curl http://fgedudb:8501/v1/models/fgedu_classifier

{
“model_version_status”: [
{
“version”: “1”,
“state”: “AVAILABLE”,
“status”: {
“error_code”: “OK”,
“error_message”: “”
}
},
{
“version”: “2”,
“state”: “AVAILABLE”,
“status”: {
“error_code”: “OK”,
“error_message”: “”
}
}
]
}

# 7. 性能优化配置
# 启用批处理
$ docker run -p 8501:8501 \
–mount type=bind,source=/models,target=/models \
tensorflow/serving:latest \
–enable_batching=true \
–batching_parameters_file=/config/batching_config.txt

# batching_config.txt
max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
num_batch_threads { value: 8 }
max_enqueued_batches { value: 1000000 }

2.2 Triton Inference Server

使用Triton部署多框架模型服务。

# Triton Inference Server部署

# 1. 模型仓库结构
$ tree /model_repository
/model_repository
├── fgedu_classifier
│ ├── config.pbtxt
│ └── 1
│ └── model.onnx
├── fgedu_nlp
│ ├── config.pbtxt
│ └── 1
│ └── model.pt
└── fgedu_detection
├── config.pbtxt
└── 1
└── model.plan

# 2. 模型配置文件
# config.pbtxt
name: “fgedu_classifier”
platform: “onnxruntime_onnx”
max_batch_size: 32
input [
{
name: “input”
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
output [
{
name: “output”
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
dynamic_batching {
max_queue_delay_microseconds: 100
}
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]

# 3. 启动Triton Server
$ docker run –gpus all –rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /model_repository:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver –model-repository=/models

# 输出日志
I0403 10:00:00.000000 1 server.cc:262] “Initializing Triton Inference Server”
I0403 10:00:00.000000 1 server.cc:178] “Loading model: fgedu_classifier”
I0403 10:00:00.000000 1 model_repository_manager.cc:1195] “successfully loaded ‘fgedu_classifier’ version 1”
I0403 10:00:00.000000 1 server.cc:612]
+——————+——+
| Repository Agent | Path |
+——————+——+
| fgedu_classifier | 1 |
| fgedu_nlp | 1 |
| fgedu_detection | 1 |
+——————+——+

# 4. 客户端调用
import tritonclient.http as httpclient
import numpy as np

# 创建客户端
client = httpclient.InferenceServerClient(url=’fgedudb:8000′)

# 准备输入
inputs = []
inputs.append(httpclient.InferInput(‘input’, [1, 3, 224, 224], ‘FP32’))
inputs[0].set_data_from_numpy(np.random.rand(1, 3, 224, 224).astype(np.float32))

# 准备输出
outputs = []
outputs.append(httpclient.InferRequestedOutput(‘output’))

# 发送请求
response = client.infer(‘fgedu_classifier’, inputs, outputs=outputs)
result = response.as_numpy(‘output’)
print(result.shape)

# 输出
(1, 1000)

# 5. 模型集成
# ensemble_config.pbtxt
name: “fgedu_ensemble”
platform: “ensemble”
max_batch_size: 32
input [
{
name: “IMAGE”
data_type: TYPE_STRING
dims: [ 1 ]
}
]
output [
{
name: “CLASSIFICATION”
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
ensemble_scheduling {
step [
{
model_name: “fgedu_preprocess”
model_version: 1
input_map {
key: “RAW_IMAGE”
value: “IMAGE”
}
output_map {
key: “PROCESSED_IMAGE”
value: “preprocessed_image”
}
},
{
model_name: “fgedu_classifier”
model_version: 1
input_map {
key: “input”
value: “preprocessed_image”
}
output_map {
key: “output”
value: “CLASSIFICATION”
}
}
]
}

# 6. 性能分析
$ perf_analyzer -m fgedu_classifier -u fgedudb:8000

*** Measurement Settings ***
Batch size: 1
Measurement window: 5000 msec
Using synchronous calls for inference

*** Client Triton Performance Inference Client ***
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1250 infer/sec, latency 0.800 usec
Concurrency: 2, throughput: 2450 infer/sec, latency 0.816 usec
Concurrency: 4, throughput: 4800 infer/sec, latency 0.833 usec
Concurrency: 8, throughput: 9200 infer/sec, latency 0.869 usec

# 7. 健康检查
$ curl http://fgedudb:8000/v2/health/ready
{“ready”:true}

$ curl http://fgedudb:8000/v2/models/fgedu_classifier
{
“name”: “fgedu_classifier”,
“versions”: [“1”],
“platform”: “onnxruntime_onnx”,
“inputs”: [
{
“name”: “input”,
“datatype”: “FP32”,
“shape”: [3, 224, 224]
}
],
“outputs”: [
{
“name”: “output”,
“datatype”: “FP32”,
“shape”: [1000]
}
]
}

三、Kubernetes部署

3.1 模型服务部署

在Kubernetes上部署AI模型服务。

# Kubernetes模型部署

# 1. Deployment配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-classifier
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: fgedu-classifier
template:
metadata:
labels:
app: fgedu-classifier
version: v1.0.0
spec:
containers:
– name: triton-server
image: nvcr.io/nvidia/tritonserver:24.01-py3
args:
– tritonserver
– –model-repository=/models
– –strict-model-config=false
– –log-verbose=1
ports:
– containerPort: 8000
name: http
– containerPort: 8001
name: grpc
– containerPort: 8002
name: metrics
resources:
requests:
cpu: “4”
memory: “8Gi”
nvidia.com/gpu: 1
limits:
cpu: “8”
memory: “16Gi”
nvidia.com/gpu: 1
volumeMounts:
– name: model-storage
mountPath: /models
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
volumes:
– name: model-storage
persistentVolumeClaim:
claimName: model-pvc
nodeSelector:
accelerator: nvidia-tesla-v100

# 2. Service配置
apiVersion: v1
kind: Service
metadata:
name: fgedu-classifier
namespace: ai-serving
spec:
type: ClusterIP
selector:
app: fgedu-classifier
ports:
– name: http
port: 8000
targetPort: 8000
– name: grpc
port: 8001
targetPort: 8001
– name: metrics
port: 8002
targetPort: 8002

# 3. Ingress配置
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: fgedu-classifier-ingress
namespace: ai-serving
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: “50m”
nginx.ingress.kubernetes.io/proxy-read-timeout: “300”
nginx.ingress.kubernetes.io/proxy-send-timeout: “300”
spec:
ingressClassName: nginx
rules:
– host: ai.fgedu.net.cn
http:
paths:
– path: /v1/classifier
pathType: Prefix
backend:
service:
name: fgedu-classifier
port:
number: 8000

# 4. HPA配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fgedu-classifier-hpa
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fgedu-classifier
minReplicas: 3
maxReplicas: 10
metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
– type: Pods
pods:
metric:
name: triton_inference_count
target:
type: AverageValue
averageValue: 1000

# 5. GPU资源调度
# GPU节点标签
$ kubectl label nodes k8s-gpu01 accelerator=nvidia-tesla-v100

# GPU资源限制
resources:
limits:
nvidia.com/gpu: 1 # 请求1个GPU

# 6. 部署验证
$ kubectl get pods -n ai-serving
NAME READY STATUS RESTARTS AGE
fgedu-classifier-7b8f9c-d4e5f 1/1 Running 0 10m
fgedu-classifier-7b8f9c-g6h7i 1/1 Running 0 10m
fgedu-classifier-7b8f9c-j8k9l 1/1 Running 0 10m

$ kubectl get svc -n ai-serving
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fgedu-classifier ClusterIP 10.96.100.100 8000/TCP,8001/TCP,8002/TCP 10m

# 7. 模型更新策略
# 蓝绿部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-classifier-blue
spec:
replicas: 3
template:
spec:
containers:
– name: triton-server
image: nvcr.io/nvidia/tritonserver:24.01-py3
args: [“tritonserver”, “–model-repository=/models/v1”]

—
apiVersion: apps/v1
kind: Deployment
metadata:
name: fgedu-classifier-green
spec:
replicas: 0
template:
spec:
containers:
– name: triton-server
image: nvcr.io/nvidia/tritonserver:24.01-py3
args: [“tritonserver”, “–model-repository=/models/v2”]

四、模型优化技术

4.1 模型量化与剪枝

优化模型大小和推理速度。

# 模型优化技术

# 1. 模型量化
import torch
import torch.quantization as quant

# 加载模型
model = torch.load(‘fgedu_model.pt’)
model.eval()

# 动态量化
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear, torch.nn.LSTM},
dtype=torch.qint8
)

# 保存量化模型
torch.save(quantized_model.state_dict(), ‘fgedu_model_quantized.pt’)

# 比较模型大小
import os
original_size = os.path.getsize(‘fgedu_model.pt’) / 1024 / 1024
quantized_size = os.path.getsize(‘fgedu_model_quantized.pt’) / 1024 / 1024

print(f”原始模型: {original_size:.2f} MB”)
print(f”量化模型: {quantized_size:.2f} MB”)
print(f”压缩率: {(1 – quantized_size/original_size)*100:.2f}%”)

# 输出
原始模型: 256.78 MB
量化模型: 68.45 MB
压缩率: 73.33%

# 2. ONNX导出与优化
import torch.onnx

# 导出ONNX模型
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
“fgedu_model.onnx”,
opset_version=14,
input_names=[‘input’],
output_names=[‘output’],
dynamic_axes={‘input’: {0: ‘batch_size’}, ‘output’: {0: ‘batch_size’}}
)

# ONNX优化
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType

# 加载ONNX模型
onnx_model = onnx.load(‘fgedu_model.onnx’)
onnx.save(onnx_model, ‘fgedu_model_optimized.onnx’)

# 动态量化
quantize_dynamic(
‘fgedu_model.onnx’,
‘fgedu_model_quantized.onnx’,
weight_type=QuantType.QUInt8
)

# 3. TensorRT优化
# 使用trtexec工具
$ trtexec –onnx=fgedu_model.onnx \
–saveEngine=fgedu_model.trt \
–fp16 \
–batch=32 \
–workspace=4096

# 输出
[I] Building engine…
[I] Engine built in 15.23 seconds
[I] Running inference…
[I] Average over 100 runs: 2.56 ms/batch

# 4. 模型剪枝
import torch.nn.utils.prune as prune

# 结构化剪枝
def prune_model(model, amount=0.3):
for name, module in model.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.ln_structured(module, name=’weight’, amount=amount, n=2, dim=0)
elif isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name=’weight’, amount=amount)
return model

# 应用剪枝
pruned_model = prune_model(model, amount=0.3)

# 5. 知识蒸馏
import torch.nn.functional as F

def distillation_loss(student_output, teacher_output, labels, temperature=5.0, alpha=0.5):
# 软标签损失
soft_loss = F.kl_div(
F.log_softmax(student_output / temperature, dim=1),
F.softmax(teacher_output / temperature, dim=1),
reduction=’batchmean’
) * (temperature ** 2)

# 硬标签损失
hard_loss = F.cross_entropy(student_output, labels)

return alpha * soft_loss + (1 – alpha) * hard_loss

# 6. 性能对比
import time

def benchmark_model(model, input_data, num_runs=100):
model.eval()
with torch.no_grad():
# 预热
for _ in range(10):
_ = model(input_data)

# 计时
start_time = time.time()
for _ in range(num_runs):
_ = model(input_data)
end_time = time.time()

avg_time = (end_time – start_time) / num_runs * 1000
return avg_time

# 对比测试
input_data = torch.randn(1, 3, 224, 224)

original_time = benchmark_model(model, input_data)
quantized_time = benchmark_model(quantized_model, input_data)

print(f”原始模型推理时间: {original_time:.2f} ms”)
print(f”量化模型推理时间: {quantized_time:.2f} ms”)
print(f”加速比: {original_time/quantized_time:.2f}x”)

# 输出
原始模型推理时间: 15.23 ms
量化模型推理时间: 5.67 ms
加速比: 2.69x

五、监控与运维

5.1 模型服务监控

建立完善的监控体系，确保模型服务稳定运行。

# 模型服务监控

# 1. Prometheus监控配置
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: triton-metrics
namespace: ai-serving
spec:
selector:
matchLabels:
app: fgedu-classifier
endpoints:
– port: metrics
interval: 15s
path: /metrics

# 2. 关键监控指标
# Grafana Dashboard配置
监控指标：
– triton_inference_count: 推理请求总数
– triton_inference_duration: 推理延迟
– triton_queue_size: 请求队列大小
– triton_batch_size: 批处理大小
– gpu_utilization: GPU利用率
– gpu_memory_used: GPU内存使用
– cpu_utilization: CPU利用率
– memory_utilization: 内存使用率

# 3. 自定义指标导出
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# 定义指标
inference_count = Counter(
‘fgedu_inference_total’,
‘Total number of inferences’,
[‘model_name’, ‘model_version’]
)

inference_latency = Histogram(
‘fgedu_inference_latency_seconds’,
‘Inference latency in seconds’,
[‘model_name’],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

queue_size = Gauge(
‘fgedu_queue_size’,
‘Current queue size’,
[‘model_name’]
)

# 使用示例
@inference_latency.labels(model_name=’fgedu_classifier’).time()
def predict(input_data):
inference_count.labels(model_name=’fgedu_classifier’, model_version=’1′).inc()
result = model(input_data)
return result

# 启动指标服务
start_http_server(8002)

# 4. 日志收集配置
# Fluentd配置@type tail
path /var/log/triton/*.log
pos_file /var/log/fluentd/triton.log.pos
tag triton @type json

@type parser
key_name message @type regexp
expression /^\[(?[^\]]+)\]\[(?\w+)\]\[(?\w+)\]\s+(?.*)$/

@type elasticsearch
host elasticsearch
port 9200
index_name triton-logs
type_name _doc

# 5. 告警规则配置
# Prometheus告警规则
groups:
– name: ai-serving-alerts
rules:
– alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(fgedu_inference_latency_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: “推理延迟过高”
description: “模型 {{ $labels.model_name }} P95延迟超过1秒”

– alert: LowGPUUtilization
expr: gpu_utilization < 30 for: 10m labels: severity: info annotations: summary: "GPU利用率低" description: "GPU {{ $labels.gpu }} 利用率低于30%" - alert: ModelServiceDown expr: up{job="triton-metrics"} == 0 for: 1m labels: severity: critical annotations: summary: "模型服务不可用" description: "模型服务 {{ $labels.instance }} 无法访问" # 6. 监控脚本 $ cat /opt/scripts/model_monitor.sh #!/bin/bash echo "=== AI模型服务监控 ===" # 检查服务状态 echo -e "\n1. 服务状态:" kubectl get pods -n ai-serving -l app=fgedu-classifier # 检查推理延迟 echo -e "\n2. 推理延迟:" curl -s http://fgedudb:8002/metrics | grep fgedu_inference_latency_seconds | grep quantile # 检查GPU使用 echo -e "\n3. GPU使用情况:" nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv # 检查请求队列 echo -e "\n4. 请求队列:" curl -s http://fgedudb:8002/metrics | grep fgedu_queue_size # 检查错误率 echo -e "\n5. 错误统计:" kubectl logs -n ai-serving -l app=fgedu-classifier --tail=100 | grep -i error | wc -l $ chmod +x /opt/scripts/model_monitor.sh $ ./model_monitor.sh === AI模型服务监控 === 1. 服务状态: NAME READY STATUS RESTARTS AGE fgedu-classifier-7b8f9c-d4e5f 1/1 Running 0 10m 2. 推理延迟: fgedu_inference_latency_seconds{model_name="fgedu_classifier",quantile="0.5"} 0.025 fgedu_inference_latency_seconds{model_name="fgedu_classifier",quantile="0.95"} 0.085 3. GPU使用情况: index, name, utilization.gpu [%], memory.used [MiB], memory.total [MiB] 0, Tesla V100-SXM2-32GB, 75 %, 8192 MiB, 32768 MiB 4. 请求队列: fgedu_queue_size{model_name="fgedu_classifier"} 5 5. 错误统计: 0

六、弹性伸缩策略

6.1 自动伸缩配置

配置自动伸缩策略，应对流量波动。

# 弹性伸缩策略

# 1. HPA配置
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: fgedu-classifier-hpa
namespace: ai-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fgedu-classifier
minReplicas: 2
maxReplicas: 20
metrics:
– type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
– type: Pods
pods:
metric:
name: triton_inference_count
target:
type: AverageValue
averageValue: “500”
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
– type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
– type: Percent
value: 100
periodSeconds: 15
– type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max

# 2. VPA配置（垂直伸缩）
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: fgedu-classifier-vpa
namespace: ai-serving
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: fgedu-classifier
updatePolicy:
updateMode: “Auto”
resourcePolicy:
containerPolicies:
– containerName: triton-server
minAllowed:
cpu: 2
memory: 4Gi
maxAllowed:
cpu: 16
memory: 32Gi
controlledResources: [“cpu”, “memory”]

# 3. 基于自定义指标的伸缩
# 自定义指标适配器配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-metrics-adapter
spec:
template:
spec:
containers:
– name: adapter
image: directxman12/k8s-prometheus-adapter:latest
args:
– –prometheus-url=http://prometheus:9090
– –metrics-relist-interval=30s
– –v=4
– –config=/etc/adapter/config.yaml

# 4. 预测性伸缩
# 使用KEDA进行事件驱动伸缩
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: fgedu-classifier-scaler
namespace: ai-serving
spec:
scaleTargetRef:
name: fgedu-classifier
minReplicaCount: 2
maxReplicaCount: 20
cooldownPeriod: 300
triggers:
– type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: fgedu_inference_count
threshold: “1000”
query: sum(rate(fgedu_inference_count[1m]))

# 5. 伸缩策略测试
$ cat /opt/scripts/scaling_test.sh
#!/bin/bash

echo “=== 弹性伸缩测试 ===”

# 初始状态
echo -e “\n1. 初始副本数:”
kubectl get deployment fgedu-classifier -n ai-serving -o jsonpath='{.spec.replicas}’

# 发送测试流量
echo -e “\n2. 发送测试流量…”
for i in {1..1000}; do
curl -s -X POST http://ai.fgedu.net.cn/v1/classifier/predict \
-d ‘{“instances”: [[1.0, 2.0, 3.0, 4.0]]}’ > /dev/null &
done
wait

# 等待伸缩
echo -e “\n3. 等待自动伸缩…”
sleep 120

# 检查副本数
echo -e “\n4. 当前副本数:”
kubectl get deployment fgedu-classifier -n ai-serving -o jsonpath='{.spec.replicas}’

# 检查HPA状态
echo -e “\n5. HPA状态:”
kubectl get hpa fgedu-classifier-hpa -n ai-serving

$ chmod +x /opt/scripts/scaling_test.sh
$ ./scaling_test.sh
=== 弹性伸缩测试 ===

1. 初始副本数:
2

2. 发送测试流量…

3. 等待自动伸缩…

4. 当前副本数:
8

5. HPA状态:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
fgedu-classifier-hpa Deployment/fgedu-classifier 85%/70% 2 20 8 10m

# 6. 容量规划
# 容量规划脚本
$ cat /opt/scripts/capacity_planning.sh
#!/bin/bash

echo “=== AI服务容量规划 ===”

# 当前配置
current_replicas=$(kubectl get deployment fgedu-classifier -n ai-serving -o jsonpath='{.spec.replicas}’)
cpu_per_pod=$(kubectl get deployment fgedu-classifier -n ai-serving -o jsonpath='{.spec.template.spec.containers[0].resources.requests.cpu}’)
mem_per_pod=$(kubectl get deployment fgedu-classifier -n ai-serving -o jsonpath='{.spec.template.spec.containers[0].resources.requests.memory}’)

# 计算总资源
total_cpu=$(echo “$current_replicas * $cpu_per_pod” | bc)
total_mem=$(echo “$current_replicas * $mem_per_pod” | sed ‘s/Gi//’)

echo “当前副本数: $current_replicas”
echo “每Pod CPU: $cpu_per_pod”
echo “每Pod内存: $mem_per_pod”
echo “总CPU: ${total_cpu}核”
echo “总内存: ${total_mem}Gi”

# 预估峰值需求
peak_qps=10000
avg_latency=0.05 # 50ms
required_pods=$(echo “scale=0; $peak_qps * $avg_latency / 1” | bc)

echo -e “\n峰值QPS预估: $peak_qps”
echo “平均延迟: ${avg_latency}s”
echo “所需Pod数: $required_pods”

$ chmod +x /opt/scripts/capacity_planning.sh
$ ./capacity_planning.sh
=== AI服务容量规划 ===

当前副本数: 8
每Pod CPU: 4
每Pod内存: 8Gi
总CPU: 32核
总内存: 64Gi

峰值QPS预估: 10000
平均延迟: 0.05s
所需Pod数: 500

总结

AI模型部署架构是将模型投入生产环境的关键环节，需要综合考虑性能、可靠性、可扩展性等多个方面。本教程详细介绍了模型服务框架、Kubernetes部署、模型优化、监控运维和弹性伸缩等内容，帮助用户构建高效的AI模型服务。

更多学习教程www.fgedu.net.cn，在实际工作中，建议根据业务特点选择合适的部署架构，并持续优化和改进。

风哥风哥提示：模型部署要特别注意版本管理和灰度发布，确保模型更新的平滑过渡。

本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html