vLLM 高性能推理部署指南 / 12 - Kubernetes 部署
12 - Kubernetes 部署
在 Kubernetes 上构建生产级的 vLLM 推理服务集群。
12.1 Kubernetes 部署架构
12.1.1 整体架构
┌─────────────────────────┐
│ Ingress Controller │
│ (Nginx/Traefik) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Service (ClusterIP) │
│ load-balancing │
└────────────┬────────────┘
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Pod (GPU 0) │ │ Pod (GPU 1)│ │ Pod (GPU 2)│
│ ┌───────────┐│ │ ┌───────────┐│ │ ┌───────────┐│
│ │vLLM Server││ │ │vLLM Server││ │ │vLLM Server││
│ │ Model A ││ │ │ Model B ││ │ │ Model A ││
│ └───────────┘│ │ └───────────┘│ │ └───────────┘│
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ GPU x1 │ │ GPU x1 │ │ GPU x1 │
└─────────────┘ └─────────────┘ └─────────────┘
┌───────────────────────────────────────────┐
│ Monitoring: Prometheus + Grafana │
│ Autoscaling: HPA / KEDA │
│ Storage: PVC for model cache │
└───────────────────────────────────────────┘
12.2 GPU Operator 安装
12.2.1 安装 NVIDIA GPU Operator
# 添加 Helm 仓库
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# 安装 GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set dcgm.enabled=true \
--set dcgmExporter.enabled=true
12.2.2 验证 GPU 可用性
# 检查 GPU 节点
kubectl get nodes -l nvidia.com/gpu.present=true
# 检查 GPU 资源
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
# nvidia.com/gpu: 8
# 测试 GPU Pod
kubectl run gpu-test --rm -it \
--image=nvidia/cuda:12.4.0-base-ubuntu22.04 \
--limits=nvidia.com/gpu=1 \
-- nvidia-smi
12.3 基础 Kubernetes 部署
12.3.1 ConfigMap
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
namespace: llm
data:
MODEL_NAME: "Qwen/Qwen2.5-7B-Instruct"
MAX_MODEL_LEN: "4096"
GPU_MEMORY_UTILIZATION: "0.9"
DTYPE: "auto"
TRUST_REMOTE_CODE: "true"
12.3.2 Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: llm
labels:
app: vllm-server
spec:
replicas: 2
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
# Node 选择器:指定 GPU 节点
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-80GB-PCIe"
# 容忍 GPU 节点的 taint
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# 亲和性:不同 Pod 分散到不同节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- vllm-server
topologyKey: kubernetes.io/hostname
containers:
- name: vllm
image: vllm/vllm-openai:latest
# GPU 资源请求
resources:
limits:
nvidia.com/gpu: "1"
memory: "64Gi"
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "4"
# 启动命令
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model"
- "$(MODEL_NAME)"
- "--max-model-len"
- "$(MAX_MODEL_LEN)"
- "--gpu-memory-utilization"
- "$(GPU_MEMORY_UTILIZATION)"
- "--served-model-name"
- "qwen-7b"
- "--trust-remote-code"
# 环境变量
envFrom:
- configMapRef:
name: vllm-config
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
# 端口
ports:
- containerPort: 8000
name: http
protocol: TCP
# 健康检查
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 模型加载需要时间
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
timeoutSeconds: 5
# 共享内存
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache/huggingface
# 共享内存卷
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
# 启动超时(模型加载可能很慢)
terminationGracePeriodSeconds: 60
12.3.3 Service
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: llm
labels:
app: vllm-server
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
protocol: TCP
name: http
selector:
app: vllm-server
12.3.4 Secret(HF Token)
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: llm
type: Opaque
data:
token: <base64-encoded-hf-token>
# 创建 Secret
kubectl create secret generic hf-token \
--namespace llm \
--from-literal=token=hf_xxxxxxxxxxxxx
12.3.5 PVC(模型缓存)
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
namespace: llm
spec:
accessModes:
- ReadWriteOnce # 或 ReadWriteMany(如果存储支持)
storageClassName: fast-ssd # 使用 SSD 存储类
resources:
requests:
storage: 200Gi
12.4 Helm Chart
12.4.1 Chart 结构
vllm-chart/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── hpa.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── pvc.yaml
│ └── _helpers.tpl
└── README.md
12.4.2 values.yaml
# values.yaml
replicaCount: 2
image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: IfNotPresent
model:
name: "Qwen/Qwen2.5-7B-Instruct"
servedName: "qwen-7b"
maxModelLen: 4096
gpuMemoryUtilization: 0.9
trustRemoteCode: true
dtype: "auto"
gpu:
count: 1
type: "NVIDIA-A100-80GB-PCIe"
resources:
limits:
memory: "64Gi"
cpu: "8"
requests:
memory: "32Gi"
cpu: "4"
service:
type: ClusterIP
port: 8000
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
hosts:
- host: llm.example.com
paths:
- path: /
pathType: Prefix
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetGPUUtilizationPercentage: 80
monitoring:
enabled: true
serviceMonitor:
enabled: true
interval: 10s
shmSize: "16Gi"
modelCache:
enabled: true
size: 200Gi
storageClass: fast-ssd
hfToken:
enabled: true
existingSecret: ""
12.4.3 安装 Helm Chart
# 添加自定义 Helm 仓库
helm repo add vllm https://your-registry.com/charts
helm repo update
# 安装
helm install vllm-service vllm/vllm-chart \
--namespace llm \
--create-namespace \
--values values.yaml
# 更新
helm upgrade vllm-service vllm/vllm-chart \
--namespace llm \
--values values.yaml
# 查看状态
helm status vllm-service -n llm
12.5 自动扩缩容
12.5.1 HPA(基于 CPU/内存)
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: llm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: vllm:num_requests_waiting
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
12.5.2 KEDA(基于自定义指标)
KEDA 提供更灵活的扩缩容策略:
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: llm
spec:
scaleTargetRef:
name: vllm-server
minReplicaCount: 1
maxReplicaCount: 20
cooldownPeriod: 300
pollingInterval: 15
triggers:
# 基于等待队列长度
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_waiting
query: avg(vllm:num_requests_waiting{model="qwen-7b"})
threshold: "50"
# 基于 GPU 缓存使用率
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_gpu_cache_usage
query: avg(vllm:gpu_cache_usage_perc{model="qwen-7b"})
threshold: "0.85"
# 基于时间(定时扩缩)
- type: cron
metadata:
timezone: Asia/Shanghai
start: "0 9 * * *" # 每天 9:00 扩容
end: "0 22 * * *" # 每天 22:00 缩容
desiredReplicas: "5"
12.5.3 安装 KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
12.6 GPU 调度策略
12.6.1 GPU 分配方式
| 方式 | 说明 | 适用场景 |
|---|---|---|
| 整卡分配 | nvidia.com/gpu: 1 | 单 Pod 独占 GPU |
| MIG 分区 | A100/H100 的 MIG 实例 | 多个小模型共享 |
| GPU 时间片 | 通过 GPU 共享 | 开发/测试环境 |
| vGPU | NVIDIA vGPU 软件 | 企业级共享 |
12.6.2 MIG(Multi-Instance GPU)配置
# 使用 MIG 实例(A100 80GB 的 1g.10gb 实例)
resources:
limits:
nvidia.com/mig-1g.10gb: "1"
# 配置 MIG 模式(在节点上执行)
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi -i 0 -mig -cgi 9,9,9,9,9,9,9 -C
# 在 K8s 中使用 MIG
# 需要 GPU Operator 启用 MIG 策略
helm upgrade gpu-operator nvidia/gpu-operator \
--set mig.strategy=mixed
12.7 网络配置
12.7.1 Ingress 配置
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: llm
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
nginx.ingress.kubernetes.io/server-snippet: |
proxy_cache off;
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: nginx
tls:
- hosts:
- llm.example.com
secretName: llm-tls
rules:
- host: llm.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 8000
关键:流式输出需要禁用 Nginx 的 proxy_buffering,否则 SSE 事件会被缓冲。
12.8 多模型部署
12.8.1 独立 Deployment
# 多模型:每个模型一个 Deployment
# 模型 A
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen-7b
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "Qwen/Qwen2.5-7B-Instruct", "--served-model-name", "qwen-7b"]
resources:
limits:
nvidia.com/gpu: "1"
# 模型 B
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen-coder
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "Qwen/Qwen2.5-Coder-7B-Instruct", "--served-model-name", "qwen-coder"]
resources:
limits:
nvidia.com/gpu: "1"
12.9 高可用配置
12.9.1 Pod Disruption Budget
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: vllm-pdb
namespace: llm
spec:
minAvailable: 1 # 至少保持 1 个 Pod 运行
selector:
matchLabels:
app: vllm-server
12.9.2 优雅重启
# 确保更新时的零停机
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
terminationGracePeriodSeconds: 120 # 足够长的优雅关闭时间
containers:
- name: vllm
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 30"] # 等待负载均衡器切换
12.10 注意事项
共享内存:vLLM 需要大量共享内存(/dev/shm)。Kubernetes 默认的 64MB 不够,必须通过 emptyDir 设置。
模型加载时间:大模型加载可能需要 5-15 分钟。设置足够的
initialDelaySeconds避免 Pod 被误判为不健康。
GPU 调度:确保集群有足够的 GPU 资源。使用
nvidia.com/gpu资源请求而非通用 CPU 请求。
存储性能:模型加载速度受存储影响。建议使用本地 SSD 或高性能 NFS/CSI 驱动。
网络:如果使用张量并行(多 GPU),Pod 内的 GPU 间通信需要 NVLink。跨节点的 TP 需要高速网络。
12.11 扩展阅读
上一章:11 - 监控与可观测性 | 下一章:13 - Docker 容器化部署