Kubernetes部署LLM
--- title: "Kubernetes部署LLM" description: "使用Kubernetes集群部署和管理大语言模型,涵盖GPU节点调度、HPA自动扩缩容、模型服务化等实战技巧" tags: ["Kubernetes", "K8s", "GPU节点", "HPA自动扩缩"] category: "llm" icon: "🧠"
Kubernetes部署LLM
为什么选择Kubernetes部署LLM
随着大语言模型的规模不断增大,传统的单机部署方式已经无法满足生产环境的需求。Kubernetes作为容器编排的事实标准,提供了弹性伸缩、滚动更新、健康检查等关键能力,是部署LLM服务的理想平台。
GPU节点管理
在Kubernetes中部署LLM首先需要配置GPU节点。通过NVIDIA Device Plugin,Kubernetes能够感知和调度GPU资源:
apiVersion: v1
kind: Node
metadata:
name: gpu-node-1
labels:
accelerator: nvidia-a100
spec:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
部署GPU工作负载时,需要在Pod规格中声明GPU资源需求:
apiVersion: v1
kind: Pod
metadata:
name: llm-inference-pod
spec:
containers:
- name: llm-server
image: myregistry/llm-server:latest
resources:
limits:
nvidia.com/gpu: 4
memory: "64Gi"
requests:
cpu: "8"
memory: "32Gi"
nodeSelector:
accelerator: nvidia-a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
使用Deployment管理LLM服务
Deployment控制器确保LLM服务的高可用性和滚动更新能力:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-deployment
namespace: ai-services
spec:
replicas: 3
selector:
matchLabels:
app: llm-server
template:
metadata:
labels:
app: llm-server
spec:
containers:
- name: llm
image: myregistry/llm-7b:latest
ports:
- containerPort: 8080
resources:
limits:
nvidia.com/gpu: 2
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
HPA自动扩缩容
针对LLM服务的流量波动,配置Horizontal Pod Autoscaler实现自动扩缩容:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Service暴露LLM服务
通过Service和Ingress将LLM服务暴露给外部用户:
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-server
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llm-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
rules:
- host: llm.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llm-service
port:
number: 80
模型存储与持久化
使用PersistentVolumeClaim挂载模型权重文件,避免每次启动重新下载:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 500Gi
资源监控与调度策略
利用Prometheus和Grafana监控GPU利用率和推理延迟。配置PodDisruptionBudget确保滚动更新时至少保留一定数量的Pod:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: llm-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: llm-server
通过合理配置Kubernetes的资源管理和调度策略,可以构建稳定、高效的LLM推理平台,支撑大规模在线服务需求。