gpu-kubernetes-operations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGPU Kubernetes Operations
GPU Kubernetes 运维
Run resilient and cost-efficient GPU clusters for production AI workloads.
为生产级AI工作负载运行高可用且成本优化的GPU集群。
When to Use This Skill
何时使用此技能
- Setting up GPU node pools in Kubernetes for AI inference or training
- Configuring NVIDIA device plugin and GPU operator
- Implementing MIG partitioning to share GPUs across workloads
- Building GPU-aware autoscaling policies
- Monitoring GPU health with DCGM and Prometheus
- Troubleshooting GPU scheduling, driver, or OOM issues
- 在Kubernetes中为AI推理或训练设置GPU节点池
- 配置NVIDIA设备插件和GPU Operator
- 实施MIG分区以在多个工作负载间共享GPU
- 构建感知GPU的自动扩缩容策略
- 使用DCGM和Prometheus监控GPU健康状况
- 排查GPU调度、驱动或OOM问题
Prerequisites
前提条件
- Kubernetes 1.28+ cluster with GPU-capable nodes
- NVIDIA GPUs (A10, L4, A100, H100, or similar)
- NVIDIA drivers installed on nodes (535+ recommended)
- Helm 3 for operator and plugin installation
- Prometheus stack for metrics collection
- Kubernetes 1.28+集群,且节点具备GPU能力
- NVIDIA GPUs(A10、L4、A100、H100或类似型号)
- 节点上已安装NVIDIA驱动(推荐535+版本)
- 用于安装Operator和插件的Helm 3
- 用于指标收集的Prometheus栈
NVIDIA GPU Operator Installation
NVIDIA GPU Operator 安装
The GPU Operator automates driver, toolkit, device plugin, and DCGM deployment.
bash
undefinedGPU Operator可自动化驱动、工具包、设备插件和DCGM的部署。
bash
undefinedAdd NVIDIA Helm repo
Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Install GPU Operator
Install GPU Operator
helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0
helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0
Verify installation
Verify installation
kubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
undefinedkubectl get pods -n gpu-operator
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
undefinedNVIDIA Device Plugin (Standalone)
NVIDIA 设备插件(独立部署)
If not using the GPU Operator, deploy the device plugin directly.
yaml
undefined若不使用GPU Operator,可直接部署设备插件。
yaml
undefinednvidia-device-plugin.yaml
nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin
template:
metadata:
labels:
name: nvidia-device-plugin
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
securityContext:
privileged: true
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: DEVICE_SPLIT_COUNT
value: "1"
- name: DEVICE_LIST_STRATEGY
value: "envvar"
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
undefinedapiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin
template:
metadata:
labels:
name: nvidia-device-plugin
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
priorityClassName: system-node-critical
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
securityContext:
privileged: true
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: DEVICE_SPLIT_COUNT
value: "1"
- name: DEVICE_LIST_STRATEGY
value: "envvar"
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
undefinedMIG (Multi-Instance GPU) Partitioning
MIG(多实例GPU)分区
MIG allows a single A100 or H100 to be split into isolated GPU instances.
yaml
undefinedMIG允许将单个A100或H100拆分为多个独立的GPU实例。
yaml
undefinedmig-config.yaml - ConfigMap for MIG Manager
mig-config.yaml - ConfigMap for MIG Manager
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
# 7 small instances for inference microservices
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
# 3 medium instances for mid-size models
all-2g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.20gb": 3
# Mixed: 1 large + 2 small
mixed-inference:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
# Full GPU for training (no partitioning)
all-disabled:
- devices: all
mig-enabled: false
```bashapiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
# 7个小实例用于推理微服务
all-1g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
# 3个中等实例用于中型模型
all-2g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.20gb": 3
# 混合配置:1个大实例 + 2个小实例
mixed-inference:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4
# 完整GPU用于训练(不分区)
all-disabled:
- devices: all
mig-enabled: false
```bashApply MIG profile to a node
为节点应用MIG配置
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
Verify MIG instances
验证MIG实例
kubectl exec -it nvidia-device-plugin-xxxxx -n kube-system -- nvidia-smi mig -lgi
kubectl exec -it nvidia-device-plugin-xxxxx -n kube-system -- nvidia-smi mig -lgi
Check available MIG resources
检查可用MIG资源
kubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'
undefinedkubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'
undefinedRequesting MIG Slices in Pods
在Pod中请求MIG切片
yaml
undefinedyaml
undefinedpod-with-mig.yaml
pod-with-mig.yaml
apiVersion: v1
kind: Pod
metadata:
name: inference-small
spec:
containers:
- name: model
image: registry.internal/vllm-server:latest
resources:
limits:
nvidia.com/mig-1g.10gb: 1
# For medium slice:
# nvidia.com/mig-2g.20gb: 1
# For large slice:
# nvidia.com/mig-3g.40gb: 1
undefinedapiVersion: v1
kind: Pod
metadata:
name: inference-small
spec:
containers:
- name: model
image: registry.internal/vllm-server:latest
resources:
limits:
nvidia.com/mig-1g.10gb: 1
# 若使用中等切片:
# nvidia.com/mig-2g.20gb: 1
# 若使用大切片:
# nvidia.com/mig-3g.40gb: 1
undefinedGPU Time-Slicing
GPU 时间分片
For GPUs that do not support MIG (A10, L4), use time-slicing to share a GPU.
yaml
undefined对于不支持MIG的GPU(A10、L4),使用时间分片来共享GPU。
yaml
undefinedtime-slicing-config.yaml
time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
```bashapiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: gpu-operator
data:
any: |-
version: v1
flags:
migStrategy: none
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: false
resources:
- name: nvidia.com/gpu
replicas: 4
```bashApply time-slicing config
应用时间分片配置
kubectl patch clusterpolicy/cluster-policy
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
kubectl patch clusterpolicy/cluster-policy
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
After applying, each physical GPU appears as 4 virtual GPUs
应用后,每个物理GPU将显示为4个虚拟GPU
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
Output: "4" per physical GPU
输出:每个物理GPU对应"4"
undefinedundefinedDCGM Monitoring
DCGM 监控
yaml
undefinedyaml
undefineddcgm-servicemonitor.yaml
dcgm-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
labels:
release: prometheus
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
path: /metrics
undefinedapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
labels:
release: prometheus
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: gpu-metrics
interval: 15s
path: /metrics
undefinedKey DCGM Metrics and Alert Rules
关键DCGM指标与告警规则
yaml
undefinedyaml
undefinedgpu-alerts.yaml
gpu-alerts.yaml
groups:
- name: gpu-health
rules:
-
alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} temperature above 85C on {{ $labels.node }}"
-
alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "GPU memory above 90% on {{ $labels.node }} GPU {{ $labels.gpu }}"
-
alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "Double-bit ECC errors detected on {{ $labels.node }} GPU {{ $labels.gpu }}"
-
alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "Xid error on {{ $labels.node }} GPU {{ $labels.gpu }}: {{ $labels.xid }}"
-
alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "GPU underutilized on {{ $labels.node }} - consider rightsizing"
-
alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "Multiple GPU driver versions detected across cluster"
-
undefinedgroups:
- name: gpu-health
rules:
-
alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} 在节点 {{ $labels.node }} 上温度超过85℃"
-
alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 内存使用率超过90%"
-
alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 检测到双位ECC错误"
-
alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 出现Xid错误:{{ $labels.xid }}"
-
alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "节点 {{ $labels.node }} 的GPU利用率偏低 - 考虑调整资源配置"
-
alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "集群中检测到多个GPU驱动版本"
-
undefinedGPU Node Pool Configuration
GPU节点池配置
yaml
undefinedyaml
undefinedgpu-nodepool.yaml
gpu-nodepool.yaml
apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule
apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule
Inference deployment with GPU scheduling
带GPU调度的推理部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
gpu-type: a100
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: llm-inference
topologyKey: kubernetes.io/hostname
containers:
- name: vllm
image: registry.internal/vllm-server:0.4.1
resources:
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "64Gi"
env:
- name: CUDA_VISIBLE_DEVICES
value: "all"
undefinedapiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ai-serving
spec:
replicas: 3
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
gpu-type: a100
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: llm-inference
topologyKey: kubernetes.io/hostname
containers:
- name: vllm
image: registry.internal/vllm-server:0.4.1
resources:
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "64Gi"
env:
- name: CUDA_VISIBLE_DEVICES
value: "all"
undefinedGPU Autoscaling
GPU自动扩缩容
yaml
undefinedyaml
undefinedgpu-hpa.yaml
gpu-hpa.yaml
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-serving spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 8 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 120 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-serving spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 8 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 120 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300
Cluster Autoscaler config for GPU node pools
GPU节点池的集群自动扩缩容配置
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config: |
expander: priority
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
skip-nodes-with-local-storage: false
balance-similar-node-groups: true
expendable-pods-priority-cutoff: -10
gpu-total:
- min: 2
max: 16
gpu: nvidia.com/gpu
undefinedapiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
config: |
expander: priority
scale-down-delay-after-add: 10m
scale-down-unneeded-time: 10m
skip-nodes-with-local-storage: false
balance-similar-node-groups: true
expendable-pods-priority-cutoff: -10
gpu-total:
- min: 2
max: 16
gpu: nvidia.com/gpu
undefinedScheduling Patterns
调度模式
- Use node affinity by GPU type (A10/L4/A100/H100).
- Separate latency-critical inference from batch training.
- Pin model replicas with anti-affinity for availability.
- Reserve headroom for failover and rolling updates.
- 按GPU类型(A10/L4/A100/H100)使用节点亲和性
- 将延迟敏感型推理与批量训练分离
- 使用反亲和性固定模型副本以提升可用性
- 为故障转移和滚动更新预留资源余量
Cost Optimization
成本优化
- Prefer MIG slices for smaller inference services.
- Schedule batch jobs in off-peak windows.
- Route low-priority traffic to cheaper model tiers.
- Use spot/preemptible instances for training workloads.
- Monitor GPU utilization and rightsize deployments.
- 对于小型推理服务优先使用MIG切片
- 在非高峰时段调度批量作业
- 将低优先级流量路由至更经济的模型层级
- 使用抢占式实例运行训练工作负载
- 监控GPU利用率并调整部署资源配置
Troubleshooting
故障排查
| Symptom | Check | Fix |
|---|---|---|
| Pod stuck in Pending | | Verify node has allocatable GPUs, check taints/tolerations |
| CUDA OOM during inference | Model too large for GPU memory | Reduce batch size, use quantization, or use MIG slice |
| DCGM metrics missing | ServiceMonitor labels matching | Verify DCGM exporter pod is running and scrape config |
| Driver mismatch after upgrade | | Cordon node, drain, upgrade driver, uncordon |
| GPU not detected | Device plugin pod logs | Restart device plugin, check NVIDIA container toolkit |
| Time-slicing not working | ConfigMap applied but no extra GPUs | Restart device plugin pods after config change |
| ECC errors increasing | | Schedule node drain and hardware replacement |
| 症状 | 检查项 | 修复方案 |
|---|---|---|
| Pod 处于Pending状态 | 执行 | 验证节点是否有可分配的GPU,检查污点/容忍度配置 |
| 推理过程中出现CUDA OOM | 模型尺寸超过GPU内存 | 减小批量大小、使用量化技术或切换至MIG切片 |
| DCGM指标缺失 | 检查ServiceMonitor标签是否匹配 | 验证DCGM exporter Pod是否运行,检查采集配置 |
| 升级后驱动版本不匹配 | 在每个节点执行 | 标记节点不可调度、驱逐Pod、升级驱动、重新标记可调度 |
| GPU未被检测到 | 查看设备插件Pod日志 | 重启设备插件,检查NVIDIA容器工具包 |
| 时间分片不生效 | 已应用ConfigMap但未显示额外GPU | 配置变更后重启设备插件Pod |
| ECC错误持续增加 | 执行 | 安排节点驱逐并更换硬件 |
Related Skills
相关技能
- llm-inference-scaling - Autoscale inference workloads
- model-serving-kubernetes - Production model serving patterns
- gpu-server-management - Host-level GPU management fundamentals
- multi-tenant-llm-hosting - Multi-tenant GPU sharing
- llm-cost-optimization - Cost optimization strategies
- llm-inference-scaling - 自动扩缩容推理工作负载
- model-serving-kubernetes - 生产级模型服务模式
- gpu-server-management - 主机级GPU管理基础
- multi-tenant-llm-hosting - 多租户GPU共享
- llm-cost-optimization - 成本优化策略