GPU Kubernetes Operations

GPU Kubernetes 运维

Run resilient and cost-efficient GPU clusters for production AI workloads.

为生产级AI工作负载运行高可用且成本优化的GPU集群。

When to Use This Skill

何时使用此技能

Setting up GPU node pools in Kubernetes for AI inference or training
Configuring NVIDIA device plugin and GPU operator
Implementing MIG partitioning to share GPUs across workloads
Building GPU-aware autoscaling policies
Monitoring GPU health with DCGM and Prometheus
Troubleshooting GPU scheduling, driver, or OOM issues

在Kubernetes中为AI推理或训练设置GPU节点池
配置NVIDIA设备插件和GPU Operator
实施MIG分区以在多个工作负载间共享GPU
构建感知GPU的自动扩缩容策略
使用DCGM和Prometheus监控GPU健康状况
排查GPU调度、驱动或OOM问题

Prerequisites

前提条件

Kubernetes 1.28+ cluster with GPU-capable nodes
NVIDIA GPUs (A10, L4, A100, H100, or similar)
NVIDIA drivers installed on nodes (535+ recommended)
Helm 3 for operator and plugin installation
Prometheus stack for metrics collection

Kubernetes 1.28+集群，且节点具备GPU能力
NVIDIA GPUs（A10、L4、A100、H100或类似型号）
节点上已安装NVIDIA驱动（推荐535+版本）
用于安装Operator和插件的Helm 3
用于指标收集的Prometheus栈

NVIDIA GPU Operator Installation

NVIDIA GPU Operator 安装

The GPU Operator automates driver, toolkit, device plugin, and DCGM deployment.

bash

undefined

GPU Operator可自动化驱动、工具包、设备插件和DCGM的部署。

bash

undefined

Add NVIDIA Helm repo

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update

Install GPU Operator

helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0

Verify installation

kubectl get pods -n gpu-operator kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

undefined

kubectl get pods -n gpu-operator kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

undefined

NVIDIA Device Plugin (Standalone)

NVIDIA 设备插件（独立部署）

If not using the GPU Operator, deploy the device plugin directly.

yaml

undefined

若不使用GPU Operator，可直接部署设备插件。

yaml

undefined

nvidia-device-plugin.yaml

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin template: metadata: labels: name: nvidia-device-plugin spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: system-node-critical containers: - name: nvidia-device-plugin image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0 securityContext: privileged: true env: - name: FAIL_ON_INIT_ERROR value: "false" - name: DEVICE_SPLIT_COUNT value: "1" - name: DEVICE_LIST_STRATEGY value: "envvar" volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins

undefined

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin template: metadata: labels: name: nvidia-device-plugin spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: system-node-critical containers: - name: nvidia-device-plugin image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0 securityContext: privileged: true env: - name: FAIL_ON_INIT_ERROR value: "false" - name: DEVICE_SPLIT_COUNT value: "1" - name: DEVICE_LIST_STRATEGY value: "envvar" volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins

undefined

MIG (Multi-Instance GPU) Partitioning

MIG（多实例GPU）分区

MIG allows a single A100 or H100 to be split into isolated GPU instances.

yaml

undefined

MIG允许将单个A100或H100拆分为多个独立的GPU实例。

yaml

undefined

mig-config.yaml - ConfigMap for MIG Manager

apiVersion: v1 kind: ConfigMap metadata: name: mig-parted-config namespace: gpu-operator data: config.yaml: | version: v1 mig-configs: # 7 small instances for inference microservices all-1g.10gb: - devices: all mig-enabled: true mig-devices: "1g.10gb": 7

  # 3 medium instances for mid-size models
  all-2g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3

  # Mixed: 1 large + 2 small
  mixed-inference:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.40gb": 1
        "1g.10gb": 4

  # Full GPU for training (no partitioning)
  all-disabled:
    - devices: all
      mig-enabled: false


```bash

apiVersion: v1 kind: ConfigMap metadata: name: mig-parted-config namespace: gpu-operator data: config.yaml: | version: v1 mig-configs: # 7个小实例用于推理微服务 all-1g.10gb: - devices: all mig-enabled: true mig-devices: "1g.10gb": 7

  # 3个中等实例用于中型模型
  all-2g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3

  # 混合配置：1个大实例 + 2个小实例
  mixed-inference:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.40gb": 1
        "1g.10gb": 4

  # 完整GPU用于训练（不分区）
  all-disabled:
    - devices: all
      mig-enabled: false


```bash

Apply MIG profile to a node

为节点应用MIG配置

kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite

Verify MIG instances

验证MIG实例

kubectl exec -it nvidia-device-plugin-xxxxx -n kube-system -- nvidia-smi mig -lgi

Check available MIG resources

检查可用MIG资源

kubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'

undefined

kubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'

undefined

Requesting MIG Slices in Pods

在Pod中请求MIG切片

yaml

undefined

yaml

undefined

pod-with-mig.yaml

apiVersion: v1 kind: Pod metadata: name: inference-small spec: containers: - name: model image: registry.internal/vllm-server:latest resources: limits: nvidia.com/mig-1g.10gb: 1 # For medium slice: # nvidia.com/mig-2g.20gb: 1 # For large slice: # nvidia.com/mig-3g.40gb: 1

undefined

apiVersion: v1 kind: Pod metadata: name: inference-small spec: containers: - name: model image: registry.internal/vllm-server:latest resources: limits: nvidia.com/mig-1g.10gb: 1 # 若使用中等切片： # nvidia.com/mig-2g.20gb: 1 # 若使用大切片： # nvidia.com/mig-3g.40gb: 1

undefined

GPU Time-Slicing

GPU 时间分片

For GPUs that do not support MIG (A10, L4), use time-slicing to share a GPU.

yaml

undefined

对于不支持MIG的GPU（A10、L4），使用时间分片来共享GPU。

yaml

undefined

time-slicing-config.yaml

apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 4


```bash

apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 4


```bash

Apply time-slicing config

应用时间分片配置

kubectl patch clusterpolicy/cluster-policy
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After applying, each physical GPU appears as 4 virtual GPUs

应用后，每个物理GPU将显示为4个虚拟GPU

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

Output: "4" per physical GPU

输出：每个物理GPU对应"4"

undefined

undefined

DCGM Monitoring

DCGM 监控

yaml

undefined

yaml

undefined

dcgm-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: dcgm-exporter namespace: gpu-operator labels: release: prometheus spec: selector: matchLabels: app: nvidia-dcgm-exporter endpoints: - port: gpu-metrics interval: 15s path: /metrics

undefined

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: dcgm-exporter namespace: gpu-operator labels: release: prometheus spec: selector: matchLabels: app: nvidia-dcgm-exporter endpoints: - port: gpu-metrics interval: 15s path: /metrics

undefined

Key DCGM Metrics and Alert Rules

关键DCGM指标与告警规则

yaml

undefined

yaml

undefined

gpu-alerts.yaml

groups:

name: gpu-health rules:
- alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} temperature above 85C on {{ $labels.node }}"
- alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "GPU memory above 90% on {{ $labels.node }} GPU {{ $labels.gpu }}"
- alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "Double-bit ECC errors detected on {{ $labels.node }} GPU {{ $labels.gpu }}"
- alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "Xid error on {{ $labels.node }} GPU {{ $labels.gpu }}: {{ $labels.xid }}"
- alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "GPU underutilized on {{ $labels.node }} - consider rightsizing"
- alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "Multiple GPU driver versions detected across cluster"

undefined

groups:

name: gpu-health rules:
- alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} 在节点 {{ $labels.node }} 上温度超过85℃"
- alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 内存使用率超过90%"
- alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 检测到双位ECC错误"
- alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 出现Xid错误：{{ $labels.xid }}"
- alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "节点 {{ $labels.node }} 的GPU利用率偏低 - 考虑调整资源配置"
- alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "集群中检测到多个GPU驱动版本"

undefined

GPU Node Pool Configuration

GPU节点池配置

yaml

undefined

yaml

undefined

gpu-nodepool.yaml

apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule

Inference deployment with GPU scheduling

带GPU调度的推理部署

apiVersion: apps/v1 kind: Deployment metadata: name: llm-inference namespace: ai-serving spec: replicas: 3 selector: matchLabels: app: llm-inference template: metadata: labels: app: llm-inference spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: gpu-type: a100 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: llm-inference topologyKey: kubernetes.io/hostname containers: - name: vllm image: registry.internal/vllm-server:0.4.1 resources: requests: nvidia.com/gpu: 1 cpu: "4" memory: "32Gi" limits: nvidia.com/gpu: 1 cpu: "8" memory: "64Gi" env: - name: CUDA_VISIBLE_DEVICES value: "all"

undefined

apiVersion: apps/v1 kind: Deployment metadata: name: llm-inference namespace: ai-serving spec: replicas: 3 selector: matchLabels: app: llm-inference template: metadata: labels: app: llm-inference spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: gpu-type: a100 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: llm-inference topologyKey: kubernetes.io/hostname containers: - name: vllm image: registry.internal/vllm-server:0.4.1 resources: requests: nvidia.com/gpu: 1 cpu: "4" memory: "32Gi" limits: nvidia.com/gpu: 1 cpu: "8" memory: "64Gi" env: - name: CUDA_VISIBLE_DEVICES value: "all"

undefined

GPU Autoscaling

GPU自动扩缩容

yaml

undefined

yaml

undefined

gpu-hpa.yaml

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-serving spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 8 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 120 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300

Cluster Autoscaler config for GPU node pools

GPU节点池的集群自动扩缩容配置

apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config namespace: kube-system data: config: | expander: priority scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m skip-nodes-with-local-storage: false balance-similar-node-groups: true expendable-pods-priority-cutoff: -10 gpu-total: - min: 2 max: 16 gpu: nvidia.com/gpu

undefined

apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config namespace: kube-system data: config: | expander: priority scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m skip-nodes-with-local-storage: false balance-similar-node-groups: true expendable-pods-priority-cutoff: -10 gpu-total: - min: 2 max: 16 gpu: nvidia.com/gpu

undefined

Scheduling Patterns

调度模式

Use node affinity by GPU type (A10/L4/A100/H100).
Separate latency-critical inference from batch training.
Pin model replicas with anti-affinity for availability.
Reserve headroom for failover and rolling updates.

按GPU类型（A10/L4/A100/H100）使用节点亲和性
将延迟敏感型推理与批量训练分离
使用反亲和性固定模型副本以提升可用性
为故障转移和滚动更新预留资源余量

Cost Optimization

成本优化

Prefer MIG slices for smaller inference services.
Schedule batch jobs in off-peak windows.
Route low-priority traffic to cheaper model tiers.
Use spot/preemptible instances for training workloads.
Monitor GPU utilization and rightsize deployments.

对于小型推理服务优先使用MIG切片
在非高峰时段调度批量作业
将低优先级流量路由至更经济的模型层级
使用抢占式实例运行训练工作负载
监控GPU利用率并调整部署资源配置

Troubleshooting

故障排查

Symptom	Check	Fix
Pod stuck in Pending	`kubectl describe pod` for GPU resource events	Verify node has allocatable GPUs, check taints/tolerations
CUDA OOM during inference	Model too large for GPU memory	Reduce batch size, use quantization, or use MIG slice
DCGM metrics missing	ServiceMonitor labels matching	Verify DCGM exporter pod is running and scrape config
Driver mismatch after upgrade	`nvidia-smi` on each node	Cordon node, drain, upgrade driver, uncordon
GPU not detected	Device plugin pod logs	Restart device plugin, check NVIDIA container toolkit
Time-slicing not working	ConfigMap applied but no extra GPUs	Restart device plugin pods after config change
ECC errors increasing	`nvidia-smi -q -d ECC`	Schedule node drain and hardware replacement

症状	检查项	修复方案
Pod 处于Pending状态	执行 `kubectl describe pod` 查看GPU资源事件	验证节点是否有可分配的GPU，检查污点/容忍度配置
推理过程中出现CUDA OOM	模型尺寸超过GPU内存	减小批量大小、使用量化技术或切换至MIG切片
DCGM指标缺失	检查ServiceMonitor标签是否匹配	验证DCGM exporter Pod是否运行，检查采集配置
升级后驱动版本不匹配	在每个节点执行 `nvidia-smi`	标记节点不可调度、驱逐Pod、升级驱动、重新标记可调度
GPU未被检测到	查看设备插件Pod日志	重启设备插件，检查NVIDIA容器工具包
时间分片不生效	已应用ConfigMap但未显示额外GPU	配置变更后重启设备插件Pod
ECC错误持续增加	执行 `nvidia-smi -q -d ECC`	安排节点驱逐并更换硬件

gpu-kubernetes-operations

Original

Translation

GPU Kubernetes Operations

GPU Kubernetes 运维

When to Use This Skill

何时使用此技能

Prerequisites

前提条件

NVIDIA GPU Operator Installation

NVIDIA GPU Operator 安装

Add NVIDIA Helm repo

Add NVIDIA Helm repo

Install GPU Operator

Install GPU Operator

Verify installation

Verify installation

NVIDIA Device Plugin (Standalone)

NVIDIA 设备插件（独立部署）

nvidia-device-plugin.yaml

nvidia-device-plugin.yaml

MIG (Multi-Instance GPU) Partitioning

MIG（多实例GPU）分区

mig-config.yaml - ConfigMap for MIG Manager

mig-config.yaml - ConfigMap for MIG Manager

Apply MIG profile to a node

为节点应用MIG配置

Verify MIG instances

验证MIG实例

Check available MIG resources

检查可用MIG资源

Requesting MIG Slices in Pods

在Pod中请求MIG切片

pod-with-mig.yaml

pod-with-mig.yaml

GPU Time-Slicing

GPU 时间分片

time-slicing-config.yaml

time-slicing-config.yaml

Apply time-slicing config

应用时间分片配置

After applying, each physical GPU appears as 4 virtual GPUs

应用后，每个物理GPU将显示为4个虚拟GPU

Output: "4" per physical GPU

输出：每个物理GPU对应"4"

DCGM Monitoring

DCGM 监控

dcgm-servicemonitor.yaml

dcgm-servicemonitor.yaml

Key DCGM Metrics and Alert Rules

关键DCGM指标与告警规则

gpu-alerts.yaml

gpu-alerts.yaml

GPU Node Pool Configuration

GPU节点池配置

gpu-nodepool.yaml

gpu-nodepool.yaml

apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule

apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule

Inference deployment with GPU scheduling

带GPU调度的推理部署

GPU Autoscaling

GPU自动扩缩容

gpu-hpa.yaml

gpu-hpa.yaml

Cluster Autoscaler config for GPU node pools

GPU节点池的集群自动扩缩容配置

Scheduling Patterns

调度模式

Cost Optimization

成本优化

Troubleshooting

故障排查

Related Skills

相关技能