gpu-kubernetes-operations

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPU Kubernetes Operations

GPU Kubernetes 运维

Run resilient and cost-efficient GPU clusters for production AI workloads.
为生产级AI工作负载运行高可用且成本优化的GPU集群。

When to Use This Skill

何时使用此技能

  • Setting up GPU node pools in Kubernetes for AI inference or training
  • Configuring NVIDIA device plugin and GPU operator
  • Implementing MIG partitioning to share GPUs across workloads
  • Building GPU-aware autoscaling policies
  • Monitoring GPU health with DCGM and Prometheus
  • Troubleshooting GPU scheduling, driver, or OOM issues
  • 在Kubernetes中为AI推理或训练设置GPU节点池
  • 配置NVIDIA设备插件和GPU Operator
  • 实施MIG分区以在多个工作负载间共享GPU
  • 构建感知GPU的自动扩缩容策略
  • 使用DCGM和Prometheus监控GPU健康状况
  • 排查GPU调度、驱动或OOM问题

Prerequisites

前提条件

  • Kubernetes 1.28+ cluster with GPU-capable nodes
  • NVIDIA GPUs (A10, L4, A100, H100, or similar)
  • NVIDIA drivers installed on nodes (535+ recommended)
  • Helm 3 for operator and plugin installation
  • Prometheus stack for metrics collection
  • Kubernetes 1.28+集群,且节点具备GPU能力
  • NVIDIA GPUs(A10、L4、A100、H100或类似型号)
  • 节点上已安装NVIDIA驱动(推荐535+版本)
  • 用于安装Operator和插件的Helm 3
  • 用于指标收集的Prometheus栈

NVIDIA GPU Operator Installation

NVIDIA GPU Operator 安装

The GPU Operator automates driver, toolkit, device plugin, and DCGM deployment.
bash
undefined
GPU Operator可自动化驱动、工具包、设备插件和DCGM的部署。
bash
undefined

Add NVIDIA Helm repo

Add NVIDIA Helm repo

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update

Install GPU Operator

Install GPU Operator

helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0
helm install gpu-operator nvidia/gpu-operator
--namespace gpu-operator
--create-namespace
--set driver.enabled=true
--set toolkit.enabled=true
--set devicePlugin.enabled=true
--set dcgmExporter.enabled=true
--set migManager.enabled=true
--set nodeStatusExporter.enabled=true
--version v24.3.0

Verify installation

Verify installation

kubectl get pods -n gpu-operator kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
undefined
kubectl get pods -n gpu-operator kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
undefined

NVIDIA Device Plugin (Standalone)

NVIDIA 设备插件(独立部署)

If not using the GPU Operator, deploy the device plugin directly.
yaml
undefined
若不使用GPU Operator,可直接部署设备插件。
yaml
undefined

nvidia-device-plugin.yaml

nvidia-device-plugin.yaml

apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin template: metadata: labels: name: nvidia-device-plugin spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: system-node-critical containers: - name: nvidia-device-plugin image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0 securityContext: privileged: true env: - name: FAIL_ON_INIT_ERROR value: "false" - name: DEVICE_SPLIT_COUNT value: "1" - name: DEVICE_LIST_STRATEGY value: "envvar" volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
undefined
apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin template: metadata: labels: name: nvidia-device-plugin spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: system-node-critical containers: - name: nvidia-device-plugin image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0 securityContext: privileged: true env: - name: FAIL_ON_INIT_ERROR value: "false" - name: DEVICE_SPLIT_COUNT value: "1" - name: DEVICE_LIST_STRATEGY value: "envvar" volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
undefined

MIG (Multi-Instance GPU) Partitioning

MIG(多实例GPU)分区

MIG allows a single A100 or H100 to be split into isolated GPU instances.
yaml
undefined
MIG允许将单个A100或H100拆分为多个独立的GPU实例。
yaml
undefined

mig-config.yaml - ConfigMap for MIG Manager

mig-config.yaml - ConfigMap for MIG Manager

apiVersion: v1 kind: ConfigMap metadata: name: mig-parted-config namespace: gpu-operator data: config.yaml: | version: v1 mig-configs: # 7 small instances for inference microservices all-1g.10gb: - devices: all mig-enabled: true mig-devices: "1g.10gb": 7
  # 3 medium instances for mid-size models
  all-2g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3

  # Mixed: 1 large + 2 small
  mixed-inference:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.40gb": 1
        "1g.10gb": 4

  # Full GPU for training (no partitioning)
  all-disabled:
    - devices: all
      mig-enabled: false

```bash
apiVersion: v1 kind: ConfigMap metadata: name: mig-parted-config namespace: gpu-operator data: config.yaml: | version: v1 mig-configs: # 7个小实例用于推理微服务 all-1g.10gb: - devices: all mig-enabled: true mig-devices: "1g.10gb": 7
  # 3个中等实例用于中型模型
  all-2g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3

  # 混合配置:1个大实例 + 2个小实例
  mixed-inference:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.40gb": 1
        "1g.10gb": 4

  # 完整GPU用于训练(不分区)
  all-disabled:
    - devices: all
      mig-enabled: false

```bash

Apply MIG profile to a node

为节点应用MIG配置

kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
kubectl label nodes gpu-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite

Verify MIG instances

验证MIG实例

kubectl exec -it nvidia-device-plugin-xxxxx -n kube-system -- nvidia-smi mig -lgi
kubectl exec -it nvidia-device-plugin-xxxxx -n kube-system -- nvidia-smi mig -lgi

Check available MIG resources

检查可用MIG资源

kubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'
undefined
kubectl get nodes gpu-node-01 -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com")))'
undefined

Requesting MIG Slices in Pods

在Pod中请求MIG切片

yaml
undefined
yaml
undefined

pod-with-mig.yaml

pod-with-mig.yaml

apiVersion: v1 kind: Pod metadata: name: inference-small spec: containers: - name: model image: registry.internal/vllm-server:latest resources: limits: nvidia.com/mig-1g.10gb: 1 # For medium slice: # nvidia.com/mig-2g.20gb: 1 # For large slice: # nvidia.com/mig-3g.40gb: 1
undefined
apiVersion: v1 kind: Pod metadata: name: inference-small spec: containers: - name: model image: registry.internal/vllm-server:latest resources: limits: nvidia.com/mig-1g.10gb: 1 # 若使用中等切片: # nvidia.com/mig-2g.20gb: 1 # 若使用大切片: # nvidia.com/mig-3g.40gb: 1
undefined

GPU Time-Slicing

GPU 时间分片

For GPUs that do not support MIG (A10, L4), use time-slicing to share a GPU.
yaml
undefined
对于不支持MIG的GPU(A10、L4),使用时间分片来共享GPU。
yaml
undefined

time-slicing-config.yaml

time-slicing-config.yaml

apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 4

```bash
apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: any: |- version: v1 flags: migStrategy: none sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 4

```bash

Apply time-slicing config

应用时间分片配置

kubectl patch clusterpolicy/cluster-policy
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'
kubectl patch clusterpolicy/cluster-policy
--type merge
-p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

After applying, each physical GPU appears as 4 virtual GPUs

应用后,每个物理GPU将显示为4个虚拟GPU

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

Output: "4" per physical GPU

输出:每个物理GPU对应"4"

undefined
undefined

DCGM Monitoring

DCGM 监控

yaml
undefined
yaml
undefined

dcgm-servicemonitor.yaml

dcgm-servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: dcgm-exporter namespace: gpu-operator labels: release: prometheus spec: selector: matchLabels: app: nvidia-dcgm-exporter endpoints: - port: gpu-metrics interval: 15s path: /metrics
undefined
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: dcgm-exporter namespace: gpu-operator labels: release: prometheus spec: selector: matchLabels: app: nvidia-dcgm-exporter endpoints: - port: gpu-metrics interval: 15s path: /metrics
undefined

Key DCGM Metrics and Alert Rules

关键DCGM指标与告警规则

yaml
undefined
yaml
undefined

gpu-alerts.yaml

gpu-alerts.yaml

groups:
  • name: gpu-health rules:
    • alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} temperature above 85C on {{ $labels.node }}"
    • alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "GPU memory above 90% on {{ $labels.node }} GPU {{ $labels.gpu }}"
    • alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "Double-bit ECC errors detected on {{ $labels.node }} GPU {{ $labels.gpu }}"
    • alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "Xid error on {{ $labels.node }} GPU {{ $labels.gpu }}: {{ $labels.xid }}"
    • alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "GPU underutilized on {{ $labels.node }} - consider rightsizing"
    • alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "Multiple GPU driver versions detected across cluster"
undefined
groups:
  • name: gpu-health rules:
    • alert: GPUHighTemperature expr: DCGM_FI_DEV_GPU_TEMP > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.gpu }} 在节点 {{ $labels.node }} 上温度超过85℃"
    • alert: GPUMemoryPressure expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) > 0.90 for: 5m labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 内存使用率超过90%"
    • alert: GPUECCErrors expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0 labels: severity: critical annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 检测到双位ECC错误"
    • alert: GPUXidErrors expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0 labels: severity: warning annotations: summary: "节点 {{ $labels.node }} 的GPU {{ $labels.gpu }} 出现Xid错误:{{ $labels.xid }}"
    • alert: GPULowUtilization expr: DCGM_FI_DEV_GPU_UTIL < 10 and on(pod) kube_pod_status_phase{phase="Running"} == 1 for: 30m labels: severity: info annotations: summary: "节点 {{ $labels.node }} 的GPU利用率偏低 - 考虑调整资源配置"
    • alert: GPUDriverMismatch expr: count(count by (driver_version)(DCGM_FI_DRIVER_VERSION)) > 1 labels: severity: warning annotations: summary: "集群中检测到多个GPU驱动版本"
undefined

GPU Node Pool Configuration

GPU节点池配置

yaml
undefined
yaml
undefined

gpu-nodepool.yaml

gpu-nodepool.yaml

apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule

apiVersion: v1 kind: Node metadata: labels: gpu-type: a100 gpu-memory: "80gb" gpu-mig-capable: "true" node-role: gpu-inference spec: taints: - key: nvidia.com/gpu value: "true" effect: NoSchedule

Inference deployment with GPU scheduling

带GPU调度的推理部署

apiVersion: apps/v1 kind: Deployment metadata: name: llm-inference namespace: ai-serving spec: replicas: 3 selector: matchLabels: app: llm-inference template: metadata: labels: app: llm-inference spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: gpu-type: a100 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: llm-inference topologyKey: kubernetes.io/hostname containers: - name: vllm image: registry.internal/vllm-server:0.4.1 resources: requests: nvidia.com/gpu: 1 cpu: "4" memory: "32Gi" limits: nvidia.com/gpu: 1 cpu: "8" memory: "64Gi" env: - name: CUDA_VISIBLE_DEVICES value: "all"
undefined
apiVersion: apps/v1 kind: Deployment metadata: name: llm-inference namespace: ai-serving spec: replicas: 3 selector: matchLabels: app: llm-inference template: metadata: labels: app: llm-inference spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule nodeSelector: gpu-type: a100 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: llm-inference topologyKey: kubernetes.io/hostname containers: - name: vllm image: registry.internal/vllm-server:0.4.1 resources: requests: nvidia.com/gpu: 1 cpu: "4" memory: "32Gi" limits: nvidia.com/gpu: 1 cpu: "8" memory: "64Gi" env: - name: CUDA_VISIBLE_DEVICES value: "all"
undefined

GPU Autoscaling

GPU自动扩缩容

yaml
undefined
yaml
undefined

gpu-hpa.yaml

gpu-hpa.yaml

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-serving spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 8 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 120 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: llm-inference-hpa namespace: ai-serving spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference minReplicas: 2 maxReplicas: 8 metrics: - type: Pods pods: metric: name: DCGM_FI_DEV_GPU_UTIL target: type: AverageValue averageValue: "75" - type: Pods pods: metric: name: inference_queue_depth target: type: AverageValue averageValue: "10" behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 2 periodSeconds: 120 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300

Cluster Autoscaler config for GPU node pools

GPU节点池的集群自动扩缩容配置

apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config namespace: kube-system data: config: | expander: priority scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m skip-nodes-with-local-storage: false balance-similar-node-groups: true expendable-pods-priority-cutoff: -10 gpu-total: - min: 2 max: 16 gpu: nvidia.com/gpu
undefined
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config namespace: kube-system data: config: | expander: priority scale-down-delay-after-add: 10m scale-down-unneeded-time: 10m skip-nodes-with-local-storage: false balance-similar-node-groups: true expendable-pods-priority-cutoff: -10 gpu-total: - min: 2 max: 16 gpu: nvidia.com/gpu
undefined

Scheduling Patterns

调度模式

  • Use node affinity by GPU type (A10/L4/A100/H100).
  • Separate latency-critical inference from batch training.
  • Pin model replicas with anti-affinity for availability.
  • Reserve headroom for failover and rolling updates.
  • 按GPU类型(A10/L4/A100/H100)使用节点亲和性
  • 将延迟敏感型推理与批量训练分离
  • 使用反亲和性固定模型副本以提升可用性
  • 为故障转移和滚动更新预留资源余量

Cost Optimization

成本优化

  • Prefer MIG slices for smaller inference services.
  • Schedule batch jobs in off-peak windows.
  • Route low-priority traffic to cheaper model tiers.
  • Use spot/preemptible instances for training workloads.
  • Monitor GPU utilization and rightsize deployments.
  • 对于小型推理服务优先使用MIG切片
  • 在非高峰时段调度批量作业
  • 将低优先级流量路由至更经济的模型层级
  • 使用抢占式实例运行训练工作负载
  • 监控GPU利用率并调整部署资源配置

Troubleshooting

故障排查

SymptomCheckFix
Pod stuck in Pending
kubectl describe pod
for GPU resource events
Verify node has allocatable GPUs, check taints/tolerations
CUDA OOM during inferenceModel too large for GPU memoryReduce batch size, use quantization, or use MIG slice
DCGM metrics missingServiceMonitor labels matchingVerify DCGM exporter pod is running and scrape config
Driver mismatch after upgrade
nvidia-smi
on each node
Cordon node, drain, upgrade driver, uncordon
GPU not detectedDevice plugin pod logsRestart device plugin, check NVIDIA container toolkit
Time-slicing not workingConfigMap applied but no extra GPUsRestart device plugin pods after config change
ECC errors increasing
nvidia-smi -q -d ECC
Schedule node drain and hardware replacement
症状检查项修复方案
Pod 处于Pending状态执行
kubectl describe pod
查看GPU资源事件
验证节点是否有可分配的GPU,检查污点/容忍度配置
推理过程中出现CUDA OOM模型尺寸超过GPU内存减小批量大小、使用量化技术或切换至MIG切片
DCGM指标缺失检查ServiceMonitor标签是否匹配验证DCGM exporter Pod是否运行,检查采集配置
升级后驱动版本不匹配在每个节点执行
nvidia-smi
标记节点不可调度、驱逐Pod、升级驱动、重新标记可调度
GPU未被检测到查看设备插件Pod日志重启设备插件,检查NVIDIA容器工具包
时间分片不生效已应用ConfigMap但未显示额外GPU配置变更后重启设备插件Pod
ECC错误持续增加执行
nvidia-smi -q -d ECC
安排节点驱逐并更换硬件

Related Skills

相关技能

  • llm-inference-scaling - Autoscale inference workloads
  • model-serving-kubernetes - Production model serving patterns
  • gpu-server-management - Host-level GPU management fundamentals
  • multi-tenant-llm-hosting - Multi-tenant GPU sharing
  • llm-cost-optimization - Cost optimization strategies
  • llm-inference-scaling - 自动扩缩容推理工作负载
  • model-serving-kubernetes - 生产级模型服务模式
  • gpu-server-management - 主机级GPU管理基础
  • multi-tenant-llm-hosting - 多租户GPU共享
  • llm-cost-optimization - 成本优化策略