gpu-optimizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GPU Optimizer

GPU优化器

Expert GPU optimization for consumer GPUs with 8–24GB VRAM. Evidence-based patterns only.
针对配备8–24GB VRAM的消费级GPU的专业优化方案,仅采用经验证的优化模式。

Hardware Profile

硬件配置表

Fill in your hardware before applying optimizations:
PropertyYour Value
GPU model(e.g., RTX 4080 Mobile, RTX 3090, RTX 4090)
VRAM(e.g., 12GB, 16GB, 24GB)
CUDA version(
nvidia-smi
→ top-right)
TDP / power limit(laptop vs desktop affects sustained throughput)
Driver version(
nvidia-smi
→ top-left)
Key constraint: VRAM capacity determines which strategies apply. Patterns below are annotated with minimum VRAM requirements where relevant.
在应用优化前,请填写你的硬件信息:
属性你的配置值
GPU型号(例如:RTX 4080 Mobile, RTX 3090, RTX 4090)
VRAM(例如:12GB, 16GB, 24GB)
CUDA版本(
nvidia-smi
→ 右上角)
TDP/功耗限制(笔记本与台式机的差异会影响持续吞吐量)
驱动版本(
nvidia-smi
→ 左上角)
核心限制条件:显存容量决定了适用的优化策略。以下优化方案会在相关位置标注最低显存要求。

Optimization Categories

优化分类

1. XGBoost GPU Acceleration

1. XGBoost GPU加速

DMatrix vs QuantileDMatrix:
python
undefined
DMatrix 与 QuantileDMatrix对比:
python
undefined

GPU-optimized: QuantileDMatrix is 1.8x faster

GPU-optimized: QuantileDMatrix is 1.8x faster

dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32)) dval = xgb.QuantileDMatrix(X_val.astype(np.float32))
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32)) dval = xgb.QuantileDMatrix(X_val.astype(np.float32))

Standard: DMatrix (use for inference only)

Standard: DMatrix (use for inference only)

dtest = xgb.DMatrix(X_test.astype(np.float32))

**Critical Parameters:**

```python
params = {
    'tree_method': 'hist',        # GPU-accelerated histogram
    'device': 'cuda:0',           # Explicit GPU device
    'max_bin': 256,               # Higher bins = better splits (VRAM permitting)
    'grow_policy': 'depthwise',   # vs 'lossguide' for imbalanced data
    'predictor': 'gpu_predictor', # GPU inference
}
dtest = xgb.DMatrix(X_test.astype(np.float32))

**关键参数:**

```python
params = {
    'tree_method': 'hist',        # GPU-accelerated histogram
    'device': 'cuda:0',           # Explicit GPU device
    'max_bin': 256,               # Higher bins = better splits (VRAM permitting)
    'grow_policy': 'depthwise',   # vs 'lossguide' for imbalanced data
    'predictor': 'gpu_predictor', # GPU inference
}

Training with explicit device

Training with explicit device

model = xgb.train(params, dtrain, num_boost_round=100)

**GPU Verification (fail-fast):**

```python
def verify_gpu():
    """Verify XGBoost GPU availability. Raises if unavailable."""
    import subprocess
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        if result.returncode != 0:
            raise RuntimeError("nvidia-smi failed - no GPU available")
    except FileNotFoundError:
        raise RuntimeError("nvidia-smi not found - no GPU available")

    build_info = xgb.build_info()
    if not build_info.get("USE_CUDA"):
        raise RuntimeError("XGBoost not compiled with CUDA support")
Memory Management:
python
undefined
model = xgb.train(params, dtrain, num_boost_round=100)

**GPU可用性验证(快速失败):**

```python
def verify_gpu():
    """Verify XGBoost GPU availability. Raises if unavailable."""
    import subprocess
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        if result.returncode != 0:
            raise RuntimeError("nvidia-smi failed - no GPU available")
    except FileNotFoundError:
        raise RuntimeError("nvidia-smi not found - no GPU available")

    build_info = xgb.build_info()
    if not build_info.get("USE_CUDA"):
        raise RuntimeError("XGBoost not compiled with CUDA support")
内存管理:
python
undefined

Single-pass training (reuse QuantileDMatrix across slots)

Single-pass training (reuse QuantileDMatrix across slots)

dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32)) for slot_idx in range(num_slots): dtrain.set_label(y_train[:, slot_idx]) # Reuse matrix model = xgb.train(params, dtrain, num_boost_round=100)
undefined
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32)) for slot_idx in range(num_slots): dtrain.set_label(y_train[:, slot_idx]) # Reuse matrix model = xgb.train(params, dtrain, num_boost_round=100)
undefined

2. PyTorch Mixed Precision

2. PyTorch混合精度训练

BF16 (preferred) vs FP16:
python
from torch.amp import autocast, GradScaler
BF16(优先选择) vs FP16:
python
from torch.amp import autocast, GradScaler

Auto-detect best precision

Auto-detect best precision

if torch.cuda.is_bf16_supported(): amp_dtype = torch.bfloat16 # Ampere+ GPUs support BF16 else: amp_dtype = torch.float16
if torch.cuda.is_bf16_supported(): amp_dtype = torch.bfloat16 # Ampere+ GPUs support BF16 else: amp_dtype = torch.float16

Training step

Training step

scaler = GradScaler('cuda') if amp_dtype == torch.float16 else None
with autocast('cuda', dtype=amp_dtype): output = model(input_ids, attention_mask) loss = criterion(output, targets)
scaler = GradScaler('cuda') if amp_dtype == torch.float16 else None
with autocast('cuda', dtype=amp_dtype): output = model(input_ids, attention_mask) loss = criterion(output, targets)

Backward with scaling (FP16 only)

Backward with scaling (FP16 only)

if scaler: scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() else: loss.backward() optimizer.step()

**Why BF16 > FP16:**

- Same exponent range as FP32 (no overflow/underflow)
- No GradScaler needed (simpler code)
- Ampere and later GPUs have native BF16 Tensor cores
if scaler: scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() else: loss.backward() optimizer.step()

**为什么BF16优于FP16:**

- 与FP32具有相同的指数范围(无溢出/下溢)
- 无需GradScaler(代码更简洁)
- Ampere及后续架构GPU原生支持BF16张量核心

3. VRAM Management

3. VRAM内存管理

Gradient Checkpointing:
python
undefined
梯度检查点:
python
undefined

Saves ~40% VRAM, adds ~20% compute time

Saves ~40% VRAM, adds ~20% compute time

model.gradient_checkpointing_enable()
model.gradient_checkpointing_enable()

For transformers:

For transformers:

model.base_model.model.gradient_checkpointing_enable()

**VRAM Monitoring:**

```python
import torch

torch.cuda.reset_peak_memory_stats()
model.base_model.model.gradient_checkpointing_enable()

**VRAM监控:**

```python
import torch

torch.cuda.reset_peak_memory_stats()

... training ...

... training ...

peak_vram_gb = torch.cuda.max_memory_allocated() / 1024**3 print(f"Peak VRAM: {peak_vram_gb:.2f} GB")
peak_vram_gb = torch.cuda.max_memory_allocated() / 1024**3 print(f"Peak VRAM: {peak_vram_gb:.2f} GB")

Clear cache between experiments

Clear cache between experiments

torch.cuda.empty_cache()

**Gradient Accumulation:**

```python
torch.cuda.empty_cache()

**梯度累积:**

```python

Simulate larger batch size without OOM

Simulate larger batch size without OOM

grad_accum_steps = max(1, target_batch_size // actual_batch_size)
for i, batch in enumerate(dataloader): loss = model(batch) / grad_accum_steps loss.backward()
if (i + 1) % grad_accum_steps == 0:
    optimizer.step()
    optimizer.zero_grad()

**DoE for VRAM Optimization:**

```python
EXPERIMENTS = [
    {"batch_size": 2,  "seq_len": 128, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 4,  "seq_len": 256, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 8,  "seq_len": 512, "grad_ckpt": False, "amp": "bf16"},
    {"batch_size": 16, "seq_len": 256, "grad_ckpt": False, "amp": "bf16"},
]
grad_accum_steps = max(1, target_batch_size // actual_batch_size)
for i, batch in enumerate(dataloader): loss = model(batch) / grad_accum_steps loss.backward()
if (i + 1) % grad_accum_steps == 0:
    optimizer.step()
    optimizer.zero_grad()

**VRAM优化试验设计:**

```python
EXPERIMENTS = [
    {"batch_size": 2,  "seq_len": 128, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 4,  "seq_len": 256, "grad_ckpt": True,  "amp": "bf16"},
    {"batch_size": 8,  "seq_len": 512, "grad_ckpt": False, "amp": "bf16"},
    {"batch_size": 16, "seq_len": 256, "grad_ckpt": False, "amp": "bf16"},
]

4. Aggressive Vectorization

4. 深度向量化

Tensor Lookups (not Python loops):
python
undefined
张量查找(替代Python循环):
python
undefined

Slow: Python loop

Slow: Python loop

for i, token_id in enumerate(input_ids): type_id = token_to_type[token_id] embeddings[i] = type_embeddings[type_id]
for i, token_id in enumerate(input_ids): type_id = token_to_type[token_id] embeddings[i] = type_embeddings[type_id]

Fast: Vectorized

Fast: Vectorized

type_ids = token_to_type[input_ids] # Broadcast lookup embeddings = type_embeddings[type_ids] # Single GPU kernel

**Registered Buffers (persistent GPU data):**

```python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        # Build lookup tensors once
        type_ids = torch.zeros(vocab_size, dtype=torch.long)
        self.register_buffer('_type_ids', type_ids)  # Stays on GPU

    def forward(self, input_ids):
        return self._type_ids[input_ids]  # Vectorized lookup
Batch Operations:
python
undefined
type_ids = token_to_type[input_ids] # Broadcast lookup embeddings = type_embeddings[type_ids] # Single GPU kernel

**注册缓冲区(持久化GPU数据):**

```python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        # Build lookup tensors once
        type_ids = torch.zeros(vocab_size, dtype=torch.long)
        self.register_buffer('_type_ids', type_ids)  # Stays on GPU

    def forward(self, input_ids):
        return self._type_ids[input_ids]  # Vectorized lookup
批量操作:
python
undefined

Slow: Per-sample processing

Slow: Per-sample processing

outputs = [model(x.unsqueeze(0)) for x in batch]
outputs = [model(x.unsqueeze(0)) for x in batch]

Fast: Batched

Fast: Batched

outputs = model(batch) # Single forward pass
undefined
outputs = model(batch) # Single forward pass
undefined

5. CuPy Migration (NumPy → GPU)

5. CuPy迁移(NumPy → GPU)

When to Use CuPy:
  • Large array operations (>1M elements)
  • Repeated NumPy calls in tight loops
  • Preprocessing pipelines before PyTorch/XGBoost
Migration Pattern:
python
import cupy as cp
import numpy as np
何时使用CuPy:
  • 大型数组运算(>100万元素)
  • 密集循环中重复调用NumPy
  • PyTorch/XGBoost训练前的预处理流水线
迁移模式:
python
import cupy as cp
import numpy as np

NumPy (CPU)

NumPy (CPU)

x = np.random.randn(10000, 1000) y = np.dot(x, x.T)
x = np.random.randn(10000, 1000) y = np.dot(x, x.T)

CuPy (GPU) - SAME API

CuPy (GPU) - SAME API

x_gpu = cp.random.randn(10000, 1000) y_gpu = cp.dot(x_gpu, x_gpu.T)
x_gpu = cp.random.randn(10000, 1000) y_gpu = cp.dot(x_gpu, x_gpu.T)

Transfer back if needed

Transfer back if needed

y_cpu = cp.asnumpy(y_gpu)

**Interop with PyTorch:**

```python
y_cpu = cp.asnumpy(y_gpu)

**与PyTorch的互操作性:**

```python

CuPy → PyTorch (zero-copy)

CuPy → PyTorch (zero-copy)

x_cupy = cp.random.randn(1000, 1000) x_torch = torch.as_tensor(x_cupy, device='cuda')
x_cupy = cp.random.randn(1000, 1000) x_torch = torch.as_tensor(x_cupy, device='cuda')

PyTorch → CuPy (zero-copy)

PyTorch → CuPy (zero-copy)

x_torch = torch.randn(1000, 1000, device='cuda') x_cupy = cp.asarray(x_torch)

**Install:**

```bash
uv pip install cupy-cuda12x  # For CUDA 12.x
x_torch = torch.randn(1000, 1000, device='cuda') x_cupy = cp.asarray(x_torch)

**安装:**

```bash
uv pip install cupy-cuda12x  # For CUDA 12.x

6. cuDF Migration (Pandas → GPU)

6. cuDF迁移(Pandas → GPU)

When to Use cuDF:
  • DataFrames >1GB
  • Groupby/aggregation on large data
  • ETL pipelines before model training
Migration Pattern:
python
import cudf
import pandas as pd
何时使用cuDF:
  • 数据量>1GB的DataFrame
  • 大型数据集上的分组/聚合操作
  • 模型训练前的ETL流水线
迁移模式:
python
import cudf
import pandas as pd

Pandas (CPU)

Pandas (CPU)

df = pd.read_csv('large.csv') grouped = df.groupby('category')['value'].mean()
df = pd.read_csv('large.csv') grouped = df.groupby('category')['value'].mean()

cuDF (GPU) - SAME API

cuDF (GPU) - SAME API

df_gpu = cudf.read_csv('large.csv') grouped_gpu = df_gpu.groupby('category')['value'].mean()
df_gpu = cudf.read_csv('large.csv') grouped_gpu = df_gpu.groupby('category')['value'].mean()

Transfer back

Transfer back

grouped_cpu = grouped_gpu.to_pandas()

**XGBoost Integration:**

```python
import cudf
import xgboost as xgb
grouped_cpu = grouped_gpu.to_pandas()

**与XGBoost的集成:**

```python
import cudf
import xgboost as xgb

Load data on GPU

Load data on GPU

df = cudf.read_csv('train.csv') X = df[feature_cols] y = df['target']
df = cudf.read_csv('train.csv') X = df[feature_cols] y = df['target']

Create DMatrix directly from cuDF (no CPU copy)

Create DMatrix directly from cuDF (no CPU copy)

dtrain = xgb.DMatrix(X, label=y)

**Install:**

```bash
dtrain = xgb.DMatrix(X, label=y)

**安装:**

```bash

RAPIDS (includes cuDF, cuML, cuGraph)

RAPIDS (includes cuDF, cuML, cuGraph)

uv pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
undefined
uv pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
undefined

7. PyTorch Compilation & Optimization

7. PyTorch编译与优化

Fused Optimizer:
python
undefined
融合优化器:
python
undefined

Check availability

Check availability

use_fused = ( torch.cuda.is_available() and "fused" in torch.optim.AdamW.init.code.co_varnames )
optimizer = torch.optim.AdamW( model.parameters(), lr=1e-3, fused=use_fused, # Single GPU kernel (2-3x faster) )

**Torch Compile:**

```python
use_fused = ( torch.cuda.is_available() and "fused" in torch.optim.AdamW.init.code.co_varnames )
optimizer = torch.optim.AdamW( model.parameters(), lr=1e-3, fused=use_fused, # Single GPU kernel (2-3x faster) )

**Torch Compile:**

```python

PyTorch 2.0+ compile

PyTorch 2.0+ compile

if hasattr(torch, "compile"): model = torch.compile(model, mode="reduce-overhead")

**cuDNN Benchmarking:**

```python
if hasattr(torch, "compile"): model = torch.compile(model, mode="reduce-overhead")

**cuDNN基准测试:**

```python

Auto-tune kernels (slower startup, faster training)

Auto-tune kernels (slower startup, faster training)

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.benchmark = True

Disable for determinism

Disable for determinism

torch.backends.cudnn.deterministic = True
undefined
torch.backends.cudnn.deterministic = True
undefined

8. Advanced Loss Functions

8. 高级损失函数

Weighted Slot Loss:
python
class WeightedSlotLoss(nn.Module):
    def __init__(self, slot_weights):
        super().__init__()
        self.slot_weights = torch.tensor(slot_weights)

    def forward(self, logits_list, targets):
        weighted_losses = []
        for i, logits in enumerate(logits_list):
            loss = F.cross_entropy(logits, targets[:, i])
            weighted_losses.append(loss * self.slot_weights[i])
        return torch.stack(weighted_losses).sum() / self.slot_weights.sum()
Focal Loss (hard example mining):
python
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0):
        super().__init__()
        self.gamma = gamma

    def forward(self, logits, targets):
        ce_loss = F.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        return focal_loss.mean()
加权槽损失:
python
class WeightedSlotLoss(nn.Module):
    def __init__(self, slot_weights):
        super().__init__()
        self.slot_weights = torch.tensor(slot_weights)

    def forward(self, logits_list, targets):
        weighted_losses = []
        for i, logits in enumerate(logits_list):
            loss = F.cross_entropy(logits, targets[:, i])
            weighted_losses.append(loss * self.slot_weights[i])
        return torch.stack(weighted_losses).sum() / self.slot_weights.sum()
Focal Loss(难例挖掘):
python
class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0):
        super().__init__()
        self.gamma = gamma

    def forward(self, logits, targets):
        ce_loss = F.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        return focal_loss.mean()

9. Caching & Precomputation

9. 缓存与预计算

Position Embedding Cache:
python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self._pos_cache = {}  # {seq_len: positions}

    def forward(self, x):
        T = x.size(1)
        if T not in self._pos_cache:
            self._pos_cache[T] = torch.arange(T, device=x.device)
            # Limit cache size
            if len(self._pos_cache) > 10:
                self._pos_cache.pop(next(iter(self._pos_cache)))
        return self.pos_embed(self._pos_cache[T])
Attention Mask Cache:
python
def _create_causal_mask(self, T, device):
    if T not in self._mask_cache:
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
        self._mask_cache[T] = mask.to(device)
    return self._mask_cache[T]
位置嵌入缓存:
python
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self._pos_cache = {}  # {seq_len: positions}

    def forward(self, x):
        T = x.size(1)
        if T not in self._pos_cache:
            self._pos_cache[T] = torch.arange(T, device=x.device)
            # Limit cache size
            if len(self._pos_cache) > 10:
                self._pos_cache.pop(next(iter(self._pos_cache)))
        return self.pos_embed(self._pos_cache[T])
注意力掩码缓存:
python
def _create_causal_mask(self, T, device):
    if T not in self._mask_cache:
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
        self._mask_cache[T] = mask.to(device)
    return self._mask_cache[T]

Quick Diagnostics

快速诊断

Check GPU Utilization:
bash
watch -n 1 nvidia-smi  # Monitor in real-time
Profile PyTorch:
python
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.GPU],
    with_stack=True,
) as prof:
    model(batch)

print(prof.key_averages().table(sort_by="cuda_time_total"))
Bottleneck Detection:
python
import torch.utils.bottleneck as bottleneck
bottleneck.main(['script.py'])
检查GPU利用率:
bash
watch -n 1 nvidia-smi  # Monitor in real-time
PyTorch性能分析:
python
with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.GPU],
    with_stack=True,
) as prof:
    model(batch)

print(prof.key_averages().table(sort_by="cuda_time_total"))
瓶颈检测:
python
import torch.utils.bottleneck as bottleneck
bottleneck.main(['script.py'])

Migration Checklist

迁移检查清单

  • XGBoost: Use
    QuantileDMatrix
    , set
    device='cuda:0'
  • PyTorch: Enable BF16/FP16, fused optimizer, torch.compile
  • VRAM: Gradient checkpointing if approaching VRAM limit
  • NumPy→CuPy: For preprocessing >1M elements
  • Pandas→cuDF: For DataFrames >1GB
  • Vectorization: Replace Python loops with tensor ops
  • Caching: Precompute positions, masks, embeddings
  • Monitor: Track VRAM usage, profile GPU kernels
  • XGBoost: 使用
    QuantileDMatrix
    ,设置
    device='cuda:0'
  • PyTorch: 启用BF16/FP16、融合优化器、torch.compile
  • VRAM: 当显存接近上限时启用梯度检查点
  • NumPy→CuPy: 针对>100万元素的预处理任务
  • Pandas→cuDF: 针对>1GB的DataFrame
  • 向量化: 用张量操作替代Python循环
  • 缓存: 预计算位置、掩码、嵌入向量
  • 监控: 跟踪VRAM使用情况,分析GPU核函数性能

Anti-Patterns

反模式

Avoid:
  • Using
    .cpu()
    in training loop (kills GPU pipeline)
  • Creating tensors on CPU then moving to GPU (create on GPU directly)
  • Using Python loops over tensors (vectorize)
  • Ignoring VRAM monitoring (leads to OOM crashes)
  • Using FP32 when BF16/FP16 works (wastes bandwidth)
  • Calling
    torch.cuda.synchronize()
    unnecessarily (breaks async)
需避免的操作:
  • 在训练循环中使用
    .cpu()
    (会中断GPU流水线)
  • 在CPU创建张量后再迁移至GPU(直接在GPU上创建)
  • 对张量使用Python循环(改为向量化操作)
  • 忽略VRAM监控(会导致OOM崩溃)
  • 当BF16/FP16可用时仍使用FP32(浪费带宽)
  • 不必要地调用
    torch.cuda.synchronize()
    (破坏异步执行)

References

参考资料

Error Handling

错误处理

  • CUDA not available at runtime: run
    nvidia-smi
    first to confirm the GPU is visible; if the command fails, verify driver installation with
    sudo nvidia-smi
    or reinstall drivers before proceeding.
  • XGBoost raises
    RuntimeError: XGBoost not compiled with CUDA support
    : install the CUDA build via
    uv pip install xgboost
    from a CUDA-enabled environment, or build from source with
    -DUSE_CUDA=ON
    .
  • OOM during training: reduce batch size first (halve it), then enable gradient checkpointing; if OOM persists after both, enable gradient accumulation to simulate the original batch size.
  • CuPy import failure (
    ImportError
    or version mismatch): verify CUDA toolkit version with
    nvcc --version
    and install the matching CuPy wheel (e.g.,
    cupy-cuda12x
    for CUDA 12.x).
  • cuDF install fails or produces CUDA version errors: use the NVIDIA PyPI index (
    --extra-index-url=https://pypi.nvidia.com
    ) and match the
    cudf-cu12
    suffix to your CUDA major version.
  • torch.compile
    produces incorrect results or crashes: disable with
    model = model
    (no compile) to isolate; known to fail on some custom ops — fall back to eager mode for those layers.
  • 运行时CUDA不可用:先执行
    nvidia-smi
    确认GPU可见;若命令失败,通过
    sudo nvidia-smi
    验证驱动安装,或重新安装驱动后重试。
  • XGBoost抛出
    RuntimeError: XGBoost not compiled with CUDA support
    :在支持CUDA的环境中通过
    uv pip install xgboost
    安装CUDA版本,或从源码编译时添加
    -DUSE_CUDA=ON
    参数。
  • 训练时出现OOM:首先减半批量大小,然后启用梯度检查点;若仍出现OOM,启用梯度累积以模拟原批量大小。
  • CuPy导入失败(
    ImportError
    或版本不匹配):通过
    nvcc --version
    确认CUDA工具包版本,安装对应版本的CuPy wheel包(例如CUDA 12.x对应
    cupy-cuda12x
    )。
  • cuDF安装失败或出现CUDA版本错误:使用NVIDIA PyPI源(
    --extra-index-url=https://pypi.nvidia.com
    ),并确保
    cudf-cu12
    后缀与CUDA主版本匹配。
  • torch.compile
    产生错误结果或崩溃:通过
    model = model
    禁用编译以定位问题;已知部分自定义算子会导致编译失败,此类层可回退至eager模式。

Limitations

限制说明

  • NVIDIA GPUs only — AMD (ROCm) and Intel Arc GPUs are not covered by these patterns.
  • Assumes a single-GPU setup; multi-GPU (DDP, FSDP) requires additional configuration not covered here.
  • Patterns are calibrated for consumer GPUs (8–24GB VRAM); datacenter GPUs (A100, H100) have different memory hierarchies and may benefit from different strategies.
  • Framework coverage: PyTorch, XGBoost, and RAPIDS (CuPy/cuDF) only — JAX, TensorFlow, and MXNet are out of scope.
  • Laptop GPU TDP limits sustained throughput; power-throttled performance can differ significantly from desktop benchmarks even at the same VRAM capacity.
  • 仅支持NVIDIA GPU — AMD(ROCm)和Intel Arc GPU不在本方案覆盖范围内。
  • 假设为单GPU环境;多GPU(DDP、FSDP)需额外配置,本方案未涉及。
  • 优化模式针对消费级GPU(8–24GB VRAM)校准;数据中心GPU(A100、H100)具有不同的内存层级,可能需要不同的优化策略。
  • 框架覆盖范围:仅支持PyTorch、XGBoost和RAPIDS(CuPy/cuDF) — JAX、TensorFlow和MXNet不在范围内。
  • 笔记本GPU的TDP限制会影响持续吞吐量;即使显存容量相同,其性能表现也可能与台式机GPU存在显著差异。

Output Format

输出格式

Each optimization recommendation includes a before/after code pair showing the original pattern and the GPU-optimized equivalent. Performance gain estimates are provided as ranges (e.g., "1.8x faster", "~40% VRAM reduction") based on typical consumer GPU benchmarks — actual gains depend on workload and hardware. Where a change introduces a trade-off (e.g., gradient checkpointing adds compute time), the trade-off is stated explicitly inline.
每个优化建议包含前后代码对比,展示原始模式与GPU优化后的等效代码。基于消费级GPU的典型基准测试,提供性能提升范围估计(例如“速度提升1.8倍”、“显存占用减少约40%”) — 实际提升效果取决于工作负载和硬件配置。当某项变更存在权衡(例如梯度检查点会增加计算时间)时,会在对应位置明确说明。