perf-cpu-offloading
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCPU Offloading
CPU卸载
References
参考资料
- Stable docs: @docs/training/cpu-offloading.md
- Structured metadata: @skills/perf-cpu-offloading/card.yaml
- 稳定文档:@docs/training/cpu-offloading.md
- 结构化元数据:@skills/perf-cpu-offloading/card.yaml
What It Is
功能介绍
Two independent mechanisms to move data from GPU to CPU memory:
| Mechanism | Config namespace | What gets offloaded | PP restriction |
|---|---|---|---|
| Activation offloading | | Activations (and optionally weights) per transformer layer | PP must be 1 |
| Optimizer offloading | | Adam optimizer states (momentum + variance) via | None |
两种将数据从GPU内存转移到CPU内存的独立机制:
| 机制 | 配置命名空间 | 卸载内容 | 流水线并行(PP)限制 |
|---|---|---|---|
| 激活卸载 | | 每个Transformer层的激活值(可选包含权重) | PP必须为1 |
| 优化器卸载 | | 通过 | 无 |
Quick Decision
快速决策指南
| Situation | Recommendation |
|---|---|
| Large MoE model (30B+), needs PP > 1 | Optimizer offloading — activation offloading is blocked by PP=1 |
| Small/medium model, PP=1 fits, activation memory dominates | Activation offloading |
| Want tunable memory-speed tradeoff | Optimizer offloading with fractional |
| Throughput is top priority | Don't enable — offloading always adds overhead |
| CUDA graphs are needed | Only optimizer offloading — activation offloading is incompatible |
| Memory pressure is moderate | Optimizer offload at 25–50% fraction for best efficiency |
| 场景 | 推荐方案 |
|---|---|
| 大型MoE模型(300亿参数以上),需要PP>1 | 优化器卸载——激活卸载受限于PP=1 |
| 中小型模型,PP=1可适配,激活内存占主导 | 激活卸载 |
| 想要可调的内存-速度权衡 | 搭配 |
| 吞吐量为首要优先级 | 不要启用——卸载总会带来额外开销 |
| 需要使用CUDA图 | 仅使用优化器卸载——激活卸载不兼容 |
| 内存压力适中 | 优化器卸载设置25%-50%的比例以获得最佳效率 |
Enablement
启用方法
Optimizer CPU offloading (recommended for large models)
优化器CPU卸载(推荐用于大型模型)
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = TrueCLI overrides:
bash
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=Truepython
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = TrueCLI覆盖配置:
bash
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=TrueActivation CPU offloading (small/medium models only)
激活CPU卸载(仅适用于中小型模型)
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"Config Parameter Reference
配置参数参考
Optimizer offloading
优化器卸载
| Parameter | Default | Description |
|---|---|---|
| | Master switch |
| | Fraction of optimizer states on CPU (0.0–1.0) |
| | Overlap GPU↔CPU transfers with compute |
| | Use |
| 参数 | 默认值 | 描述 |
|---|---|---|
| | 主开关 |
| | 卸载到CPU的优化器状态比例(0.0–1.0) |
| | 让GPU↔CPU数据传输与计算重叠执行 |
| | CPU部分使用 |
Activation offloading
激活卸载
| Parameter | Default | Description |
|---|---|---|
| | Master switch |
| | Number of transformer layers to offload (0 to num_layers-1) |
| | Offload activations |
| | Offload weights |
| | Double-buffer across layers while reloading |
| 参数 | 默认值 | 描述 |
|---|---|---|
| | 主开关 |
| | 要卸载的Transformer层数(0到总层数-1) |
| | 卸载激活值 |
| | 卸载权重 |
| | 重新加载时在层间使用双缓冲机制 |
Compatibility And Constraints
兼容性与约束
Activation offloading
激活卸载
- must be 1
pipeline_model_parallel_size - must be
recompute_granularityNone - Cannot combine with
fine_grained_activation_offloading - Cannot combine with CUDA graphs
- must be in
cpu_offloading_num_layers[0, num_layers-1)
- 必须为1
pipeline_model_parallel_size - 必须为
recompute_granularityNone - 无法与同时启用
fine_grained_activation_offloading - 无法与CUDA图配合使用
- 必须在
cpu_offloading_num_layers范围内[0, 总层数-1)
Optimizer offloading
优化器卸载
- Requires (default in most recipes)
use_distributed_optimizer = True - No PP, recompute, or CUDA graph restrictions
- must be in
optimizer_offload_fraction[0.0, 1.0]
- 需要(大多数训练配置中的默认值)
use_distributed_optimizer = True - 无流水线并行(PP)、重计算或CUDA图相关限制
- 必须在
optimizer_offload_fraction范围内[0.0, 1.0]
Practical: large MoE models
实践:大型MoE模型
Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE
models. The PP=1 constraint means each GPU holds all 48 layers; model
weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.
对于Qwen3-30B-A3B及类似的大型MoE模型,激活卸载无法使用。PP=1的约束意味着每个GPU需承载全部48层;仅模型权重+优化器状态(约70GB)就超过了H100 80GB的容量。
Minimal Working Config
最简可用配置
Optimizer offload (50%, balanced)
优化器卸载(50%比例,平衡内存与速度)
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5Optimizer offload (100% + overlap, max savings)
优化器卸载(100%比例+重叠传输,最大内存节省)
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = Truepython
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = TrueActivation offload (small model, PP=1)
激活卸载(小型模型,PP=1)
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = Nonepython
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = NoneWeight offload only (small model, PP=1)
仅权重卸载(小型模型,PP=1)
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = Nonepython
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = NoneBoth activations and weights (small model, PP=1)
激活值与权重同时卸载(小型模型,PP=1)
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = NoneWeight offloading and activation offloading share the same constraints (PP=1,
no recompute, no CUDA graphs). Weight offloading has not been tested in
the Qwen3-30B-A3B experiments — the measured results cover optimizer
offloading only.
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None权重卸载与激活卸载共享相同约束(PP=1、无重计算、无CUDA图)。在Qwen3-30B-A3B实验中未测试权重卸载——已测结果仅覆盖优化器卸载。
Minimal Runnable Command
最简可运行命令
bash
uv run python scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
train.train_iters=20 \
train.global_batch_size=8 \
train.micro_batch_size=1bash
uv run python scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
train.train_iters=20 \
train.global_batch_size=8 \
train.micro_batch_size=1Verification
验证方法
Unit tests
单元测试
bash
uv run python -m pytest \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
tests/unit_tests/peft/test_utils.py -k "cpu_offload" -qbash
uv run python -m pytest \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
tests/unit_tests/peft/test_utils.py -k "cpu_offload" -qSuccess criteria
成功标准
- Config validation passes for the selected offloading mode
- Training completes without OOM or NCCL errors
- Loss matches the non-offloaded baseline (max delta < 0.001)
- Memory usage drops proportionally to offload fraction
- 所选卸载模式的配置验证通过
- 训练完成无OOM或NCCL错误
- 损失值与未启用卸载的基准一致(最大差值<0.001)
- 内存使用量随卸载比例相应降低
Code Anchors
代码锚点
MCore activation offload constraints
MCore激活卸载约束
1296
if self.cpu_offloading and (
self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
):
raise ValueError(...)
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if self.cpu_offloading and self.recompute_granularity is not None:
raise ValueError(
"CPU offloading does not work when activation recomputation is enabled"
)1296
if self.cpu_offloading and (
self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
):
raise ValueError(...)
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if self.cpu_offloading and self.recompute_granularity is not None:
raise ValueError(
"CPU offloading does not work when activation recomputation is enabled"
)MCore CUDA graph incompatibility
MCore CUDA图不兼容性
1943
if self.cpu_offloading:
raise ValueError("CUDA graphs not supported with CPU offloading.")1943
if self.cpu_offloading:
raise ValueError("CUDA graphs not supported with CPU offloading.")MCore fine-grained offloading mutual exclusion
MCore细粒度卸载互斥性
1427
if self.fine_grained_activation_offloading:
assert (
not self.cpu_offloading
), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."1427
if self.fine_grained_activation_offloading:
assert (
not self.cpu_offloading
), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."MCore HybridDeviceOptimizer instantiation
MCore HybridDeviceOptimizer实例化
480
if config.optimizer_cpu_offload:
# ... setup cpu/gpu optimizer classes ...
optimizer = HybridDeviceOptimizer(
param_groups,
offload_fraction=config.optimizer_offload_fraction,
cpu_optimizer_cls=cpu_optimizer_cls,
gpu_optimizer_cls=gpu_optimizer_cls,
overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
pin_cpu_grads=config.pin_cpu_grads,
pin_cpu_params=config.pin_cpu_params,
)480
if config.optimizer_cpu_offload:
# ... setup cpu/gpu optimizer classes ...
optimizer = HybridDeviceOptimizer(
param_groups,
offload_fraction=config.optimizer_offload_fraction,
cpu_optimizer_cls=cpu_optimizer_cls,
gpu_optimizer_cls=gpu_optimizer_cls,
overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
pin_cpu_grads=config.pin_cpu_grads,
pin_cpu_params=config.pin_cpu_params,
)Bridge CUDA graph guard
Bridge CUDA图防护
232
assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"232
assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"Bridge activation offloading in PEFT
Bridge PEFT中的激活卸载
621
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_in(x)
x = self.activation(x)
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_out(x)621
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_in(x)
x = self.activation(x)
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_out(x)MCore model_parallel_config fields
MCore model_parallel_config字段
3rdparty
cpu_offloading: bool = False
cpu_offloading_num_layers: int = 0
cpu_offloading_activations: bool = True
cpu_offloading_weights: bool = False
cpu_offloading_double_buffering: bool = False
cpu_offloading_retain_pinned_cpu_buffers: bool = False3rdparty
cpu_offloading: bool = False
cpu_offloading_num_layers: int = 0
cpu_offloading_activations: bool = True
cpu_offloading_weights: bool = False
cpu_offloading_double_buffering: bool = False
cpu_offloading_retain_pinned_cpu_buffers: bool = FalseMCore optimizer offload config
MCore优化器卸载配置
3rdparty
optimizer_cpu_offload: bool = False
optimizer_offload_fraction: float = 0.0
use_torch_optimizer_for_cpu_offload: bool = False
overlap_cpu_optimizer_d2h_h2d: bool = False3rdparty
optimizer_cpu_offload: bool = False
optimizer_offload_fraction: float = 0.0
use_torch_optimizer_for_cpu_offload: bool = False
overlap_cpu_optimizer_d2h_h2d: bool = FalseFailure Diagnosis
故障诊断
| Symptom | Likely Cause | How To Confirm | Fix |
|---|---|---|---|
| Activation offload + PP > 1 | Check | Set PP=1 or use optimizer offloading |
| Activation offload + recompute | Check | Set |
| Both offloading modes enabled | Check both flags | Use one or the other |
| CUDA graphs + activation offload | Check | Set |
| OOM with activation offloading | Model too large for PP=1 | Check allocated memory vs 80 GB | Use optimizer offloading with PP > 1 |
| Extreme slowdown (>4x) | 100% optimizer offload, CPU Adam bottleneck | Compare iter time at different fractions | Reduce fraction or enable |
| OOM at partial optimizer offload | Insufficient offload for this config | Check memory at different fractions | Increase fraction or add PP |
| 症状 | 可能原因 | 确认方法 | 修复方案 |
|---|---|---|---|
| 激活卸载+PP>1 | 检查 | 设置PP=1或使用优化器卸载 |
| 激活卸载+重计算 | 检查 | 设置 |
| 同时启用了两种卸载模式 | 检查两个开关 | 仅使用其中一种 |
| CUDA图+激活卸载 | 检查 | 设置 |
| 激活卸载时出现OOM | 模型过大,PP=1无法适配 | 检查已分配内存与80GB的对比 | 使用优化器卸载并设置PP>1 |
| 速度大幅下降(>4倍) | 100%优化器卸载,CPU Adam成为瓶颈 | 对比不同比例下的迭代时间 | 降低卸载比例或启用 |
| 部分优化器卸载时出现OOM | 当前配置下卸载比例不足 | 检查不同比例下的内存使用 | 提高卸载比例或增加PP |
Known Limitations
已知限制
- Activation offloading requires PP=1, making it impractical for large models (30B+ MoE) that need pipeline parallelism.
- Optimizer offloading throughput penalty scales linearly (~1.9x at 25%, ~4.2x at 100% for Qwen3-30B-A3B).
- D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is the dominant bottleneck.
- is a separate module-level approach that works with PP > 1 but cannot be combined with layer-level
fine_grained_activation_offloading.cpu_offloading
- 激活卸载要求PP=1,对于需要流水线并行的大型模型(300亿参数以上MoE)不实用。
- 优化器卸载的吞吐量损失呈线性增长(Qwen3-30B-A3B模型在25%比例时约为1.9倍,100%比例时约为4.2倍)。
- D2H/H2D重叠传输仅能提供约7%的速度提升,因为CPU Adam计算是主要瓶颈。
- 是一种独立的模块级方案,可与PP>1配合使用,但无法与层级
fine_grained_activation_offloading同时启用。cpu_offloading