perf-cpu-offloading

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CPU Offloading

CPU卸载

References

参考资料

Stable docs: @docs/training/cpu-offloading.md
Structured metadata: @skills/perf-cpu-offloading/card.yaml

稳定文档：@docs/training/cpu-offloading.md
结构化元数据：@skills/perf-cpu-offloading/card.yaml

What It Is

功能介绍

Two independent mechanisms to move data from GPU to CPU memory:

Mechanism	Config namespace	What gets offloaded	PP restriction
Activation offloading	`model.cpu_offloading*`	Activations (and optionally weights) per transformer layer	PP must be 1
Optimizer offloading	`optimizer.optimizer_cpu_offload`	Adam optimizer states (momentum + variance) via `HybridDeviceOptimizer`	None

两种将数据从GPU内存转移到CPU内存的独立机制：

机制	配置命名空间	卸载内容	流水线并行（PP）限制
激活卸载	`model.cpu_offloading*`	每个Transformer层的激活值（可选包含权重）	PP必须为1
优化器卸载	`optimizer.optimizer_cpu_offload`	通过 `HybridDeviceOptimizer` 卸载Adam优化器状态（动量+方差）	无

Quick Decision

快速决策指南

Situation	Recommendation
Large MoE model (30B+), needs PP > 1	Optimizer offloading — activation offloading is blocked by PP=1
Small/medium model, PP=1 fits, activation memory dominates	Activation offloading
Want tunable memory-speed tradeoff	Optimizer offloading with fractional `optimizer_offload_fraction`
Throughput is top priority	Don't enable — offloading always adds overhead
CUDA graphs are needed	Only optimizer offloading — activation offloading is incompatible
Memory pressure is moderate	Optimizer offload at 25–50% fraction for best efficiency

场景	推荐方案
大型MoE模型（300亿参数以上），需要PP>1	优化器卸载——激活卸载受限于PP=1
中小型模型，PP=1可适配，激活内存占主导	激活卸载
想要可调的内存-速度权衡	搭配 `optimizer_offload_fraction` 参数的优化器卸载
吞吐量为首要优先级	不要启用——卸载总会带来额外开销
需要使用CUDA图	仅使用优化器卸载——激活卸载不兼容
内存压力适中	优化器卸载设置25%-50%的比例以获得最佳效率

Enablement

启用方法

Optimizer CPU offloading (recommended for large models)

优化器CPU卸载（推荐用于大型模型）

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

CLI overrides:

bash

optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

CLI覆盖配置：

bash

optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True

Activation CPU offloading (small/medium models only)

激活CPU卸载（仅适用于中小型模型）

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False

cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False

cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"

Config Parameter Reference

配置参数参考

Optimizer offloading

优化器卸载

Parameter	Default	Description
`optimizer_cpu_offload`	`False`	Master switch
`optimizer_offload_fraction`	`0.0`	Fraction of optimizer states on CPU (0.0–1.0)
`overlap_cpu_optimizer_d2h_h2d`	`False`	Overlap GPU↔CPU transfers with compute
`use_torch_optimizer_for_cpu_offload`	`False`	Use `torch.optim` instead of fused optimizer for CPU portion

参数	默认值	描述
`optimizer_cpu_offload`	`False`	主开关
`optimizer_offload_fraction`	`0.0`	卸载到CPU的优化器状态比例（0.0–1.0）
`overlap_cpu_optimizer_d2h_h2d`	`False`	让GPU↔CPU数据传输与计算重叠执行
`use_torch_optimizer_for_cpu_offload`	`False`	CPU部分使用 `torch.optim` 而非融合优化器

Activation offloading

激活卸载

Parameter	Default	Description
`cpu_offloading`	`False`	Master switch
`cpu_offloading_num_layers`	`0`	Number of transformer layers to offload (0 to num_layers-1)
`cpu_offloading_activations`	`True`	Offload activations
`cpu_offloading_weights`	`False`	Offload weights
`cpu_offloading_double_buffering`	`False`	Double-buffer across layers while reloading

参数	默认值	描述
`cpu_offloading`	`False`	主开关
`cpu_offloading_num_layers`	`0`	要卸载的Transformer层数（0到总层数-1）
`cpu_offloading_activations`	`True`	卸载激活值
`cpu_offloading_weights`	`False`	卸载权重
`cpu_offloading_double_buffering`	`False`	重新加载时在层间使用双缓冲机制

Compatibility And Constraints

兼容性与约束

Activation offloading

激活卸载

```
pipeline_model_parallel_size
```
must be 1
```
recompute_granularity
```
must be
```
None
```
Cannot combine with
```
fine_grained_activation_offloading
```
Cannot combine with CUDA graphs

cpu_offloading_num_layers

must be in

[0, num_layers-1)

```
pipeline_model_parallel_size
```
必须为1
```
recompute_granularity
```
必须为
```
None
```
无法与
```
fine_grained_activation_offloading
```
同时启用
无法与CUDA图配合使用

cpu_offloading_num_layers

必须在

[0, 总层数-1)

范围内

Optimizer offloading

优化器卸载

Requires
```
use_distributed_optimizer = True
```
(default in most recipes)
No PP, recompute, or CUDA graph restrictions
```
optimizer_offload_fraction
```
must be in
```
[0.0, 1.0]
```

需要
```
use_distributed_optimizer = True
```
（大多数训练配置中的默认值）
无流水线并行（PP）、重计算或CUDA图相关限制
```
optimizer_offload_fraction
```
必须在
```
[0.0, 1.0]
```
范围内

Practical: large MoE models

实践：大型MoE模型

Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE models. The PP=1 constraint means each GPU holds all 48 layers; model weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.

对于Qwen3-30B-A3B及类似的大型MoE模型，激活卸载无法使用。PP=1的约束意味着每个GPU需承载全部48层；仅模型权重+优化器状态（约70GB）就超过了H100 80GB的容量。

Minimal Working Config

最简可用配置

Optimizer offload (50%, balanced)

优化器卸载（50%比例，平衡内存与速度）

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5

Optimizer offload (100% + overlap, max savings)

优化器卸载（100%比例+重叠传输，最大内存节省）

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

python

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

Activation offload (small model, PP=1)

激活卸载（小型模型，PP=1）

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

Weight offload only (small model, PP=1)

仅权重卸载（小型模型，PP=1）

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

Both activations and weights (small model, PP=1)

激活值与权重同时卸载（小型模型，PP=1）

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

Weight offloading and activation offloading share the same constraints (PP=1, no recompute, no CUDA graphs). Weight offloading has not been tested in the Qwen3-30B-A3B experiments — the measured results cover optimizer offloading only.

python

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

权重卸载与激活卸载共享相同约束（PP=1、无重计算、无CUDA图）。在Qwen3-30B-A3B实验中未测试权重卸载——已测结果仅覆盖优化器卸载。

Minimal Runnable Command

最简可运行命令

bash

uv run python scripts/training/run_recipe.py \
  --recipe qwen3_30b_a3b_pretrain_config \
  optimizer.optimizer_cpu_offload=True \
  optimizer.optimizer_offload_fraction=0.5 \
  train.train_iters=20 \
  train.global_batch_size=8 \
  train.micro_batch_size=1

bash

uv run python scripts/training/run_recipe.py \
  --recipe qwen3_30b_a3b_pretrain_config \
  optimizer.optimizer_cpu_offload=True \
  optimizer.optimizer_offload_fraction=0.5 \
  train.train_iters=20 \
  train.global_batch_size=8 \
  train.micro_batch_size=1

Verification

验证方法

Unit tests

单元测试

bash

uv run python -m pytest \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
  tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q

bash

uv run python -m pytest \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
  tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q

Success criteria

成功标准

Config validation passes for the selected offloading mode
Training completes without OOM or NCCL errors
Loss matches the non-offloaded baseline (max delta < 0.001)
Memory usage drops proportionally to offload fraction

所选卸载模式的配置验证通过
训练完成无OOM或NCCL错误
损失值与未启用卸载的基准一致（最大差值<0.001）
内存使用量随卸载比例相应降低

Code Anchors

代码锚点

MCore activation offload constraints

MCore激活卸载约束

1296

        if self.cpu_offloading and (
            self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
        ):
            raise ValueError(...)

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

        if self.cpu_offloading and self.recompute_granularity is not None:
            raise ValueError(
                "CPU offloading does not work when activation recomputation is enabled"
            )

1296

        if self.cpu_offloading and (
            self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
        ):
            raise ValueError(...)

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

        if self.cpu_offloading and self.recompute_granularity is not None:
            raise ValueError(
                "CPU offloading does not work when activation recomputation is enabled"
            )

MCore CUDA graph incompatibility

MCore CUDA图不兼容性

1943

            if self.cpu_offloading:
                raise ValueError("CUDA graphs not supported with CPU offloading.")

1943

            if self.cpu_offloading:
                raise ValueError("CUDA graphs not supported with CPU offloading.")

MCore fine-grained offloading mutual exclusion

MCore细粒度卸载互斥性

1427

        if self.fine_grained_activation_offloading:
            assert (
                not self.cpu_offloading
            ), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."

1427

        if self.fine_grained_activation_offloading:
            assert (
                not self.cpu_offloading
            ), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."

MCore HybridDeviceOptimizer instantiation

MCore HybridDeviceOptimizer实例化

480

        if config.optimizer_cpu_offload:
            # ... setup cpu/gpu optimizer classes ...
            optimizer = HybridDeviceOptimizer(
                param_groups,
                offload_fraction=config.optimizer_offload_fraction,
                cpu_optimizer_cls=cpu_optimizer_cls,
                gpu_optimizer_cls=gpu_optimizer_cls,
                overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
                pin_cpu_grads=config.pin_cpu_grads,
                pin_cpu_params=config.pin_cpu_params,
            )

480

        if config.optimizer_cpu_offload:
            # ... setup cpu/gpu optimizer classes ...
            optimizer = HybridDeviceOptimizer(
                param_groups,
                offload_fraction=config.optimizer_offload_fraction,
                cpu_optimizer_cls=cpu_optimizer_cls,
                gpu_optimizer_cls=gpu_optimizer_cls,
                overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
                pin_cpu_grads=config.pin_cpu_grads,
                pin_cpu_params=config.pin_cpu_params,
            )

Bridge CUDA graph guard

Bridge CUDA图防护

232

        assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"

232

        assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"

Bridge activation offloading in PEFT

Bridge PEFT中的激活卸载

621

        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_in(x)
        x = self.activation(x)
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_out(x)

621

        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_in(x)
        x = self.activation(x)
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_out(x)

MCore model_parallel_config fields

MCore model_parallel_config字段

3rdparty

    cpu_offloading: bool = False
    cpu_offloading_num_layers: int = 0
    cpu_offloading_activations: bool = True
    cpu_offloading_weights: bool = False
    cpu_offloading_double_buffering: bool = False
    cpu_offloading_retain_pinned_cpu_buffers: bool = False

3rdparty

    cpu_offloading: bool = False
    cpu_offloading_num_layers: int = 0
    cpu_offloading_activations: bool = True
    cpu_offloading_weights: bool = False
    cpu_offloading_double_buffering: bool = False
    cpu_offloading_retain_pinned_cpu_buffers: bool = False

MCore optimizer offload config

MCore优化器卸载配置

3rdparty

    optimizer_cpu_offload: bool = False
    optimizer_offload_fraction: float = 0.0
    use_torch_optimizer_for_cpu_offload: bool = False
    overlap_cpu_optimizer_d2h_h2d: bool = False

3rdparty

    optimizer_cpu_offload: bool = False
    optimizer_offload_fraction: float = 0.0
    use_torch_optimizer_for_cpu_offload: bool = False
    overlap_cpu_optimizer_d2h_h2d: bool = False

Failure Diagnosis

故障诊断

Symptom	Likely Cause	How To Confirm	Fix
`Currently there is no support for Pipeline parallelism with CPU offloading`	Activation offload + PP > 1	Check `pipeline_model_parallel_size`	Set PP=1 or use optimizer offloading
`CPU offloading does not work when activation recomputation is enabled`	Activation offload + recompute	Check `recompute_granularity`	Set `recompute_granularity=null`
`fine_grained_activation_offloading cannot be enabled with cpu_offloading`	Both offloading modes enabled	Check both flags	Use one or the other
`CUDA graphs not supported with CPU offloading`	CUDA graphs + activation offload	Check `cuda_graph_impl`	Set `cuda_graph_impl="none"`
OOM with activation offloading	Model too large for PP=1	Check allocated memory vs 80 GB	Use optimizer offloading with PP > 1
Extreme slowdown (>4x)	100% optimizer offload, CPU Adam bottleneck	Compare iter time at different fractions	Reduce fraction or enable `overlap_cpu_optimizer_d2h_h2d`
OOM at partial optimizer offload	Insufficient offload for this config	Check memory at different fractions	Increase fraction or add PP

症状	可能原因	确认方法	修复方案
`Currently there is no support for Pipeline parallelism with CPU offloading`	激活卸载+PP>1	检查 `pipeline_model_parallel_size`	设置PP=1或使用优化器卸载
`CPU offloading does not work when activation recomputation is enabled`	激活卸载+重计算	检查 `recompute_granularity`	设置 `recompute_granularity=null`
`fine_grained_activation_offloading cannot be enabled with cpu_offloading`	同时启用了两种卸载模式	检查两个开关	仅使用其中一种
`CUDA graphs not supported with CPU offloading`	CUDA图+激活卸载	检查 `cuda_graph_impl`	设置 `cuda_graph_impl="none"`
激活卸载时出现OOM	模型过大，PP=1无法适配	检查已分配内存与80GB的对比	使用优化器卸载并设置PP>1
速度大幅下降（>4倍）	100%优化器卸载，CPU Adam成为瓶颈	对比不同比例下的迭代时间	降低卸载比例或启用 `overlap_cpu_optimizer_d2h_h2d`
部分优化器卸载时出现OOM	当前配置下卸载比例不足	检查不同比例下的内存使用	提高卸载比例或增加PP

Known Limitations

已知限制

Activation offloading requires PP=1, making it impractical for large models (30B+ MoE) that need pipeline parallelism.
Optimizer offloading throughput penalty scales linearly (~1.9x at 25%, ~4.2x at 100% for Qwen3-30B-A3B).
D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is the dominant bottleneck.
```
fine_grained_activation_offloading
```
is a separate module-level approach that works with PP > 1 but cannot be combined with layer-level
```
cpu_offloading
```
.

激活卸载要求PP=1，对于需要流水线并行的大型模型（300亿参数以上MoE）不实用。
优化器卸载的吞吐量损失呈线性增长（Qwen3-30B-A3B模型在25%比例时约为1.9倍，100%比例时约为4.2倍）。
D2H/H2D重叠传输仅能提供约7%的速度提升，因为CPU Adam计算是主要瓶颈。
```
fine_grained_activation_offloading
```
是一种独立的模块级方案，可与PP>1配合使用，但无法与层级
```
cpu_offloading
```
同时启用。