perf-cpu-offloading

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CPU Offloading

CPU卸载

References

参考资料

  • Stable docs: @docs/training/cpu-offloading.md
  • Structured metadata: @skills/perf-cpu-offloading/card.yaml
  • 稳定文档:@docs/training/cpu-offloading.md
  • 结构化元数据:@skills/perf-cpu-offloading/card.yaml

What It Is

功能介绍

Two independent mechanisms to move data from GPU to CPU memory:
MechanismConfig namespaceWhat gets offloadedPP restriction
Activation offloading
model.cpu_offloading*
Activations (and optionally weights) per transformer layerPP must be 1
Optimizer offloading
optimizer.optimizer_cpu_offload
Adam optimizer states (momentum + variance) via
HybridDeviceOptimizer
None
两种将数据从GPU内存转移到CPU内存的独立机制:
机制配置命名空间卸载内容流水线并行(PP)限制
激活卸载
model.cpu_offloading*
每个Transformer层的激活值(可选包含权重)PP必须为1
优化器卸载
optimizer.optimizer_cpu_offload
通过
HybridDeviceOptimizer
卸载Adam优化器状态(动量+方差)

Quick Decision

快速决策指南

SituationRecommendation
Large MoE model (30B+), needs PP > 1Optimizer offloading — activation offloading is blocked by PP=1
Small/medium model, PP=1 fits, activation memory dominatesActivation offloading
Want tunable memory-speed tradeoffOptimizer offloading with fractional
optimizer_offload_fraction
Throughput is top priorityDon't enable — offloading always adds overhead
CUDA graphs are neededOnly optimizer offloading — activation offloading is incompatible
Memory pressure is moderateOptimizer offload at 25–50% fraction for best efficiency
场景推荐方案
大型MoE模型(300亿参数以上),需要PP>1优化器卸载——激活卸载受限于PP=1
中小型模型,PP=1可适配,激活内存占主导激活卸载
想要可调的内存-速度权衡搭配
optimizer_offload_fraction
参数的优化器卸载
吞吐量为首要优先级不要启用——卸载总会带来额外开销
需要使用CUDA图仅使用优化器卸载——激活卸载不兼容
内存压力适中优化器卸载设置25%-50%的比例以获得最佳效率

Enablement

启用方法

Optimizer CPU offloading (recommended for large models)

优化器CPU卸载(推荐用于大型模型)

python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
CLI overrides:
bash
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
CLI覆盖配置:
bash
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True

Activation CPU offloading (small/medium models only)

激活CPU卸载(仅适用于中小型模型)

python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False

cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False

cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"

Config Parameter Reference

配置参数参考

Optimizer offloading

优化器卸载

ParameterDefaultDescription
optimizer_cpu_offload
False
Master switch
optimizer_offload_fraction
0.0
Fraction of optimizer states on CPU (0.0–1.0)
overlap_cpu_optimizer_d2h_h2d
False
Overlap GPU↔CPU transfers with compute
use_torch_optimizer_for_cpu_offload
False
Use
torch.optim
instead of fused optimizer for CPU portion
参数默认值描述
optimizer_cpu_offload
False
主开关
optimizer_offload_fraction
0.0
卸载到CPU的优化器状态比例(0.0–1.0)
overlap_cpu_optimizer_d2h_h2d
False
让GPU↔CPU数据传输与计算重叠执行
use_torch_optimizer_for_cpu_offload
False
CPU部分使用
torch.optim
而非融合优化器

Activation offloading

激活卸载

ParameterDefaultDescription
cpu_offloading
False
Master switch
cpu_offloading_num_layers
0
Number of transformer layers to offload (0 to num_layers-1)
cpu_offloading_activations
True
Offload activations
cpu_offloading_weights
False
Offload weights
cpu_offloading_double_buffering
False
Double-buffer across layers while reloading
参数默认值描述
cpu_offloading
False
主开关
cpu_offloading_num_layers
0
要卸载的Transformer层数(0到总层数-1)
cpu_offloading_activations
True
卸载激活值
cpu_offloading_weights
False
卸载权重
cpu_offloading_double_buffering
False
重新加载时在层间使用双缓冲机制

Compatibility And Constraints

兼容性与约束

Activation offloading

激活卸载

  • pipeline_model_parallel_size
    must be 1
  • recompute_granularity
    must be
    None
  • Cannot combine with
    fine_grained_activation_offloading
  • Cannot combine with CUDA graphs
  • cpu_offloading_num_layers
    must be in
    [0, num_layers-1)
  • pipeline_model_parallel_size
    必须为1
  • recompute_granularity
    必须为
    None
  • 无法与
    fine_grained_activation_offloading
    同时启用
  • 无法与CUDA图配合使用
  • cpu_offloading_num_layers
    必须在
    [0, 总层数-1)
    范围内

Optimizer offloading

优化器卸载

  • Requires
    use_distributed_optimizer = True
    (default in most recipes)
  • No PP, recompute, or CUDA graph restrictions
  • optimizer_offload_fraction
    must be in
    [0.0, 1.0]
  • 需要
    use_distributed_optimizer = True
    (大多数训练配置中的默认值)
  • 无流水线并行(PP)、重计算或CUDA图相关限制
  • optimizer_offload_fraction
    必须在
    [0.0, 1.0]
    范围内

Practical: large MoE models

实践:大型MoE模型

Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE models. The PP=1 constraint means each GPU holds all 48 layers; model weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.
对于Qwen3-30B-A3B及类似的大型MoE模型,激活卸载无法使用。PP=1的约束意味着每个GPU需承载全部48层;仅模型权重+优化器状态(约70GB)就超过了H100 80GB的容量。

Minimal Working Config

最简可用配置

Optimizer offload (50%, balanced)

优化器卸载(50%比例,平衡内存与速度)

python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5

Optimizer offload (100% + overlap, max savings)

优化器卸载(100%比例+重叠传输,最大内存节省)

python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True
python
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

Activation offload (small model, PP=1)

激活卸载(小型模型,PP=1)

python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

Weight offload only (small model, PP=1)

仅权重卸载(小型模型,PP=1)

python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None

Both activations and weights (small model, PP=1)

激活值与权重同时卸载(小型模型,PP=1)

python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
Weight offloading and activation offloading share the same constraints (PP=1, no recompute, no CUDA graphs). Weight offloading has not been tested in the Qwen3-30B-A3B experiments — the measured results cover optimizer offloading only.
python
cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
权重卸载与激活卸载共享相同约束(PP=1、无重计算、无CUDA图)。在Qwen3-30B-A3B实验中未测试权重卸载——已测结果仅覆盖优化器卸载。

Minimal Runnable Command

最简可运行命令

bash
uv run python scripts/training/run_recipe.py \
  --recipe qwen3_30b_a3b_pretrain_config \
  optimizer.optimizer_cpu_offload=True \
  optimizer.optimizer_offload_fraction=0.5 \
  train.train_iters=20 \
  train.global_batch_size=8 \
  train.micro_batch_size=1
bash
uv run python scripts/training/run_recipe.py \
  --recipe qwen3_30b_a3b_pretrain_config \
  optimizer.optimizer_cpu_offload=True \
  optimizer.optimizer_offload_fraction=0.5 \
  train.train_iters=20 \
  train.global_batch_size=8 \
  train.micro_batch_size=1

Verification

验证方法

Unit tests

单元测试

bash
uv run python -m pytest \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
  tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q
bash
uv run python -m pytest \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
  tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q

Success criteria

成功标准

  • Config validation passes for the selected offloading mode
  • Training completes without OOM or NCCL errors
  • Loss matches the non-offloaded baseline (max delta < 0.001)
  • Memory usage drops proportionally to offload fraction
  • 所选卸载模式的配置验证通过
  • 训练完成无OOM或NCCL错误
  • 损失值与未启用卸载的基准一致(最大差值<0.001)
  • 内存使用量随卸载比例相应降低

Code Anchors

代码锚点

MCore activation offload constraints

MCore激活卸载约束

1296
        if self.cpu_offloading and (
            self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
        ):
            raise ValueError(...)

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

        if self.cpu_offloading and self.recompute_granularity is not None:
            raise ValueError(
                "CPU offloading does not work when activation recomputation is enabled"
            )
1296
        if self.cpu_offloading and (
            self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
        ):
            raise ValueError(...)

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

        if self.cpu_offloading and self.recompute_granularity is not None:
            raise ValueError(
                "CPU offloading does not work when activation recomputation is enabled"
            )

MCore CUDA graph incompatibility

MCore CUDA图不兼容性

1943
            if self.cpu_offloading:
                raise ValueError("CUDA graphs not supported with CPU offloading.")
1943
            if self.cpu_offloading:
                raise ValueError("CUDA graphs not supported with CPU offloading.")

MCore fine-grained offloading mutual exclusion

MCore细粒度卸载互斥性

1427
        if self.fine_grained_activation_offloading:
            assert (
                not self.cpu_offloading
            ), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."
1427
        if self.fine_grained_activation_offloading:
            assert (
                not self.cpu_offloading
            ), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."

MCore HybridDeviceOptimizer instantiation

MCore HybridDeviceOptimizer实例化

480
        if config.optimizer_cpu_offload:
            # ... setup cpu/gpu optimizer classes ...
            optimizer = HybridDeviceOptimizer(
                param_groups,
                offload_fraction=config.optimizer_offload_fraction,
                cpu_optimizer_cls=cpu_optimizer_cls,
                gpu_optimizer_cls=gpu_optimizer_cls,
                overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
                pin_cpu_grads=config.pin_cpu_grads,
                pin_cpu_params=config.pin_cpu_params,
            )
480
        if config.optimizer_cpu_offload:
            # ... setup cpu/gpu optimizer classes ...
            optimizer = HybridDeviceOptimizer(
                param_groups,
                offload_fraction=config.optimizer_offload_fraction,
                cpu_optimizer_cls=cpu_optimizer_cls,
                gpu_optimizer_cls=gpu_optimizer_cls,
                overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
                pin_cpu_grads=config.pin_cpu_grads,
                pin_cpu_params=config.pin_cpu_params,
            )

Bridge CUDA graph guard

Bridge CUDA图防护

232
        assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"
232
        assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"

Bridge activation offloading in PEFT

Bridge PEFT中的激活卸载

621
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_in(x)
        x = self.activation(x)
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_out(x)
621
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_in(x)
        x = self.activation(x)
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_out(x)

MCore model_parallel_config fields

MCore model_parallel_config字段

3rdparty
    cpu_offloading: bool = False
    cpu_offloading_num_layers: int = 0
    cpu_offloading_activations: bool = True
    cpu_offloading_weights: bool = False
    cpu_offloading_double_buffering: bool = False
    cpu_offloading_retain_pinned_cpu_buffers: bool = False
3rdparty
    cpu_offloading: bool = False
    cpu_offloading_num_layers: int = 0
    cpu_offloading_activations: bool = True
    cpu_offloading_weights: bool = False
    cpu_offloading_double_buffering: bool = False
    cpu_offloading_retain_pinned_cpu_buffers: bool = False

MCore optimizer offload config

MCore优化器卸载配置

3rdparty
    optimizer_cpu_offload: bool = False
    optimizer_offload_fraction: float = 0.0
    use_torch_optimizer_for_cpu_offload: bool = False
    overlap_cpu_optimizer_d2h_h2d: bool = False
3rdparty
    optimizer_cpu_offload: bool = False
    optimizer_offload_fraction: float = 0.0
    use_torch_optimizer_for_cpu_offload: bool = False
    overlap_cpu_optimizer_d2h_h2d: bool = False

Failure Diagnosis

故障诊断

SymptomLikely CauseHow To ConfirmFix
Currently there is no support for Pipeline parallelism with CPU offloading
Activation offload + PP > 1Check
pipeline_model_parallel_size
Set PP=1 or use optimizer offloading
CPU offloading does not work when activation recomputation is enabled
Activation offload + recomputeCheck
recompute_granularity
Set
recompute_granularity=null
fine_grained_activation_offloading cannot be enabled with cpu_offloading
Both offloading modes enabledCheck both flagsUse one or the other
CUDA graphs not supported with CPU offloading
CUDA graphs + activation offloadCheck
cuda_graph_impl
Set
cuda_graph_impl="none"
OOM with activation offloadingModel too large for PP=1Check allocated memory vs 80 GBUse optimizer offloading with PP > 1
Extreme slowdown (>4x)100% optimizer offload, CPU Adam bottleneckCompare iter time at different fractionsReduce fraction or enable
overlap_cpu_optimizer_d2h_h2d
OOM at partial optimizer offloadInsufficient offload for this configCheck memory at different fractionsIncrease fraction or add PP
症状可能原因确认方法修复方案
Currently there is no support for Pipeline parallelism with CPU offloading
激活卸载+PP>1检查
pipeline_model_parallel_size
设置PP=1或使用优化器卸载
CPU offloading does not work when activation recomputation is enabled
激活卸载+重计算检查
recompute_granularity
设置
recompute_granularity=null
fine_grained_activation_offloading cannot be enabled with cpu_offloading
同时启用了两种卸载模式检查两个开关仅使用其中一种
CUDA graphs not supported with CPU offloading
CUDA图+激活卸载检查
cuda_graph_impl
设置
cuda_graph_impl="none"
激活卸载时出现OOM模型过大,PP=1无法适配检查已分配内存与80GB的对比使用优化器卸载并设置PP>1
速度大幅下降(>4倍)100%优化器卸载,CPU Adam成为瓶颈对比不同比例下的迭代时间降低卸载比例或启用
overlap_cpu_optimizer_d2h_h2d
部分优化器卸载时出现OOM当前配置下卸载比例不足检查不同比例下的内存使用提高卸载比例或增加PP

Known Limitations

已知限制

  • Activation offloading requires PP=1, making it impractical for large models (30B+ MoE) that need pipeline parallelism.
  • Optimizer offloading throughput penalty scales linearly (~1.9x at 25%, ~4.2x at 100% for Qwen3-30B-A3B).
  • D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is the dominant bottleneck.
  • fine_grained_activation_offloading
    is a separate module-level approach that works with PP > 1 but cannot be combined with layer-level
    cpu_offloading
    .
  • 激活卸载要求PP=1,对于需要流水线并行的大型模型(300亿参数以上MoE)不实用。
  • 优化器卸载的吞吐量损失呈线性增长(Qwen3-30B-A3B模型在25%比例时约为1.9倍,100%比例时约为4.2倍)。
  • D2H/H2D重叠传输仅能提供约7%的速度提升,因为CPU Adam计算是主要瓶颈。
  • fine_grained_activation_offloading
    是一种独立的模块级方案,可与PP>1配合使用,但无法与层级
    cpu_offloading
    同时启用。