Loading...
Loading...
Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
npx skill4agent add nvidia/skills perf-cpu-offloading| Mechanism | Config namespace | What gets offloaded | PP restriction |
|---|---|---|---|
| Activation offloading | | Activations (and optionally weights) per transformer layer | PP must be 1 |
| Optimizer offloading | | Adam optimizer states (momentum + variance) via | None |
| Situation | Recommendation |
|---|---|
| Large MoE model (30B+), needs PP > 1 | Optimizer offloading — activation offloading is blocked by PP=1 |
| Small/medium model, PP=1 fits, activation memory dominates | Activation offloading |
| Want tunable memory-speed tradeoff | Optimizer offloading with fractional |
| Throughput is top priority | Don't enable — offloading always adds overhead |
| CUDA graphs are needed | Only optimizer offloading — activation offloading is incompatible |
| Memory pressure is moderate | Optimizer offload at 25–50% fraction for best efficiency |
cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = Trueoptimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=Truecfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"| Parameter | Default | Description |
|---|---|---|
| | Master switch |
| | Fraction of optimizer states on CPU (0.0–1.0) |
| | Overlap GPU↔CPU transfers with compute |
| | Use |
| Parameter | Default | Description |
|---|---|---|
| | Master switch |
| | Number of transformer layers to offload (0 to num_layers-1) |
| | Offload activations |
| | Offload weights |
| | Double-buffer across layers while reloading |
pipeline_model_parallel_sizerecompute_granularityNonefine_grained_activation_offloadingcpu_offloading_num_layers[0, num_layers-1)use_distributed_optimizer = Trueoptimizer_offload_fraction[0.0, 1.0]cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 0.5cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = Truecfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = Nonecfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = False
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = Nonecfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 8
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = True
cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = Noneuv run python scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
train.train_iters=20 \
train.global_batch_size=8 \
train.micro_batch_size=1uv run python -m pytest \
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q if self.cpu_offloading and (
self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
):
raise ValueError(...)
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)
if self.cpu_offloading and self.recompute_granularity is not None:
raise ValueError(
"CPU offloading does not work when activation recomputation is enabled"
) if self.cpu_offloading:
raise ValueError("CUDA graphs not supported with CPU offloading.") if self.fine_grained_activation_offloading:
assert (
not self.cpu_offloading
), "fine_grained_activation_offloading cannot be enabled with cpu_offloading." if config.optimizer_cpu_offload:
# ... setup cpu/gpu optimizer classes ...
optimizer = HybridDeviceOptimizer(
param_groups,
offload_fraction=config.optimizer_offload_fraction,
cpu_optimizer_cls=cpu_optimizer_cls,
gpu_optimizer_cls=gpu_optimizer_cls,
overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
pin_cpu_grads=config.pin_cpu_grads,
pin_cpu_params=config.pin_cpu_params,
) assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported" if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_in(x)
x = self.activation(x)
if self.config.cpu_offloading and self.config.cpu_offloading_activations:
x.activation_offloading = True
x, _ = self.linear_out(x) cpu_offloading: bool = False
cpu_offloading_num_layers: int = 0
cpu_offloading_activations: bool = True
cpu_offloading_weights: bool = False
cpu_offloading_double_buffering: bool = False
cpu_offloading_retain_pinned_cpu_buffers: bool = False optimizer_cpu_offload: bool = False
optimizer_offload_fraction: float = 0.0
use_torch_optimizer_for_cpu_offload: bool = False
overlap_cpu_optimizer_d2h_h2d: bool = False| Symptom | Likely Cause | How To Confirm | Fix |
|---|---|---|---|
| Activation offload + PP > 1 | Check | Set PP=1 or use optimizer offloading |
| Activation offload + recompute | Check | Set |
| Both offloading modes enabled | Check both flags | Use one or the other |
| CUDA graphs + activation offload | Check | Set |
| OOM with activation offloading | Model too large for PP=1 | Check allocated memory vs 80 GB | Use optimizer offloading with PP > 1 |
| Extreme slowdown (>4x) | 100% optimizer offload, CPU Adam bottleneck | Compare iter time at different fractions | Reduce fraction or enable |
| OOM at partial optimizer offload | Insufficient offload for this config | Check memory at different fractions | Increase fraction or add PP |
fine_grained_activation_offloadingcpu_offloading