Loading...
Loading...
Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
npx skill4agent add nvidia/skills perf-activation-recompute| Granularity | What you specify | What gets recomputed | Memory savings | Compute cost |
|---|---|---|---|---|
| | specific submodules within each layer | moderate (module-dependent) | low to high |
| | entire transformer layers (N layers) | strongest | highest |
recompute_num_layersPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truerecompute_granularity=selectiverecompute_modules=[core_attn]layernormmlprecompute_granularity=fullcpu_offloading=Truecfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"] # or ["mlp"] or ["mlp", "core_attn"]cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4| Module | What it recomputes | Compute cost | Memory savings |
|---|---|---|---|
| attention softmax/dropout/QKV dot product | low (Flash Attention already recomputes internally) | moderate |
| layer normalization | negligible (~0%) | negligible |
| full FFN block | high (~16% on Llama3 70B, hidden=28672) | ~3 GB |
| MoE expert dispatch | varies | varies |
| MoE activation functions | low | small |
| shared expert layers | moderate | moderate |
| Multi-Latent Attention up projection | moderate | moderate |
python scripts/performance/run_performance_workload.py \
--recompute_granularity selective \
--recompute_modules core_attn layernorm \
...recompute_granularity=selectiverecompute_modulesrecompute_granularity=fullrecompute_methodrecompute_num_layersrecompute_granularity="full"recompute_num_layersattnmlpmoe_routerLLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1cuda_graph_impl="transformer_engine"cuda_graph_scope="mlp"recompute_granularity="selective"recompute_modulescuda_graph_impl="none"cuda_graph_impl="local"cuda_graph_scope="full_iteration"distribute_saved_activations=Truesequence_parallel=Truemlpcore_attnmlp| Experiment | recompute_modules | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|---|---|---|---|---|
| Baseline | [core_attn] | ~704 | -0.8% | 58.8 (OOM rank0) | OOM |
| Exp 1 | [mlp] | 593.6 | -16.4% | 55.6 | Perf regression |
| Exp 2 | [mlp, core_attn] | 586.8 | -17.3% | 55.6 | Perf regression |
| Exp 3 | [core_attn, layernorm] | ~702 | -1.1% | 59.6 (OOM rank0) | OOM |
layernormmlpmlpcore_attnmlpPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py
# _checkpointed_forward() applies selective recompute based on recompute_modules# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
# Validates recompute_granularity, recompute_method, recompute_num_layers # Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None if self.recompute_granularity:
if self.recompute_granularity != "selective":
assert self.cuda_graph_scope == [
CudaGraphScope.full_iteration
], "full recompute is only supported with full iteration CUDA graph." if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| >15% GPU utilization drop | mlp recompute on large FFN | check | check |
| Still OOM after adding layernorm | layernorm activations are too small | compare peak memory before/after | add mlp recompute or check |
| layer-level recompute ( | check | use submodule recompute ( |
| ValueError: PP + CPU offloading | | check PP config | disable CPU offloading or set PP=1 |
| mlp+core_attn worse than mlp alone | double recompute overhead | compare Exp 1 vs Exp 2 | use mlp alone |
layernormuv run python -m pytest \
tests/unit_tests/training/test_config.py -k "recompute" -q