nemo-mbridge-perf-activation-recompute
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseActivation Recompute
激活重计算
Stable docs: @docs/training/activation-recomputation.md
Card: @skills/nemo-mbridge-perf-activation-recompute/card.yaml
稳定文档:@docs/training/activation-recomputation.md
卡片:@skills/nemo-mbridge-perf-activation-recompute/card.yaml
What It Is
什么是激活重计算
Activation recompute trades GPU compute for memory by discarding intermediate
activations during the forward pass and recomputing them during backward.
Megatron Bridge supports two granularities:
| Granularity | What you specify | What gets recomputed | Memory savings | Compute cost |
|---|---|---|---|---|
| | specific submodules within each layer | moderate (module-dependent) | low to high |
| | entire transformer layers (N layers) | strongest | highest |
Note: MCore names these "selective" (submodule-level) vs "full" (layer-level).
"Full" means recomputing full layers, not the full model — you still choose
how many layers via .
recompute_num_layers激活重计算通过在前向传播过程中丢弃中间激活值,在反向传播时重新计算它们,以此用GPU计算开销换取内存空间。Megatron Bridge支持两种粒度的重计算:
| 粒度 | 需指定内容 | 重计算对象 | 内存节省程度 | 计算开销 |
|---|---|---|---|---|
| | 每层内的特定子模块 | 中等(取决于模块) | 低到高 |
| | 完整的Transformer层(N层) | 最高 | 最高 |
注意:MCore将其命名为“selective”(子模块级)与“full”(层级)。“full”指重计算完整的Transformer层,而非整个模型——你仍可通过选择重计算的层数。
recompute_num_layersQuick Decision
快速决策指南
- Set first — most borderline OOMs are caused by memory fragmentation, not capacity. This fixes it at zero cost. See @skills/nemo-mbridge-perf-memory-tuning/SKILL.md.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - Start with ,
recompute_granularity=selective(often already the default in recipes).recompute_modules=[core_attn] - Add to recompute modules — nearly free compute-wise but saves negligible memory. Only helps in extremely borderline cases.
layernorm - Add as a last resort — saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).
mlp - Use only when selective recompute still does not fit.
recompute_granularity=full
CPU offloading () is an alternative that avoids recompute
cost entirely, but it is incompatible with PP > 1.
cpu_offloading=True- 首先设置——大多数临界OOM(内存不足)问题是由内存碎片而非容量不足导致的。此设置可零成本解决该问题。详见@skills/nemo-mbridge-perf-memory-tuning/SKILL.md。
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True - 从、
recompute_granularity=selective开始(这通常已是训练脚本中的默认配置)。recompute_modules=[core_attn] - 将添加到重计算模块列表——计算开销几乎可以忽略,但内存节省也微乎其微。仅在极端临界情况下有帮助。
layernorm - 最后考虑添加——在大型稠密模型(如Llama3 70B)上可节省约3GB内存,但会增加约16%的GPU利用率开销。
mlp - 仅当选择性重计算仍无法满足内存需求时,才使用。
recompute_granularity=full
CPU卸载()是一种无需重计算开销的替代方案,但与PP>1不兼容。
cpu_offloading=TrueEnablement
启用方式
Selective recompute (default for most recipes)
选择性重计算(大多数训练脚本的默认配置)
python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]Selective recompute with additional modules
包含额外模块的选择性重计算
python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"] # or ["mlp"] or ["mlp", "core_attn"]python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"] # 或["mlp"]或["mlp", "core_attn"]Full-layer recompute
全层重计算
python
cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4python
cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4Available recompute_modules
可用的recompute_modules
| Module | What it recomputes | Compute cost | Memory savings |
|---|---|---|---|
| attention softmax/dropout/QKV dot product | low (Flash Attention already recomputes internally) | moderate |
| layer normalization | negligible (~0%) | negligible |
| full FFN block | high (~16% on Llama3 70B, hidden=28672) | ~3 GB |
| MoE expert dispatch | varies | varies |
| MoE activation functions | low | small |
| shared expert layers | moderate | moderate |
| Multi-Latent Attention up projection | moderate | moderate |
| 模块 | 重计算内容 | 计算开销 | 内存节省 |
|---|---|---|---|
| 注意力机制的softmax/dropout/QKV点积 | 低(Flash Attention已在内部进行重计算) | 中等 |
| 层归一化 | 可忽略(约0%) | 可忽略 |
| 完整的FFN块 | 高(在Llama3 70B、hidden=28672模型上约16%) | 约3GB |
| MoE专家调度 | 可变 | 可变 |
| MoE激活函数 | 低 | 少量 |
| 共享专家层 | 中等 | 中等 |
| Multi-Latent Attention上投影 | 中等 | 中等 |
Performance harness CLI
性能测试工具CLI
bash
python scripts/performance/run_performance_workload.py \
--recompute_granularity selective \
--recompute_modules core_attn layernorm \
...bash
python scripts/performance/run_performance_workload.py \
--recompute_granularity selective \
--recompute_modules core_attn layernorm \
...Compatibility and Constraints
兼容性与约束
- requires a non-empty
recompute_granularity=selectivelistrecompute_modules - requires
recompute_granularity=fullandrecompute_methodrecompute_num_layers - Layer-level recompute (+
recompute_granularity="full") is incompatible with TE-scoped CUDA graphs. MCore calls this "full" granularity — the name refers to recomputing full transformer layers, not the full model. Even though you're selecting how many layers to recompute, MCore treats it differently from submodule recompute. Any TE-scoped scope (recompute_num_layers,attn,mlp, etc.) will assert. This commonly hits FP8 configs that enable TE-scoped graphs by default (e.g.moe_routersetsLLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1,cuda_graph_impl="transformer_engine"). Options:cuda_graph_scope="mlp"- use submodule recompute (+
recompute_granularity="selective") — compatible with TE-scoped graphsrecompute_modules - disable CUDA graphs () and use layer-level recompute
cuda_graph_impl="none" - switch to ,
cuda_graph_impl="local"cuda_graph_scope="full_iteration"
- use submodule recompute (
- cannot be combined with
distribute_saved_activations=Truesequence_parallel=True - Combining +
mlprecompute is slightly worse thancore_attnalone due to double recompute overheadmlp
- 要求
recompute_granularity=selective列表非空recompute_modules - 要求配置
recompute_granularity=full和recompute_methodrecompute_num_layers - 层级重计算(+
recompute_granularity="full")与TE作用域的CUDA图不兼容。 MCore将其称为“full”粒度——该名称指重计算完整的Transformer层,而非整个模型。即使你选择了要重计算的层数,MCore仍将其与子模块重计算区别对待。任何TE作用域(recompute_num_layers、attn、mlp等)都会触发断言。这通常会影响默认启用TE作用域图的FP8配置(例如moe_router设置了LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1、cuda_graph_impl="transformer_engine")。可选方案:cuda_graph_scope="mlp"- 使用子模块重计算(+
recompute_granularity="selective")——与TE作用域图兼容recompute_modules - 禁用CUDA图()并使用层级重计算
cuda_graph_impl="none" - 切换为、
cuda_graph_impl="local"cuda_graph_scope="full_iteration"
- 使用子模块重计算(
- 无法与
distribute_saved_activations=True结合使用sequence_parallel=True - 同时重计算和
mlp的效果略逊于仅重计算core_attn,因为存在双重重计算开销mlp
Measured Results
实测结果
Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
- Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
- Golden GPU utilization: 709.93 TFLOP/s/GPU
- Regression threshold: 5%
| Experiment | recompute_modules | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|---|---|---|---|---|
| Baseline | [core_attn] | ~704 | -0.8% | 58.8 (OOM rank0) | OOM |
| Exp 1 | [mlp] | 593.6 | -16.4% | 55.6 | Perf regression |
| Exp 2 | [mlp, core_attn] | 586.8 | -17.3% | 55.6 | Perf regression |
| Exp 3 | [core_attn, layernorm] | ~702 | -1.1% | 59.6 (OOM rank0) | OOM |
Key takeaways:
- recompute is nearly free compute-wise but saves negligible memory
layernorm - recompute saves ~3 GB peak but costs ~16% because the Llama3 70B FFN (hidden=28672) is expensive to recompute
mlp - Combining +
mlpis slightly worse thancore_attnalonemlp - For this workload, the actual OOM fix was (memory fragmentation, not capacity). See @skills/nemo-mbridge-perf-memory-tuning/SKILL.md.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
在32张H100 80GB GPU上进行Llama3 70B SFT训练,采用FP8(当前缩放配置):
- 基准配置:TP=4,PP=4,VPP=5,DP=2,MBS=1,GBS=32,seq_len=4096
- 理想GPU利用率:709.93 TFLOP/s/GPU
- 性能退化阈值:5%
| 实验 | recompute_modules | TFLOP/s/GPU | 与理想值对比 | 峰值内存(GB) | 结果 |
|---|---|---|---|---|---|
| 基准 | [core_attn] | ~704 | -0.8% | 58.8(rank0出现OOM) | OOM |
| 实验1 | [mlp] | 593.6 | -16.4% | 55.6 | 性能退化 |
| 实验2 | [mlp, core_attn] | 586.8 | -17.3% | 55.6 | 性能退化 |
| 实验3 | [core_attn, layernorm] | ~702 | -1.1% | 59.6(rank0出现OOM) | OOM |
关键结论:
- 重计算的计算开销几乎可以忽略,但内存节省微乎其微
layernorm - 重计算可节省约3GB峰值内存,但开销约16%,因为Llama3 70B的FFN(hidden=28672)重计算成本很高
mlp - 同时重计算和
mlp的效果略逊于仅重计算core_attnmlp - 对于该训练任务,真正解决OOM问题的方法是设置(问题源于内存碎片而非容量不足)。详见@skills/nemo-mbridge-perf-memory-tuning/SKILL.md。
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Code Anchors
代码锚点
Recompute modules enum and selective checkpoint logic
重计算模块枚举与选择性检查点逻辑
python
undefinedpython
undefined3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py
3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py
_checkpointed_forward() applies selective recompute based on recompute_modules
_checkpointed_forward()根据recompute_modules应用选择性重计算
undefinedundefinedRecompute config validation
重计算配置验证
python
undefinedpython
undefined3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
Validates recompute_granularity, recompute_method, recompute_num_layers
验证recompute_granularity、recompute_method、recompute_num_layers
undefinedundefinedLlama3 recipe defaults
Llama3训练脚本默认配置
99
# Memory saving (recompute & offloading)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = None99
# 内存优化(重计算与卸载)
cfg.model.recompute_granularity = None
cfg.model.recompute_modules = None
cfg.model.fine_grained_activation_offloading = False
cfg.model.offload_modules = NoneFull recompute + CUDA graph assertion (MCore)
全量重计算+CUDA图断言(MCore)
2001
if self.recompute_granularity:
if self.recompute_granularity != "selective":
assert self.cuda_graph_scope == [
CudaGraphScope.full_iteration
], "full recompute is only supported with full iteration CUDA graph."2001
if self.recompute_granularity:
if self.recompute_granularity != "selective":
assert self.cuda_graph_scope == [
CudaGraphScope.full_iteration
], "full recompute is only supported with full iteration CUDA graph."CPU offloading PP incompatibility (MCore)
CPU卸载与PP不兼容(MCore)
1303
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)1303
if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
raise ValueError(
"Currently there is no support for Pipeline parallelism with CPU offloading"
)Failure Diagnosis
故障诊断
| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| >15% GPU utilization drop | mlp recompute on large FFN | check | check |
| Still OOM after adding layernorm | layernorm activations are too small | compare peak memory before/after | add mlp recompute or check |
| layer-level recompute ( | check | use submodule recompute ( |
| ValueError: PP + CPU offloading | | check PP config | disable CPU offloading or set PP=1 |
| mlp+core_attn worse than mlp alone | double recompute overhead | compare Exp 1 vs Exp 2 | use mlp alone |
| 症状 | 原因 | 确认方式 | 修复方案 |
|---|---|---|---|
| GPU利用率下降超过15% | 大型FFN上启用mlp重计算 | 检查 | 确认已设置 |
| 添加layernorm后仍出现OOM | layernorm激活值占用内存过小 | 对比添加前后的峰值内存 | 添加mlp重计算或检查是否设置 |
| 层级重计算( | 检查 | 使用子模块重计算( |
| ValueError: PP + CPU offloading | | 检查PP配置 | 禁用CPU卸载或设置PP=1 |
| mlp+core_attn重计算效果差于仅mlp | 双重重计算开销 | 对比实验1与实验2结果 | 仅使用mlp重计算 |
Known Limitations
已知限制
- Per-module memory savings vary significantly by model architecture and hidden dimension
- No automatic module selection — users must choose which modules to recompute
- recompute is almost never worth it as a standalone fix
layernorm - CPU offloading (the zero-compute-cost alternative) is blocked when PP > 1
- 各模块的内存节省程度因模型架构和隐藏维度差异显著
- 无自动模块选择功能——用户必须手动选择要重计算的模块
- 单独重计算几乎不值得作为修复方案
layernorm - CPU卸载(零计算开销的替代方案)在PP>1时无法使用
Verification
验证
bash
uv run python -m pytest \
tests/unit_tests/training/test_config.py -k "recompute" -qSuccess criteria:
- Unit tests pass for recompute config validation
- No assertion errors from config validation
bash
uv run python -m pytest \
tests/unit_tests/training/test_config.py -k "recompute" -q成功标准:
- 重计算配置验证的单元测试通过
- 无配置验证引发的断言错误