nemo-mbridge-perf-activation-recompute

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Activation Recompute

激活重计算

Stable docs: @docs/training/activation-recomputation.md Card: @skills/nemo-mbridge-perf-activation-recompute/card.yaml

稳定文档：@docs/training/activation-recomputation.md 卡片：@skills/nemo-mbridge-perf-activation-recompute/card.yaml

What It Is

什么是激活重计算

Activation recompute trades GPU compute for memory by discarding intermediate activations during the forward pass and recomputing them during backward. Megatron Bridge supports two granularities:

Granularity	What you specify	What gets recomputed	Memory savings	Compute cost
`selective`	`recompute_modules` list (e.g. `core_attn` , `mlp` )	specific submodules within each layer	moderate (module-dependent)	low to high
`full`	`recompute_num_layers` + `recompute_method`	entire transformer layers (N layers)	strongest	highest

Note: MCore names these "selective" (submodule-level) vs "full" (layer-level). "Full" means recomputing full layers, not the full model — you still choose how many layers via

recompute_num_layers

激活重计算通过在前向传播过程中丢弃中间激活值，在反向传播时重新计算它们，以此用GPU计算开销换取内存空间。Megatron Bridge支持两种粒度的重计算：

粒度	需指定内容	重计算对象	内存节省程度	计算开销
`selective` （选择性）	`recompute_modules` 列表（例如 `core_attn` 、 `mlp` ）	每层内的特定子模块	中等（取决于模块）	低到高
`full` （全量）	`recompute_num_layers` + `recompute_method`	完整的Transformer层（N层）	最高	最高

注意：MCore将其命名为“selective”（子模块级）与“full”（层级）。“full”指重计算完整的Transformer层，而非整个模型——你仍可通过

recompute_num_layers

选择重计算的层数。

Quick Decision

快速决策指南

Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
first — most borderline OOMs are caused by memory fragmentation, not capacity. This fixes it at zero cost. See @skills/nemo-mbridge-perf-memory-tuning/SKILL.md.

Start with

recompute_granularity=selective

recompute_modules=[core_attn]

(often already the default in recipes).

Add
```
layernorm
```
to recompute modules — nearly free compute-wise but saves negligible memory. Only helps in extremely borderline cases.
Add
```
mlp
```
as a last resort — saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).
Use
```
recompute_granularity=full
```
only when selective recompute still does not fit.

CPU offloading (

cpu_offloading=True

) is an alternative that avoids recompute cost entirely, but it is incompatible with PP > 1.

首先设置
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
——大多数临界OOM（内存不足）问题是由内存碎片而非容量不足导致的。此设置可零成本解决该问题。详见@skills/nemo-mbridge-perf-memory-tuning/SKILL.md。
从
```
recompute_granularity=selective
```
、
```
recompute_modules=[core_attn]
```
开始（这通常已是训练脚本中的默认配置）。
将
```
layernorm
```
添加到重计算模块列表——计算开销几乎可以忽略，但内存节省也微乎其微。仅在极端临界情况下有帮助。
最后考虑添加
```
mlp
```
——在大型稠密模型（如Llama3 70B）上可节省约3GB内存，但会增加约16%的GPU利用率开销。
仅当选择性重计算仍无法满足内存需求时，才使用
```
recompute_granularity=full
```
。

CPU卸载（

cpu_offloading=True

）是一种无需重计算开销的替代方案，但与PP>1不兼容。

Enablement

启用方式

Selective recompute (default for most recipes)

选择性重计算（大多数训练脚本的默认配置）

python

cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]

python

cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]

Selective recompute with additional modules

包含额外模块的选择性重计算

python

cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"]  # or ["mlp"] or ["mlp", "core_attn"]

python

cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"]  # 或["mlp"]或["mlp", "core_attn"]

Full-layer recompute

全层重计算

python

cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4

python

cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4

Available recompute_modules

可用的recompute_modules

Module	What it recomputes	Compute cost	Memory savings
`core_attn`	attention softmax/dropout/QKV dot product	low (Flash Attention already recomputes internally)	moderate
`layernorm`	layer normalization	negligible (~0%)	negligible
`mlp`	full FFN block	high (~16% on Llama3 70B, hidden=28672)	~3 GB
`moe`	MoE expert dispatch	varies	varies
`moe_act`	MoE activation functions	low	small
`shared_experts`	shared expert layers	moderate	moderate
`mla_up_proj`	Multi-Latent Attention up projection	moderate	moderate

模块	重计算内容	计算开销	内存节省
`core_attn`	注意力机制的softmax/dropout/QKV点积	低（Flash Attention已在内部进行重计算）	中等
`layernorm`	层归一化	可忽略（约0%）	可忽略
`mlp`	完整的FFN块	高（在Llama3 70B、hidden=28672模型上约16%）	约3GB
`moe`	MoE专家调度	可变	可变
`moe_act`	MoE激活函数	低	少量
`shared_experts`	共享专家层	中等	中等
`mla_up_proj`	Multi-Latent Attention上投影	中等	中等

Performance harness CLI

性能测试工具CLI

bash

python scripts/performance/run_performance_workload.py \
  --recompute_granularity selective \
  --recompute_modules core_attn layernorm \
  ...

bash

python scripts/performance/run_performance_workload.py \
  --recompute_granularity selective \
  --recompute_modules core_attn layernorm \
  ...

Compatibility and Constraints

兼容性与约束

recompute_granularity=selective

requires a non-empty

recompute_modules

list

recompute_granularity=full

requires

recompute_method

and

recompute_num_layers

Layer-level recompute (
recompute_granularity="full"
+
recompute_num_layers
) is incompatible with TE-scoped CUDA graphs. MCore calls this "full" granularity — the name refers to recomputing full transformer layers, not the full model. Even though you're selecting how many layers to recompute, MCore treats it differently from submodule recompute. Any TE-scoped scope (
```
attn
```
,
```
mlp
```
,
```
moe_router
```
, etc.) will assert. This commonly hits FP8 configs that enable TE-scoped graphs by default (e.g.
```
LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
```
sets
```
cuda_graph_impl="transformer_engine"
```
,
```
cuda_graph_scope="mlp"
```
). Options:
- use submodule recompute (
```
recompute_granularity="selective"
```
  +
```
recompute_modules
```
  ) — compatible with TE-scoped graphs
- disable CUDA graphs (
```
cuda_graph_impl="none"
```
  ) and use layer-level recompute
- switch to
```
cuda_graph_impl="local"
```
  ,
```
cuda_graph_scope="full_iteration"
```

distribute_saved_activations=True

cannot be combined with

sequence_parallel=True

Combining
```
mlp
```
+
```
core_attn
```
recompute is slightly worse than
```
mlp
```
alone due to double recompute overhead

recompute_granularity=selective

要求

recompute_modules

列表非空

recompute_granularity=full

要求配置

recompute_method

和

recompute_num_layers

层级重计算（
recompute_granularity="full"
+
recompute_num_layers
）与TE作用域的CUDA图不兼容。 MCore将其称为“full”粒度——该名称指重计算完整的Transformer层，而非整个模型。即使你选择了要重计算的层数，MCore仍将其与子模块重计算区别对待。任何TE作用域（
```
attn
```
、
```
mlp
```
、
```
moe_router
```
等）都会触发断言。这通常会影响默认启用TE作用域图的FP8配置（例如
```
LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
```
设置了
```
cuda_graph_impl="transformer_engine"
```
、
```
cuda_graph_scope="mlp"
```
）。可选方案：
- 使用子模块重计算（
```
recompute_granularity="selective"
```
  +
```
recompute_modules
```
  ）——与TE作用域图兼容
- 禁用CUDA图（
```
cuda_graph_impl="none"
```
  ）并使用层级重计算
- 切换为
```
cuda_graph_impl="local"
```
  、
```
cuda_graph_scope="full_iteration"
```

distribute_saved_activations=True

无法与

sequence_parallel=True

结合使用

同时重计算
```
mlp
```
和
```
core_attn
```
的效果略逊于仅重计算
```
mlp
```
，因为存在双重重计算开销

Measured Results

实测结果

Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):

Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
Golden GPU utilization: 709.93 TFLOP/s/GPU
Regression threshold: 5%

Experiment	recompute_modules	TFLOP/s/GPU	vs Golden	Peak Mem (GB)	Result
Baseline	[core_attn]	~704	-0.8%	58.8 (OOM rank0)	OOM
Exp 1	[mlp]	593.6	-16.4%	55.6	Perf regression
Exp 2	[mlp, core_attn]	586.8	-17.3%	55.6	Perf regression
Exp 3	[core_attn, layernorm]	~702	-1.1%	59.6 (OOM rank0)	OOM

Key takeaways:

```
layernorm
```
recompute is nearly free compute-wise but saves negligible memory
```
mlp
```
recompute saves ~3 GB peak but costs ~16% because the Llama3 70B FFN (hidden=28672) is expensive to recompute
Combining
```
mlp
```
+
```
core_attn
```
is slightly worse than
```
mlp
```
alone
For this workload, the actual OOM fix was
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
(memory fragmentation, not capacity). See @skills/nemo-mbridge-perf-memory-tuning/SKILL.md.

在32张H100 80GB GPU上进行Llama3 70B SFT训练，采用FP8（当前缩放配置）：

基准配置：TP=4，PP=4，VPP=5，DP=2，MBS=1，GBS=32，seq_len=4096
理想GPU利用率：709.93 TFLOP/s/GPU
性能退化阈值：5%

实验	recompute_modules	TFLOP/s/GPU	与理想值对比	峰值内存（GB）	结果
基准	[core_attn]	~704	-0.8%	58.8（rank0出现OOM）	OOM
实验1	[mlp]	593.6	-16.4%	55.6	性能退化
实验2	[mlp, core_attn]	586.8	-17.3%	55.6	性能退化
实验3	[core_attn, layernorm]	~702	-1.1%	59.6（rank0出现OOM）	OOM

关键结论：

重计算
```
layernorm
```
的计算开销几乎可以忽略，但内存节省微乎其微
重计算
```
mlp
```
可节省约3GB峰值内存，但开销约16%，因为Llama3 70B的FFN（hidden=28672）重计算成本很高
同时重计算
```
mlp
```
和
```
core_attn
```
的效果略逊于仅重计算
```
mlp
```
对于该训练任务，真正解决OOM问题的方法是设置
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
（问题源于内存碎片而非容量不足）。详见@skills/nemo-mbridge-perf-memory-tuning/SKILL.md。

Code Anchors

代码锚点

Recompute modules enum and selective checkpoint logic

重计算模块枚举与选择性检查点逻辑

python

undefined

python

undefined

3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py

_checkpointed_forward() applies selective recompute based on recompute_modules

_checkpointed_forward()根据recompute_modules应用选择性重计算

undefined

undefined

Recompute config validation

重计算配置验证

python

undefined

python

undefined

3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py

Validates recompute_granularity, recompute_method, recompute_num_layers

验证recompute_granularity、recompute_method、recompute_num_layers

undefined

undefined

Llama3 recipe defaults

Llama3训练脚本默认配置

    # Memory saving (recompute & offloading)
    cfg.model.recompute_granularity = None
    cfg.model.recompute_modules = None
    cfg.model.fine_grained_activation_offloading = False
    cfg.model.offload_modules = None

    # 内存优化（重计算与卸载）
    cfg.model.recompute_granularity = None
    cfg.model.recompute_modules = None
    cfg.model.fine_grained_activation_offloading = False
    cfg.model.offload_modules = None

Full recompute + CUDA graph assertion (MCore)

全量重计算+CUDA图断言（MCore）

2001

            if self.recompute_granularity:
                if self.recompute_granularity != "selective":
                    assert self.cuda_graph_scope == [
                        CudaGraphScope.full_iteration
                    ], "full recompute is only supported with full iteration CUDA graph."

2001

            if self.recompute_granularity:
                if self.recompute_granularity != "selective":
                    assert self.cuda_graph_scope == [
                        CudaGraphScope.full_iteration
                    ], "full recompute is only supported with full iteration CUDA graph."

CPU offloading PP incompatibility (MCore)

CPU卸载与PP不兼容（MCore）

1303

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

1303

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

Failure Diagnosis

故障诊断

Symptom	Cause	Confirm	Fix
>15% GPU utilization drop	mlp recompute on large FFN	check `recompute_modules` includes `mlp`	check `expandable_segments:True` is set; consider reducing MBS
Still OOM after adding layernorm	layernorm activations are too small	compare peak memory before/after	add mlp recompute or check `expandable_segments:True`
`AssertionError: full recompute is only supported with full iteration CUDA graph`	layer-level recompute ( `recompute_granularity=full` + `recompute_num_layers` ) with TE-scoped graphs. FP8 CS configs default to `cuda_graph_impl=transformer_engine` , `scope=mlp` .	check `cuda_graph_impl` and `cuda_graph_scope`	use submodule recompute ( `selective` + `recompute_modules` ), or `cuda_graph_impl=none` , or `local` + `full_iteration`
ValueError: PP + CPU offloading	`cpu_offloading=True` with `pipeline_model_parallel_size > 1`	check PP config	disable CPU offloading or set PP=1
mlp+core_attn worse than mlp alone	double recompute overhead	compare Exp 1 vs Exp 2	use mlp alone

症状	原因	确认方式	修复方案
GPU利用率下降超过15%	大型FFN上启用mlp重计算	检查 `recompute_modules` 是否包含 `mlp`	确认已设置 `expandable_segments:True` ；考虑减小MBS
添加layernorm后仍出现OOM	layernorm激活值占用内存过小	对比添加前后的峰值内存	添加mlp重计算或检查是否设置 `expandable_segments:True`
`AssertionError: full recompute is only supported with full iteration CUDA graph`	层级重计算（ `recompute_granularity=full` + `recompute_num_layers` ）与TE作用域图同时使用。FP8 CS配置默认设置 `cuda_graph_impl=transformer_engine` 、 `scope=mlp`	检查 `cuda_graph_impl` 和 `cuda_graph_scope` 配置	使用子模块重计算（ `selective` + `recompute_modules` ），或设置 `cuda_graph_impl=none` ，或设置 `local` + `full_iteration`
ValueError: PP + CPU offloading	`cpu_offloading=True` 且 `pipeline_model_parallel_size > 1`	检查PP配置	禁用CPU卸载或设置PP=1
mlp+core_attn重计算效果差于仅mlp	双重重计算开销	对比实验1与实验2结果	仅使用mlp重计算

Known Limitations

已知限制

Per-module memory savings vary significantly by model architecture and hidden dimension
No automatic module selection — users must choose which modules to recompute
```
layernorm
```
recompute is almost never worth it as a standalone fix
CPU offloading (the zero-compute-cost alternative) is blocked when PP > 1

各模块的内存节省程度因模型架构和隐藏维度差异显著
无自动模块选择功能——用户必须手动选择要重计算的模块
单独重计算
```
layernorm
```
几乎不值得作为修复方案
CPU卸载（零计算开销的替代方案）在PP>1时无法使用

Verification

验证

bash

uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "recompute" -q

Success criteria:

Unit tests pass for recompute config validation
No assertion errors from config validation

bash

uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "recompute" -q

成功标准：

重计算配置验证的单元测试通过
无配置验证引发的断言错误