perf-activation-recompute

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Activation Recompute

激活重计算

Stable docs: @docs/training/activation-recomputation.md Card: @skills/perf-activation-recompute/card.yaml
稳定版文档:@docs/training/activation-recomputation.md 技能卡片:@skills/perf-activation-recompute/card.yaml

What It Is

什么是激活重计算

Activation recompute trades GPU compute for memory by discarding intermediate activations during the forward pass and recomputing them during backward. Megatron Bridge supports two granularities:
GranularityWhat you specifyWhat gets recomputedMemory savingsCompute cost
selective
recompute_modules
list (e.g.
core_attn
,
mlp
)
specific submodules within each layermoderate (module-dependent)low to high
full
recompute_num_layers
+
recompute_method
entire transformer layers (N layers)strongesthighest
Note: MCore names these "selective" (submodule-level) vs "full" (layer-level). "Full" means recomputing full layers, not the full model — you still choose how many layers via
recompute_num_layers
.
激活重计算通过在前向传播过程中丢弃中间激活值,反向传播时重新计算这些值,以GPU计算量换取内存空间。Megatron Bridge支持两种粒度的重计算:
粒度需指定内容重计算对象内存节省幅度计算开销
selective
recompute_modules
列表(例如
core_attn
mlp
每层内的特定子模块中等(取决于模块)低到高
full
recompute_num_layers
+
recompute_method
完整Transformer层(N层)最大最高
注意:MCore将这两种方式命名为“selective”(子模块级)和“full”(层级)。“Full”指重计算完整的Transformer层,而非整个模型——你仍可通过
recompute_num_layers
选择重计算的层数。

Quick Decision

快速决策指南

  1. Set
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    first
    — most borderline OOMs are caused by memory fragmentation, not capacity. This fixes it at zero cost. See @skills/perf-memory-tuning/SKILL.md.
  2. Start with
    recompute_granularity=selective
    ,
    recompute_modules=[core_attn]
    (often already the default in recipes).
  3. Add
    layernorm
    to recompute modules — nearly free compute-wise but saves negligible memory. Only helps in extremely borderline cases.
  4. Add
    mlp
    as a last resort — saves ~3 GB but costs ~16% GPU utilization on large dense models (Llama3 70B).
  5. Use
    recompute_granularity=full
    only when selective recompute still does not fit.
CPU offloading (
cpu_offloading=True
) is an alternative that avoids recompute cost entirely, but it is incompatible with PP > 1.
  1. 首先设置
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    ——大多数临界OOM(内存不足)问题是由内存碎片而非容量不足导致的。此设置可零成本解决该问题。详见@skills/perf-memory-tuning/SKILL.md。
  2. recompute_granularity=selective
    recompute_modules=[core_attn]
    开始(这通常是训练配方中的默认设置)。
  3. layernorm
    添加到重计算模块列表——计算开销几乎为零,但内存节省微乎其微。仅在极端临界情况下有用。
  4. 万不得已时添加
    mlp
    ——在大型稠密模型(如Llama3 70B)上可节省约3GB内存,但会增加约16%的GPU利用率开销。
  5. 仅当选择性重计算仍无法满足内存需求时,才使用
    recompute_granularity=full
CPU卸载(
cpu_offloading=True
)是一种替代方案,可完全避免重计算开销,但与PP>1不兼容

Enablement

启用方式

Selective recompute (default for most recipes)

选择性重计算(多数训练配方的默认设置)

python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]
python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]

Selective recompute with additional modules

包含额外模块的选择性重计算

python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"]  # or ["mlp"] or ["mlp", "core_attn"]
python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"]  # 或["mlp"]或["mlp", "core_attn"]

Full-layer recompute

完整层重计算

python
cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4
python
cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4

Available recompute_modules

可用的recompute_modules

ModuleWhat it recomputesCompute costMemory savings
core_attn
attention softmax/dropout/QKV dot productlow (Flash Attention already recomputes internally)moderate
layernorm
layer normalizationnegligible (~0%)negligible
mlp
full FFN blockhigh (~16% on Llama3 70B, hidden=28672)~3 GB
moe
MoE expert dispatchvariesvaries
moe_act
MoE activation functionslowsmall
shared_experts
shared expert layersmoderatemoderate
mla_up_proj
Multi-Latent Attention up projectionmoderatemoderate
模块重计算对象计算开销内存节省幅度
core_attn
注意力机制的softmax/dropout/QKV点积低(Flash Attention已在内部进行重计算)中等
layernorm
层归一化可忽略(约0%)可忽略
mlp
完整FFN块高(在Llama3 70B、hidden=28672模型上约16%)约3GB
moe
MoE专家调度可变可变
moe_act
MoE激活函数少量
shared_experts
共享专家层中等中等
mla_up_proj
Multi-Latent Attention上投影中等中等

Performance harness CLI

性能测试工具CLI

bash
python scripts/performance/run_performance_workload.py \
  --recompute_granularity selective \
  --recompute_modules core_attn layernorm \
  ...
bash
python scripts/performance/run_performance_workload.py \
  --recompute_granularity selective \
  --recompute_modules core_attn layernorm \
  ...

Compatibility and Constraints

兼容性与约束

  • recompute_granularity=selective
    requires a non-empty
    recompute_modules
    list
  • recompute_granularity=full
    requires
    recompute_method
    and
    recompute_num_layers
  • Layer-level recompute (
    recompute_granularity="full"
    +
    recompute_num_layers
    ) is incompatible with TE-scoped CUDA graphs.
    MCore calls this "full" granularity — the name refers to recomputing full transformer layers, not the full model. Even though you're selecting how many layers to recompute, MCore treats it differently from submodule recompute. Any TE-scoped scope (
    attn
    ,
    mlp
    ,
    moe_router
    , etc.) will assert. This commonly hits FP8 configs that enable TE-scoped graphs by default (e.g.
    LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
    sets
    cuda_graph_impl="transformer_engine"
    ,
    cuda_graph_scope="mlp"
    ). Options:
    • use submodule recompute (
      recompute_granularity="selective"
      +
      recompute_modules
      ) — compatible with TE-scoped graphs
    • disable CUDA graphs (
      cuda_graph_impl="none"
      ) and use layer-level recompute
    • switch to
      cuda_graph_impl="local"
      ,
      cuda_graph_scope="full_iteration"
  • distribute_saved_activations=True
    cannot be combined with
    sequence_parallel=True
  • Combining
    mlp
    +
    core_attn
    recompute is slightly worse than
    mlp
    alone due to double recompute overhead
  • recompute_granularity=selective
    要求
    recompute_modules
    列表非空
  • recompute_granularity=full
    要求指定
    recompute_method
    recompute_num_layers
  • 层级重计算(
    recompute_granularity="full"
    +
    recompute_num_layers
    )与TE作用域的CUDA图不兼容
    。MCore将其称为“full”粒度——该名称指重计算完整的Transformer层,而非整个模型。即使你选择了重计算的层数,MCore仍将其与子模块重计算区别对待。任何TE作用域(
    attn
    mlp
    moe_router
    等)都会触发断言。这通常会影响默认启用TE作用域图的FP8配置(例如
    LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
    设置了
    cuda_graph_impl="transformer_engine"
    cuda_graph_scope="mlp"
    )。可选方案:
    • 使用子模块重计算(
      recompute_granularity="selective"
      +
      recompute_modules
      )——与TE作用域图兼容
    • 禁用CUDA图(
      cuda_graph_impl="none"
      )并使用层级重计算
    • 切换为
      cuda_graph_impl="local"
      cuda_graph_scope="full_iteration"
  • distribute_saved_activations=True
    无法与
    sequence_parallel=True
    结合使用
  • 同时重计算
    mlp
    +
    core_attn
    的效果略逊于单独重计算
    mlp
    ,因为存在双重重计算开销

Measured Results

实测结果

Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
  • Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
  • Golden GPU utilization: 709.93 TFLOP/s/GPU
  • Regression threshold: 5%
Experimentrecompute_modulesTFLOP/s/GPUvs GoldenPeak Mem (GB)Result
Baseline[core_attn]~704-0.8%58.8 (OOM rank0)OOM
Exp 1[mlp]593.6-16.4%55.6Perf regression
Exp 2[mlp, core_attn]586.8-17.3%55.6Perf regression
Exp 3[core_attn, layernorm]~702-1.1%59.6 (OOM rank0)OOM
Key takeaways:
  • layernorm
    recompute is nearly free compute-wise but saves negligible memory
  • mlp
    recompute saves ~3 GB peak but costs ~16% because the Llama3 70B FFN (hidden=28672) is expensive to recompute
  • Combining
    mlp
    +
    core_attn
    is slightly worse than
    mlp
    alone
  • For this workload, the actual OOM fix was
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    (memory fragmentation, not capacity). See @skills/perf-memory-tuning/SKILL.md.
在32张H100 80GB GPU上进行Llama3 70B SFT训练,FP8精度(当前缩放配置):
  • 基线配置:TP=4,PP=4,VPP=5,DP=2,MBS=1,GBS=32,seq_len=4096
  • 理想GPU利用率:709.93 TFLOP/s/GPU
  • 性能退化阈值:5%
实验recompute_modulesTFLOP/s/GPU与理想值对比峰值内存(GB)结果
基线[core_attn]~704-0.8%58.8(rank0出现OOM)OOM
实验1[mlp]593.6-16.4%55.6性能退化
实验2[mlp, core_attn]586.8-17.3%55.6性能退化
实验3[core_attn, layernorm]~702-1.1%59.6(rank0出现OOM)OOM
关键结论:
  • 重计算
    layernorm
    的计算开销几乎为零,但内存节省微乎其微
  • 重计算
    mlp
    可节省约3GB峰值内存,但开销约16%,因为Llama3 70B的FFN(hidden=28672)重计算成本很高
  • 同时重计算
    mlp
    +
    core_attn
    的效果略逊于单独重计算
    mlp
  • 对于该工作负载,实际解决OOM问题的方法是设置
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    (问题源于内存碎片而非容量不足)。详见@skills/perf-memory-tuning/SKILL.md。

Code Anchors

代码锚点

Recompute modules enum and selective checkpoint logic

重计算模块枚举与选择性检查点逻辑

python
undefined
python
undefined

3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py

3rdparty/Megatron-LM/megatron/core/transformer/transformer_block.py

_checkpointed_forward() applies selective recompute based on recompute_modules

_checkpointed_forward()根据recompute_modules应用选择性重计算

undefined
undefined

Recompute config validation

重计算配置验证

python
undefined
python
undefined

3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py

3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py

Validates recompute_granularity, recompute_method, recompute_num_layers

验证recompute_granularity、recompute_method、recompute_num_layers

undefined
undefined

Llama3 recipe defaults

Llama3训练配方默认设置

99
    # Memory saving (recompute & offloading)
    cfg.model.recompute_granularity = None
    cfg.model.recompute_modules = None
    cfg.model.fine_grained_activation_offloading = False
    cfg.model.offload_modules = None
99
    # 内存优化(重计算与卸载)
    cfg.model.recompute_granularity = None
    cfg.model.recompute_modules = None
    cfg.model.fine_grained_activation_offloading = False
    cfg.model.offload_modules = None

Full recompute + CUDA graph assertion (MCore)

完整重计算 + CUDA图断言(MCore)

2001
            if self.recompute_granularity:
                if self.recompute_granularity != "selective":
                    assert self.cuda_graph_scope == [
                        CudaGraphScope.full_iteration
                    ], "full recompute is only supported with full iteration CUDA graph."
2001
            if self.recompute_granularity:
                if self.recompute_granularity != "selective":
                    assert self.cuda_graph_scope == [
                        CudaGraphScope.full_iteration
                    ], "full recompute is only supported with full iteration CUDA graph."

CPU offloading PP incompatibility (MCore)

CPU卸载与PP不兼容(MCore)

1303
        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )
1303
        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

Failure Diagnosis

故障诊断

SymptomCauseConfirmFix
>15% GPU utilization dropmlp recompute on large FFNcheck
recompute_modules
includes
mlp
check
expandable_segments:True
is set; consider reducing MBS
Still OOM after adding layernormlayernorm activations are too smallcompare peak memory before/afteradd mlp recompute or check
expandable_segments:True
AssertionError: full recompute is only supported with full iteration CUDA graph
layer-level recompute (
recompute_granularity=full
+
recompute_num_layers
) with TE-scoped graphs. FP8 CS configs default to
cuda_graph_impl=transformer_engine
,
scope=mlp
.
check
cuda_graph_impl
and
cuda_graph_scope
use submodule recompute (
selective
+
recompute_modules
), or
cuda_graph_impl=none
, or
local
+
full_iteration
ValueError: PP + CPU offloading
cpu_offloading=True
with
pipeline_model_parallel_size > 1
check PP configdisable CPU offloading or set PP=1
mlp+core_attn worse than mlp alonedouble recompute overheadcompare Exp 1 vs Exp 2use mlp alone
症状原因确认方式修复方案
GPU利用率下降超过15%大型FFN上重计算
mlp
检查
recompute_modules
是否包含
mlp
确认已设置
expandable_segments:True
;考虑减小MBS
添加
layernorm
后仍出现OOM
layernorm
激活值占用内存过小
对比添加前后的峰值内存添加
mlp
重计算或检查是否已设置
expandable_segments:True
AssertionError: full recompute is only supported with full iteration CUDA graph
层级重计算(
recompute_granularity=full
+
recompute_num_layers
)与TE作用域图同时使用。FP8 CS配置默认设置
cuda_graph_impl=transformer_engine
scope=mlp
检查
cuda_graph_impl
cuda_graph_scope
使用子模块重计算(
selective
+
recompute_modules
),或设置
cuda_graph_impl=none
,或设置
local
+
full_iteration
ValueError: PP + CPU offloading
cpu_offloading=True
pipeline_model_parallel_size > 1
检查PP配置禁用CPU卸载或设置PP=1
同时重计算
mlp+core_attn
效果逊于单独重计算
mlp
双重重计算开销对比实验1与实验2的结果仅单独重计算
mlp

Known Limitations

已知限制

  • Per-module memory savings vary significantly by model architecture and hidden dimension
  • No automatic module selection — users must choose which modules to recompute
  • layernorm
    recompute is almost never worth it as a standalone fix
  • CPU offloading (the zero-compute-cost alternative) is blocked when PP > 1
  • 各模块的内存节省幅度因模型架构和隐藏维度差异显著
  • 无自动模块选择功能——用户必须手动选择要重计算的模块
  • 单独重计算
    layernorm
    几乎不值得作为独立解决方案
  • CPU卸载(零计算开销的替代方案)在PP>1时无法使用

Verification

验证方法

bash
uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "recompute" -q
Success criteria:
  • Unit tests pass for recompute config validation
  • No assertion errors from config validation
bash
uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "recompute" -q
成功标准:
  • 重计算配置验证的单元测试通过
  • 无配置验证导致的断言错误