perf-cuda-graphs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA Graphs

CUDA Graphs

Stable docs: @docs/training/cuda-graphs.md Card: @skills/perf-cuda-graphs/card.yaml
稳定文档:@docs/training/cuda-graphs.md 卡片:@skills/perf-cuda-graphs/card.yaml

What It Is

什么是CUDA Graphs

CUDA graphs capture GPU operations once and replay them with minimal host-driver overhead. Bridge supports two implementations:
cuda_graph_impl
MechanismScope support
"local"
MCore
FullCudaGraphWrapper
wrapping entire fwd+bwd
full_iteration
"transformer_engine"
TE
make_graphed_callables()
per layer
attn
,
mlp
,
moe
,
moe_router
,
moe_preprocess
,
mamba
CUDA图会一次性捕获GPU操作,然后以最小的主机驱动开销重放这些操作。Bridge支持两种实现方式:
cuda_graph_impl
实现机制支持的作用域
"local"
由MCore
FullCudaGraphWrapper
包裹整个前向+反向传播
full_iteration
"transformer_engine"
每层使用TE
make_graphed_callables()
attn
,
mlp
,
moe
,
moe_router
,
moe_preprocess
,
mamba

Quick Decision

快速决策建议

Start with TE-scoped graphs for most training workloads, then verify replay timing against eager on the same dispatcher, layout, and container:
  • dense models:
    attn
    , then optionally
    mlp
  • dropless MoE:
    attn moe_router moe_preprocess
  • VLMs: the same dropless-MoE scope, but only after the real-data path is stable
Use
local
+
full_iteration
only when you specifically want full-iteration capture and can satisfy the tighter constraints.
For recompute-heavy workloads:
  • TE-scoped graphs pair naturally with selective recompute
  • full recompute usually pushes you toward
    local
    full-iteration graphs or away from graphs entirely
Related docs:
  • @docs/training/cuda-graphs.md
  • @docs/training/activation-recomputation.md
对于大多数训练工作负载,先从TE作用域图开始,然后在相同的调度器、布局和容器下验证重放时间与eager模式的对比:
  • 稠密模型:先启用
    attn
    ,可选再启用
    mlp
  • 无丢弃MoE模型:
    attn moe_router moe_preprocess
  • 多模态大模型(VLM):使用与无丢弃MoE相同的作用域,但仅在真实数据路径稳定后启用
仅当你明确需要全迭代捕获且能满足更严格的约束时,才使用
local
+
full_iteration
对于重计算密集型工作负载:
  • TE作用域图与选择性重计算天然适配
  • 全重计算通常会让你倾向于使用
    local
    全迭代图,或者完全不使用CUDA图
相关文档:
  • @docs/training/cuda-graphs.md
  • @docs/training/activation-recomputation.md

Enablement

启用方法

Local full-iteration graph

本地全迭代图

python
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = False
python
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = False

TE scoped graph (dense model)

TE作用域图(稠密模型)

python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"]           # or ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"]           # 或 ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True

TE scoped graph (MoE model)

TE作用域图(MoE模型)

python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True

Performance harness CLI

性能测试工具CLI

bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  --cuda_graph_impl transformer_engine \
  --cuda_graph_scope attn,moe_router,moe_preprocess \
  ...
Valid CLI values live in
scripts/performance/argument_parser.py
:
  • VALID_CUDA_GRAPH_IMPLS
    :
    ["none", "local", "transformer_engine"]
  • VALID_CUDA_GRAPH_SCOPES
    :
    ["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]
The performance harness uses a comma-separated
--cuda_graph_scope
value and auto-enables
model.use_te_rng_tracker
plus
rng.te_rng_tracker
when
--cuda_graph_impl
is not
none
.
bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  --cuda_graph_impl transformer_engine \
  --cuda_graph_scope attn,moe_router,moe_preprocess \
  ...
有效的CLI值定义在
scripts/performance/argument_parser.py
中:
  • VALID_CUDA_GRAPH_IMPLS
    :
    ["none", "local", "transformer_engine"]
  • VALID_CUDA_GRAPH_SCOPES
    :
    ["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]
性能测试工具使用逗号分隔的
--cuda_graph_scope
值,当
--cuda_graph_impl
不为
none
时,会自动启用
model.use_te_rng_tracker
rng.te_rng_tracker

Required constraints

必要约束

  • use_te_rng_tracker = True
    (enforced in
    gpt_provider.py
    )
  • full_iteration
    scope only with
    cuda_graph_impl = "local"
  • full_iteration
    scope requires
    check_for_nan_in_loss = False
  • Do not combine
    moe
    scope and
    moe_router
    scope
  • Tensor shapes must be static (fixed seq_length, fixed micro_batch_size)
  • MoE token-dropless routing limits graphable scope to dense modules
  • With
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    , set
    NCCL_GRAPH_REGISTER=0
    (MCore enforces for local impl on arch < sm_100; TE impl asserts unconditionally)
  • CPU offloading is incompatible with CUDA graphs
  • moe_preprocess
    scope requires
    moe_router
    scope to also be set
  • use_te_rng_tracker = True
    (在
    gpt_provider.py
    中强制执行)
  • full_iteration
    作用域仅能与
    cuda_graph_impl = "local"
    配合使用
  • full_iteration
    作用域要求
    check_for_nan_in_loss = False
  • 不要同时使用
    moe
    作用域和
    moe_router
    作用域
  • 张量形状必须是静态的(固定seq_length、固定micro_batch_size)
  • MoE无令牌丢弃路由将可图化的作用域限制在稠密模块
  • 当设置
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    时,需设置
    NCCL_GRAPH_REGISTER=0
    (MCore对架构<sm_100的本地实现强制执行;TE实现无条件断言)
  • CPU卸载与CUDA图不兼容
  • moe_preprocess
    作用域要求同时设置
    moe_router
    作用域

Practical bring-up order

实际启用步骤

  1. Stabilize the eager run first.
  2. Fix sequence length and micro-batch size.
  3. Enable the narrowest useful graph scope.
  4. Confirm replay is active and memory is still acceptable.
  5. Compare eager against graph replay iterations after warmup and capture; do not include the capture step in steady-state timing.
  6. Only then widen scope or combine with overlap features.
  1. 先稳定eager模式运行。
  2. 固定序列长度和微批次大小。
  3. 启用最窄的有用图作用域。
  4. 确认重放已激活且内存仍在可接受范围内。
  5. 在预热和捕获完成后,对比eager模式与图重放的迭代时间;不要将捕获步骤计入稳态时间。
  6. 之后再扩大作用域或与重叠功能结合使用。

Code Anchors

代码锚点

Bridge config and validation

Bridge配置与验证

1524
        # CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph
        if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope:
            assert not self.rerun_state_machine.check_for_nan_in_loss, (
                "check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. "
                "Set rerun_state_machine.check_for_nan_in_loss=False."
            )
        if self.model.cuda_graph_impl == "none":
            self.model.cuda_graph_scope = []
1524
        # CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph
        if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope:
            assert not self.rerun_state_machine.check_for_nan_in_loss, (
                "check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. "
                "Set rerun_state_machine.check_for_nan_in_loss=False."
            )
        if self.model.cuda_graph_impl == "none":
            self.model.cuda_graph_scope = []

TE RNG tracker requirement

TE RNG追踪器要求

213
        if self.cuda_graph_impl != "none":
            assert getattr(self, "use_te_rng_tracker", False), (
                "Transformer engine's RNG tracker is required for cudagraphs, it can be "
                "enabled with use_te_rng_tracker=True'."
213
        if self.cuda_graph_impl != "none":
            assert getattr(self, "use_te_rng_tracker", False), (
                "Transformer engine's RNG tracker is required for cudagraphs, it can be "
                "enabled with use_te_rng_tracker=True'."

Graph creation and capture in training loop

训练循环中的图创建与捕获

231
    # Capture CUDA Graphs.
    cuda_graph_helper = None
    if model_config.cuda_graph_impl == "transformer_engine":
        cuda_graph_helper = TECudaGraphHelper(...)
    # ...
    if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope:
        forward_backward_func = FullCudaGraphWrapper(
            forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps
        )
231
    # Capture CUDA Graphs.
    cuda_graph_helper = None
    if model_config.cuda_graph_impl == "transformer_engine":
        cuda_graph_helper = TECudaGraphHelper(...)
    # ...
    if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope:
        forward_backward_func = FullCudaGraphWrapper(
            forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps
        )

TE graph capture after warmup

预热后的TE图捕获

338
        # Capture CUDA Graphs after warmup.
        if (
            model_config.cuda_graph_impl == "transformer_engine"
            and cuda_graph_helper is not None
            and not cuda_graph_helper.graphs_created()
            and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps
        ):
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                disable_forward_pre_hook(model, param_sync=False)
            cuda_graph_helper.create_cudagraphs()
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                enable_forward_pre_hook(model)
                cuda_graph_helper.cuda_graph_set_manual_hooks()
338
        # Capture CUDA Graphs after warmup.
        if (
            model_config.cuda_graph_impl == "transformer_engine"
            and cuda_graph_helper is not None
            and not cuda_graph_helper.graphs_created()
            and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps
        ):
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                disable_forward_pre_hook(model, param_sync=False)
            cuda_graph_helper.create_cudagraphs()
            if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook:
                enable_forward_pre_hook(model)
                cuda_graph_helper.cuda_graph_set_manual_hooks()

RNG initialization

RNG初始化

199
        _set_random_seed(
            rng_config.seed,
            rng_config.data_parallel_random_init,
            rng_config.te_rng_tracker,
            rng_config.inference_rng_tracker,
            use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"),
            pg_collection=pg_collection,
        )
199
        _set_random_seed(
            rng_config.seed,
            rng_config.data_parallel_random_init,
            rng_config.te_rng_tracker,
            rng_config.inference_rng_tracker,
            use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"),
            pg_collection=pg_collection,
        )

Delayed wgrad + CUDA graph interaction

延迟梯度权重计算与CUDA图的交互

522
            cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or []
            # ... scope parsing ...
            if wgrad_in_graph_scope:
                assert is_te_min_version("2.12.0"), ...
                assert model_cfg.gradient_accumulation_fusion, ...
                if attn_scope_enabled:
                    assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ...
522
            cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or []
            # ... scope parsing ...
            if wgrad_in_graph_scope:
                assert is_te_min_version("2.12.0"), ...
                assert model_cfg.gradient_accumulation_fusion, ...
                if attn_scope_enabled:
                    assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ...

Perf harness override helper

性能测试工具覆盖助手

102
def _set_cuda_graph_overrides(
    recipe, cuda_graph_impl=None, cuda_graph_scope=None
):
    # Sets impl, scope, and auto-enables te_rng_tracker
102
def _set_cuda_graph_overrides(
    recipe, cuda_graph_impl=None, cuda_graph_scope=None
):
    # Sets impl, scope, and auto-enables te_rng_tracker

Graph cleanup

图清理

1414
def _delete_cuda_graphs(cuda_graph_helper):
    # Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffers
1414
def _delete_cuda_graphs(cuda_graph_helper):
    # Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffers

MCore classes (in 3rdparty/Megatron-LM)

MCore类(位于3rdparty/Megatron-LM)

  • CudaGraphManager
    :
    megatron/core/transformer/cuda_graphs.py
  • TECudaGraphHelper
    :
    megatron/core/transformer/cuda_graphs.py
  • FullCudaGraphWrapper
    :
    megatron/core/full_cuda_graph.py
  • CudaGraphScope
    enum:
    megatron/core/transformer/enums.py
  • CudaGraphManager
    :
    megatron/core/transformer/cuda_graphs.py
  • TECudaGraphHelper
    :
    megatron/core/transformer/cuda_graphs.py
  • FullCudaGraphWrapper
    :
    megatron/core/full_cuda_graph.py
  • CudaGraphScope
    枚举:
    megatron/core/transformer/enums.py

Positive recipe anchors

参考配置锚点

  • scripts/performance/configs/deepseek/deepseek_workload_base_configs.py
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py
  • scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py
  • scripts/performance/configs/deepseek/deepseek_workload_base_configs.py
  • scripts/performance/configs/qwen/qwen3_workload_base_configs.py
  • scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py

Tests

测试用例

FileCoverage
tests/unit_tests/training/test_config.py
full_iteration
NaN-check constraint
tests/unit_tests/training/test_comm_overlap.py
delay_wgrad
+ CUDA graph interaction
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py
TE autocast with CUDA graphs
tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py
End-to-end local and TE graph smoke tests
tests/unit_tests/recipes/kimi/test_kimi_k2.py
TE + CUDA graph recipe config
tests/unit_tests/recipes/gpt/test_gpt3_175b.py
TE + CUDA graph recipe config
tests/unit_tests/recipes/qwen_vl/test_qwen25_vl_recipes.py
VLM CUDA graph settings
文件覆盖范围
tests/unit_tests/training/test_config.py
full_iteration
的NaN检查约束
tests/unit_tests/training/test_comm_overlap.py
delay_wgrad
与CUDA图的交互
tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py
带CUDA图的TE自动混合精度
tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py
本地和TE图的端到端冒烟测试
tests/unit_tests/recipes/kimi/test_kimi_k2.py
TE + CUDA图的配置示例
tests/unit_tests/recipes/gpt/test_gpt3_175b.py
TE + CUDA图的配置示例
tests/unit_tests/recipes/qwen_vl/test_qwen25_vl_recipes.py
多模态大模型的CUDA图设置

Pitfalls

注意事项

  1. TE RNG tracker is mandatory: Setting
    cuda_graph_impl
    without
    use_te_rng_tracker=True
    and
    rng.te_rng_tracker=True
    will assert in the provider.
  2. full_iteration
    requires NaN checks disabled
    : The entire fwd+bwd is captured, so loss-NaN checking cannot inspect intermediate values.
  3. MoE scope restrictions:
    moe
    scope and
    moe_router
    scope are mutually exclusive. Token-dropless MoE can only graph
    moe_router
    and
    moe_preprocess
    , not the full expert dispatch.
  4. Memory overhead: CUDA graphs pin all intermediate buffers for the graph's lifetime (no memory reuse). TE scoped graphs add a few GB; full-iteration graphs can increase peak memory by 1.5–2×.
    PP > 1
    compounds overhead since each stage holds its own graph.
  5. Delayed wgrad interaction: When
    delay_wgrad_compute=True
    and attention or MoE router is in
    cuda_graph_scope
    , additional constraints apply: TE >= 2.12.0,
    gradient_accumulation_fusion=True
    , and no attention bias.
  6. Variable-length sequences break graphs: Sequence lengths must be constant across steps. Use padded packed sequences if packing is needed.
  7. Graph cleanup is required: CUDA graph objects hold NCCL buffer references. Bridge handles this in
    _delete_cuda_graphs()
    at the end of training, but early exits must call it explicitly.
  8. Older GPU architectures: On GPUs with compute capability < 10.0 (pre-Blackwell), set
    NCCL_GRAPH_REGISTER=0
    when using
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    . Enforced in MCore
    CudaGraphManager
    (cuda_graphs.py:1428) and
    TECudaGraphHelper
    (cuda_graphs.py:1697). The TE impl asserts unconditionally regardless of arch.
  9. CPU offloading incompatible: CUDA graphs cannot be used with CPU offloading. Enforced in MCore
    transformer_config.py:1907
    .
  10. MoE recompute + moe_router scope: MoE recompute is not supported with
    moe_router
    CUDA graph scope when using
    cuda_graph_impl = "transformer_engine"
    . Enforced in MCore
    transformer_config.py:1977
    .
  11. Layer-level recompute requires
    full_iteration
    scope
    : Using
    recompute_granularity="full"
    with
    recompute_num_layers
    (recompute N whole transformer layers) is incompatible with TE-scoped graphs. MCore calls this "full" granularity even though you're selecting how many layers — the name refers to recomputing the full layer, not full model. Any TE-scoped scope (
    attn
    ,
    mlp
    ,
    moe_router
    , etc.) will assert:
    AssertionError: full recompute is only supported with full iteration CUDA graph.
    This commonly hits FP8 configs that default to TE-scoped graphs (e.g.
    LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
    uses
    cuda_graph_impl= "transformer_engine"
    ,
    cuda_graph_scope="mlp"
    ). Fix: use submodule recompute (
    recompute_granularity="selective"
    +
    recompute_modules
    ), disable CUDA graphs, or switch to
    local
    +
    full_iteration
    . Enforced in MCore
    transformer_config.py:2001-2005
    . See also @skills/perf-activation-recompute/SKILL.md.
  12. Benchmark numbers are workload-specific: graph wins are usually real when host overhead is visible, but the exact gain depends on batch shape, PP depth, recompute, dispatcher backend, and whether the eager baseline was already optimized.
  13. A successful capture is not a speedup guarantee: On 2026-05-18, Qwen3 30B A3B H100 BF16 pretrain with the all-to-all dispatcher captured TE-scoped
    attn,moe_router,moe_preprocess
    graphs successfully (
    48
    graphable layers, about
    6.9 s
    capture time on rank 0), but replay iterations 5-8 averaged
    42.00 s
    versus
    41.36 s
    for eager. Treat scoped graphs as a bring-up candidate and validate on the target stack.
  1. TE RNG追踪器是强制要求:设置
    cuda_graph_impl
    但未设置
    use_te_rng_tracker=True
    rng.te_rng_tracker=True
    会在provider中触发断言。
  2. full_iteration
    需要禁用NaN检查
    :整个前向+反向传播被捕获,因此损失NaN检查无法检查中间值。
  3. MoE作用域限制
    moe
    作用域和
    moe_router
    作用域互斥。无令牌丢弃MoE仅能对
    moe_router
    moe_preprocess
    进行图捕获,无法覆盖完整的专家调度。
  4. 内存开销:CUDA图会在其生命周期内固定所有中间缓冲区(无法复用内存)。TE作用域图会增加数GB内存开销;全迭代图可能使峰值内存增加1.5–2倍。当流水线并行(PP)>1时,每个阶段都会持有自己的图,开销会进一步增大。
  5. 延迟梯度权重计算的交互:当
    delay_wgrad_compute=True
    且注意力或MoE路由在
    cuda_graph_scope
    中时,会有额外约束:TE版本>=2.12.0、
    gradient_accumulation_fusion=True
    ,且无注意力偏置。
  6. 变长序列会破坏图:序列长度在各步骤中必须保持恒定。如果需要打包序列,使用填充后的打包序列。
  7. 必须进行图清理:CUDA图对象会持有NCCL缓冲区引用。Bridge会在训练结束时通过
    _delete_cuda_graphs()
    处理,但提前退出时必须显式调用该函数。
  8. 旧GPU架构:对于计算能力<10.0的GPU(Blackwell之前的架构),当使用
    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    时,需设置
    NCCL_GRAPH_REGISTER=0
    。MCore的
    CudaGraphManager
    (cuda_graphs.py:1428)会强制执行此要求;TE实现则无条件断言,与架构无关。
  9. CPU卸载不兼容:CUDA图无法与CPU卸载一起使用。MCore的
    transformer_config.py:1907
    会强制执行此约束。
  10. MoE重计算与moe_router作用域:当使用
    cuda_graph_impl = "transformer_engine"
    时,MoE重计算与
    moe_router
    CUDA图作用域不兼容。MCore的
    transformer_config.py:1977
    会强制执行此约束。
  11. 层级重计算需要full_iteration作用域:将
    recompute_granularity="full"
    recompute_num_layers
    (重计算N个完整的Transformer层)配合使用时,与TE作用域图不兼容。MCore将此称为“full”粒度,尽管你可以选择重计算的层数——该名称指的是重计算完整层,而非完整模型。任何TE作用域(
    attn
    mlp
    moe_router
    等)都会触发断言:
    AssertionError: full recompute is only supported with full iteration CUDA graph.
    。这通常会影响默认使用TE作用域图的FP8配置(例如
    LLAMA3_70B_SFT_CONFIG_H100_FP8_CS_V1
    使用
    cuda_graph_impl="transformer_engine"
    cuda_graph_scope="mlp"
    )。解决方法:使用子模块重计算(
    recompute_granularity="selective"
    +
    recompute_modules
    )、禁用CUDA图,或切换到
    local
    +
    full_iteration
    。MCore的
    transformer_config.py:2001-2005
    会强制执行此约束。另见@skills/perf-activation-recompute/SKILL.md。
  12. 基准测试结果因工作负载而异:当主机开销明显时,CUDA图通常能带来实际性能提升,但具体收益取决于批次形状、流水线并行深度、重计算设置、调度器后端以及eager基线是否已优化。
  13. 捕获成功不代表一定会提速:在2026-05-18的测试中,使用all-to-all调度器的Qwen3 30B A3B H100 BF16预训练任务成功捕获了TE作用域的
    attn,moe_router,moe_preprocess
    图(48个可图化层,rank 0的捕获时间约6.9秒),但第5-8次重放迭代的平均时间为42.00秒,而eager模式为41.36秒。将作用域图视为候选方案,需在目标栈上验证效果。

Verification

验证方法

Unit tests

单元测试

bash
uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "cuda_graph" \
  tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -q
bash
uv run python -m pytest \
  tests/unit_tests/training/test_config.py -k "cuda_graph" \
  tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -q

Functional smoke test (requires GPU)

功能冒烟测试(需要GPU)

bash
uv run python -m pytest \
  tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -q
bash
uv run python -m pytest \
  tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -q

Success criteria

成功标准

  • Unit tests pass, covering config validation for both
    local
    and
    transformer_engine
    implementations.
  • Functional test completes training steps with both CUDA graph implementations.
  • No NCCL errors or illegal memory access in logs.
  • 单元测试通过,覆盖
    local
    transformer_engine
    两种实现的配置验证。
  • 功能测试完成两种CUDA图实现的训练步骤。
  • 日志中无NCCL错误或非法内存访问。