perf-torch-cuda-graphs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUDA Graphs for PyTorch

面向PyTorch的CUDA Graphs

CUDA Graphs capture a sequence of GPU operations once and replay them with minimal CPU overhead. This skill guides applying CUDA Graphs to PyTorch training and inference workloads using native PyTorch APIs, Transformer Engine, and Megatron-LM.
CUDA Graphs可一次性捕获一系列GPU操作,并以极低的CPU开销重放这些操作。本技能将指导你使用PyTorch原生API、Transformer Engine和Megatron-LM,将CUDA Graphs应用于PyTorch训练和推理工作负载。

When to Use

适用场景

Reach for this skill when you encounter:
  • Triggers: User wants to optimize with CUDA Graphs, reduce kernel launch overhead, or speed up training/inference loops
  • Symptoms: Low GPU utilization (<80%), many small kernel launches (<50 us each), CPU-bound training, high kernel launch latency visible in Nsight Systems profiles
  • Keywords: "CUDA graph", "torch.cuda.graph", "make_graphed_callables", "reduce-overhead", "graph capture", "graph replay", "kernel launch overhead", "CudaGraphManager", "FullCudaGraphWrapper", "full-iteration graph", "stream capture"
Do NOT use this skill for:
  • General PyTorch performance tuning unrelated to kernel launch overhead
  • CUDA kernel development or custom CUDA C++ code
  • Host-device sync elimination only (use perf-torch-sync-free skill instead)
  • Nsight Systems profiling (use perf-nsight-systems skill)
  • TensorFlow/JAX graph compilation (different APIs entirely)
当你遇到以下情况时,可以使用本技能:
  • 触发条件:希望通过CUDA Graphs进行优化、降低内核启动开销,或加速训练/推理循环
  • 症状:GPU利用率低(<80%)、存在大量小型内核启动(每个耗时<50微秒)、训练受CPU限制、Nsight Systems性能分析中可见高内核启动延迟
  • 关键词:"CUDA graph"、"torch.cuda.graph"、"make_graphed_callables"、"reduce-overhead"、"graph capture"、"graph replay"、"kernel launch overhead"、"CudaGraphManager"、"FullCudaGraphWrapper"、"full-iteration graph"、"stream capture"
请勿在以下场景使用本技能:
  • 与内核启动开销无关的通用PyTorch性能调优
  • CUDA内核开发或自定义CUDA C++代码
  • 仅消除主机-设备同步(请改用perf-torch-sync-free技能)
  • Nsight Systems性能分析(请改用perf-nsight-systems技能)
  • TensorFlow/JAX图编译(使用完全不同的API)

Requirements

依赖要求

DependencyVersionNotes
PyTorch>= 1.10
torch.cuda.graph()
available
CUDA>= 11.0Graph update APIs
GPUNVIDIA (any)Required for CUDA
Nsight SystemsanyOptional, for profiling
APEXanyOptional, for capturable optimizers
Transformer Engine>= 2.2Optional, for FP8-aware graphing
Megatron-LMcore >= 0.14.0Optional, for CudaGraphManager / FullCudaGraphWrapper
依赖项版本说明
PyTorch>= 1.10支持
torch.cuda.graph()
CUDA>= 11.0支持图更新API
GPUNVIDIA(任意型号)CUDA运行必需
Nsight Systems任意版本可选,用于性能分析
APEX任意版本可选,用于可捕获的优化器
Transformer Engine>= 2.2可选,支持感知FP8的图捕获
Megatron-LMcore >= 0.14.0可选,用于CudaGraphManager / FullCudaGraphWrapper

API Selection Guide

API选择指南

Choose the API based on your framework and performance needs.
SituationAPIWorkflow
Quick experiment, unknown graph boundaries
torch.compile(mode="reduce-overhead")
Workflow 2
Training, need autograd, no FP8/PP
torch.cuda.make_graphed_callables()
Workflow 3
Any PyTorch model, FP8 or PP supportTE
make_graphed_callables
Workflow 4
Megatron-LM, per-layer, automaticMCore
CudaGraphManager
Workflow 5
Maximum perf, full-iteration captureMCore
FullCudaGraphWrapper
Workflow 6
Full manual control, custom pipelines
torch.cuda.graph()
Workflow 7
Decision flowchart:
  1. Using Megatron-LM with FP8/PP?
    • Yes, want maximum perf with static workload --> Workflow 6 (FullCudaGraphWrapper)
    • Yes, want per-layer automatic graphing --> Workflow 5 (CudaGraphManager)
    • Yes, want manual control over what gets graphed --> Workflow 4 (TE make_graphed_callables)
  2. Using Transformer Engine without Megatron?
    • Yes, need FP8 or PP --> Workflow 4 (TE make_graphed_callables)
  3. General PyTorch?
    • Want zero effort, okay with fragmented graphs --> Workflow 2 (torch.compile)
    • Want autograd support, training loop --> Workflow 3 (PyTorch make_graphed_callables)
    • Want full manual control --> Workflow 7 (torch.cuda.graph)
Strategy: Start with the highest-level API available for your framework. Move to lower-level APIs only if you need more control, hit limitations, or do not achieve the expected performance improvement.
根据你的框架和性能需求选择合适的API。
场景API工作流
快速实验、未知图边界
torch.compile(mode="reduce-overhead")
工作流2
训练场景、需要自动求导、无FP8/PP
torch.cuda.make_graphed_callables()
工作流3
任意PyTorch模型、支持FP8或PPTE
make_graphed_callables
工作流4
Megatron-LM、按层处理、自动捕获MCore
CudaGraphManager
工作流5
极致性能、全迭代捕获MCore
FullCudaGraphWrapper
工作流6
完全手动控制、自定义流水线
torch.cuda.graph()
工作流7
决策流程图:
  1. 是否在使用带FP8/PP的Megatron-LM?
    • 是,需要静态工作负载的极致性能 --> 工作流6(FullCudaGraphWrapper)
    • 是,需要按层自动图捕获 --> 工作流5(CudaGraphManager)
    • 是,需要手动控制图捕获范围 --> 工作流4(TE make_graphed_callables)
  2. 是否在不使用Megatron的情况下使用Transformer Engine?
    • 是,需要FP8或PP支持 --> 工作流4(TE make_graphed_callables)
  3. 通用PyTorch场景?
    • 希望零成本实现、可接受碎片化图 --> 工作流2(torch.compile)
    • 需要自动求导支持、训练循环场景 --> 工作流3(PyTorch make_graphed_callables)
    • 需要完全手动控制 --> 工作流7(torch.cuda.graph)
策略建议: 从你的框架支持的最高层级API开始尝试。仅当需要更多控制、遇到限制或未达到预期性能提升时,才转向更低层级的API。

Workflows

工作流

Workflow 1: Profile and Decide Whether Graphs Help

工作流1:性能分析并判断是否需要使用CUDA Graphs

Goal: Determine if CUDA Graphs will benefit your workload before investing effort.
  1. Profile with Nsight Systems:
    bash
    nsys profile --cuda-graph-trace=graph python train.py
  2. Check GPU utilization -- if already >95%, graphs won't help much.
  3. Look for gaps between kernel launches (CPU overhead) and many small kernels (<50 us each). These are the targets for graphing.
  4. Annotate regions of interest to correlate idle GPU time with code:
    python
    with torch.cuda.nvtx.range("forward"):
        output = model(input)
  5. Estimate benefit: count kernels per iteration. Workloads with hundreds of small kernels and <80% GPU utilization are strong candidates.
Expected result: Identified bottleneck regions with low GPU occupancy between kernels. Proceed to the appropriate workflow from the API Selection Guide.
目标:在投入精力之前,确定CUDA Graphs是否能让你的工作负载受益。
  1. 使用Nsight Systems进行性能分析:
    bash
    nsys profile --cuda-graph-trace=graph python train.py
  2. 检查GPU利用率——如果已经>95%,CUDA Graphs带来的收益有限。
  3. 查找内核启动之间的间隙(CPU开销)和大量小型内核(每个耗时<50微秒),这些是图捕获的目标。
  4. 为感兴趣的代码区域添加注释,将GPU空闲时间与代码关联:
    python
    with torch.cuda.nvtx.range("forward"):
        output = model(input)
  5. 估算收益:统计每次迭代的内核数量。包含数百个小型内核且GPU利用率<80%的工作负载是CUDA Graphs的理想候选。
预期结果:识别出内核之间GPU占用率低的瓶颈区域。根据API选择指南进入相应的工作流。

Workflow 2: torch.compile(mode="reduce-overhead")

工作流2:torch.compile(mode="reduce-overhead")

Goal: Automatic CUDA Graph capture with zero manual effort.
When to use: Quick experiment, unknown graph boundaries, already using
torch.compile
.
Steps:
  1. Decorate the training step with
    @torch.compile(mode="reduce-overhead")
    :
    python
    @torch.compile(mode="reduce-overhead")
    def train_step(model, x, target, criterion):
        output = model(x)
        loss = criterion(output, target)
        loss.backward()
        return loss
  2. Run the training loop normally -- graphs are captured automatically.
  3. Profile with Nsight Systems to see captured graphs:
    bash
    nsys profile --cuda-graph-trace=graph python train.py
  4. If you see too many small graphs (graph fragmentation), check for graph breaks:
    .item()
    ,
    print()
    , data-dependent control flow. Fix these or escalate to Workflow 3+.
Trade-offs:
  • Zero effort, but may create fragmented small graphs.
  • Limited control over what gets graphed.
  • Graph fragmentation limits performance gains compared to manual approaches.
目标:自动捕获CUDA Graph,无需手动操作。
适用场景:快速实验、未知图边界、已在使用
torch.compile
步骤:
  1. 使用
    @torch.compile(mode="reduce-overhead")
    装饰训练步骤:
    python
    @torch.compile(mode="reduce-overhead")
    def train_step(model, x, target, criterion):
        output = model(x)
        loss = criterion(output, target)
        loss.backward()
        return loss
  2. 正常运行训练循环——图会自动捕获。
  3. 使用Nsight Systems进行性能分析,查看已捕获的图:
    bash
    nsys profile --cuda-graph-trace=graph python train.py
  4. 如果看到过多小型图(图碎片化),检查图中断点:
    .item()
    print()
    、依赖数据的控制流。修复这些问题或升级到工作流3及以上。
权衡:
  • 零成本实现,但可能生成碎片化的小型图。
  • 对图捕获范围的控制有限。
  • 与手动方式相比,图碎片化会限制性能提升效果。

Workflow 3: torch.cuda.make_graphed_callables()

工作流3:torch.cuda.make_graphed_callables()

Goal: Training with autograd support. Separate forward/backward graphs.
When to use: Training with custom loops, non-FP8, need autograd.
Steps:
  1. Prepare sample inputs matching training batch shape:
    python
    sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda")
  2. Create the graphed model:
    python
    graphed_model = torch.cuda.make_graphed_callables(
        model, (sample_input,), num_warmup_iters=3
    )
  3. Use
    graphed_model
    as a drop-in replacement in the training loop:
    python
    for data, target in dataloader:
        optimizer.zero_grad()
        output = graphed_model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
  4. If using AMP, set
    cache_enabled=False
    :
    python
    for data, target in dataloader:
        optimizer.zero_grad()
        with torch.amp.autocast("cuda", cache_enabled=False):
            output = graphed_model(data)
            loss = criterion(output, target)
        loss.backward()
        optimizer.step()
  5. If using DDP, construct DDP on a side stream and use 11 warmup iters:
    python
    os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
    s = torch.cuda.Stream()
    with torch.cuda.stream(s):
        model = DistributedDataParallel(model)
    torch.cuda.current_stream().wait_stream(s)
    
    graphed_model = torch.cuda.make_graphed_callables(
        model, (sample_input,), num_warmup_iters=11
    )
Limitations:
  • No double backward (higher-order gradients).
  • No module hooks during capture.
  • Module structure is frozen after graphing (no add/remove parameters).
  • Argument signature must match
    sample_args
    exactly.
目标:支持自动求导的训练场景,分离前向/反向图。
适用场景:自定义循环训练、无FP8、需要自动求导。
步骤:
  1. 准备与训练批次形状匹配的示例输入:
    python
    sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda")
  2. 创建带图捕获的模型:
    python
    graphed_model = torch.cuda.make_graphed_callables(
        model, (sample_input,), num_warmup_iters=3
    )
  3. 在训练循环中直接使用
    graphed_model
    替代原模型:
    python
    for data, target in dataloader:
        optimizer.zero_grad()
        output = graphed_model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
  4. 如果使用AMP,设置
    cache_enabled=False
    python
    for data, target in dataloader:
        optimizer.zero_grad()
        with torch.amp.autocast("cuda", cache_enabled=False):
            output = graphed_model(data)
            loss = criterion(output, target)
        loss.backward()
        optimizer.step()
  5. 如果使用DDP,在侧流上构建DDP并使用11次预热迭代:
    python
    os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
    s = torch.cuda.Stream()
    with torch.cuda.stream(s):
        model = DistributedDataParallel(model)
    torch.cuda.current_stream().wait_stream(s)
    
    graphed_model = torch.cuda.make_graphed_callables(
        model, (sample_input,), num_warmup_iters=11
    )
限制:
  • 不支持双重反向传播(高阶梯度)。
  • 捕获期间不支持模块钩子。
  • 图捕获后模块结构固定(无法添加/删除参数)。
  • 参数签名必须与
    sample_args
    完全匹配。

Workflow 4: TE make_graphed_callables

工作流4:TE make_graphed_callables

Goal: Per-callable graphing with FP8 support and pipeline parallelism.
When to use: FP8 training, PP with manual scheduling, non-Megatron models needing FP8, or any PyTorch model that needs FP8-aware CUDA Graphs.
Steps:
  1. Import and configure:
    python
    from transformer_engine.pytorch.graph import make_graphed_callables
    from transformer_engine.pytorch.fp8 import fp8_autocast
  2. Prepare sample inputs (one per callable per microbatch per chunk):
    python
    sample_args = tuple(
        (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),)
        for _ in range(num_callables * num_microbatches)
    )
  3. Define pipeline schedule if using PP (1-indexed chunk IDs, positive=fwd, negative=bwd):
    python
    # Example: 2 chunks, 3 microbatches
    layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1]
  4. Wrap layers in CUDA Graphs:
    python
    graphed_layers = make_graphed_callables(
        tuple(layers),
        sample_args=sample_args,
        fp8_enabled=True,
        fp8_recipe=fp8_recipe,
        fp8_weight_caching=True,
        _order=layer_order,  # None for no PP
    )
  5. Training loop -- wrap with
    fp8_autocast
    during replay:
    python
    with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
        for layer in graphed_layers[start:end]:
            x = layer(x, is_first_microbatch=(mb_idx == 0))
    # FP8 scaling auto-updated on fp8_autocast exit
    optimizer.step()
Key points:
  • AOT capture: Graphs captured before the training loop when you call
    make_graphed_callables()
    .
  • Replay order must match
    _order
    : The training loop must execute graphs in the same interleaved order as specified during capture.
  • fp8_autocast
    required during replay
    : Without it, FP8 state is not properly configured.
  • Weight caching:
    fp8_weight_caching=True
    caches FP8 weight quantization across microbatches; pass
    is_first_microbatch
    kwarg to control when weights are requantized.
For full API details, see
references/api-te-megatron.md
.
目标:支持FP8和流水线并行的按调用图捕获。
适用场景:FP8训练、手动调度的PP、需要FP8的非Megatron模型,或任何需要感知FP8的CUDA Graphs的PyTorch模型。
步骤:
  1. 导入并配置:
    python
    from transformer_engine.pytorch.graph import make_graphed_callables
    from transformer_engine.pytorch.fp8 import fp8_autocast
  2. 准备示例输入(每个调用、每个微批次、每个chunk对应一组输入):
    python
    sample_args = tuple(
        (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),)
        for _ in range(num_callables * num_microbatches)
    )
  3. 如果使用PP,定义流水线调度(chunk ID从1开始,正数表示前向,负数表示反向):
    python
    # 示例:2个chunk,3个微批次
    layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1]
  4. 将层包装到CUDA Graph中:
    python
    graphed_layers = make_graphed_callables(
        tuple(layers),
        sample_args=sample_args,
        fp8_enabled=True,
        fp8_recipe=fp8_recipe,
        fp8_weight_caching=True,
        _order=layer_order,  # 无PP时设为None
    )
  5. 训练循环——重放期间用
    fp8_autocast
    包裹:
    python
    with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
        for layer in graphed_layers[start:end]:
            x = layer(x, is_first_microbatch=(mb_idx == 0))
    # FP8缩放会在fp8_autocast退出时自动更新
    optimizer.step()
关键点:
  • AOT捕获:当你调用
    make_graphed_callables()
    时,图会在训练循环开始前捕获。
  • 重放顺序必须匹配
    _order
    :训练循环必须按照捕获时指定的交错顺序执行图。
  • 重放期间必须使用
    fp8_autocast
    :否则FP8状态无法正确配置。
  • 权重缓存
    fp8_weight_caching=True
    会在微批次之间缓存FP8权重量化结果;传递
    is_first_microbatch
    参数来控制何时重新量化权重。
完整API细节请参考
references/api-te-megatron.md

Workflow 5: MCore CudaGraphManager (Per-Layer)

工作流5:MCore CudaGraphManager(按层捕获)

Goal: Automatic per-layer graphing for Megatron-LM training.
When to use: Megatron-LM training, especially with PP > 1. Default choice for Megatron users.
Steps:
  1. Enable via CLI flags (no code changes needed):
    bash
    python pretrain_gpt.py \
        --enable-cuda-graph \
        --cuda-graph-num-warmup-steps 3
  2. Or enable via Python config:
    python
    config = TransformerConfig(
        enable_cuda_graph=True,
        cuda_graph_num_warmup_steps=3,
    )
  3. Training loop is unchanged -- graphs are captured automatically after warmup iterations.
Key points:
  • Megatron layers only: Works with
    TransformerLayer
    and
    MambaLayer
    .
  • JIT capture: Records execution order during warmup, captures graphs after warmup completes, then replays on subsequent iterations.
  • Automatic FP8 handling: Uses
    fp8_autocast(..., _graph=True)
    to skip per-layer amax reduction; reduction happens once after all backward graphs.
  • Automatic PP support: Handles microbatch interleaving automatically.
  • Memory savings: Set
    cuda_graph_share_io_buffers=True
    to share I/O buffers between layers (requires no operations between layers).
  • Memory pool strategy: Default uses separate pools per microbatch for graph reuse. Set
    cuda_graph_use_single_mempool=True
    for shared pool (higher graph count but may reduce fragmentation).
目标:为Megatron-LM训练自动进行按层图捕获。
适用场景:Megatron-LM训练,尤其是PP>1的情况。是Megatron用户的默认选择。
步骤:
  1. 通过CLI标志启用(无需修改代码):
    bash
    python pretrain_gpt.py \
        --enable-cuda-graph \
        --cuda-graph-num-warmup-steps 3
  2. 或通过Python配置启用:
    python
    config = TransformerConfig(
        enable_cuda_graph=True,
        cuda_graph_num_warmup_steps=3,
    )
  3. 训练循环无需修改——预热迭代完成后会自动捕获图。
关键点:
  • 仅支持Megatron层:适用于
    TransformerLayer
    MambaLayer
  • JIT捕获:预热期间记录执行顺序,预热完成后捕获图,然后在后续迭代中重放。
  • 自动FP8处理:使用
    fp8_autocast(..., _graph=True)
    跳过按层amax归约;归约在所有反向图完成后执行一次。
  • 自动PP支持:自动处理微批次交错。
  • 内存节省:设置
    cuda_graph_share_io_buffers=True
    来在层之间共享I/O缓冲区(要求层之间无操作)。
  • 内存池策略:默认每个微批次使用单独的池来复用图。设置
    cuda_graph_use_single_mempool=True
    使用共享池(图数量更多,但可能减少碎片化)。

Workflow 6: MCore FullCudaGraphWrapper (Full-Iteration)

工作流6:MCore FullCudaGraphWrapper(全迭代捕获)

Goal: Maximum performance. Captures forward+backward for all microbatches as a single graph.
When to use: Maximum performance priority, static workloads, Megatron-LM training.
Steps:
  1. Enable via CLI flags:
    bash
    python pretrain_gpt.py \
        --enable-cuda-graph \
        --cuda-graph-scope full_iteration \
        --cuda-graph-warmup-steps 1 \
        --te-rng-tracker \
        --no-check-for-nan-in-loss-and-grad
  2. Ensure all forward+backward code is capturable (no
    .item()
    , no NaN check, no dynamic control flow).
  3. Optimizer remains in eager mode by default (outside the graph). Can be included inside the graph for maximum performance.
Key points:
  • Only 2 graphs total: One for training, one for validation.
  • --te-rng-tracker
    required
    : Standard RNG uses CPU scalars that cannot be captured; TE RNG uses device tensors compatible with graphs.
  • --no-check-for-nan-in-loss-and-grad
    mandatory
    : NaN checking uses
    .item()
    which requires CPU-GPU sync, forbidden during capture.
  • StaticBufferLoader: Pre-allocates input buffers for all microbatches during warmup.
  • Optimizer in/out of graph: Inside = maximum performance (all optimizer kernels captured). Outside = more flexible (can change optimizer/LR without recapture).
  • JIT capture: Graph captured during training at iteration
    warmup_steps + 1
    .
目标:极致性能。将所有微批次的前向+反向操作捕获为单个图。
适用场景:优先追求极致性能、静态工作负载、Megatron-LM训练。
步骤:
  1. 通过CLI标志启用:
    bash
    python pretrain_gpt.py \
        --enable-cuda-graph \
        --cuda-graph-scope full_iteration \
        --cuda-graph-warmup-steps 1 \
        --te-rng-tracker \
        --no-check-for-nan-in-loss-and-grad
  2. 确保所有前向+反向代码都可捕获(无
    .item()
    、无NaN检查、无动态控制流)。
  3. 默认优化器保持eager模式(在图外)。也可以将其包含在图内以获得极致性能。
关键点:
  • 仅需2个图:一个用于训练,一个用于验证。
  • 必须使用
    --te-rng-tracker
    :标准RNG使用无法捕获的CPU标量;TE RNG使用与图兼容的设备张量。
  • 必须设置
    --no-check-for-nan-in-loss-and-grad
    :NaN检查使用
    .item()
    ,需要CPU-GPU同步,捕获期间禁止使用。
  • StaticBufferLoader:预热期间为所有微批次预分配输入缓冲区。
  • 优化器在图内/图外:图内=极致性能(捕获所有优化器内核)。图外=更灵活(无需重新捕获即可更改优化器/学习率)。
  • JIT捕获:在
    warmup_steps + 1
    迭代时捕获图。

Workflow 7: torch.cuda.graph() (Manual)

工作流7:torch.cuda.graph()(手动捕获)

Goal: Full control over capture and replay. Custom pipelines, full-iteration capture without Megatron.
When to use: Need fine-grained control, non-Megatron full-iteration capture, custom pipelines.
Inference pattern:
  1. Pre-allocate static input/output tensors:
    python
    static_input = torch.randn(batch_size, *shape, device="cuda")
  2. Warmup on a side stream (3 iterations, 11 for DDP):
    python
    s = torch.cuda.Stream()
    with torch.cuda.stream(s):
        for _ in range(3):
            _ = model(static_input)
    torch.cuda.current_stream().wait_stream(s)
  3. Capture the graph:
    python
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        static_output = model(static_input)
  4. Replay loop -- update inputs via
    .copy_()
    , clone outputs:
    python
    for data in loader:
        static_input.copy_(data)
        g.replay()
        result = static_output.clone()
Full training pattern (fwd+bwd+optimizer in one graph):
python
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")
目标:完全控制捕获和重放。自定义流水线、无需Megatron的全迭代捕获。
适用场景:需要细粒度控制、非Megatron的全迭代捕获、自定义流水线。
推理模式:
  1. 预分配静态输入/输出张量:
    python
    static_input = torch.randn(batch_size, *shape, device="cuda")
  2. 在侧流上预热(3次迭代,DDP场景为11次):
    python
    s = torch.cuda.Stream()
    with torch.cuda.stream(s):
        for _ in range(3):
            _ = model(static_input)
    torch.cuda.current_stream().wait_stream(s)
  3. 捕获图:
    python
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        static_output = model(static_input)
  4. 重放循环——通过
    .copy_()
    更新输入,克隆输出:
    python
    for data in loader:
        static_input.copy_(data)
        g.replay()
        result = static_output.clone()
全训练模式(前向+反向+优化器在单个图内):
python
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")

Warmup

预热

s = torch.cuda.Stream() with torch.cuda.stream(s): for _ in range(3): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): out = model(static_input) loss = criterion(out, static_target) loss.backward() torch.cuda.current_stream().wait_stream(s)
s = torch.cuda.Stream() with torch.cuda.stream(s): for _ in range(3): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): out = model(static_input) loss = criterion(out, static_target) loss.backward() torch.cuda.current_stream().wait_stream(s)

Capture

捕获

g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): static_output = model(static_input) static_loss = criterion(static_output, static_target) static_loss.backward()
g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): static_output = model(static_input) static_loss = criterion(static_output, static_target) static_loss.backward()

Replay loop

重放循环

for data, target in loader: static_input.copy_(data) static_target.copy_(target) g.replay() optimizer.step()

**DDP setup:**

```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)
for data, target in loader: static_input.copy_(data) static_target.copy_(target) g.replay() optimizer.step()

**DDP设置:**

```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

11 warmup iterations for DDP

DDP场景需要11次预热迭代

with torch.cuda.stream(s): for _ in range(11): out = model(static_input) out.sum().backward() torch.cuda.current_stream().wait_stream(s)
with torch.cuda.stream(s): for _ in range(11): out = model(static_input) out.sum().backward() torch.cuda.current_stream().wait_stream(s)

Capture on the same side stream

在同一侧流上捕获图

with torch.cuda.graph(g): static_output = model(static_input)

**Memory pool sharing for multiple graphs:**

```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
    out1 = model_a(static_in_a)
with torch.cuda.graph(g): static_output = model(static_input)

**多图共享内存池:**

```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
    out1 = model_a(static_in_a)

Second graph shares first graph's memory pool

第二个图共享第一个图的内存池

g2 = torch.cuda.CUDAGraph() with torch.cuda.graph(g2, pool=g1.pool()): out2 = model_b(static_in_b)

**Custom RNG registration:**

```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
    out = model(static_input)  # RNG state properly captured
g2 = torch.cuda.CUDAGraph() with torch.cuda.graph(g2, pool=g1.pool()): out2 = model_b(static_in_b)

**自定义RNG注册:**

```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
    out = model(static_input)  # RNG状态会被正确捕获

Navigating Between Workflows

工作流转场

  • torch.compile gives insufficient speedup --> escalate to
    make_graphed_callables
    (Workflow 3) for larger, fewer graphs.
  • make_graphed_callables can't handle FP8/PP --> TE
    make_graphed_callables
    (Workflow 4).
  • Need Megatron per-layer automatic --> CudaGraphManager (Workflow 5).
  • Want maximum perf --> FullCudaGraphWrapper (Workflow 6) or manual full-iteration capture (Workflow 7).
  • Something too hard to graph --> partial capture (graph what you can, leave the rest in eager mode).
  • User wants best absolute perf --> skip directly to Workflow 6 (Megatron) or Workflow 7 (manual).
  • Start small, expand progressively: Begin with one module/layer. Verify correctness. Then expand to more layers, full forward pass, add backward, and eventually full iteration with optimizer.
  • torch.compile性能提升不足 --> 升级到
    make_graphed_callables
    (工作流3)以获得更大、更少的图。
  • make_graphed_callables无法处理FP8/PP --> 使用TE
    make_graphed_callables
    (工作流4)。
  • 需要Megatron按层自动捕获 --> 使用CudaGraphManager(工作流5)。
  • 追求极致性能 --> 使用FullCudaGraphWrapper(工作流6)或手动全迭代捕获(工作流7)。
  • 某些部分难以捕获 --> 部分捕获(捕获可处理的部分,其余保留eager模式)。
  • 追求最佳绝对性能 --> 直接跳转到工作流6(Megatron)或工作流7(手动)。
  • 从小规模开始,逐步扩展:先从一个模块/层开始。验证正确性。然后扩展到更多层、完整前向传播、添加反向传播,最终扩展到包含优化器的全迭代。

Making Code Graph-Compatible

使代码兼容图捕获

These principles apply to all workflows. Code inside the captured region must satisfy three constraints.
这些原则适用于所有工作流。捕获区域内的代码必须满足三个约束。

Principle 1: GPU-Only

原则1:仅GPU操作

Only GPU operations are captured. CPU-side code (Python logic, I/O, logging) executes during capture but is eliminated during replay.
Violations:
  • File I/O:
    data = torch.load("file.pt")
    won't reload on replay
  • CPU preprocessing:
    tokens = tokenizer.encode(text)
    won't re-tokenize
  • Logging:
    print(f"Step {i}")
    won't print during replay
  • CPU RNG:
    random.randint(0, 10)
    won't regenerate
  • CPU bookkeeping:
    buffer.append(tensor)
    won't populate during replay
Fix: Move all CPU-side operations outside the graphed region.
只有GPU操作会被捕获。CPU端代码(Python逻辑、I/O、日志)在捕获期间执行,但重放时不会执行。
违规情况:
  • 文件I/O:
    data = torch.load("file.pt")
    在重放时不会重新加载
  • CPU预处理:
    tokens = tokenizer.encode(text)
    在重放时不会重新分词
  • 日志:
    print(f"Step {i}")
    在重放时不会打印
  • CPU随机数生成:
    random.randint(0, 10)
    在重放时不会重新生成
  • CPU记录:
    buffer.append(tensor)
    在重放时不会填充
修复:将所有CPU端操作移到图捕获区域之外。

Principle 2: Sync-Free

原则2:无同步

No CPU-GPU synchronization inside the graph. The CPU queues work continuously without waiting for GPU results.
Violations:
  • .item()
    to get scalar values
  • .cpu()
    to move tensors for inspection
  • torch.cuda.synchronize()
    or
    stream.synchronize()
  • print(tensor)
    (implicitly syncs)
Fix: Invoke the perf-torch-sync-free skill for systematic detection and elimination of sync points. Use
torch.cuda.set_sync_debug_mode("warn")
to find hidden syncs.
图内不能有CPU-GPU同步操作。CPU需持续排队任务,无需等待GPU结果。
违规情况:
  • 使用
    .item()
    获取标量值
  • 使用
    .cpu()
    将张量移到CPU进行检查
  • torch.cuda.synchronize()
    stream.synchronize()
  • print(tensor)
    (隐式同步)
修复:调用perf-torch-sync-free技能来系统检测和消除同步点。使用
torch.cuda.set_sync_debug_mode("warn")
查找隐藏的同步操作。

Principle 3: Static

原则3:静态化

All operations, control flow, memory addresses, and shapes must be fixed across all replays.
Violations and fixes:
Dynamic aspectFix
if loss > threshold:
torch.where(condition, a, b)
input = new_tensor
(address changes)
Pre-allocate +
.copy_()
Python scalars (lr, temperature)GPU tensor +
.fill_()
Variable batch size / sequence lengthPadding or bucketing
MoE / dynamic routingPartial graphing
For detailed patterns, see
references/patterns-dynamic.md
.
所有操作、控制流、内存地址和形状在所有重放中必须保持固定。
违规情况及修复:
动态因素修复方案
if loss > threshold:
使用
torch.where(condition, a, b)
input = new_tensor
(地址变化)
预分配张量并使用
.copy_()
更新
Python标量(学习率、温度)转为GPU张量并使用
.fill_()
更新
可变批次大小/序列长度使用填充或分桶
MoE / 动态路由部分图捕获
详细模式请参考
references/patterns-dynamic.md

Compatibility Checklist

兼容性检查清单

Verify every item before attempting capture:
  • No
    .item()
    ,
    .cpu()
    ,
    .numpy()
    ,
    print(tensor)
    inside graph
  • No
    torch.cuda.synchronize()
    or
    stream.synchronize()
  • No
    if tensor_value:
    -- use
    torch.where()
    instead
  • All inputs pre-allocated, updated via
    .copy_()
  • All shapes fixed (use padding or bucketing for variable sizes)
  • Python scalars --> GPU tensors with
    .fill_()
  • Output tensors
    .clone()
    d before next replay
  • cache_enabled=False
    with
    torch.amp.autocast
  • Custom RNG generators registered with
    graph.register_generator_state()
  • Use
    graphsafe_get_state()
    /
    graphsafe_set_state()
    for RNG
  • Warmup completed (3 standard, 11 for DDP)
  • DDP:
    TORCH_NCCL_ASYNC_ERROR_HANDLING=0
    , construct on side stream
  • DDP: NCCL >= 2.9.6 for full graph capture
  • Libraries/extensions use
    torch.cuda.current_stream()
    , not default stream
  • No pinned memory allocation during capture (triggers hidden event query)
  • activation_checkpointing
    :
    preserve_rng_state=False
  • Global tensors used in graph kept alive (not deleted/reassigned)
  • No
    torch.compile
    functions inside manual capture without prior warmup
  • Gradient clipping uses sync-free
    clip_grad_norm_
    (PyTorch >= 1.13)
For the complete checklist with references, see
references/patterns-compatibility.md
.
尝试捕获之前,请验证以下所有项:
  • 图内无
    .item()
    .cpu()
    .numpy()
    print(tensor)
    操作
  • torch.cuda.synchronize()
    stream.synchronize()
    操作
  • if tensor_value:
    判断——改用
    torch.where()
  • 所有输入已预分配,通过
    .copy_()
    更新
  • 所有形状固定(对可变大小使用填充或分桶)
  • Python标量已转为GPU张量并使用
    .fill_()
    更新
  • 输出张量在下次重放前已
    .clone()
  • 使用
    torch.amp.autocast
    时设置
    cache_enabled=False
  • 自定义RNG生成器已通过
    graph.register_generator_state()
    注册
  • 使用
    graphsafe_get_state()
    /
    graphsafe_set_state()
    处理RNG
  • 已完成预热(标准场景3次,DDP场景11次)
  • DDP:设置
    TORCH_NCCL_ASYNC_ERROR_HANDLING=0
    ,在侧流上构建
  • DDP:NCCL >= 2.9.6以支持完整图捕获
  • 库/扩展使用
    torch.cuda.current_stream()
    而非默认流
  • 捕获期间无固定内存分配(会触发隐藏事件查询)
  • 激活 checkpointing:设置
    preserve_rng_state=False
  • 图中使用的全局张量保持存活(未被删除/重新分配)
  • 手动捕获内的
    torch.compile
    函数已提前预热
  • 梯度裁剪使用无同步的
    clip_grad_norm_
    (PyTorch >= 1.13)
完整检查清单及参考请见
references/patterns-compatibility.md

Output Formats

输出格式

Success indicators:
  • g.replay()
    completes without errors
  • Outputs match eager mode within tolerance (
    torch.allclose
    )
  • Nsight Systems profile shows single graph launch replacing many kernels
  • GPU utilization increases, training/inference latency decreases
Key metrics:
MetricHow to Check
Correctness
torch.allclose(eager, graphed, rtol=1e-5)
SpeedupWall-clock time comparison
GPU utilization
nvidia-smi
or Nsight Systems timeline
Memory overhead
torch.cuda.memory_summary()
成功指标:
  • g.replay()
    无错误完成
  • 输出与eager模式结果在容差范围内匹配(
    torch.allclose
  • Nsight Systems性能分析显示单个图启动替代了多个内核
  • GPU利用率提升,训练/推理延迟降低
关键指标:
指标检查方式
正确性
torch.allclose(eager, graphed, rtol=1e-5)
加速比对比 wall-clock 时间
GPU利用率
nvidia-smi
或Nsight Systems时间线
内存开销
torch.cuda.memory_summary()

Error Handling

错误处理

ErrorCauseFix
StreamCaptureUnsupported
(900)
Sync op during capture (
.item()
,
.cpu()
)
Move sync outside graph
StreamCaptureInvalidated
(901)
Background thread (e.g., pin_memory)
capture_error_mode="thread_local"
StreamCaptureUnjoined
(904)
Side stream didn't rejoin capture stream
capture_stream.wait_stream(side_stream)
StreamCaptureImplicit
(906)
AccumulateGrad on default streamWarmup on side stream before capture
Illegal memory accessInput tensor freed/reassignedKeep persistent ref, use
.copy_()
Wrong numerical resultsDynamic behavior frozen at captureSee
references/patterns-compatibility.md
OOM with multiple graphsPools can't share memory
pool=g1.pool()
for sequential graphs
No speedupAlready GPU-bound or wrong capture scopeProfile with nsys first (Workflow 1)
FP8 scaling corruptionTE without
fp8_autocast
during replay
Wrap with
fp8_autocast(enabled=True)
PP replay order mismatchWrong execution order during replayMatch
_order
/ capture sequence exactly
FullCudaGraphWrapper capture failNaN check or sync enabled
--no-check-for-nan-in-loss-and-grad
RNG failure with FullCudaGraphWrapperStandard RNG not capturable
--te-rng-tracker
DDP capture failureAsync error handling watchdog
TORCH_NCCL_ASYNC_ERROR_HANDLING=0
DDP AccumulateGrad on default streamDDP constructed on default streamConstruct DDP in side stream context
Autocast cache invalidationCached cast tensors freed on exit
cache_enabled=False
For detailed troubleshooting, see
references/troubleshooting.md
.
错误原因修复方案
StreamCaptureUnsupported
(900)
捕获期间存在同步操作(
.item()
.cpu()
将同步操作移到图外
StreamCaptureInvalidated
(901)
存在后台线程(如pin_memory)设置
capture_error_mode="thread_local"
StreamCaptureUnjoined
(904)
侧流未重新加入捕获流
capture_stream.wait_stream(side_stream)
StreamCaptureImplicit
(906)
默认流上存在AccumulateGrad捕获前在侧流上预热
非法内存访问输入张量被释放/重新分配保持持久引用,使用
.copy_()
更新
数值结果错误动态行为在捕获时被冻结参考
references/patterns-compatibility.md
多图场景下OOM内存池无法共享对顺序图使用
pool=g1.pool()
无性能提升已受GPU限制或捕获范围错误先使用nsys进行性能分析(工作流1)
FP8缩放损坏TE重放时未使用
fp8_autocast
fp8_autocast(enabled=True)
包裹
PP重放顺序不匹配重放时执行顺序错误严格匹配
_order
/捕获序列
FullCudaGraphWrapper捕获失败启用了NaN检查或同步设置
--no-check-for-nan-in-loss-and-grad
FullCudaGraphWrapper的RNG失败标准RNG不可捕获使用
--te-rng-tracker
DDP捕获失败异步错误处理看门狗设置
TORCH_NCCL_ASYNC_ERROR_HANDLING=0
DDP AccumulateGrad在默认流上DDP在默认流上构建在侧流上下文构建DDP
Autocast缓存失效缓存的转换张量在退出时被释放设置
cache_enabled=False
详细故障排查请参考
references/troubleshooting.md

Finding More Information

更多信息查找

Use this 3-tier lookup hierarchy -- start at Tier 1 and escalate only when needed.
使用以下三级查找体系——从第1级开始,仅在需要时升级到更高级别。

Tier 1: This File (SKILL.md)

第1级:本文件(SKILL.md)

You are reading it now. The workflows, compatibility checklist, and error table above cover the most common tasks. Search this file first before going deeper.
你正在阅读的文件。上述工作流、兼容性检查清单和错误表涵盖了最常见的任务。先在本文件中搜索,再深入查找。

Tier 2: references/ Directory

第2级:references/目录

The
references/
directory beside this file contains distilled reference material -- API details, patterns, and troubleshooting pages.
How to search:
  1. Grep for your keyword across
    references/
    -- headers are designed to be grep-friendly.
  2. Read only the file that grep points you to. Do not read every file.
Available references:
  • references/api-pytorch.md
    -- PyTorch CUDA Graph APIs (
    torch.cuda.graph
    ,
    make_graphed_callables
    ,
    torch.compile reduce-overhead
    )
  • references/api-te-megatron.md
    -- TE
    make_graphed_callables
    , CudaGraphManager, FullCudaGraphWrapper implementations
  • references/patterns-compatibility.md
    -- GPU-only, sync-free, and static principles with full checklist
  • references/patterns-dynamic.md
    -- Dynamic control flow, tensors, scalars, shapes: workarounds and patterns
  • references/troubleshooting.md
    -- Capture failures, numerical errors, memory issues, performance issues
本文件旁的
references/
目录包含提炼后的参考资料——API细节、模式和故障排查页面。
搜索方法:
  1. references/
    目录中 grep 你的关键词——标题设计为便于grep搜索。
  2. 仅阅读grep指向的文件,无需阅读所有文件。
可用参考资料:
  • references/api-pytorch.md
    ——PyTorch CUDA Graph API(
    torch.cuda.graph
    make_graphed_callables
    torch.compile reduce-overhead
  • references/api-te-megatron.md
    ——TE
    make_graphed_callables
    、CudaGraphManager、FullCudaGraphWrapper实现细节
  • references/patterns-compatibility.md
    ——仅GPU操作、无同步、静态化原则及完整检查清单
  • references/patterns-dynamic.md
    ——动态控制流、张量、标量、形状的解决方案和模式
  • references/troubleshooting.md
    ——捕获失败、数值错误、内存问题、性能问题

Tier 3: Original Documentation

第3级:原始文档

If Tiers 1-2 do not answer the question, consult the original sources:
  • NVIDIA guide:
    https://docs.nvidia.com/dl-cuda-graph/latest/index.html
  • PyTorch docs:
    https://docs.pytorch.org/docs/stable/notes/cuda.html
    (CUDA Graphs section)
  • TE docs:
    https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
  • Megatron Core docs:
    https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html
Return to Tier 2 afterward and consider whether the answer should be distilled into the references directory for next time.
如果第1-2级无法解答问题,请参考原始来源:
  • NVIDIA指南
    https://docs.nvidia.com/dl-cuda-graph/latest/index.html
  • PyTorch文档
    https://docs.pytorch.org/docs/stable/notes/cuda.html
    (CUDA Graphs章节)
  • TE文档
    https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
  • Megatron Core文档
    https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html
之后回到第2级,考虑是否应将答案提炼到references目录中,以便下次使用。