perf-torch-cuda-graphs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CUDA Graphs for PyTorch

面向PyTorch的CUDA Graphs

CUDA Graphs capture a sequence of GPU operations once and replay them with minimal CPU overhead. This skill guides applying CUDA Graphs to PyTorch training and inference workloads using native PyTorch APIs, Transformer Engine, and Megatron-LM.

CUDA Graphs可一次性捕获一系列GPU操作，并以极低的CPU开销重放这些操作。本技能将指导你使用PyTorch原生API、Transformer Engine和Megatron-LM，将CUDA Graphs应用于PyTorch训练和推理工作负载。

When to Use

适用场景

Reach for this skill when you encounter:

Triggers: User wants to optimize with CUDA Graphs, reduce kernel launch overhead, or speed up training/inference loops
Symptoms: Low GPU utilization (<80%), many small kernel launches (<50 us each), CPU-bound training, high kernel launch latency visible in Nsight Systems profiles
Keywords: "CUDA graph", "torch.cuda.graph", "make_graphed_callables", "reduce-overhead", "graph capture", "graph replay", "kernel launch overhead", "CudaGraphManager", "FullCudaGraphWrapper", "full-iteration graph", "stream capture"

Do NOT use this skill for:

General PyTorch performance tuning unrelated to kernel launch overhead
CUDA kernel development or custom CUDA C++ code
Host-device sync elimination only (use perf-torch-sync-free skill instead)
Nsight Systems profiling (use perf-nsight-systems skill)
TensorFlow/JAX graph compilation (different APIs entirely)

当你遇到以下情况时，可以使用本技能：

触发条件：希望通过CUDA Graphs进行优化、降低内核启动开销，或加速训练/推理循环
症状：GPU利用率低（<80%）、存在大量小型内核启动（每个耗时<50微秒）、训练受CPU限制、Nsight Systems性能分析中可见高内核启动延迟
关键词："CUDA graph"、"torch.cuda.graph"、"make_graphed_callables"、"reduce-overhead"、"graph capture"、"graph replay"、"kernel launch overhead"、"CudaGraphManager"、"FullCudaGraphWrapper"、"full-iteration graph"、"stream capture"

请勿在以下场景使用本技能：

与内核启动开销无关的通用PyTorch性能调优
CUDA内核开发或自定义CUDA C++代码
仅消除主机-设备同步（请改用perf-torch-sync-free技能）
Nsight Systems性能分析（请改用perf-nsight-systems技能）
TensorFlow/JAX图编译（使用完全不同的API）

Requirements

依赖要求

Dependency	Version	Notes
PyTorch	>= 1.10	`torch.cuda.graph()` available
CUDA	>= 11.0	Graph update APIs
GPU	NVIDIA (any)	Required for CUDA
Nsight Systems	any	Optional, for profiling
APEX	any	Optional, for capturable optimizers
Transformer Engine	>= 2.2	Optional, for FP8-aware graphing
Megatron-LM	core >= 0.14.0	Optional, for CudaGraphManager / FullCudaGraphWrapper

依赖项	版本	说明
PyTorch	>= 1.10	支持 `torch.cuda.graph()`
CUDA	>= 11.0	支持图更新API
GPU	NVIDIA（任意型号）	CUDA运行必需
Nsight Systems	任意版本	可选，用于性能分析
APEX	任意版本	可选，用于可捕获的优化器
Transformer Engine	>= 2.2	可选，支持感知FP8的图捕获
Megatron-LM	core >= 0.14.0	可选，用于CudaGraphManager / FullCudaGraphWrapper

API Selection Guide

API选择指南

Choose the API based on your framework and performance needs.

Situation	API	Workflow
Quick experiment, unknown graph boundaries	`torch.compile(mode="reduce-overhead")`	Workflow 2
Training, need autograd, no FP8/PP	`torch.cuda.make_graphed_callables()`	Workflow 3
Any PyTorch model, FP8 or PP support	TE `make_graphed_callables`	Workflow 4
Megatron-LM, per-layer, automatic	MCore `CudaGraphManager`	Workflow 5
Maximum perf, full-iteration capture	MCore `FullCudaGraphWrapper`	Workflow 6
Full manual control, custom pipelines	`torch.cuda.graph()`	Workflow 7

Decision flowchart:

Using Megatron-LM with FP8/PP?
- Yes, want maximum perf with static workload --> Workflow 6 (FullCudaGraphWrapper)
- Yes, want per-layer automatic graphing --> Workflow 5 (CudaGraphManager)
- Yes, want manual control over what gets graphed --> Workflow 4 (TE make_graphed_callables)
Using Transformer Engine without Megatron?
- Yes, need FP8 or PP --> Workflow 4 (TE make_graphed_callables)
General PyTorch?
- Want zero effort, okay with fragmented graphs --> Workflow 2 (torch.compile)
- Want autograd support, training loop --> Workflow 3 (PyTorch make_graphed_callables)
- Want full manual control --> Workflow 7 (torch.cuda.graph)

Strategy: Start with the highest-level API available for your framework. Move to lower-level APIs only if you need more control, hit limitations, or do not achieve the expected performance improvement.

根据你的框架和性能需求选择合适的API。

场景	API	工作流
快速实验、未知图边界	`torch.compile(mode="reduce-overhead")`	工作流2
训练场景、需要自动求导、无FP8/PP	`torch.cuda.make_graphed_callables()`	工作流3
任意PyTorch模型、支持FP8或PP	TE `make_graphed_callables`	工作流4
Megatron-LM、按层处理、自动捕获	MCore `CudaGraphManager`	工作流5
极致性能、全迭代捕获	MCore `FullCudaGraphWrapper`	工作流6
完全手动控制、自定义流水线	`torch.cuda.graph()`	工作流7

决策流程图：

是否在使用带FP8/PP的Megatron-LM？
- 是，需要静态工作负载的极致性能 --> 工作流6（FullCudaGraphWrapper）
- 是，需要按层自动图捕获 --> 工作流5（CudaGraphManager）
- 是，需要手动控制图捕获范围 --> 工作流4（TE make_graphed_callables）
是否在不使用Megatron的情况下使用Transformer Engine？
- 是，需要FP8或PP支持 --> 工作流4（TE make_graphed_callables）
通用PyTorch场景？
- 希望零成本实现、可接受碎片化图 --> 工作流2（torch.compile）
- 需要自动求导支持、训练循环场景 --> 工作流3（PyTorch make_graphed_callables）
- 需要完全手动控制 --> 工作流7（torch.cuda.graph）

策略建议： 从你的框架支持的最高层级API开始尝试。仅当需要更多控制、遇到限制或未达到预期性能提升时，才转向更低层级的API。

Workflows

工作流

Workflow 1: Profile and Decide Whether Graphs Help

工作流1：性能分析并判断是否需要使用CUDA Graphs

Goal: Determine if CUDA Graphs will benefit your workload before investing effort.

Profile with Nsight Systems:

bash

nsys profile --cuda-graph-trace=graph python train.py

Check GPU utilization -- if already >95%, graphs won't help much.
Look for gaps between kernel launches (CPU overhead) and many small kernels (<50 us each). These are the targets for graphing.

Annotate regions of interest to correlate idle GPU time with code:

python

with torch.cuda.nvtx.range("forward"):
    output = model(input)

Estimate benefit: count kernels per iteration. Workloads with hundreds of small kernels and <80% GPU utilization are strong candidates.

Expected result: Identified bottleneck regions with low GPU occupancy between kernels. Proceed to the appropriate workflow from the API Selection Guide.

目标：在投入精力之前，确定CUDA Graphs是否能让你的工作负载受益。

使用Nsight Systems进行性能分析：

bash

nsys profile --cuda-graph-trace=graph python train.py

检查GPU利用率——如果已经>95%，CUDA Graphs带来的收益有限。
查找内核启动之间的间隙（CPU开销）和大量小型内核（每个耗时<50微秒），这些是图捕获的目标。
为感兴趣的代码区域添加注释，将GPU空闲时间与代码关联：
python
```
with torch.cuda.nvtx.range("forward"):
    output = model(input)
```
估算收益：统计每次迭代的内核数量。包含数百个小型内核且GPU利用率<80%的工作负载是CUDA Graphs的理想候选。

预期结果：识别出内核之间GPU占用率低的瓶颈区域。根据API选择指南进入相应的工作流。

Workflow 2: torch.compile(mode="reduce-overhead")

工作流2：torch.compile(mode="reduce-overhead")

Goal: Automatic CUDA Graph capture with zero manual effort.

When to use: Quick experiment, unknown graph boundaries, already using

torch.compile

Steps:

Decorate the training step with

@torch.compile(mode="reduce-overhead")

python

@torch.compile(mode="reduce-overhead")
def train_step(model, x, target, criterion):
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
    return loss

Run the training loop normally -- graphs are captured automatically.
Profile with Nsight Systems to see captured graphs:
bash
```
nsys profile --cuda-graph-trace=graph python train.py
```
If you see too many small graphs (graph fragmentation), check for graph breaks:
```
.item()
```
,
```
print()
```
, data-dependent control flow. Fix these or escalate to Workflow 3+.

Trade-offs:

Zero effort, but may create fragmented small graphs.
Limited control over what gets graphed.
Graph fragmentation limits performance gains compared to manual approaches.

目标：自动捕获CUDA Graph，无需手动操作。

适用场景：快速实验、未知图边界、已在使用

torch.compile

。

步骤：

使用

@torch.compile(mode="reduce-overhead")

装饰训练步骤：

python

@torch.compile(mode="reduce-overhead")
def train_step(model, x, target, criterion):
    output = model(x)
    loss = criterion(output, target)
    loss.backward()
    return loss

正常运行训练循环——图会自动捕获。
使用Nsight Systems进行性能分析，查看已捕获的图：
bash
```
nsys profile --cuda-graph-trace=graph python train.py
```
如果看到过多小型图（图碎片化），检查图中断点：
```
.item()
```
、
```
print()
```
、依赖数据的控制流。修复这些问题或升级到工作流3及以上。

权衡：

零成本实现，但可能生成碎片化的小型图。
对图捕获范围的控制有限。
与手动方式相比，图碎片化会限制性能提升效果。

Workflow 3: torch.cuda.make_graphed_callables()

工作流3：torch.cuda.make_graphed_callables()

Goal: Training with autograd support. Separate forward/backward graphs.

When to use: Training with custom loops, non-FP8, need autograd.

Steps:

Prepare sample inputs matching training batch shape:

python

sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda")

Create the graphed model:

python

graphed_model = torch.cuda.make_graphed_callables(
    model, (sample_input,), num_warmup_iters=3
)

Use

graphed_model

as a drop-in replacement in the training loop:

python

for data, target in dataloader:
    optimizer.zero_grad()
    output = graphed_model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

If using AMP, set

cache_enabled=False

python

for data, target in dataloader:
    optimizer.zero_grad()
    with torch.amp.autocast("cuda", cache_enabled=False):
        output = graphed_model(data)
        loss = criterion(output, target)
    loss.backward()
    optimizer.step()

If using DDP, construct DDP on a side stream and use 11 warmup iters:

python

os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)
torch.cuda.current_stream().wait_stream(s)

graphed_model = torch.cuda.make_graphed_callables(
    model, (sample_input,), num_warmup_iters=11
)

Limitations:

No double backward (higher-order gradients).
No module hooks during capture.
Module structure is frozen after graphing (no add/remove parameters).
Argument signature must match
```
sample_args
```
exactly.

目标：支持自动求导的训练场景，分离前向/反向图。

适用场景：自定义循环训练、无FP8、需要自动求导。

步骤：

准备与训练批次形状匹配的示例输入：

python

sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda")

创建带图捕获的模型：

python

graphed_model = torch.cuda.make_graphed_callables(
    model, (sample_input,), num_warmup_iters=3
)

在训练循环中直接使用

graphed_model

替代原模型：

python

for data, target in dataloader:
    optimizer.zero_grad()
    output = graphed_model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

如果使用AMP，设置

cache_enabled=False

：

python

for data, target in dataloader:
    optimizer.zero_grad()
    with torch.amp.autocast("cuda", cache_enabled=False):
        output = graphed_model(data)
        loss = criterion(output, target)
    loss.backward()
    optimizer.step()

如果使用DDP，在侧流上构建DDP并使用11次预热迭代：

python

os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)
torch.cuda.current_stream().wait_stream(s)

graphed_model = torch.cuda.make_graphed_callables(
    model, (sample_input,), num_warmup_iters=11
)

限制：

不支持双重反向传播（高阶梯度）。
捕获期间不支持模块钩子。
图捕获后模块结构固定（无法添加/删除参数）。
参数签名必须与
```
sample_args
```
完全匹配。

Workflow 4: TE make_graphed_callables

工作流4：TE make_graphed_callables

Goal: Per-callable graphing with FP8 support and pipeline parallelism.

When to use: FP8 training, PP with manual scheduling, non-Megatron models needing FP8, or any PyTorch model that needs FP8-aware CUDA Graphs.

Steps:

Import and configure:

python

from transformer_engine.pytorch.graph import make_graphed_callables
from transformer_engine.pytorch.fp8 import fp8_autocast

Prepare sample inputs (one per callable per microbatch per chunk):

python

sample_args = tuple(
    (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),)
    for _ in range(num_callables * num_microbatches)
)

Define pipeline schedule if using PP (1-indexed chunk IDs, positive=fwd, negative=bwd):

python

# Example: 2 chunks, 3 microbatches
layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1]

Wrap layers in CUDA Graphs:

python

graphed_layers = make_graphed_callables(
    tuple(layers),
    sample_args=sample_args,
    fp8_enabled=True,
    fp8_recipe=fp8_recipe,
    fp8_weight_caching=True,
    _order=layer_order,  # None for no PP
)

Training loop -- wrap with

fp8_autocast

during replay:

python

with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    for layer in graphed_layers[start:end]:
        x = layer(x, is_first_microbatch=(mb_idx == 0))
# FP8 scaling auto-updated on fp8_autocast exit
optimizer.step()

Key points:

AOT capture: Graphs captured before the training loop when you call
```
make_graphed_callables()
```
.
Replay order must match
_order
: The training loop must execute graphs in the same interleaved order as specified during capture.
fp8_autocast
required during replay: Without it, FP8 state is not properly configured.
Weight caching:
```
fp8_weight_caching=True
```
caches FP8 weight quantization across microbatches; pass
```
is_first_microbatch
```
kwarg to control when weights are requantized.

For full API details, see

references/api-te-megatron.md

目标：支持FP8和流水线并行的按调用图捕获。

适用场景：FP8训练、手动调度的PP、需要FP8的非Megatron模型，或任何需要感知FP8的CUDA Graphs的PyTorch模型。

步骤：

导入并配置：

python

from transformer_engine.pytorch.graph import make_graphed_callables
from transformer_engine.pytorch.fp8 import fp8_autocast

准备示例输入（每个调用、每个微批次、每个chunk对应一组输入）：

python

sample_args = tuple(
    (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),)
    for _ in range(num_callables * num_microbatches)
)

如果使用PP，定义流水线调度（chunk ID从1开始，正数表示前向，负数表示反向）：
python
```
# 示例：2个chunk，3个微批次
layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1]
```

将层包装到CUDA Graph中：

python

graphed_layers = make_graphed_callables(
    tuple(layers),
    sample_args=sample_args,
    fp8_enabled=True,
    fp8_recipe=fp8_recipe,
    fp8_weight_caching=True,
    _order=layer_order,  # 无PP时设为None
)

训练循环——重放期间用

fp8_autocast

包裹：

python

with fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    for layer in graphed_layers[start:end]:
        x = layer(x, is_first_microbatch=(mb_idx == 0))
# FP8缩放会在fp8_autocast退出时自动更新
optimizer.step()

关键点：

AOT捕获：当你调用
```
make_graphed_callables()
```
时，图会在训练循环开始前捕获。
重放顺序必须匹配
_order
：训练循环必须按照捕获时指定的交错顺序执行图。
重放期间必须使用
fp8_autocast
：否则FP8状态无法正确配置。
权重缓存：
```
fp8_weight_caching=True
```
会在微批次之间缓存FP8权重量化结果；传递
```
is_first_microbatch
```
参数来控制何时重新量化权重。

完整API细节请参考

references/api-te-megatron.md

。

Workflow 5: MCore CudaGraphManager (Per-Layer)

工作流5：MCore CudaGraphManager（按层捕获）

Goal: Automatic per-layer graphing for Megatron-LM training.

When to use: Megatron-LM training, especially with PP > 1. Default choice for Megatron users.

Steps:

Enable via CLI flags (no code changes needed):

bash

python pretrain_gpt.py \
    --enable-cuda-graph \
    --cuda-graph-num-warmup-steps 3

Or enable via Python config:

python

config = TransformerConfig(
    enable_cuda_graph=True,
    cuda_graph_num_warmup_steps=3,
)

Training loop is unchanged -- graphs are captured automatically after warmup iterations.

Key points:

Megatron layers only: Works with
```
TransformerLayer
```
and
```
MambaLayer
```
.
JIT capture: Records execution order during warmup, captures graphs after warmup completes, then replays on subsequent iterations.
Automatic FP8 handling: Uses
```
fp8_autocast(..., _graph=True)
```
to skip per-layer amax reduction; reduction happens once after all backward graphs.
Automatic PP support: Handles microbatch interleaving automatically.
Memory savings: Set
```
cuda_graph_share_io_buffers=True
```
to share I/O buffers between layers (requires no operations between layers).
Memory pool strategy: Default uses separate pools per microbatch for graph reuse. Set
```
cuda_graph_use_single_mempool=True
```
for shared pool (higher graph count but may reduce fragmentation).

目标：为Megatron-LM训练自动进行按层图捕获。

适用场景：Megatron-LM训练，尤其是PP>1的情况。是Megatron用户的默认选择。

步骤：

通过CLI标志启用（无需修改代码）：

bash

python pretrain_gpt.py \
    --enable-cuda-graph \
    --cuda-graph-num-warmup-steps 3

或通过Python配置启用：

python

config = TransformerConfig(
    enable_cuda_graph=True,
    cuda_graph_num_warmup_steps=3,
)

训练循环无需修改——预热迭代完成后会自动捕获图。

关键点：

仅支持Megatron层：适用于
```
TransformerLayer
```
和
```
MambaLayer
```
。
JIT捕获：预热期间记录执行顺序，预热完成后捕获图，然后在后续迭代中重放。
自动FP8处理：使用
```
fp8_autocast(..., _graph=True)
```
跳过按层amax归约；归约在所有反向图完成后执行一次。
自动PP支持：自动处理微批次交错。
内存节省：设置
```
cuda_graph_share_io_buffers=True
```
来在层之间共享I/O缓冲区（要求层之间无操作）。
内存池策略：默认每个微批次使用单独的池来复用图。设置
```
cuda_graph_use_single_mempool=True
```
使用共享池（图数量更多，但可能减少碎片化）。

Workflow 6: MCore FullCudaGraphWrapper (Full-Iteration)

工作流6：MCore FullCudaGraphWrapper（全迭代捕获）

Goal: Maximum performance. Captures forward+backward for all microbatches as a single graph.

When to use: Maximum performance priority, static workloads, Megatron-LM training.

Steps:

Enable via CLI flags:

bash

python pretrain_gpt.py \
    --enable-cuda-graph \
    --cuda-graph-scope full_iteration \
    --cuda-graph-warmup-steps 1 \
    --te-rng-tracker \
    --no-check-for-nan-in-loss-and-grad

Ensure all forward+backward code is capturable (no
```
.item()
```
, no NaN check, no dynamic control flow).
Optimizer remains in eager mode by default (outside the graph). Can be included inside the graph for maximum performance.

Key points:

Only 2 graphs total: One for training, one for validation.
--te-rng-tracker
required: Standard RNG uses CPU scalars that cannot be captured; TE RNG uses device tensors compatible with graphs.
--no-check-for-nan-in-loss-and-grad
mandatory: NaN checking uses
```
.item()
```
which requires CPU-GPU sync, forbidden during capture.
StaticBufferLoader: Pre-allocates input buffers for all microbatches during warmup.
Optimizer in/out of graph: Inside = maximum performance (all optimizer kernels captured). Outside = more flexible (can change optimizer/LR without recapture).
JIT capture: Graph captured during training at iteration
```
warmup_steps + 1
```
.

目标：极致性能。将所有微批次的前向+反向操作捕获为单个图。

适用场景：优先追求极致性能、静态工作负载、Megatron-LM训练。

步骤：

通过CLI标志启用：

bash

python pretrain_gpt.py \
    --enable-cuda-graph \
    --cuda-graph-scope full_iteration \
    --cuda-graph-warmup-steps 1 \
    --te-rng-tracker \
    --no-check-for-nan-in-loss-and-grad

确保所有前向+反向代码都可捕获（无
```
.item()
```
、无NaN检查、无动态控制流）。
默认优化器保持eager模式（在图外）。也可以将其包含在图内以获得极致性能。

关键点：

仅需2个图：一个用于训练，一个用于验证。
必须使用
--te-rng-tracker
：标准RNG使用无法捕获的CPU标量；TE RNG使用与图兼容的设备张量。
必须设置
--no-check-for-nan-in-loss-and-grad
：NaN检查使用
```
.item()
```
，需要CPU-GPU同步，捕获期间禁止使用。
StaticBufferLoader：预热期间为所有微批次预分配输入缓冲区。
优化器在图内/图外：图内=极致性能（捕获所有优化器内核）。图外=更灵活（无需重新捕获即可更改优化器/学习率）。
JIT捕获：在
```
warmup_steps + 1
```
迭代时捕获图。

Workflow 7: torch.cuda.graph() (Manual)

工作流7：torch.cuda.graph()（手动捕获）

Goal: Full control over capture and replay. Custom pipelines, full-iteration capture without Megatron.

When to use: Need fine-grained control, non-Megatron full-iteration capture, custom pipelines.

Inference pattern:

Pre-allocate static input/output tensors:

python

static_input = torch.randn(batch_size, *shape, device="cuda")

Warmup on a side stream (3 iterations, 11 for DDP):

python

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    for _ in range(3):
        _ = model(static_input)
torch.cuda.current_stream().wait_stream(s)

Capture the graph:

python

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output = model(static_input)

Replay loop -- update inputs via

.copy_()

, clone outputs:

python

for data in loader:
    static_input.copy_(data)
    g.replay()
    result = static_output.clone()

Full training pattern (fwd+bwd+optimizer in one graph):

python

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")

目标：完全控制捕获和重放。自定义流水线、无需Megatron的全迭代捕获。

适用场景：需要细粒度控制、非Megatron的全迭代捕获、自定义流水线。

推理模式：

预分配静态输入/输出张量：

python

static_input = torch.randn(batch_size, *shape, device="cuda")

在侧流上预热（3次迭代，DDP场景为11次）：

python

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    for _ in range(3):
        _ = model(static_input)
torch.cuda.current_stream().wait_stream(s)

捕获图：

python

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output = model(static_input)

重放循环——通过

.copy_()

更新输入，克隆输出：

python

for data in loader:
    static_input.copy_(data)
    g.replay()
    result = static_output.clone()

全训练模式（前向+反向+优化器在单个图内）：

python

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()

static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")

Warmup

预热

s = torch.cuda.Stream() with torch.cuda.stream(s): for _ in range(3): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): out = model(static_input) loss = criterion(out, static_target) loss.backward() torch.cuda.current_stream().wait_stream(s)

Capture

捕获

g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): static_output = model(static_input) static_loss = criterion(static_output, static_target) static_loss.backward()

Replay loop

重放循环

for data, target in loader: static_input.copy_(data) static_target.copy_(target) g.replay() optimizer.step()


**DDP setup:**

```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

for data, target in loader: static_input.copy_(data) static_target.copy_(target) g.replay() optimizer.step()


**DDP设置：**

```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"

s = torch.cuda.Stream()
with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

11 warmup iterations for DDP

DDP场景需要11次预热迭代

with torch.cuda.stream(s): for _ in range(11): out = model(static_input) out.sum().backward() torch.cuda.current_stream().wait_stream(s)

Capture on the same side stream

在同一侧流上捕获图

with torch.cuda.graph(g): static_output = model(static_input)


**Memory pool sharing for multiple graphs:**

```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
    out1 = model_a(static_in_a)

with torch.cuda.graph(g): static_output = model(static_input)


**多图共享内存池：**

```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
    out1 = model_a(static_in_a)

Second graph shares first graph's memory pool

第二个图共享第一个图的内存池

g2 = torch.cuda.CUDAGraph() with torch.cuda.graph(g2, pool=g1.pool()): out2 = model_b(static_in_b)


**Custom RNG registration:**

```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
    out = model(static_input)  # RNG state properly captured

g2 = torch.cuda.CUDAGraph() with torch.cuda.graph(g2, pool=g1.pool()): out2 = model_b(static_in_b)


**自定义RNG注册：**

```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
    out = model(static_input)  # RNG状态会被正确捕获

Navigating Between Workflows

工作流转场

torch.compile gives insufficient speedup --> escalate to
```
make_graphed_callables
```
(Workflow 3) for larger, fewer graphs.
make_graphed_callables can't handle FP8/PP --> TE
```
make_graphed_callables
```
(Workflow 4).
Need Megatron per-layer automatic --> CudaGraphManager (Workflow 5).
Want maximum perf --> FullCudaGraphWrapper (Workflow 6) or manual full-iteration capture (Workflow 7).
Something too hard to graph --> partial capture (graph what you can, leave the rest in eager mode).
User wants best absolute perf --> skip directly to Workflow 6 (Megatron) or Workflow 7 (manual).
Start small, expand progressively: Begin with one module/layer. Verify correctness. Then expand to more layers, full forward pass, add backward, and eventually full iteration with optimizer.

torch.compile性能提升不足 --> 升级到
```
make_graphed_callables
```
（工作流3）以获得更大、更少的图。
make_graphed_callables无法处理FP8/PP --> 使用TE
```
make_graphed_callables
```
（工作流4）。
需要Megatron按层自动捕获 --> 使用CudaGraphManager（工作流5）。
追求极致性能 --> 使用FullCudaGraphWrapper（工作流6）或手动全迭代捕获（工作流7）。
某些部分难以捕获 --> 部分捕获（捕获可处理的部分，其余保留eager模式）。
追求最佳绝对性能 --> 直接跳转到工作流6（Megatron）或工作流7（手动）。
从小规模开始，逐步扩展：先从一个模块/层开始。验证正确性。然后扩展到更多层、完整前向传播、添加反向传播，最终扩展到包含优化器的全迭代。

Making Code Graph-Compatible

使代码兼容图捕获

These principles apply to all workflows. Code inside the captured region must satisfy three constraints.

这些原则适用于所有工作流。捕获区域内的代码必须满足三个约束。

Principle 1: GPU-Only

原则1：仅GPU操作

Only GPU operations are captured. CPU-side code (Python logic, I/O, logging) executes during capture but is eliminated during replay.

Violations:

File I/O:
```
data = torch.load("file.pt")
```
won't reload on replay
CPU preprocessing:
```
tokens = tokenizer.encode(text)
```
won't re-tokenize
Logging:
```
print(f"Step {i}")
```
won't print during replay
CPU RNG:
```
random.randint(0, 10)
```
won't regenerate
CPU bookkeeping:
```
buffer.append(tensor)
```
won't populate during replay

Fix: Move all CPU-side operations outside the graphed region.

只有GPU操作会被捕获。CPU端代码（Python逻辑、I/O、日志）在捕获期间执行，但重放时不会执行。

违规情况：

文件I/O：
```
data = torch.load("file.pt")
```
在重放时不会重新加载
CPU预处理：
```
tokens = tokenizer.encode(text)
```
在重放时不会重新分词
日志：
```
print(f"Step {i}")
```
在重放时不会打印
CPU随机数生成：
```
random.randint(0, 10)
```
在重放时不会重新生成
CPU记录：
```
buffer.append(tensor)
```
在重放时不会填充

修复：将所有CPU端操作移到图捕获区域之外。

Principle 2: Sync-Free

原则2：无同步

No CPU-GPU synchronization inside the graph. The CPU queues work continuously without waiting for GPU results.

Violations:

```
.item()
```
to get scalar values
```
.cpu()
```
to move tensors for inspection

torch.cuda.synchronize()

stream.synchronize()

```
print(tensor)
```
(implicitly syncs)

Fix: Invoke the perf-torch-sync-free skill for systematic detection and elimination of sync points. Use

torch.cuda.set_sync_debug_mode("warn")

to find hidden syncs.

图内不能有CPU-GPU同步操作。CPU需持续排队任务，无需等待GPU结果。

违规情况：

使用
```
.item()
```
获取标量值
使用
```
.cpu()
```
将张量移到CPU进行检查

torch.cuda.synchronize()

或

stream.synchronize()

```
print(tensor)
```
（隐式同步）

修复：调用perf-torch-sync-free技能来系统检测和消除同步点。使用

torch.cuda.set_sync_debug_mode("warn")

查找隐藏的同步操作。

Principle 3: Static

原则3：静态化

All operations, control flow, memory addresses, and shapes must be fixed across all replays.

Violations and fixes:

Dynamic aspect	Fix
`if loss > threshold:`	`torch.where(condition, a, b)`
`input = new_tensor` (address changes)	Pre-allocate + `.copy_()`
Python scalars (lr, temperature)	GPU tensor + `.fill_()`
Variable batch size / sequence length	Padding or bucketing
MoE / dynamic routing	Partial graphing

For detailed patterns, see

references/patterns-dynamic.md

所有操作、控制流、内存地址和形状在所有重放中必须保持固定。

违规情况及修复：

动态因素	修复方案
`if loss > threshold:`	使用 `torch.where(condition, a, b)`
`input = new_tensor` （地址变化）	预分配张量并使用 `.copy_()` 更新
Python标量（学习率、温度）	转为GPU张量并使用 `.fill_()` 更新
可变批次大小/序列长度	使用填充或分桶
MoE / 动态路由	部分图捕获

详细模式请参考

references/patterns-dynamic.md

。

Compatibility Checklist

兼容性检查清单

Verify every item before attempting capture:

No
```
.item()
```
,
```
.cpu()
```
,
```
.numpy()
```
,
```
print(tensor)
```
inside graph

torch.cuda.synchronize()

stream.synchronize()

No
```
if tensor_value:
```
-- use
```
torch.where()
```
instead
All inputs pre-allocated, updated via
```
.copy_()
```
All shapes fixed (use padding or bucketing for variable sizes)
Python scalars --> GPU tensors with
```
.fill_()
```
Output tensors
```
.clone()
```
d before next replay
```
cache_enabled=False
```
with
```
torch.amp.autocast
```
Custom RNG generators registered with
```
graph.register_generator_state()
```

Use

graphsafe_get_state()

graphsafe_set_state()

for RNG

Warmup completed (3 standard, 11 for DDP)
DDP:
```
TORCH_NCCL_ASYNC_ERROR_HANDLING=0
```
, construct on side stream
DDP: NCCL >= 2.9.6 for full graph capture
Libraries/extensions use
```
torch.cuda.current_stream()
```
, not default stream
No pinned memory allocation during capture (triggers hidden event query)

activation_checkpointing

preserve_rng_state=False

Global tensors used in graph kept alive (not deleted/reassigned)
No
```
torch.compile
```
functions inside manual capture without prior warmup
Gradient clipping uses sync-free
```
clip_grad_norm_
```
(PyTorch >= 1.13)

For the complete checklist with references, see

references/patterns-compatibility.md

尝试捕获之前，请验证以下所有项：

图内无
```
.item()
```
、
```
.cpu()
```
、
```
.numpy()
```
、
```
print(tensor)
```
操作

无

torch.cuda.synchronize()

或

stream.synchronize()

操作

无
```
if tensor_value:
```
判断——改用
```
torch.where()
```
所有输入已预分配，通过
```
.copy_()
```
更新
所有形状固定（对可变大小使用填充或分桶）
Python标量已转为GPU张量并使用
```
.fill_()
```
更新
输出张量在下次重放前已
```
.clone()
```
使用
```
torch.amp.autocast
```
时设置
```
cache_enabled=False
```
自定义RNG生成器已通过
```
graph.register_generator_state()
```
注册

使用

graphsafe_get_state()

graphsafe_set_state()

处理RNG

已完成预热（标准场景3次，DDP场景11次）
DDP：设置
```
TORCH_NCCL_ASYNC_ERROR_HANDLING=0
```
，在侧流上构建
DDP：NCCL >= 2.9.6以支持完整图捕获
库/扩展使用
```
torch.cuda.current_stream()
```
而非默认流
捕获期间无固定内存分配（会触发隐藏事件查询）
激活 checkpointing：设置
```
preserve_rng_state=False
```
图中使用的全局张量保持存活（未被删除/重新分配）
手动捕获内的
```
torch.compile
```
函数已提前预热
梯度裁剪使用无同步的
```
clip_grad_norm_
```
（PyTorch >= 1.13）

完整检查清单及参考请见

references/patterns-compatibility.md

。

Output Formats

输出格式

Success indicators:

```
g.replay()
```
completes without errors
Outputs match eager mode within tolerance (
```
torch.allclose
```
)
Nsight Systems profile shows single graph launch replacing many kernels
GPU utilization increases, training/inference latency decreases

Key metrics:

Metric	How to Check
Correctness	`torch.allclose(eager, graphed, rtol=1e-5)`
Speedup	Wall-clock time comparison
GPU utilization	`nvidia-smi` or Nsight Systems timeline
Memory overhead	`torch.cuda.memory_summary()`

成功指标：

```
g.replay()
```
无错误完成
输出与eager模式结果在容差范围内匹配（
```
torch.allclose
```
）
Nsight Systems性能分析显示单个图启动替代了多个内核
GPU利用率提升，训练/推理延迟降低

关键指标：

指标	检查方式
正确性	`torch.allclose(eager, graphed, rtol=1e-5)`
加速比	对比 wall-clock 时间
GPU利用率	`nvidia-smi` 或Nsight Systems时间线
内存开销	`torch.cuda.memory_summary()`

Error Handling

错误处理

Error	Cause	Fix
`StreamCaptureUnsupported` (900)	Sync op during capture ( `.item()` , `.cpu()` )	Move sync outside graph
`StreamCaptureInvalidated` (901)	Background thread (e.g., pin_memory)	`capture_error_mode="thread_local"`
`StreamCaptureUnjoined` (904)	Side stream didn't rejoin capture stream	`capture_stream.wait_stream(side_stream)`
`StreamCaptureImplicit` (906)	AccumulateGrad on default stream	Warmup on side stream before capture
Illegal memory access	Input tensor freed/reassigned	Keep persistent ref, use `.copy_()`
Wrong numerical results	Dynamic behavior frozen at capture	See `references/patterns-compatibility.md`
OOM with multiple graphs	Pools can't share memory	`pool=g1.pool()` for sequential graphs
No speedup	Already GPU-bound or wrong capture scope	Profile with nsys first (Workflow 1)
FP8 scaling corruption	TE without `fp8_autocast` during replay	Wrap with `fp8_autocast(enabled=True)`
PP replay order mismatch	Wrong execution order during replay	Match `_order` / capture sequence exactly
FullCudaGraphWrapper capture fail	NaN check or sync enabled	`--no-check-for-nan-in-loss-and-grad`
RNG failure with FullCudaGraphWrapper	Standard RNG not capturable	`--te-rng-tracker`
DDP capture failure	Async error handling watchdog	`TORCH_NCCL_ASYNC_ERROR_HANDLING=0`
DDP AccumulateGrad on default stream	DDP constructed on default stream	Construct DDP in side stream context
Autocast cache invalidation	Cached cast tensors freed on exit	`cache_enabled=False`

For detailed troubleshooting, see

references/troubleshooting.md

错误	原因	修复方案
`StreamCaptureUnsupported` (900)	捕获期间存在同步操作（ `.item()` 、 `.cpu()` ）	将同步操作移到图外
`StreamCaptureInvalidated` (901)	存在后台线程（如pin_memory）	设置 `capture_error_mode="thread_local"`
`StreamCaptureUnjoined` (904)	侧流未重新加入捕获流	`capture_stream.wait_stream(side_stream)`
`StreamCaptureImplicit` (906)	默认流上存在AccumulateGrad	捕获前在侧流上预热
非法内存访问	输入张量被释放/重新分配	保持持久引用，使用 `.copy_()` 更新
数值结果错误	动态行为在捕获时被冻结	参考 `references/patterns-compatibility.md`
多图场景下OOM	内存池无法共享	对顺序图使用 `pool=g1.pool()`
无性能提升	已受GPU限制或捕获范围错误	先使用nsys进行性能分析（工作流1）
FP8缩放损坏	TE重放时未使用 `fp8_autocast`	用 `fp8_autocast(enabled=True)` 包裹
PP重放顺序不匹配	重放时执行顺序错误	严格匹配 `_order` /捕获序列
FullCudaGraphWrapper捕获失败	启用了NaN检查或同步	设置 `--no-check-for-nan-in-loss-and-grad`
FullCudaGraphWrapper的RNG失败	标准RNG不可捕获	使用 `--te-rng-tracker`
DDP捕获失败	异步错误处理看门狗	设置 `TORCH_NCCL_ASYNC_ERROR_HANDLING=0`
DDP AccumulateGrad在默认流上	DDP在默认流上构建	在侧流上下文构建DDP
Autocast缓存失效	缓存的转换张量在退出时被释放	设置 `cache_enabled=False`

详细故障排查请参考

references/troubleshooting.md

。

perf-torch-cuda-graphs

Original

Translation

CUDA Graphs for PyTorch

面向PyTorch的CUDA Graphs

When to Use

适用场景

Requirements

依赖要求

API Selection Guide

API选择指南

Workflows

工作流

Workflow 1: Profile and Decide Whether Graphs Help

工作流1：性能分析并判断是否需要使用CUDA Graphs

Workflow 2: torch.compile(mode="reduce-overhead")

工作流2：torch.compile(mode="reduce-overhead")

Workflow 3: torch.cuda.make_graphed_callables()

工作流3：torch.cuda.make_graphed_callables()

Workflow 4: TE make_graphed_callables

工作流4：TE make_graphed_callables

Workflow 5: MCore CudaGraphManager (Per-Layer)

工作流5：MCore CudaGraphManager（按层捕获）

Workflow 6: MCore FullCudaGraphWrapper (Full-Iteration)

工作流6：MCore FullCudaGraphWrapper（全迭代捕获）

Workflow 7: torch.cuda.graph() (Manual)

工作流7：torch.cuda.graph()（手动捕获）

Warmup

预热

Capture

捕获

Replay loop

重放循环

11 warmup iterations for DDP

DDP场景需要11次预热迭代

Capture on the same side stream

在同一侧流上捕获图

Second graph shares first graph's memory pool

第二个图共享第一个图的内存池

Navigating Between Workflows

工作流转场

Making Code Graph-Compatible

使代码兼容图捕获

Principle 1: GPU-Only

原则1：仅GPU操作

Principle 2: Sync-Free

原则2：无同步

Principle 3: Static

原则3：静态化

Compatibility Checklist

兼容性检查清单

Output Formats

输出格式

Error Handling

错误处理

Finding More Information

更多信息查找

Tier 1: This File (SKILL.md)

第1级：本文件（SKILL.md）

Tier 2: references/ Directory

第2级：references/目录

Tier 3: Original Documentation

第3级：原始文档