perf-torch-cuda-graphs
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUDA Graphs for PyTorch
面向PyTorch的CUDA Graphs
CUDA Graphs capture a sequence of GPU operations once and replay them with
minimal CPU overhead. This skill guides applying CUDA Graphs to PyTorch
training and inference workloads using native PyTorch APIs, Transformer
Engine, and Megatron-LM.
CUDA Graphs可一次性捕获一系列GPU操作,并以极低的CPU开销重放这些操作。本技能将指导你使用PyTorch原生API、Transformer Engine和Megatron-LM,将CUDA Graphs应用于PyTorch训练和推理工作负载。
When to Use
适用场景
Reach for this skill when you encounter:
- Triggers: User wants to optimize with CUDA Graphs, reduce kernel launch overhead, or speed up training/inference loops
- Symptoms: Low GPU utilization (<80%), many small kernel launches (<50 us each), CPU-bound training, high kernel launch latency visible in Nsight Systems profiles
- Keywords: "CUDA graph", "torch.cuda.graph", "make_graphed_callables", "reduce-overhead", "graph capture", "graph replay", "kernel launch overhead", "CudaGraphManager", "FullCudaGraphWrapper", "full-iteration graph", "stream capture"
Do NOT use this skill for:
- General PyTorch performance tuning unrelated to kernel launch overhead
- CUDA kernel development or custom CUDA C++ code
- Host-device sync elimination only (use perf-torch-sync-free skill instead)
- Nsight Systems profiling (use perf-nsight-systems skill)
- TensorFlow/JAX graph compilation (different APIs entirely)
当你遇到以下情况时,可以使用本技能:
- 触发条件:希望通过CUDA Graphs进行优化、降低内核启动开销,或加速训练/推理循环
- 症状:GPU利用率低(<80%)、存在大量小型内核启动(每个耗时<50微秒)、训练受CPU限制、Nsight Systems性能分析中可见高内核启动延迟
- 关键词:"CUDA graph"、"torch.cuda.graph"、"make_graphed_callables"、"reduce-overhead"、"graph capture"、"graph replay"、"kernel launch overhead"、"CudaGraphManager"、"FullCudaGraphWrapper"、"full-iteration graph"、"stream capture"
请勿在以下场景使用本技能:
- 与内核启动开销无关的通用PyTorch性能调优
- CUDA内核开发或自定义CUDA C++代码
- 仅消除主机-设备同步(请改用perf-torch-sync-free技能)
- Nsight Systems性能分析(请改用perf-nsight-systems技能)
- TensorFlow/JAX图编译(使用完全不同的API)
Requirements
依赖要求
| Dependency | Version | Notes |
|---|---|---|
| PyTorch | >= 1.10 | |
| CUDA | >= 11.0 | Graph update APIs |
| GPU | NVIDIA (any) | Required for CUDA |
| Nsight Systems | any | Optional, for profiling |
| APEX | any | Optional, for capturable optimizers |
| Transformer Engine | >= 2.2 | Optional, for FP8-aware graphing |
| Megatron-LM | core >= 0.14.0 | Optional, for CudaGraphManager / FullCudaGraphWrapper |
| 依赖项 | 版本 | 说明 |
|---|---|---|
| PyTorch | >= 1.10 | 支持 |
| CUDA | >= 11.0 | 支持图更新API |
| GPU | NVIDIA(任意型号) | CUDA运行必需 |
| Nsight Systems | 任意版本 | 可选,用于性能分析 |
| APEX | 任意版本 | 可选,用于可捕获的优化器 |
| Transformer Engine | >= 2.2 | 可选,支持感知FP8的图捕获 |
| Megatron-LM | core >= 0.14.0 | 可选,用于CudaGraphManager / FullCudaGraphWrapper |
API Selection Guide
API选择指南
Choose the API based on your framework and performance needs.
| Situation | API | Workflow |
|---|---|---|
| Quick experiment, unknown graph boundaries | | Workflow 2 |
| Training, need autograd, no FP8/PP | | Workflow 3 |
| Any PyTorch model, FP8 or PP support | TE | Workflow 4 |
| Megatron-LM, per-layer, automatic | MCore | Workflow 5 |
| Maximum perf, full-iteration capture | MCore | Workflow 6 |
| Full manual control, custom pipelines | | Workflow 7 |
Decision flowchart:
- Using Megatron-LM with FP8/PP?
- Yes, want maximum perf with static workload --> Workflow 6 (FullCudaGraphWrapper)
- Yes, want per-layer automatic graphing --> Workflow 5 (CudaGraphManager)
- Yes, want manual control over what gets graphed --> Workflow 4 (TE make_graphed_callables)
- Using Transformer Engine without Megatron?
- Yes, need FP8 or PP --> Workflow 4 (TE make_graphed_callables)
- General PyTorch?
- Want zero effort, okay with fragmented graphs --> Workflow 2 (torch.compile)
- Want autograd support, training loop --> Workflow 3 (PyTorch make_graphed_callables)
- Want full manual control --> Workflow 7 (torch.cuda.graph)
Strategy: Start with the highest-level API available for your framework.
Move to lower-level APIs only if you need more control, hit limitations, or
do not achieve the expected performance improvement.
根据你的框架和性能需求选择合适的API。
| 场景 | API | 工作流 |
|---|---|---|
| 快速实验、未知图边界 | | 工作流2 |
| 训练场景、需要自动求导、无FP8/PP | | 工作流3 |
| 任意PyTorch模型、支持FP8或PP | TE | 工作流4 |
| Megatron-LM、按层处理、自动捕获 | MCore | 工作流5 |
| 极致性能、全迭代捕获 | MCore | 工作流6 |
| 完全手动控制、自定义流水线 | | 工作流7 |
决策流程图:
- 是否在使用带FP8/PP的Megatron-LM?
- 是,需要静态工作负载的极致性能 --> 工作流6(FullCudaGraphWrapper)
- 是,需要按层自动图捕获 --> 工作流5(CudaGraphManager)
- 是,需要手动控制图捕获范围 --> 工作流4(TE make_graphed_callables)
- 是否在不使用Megatron的情况下使用Transformer Engine?
- 是,需要FP8或PP支持 --> 工作流4(TE make_graphed_callables)
- 通用PyTorch场景?
- 希望零成本实现、可接受碎片化图 --> 工作流2(torch.compile)
- 需要自动求导支持、训练循环场景 --> 工作流3(PyTorch make_graphed_callables)
- 需要完全手动控制 --> 工作流7(torch.cuda.graph)
策略建议: 从你的框架支持的最高层级API开始尝试。仅当需要更多控制、遇到限制或未达到预期性能提升时,才转向更低层级的API。
Workflows
工作流
Workflow 1: Profile and Decide Whether Graphs Help
工作流1:性能分析并判断是否需要使用CUDA Graphs
Goal: Determine if CUDA Graphs will benefit your workload before investing
effort.
- Profile with Nsight Systems:
bash
nsys profile --cuda-graph-trace=graph python train.py - Check GPU utilization -- if already >95%, graphs won't help much.
- Look for gaps between kernel launches (CPU overhead) and many small kernels (<50 us each). These are the targets for graphing.
- Annotate regions of interest to correlate idle GPU time with code:
python
with torch.cuda.nvtx.range("forward"): output = model(input) - Estimate benefit: count kernels per iteration. Workloads with hundreds of small kernels and <80% GPU utilization are strong candidates.
Expected result: Identified bottleneck regions with low GPU occupancy between
kernels. Proceed to the appropriate workflow from the API Selection Guide.
目标:在投入精力之前,确定CUDA Graphs是否能让你的工作负载受益。
- 使用Nsight Systems进行性能分析:
bash
nsys profile --cuda-graph-trace=graph python train.py - 检查GPU利用率——如果已经>95%,CUDA Graphs带来的收益有限。
- 查找内核启动之间的间隙(CPU开销)和大量小型内核(每个耗时<50微秒),这些是图捕获的目标。
- 为感兴趣的代码区域添加注释,将GPU空闲时间与代码关联:
python
with torch.cuda.nvtx.range("forward"): output = model(input) - 估算收益:统计每次迭代的内核数量。包含数百个小型内核且GPU利用率<80%的工作负载是CUDA Graphs的理想候选。
预期结果:识别出内核之间GPU占用率低的瓶颈区域。根据API选择指南进入相应的工作流。
Workflow 2: torch.compile(mode="reduce-overhead")
工作流2:torch.compile(mode="reduce-overhead")
Goal: Automatic CUDA Graph capture with zero manual effort.
When to use: Quick experiment, unknown graph boundaries, already using
.
torch.compileSteps:
- Decorate the training step with :
@torch.compile(mode="reduce-overhead")python@torch.compile(mode="reduce-overhead") def train_step(model, x, target, criterion): output = model(x) loss = criterion(output, target) loss.backward() return loss - Run the training loop normally -- graphs are captured automatically.
- Profile with Nsight Systems to see captured graphs:
bash
nsys profile --cuda-graph-trace=graph python train.py - If you see too many small graphs (graph fragmentation), check for graph
breaks: ,
.item(), data-dependent control flow. Fix these or escalate to Workflow 3+.print()
Trade-offs:
- Zero effort, but may create fragmented small graphs.
- Limited control over what gets graphed.
- Graph fragmentation limits performance gains compared to manual approaches.
目标:自动捕获CUDA Graph,无需手动操作。
适用场景:快速实验、未知图边界、已在使用。
torch.compile步骤:
- 使用装饰训练步骤:
@torch.compile(mode="reduce-overhead")python@torch.compile(mode="reduce-overhead") def train_step(model, x, target, criterion): output = model(x) loss = criterion(output, target) loss.backward() return loss - 正常运行训练循环——图会自动捕获。
- 使用Nsight Systems进行性能分析,查看已捕获的图:
bash
nsys profile --cuda-graph-trace=graph python train.py - 如果看到过多小型图(图碎片化),检查图中断点:、
.item()、依赖数据的控制流。修复这些问题或升级到工作流3及以上。print()
权衡:
- 零成本实现,但可能生成碎片化的小型图。
- 对图捕获范围的控制有限。
- 与手动方式相比,图碎片化会限制性能提升效果。
Workflow 3: torch.cuda.make_graphed_callables()
工作流3:torch.cuda.make_graphed_callables()
Goal: Training with autograd support. Separate forward/backward graphs.
When to use: Training with custom loops, non-FP8, need autograd.
Steps:
- Prepare sample inputs matching training batch shape:
python
sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda") - Create the graphed model:
python
graphed_model = torch.cuda.make_graphed_callables( model, (sample_input,), num_warmup_iters=3 ) - Use as a drop-in replacement in the training loop:
graphed_modelpythonfor data, target in dataloader: optimizer.zero_grad() output = graphed_model(data) loss = criterion(output, target) loss.backward() optimizer.step() - If using AMP, set :
cache_enabled=Falsepythonfor data, target in dataloader: optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): output = graphed_model(data) loss = criterion(output, target) loss.backward() optimizer.step() - If using DDP, construct DDP on a side stream and use 11 warmup iters:
python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0" s = torch.cuda.Stream() with torch.cuda.stream(s): model = DistributedDataParallel(model) torch.cuda.current_stream().wait_stream(s) graphed_model = torch.cuda.make_graphed_callables( model, (sample_input,), num_warmup_iters=11 )
Limitations:
- No double backward (higher-order gradients).
- No module hooks during capture.
- Module structure is frozen after graphing (no add/remove parameters).
- Argument signature must match exactly.
sample_args
目标:支持自动求导的训练场景,分离前向/反向图。
适用场景:自定义循环训练、无FP8、需要自动求导。
步骤:
- 准备与训练批次形状匹配的示例输入:
python
sample_input = torch.randn(batch_size, seq_len, hidden_size, device="cuda") - 创建带图捕获的模型:
python
graphed_model = torch.cuda.make_graphed_callables( model, (sample_input,), num_warmup_iters=3 ) - 在训练循环中直接使用替代原模型:
graphed_modelpythonfor data, target in dataloader: optimizer.zero_grad() output = graphed_model(data) loss = criterion(output, target) loss.backward() optimizer.step() - 如果使用AMP,设置:
cache_enabled=Falsepythonfor data, target in dataloader: optimizer.zero_grad() with torch.amp.autocast("cuda", cache_enabled=False): output = graphed_model(data) loss = criterion(output, target) loss.backward() optimizer.step() - 如果使用DDP,在侧流上构建DDP并使用11次预热迭代:
python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0" s = torch.cuda.Stream() with torch.cuda.stream(s): model = DistributedDataParallel(model) torch.cuda.current_stream().wait_stream(s) graphed_model = torch.cuda.make_graphed_callables( model, (sample_input,), num_warmup_iters=11 )
限制:
- 不支持双重反向传播(高阶梯度)。
- 捕获期间不支持模块钩子。
- 图捕获后模块结构固定(无法添加/删除参数)。
- 参数签名必须与完全匹配。
sample_args
Workflow 4: TE make_graphed_callables
工作流4:TE make_graphed_callables
Goal: Per-callable graphing with FP8 support and pipeline parallelism.
When to use: FP8 training, PP with manual scheduling, non-Megatron models
needing FP8, or any PyTorch model that needs FP8-aware CUDA Graphs.
Steps:
- Import and configure:
python
from transformer_engine.pytorch.graph import make_graphed_callables from transformer_engine.pytorch.fp8 import fp8_autocast - Prepare sample inputs (one per callable per microbatch per chunk):
python
sample_args = tuple( (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),) for _ in range(num_callables * num_microbatches) ) - Define pipeline schedule if using PP (1-indexed chunk IDs, positive=fwd,
negative=bwd):
python
# Example: 2 chunks, 3 microbatches layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1] - Wrap layers in CUDA Graphs:
python
graphed_layers = make_graphed_callables( tuple(layers), sample_args=sample_args, fp8_enabled=True, fp8_recipe=fp8_recipe, fp8_weight_caching=True, _order=layer_order, # None for no PP ) - Training loop -- wrap with during replay:
fp8_autocastpythonwith fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): for layer in graphed_layers[start:end]: x = layer(x, is_first_microbatch=(mb_idx == 0)) # FP8 scaling auto-updated on fp8_autocast exit optimizer.step()
Key points:
- AOT capture: Graphs captured before the training loop when you call
.
make_graphed_callables() - Replay order must match : The training loop must execute graphs in the same interleaved order as specified during capture.
_order - required during replay: Without it, FP8 state is not properly configured.
fp8_autocast - Weight caching: caches FP8 weight quantization across microbatches; pass
fp8_weight_caching=Truekwarg to control when weights are requantized.is_first_microbatch
For full API details, see .
references/api-te-megatron.md目标:支持FP8和流水线并行的按调用图捕获。
适用场景:FP8训练、手动调度的PP、需要FP8的非Megatron模型,或任何需要感知FP8的CUDA Graphs的PyTorch模型。
步骤:
- 导入并配置:
python
from transformer_engine.pytorch.graph import make_graphed_callables from transformer_engine.pytorch.fp8 import fp8_autocast - 准备示例输入(每个调用、每个微批次、每个chunk对应一组输入):
python
sample_args = tuple( (torch.randn(batch_size, seq_len, hidden_size, device="cuda"),) for _ in range(num_callables * num_microbatches) ) - 如果使用PP,定义流水线调度(chunk ID从1开始,正数表示前向,负数表示反向):
python
# 示例:2个chunk,3个微批次 layer_order = [1, 2, 1, 2, 1, 2, -2, -1, -2, -1, -2, -1] - 将层包装到CUDA Graph中:
python
graphed_layers = make_graphed_callables( tuple(layers), sample_args=sample_args, fp8_enabled=True, fp8_recipe=fp8_recipe, fp8_weight_caching=True, _order=layer_order, # 无PP时设为None ) - 训练循环——重放期间用包裹:
fp8_autocastpythonwith fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): for layer in graphed_layers[start:end]: x = layer(x, is_first_microbatch=(mb_idx == 0)) # FP8缩放会在fp8_autocast退出时自动更新 optimizer.step()
关键点:
- AOT捕获:当你调用时,图会在训练循环开始前捕获。
make_graphed_callables() - 重放顺序必须匹配:训练循环必须按照捕获时指定的交错顺序执行图。
_order - 重放期间必须使用:否则FP8状态无法正确配置。
fp8_autocast - 权重缓存:会在微批次之间缓存FP8权重量化结果;传递
fp8_weight_caching=True参数来控制何时重新量化权重。is_first_microbatch
完整API细节请参考。
references/api-te-megatron.mdWorkflow 5: MCore CudaGraphManager (Per-Layer)
工作流5:MCore CudaGraphManager(按层捕获)
Goal: Automatic per-layer graphing for Megatron-LM training.
When to use: Megatron-LM training, especially with PP > 1. Default choice
for Megatron users.
Steps:
- Enable via CLI flags (no code changes needed):
bash
python pretrain_gpt.py \ --enable-cuda-graph \ --cuda-graph-num-warmup-steps 3 - Or enable via Python config:
python
config = TransformerConfig( enable_cuda_graph=True, cuda_graph_num_warmup_steps=3, ) - Training loop is unchanged -- graphs are captured automatically after warmup iterations.
Key points:
- Megatron layers only: Works with and
TransformerLayer.MambaLayer - JIT capture: Records execution order during warmup, captures graphs after warmup completes, then replays on subsequent iterations.
- Automatic FP8 handling: Uses to skip per-layer amax reduction; reduction happens once after all backward graphs.
fp8_autocast(..., _graph=True) - Automatic PP support: Handles microbatch interleaving automatically.
- Memory savings: Set to share I/O buffers between layers (requires no operations between layers).
cuda_graph_share_io_buffers=True - Memory pool strategy: Default uses separate pools per microbatch for
graph reuse. Set for shared pool (higher graph count but may reduce fragmentation).
cuda_graph_use_single_mempool=True
目标:为Megatron-LM训练自动进行按层图捕获。
适用场景:Megatron-LM训练,尤其是PP>1的情况。是Megatron用户的默认选择。
步骤:
- 通过CLI标志启用(无需修改代码):
bash
python pretrain_gpt.py \ --enable-cuda-graph \ --cuda-graph-num-warmup-steps 3 - 或通过Python配置启用:
python
config = TransformerConfig( enable_cuda_graph=True, cuda_graph_num_warmup_steps=3, ) - 训练循环无需修改——预热迭代完成后会自动捕获图。
关键点:
- 仅支持Megatron层:适用于和
TransformerLayer。MambaLayer - JIT捕获:预热期间记录执行顺序,预热完成后捕获图,然后在后续迭代中重放。
- 自动FP8处理:使用跳过按层amax归约;归约在所有反向图完成后执行一次。
fp8_autocast(..., _graph=True) - 自动PP支持:自动处理微批次交错。
- 内存节省:设置来在层之间共享I/O缓冲区(要求层之间无操作)。
cuda_graph_share_io_buffers=True - 内存池策略:默认每个微批次使用单独的池来复用图。设置使用共享池(图数量更多,但可能减少碎片化)。
cuda_graph_use_single_mempool=True
Workflow 6: MCore FullCudaGraphWrapper (Full-Iteration)
工作流6:MCore FullCudaGraphWrapper(全迭代捕获)
Goal: Maximum performance. Captures forward+backward for all microbatches
as a single graph.
When to use: Maximum performance priority, static workloads, Megatron-LM
training.
Steps:
- Enable via CLI flags:
bash
python pretrain_gpt.py \ --enable-cuda-graph \ --cuda-graph-scope full_iteration \ --cuda-graph-warmup-steps 1 \ --te-rng-tracker \ --no-check-for-nan-in-loss-and-grad - Ensure all forward+backward code is capturable (no , no NaN check, no dynamic control flow).
.item() - Optimizer remains in eager mode by default (outside the graph). Can be included inside the graph for maximum performance.
Key points:
- Only 2 graphs total: One for training, one for validation.
- required: Standard RNG uses CPU scalars that cannot be captured; TE RNG uses device tensors compatible with graphs.
--te-rng-tracker - mandatory: NaN checking uses
--no-check-for-nan-in-loss-and-gradwhich requires CPU-GPU sync, forbidden during capture..item() - StaticBufferLoader: Pre-allocates input buffers for all microbatches during warmup.
- Optimizer in/out of graph: Inside = maximum performance (all optimizer kernels captured). Outside = more flexible (can change optimizer/LR without recapture).
- JIT capture: Graph captured during training at iteration
.
warmup_steps + 1
目标:极致性能。将所有微批次的前向+反向操作捕获为单个图。
适用场景:优先追求极致性能、静态工作负载、Megatron-LM训练。
步骤:
- 通过CLI标志启用:
bash
python pretrain_gpt.py \ --enable-cuda-graph \ --cuda-graph-scope full_iteration \ --cuda-graph-warmup-steps 1 \ --te-rng-tracker \ --no-check-for-nan-in-loss-and-grad - 确保所有前向+反向代码都可捕获(无、无NaN检查、无动态控制流)。
.item() - 默认优化器保持eager模式(在图外)。也可以将其包含在图内以获得极致性能。
关键点:
- 仅需2个图:一个用于训练,一个用于验证。
- 必须使用:标准RNG使用无法捕获的CPU标量;TE RNG使用与图兼容的设备张量。
--te-rng-tracker - 必须设置:NaN检查使用
--no-check-for-nan-in-loss-and-grad,需要CPU-GPU同步,捕获期间禁止使用。.item() - StaticBufferLoader:预热期间为所有微批次预分配输入缓冲区。
- 优化器在图内/图外:图内=极致性能(捕获所有优化器内核)。图外=更灵活(无需重新捕获即可更改优化器/学习率)。
- JIT捕获:在迭代时捕获图。
warmup_steps + 1
Workflow 7: torch.cuda.graph() (Manual)
工作流7:torch.cuda.graph()(手动捕获)
Goal: Full control over capture and replay. Custom pipelines, full-iteration
capture without Megatron.
When to use: Need fine-grained control, non-Megatron full-iteration capture,
custom pipelines.
Inference pattern:
- Pre-allocate static input/output tensors:
python
static_input = torch.randn(batch_size, *shape, device="cuda") - Warmup on a side stream (3 iterations, 11 for DDP):
python
s = torch.cuda.Stream() with torch.cuda.stream(s): for _ in range(3): _ = model(static_input) torch.cuda.current_stream().wait_stream(s) - Capture the graph:
python
g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): static_output = model(static_input) - Replay loop -- update inputs via , clone outputs:
.copy_()pythonfor data in loader: static_input.copy_(data) g.replay() result = static_output.clone()
Full training pattern (fwd+bwd+optimizer in one graph):
python
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()
static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")目标:完全控制捕获和重放。自定义流水线、无需Megatron的全迭代捕获。
适用场景:需要细粒度控制、非Megatron的全迭代捕获、自定义流水线。
推理模式:
- 预分配静态输入/输出张量:
python
static_input = torch.randn(batch_size, *shape, device="cuda") - 在侧流上预热(3次迭代,DDP场景为11次):
python
s = torch.cuda.Stream() with torch.cuda.stream(s): for _ in range(3): _ = model(static_input) torch.cuda.current_stream().wait_stream(s) - 捕获图:
python
g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): static_output = model(static_input) - 重放循环——通过更新输入,克隆输出:
.copy_()pythonfor data in loader: static_input.copy_(data) g.replay() result = static_output.clone()
全训练模式(前向+反向+优化器在单个图内):
python
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss()
static_input = torch.randn(batch_size, *shape, device="cuda")
static_target = torch.randint(0, num_classes, (batch_size,), device="cuda")Warmup
预热
s = torch.cuda.Stream()
with torch.cuda.stream(s):
for _ in range(3):
optimizer.zero_grad()
with torch.amp.autocast("cuda", cache_enabled=False):
out = model(static_input)
loss = criterion(out, static_target)
loss.backward()
torch.cuda.current_stream().wait_stream(s)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
for _ in range(3):
optimizer.zero_grad()
with torch.amp.autocast("cuda", cache_enabled=False):
out = model(static_input)
loss = criterion(out, static_target)
loss.backward()
torch.cuda.current_stream().wait_stream(s)
Capture
捕获
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
optimizer.zero_grad()
with torch.amp.autocast("cuda", cache_enabled=False):
static_output = model(static_input)
static_loss = criterion(static_output, static_target)
static_loss.backward()
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
optimizer.zero_grad()
with torch.amp.autocast("cuda", cache_enabled=False):
static_output = model(static_input)
static_loss = criterion(static_output, static_target)
static_loss.backward()
Replay loop
重放循环
for data, target in loader:
static_input.copy_(data)
static_target.copy_(target)
g.replay()
optimizer.step()
**DDP setup:**
```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
s = torch.cuda.Stream()
with torch.cuda.stream(s):
model = DistributedDataParallel(model)for data, target in loader:
static_input.copy_(data)
static_target.copy_(target)
g.replay()
optimizer.step()
**DDP设置:**
```python
os.environ["TORCH_NCCL_ASYNC_ERROR_HANDLING"] = "0"
s = torch.cuda.Stream()
with torch.cuda.stream(s):
model = DistributedDataParallel(model)11 warmup iterations for DDP
DDP场景需要11次预热迭代
with torch.cuda.stream(s):
for _ in range(11):
out = model(static_input)
out.sum().backward()
torch.cuda.current_stream().wait_stream(s)
with torch.cuda.stream(s):
for _ in range(11):
out = model(static_input)
out.sum().backward()
torch.cuda.current_stream().wait_stream(s)
Capture on the same side stream
在同一侧流上捕获图
with torch.cuda.graph(g):
static_output = model(static_input)
**Memory pool sharing for multiple graphs:**
```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
out1 = model_a(static_in_a)with torch.cuda.graph(g):
static_output = model(static_input)
**多图共享内存池:**
```python
g1 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g1):
out1 = model_a(static_in_a)Second graph shares first graph's memory pool
第二个图共享第一个图的内存池
g2 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g2, pool=g1.pool()):
out2 = model_b(static_in_b)
**Custom RNG registration:**
```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
out = model(static_input) # RNG state properly capturedg2 = torch.cuda.CUDAGraph()
with torch.cuda.graph(g2, pool=g1.pool()):
out2 = model_b(static_in_b)
**自定义RNG注册:**
```python
gen = torch.cuda.default_generators[0]
g = torch.cuda.CUDAGraph()
g.register_generator_state(gen)
with torch.cuda.graph(g):
out = model(static_input) # RNG状态会被正确捕获Navigating Between Workflows
工作流转场
- torch.compile gives insufficient speedup --> escalate to
(Workflow 3) for larger, fewer graphs.
make_graphed_callables - make_graphed_callables can't handle FP8/PP --> TE
(Workflow 4).
make_graphed_callables - Need Megatron per-layer automatic --> CudaGraphManager (Workflow 5).
- Want maximum perf --> FullCudaGraphWrapper (Workflow 6) or manual full-iteration capture (Workflow 7).
- Something too hard to graph --> partial capture (graph what you can, leave the rest in eager mode).
- User wants best absolute perf --> skip directly to Workflow 6 (Megatron) or Workflow 7 (manual).
- Start small, expand progressively: Begin with one module/layer. Verify correctness. Then expand to more layers, full forward pass, add backward, and eventually full iteration with optimizer.
- torch.compile性能提升不足 --> 升级到(工作流3)以获得更大、更少的图。
make_graphed_callables - make_graphed_callables无法处理FP8/PP --> 使用TE (工作流4)。
make_graphed_callables - 需要Megatron按层自动捕获 --> 使用CudaGraphManager(工作流5)。
- 追求极致性能 --> 使用FullCudaGraphWrapper(工作流6)或手动全迭代捕获(工作流7)。
- 某些部分难以捕获 --> 部分捕获(捕获可处理的部分,其余保留eager模式)。
- 追求最佳绝对性能 --> 直接跳转到工作流6(Megatron)或工作流7(手动)。
- 从小规模开始,逐步扩展:先从一个模块/层开始。验证正确性。然后扩展到更多层、完整前向传播、添加反向传播,最终扩展到包含优化器的全迭代。
Making Code Graph-Compatible
使代码兼容图捕获
These principles apply to all workflows. Code inside the captured region must
satisfy three constraints.
这些原则适用于所有工作流。捕获区域内的代码必须满足三个约束。
Principle 1: GPU-Only
原则1:仅GPU操作
Only GPU operations are captured. CPU-side code (Python logic, I/O, logging)
executes during capture but is eliminated during replay.
Violations:
- File I/O: won't reload on replay
data = torch.load("file.pt") - CPU preprocessing: won't re-tokenize
tokens = tokenizer.encode(text) - Logging: won't print during replay
print(f"Step {i}") - CPU RNG: won't regenerate
random.randint(0, 10) - CPU bookkeeping: won't populate during replay
buffer.append(tensor)
Fix: Move all CPU-side operations outside the graphed region.
只有GPU操作会被捕获。CPU端代码(Python逻辑、I/O、日志)在捕获期间执行,但重放时不会执行。
违规情况:
- 文件I/O:在重放时不会重新加载
data = torch.load("file.pt") - CPU预处理:在重放时不会重新分词
tokens = tokenizer.encode(text) - 日志:在重放时不会打印
print(f"Step {i}") - CPU随机数生成:在重放时不会重新生成
random.randint(0, 10) - CPU记录:在重放时不会填充
buffer.append(tensor)
修复:将所有CPU端操作移到图捕获区域之外。
Principle 2: Sync-Free
原则2:无同步
No CPU-GPU synchronization inside the graph. The CPU queues work continuously
without waiting for GPU results.
Violations:
- to get scalar values
.item() - to move tensors for inspection
.cpu() - or
torch.cuda.synchronize()stream.synchronize() - (implicitly syncs)
print(tensor)
Fix: Invoke the perf-torch-sync-free skill for systematic detection and
elimination of sync points. Use to
find hidden syncs.
torch.cuda.set_sync_debug_mode("warn")图内不能有CPU-GPU同步操作。CPU需持续排队任务,无需等待GPU结果。
违规情况:
- 使用获取标量值
.item() - 使用将张量移到CPU进行检查
.cpu() - 或
torch.cuda.synchronize()stream.synchronize() - (隐式同步)
print(tensor)
修复:调用perf-torch-sync-free技能来系统检测和消除同步点。使用查找隐藏的同步操作。
torch.cuda.set_sync_debug_mode("warn")Principle 3: Static
原则3:静态化
All operations, control flow, memory addresses, and shapes must be fixed
across all replays.
Violations and fixes:
| Dynamic aspect | Fix |
|---|---|
| |
| Pre-allocate + |
| Python scalars (lr, temperature) | GPU tensor + |
| Variable batch size / sequence length | Padding or bucketing |
| MoE / dynamic routing | Partial graphing |
For detailed patterns, see .
references/patterns-dynamic.md所有操作、控制流、内存地址和形状在所有重放中必须保持固定。
违规情况及修复:
| 动态因素 | 修复方案 |
|---|---|
| 使用 |
| 预分配张量并使用 |
| Python标量(学习率、温度) | 转为GPU张量并使用 |
| 可变批次大小/序列长度 | 使用填充或分桶 |
| MoE / 动态路由 | 部分图捕获 |
详细模式请参考。
references/patterns-dynamic.mdCompatibility Checklist
兼容性检查清单
Verify every item before attempting capture:
- No ,
.item(),.cpu(),.numpy()inside graphprint(tensor) - No or
torch.cuda.synchronize()stream.synchronize() - No -- use
if tensor_value:insteadtorch.where() - All inputs pre-allocated, updated via
.copy_() - All shapes fixed (use padding or bucketing for variable sizes)
- Python scalars --> GPU tensors with
.fill_() - Output tensors d before next replay
.clone() - with
cache_enabled=Falsetorch.amp.autocast - Custom RNG generators registered with
graph.register_generator_state() - Use /
graphsafe_get_state()for RNGgraphsafe_set_state() - Warmup completed (3 standard, 11 for DDP)
- DDP: , construct on side stream
TORCH_NCCL_ASYNC_ERROR_HANDLING=0 - DDP: NCCL >= 2.9.6 for full graph capture
- Libraries/extensions use , not default stream
torch.cuda.current_stream() - No pinned memory allocation during capture (triggers hidden event query)
- :
activation_checkpointingpreserve_rng_state=False - Global tensors used in graph kept alive (not deleted/reassigned)
- No functions inside manual capture without prior warmup
torch.compile - Gradient clipping uses sync-free (PyTorch >= 1.13)
clip_grad_norm_
For the complete checklist with references, see .
references/patterns-compatibility.md尝试捕获之前,请验证以下所有项:
- 图内无、
.item()、.cpu()、.numpy()操作print(tensor) - 无或
torch.cuda.synchronize()操作stream.synchronize() - 无判断——改用
if tensor_value:torch.where() - 所有输入已预分配,通过更新
.copy_() - 所有形状固定(对可变大小使用填充或分桶)
- Python标量已转为GPU张量并使用更新
.fill_() - 输出张量在下次重放前已
.clone() - 使用时设置
torch.amp.autocastcache_enabled=False - 自定义RNG生成器已通过注册
graph.register_generator_state() - 使用/
graphsafe_get_state()处理RNGgraphsafe_set_state() - 已完成预热(标准场景3次,DDP场景11次)
- DDP:设置,在侧流上构建
TORCH_NCCL_ASYNC_ERROR_HANDLING=0 - DDP:NCCL >= 2.9.6以支持完整图捕获
- 库/扩展使用而非默认流
torch.cuda.current_stream() - 捕获期间无固定内存分配(会触发隐藏事件查询)
- 激活 checkpointing:设置
preserve_rng_state=False - 图中使用的全局张量保持存活(未被删除/重新分配)
- 手动捕获内的函数已提前预热
torch.compile - 梯度裁剪使用无同步的(PyTorch >= 1.13)
clip_grad_norm_
完整检查清单及参考请见。
references/patterns-compatibility.mdOutput Formats
输出格式
Success indicators:
- completes without errors
g.replay() - Outputs match eager mode within tolerance ()
torch.allclose - Nsight Systems profile shows single graph launch replacing many kernels
- GPU utilization increases, training/inference latency decreases
Key metrics:
| Metric | How to Check |
|---|---|
| Correctness | |
| Speedup | Wall-clock time comparison |
| GPU utilization | |
| Memory overhead | |
成功指标:
- 无错误完成
g.replay() - 输出与eager模式结果在容差范围内匹配()
torch.allclose - Nsight Systems性能分析显示单个图启动替代了多个内核
- GPU利用率提升,训练/推理延迟降低
关键指标:
| 指标 | 检查方式 |
|---|---|
| 正确性 | |
| 加速比 | 对比 wall-clock 时间 |
| GPU利用率 | |
| 内存开销 | |
Error Handling
错误处理
| Error | Cause | Fix |
|---|---|---|
| Sync op during capture ( | Move sync outside graph |
| Background thread (e.g., pin_memory) | |
| Side stream didn't rejoin capture stream | |
| AccumulateGrad on default stream | Warmup on side stream before capture |
| Illegal memory access | Input tensor freed/reassigned | Keep persistent ref, use |
| Wrong numerical results | Dynamic behavior frozen at capture | See |
| OOM with multiple graphs | Pools can't share memory | |
| No speedup | Already GPU-bound or wrong capture scope | Profile with nsys first (Workflow 1) |
| FP8 scaling corruption | TE without | Wrap with |
| PP replay order mismatch | Wrong execution order during replay | Match |
| FullCudaGraphWrapper capture fail | NaN check or sync enabled | |
| RNG failure with FullCudaGraphWrapper | Standard RNG not capturable | |
| DDP capture failure | Async error handling watchdog | |
| DDP AccumulateGrad on default stream | DDP constructed on default stream | Construct DDP in side stream context |
| Autocast cache invalidation | Cached cast tensors freed on exit | |
For detailed troubleshooting, see .
references/troubleshooting.md| 错误 | 原因 | 修复方案 |
|---|---|---|
| 捕获期间存在同步操作( | 将同步操作移到图外 |
| 存在后台线程(如pin_memory) | 设置 |
| 侧流未重新加入捕获流 | |
| 默认流上存在AccumulateGrad | 捕获前在侧流上预热 |
| 非法内存访问 | 输入张量被释放/重新分配 | 保持持久引用,使用 |
| 数值结果错误 | 动态行为在捕获时被冻结 | 参考 |
| 多图场景下OOM | 内存池无法共享 | 对顺序图使用 |
| 无性能提升 | 已受GPU限制或捕获范围错误 | 先使用nsys进行性能分析(工作流1) |
| FP8缩放损坏 | TE重放时未使用 | 用 |
| PP重放顺序不匹配 | 重放时执行顺序错误 | 严格匹配 |
| FullCudaGraphWrapper捕获失败 | 启用了NaN检查或同步 | 设置 |
| FullCudaGraphWrapper的RNG失败 | 标准RNG不可捕获 | 使用 |
| DDP捕获失败 | 异步错误处理看门狗 | 设置 |
| DDP AccumulateGrad在默认流上 | DDP在默认流上构建 | 在侧流上下文构建DDP |
| Autocast缓存失效 | 缓存的转换张量在退出时被释放 | 设置 |
详细故障排查请参考。
references/troubleshooting.mdFinding More Information
更多信息查找
Use this 3-tier lookup hierarchy -- start at Tier 1 and escalate only when
needed.
使用以下三级查找体系——从第1级开始,仅在需要时升级到更高级别。
Tier 1: This File (SKILL.md)
第1级:本文件(SKILL.md)
You are reading it now. The workflows, compatibility checklist, and error
table above cover the most common tasks. Search this file first before going
deeper.
你正在阅读的文件。上述工作流、兼容性检查清单和错误表涵盖了最常见的任务。先在本文件中搜索,再深入查找。
Tier 2: references/ Directory
第2级:references/目录
The directory beside this file contains distilled reference
material -- API details, patterns, and troubleshooting pages.
references/How to search:
- Grep for your keyword across -- headers are designed to be grep-friendly.
references/ - Read only the file that grep points you to. Do not read every file.
Available references:
- -- PyTorch CUDA Graph APIs (
references/api-pytorch.md,torch.cuda.graph,make_graphed_callables)torch.compile reduce-overhead - -- TE
references/api-te-megatron.md, CudaGraphManager, FullCudaGraphWrapper implementationsmake_graphed_callables - -- GPU-only, sync-free, and static principles with full checklist
references/patterns-compatibility.md - -- Dynamic control flow, tensors, scalars, shapes: workarounds and patterns
references/patterns-dynamic.md - -- Capture failures, numerical errors, memory issues, performance issues
references/troubleshooting.md
本文件旁的目录包含提炼后的参考资料——API细节、模式和故障排查页面。
references/搜索方法:
- 在目录中 grep 你的关键词——标题设计为便于grep搜索。
references/ - 仅阅读grep指向的文件,无需阅读所有文件。
可用参考资料:
- ——PyTorch CUDA Graph API(
references/api-pytorch.md、torch.cuda.graph、make_graphed_callables)torch.compile reduce-overhead - ——TE
references/api-te-megatron.md、CudaGraphManager、FullCudaGraphWrapper实现细节make_graphed_callables - ——仅GPU操作、无同步、静态化原则及完整检查清单
references/patterns-compatibility.md - ——动态控制流、张量、标量、形状的解决方案和模式
references/patterns-dynamic.md - ——捕获失败、数值错误、内存问题、性能问题
references/troubleshooting.md
Tier 3: Original Documentation
第3级:原始文档
If Tiers 1-2 do not answer the question, consult the original sources:
- NVIDIA guide:
https://docs.nvidia.com/dl-cuda-graph/latest/index.html - PyTorch docs: (CUDA Graphs section)
https://docs.pytorch.org/docs/stable/notes/cuda.html - TE docs:
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html - Megatron Core docs:
https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html
Return to Tier 2 afterward and consider whether the answer should be distilled
into the references directory for next time.
如果第1-2级无法解答问题,请参考原始来源:
- NVIDIA指南:
https://docs.nvidia.com/dl-cuda-graph/latest/index.html - PyTorch文档:(CUDA Graphs章节)
https://docs.pytorch.org/docs/stable/notes/cuda.html - TE文档:
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html - Megatron Core文档:
https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html
之后回到第2级,考虑是否应将答案提炼到references目录中,以便下次使用。