perf-torch-sync-free

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Writing Sync-Free PyTorch Code

编写无同步的PyTorch代码

Sync-free code means the CPU continuously queues work to the GPU without waiting for GPU operations to complete. When host-device synchronizations are eliminated, the GPU works continuously without idle stalls.
Every host-device synchronization ultimately calls one of three CUDA driver APIs that block the CPU thread:
  • cuEventSynchronize
    -- CPU waits until a specific GPU event completes
  • cuStreamSynchronize
    -- CPU waits until all work on a stream finishes
  • cuCtxSynchronize
    -- CPU waits until all work across all streams finishes
无同步代码指CPU持续向GPU提交任务,无需等待GPU操作完成。消除主机-设备同步后,GPU可持续运行,不会出现空闲停顿。
所有主机-设备同步最终都会调用以下三个阻塞CPU线程的CUDA驱动API之一:
  • cuEventSynchronize
    -- CPU等待特定GPU事件完成
  • cuStreamSynchronize
    -- CPU等待某一流中的所有任务完成
  • cuCtxSynchronize
    -- CPU等待所有流中的所有任务完成

When to Use

使用场景

Reach for this skill when you encounter:
  • Triggers: User wants to remove host-device synchronizations, eliminate CPU stalls from GPU waits, make code async/sync-free, remove
    .item()
    or
    .cpu()
    calls that block the CPU, or understand why specific PyTorch operations cause synchronization
  • Symptoms: Frequent
    cudaStreamSynchronize
    in nsys profiles, warnings from
    torch.cuda.set_sync_debug_mode
    , training throughput limited by CPU-GPU round-trips,
    .item()
    or
    .cpu()
    calls in hot loops
  • Keywords: "sync-free", "synchronization", ".item()", ".cpu()", "host-device sync", "eliminate syncs", "CPU stall", "non_blocking", "set_sync_debug_mode", "cudaStreamSynchronize", "cudaEventSynchronize", "remove syncs", "async GPU", "CPU waiting on GPU"
Do NOT use this skill for:
  • Applying CUDA Graphs or reducing kernel launch overhead (use
    perf-torch-cuda-graphs
    instead)
  • Profiling GPU kernels, system timelines, or finding GPU idle time (use
    perf-nsight-compute-analysis
    or
    perf-nsight-systems
    )
  • Kernel optimization or code generation (use
    kernel-triton-writing
    )
  • Optimizing NCCL communication or distributed training collective operations
  • Reducing GPU memory usage or gradient checkpointing
  • General model compilation with
    torch.compile
当你遇到以下情况时,可以使用本技能:
  • 触发场景:用户希望移除主机-设备同步、消除GPU等待导致的CPU停顿、让代码实现异步/无同步、移除阻塞CPU的
    .item()
    .cpu()
    调用,或者想了解为何特定PyTorch操作会引发同步
  • 症状:nsys性能分析中频繁出现
    cudaStreamSynchronize
    torch.cuda.set_sync_debug_mode
    发出警告、训练吞吐量受限于CPU-GPU往返时间、热循环中存在
    .item()
    .cpu()
    调用
  • 关键词:"sync-free"、"synchronization"、".item()"、".cpu()"、"host-device sync"、"eliminate syncs"、"CPU stall"、"non_blocking"、"set_sync_debug_mode"、"cudaStreamSynchronize"、"cudaEventSynchronize"、"remove syncs"、"async GPU"、"CPU waiting on GPU"
请勿将本技能用于以下场景:
  • 应用CUDA Graphs或减少内核启动开销(请使用
    perf-torch-cuda-graphs
  • 分析GPU内核、系统时间线或查找GPU空闲时间(请使用
    perf-nsight-compute-analysis
    perf-nsight-systems
  • 内核优化或代码生成(请使用
    kernel-triton-writing
  • 优化NCCL通信或分布式训练集合操作
  • 减少GPU内存占用或梯度检查点
  • 使用
    torch.compile
    进行通用模型编译

Requirements

依赖要求

DependencyVersionNotes
PyTorch>=2.0With CUDA support
NVIDIA GPUAnyCUDA-capable
Nsight SystemsOptionalFor comprehensive sync detection via
nsys
依赖项版本说明
PyTorch>=2.0需支持CUDA
NVIDIA GPU任意支持CUDA的GPU
Nsight Systems可选通过
nsys
进行全面同步检测

Workflow

工作流程

Step 1: Detect Synchronizations

步骤1:检测同步点

Use one or both methods to find sync points in the code.
Quick detection -- PyTorch sync debug mode prints a warning with stack trace on every synchronization:
python
import torch
使用以下一种或两种方法查找代码中的同步点。
快速检测 -- PyTorch同步调试模式会在每次同步时打印包含堆栈跟踪的警告:
python
import torch

Enable at the start of the region you want to check

在需要检查的代码区域开头启用

torch.cuda.set_sync_debug_mode('warn') # prints warning + stack trace
torch.cuda.set_sync_debug_mode('warn') # 打印警告和堆栈跟踪

torch.cuda.set_sync_debug_mode('error') # raises exception on sync

torch.cuda.set_sync_debug_mode('error') # 同步时抛出异常

Run your training step / forward pass here

在此处运行训练步骤/前向传播

train_step(model, batch)
torch.cuda.set_sync_debug_mode(0) # disable

This mode only detects syncs going through PyTorch's wrapped
`cuStreamSynchronize`. Third-party libraries calling CUDA sync APIs
directly are not detected.

**Comprehensive detection** -- Nsight Systems captures all sync calls
including those from extensions and libraries:

```bash
nsys profile --capture-range=cudaProfilerApi \
             --python-sampling=true \
             --backtrace=dwarf \
             python your_script.py
In the Nsight Systems GUI, check the CUDA API timeline row and search for
cudaStreamSynchronize
,
cudaEventSynchronize
, or
cudaDeviceSynchronize
. The call stack panel shows which Python line triggered each sync.
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0) # 关闭调试模式

该模式仅能检测通过PyTorch封装的`cuStreamSynchronize`触发的同步,无法检测第三方库直接调用CUDA同步API的情况。

**全面检测** -- Nsight Systems可捕获所有同步调用,包括来自扩展和库的调用:

```bash
nsys profile --capture-range=cudaProfilerApi \
             --python-sampling=true \
             --backtrace=dwarf \
             python your_script.py
在Nsight Systems图形界面中,查看CUDA API时间线行,搜索
cudaStreamSynchronize
cudaEventSynchronize
cudaDeviceSynchronize
。调用堆栈面板会显示触发每次同步的Python代码行。

Step 2: Classify -- False vs True Dependencies

步骤2:分类——虚假依赖与真实依赖

After detecting syncs, classify each one before deciding how to fix it.
False dependencies (avoidable) -- CPU does not actually need the GPU result. These can be eliminated without changing program logic:
  • Debug prints left in hot paths (
    print(loss.item())
    )
  • Unnecessary
    .item()
    calls for logging that could be deferred
  • Using
    .cuda()
    instead of
    .to('cuda', non_blocking=True)
  • Using
    .type(torch.LongTensor)
    instead of
    .type(torch.long)
  • Creating tensors from Python objects directly on CUDA
True dependencies (require restructuring) -- CPU genuinely needs the GPU value to proceed:
  • Control flow dependency:
    if loss.item() > threshold:
    -- CPU branches on a GPU-computed value
  • Dynamic memory allocation:
    output = x[mask]
    -- output size depends on GPU computation
  • CPU computation using GPU values: computing statistics for logging, updating learning rates from metrics
True dependencies require restructuring: move logic to GPU (
torch.where()
), delay to end of iteration, or accept that those parts stay outside any CUDA Graph capture region.
检测到同步点后,在决定修复方案前先对每个同步点进行分类。
虚假依赖(可避免)——CPU实际上并不需要GPU结果。这类同步无需改变程序逻辑即可消除:
  • 热循环中遗留的调试打印(
    print(loss.item())
  • 可延迟的不必要日志
    .item()
    调用
  • 使用
    .cuda()
    而非
    .to('cuda', non_blocking=True)
  • 使用
    .type(torch.LongTensor)
    而非
    .type(torch.long)
  • 直接在CUDA上从Python对象创建张量
真实依赖(需要重构)——CPU确实需要GPU值才能继续执行:
  • 控制流依赖
    if loss.item() > threshold:
    ——CPU根据GPU计算的值进行分支
  • 动态内存分配
    output = x[mask]
    ——输出大小取决于GPU计算结果
  • 使用GPU值进行CPU计算:计算统计数据用于日志记录、根据指标更新学习率
真实依赖需要重构:将逻辑迁移到GPU(
torch.where()
)、延迟到迭代结束,或者接受这些部分无法纳入CUDA Graph捕获区域。

Step 3: Eliminate Systematically

步骤3:系统性消除同步

Apply fixes in order of increasing difficulty. Start with easy wins.
1. Remove redundancy -- Delete operations that do not need to happen:
  • Remove debug prints and logging from hot loops
  • Delete unnecessary
    .item()
    calls
  • Eliminate duplicate synchronizations
2. Use
non_blocking=True
-- Make transfers async where CPU does not immediately use the result:
python
undefined
按照难度从低到高的顺序应用修复方案,先从简单的优化入手。
1. 移除冗余操作——删除无需执行的操作:
  • 移除热循环中的调试打印和日志记录
  • 删除不必要的
    .item()
    调用
  • 消除重复同步
2. 使用
non_blocking=True
——在CPU不会立即使用结果的情况下,将传输改为异步:
python
undefined

Before (syncs)

优化前(会同步)

x_gpu = x_cpu.cuda() x_cpu = x_gpu.cpu()
x_gpu = x_cpu.cuda() x_cpu = x_gpu.cpu()

After (async, no sync)

优化后(异步,无同步)

x_gpu = x_cpu.to('cuda', non_blocking=True) x_cpu = x_gpu.to('cpu', non_blocking=True) # only if CPU does not use x_cpu immediately

Only use `non_blocking=True` for GPU-to-CPU when the CPU does not
immediately read the result. Otherwise the CPU may operate on incomplete
data.

**3. Switch to sync-free API alternatives** -- See the Quick Reference
Table below for a condensed mapping of common patterns.

**4. Delay synchronization to end of iteration** -- Move logging and
validation to after the optimizer step rather than mid-forward/backward:

```python
x_gpu = x_cpu.to('cuda', non_blocking=True) x_cpu = x_gpu.to('cpu', non_blocking=True) # 仅当CPU不会立即读取x_cpu时使用

仅当CPU不会立即读取结果时,才对GPU到CPU的传输使用`non_blocking=True`,否则CPU可能会操作不完整的数据。

**3. 切换到无同步API替代方案**——参考下方快速参考表中的常见模式映射。

**4. 将同步延迟到迭代结束**——将日志记录和验证操作移到优化器步骤之后,而非前向/反向传播过程中:

```python

Before: sync mid-iteration

优化前:迭代过程中同步

loss = model(batch) print(f"Loss: {loss.item()}") # cuStreamSynchronize loss.backward()
loss = model(batch) print(f"Loss: {loss.item()}") # 触发cuStreamSynchronize loss.backward()

After: delay to end of iteration

优化后:延迟到迭代结束

loss = model(batch) loss.backward() optimizer.step() print(f"Loss: {loss.item()}") # sync is outside the hot path

**5. Coalesce multiple syncs into one** -- If you need several GPU values
on CPU, gather them and transfer once:

```python
loss = model(batch) loss.backward() optimizer.step() print(f"Loss: {loss.item()}") # 同步操作不在热路径中

**5. 将多次同步合并为一次**——如果需要在CPU上获取多个GPU值,先收集这些值再一次性传输:

```python

Before: 3 separate syncs

优化前:3次独立同步

loss_val = loss.item() # cuStreamSynchronize acc_val = accuracy.item() # cuStreamSynchronize gnorm_val = grad_norm.item() # cuStreamSynchronize
loss_val = loss.item() # 触发cuStreamSynchronize acc_val = accuracy.item() # 触发cuStreamSynchronize gnorm_val = grad_norm.item() # 触发cuStreamSynchronize

After: 1 sync

优化后:1次同步

metrics = torch.stack([loss, accuracy, grad_norm]) vals = metrics.cpu() # single cuStreamSynchronize loss_val, acc_val, gnorm_val = vals.tolist()

**6. Offload logic to GPU** -- Replace CPU-side logic with GPU-native ops:

```python
metrics = torch.stack([loss, accuracy, grad_norm]) vals = metrics.cpu() # 单次cuStreamSynchronize loss_val, acc_val, gnorm_val = vals.tolist()

**6. 将逻辑卸载到GPU**——用GPU原生操作替代CPU端逻辑:

```python

Before: CPU control flow (syncs)

优化前:CPU控制流(会同步)

if loss.item() > threshold: result = a else: result = b
if loss.item() > threshold: result = a else: result = b

After: GPU-side selection (no sync)

优化后:GPU端选择(无同步)

result = torch.where(loss > threshold, a, b)
result = torch.where(loss > threshold, a, b)

Before: Python max (syncs)

优化前:Python max(会同步)

val = max(x_gpu[0, 0], x_gpu[0, 1])
val = max(x_gpu[0, 0], x_gpu[0, 1])

After: torch.max (no sync)

优化后:torch.max(无同步)

val = torch.max(x_gpu[0, 0], x_gpu[0, 1])

**7. Exclude unavoidable syncs from capture range** (last resort) -- If a
sync cannot be eliminated, keep it outside the CUDA Graph capture region
and graph only the sync-free sections. Partial graphing is better than no
graphing.
val = torch.max(x_gpu[0, 0], x_gpu[0, 1])

**7. 将不可避免的同步排除在捕获范围外**(最后手段)——如果同步无法消除,将其保留在CUDA Graph捕获区域之外,仅对无同步部分进行图捕获。部分图捕获总比不捕获好。

Step 4: Verify

步骤4:验证

Re-run detection to confirm syncs are eliminated:
python
torch.cuda.set_sync_debug_mode('error')  # will raise if any sync remains
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)
Or re-profile with Nsight Systems and confirm no
cudaStreamSynchronize
/
cudaEventSynchronize
/
cudaDeviceSynchronize
calls appear in the target region.
重新运行检测以确认同步已消除:
python
torch.cuda.set_sync_debug_mode('error')  # 若仍存在同步则抛出异常
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)
或者使用Nsight Systems重新分析,确认目标区域中不再出现
cudaStreamSynchronize
/
cudaEventSynchronize
/
cudaDeviceSynchronize
调用。

Quick Reference Table

快速参考表

Sync-Inducing PatternSync-Free Alternative
Device Transfers
.cpu()
or
.to('cpu')
.to('cpu', non_blocking=True)
(fire-and-forget only)
.cuda()
or
.to('cuda')
.to('cuda', non_blocking=True)
.type(torch.LongTensor)
.type(torch.long)
(dtype conversion, stays on GPU)
Tensor Creation
torch.tensor(obj, device='cuda')
Create on CPU, then
.to('cuda', non_blocking=True)
torch.tensor(0, device='cuda')
torch.zeros(1, device='cuda', dtype=...).squeeze()
torch.as_tensor(arr, device='cuda')
Create on CPU, then
.to('cuda', non_blocking=True)
torch.cuda.BoolTensor(list)
torch.tensor(list, device='cpu').to('cuda', non_blocking=True)
Control Flow
.item()
in conditionals
torch.where()
or move outside critical region
if gpu_tensor:
Keep logic on GPU with
torch.where()
Python
max(a, b)
on GPU tensors
torch.max(a, b)
torch.is_nonzero(t)
Avoid; use GPU-side comparisons
Indexing
x_gpu[idx_cpu]
or
x_gpu[idx_list]
x_gpu[idx_gpu]
(keep indices on same device)
x_gpu[idx] = 0
(scalar assignment)
x_gpu[idx] = zero_gpu
(GPU tensor value)
x[i:j]
with CUDA tensor bounds
x[:, s]
with
s = torch.arange(i, j, device='cuda')
Dynamic Shapes
x_gpu[mask_gpu]
(masked selection)
torch.where(mask_gpu, x_gpu, 0)
(fixed shape)
torch.nonzero(mask)
torch.where()
or move outside critical region
torch.masked_select(x, mask)
torch.where(mask, x, 0)
torch.unique(x)
Avoid in hot path; precompute if possible
torch.repeat_interleave(x, r)
Specify
output_size=N
if known
引发同步的模式无同步替代方案
设备传输
.cpu()
.to('cpu')
.to('cpu', non_blocking=True)
(仅适用于无需等待结果的场景)
.cuda()
.to('cuda')
.to('cuda', non_blocking=True)
.type(torch.LongTensor)
.type(torch.long)
(数据类型转换,保持在GPU上)
张量创建
torch.tensor(obj, device='cuda')
先在CPU上创建,再使用
.to('cuda', non_blocking=True)
torch.tensor(0, device='cuda')
torch.zeros(1, device='cuda', dtype=...).squeeze()
torch.as_tensor(arr, device='cuda')
先在CPU上创建,再使用
.to('cuda', non_blocking=True)
torch.cuda.BoolTensor(list)
torch.tensor(list, device='cpu').to('cuda', non_blocking=True)
控制流
条件判断中的
.item()
使用
torch.where()
或移到关键区域之外
if gpu_tensor:
使用
torch.where()
将逻辑保留在GPU上
GPU张量上的Python
max(a, b)
torch.max(a, b)
torch.is_nonzero(t)
避免使用;改用GPU端比较
索引
x_gpu[idx_cpu]
x_gpu[idx_list]
x_gpu[idx_gpu]
(保持索引与张量在同一设备)
x_gpu[idx] = 0
(标量赋值)
x_gpu[idx] = zero_gpu
(GPU张量值)
CUDA张量边界的
x[i:j]
使用
x[:, s]
,其中
s = torch.arange(i, j, device='cuda')
动态形状
x_gpu[mask_gpu]
(掩码选择)
torch.where(mask_gpu, x_gpu, 0)
(固定形状)
torch.nonzero(mask)
使用
torch.where()
或移到关键区域之外
torch.masked_select(x, mask)
torch.where(mask, x, 0)
torch.unique(x)
在热路径中避免使用;若可能提前计算
torch.repeat_interleave(x, r)
若已知输出大小则指定
output_size=N

Finding More Information

更多信息获取渠道

  • Tier 1 (this file): Workflow, classification, elimination strategies, and quick reference table
  • Tier 2 (
    references/sync-patterns.md
    )
    : Comprehensive pattern catalog with 9 categories, full code examples showing sync-inducing and sync-free versions, and the specific CUDA driver API triggered by each pattern
  • 一级信息(本文档):工作流程、依赖分类、消除策略及快速参考表
  • 二级信息(
    references/sync-patterns.md
    :包含9类模式的完整目录,展示引发同步和无同步版本的完整代码示例,以及每种模式触发的特定CUDA驱动API