perf-torch-sync-free

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Writing Sync-Free PyTorch Code

编写无同步的PyTorch代码

Sync-free code means the CPU continuously queues work to the GPU without waiting for GPU operations to complete. When host-device synchronizations are eliminated, the GPU works continuously without idle stalls.

Every host-device synchronization ultimately calls one of three CUDA driver APIs that block the CPU thread:

```
cuEventSynchronize
```
-- CPU waits until a specific GPU event completes
```
cuStreamSynchronize
```
-- CPU waits until all work on a stream finishes
```
cuCtxSynchronize
```
-- CPU waits until all work across all streams finishes

无同步代码指CPU持续向GPU提交任务，无需等待GPU操作完成。消除主机-设备同步后，GPU可持续运行，不会出现空闲停顿。

所有主机-设备同步最终都会调用以下三个阻塞CPU线程的CUDA驱动API之一：

```
cuEventSynchronize
```
-- CPU等待特定GPU事件完成
```
cuStreamSynchronize
```
-- CPU等待某一流中的所有任务完成
```
cuCtxSynchronize
```
-- CPU等待所有流中的所有任务完成

When to Use

使用场景

Reach for this skill when you encounter:

Triggers: User wants to remove host-device synchronizations, eliminate CPU stalls from GPU waits, make code async/sync-free, remove
```
.item()
```
or
```
.cpu()
```
calls that block the CPU, or understand why specific PyTorch operations cause synchronization
Symptoms: Frequent
```
cudaStreamSynchronize
```
in nsys profiles, warnings from
```
torch.cuda.set_sync_debug_mode
```
, training throughput limited by CPU-GPU round-trips,
```
.item()
```
or
```
.cpu()
```
calls in hot loops
Keywords: "sync-free", "synchronization", ".item()", ".cpu()", "host-device sync", "eliminate syncs", "CPU stall", "non_blocking", "set_sync_debug_mode", "cudaStreamSynchronize", "cudaEventSynchronize", "remove syncs", "async GPU", "CPU waiting on GPU"

Do NOT use this skill for:

Applying CUDA Graphs or reducing kernel launch overhead (use
```
perf-torch-cuda-graphs
```
instead)
Profiling GPU kernels, system timelines, or finding GPU idle time (use
```
perf-nsight-compute-analysis
```
or
```
perf-nsight-systems
```
)
Kernel optimization or code generation (use
```
kernel-triton-writing
```
)
Optimizing NCCL communication or distributed training collective operations
Reducing GPU memory usage or gradient checkpointing
General model compilation with
```
torch.compile
```

当你遇到以下情况时，可以使用本技能：

触发场景：用户希望移除主机-设备同步、消除GPU等待导致的CPU停顿、让代码实现异步/无同步、移除阻塞CPU的
```
.item()
```
或
```
.cpu()
```
调用，或者想了解为何特定PyTorch操作会引发同步
症状：nsys性能分析中频繁出现
```
cudaStreamSynchronize
```
、
```
torch.cuda.set_sync_debug_mode
```
发出警告、训练吞吐量受限于CPU-GPU往返时间、热循环中存在
```
.item()
```
或
```
.cpu()
```
调用
关键词："sync-free"、"synchronization"、".item()"、".cpu()"、"host-device sync"、"eliminate syncs"、"CPU stall"、"non_blocking"、"set_sync_debug_mode"、"cudaStreamSynchronize"、"cudaEventSynchronize"、"remove syncs"、"async GPU"、"CPU waiting on GPU"

请勿将本技能用于以下场景：

应用CUDA Graphs或减少内核启动开销（请使用
```
perf-torch-cuda-graphs
```
）
分析GPU内核、系统时间线或查找GPU空闲时间（请使用
```
perf-nsight-compute-analysis
```
或
```
perf-nsight-systems
```
）
内核优化或代码生成（请使用
```
kernel-triton-writing
```
）
优化NCCL通信或分布式训练集合操作
减少GPU内存占用或梯度检查点
使用
```
torch.compile
```
进行通用模型编译

Requirements

依赖要求

Dependency	Version	Notes
PyTorch	>=2.0	With CUDA support
NVIDIA GPU	Any	CUDA-capable
Nsight Systems	Optional	For comprehensive sync detection via `nsys`

依赖项	版本	说明
PyTorch	>=2.0	需支持CUDA
NVIDIA GPU	任意	支持CUDA的GPU
Nsight Systems	可选	通过 `nsys` 进行全面同步检测

Workflow

工作流程

Step 1: Detect Synchronizations

步骤1：检测同步点

Use one or both methods to find sync points in the code.

Quick detection -- PyTorch sync debug mode prints a warning with stack trace on every synchronization:

python

import torch

使用以下一种或两种方法查找代码中的同步点。

快速检测 -- PyTorch同步调试模式会在每次同步时打印包含堆栈跟踪的警告：

python

import torch

Enable at the start of the region you want to check

在需要检查的代码区域开头启用

torch.cuda.set_sync_debug_mode('warn') # prints warning + stack trace

torch.cuda.set_sync_debug_mode('warn') # 打印警告和堆栈跟踪

torch.cuda.set_sync_debug_mode('error') # raises exception on sync

torch.cuda.set_sync_debug_mode('error') # 同步时抛出异常

Run your training step / forward pass here

在此处运行训练步骤/前向传播

train_step(model, batch)

torch.cuda.set_sync_debug_mode(0) # disable


This mode only detects syncs going through PyTorch's wrapped
`cuStreamSynchronize`. Third-party libraries calling CUDA sync APIs
directly are not detected.

**Comprehensive detection** -- Nsight Systems captures all sync calls
including those from extensions and libraries:

```bash
nsys profile --capture-range=cudaProfilerApi \
             --python-sampling=true \
             --backtrace=dwarf \
             python your_script.py

In the Nsight Systems GUI, check the CUDA API timeline row and search for

cudaStreamSynchronize

cudaEventSynchronize

, or

cudaDeviceSynchronize

. The call stack panel shows which Python line triggered each sync.

train_step(model, batch)

torch.cuda.set_sync_debug_mode(0) # 关闭调试模式


该模式仅能检测通过PyTorch封装的`cuStreamSynchronize`触发的同步，无法检测第三方库直接调用CUDA同步API的情况。

**全面检测** -- Nsight Systems可捕获所有同步调用，包括来自扩展和库的调用：

```bash
nsys profile --capture-range=cudaProfilerApi \
             --python-sampling=true \
             --backtrace=dwarf \
             python your_script.py

在Nsight Systems图形界面中，查看CUDA API时间线行，搜索

cudaStreamSynchronize

、

cudaEventSynchronize

或

cudaDeviceSynchronize

。调用堆栈面板会显示触发每次同步的Python代码行。

Step 2: Classify -- False vs True Dependencies

步骤2：分类——虚假依赖与真实依赖

After detecting syncs, classify each one before deciding how to fix it.

False dependencies (avoidable) -- CPU does not actually need the GPU result. These can be eliminated without changing program logic:

Debug prints left in hot paths (
```
print(loss.item())
```
)
Unnecessary
```
.item()
```
calls for logging that could be deferred
Using
```
.cuda()
```
instead of
```
.to('cuda', non_blocking=True)
```

Using

.type(torch.LongTensor)

instead of

.type(torch.long)

Creating tensors from Python objects directly on CUDA

True dependencies (require restructuring) -- CPU genuinely needs the GPU value to proceed:

Control flow dependency:
```
if loss.item() > threshold:
```
-- CPU branches on a GPU-computed value
Dynamic memory allocation:
```
output = x[mask]
```
-- output size depends on GPU computation
CPU computation using GPU values: computing statistics for logging, updating learning rates from metrics

True dependencies require restructuring: move logic to GPU (

torch.where()

), delay to end of iteration, or accept that those parts stay outside any CUDA Graph capture region.

检测到同步点后，在决定修复方案前先对每个同步点进行分类。

虚假依赖（可避免）——CPU实际上并不需要GPU结果。这类同步无需改变程序逻辑即可消除：

热循环中遗留的调试打印（
```
print(loss.item())
```
）
可延迟的不必要日志
```
.item()
```
调用
使用
```
.cuda()
```
而非
```
.to('cuda', non_blocking=True)
```

使用

.type(torch.LongTensor)

而非

.type(torch.long)

直接在CUDA上从Python对象创建张量

真实依赖（需要重构）——CPU确实需要GPU值才能继续执行：

控制流依赖：
```
if loss.item() > threshold:
```
——CPU根据GPU计算的值进行分支
动态内存分配：
```
output = x[mask]
```
——输出大小取决于GPU计算结果
使用GPU值进行CPU计算：计算统计数据用于日志记录、根据指标更新学习率

真实依赖需要重构：将逻辑迁移到GPU（

torch.where()

）、延迟到迭代结束，或者接受这些部分无法纳入CUDA Graph捕获区域。

Step 3: Eliminate Systematically

步骤3：系统性消除同步

Apply fixes in order of increasing difficulty. Start with easy wins.

1. Remove redundancy -- Delete operations that do not need to happen:

Remove debug prints and logging from hot loops
Delete unnecessary
```
.item()
```
calls
Eliminate duplicate synchronizations

2. Use
non_blocking=True
-- Make transfers async where CPU does not immediately use the result:

python

undefined

按照难度从低到高的顺序应用修复方案，先从简单的优化入手。

1. 移除冗余操作——删除无需执行的操作：

移除热循环中的调试打印和日志记录
删除不必要的
```
.item()
```
调用
消除重复同步

2. 使用
non_blocking=True
——在CPU不会立即使用结果的情况下，将传输改为异步：

python

undefined

Before (syncs)

优化前（会同步）

x_gpu = x_cpu.cuda() x_cpu = x_gpu.cpu()

After (async, no sync)

优化后（异步，无同步）

x_gpu = x_cpu.to('cuda', non_blocking=True) x_cpu = x_gpu.to('cpu', non_blocking=True) # only if CPU does not use x_cpu immediately


Only use `non_blocking=True` for GPU-to-CPU when the CPU does not
immediately read the result. Otherwise the CPU may operate on incomplete
data.

**3. Switch to sync-free API alternatives** -- See the Quick Reference
Table below for a condensed mapping of common patterns.

**4. Delay synchronization to end of iteration** -- Move logging and
validation to after the optimizer step rather than mid-forward/backward:

```python

x_gpu = x_cpu.to('cuda', non_blocking=True) x_cpu = x_gpu.to('cpu', non_blocking=True) # 仅当CPU不会立即读取x_cpu时使用


仅当CPU不会立即读取结果时，才对GPU到CPU的传输使用`non_blocking=True`，否则CPU可能会操作不完整的数据。

**3. 切换到无同步API替代方案**——参考下方快速参考表中的常见模式映射。

**4. 将同步延迟到迭代结束**——将日志记录和验证操作移到优化器步骤之后，而非前向/反向传播过程中：

```python

Before: sync mid-iteration

优化前：迭代过程中同步

loss = model(batch) print(f"Loss: {loss.item()}") # cuStreamSynchronize loss.backward()

loss = model(batch) print(f"Loss: {loss.item()}") # 触发cuStreamSynchronize loss.backward()

After: delay to end of iteration

优化后：延迟到迭代结束

loss = model(batch) loss.backward() optimizer.step() print(f"Loss: {loss.item()}") # sync is outside the hot path


**5. Coalesce multiple syncs into one** -- If you need several GPU values
on CPU, gather them and transfer once:

```python

loss = model(batch) loss.backward() optimizer.step() print(f"Loss: {loss.item()}") # 同步操作不在热路径中


**5. 将多次同步合并为一次**——如果需要在CPU上获取多个GPU值，先收集这些值再一次性传输：

```python

Before: 3 separate syncs

优化前：3次独立同步

loss_val = loss.item() # cuStreamSynchronize acc_val = accuracy.item() # cuStreamSynchronize gnorm_val = grad_norm.item() # cuStreamSynchronize

loss_val = loss.item() # 触发cuStreamSynchronize acc_val = accuracy.item() # 触发cuStreamSynchronize gnorm_val = grad_norm.item() # 触发cuStreamSynchronize

After: 1 sync

优化后：1次同步

metrics = torch.stack([loss, accuracy, grad_norm]) vals = metrics.cpu() # single cuStreamSynchronize loss_val, acc_val, gnorm_val = vals.tolist()


**6. Offload logic to GPU** -- Replace CPU-side logic with GPU-native ops:

```python

metrics = torch.stack([loss, accuracy, grad_norm]) vals = metrics.cpu() # 单次cuStreamSynchronize loss_val, acc_val, gnorm_val = vals.tolist()


**6. 将逻辑卸载到GPU**——用GPU原生操作替代CPU端逻辑：

```python

Before: CPU control flow (syncs)

优化前：CPU控制流（会同步）

if loss.item() > threshold: result = a else: result = b

After: GPU-side selection (no sync)

优化后：GPU端选择（无同步）

result = torch.where(loss > threshold, a, b)

Before: Python max (syncs)

优化前：Python max（会同步）

val = max(x_gpu[0, 0], x_gpu[0, 1])

After: torch.max (no sync)

优化后：torch.max（无同步）

val = torch.max(x_gpu[0, 0], x_gpu[0, 1])


**7. Exclude unavoidable syncs from capture range** (last resort) -- If a
sync cannot be eliminated, keep it outside the CUDA Graph capture region
and graph only the sync-free sections. Partial graphing is better than no
graphing.

val = torch.max(x_gpu[0, 0], x_gpu[0, 1])


**7. 将不可避免的同步排除在捕获范围外**（最后手段）——如果同步无法消除，将其保留在CUDA Graph捕获区域之外，仅对无同步部分进行图捕获。部分图捕获总比不捕获好。

Step 4: Verify

步骤4：验证

Re-run detection to confirm syncs are eliminated:

python

torch.cuda.set_sync_debug_mode('error')  # will raise if any sync remains
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)

Or re-profile with Nsight Systems and confirm no

cudaStreamSynchronize

cudaEventSynchronize

cudaDeviceSynchronize

calls appear in the target region.

重新运行检测以确认同步已消除：

python

torch.cuda.set_sync_debug_mode('error')  # 若仍存在同步则抛出异常
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)

或者使用Nsight Systems重新分析，确认目标区域中不再出现

cudaStreamSynchronize

cudaEventSynchronize

cudaDeviceSynchronize

调用。

Quick Reference Table

快速参考表

Sync-Inducing Pattern	Sync-Free Alternative
Device Transfers
`.cpu()` or `.to('cpu')`	`.to('cpu', non_blocking=True)` (fire-and-forget only)
`.cuda()` or `.to('cuda')`	`.to('cuda', non_blocking=True)`
`.type(torch.LongTensor)`	`.type(torch.long)` (dtype conversion, stays on GPU)
Tensor Creation
`torch.tensor(obj, device='cuda')`	Create on CPU, then `.to('cuda', non_blocking=True)`
`torch.tensor(0, device='cuda')`	`torch.zeros(1, device='cuda', dtype=...).squeeze()`
`torch.as_tensor(arr, device='cuda')`	Create on CPU, then `.to('cuda', non_blocking=True)`
`torch.cuda.BoolTensor(list)`	`torch.tensor(list, device='cpu').to('cuda', non_blocking=True)`
Control Flow
`.item()` in conditionals	`torch.where()` or move outside critical region
`if gpu_tensor:`	Keep logic on GPU with `torch.where()`
Python `max(a, b)` on GPU tensors	`torch.max(a, b)`
`torch.is_nonzero(t)`	Avoid; use GPU-side comparisons
Indexing
`x_gpu[idx_cpu]` or `x_gpu[idx_list]`	`x_gpu[idx_gpu]` (keep indices on same device)
`x_gpu[idx] = 0` (scalar assignment)	`x_gpu[idx] = zero_gpu` (GPU tensor value)
`x[i:j]` with CUDA tensor bounds	`x[:, s]` with `s = torch.arange(i, j, device='cuda')`
Dynamic Shapes
`x_gpu[mask_gpu]` (masked selection)	`torch.where(mask_gpu, x_gpu, 0)` (fixed shape)
`torch.nonzero(mask)`	`torch.where()` or move outside critical region
`torch.masked_select(x, mask)`	`torch.where(mask, x, 0)`
`torch.unique(x)`	Avoid in hot path; precompute if possible
`torch.repeat_interleave(x, r)`	Specify `output_size=N` if known

引发同步的模式	无同步替代方案
设备传输
`.cpu()` 或 `.to('cpu')`	`.to('cpu', non_blocking=True)` （仅适用于无需等待结果的场景）
`.cuda()` 或 `.to('cuda')`	`.to('cuda', non_blocking=True)`
`.type(torch.LongTensor)`	`.type(torch.long)` （数据类型转换，保持在GPU上）
张量创建
`torch.tensor(obj, device='cuda')`	先在CPU上创建，再使用 `.to('cuda', non_blocking=True)`
`torch.tensor(0, device='cuda')`	`torch.zeros(1, device='cuda', dtype=...).squeeze()`
`torch.as_tensor(arr, device='cuda')`	先在CPU上创建，再使用 `.to('cuda', non_blocking=True)`
`torch.cuda.BoolTensor(list)`	`torch.tensor(list, device='cpu').to('cuda', non_blocking=True)`
控制流
条件判断中的 `.item()`	使用 `torch.where()` 或移到关键区域之外
`if gpu_tensor:`	使用 `torch.where()` 将逻辑保留在GPU上
GPU张量上的Python `max(a, b)`	`torch.max(a, b)`
`torch.is_nonzero(t)`	避免使用；改用GPU端比较
索引
`x_gpu[idx_cpu]` 或 `x_gpu[idx_list]`	`x_gpu[idx_gpu]` （保持索引与张量在同一设备）
`x_gpu[idx] = 0` （标量赋值）	`x_gpu[idx] = zero_gpu` （GPU张量值）
CUDA张量边界的 `x[i:j]`	使用 `x[:, s]` ，其中 `s = torch.arange(i, j, device='cuda')`
动态形状
`x_gpu[mask_gpu]` （掩码选择）	`torch.where(mask_gpu, x_gpu, 0)` （固定形状）
`torch.nonzero(mask)`	使用 `torch.where()` 或移到关键区域之外
`torch.masked_select(x, mask)`	`torch.where(mask, x, 0)`
`torch.unique(x)`	在热路径中避免使用；若可能提前计算
`torch.repeat_interleave(x, r)`	若已知输出大小则指定 `output_size=N`

perf-torch-sync-free

Original

Translation

Writing Sync-Free PyTorch Code

编写无同步的PyTorch代码

When to Use

使用场景

Requirements

依赖要求

Workflow

工作流程

Step 1: Detect Synchronizations

步骤1：检测同步点

Enable at the start of the region you want to check

在需要检查的代码区域开头启用

torch.cuda.set_sync_debug_mode('error') # raises exception on sync

torch.cuda.set_sync_debug_mode('error') # 同步时抛出异常

Run your training step / forward pass here

在此处运行训练步骤/前向传播

Step 2: Classify -- False vs True Dependencies

步骤2：分类——虚假依赖与真实依赖

Step 3: Eliminate Systematically

步骤3：系统性消除同步

Before (syncs)

优化前（会同步）

After (async, no sync)

优化后（异步，无同步）

Before: sync mid-iteration

优化前：迭代过程中同步

After: delay to end of iteration

优化后：延迟到迭代结束

Before: 3 separate syncs

优化前：3次独立同步

After: 1 sync

优化后：1次同步

Before: CPU control flow (syncs)

优化前：CPU控制流（会同步）

After: GPU-side selection (no sync)

优化后：GPU端选择（无同步）

Before: Python max (syncs)

优化前：Python max（会同步）

After: torch.max (no sync)

优化后：torch.max（无同步）

Step 4: Verify

步骤4：验证

Quick Reference Table

快速参考表

Finding More Information

更多信息获取渠道