perf-torch-sync-free
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWriting Sync-Free PyTorch Code
编写无同步的PyTorch代码
Sync-free code means the CPU continuously queues work to the GPU without
waiting for GPU operations to complete. When host-device synchronizations
are eliminated, the GPU works continuously without idle stalls.
Every host-device synchronization ultimately calls one of three CUDA driver
APIs that block the CPU thread:
- -- CPU waits until a specific GPU event completes
cuEventSynchronize - -- CPU waits until all work on a stream finishes
cuStreamSynchronize - -- CPU waits until all work across all streams finishes
cuCtxSynchronize
无同步代码指CPU持续向GPU提交任务,无需等待GPU操作完成。消除主机-设备同步后,GPU可持续运行,不会出现空闲停顿。
所有主机-设备同步最终都会调用以下三个阻塞CPU线程的CUDA驱动API之一:
- -- CPU等待特定GPU事件完成
cuEventSynchronize - -- CPU等待某一流中的所有任务完成
cuStreamSynchronize - -- CPU等待所有流中的所有任务完成
cuCtxSynchronize
When to Use
使用场景
Reach for this skill when you encounter:
- Triggers: User wants to remove host-device synchronizations, eliminate
CPU stalls from GPU waits, make code async/sync-free, remove or
.item()calls that block the CPU, or understand why specific PyTorch operations cause synchronization.cpu() - Symptoms: Frequent in nsys profiles, warnings from
cudaStreamSynchronize, training throughput limited by CPU-GPU round-trips,torch.cuda.set_sync_debug_modeor.item()calls in hot loops.cpu() - Keywords: "sync-free", "synchronization", ".item()", ".cpu()", "host-device sync", "eliminate syncs", "CPU stall", "non_blocking", "set_sync_debug_mode", "cudaStreamSynchronize", "cudaEventSynchronize", "remove syncs", "async GPU", "CPU waiting on GPU"
Do NOT use this skill for:
- Applying CUDA Graphs or reducing kernel launch overhead (use
instead)
perf-torch-cuda-graphs - Profiling GPU kernels, system timelines, or finding GPU idle time (use
or
perf-nsight-compute-analysis)perf-nsight-systems - Kernel optimization or code generation (use )
kernel-triton-writing - Optimizing NCCL communication or distributed training collective operations
- Reducing GPU memory usage or gradient checkpointing
- General model compilation with
torch.compile
当你遇到以下情况时,可以使用本技能:
- 触发场景:用户希望移除主机-设备同步、消除GPU等待导致的CPU停顿、让代码实现异步/无同步、移除阻塞CPU的或
.item()调用,或者想了解为何特定PyTorch操作会引发同步.cpu() - 症状:nsys性能分析中频繁出现、
cudaStreamSynchronize发出警告、训练吞吐量受限于CPU-GPU往返时间、热循环中存在torch.cuda.set_sync_debug_mode或.item()调用.cpu() - 关键词:"sync-free"、"synchronization"、".item()"、".cpu()"、"host-device sync"、"eliminate syncs"、"CPU stall"、"non_blocking"、"set_sync_debug_mode"、"cudaStreamSynchronize"、"cudaEventSynchronize"、"remove syncs"、"async GPU"、"CPU waiting on GPU"
请勿将本技能用于以下场景:
- 应用CUDA Graphs或减少内核启动开销(请使用)
perf-torch-cuda-graphs - 分析GPU内核、系统时间线或查找GPU空闲时间(请使用或
perf-nsight-compute-analysis)perf-nsight-systems - 内核优化或代码生成(请使用)
kernel-triton-writing - 优化NCCL通信或分布式训练集合操作
- 减少GPU内存占用或梯度检查点
- 使用进行通用模型编译
torch.compile
Requirements
依赖要求
| Dependency | Version | Notes |
|---|---|---|
| PyTorch | >=2.0 | With CUDA support |
| NVIDIA GPU | Any | CUDA-capable |
| Nsight Systems | Optional | For comprehensive sync detection via |
| 依赖项 | 版本 | 说明 |
|---|---|---|
| PyTorch | >=2.0 | 需支持CUDA |
| NVIDIA GPU | 任意 | 支持CUDA的GPU |
| Nsight Systems | 可选 | 通过 |
Workflow
工作流程
Step 1: Detect Synchronizations
步骤1:检测同步点
Use one or both methods to find sync points in the code.
Quick detection -- PyTorch sync debug mode prints a warning with stack
trace on every synchronization:
python
import torch使用以下一种或两种方法查找代码中的同步点。
快速检测 -- PyTorch同步调试模式会在每次同步时打印包含堆栈跟踪的警告:
python
import torchEnable at the start of the region you want to check
在需要检查的代码区域开头启用
torch.cuda.set_sync_debug_mode('warn') # prints warning + stack trace
torch.cuda.set_sync_debug_mode('warn') # 打印警告和堆栈跟踪
torch.cuda.set_sync_debug_mode('error') # raises exception on sync
torch.cuda.set_sync_debug_mode('error') # 同步时抛出异常
Run your training step / forward pass here
在此处运行训练步骤/前向传播
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0) # disable
This mode only detects syncs going through PyTorch's wrapped
`cuStreamSynchronize`. Third-party libraries calling CUDA sync APIs
directly are not detected.
**Comprehensive detection** -- Nsight Systems captures all sync calls
including those from extensions and libraries:
```bash
nsys profile --capture-range=cudaProfilerApi \
--python-sampling=true \
--backtrace=dwarf \
python your_script.pyIn the Nsight Systems GUI, check the CUDA API timeline row and search
for , , or
. The call stack panel shows which Python line
triggered each sync.
cudaStreamSynchronizecudaEventSynchronizecudaDeviceSynchronizetrain_step(model, batch)
torch.cuda.set_sync_debug_mode(0) # 关闭调试模式
该模式仅能检测通过PyTorch封装的`cuStreamSynchronize`触发的同步,无法检测第三方库直接调用CUDA同步API的情况。
**全面检测** -- Nsight Systems可捕获所有同步调用,包括来自扩展和库的调用:
```bash
nsys profile --capture-range=cudaProfilerApi \
--python-sampling=true \
--backtrace=dwarf \
python your_script.py在Nsight Systems图形界面中,查看CUDA API时间线行,搜索、或。调用堆栈面板会显示触发每次同步的Python代码行。
cudaStreamSynchronizecudaEventSynchronizecudaDeviceSynchronizeStep 2: Classify -- False vs True Dependencies
步骤2:分类——虚假依赖与真实依赖
After detecting syncs, classify each one before deciding how to fix it.
False dependencies (avoidable) -- CPU does not actually need the GPU
result. These can be eliminated without changing program logic:
- Debug prints left in hot paths ()
print(loss.item()) - Unnecessary calls for logging that could be deferred
.item() - Using instead of
.cuda().to('cuda', non_blocking=True) - Using instead of
.type(torch.LongTensor).type(torch.long) - Creating tensors from Python objects directly on CUDA
True dependencies (require restructuring) -- CPU genuinely needs the
GPU value to proceed:
- Control flow dependency: -- CPU branches on a GPU-computed value
if loss.item() > threshold: - Dynamic memory allocation: -- output size depends on GPU computation
output = x[mask] - CPU computation using GPU values: computing statistics for logging, updating learning rates from metrics
True dependencies require restructuring: move logic to GPU
(), delay to end of iteration, or accept that those parts
stay outside any CUDA Graph capture region.
torch.where()检测到同步点后,在决定修复方案前先对每个同步点进行分类。
虚假依赖(可避免)——CPU实际上并不需要GPU结果。这类同步无需改变程序逻辑即可消除:
- 热循环中遗留的调试打印()
print(loss.item()) - 可延迟的不必要日志调用
.item() - 使用而非
.cuda().to('cuda', non_blocking=True) - 使用而非
.type(torch.LongTensor).type(torch.long) - 直接在CUDA上从Python对象创建张量
真实依赖(需要重构)——CPU确实需要GPU值才能继续执行:
- 控制流依赖:——CPU根据GPU计算的值进行分支
if loss.item() > threshold: - 动态内存分配:——输出大小取决于GPU计算结果
output = x[mask] - 使用GPU值进行CPU计算:计算统计数据用于日志记录、根据指标更新学习率
真实依赖需要重构:将逻辑迁移到GPU()、延迟到迭代结束,或者接受这些部分无法纳入CUDA Graph捕获区域。
torch.where()Step 3: Eliminate Systematically
步骤3:系统性消除同步
Apply fixes in order of increasing difficulty. Start with easy wins.
1. Remove redundancy -- Delete operations that do not need to happen:
- Remove debug prints and logging from hot loops
- Delete unnecessary calls
.item() - Eliminate duplicate synchronizations
2. Use -- Make transfers async where CPU does not
immediately use the result:
non_blocking=Truepython
undefined按照难度从低到高的顺序应用修复方案,先从简单的优化入手。
1. 移除冗余操作——删除无需执行的操作:
- 移除热循环中的调试打印和日志记录
- 删除不必要的调用
.item() - 消除重复同步
2. 使用——在CPU不会立即使用结果的情况下,将传输改为异步:
non_blocking=Truepython
undefinedBefore (syncs)
优化前(会同步)
x_gpu = x_cpu.cuda()
x_cpu = x_gpu.cpu()
x_gpu = x_cpu.cuda()
x_cpu = x_gpu.cpu()
After (async, no sync)
优化后(异步,无同步)
x_gpu = x_cpu.to('cuda', non_blocking=True)
x_cpu = x_gpu.to('cpu', non_blocking=True) # only if CPU does not use x_cpu immediately
Only use `non_blocking=True` for GPU-to-CPU when the CPU does not
immediately read the result. Otherwise the CPU may operate on incomplete
data.
**3. Switch to sync-free API alternatives** -- See the Quick Reference
Table below for a condensed mapping of common patterns.
**4. Delay synchronization to end of iteration** -- Move logging and
validation to after the optimizer step rather than mid-forward/backward:
```pythonx_gpu = x_cpu.to('cuda', non_blocking=True)
x_cpu = x_gpu.to('cpu', non_blocking=True) # 仅当CPU不会立即读取x_cpu时使用
仅当CPU不会立即读取结果时,才对GPU到CPU的传输使用`non_blocking=True`,否则CPU可能会操作不完整的数据。
**3. 切换到无同步API替代方案**——参考下方快速参考表中的常见模式映射。
**4. 将同步延迟到迭代结束**——将日志记录和验证操作移到优化器步骤之后,而非前向/反向传播过程中:
```pythonBefore: sync mid-iteration
优化前:迭代过程中同步
loss = model(batch)
print(f"Loss: {loss.item()}") # cuStreamSynchronize
loss.backward()
loss = model(batch)
print(f"Loss: {loss.item()}") # 触发cuStreamSynchronize
loss.backward()
After: delay to end of iteration
优化后:延迟到迭代结束
loss = model(batch)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}") # sync is outside the hot path
**5. Coalesce multiple syncs into one** -- If you need several GPU values
on CPU, gather them and transfer once:
```pythonloss = model(batch)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}") # 同步操作不在热路径中
**5. 将多次同步合并为一次**——如果需要在CPU上获取多个GPU值,先收集这些值再一次性传输:
```pythonBefore: 3 separate syncs
优化前:3次独立同步
loss_val = loss.item() # cuStreamSynchronize
acc_val = accuracy.item() # cuStreamSynchronize
gnorm_val = grad_norm.item() # cuStreamSynchronize
loss_val = loss.item() # 触发cuStreamSynchronize
acc_val = accuracy.item() # 触发cuStreamSynchronize
gnorm_val = grad_norm.item() # 触发cuStreamSynchronize
After: 1 sync
优化后:1次同步
metrics = torch.stack([loss, accuracy, grad_norm])
vals = metrics.cpu() # single cuStreamSynchronize
loss_val, acc_val, gnorm_val = vals.tolist()
**6. Offload logic to GPU** -- Replace CPU-side logic with GPU-native ops:
```pythonmetrics = torch.stack([loss, accuracy, grad_norm])
vals = metrics.cpu() # 单次cuStreamSynchronize
loss_val, acc_val, gnorm_val = vals.tolist()
**6. 将逻辑卸载到GPU**——用GPU原生操作替代CPU端逻辑:
```pythonBefore: CPU control flow (syncs)
优化前:CPU控制流(会同步)
if loss.item() > threshold:
result = a
else:
result = b
if loss.item() > threshold:
result = a
else:
result = b
After: GPU-side selection (no sync)
优化后:GPU端选择(无同步)
result = torch.where(loss > threshold, a, b)
result = torch.where(loss > threshold, a, b)
Before: Python max (syncs)
优化前:Python max(会同步)
val = max(x_gpu[0, 0], x_gpu[0, 1])
val = max(x_gpu[0, 0], x_gpu[0, 1])
After: torch.max (no sync)
优化后:torch.max(无同步)
val = torch.max(x_gpu[0, 0], x_gpu[0, 1])
**7. Exclude unavoidable syncs from capture range** (last resort) -- If a
sync cannot be eliminated, keep it outside the CUDA Graph capture region
and graph only the sync-free sections. Partial graphing is better than no
graphing.val = torch.max(x_gpu[0, 0], x_gpu[0, 1])
**7. 将不可避免的同步排除在捕获范围外**(最后手段)——如果同步无法消除,将其保留在CUDA Graph捕获区域之外,仅对无同步部分进行图捕获。部分图捕获总比不捕获好。Step 4: Verify
步骤4:验证
Re-run detection to confirm syncs are eliminated:
python
torch.cuda.set_sync_debug_mode('error') # will raise if any sync remains
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)Or re-profile with Nsight Systems and confirm no /
/ calls appear in the
target region.
cudaStreamSynchronizecudaEventSynchronizecudaDeviceSynchronize重新运行检测以确认同步已消除:
python
torch.cuda.set_sync_debug_mode('error') # 若仍存在同步则抛出异常
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)或者使用Nsight Systems重新分析,确认目标区域中不再出现 / / 调用。
cudaStreamSynchronizecudaEventSynchronizecudaDeviceSynchronizeQuick Reference Table
快速参考表
| Sync-Inducing Pattern | Sync-Free Alternative |
|---|---|
| Device Transfers | |
| |
| |
| |
| Tensor Creation | |
| Create on CPU, then |
| |
| Create on CPU, then |
| |
| Control Flow | |
| |
| Keep logic on GPU with |
Python | |
| Avoid; use GPU-side comparisons |
| Indexing | |
| |
| |
| |
| Dynamic Shapes | |
| |
| |
| |
| Avoid in hot path; precompute if possible |
| Specify |
| 引发同步的模式 | 无同步替代方案 |
|---|---|
| 设备传输 | |
| |
| |
| |
| 张量创建 | |
| 先在CPU上创建,再使用 |
| |
| 先在CPU上创建,再使用 |
| |
| 控制流 | |
条件判断中的 | 使用 |
| 使用 |
GPU张量上的Python | |
| 避免使用;改用GPU端比较 |
| 索引 | |
| |
| |
CUDA张量边界的 | 使用 |
| 动态形状 | |
| |
| 使用 |
| |
| 在热路径中避免使用;若可能提前计算 |
| 若已知输出大小则指定 |
Finding More Information
更多信息获取渠道
- Tier 1 (this file): Workflow, classification, elimination strategies, and quick reference table
- Tier 2 (): Comprehensive pattern catalog with 9 categories, full code examples showing sync-inducing and sync-free versions, and the specific CUDA driver API triggered by each pattern
references/sync-patterns.md
- 一级信息(本文档):工作流程、依赖分类、消除策略及快速参考表
- 二级信息():包含9类模式的完整目录,展示引发同步和无同步版本的完整代码示例,以及每种模式触发的特定CUDA驱动API
references/sync-patterns.md