Writing Sync-Free PyTorch Code
Sync-free code means the CPU continuously queues work to the GPU without
waiting for GPU operations to complete. When host-device synchronizations
are eliminated, the GPU works continuously without idle stalls.
Every host-device synchronization ultimately calls one of three CUDA driver
APIs that block the CPU thread:
- -- CPU waits until a specific GPU event completes
- -- CPU waits until all work on a stream finishes
- -- CPU waits until all work across all streams finishes
When to Use
Reach for this skill when you encounter:
- Triggers: User wants to remove host-device synchronizations, eliminate
CPU stalls from GPU waits, make code async/sync-free, remove or
calls that block the CPU, or understand why specific PyTorch
operations cause synchronization
- Symptoms: Frequent in nsys profiles,
warnings from
torch.cuda.set_sync_debug_mode
, training throughput
limited by CPU-GPU round-trips, or calls in hot loops
- Keywords: "sync-free", "synchronization", ".item()", ".cpu()",
"host-device sync", "eliminate syncs", "CPU stall", "non_blocking",
"set_sync_debug_mode", "cudaStreamSynchronize", "cudaEventSynchronize",
"remove syncs", "async GPU", "CPU waiting on GPU"
Do NOT use this skill for:
- Applying CUDA Graphs or reducing kernel launch overhead (use
instead)
- Profiling GPU kernels, system timelines, or finding GPU idle time (use
perf-nsight-compute-analysis
or )
- Kernel optimization or code generation (use )
- Optimizing NCCL communication or distributed training collective
operations
- Reducing GPU memory usage or gradient checkpointing
- General model compilation with
Requirements
| Dependency | Version | Notes |
|---|
| PyTorch | >=2.0 | With CUDA support |
| NVIDIA GPU | Any | CUDA-capable |
| Nsight Systems | Optional | For comprehensive sync detection via |
Workflow
Step 1: Detect Synchronizations
Use one or both methods to find sync points in the code.
Quick detection -- PyTorch sync debug mode prints a warning with stack
trace on every synchronization:
python
import torch
# Enable at the start of the region you want to check
torch.cuda.set_sync_debug_mode('warn') # prints warning + stack trace
# torch.cuda.set_sync_debug_mode('error') # raises exception on sync
# Run your training step / forward pass here
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0) # disable
This mode only detects syncs going through PyTorch's wrapped
. Third-party libraries calling CUDA sync APIs
directly are not detected.
Comprehensive detection -- Nsight Systems captures all sync calls
including those from extensions and libraries:
bash
nsys profile --capture-range=cudaProfilerApi \
--python-sampling=true \
--backtrace=dwarf \
python your_script.py
In the Nsight Systems GUI, check the
CUDA API timeline row and search
for
,
, or
. The call stack panel shows which Python line
triggered each sync.
Step 2: Classify -- False vs True Dependencies
After detecting syncs, classify each one before deciding how to fix it.
False dependencies (avoidable) -- CPU does not actually need the GPU
result. These can be eliminated without changing program logic:
- Debug prints left in hot paths ()
- Unnecessary calls for logging that could be deferred
- Using instead of
.to('cuda', non_blocking=True)
- Using instead of
- Creating tensors from Python objects directly on CUDA
True dependencies (require restructuring) -- CPU genuinely needs the
GPU value to proceed:
- Control flow dependency:
if loss.item() > threshold:
-- CPU
branches on a GPU-computed value
- Dynamic memory allocation: -- output size depends
on GPU computation
- CPU computation using GPU values: computing statistics for logging,
updating learning rates from metrics
True dependencies require restructuring: move logic to GPU
(
), delay to end of iteration, or accept that those parts
stay outside any CUDA Graph capture region.
Step 3: Eliminate Systematically
Apply fixes in order of increasing difficulty. Start with easy wins.
1. Remove redundancy -- Delete operations that do not need to happen:
- Remove debug prints and logging from hot loops
- Delete unnecessary calls
- Eliminate duplicate synchronizations
2. Use -- Make transfers async where CPU does not
immediately use the result:
python
# Before (syncs)
x_gpu = x_cpu.cuda()
x_cpu = x_gpu.cpu()
# After (async, no sync)
x_gpu = x_cpu.to('cuda', non_blocking=True)
x_cpu = x_gpu.to('cpu', non_blocking=True) # only if CPU does not use x_cpu immediately
Only use
for GPU-to-CPU when the CPU does not
immediately read the result. Otherwise the CPU may operate on incomplete
data.
3. Switch to sync-free API alternatives -- See the Quick Reference
Table below for a condensed mapping of common patterns.
4. Delay synchronization to end of iteration -- Move logging and
validation to after the optimizer step rather than mid-forward/backward:
python
# Before: sync mid-iteration
loss = model(batch)
print(f"Loss: {loss.item()}") # cuStreamSynchronize
loss.backward()
# After: delay to end of iteration
loss = model(batch)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}") # sync is outside the hot path
5. Coalesce multiple syncs into one -- If you need several GPU values
on CPU, gather them and transfer once:
python
# Before: 3 separate syncs
loss_val = loss.item() # cuStreamSynchronize
acc_val = accuracy.item() # cuStreamSynchronize
gnorm_val = grad_norm.item() # cuStreamSynchronize
# After: 1 sync
metrics = torch.stack([loss, accuracy, grad_norm])
vals = metrics.cpu() # single cuStreamSynchronize
loss_val, acc_val, gnorm_val = vals.tolist()
6. Offload logic to GPU -- Replace CPU-side logic with GPU-native ops:
python
# Before: CPU control flow (syncs)
if loss.item() > threshold:
result = a
else:
result = b
# After: GPU-side selection (no sync)
result = torch.where(loss > threshold, a, b)
# Before: Python max (syncs)
val = max(x_gpu[0, 0], x_gpu[0, 1])
# After: torch.max (no sync)
val = torch.max(x_gpu[0, 0], x_gpu[0, 1])
7. Exclude unavoidable syncs from capture range (last resort) -- If a
sync cannot be eliminated, keep it outside the CUDA Graph capture region
and graph only the sync-free sections. Partial graphing is better than no
graphing.
Step 4: Verify
Re-run detection to confirm syncs are eliminated:
python
torch.cuda.set_sync_debug_mode('error') # will raise if any sync remains
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)
Or re-profile with Nsight Systems and confirm no
/
/
calls appear in the
target region.
Quick Reference Table
| Sync-Inducing Pattern | Sync-Free Alternative |
|---|
| Device Transfers | |
| or | .to('cpu', non_blocking=True)
(fire-and-forget only) |
| or | .to('cuda', non_blocking=True)
|
| (dtype conversion, stays on GPU) |
| Tensor Creation | |
torch.tensor(obj, device='cuda')
| Create on CPU, then .to('cuda', non_blocking=True)
|
torch.tensor(0, device='cuda')
| torch.zeros(1, device='cuda', dtype=...).squeeze()
|
torch.as_tensor(arr, device='cuda')
| Create on CPU, then .to('cuda', non_blocking=True)
|
torch.cuda.BoolTensor(list)
| torch.tensor(list, device='cpu').to('cuda', non_blocking=True)
|
| Control Flow | |
| in conditionals | or move outside critical region |
| Keep logic on GPU with |
| Python on GPU tensors | |
| Avoid; use GPU-side comparisons |
| Indexing | |
| or | (keep indices on same device) |
| (scalar assignment) | (GPU tensor value) |
| with CUDA tensor bounds | with s = torch.arange(i, j, device='cuda')
|
| Dynamic Shapes | |
| (masked selection) | torch.where(mask_gpu, x_gpu, 0)
(fixed shape) |
| or move outside critical region |
torch.masked_select(x, mask)
| |
| Avoid in hot path; precompute if possible |
torch.repeat_interleave(x, r)
| Specify if known |
Finding More Information
- Tier 1 (this file): Workflow, classification, elimination strategies,
and quick reference table
- Tier 2 (
references/sync-patterns.md
): Comprehensive pattern catalog
with 9 categories, full code examples showing sync-inducing and sync-free
versions, and the specific CUDA driver API triggered by each pattern