Writing Sync-Free PyTorch Code

Sync-free code means the CPU continuously queues work to the GPU without waiting for GPU operations to complete. When host-device synchronizations are eliminated, the GPU works continuously without idle stalls.

Every host-device synchronization ultimately calls one of three CUDA driver APIs that block the CPU thread:

```
cuEventSynchronize
```
-- CPU waits until a specific GPU event completes
```
cuStreamSynchronize
```
-- CPU waits until all work on a stream finishes
```
cuCtxSynchronize
```
-- CPU waits until all work across all streams finishes

When to Use

Reach for this skill when you encounter:

Triggers: User wants to remove host-device synchronizations, eliminate CPU stalls from GPU waits, make code async/sync-free, remove
```
.item()
```
or
```
.cpu()
```
calls that block the CPU, or understand why specific PyTorch operations cause synchronization
Symptoms: Frequent
```
cudaStreamSynchronize
```
in nsys profiles, warnings from
```
torch.cuda.set_sync_debug_mode
```
, training throughput limited by CPU-GPU round-trips,
```
.item()
```
or
```
.cpu()
```
calls in hot loops
Keywords: "sync-free", "synchronization", ".item()", ".cpu()", "host-device sync", "eliminate syncs", "CPU stall", "non_blocking", "set_sync_debug_mode", "cudaStreamSynchronize", "cudaEventSynchronize", "remove syncs", "async GPU", "CPU waiting on GPU"

Do NOT use this skill for:

Applying CUDA Graphs or reducing kernel launch overhead (use
```
perf-torch-cuda-graphs
```
instead)
Profiling GPU kernels, system timelines, or finding GPU idle time (use
```
perf-nsight-compute-analysis
```
or
```
perf-nsight-systems
```
)
Kernel optimization or code generation (use
```
kernel-triton-writing
```
)
Optimizing NCCL communication or distributed training collective operations
Reducing GPU memory usage or gradient checkpointing
General model compilation with
```
torch.compile
```

Requirements

Dependency	Version	Notes
PyTorch	>=2.0	With CUDA support
NVIDIA GPU	Any	CUDA-capable
Nsight Systems	Optional	For comprehensive sync detection via `nsys`

Workflow

Step 1: Detect Synchronizations

Use one or both methods to find sync points in the code.

Quick detection -- PyTorch sync debug mode prints a warning with stack trace on every synchronization:

python

import torch

# Enable at the start of the region you want to check
torch.cuda.set_sync_debug_mode('warn')   # prints warning + stack trace
# torch.cuda.set_sync_debug_mode('error')  # raises exception on sync

# Run your training step / forward pass here
train_step(model, batch)

torch.cuda.set_sync_debug_mode(0)  # disable

This mode only detects syncs going through PyTorch's wrapped

cuStreamSynchronize

. Third-party libraries calling CUDA sync APIs directly are not detected.

Comprehensive detection -- Nsight Systems captures all sync calls including those from extensions and libraries:

bash

nsys profile --capture-range=cudaProfilerApi \
             --python-sampling=true \
             --backtrace=dwarf \
             python your_script.py

In the Nsight Systems GUI, check the CUDA API timeline row and search for

cudaStreamSynchronize

cudaEventSynchronize

, or

cudaDeviceSynchronize

. The call stack panel shows which Python line triggered each sync.

Step 2: Classify -- False vs True Dependencies

After detecting syncs, classify each one before deciding how to fix it.

False dependencies (avoidable) -- CPU does not actually need the GPU result. These can be eliminated without changing program logic:

Debug prints left in hot paths (
```
print(loss.item())
```
)
Unnecessary
```
.item()
```
calls for logging that could be deferred
Using
```
.cuda()
```
instead of
```
.to('cuda', non_blocking=True)
```

Using

.type(torch.LongTensor)

instead of

.type(torch.long)

Creating tensors from Python objects directly on CUDA

True dependencies (require restructuring) -- CPU genuinely needs the GPU value to proceed:

Control flow dependency:
```
if loss.item() > threshold:
```
-- CPU branches on a GPU-computed value
Dynamic memory allocation:
```
output = x[mask]
```
-- output size depends on GPU computation
CPU computation using GPU values: computing statistics for logging, updating learning rates from metrics

True dependencies require restructuring: move logic to GPU (

torch.where()

), delay to end of iteration, or accept that those parts stay outside any CUDA Graph capture region.

Step 3: Eliminate Systematically

Apply fixes in order of increasing difficulty. Start with easy wins.

1. Remove redundancy -- Delete operations that do not need to happen:

Remove debug prints and logging from hot loops
Delete unnecessary
```
.item()
```
calls
Eliminate duplicate synchronizations

2. Use
non_blocking=True
-- Make transfers async where CPU does not immediately use the result:

python

# Before (syncs)
x_gpu = x_cpu.cuda()
x_cpu = x_gpu.cpu()

# After (async, no sync)
x_gpu = x_cpu.to('cuda', non_blocking=True)
x_cpu = x_gpu.to('cpu', non_blocking=True)   # only if CPU does not use x_cpu immediately

Only use

non_blocking=True

for GPU-to-CPU when the CPU does not immediately read the result. Otherwise the CPU may operate on incomplete data.

3. Switch to sync-free API alternatives -- See the Quick Reference Table below for a condensed mapping of common patterns.

4. Delay synchronization to end of iteration -- Move logging and validation to after the optimizer step rather than mid-forward/backward:

python

# Before: sync mid-iteration
loss = model(batch)
print(f"Loss: {loss.item()}")    # cuStreamSynchronize
loss.backward()

# After: delay to end of iteration
loss = model(batch)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}")    # sync is outside the hot path

5. Coalesce multiple syncs into one -- If you need several GPU values on CPU, gather them and transfer once:

python

# Before: 3 separate syncs
loss_val = loss.item()           # cuStreamSynchronize
acc_val = accuracy.item()        # cuStreamSynchronize
gnorm_val = grad_norm.item()     # cuStreamSynchronize

# After: 1 sync
metrics = torch.stack([loss, accuracy, grad_norm])
vals = metrics.cpu()             # single cuStreamSynchronize
loss_val, acc_val, gnorm_val = vals.tolist()

6. Offload logic to GPU -- Replace CPU-side logic with GPU-native ops:

python

# Before: CPU control flow (syncs)
if loss.item() > threshold:
    result = a
else:
    result = b

# After: GPU-side selection (no sync)
result = torch.where(loss > threshold, a, b)

# Before: Python max (syncs)
val = max(x_gpu[0, 0], x_gpu[0, 1])

# After: torch.max (no sync)
val = torch.max(x_gpu[0, 0], x_gpu[0, 1])

7. Exclude unavoidable syncs from capture range (last resort) -- If a sync cannot be eliminated, keep it outside the CUDA Graph capture region and graph only the sync-free sections. Partial graphing is better than no graphing.

Step 4: Verify

Re-run detection to confirm syncs are eliminated:

python

torch.cuda.set_sync_debug_mode('error')  # will raise if any sync remains
train_step(model, batch)
torch.cuda.set_sync_debug_mode(0)

Or re-profile with Nsight Systems and confirm no

cudaStreamSynchronize

cudaEventSynchronize

cudaDeviceSynchronize

calls appear in the target region.

Quick Reference Table

Sync-Inducing Pattern	Sync-Free Alternative
Device Transfers
`.cpu()` or `.to('cpu')`	`.to('cpu', non_blocking=True)` (fire-and-forget only)
`.cuda()` or `.to('cuda')`	`.to('cuda', non_blocking=True)`
`.type(torch.LongTensor)`	`.type(torch.long)` (dtype conversion, stays on GPU)
Tensor Creation
`torch.tensor(obj, device='cuda')`	Create on CPU, then `.to('cuda', non_blocking=True)`
`torch.tensor(0, device='cuda')`	`torch.zeros(1, device='cuda', dtype=...).squeeze()`
`torch.as_tensor(arr, device='cuda')`	Create on CPU, then `.to('cuda', non_blocking=True)`
`torch.cuda.BoolTensor(list)`	`torch.tensor(list, device='cpu').to('cuda', non_blocking=True)`
Control Flow
`.item()` in conditionals	`torch.where()` or move outside critical region
`if gpu_tensor:`	Keep logic on GPU with `torch.where()`
Python `max(a, b)` on GPU tensors	`torch.max(a, b)`
`torch.is_nonzero(t)`	Avoid; use GPU-side comparisons
Indexing
`x_gpu[idx_cpu]` or `x_gpu[idx_list]`	`x_gpu[idx_gpu]` (keep indices on same device)
`x_gpu[idx] = 0` (scalar assignment)	`x_gpu[idx] = zero_gpu` (GPU tensor value)
`x[i:j]` with CUDA tensor bounds	`x[:, s]` with `s = torch.arange(i, j, device='cuda')`
Dynamic Shapes
`x_gpu[mask_gpu]` (masked selection)	`torch.where(mask_gpu, x_gpu, 0)` (fixed shape)
`torch.nonzero(mask)`	`torch.where()` or move outside critical region
`torch.masked_select(x, mask)`	`torch.where(mask, x, 0)`
`torch.unique(x)`	Avoid in hot path; precompute if possible
`torch.repeat_interleave(x, r)`	Specify `output_size=N` if known

Finding More Information

Tier 1 (this file): Workflow, classification, elimination strategies, and quick reference table
Tier 2 (
references/sync-patterns.md
): Comprehensive pattern catalog with 9 categories, full code examples showing sync-inducing and sync-free versions, and the specific CUDA driver API triggered by each pattern

perf-torch-sync-free

NPX Install

Tags

SKILL.md Content

Writing Sync-Free PyTorch Code

When to Use

Requirements

Workflow

Step 1: Detect Synchronizations

Step 2: Classify -- False vs True Dependencies

Step 3: Eliminate Systematically

Step 4: Verify

Quick Reference Table

Finding More Information