perf-workload-profiling

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Workload Profiling

工作负载分析

Quick Reference

快速参考

Pick ONE path based on the workload type:

Workload	Approach	Section
Training loop	Manual `torch.cuda.synchronize()` + `time.perf_counter()` with warmup	Loop Workloads — Manual Timing
Single kernel or op	Write CUDA event benchmark (pre-allocate, warmup, event pairs)	Non-Loop Workloads — CUDA Event Benchmarking
Add timeline labels for nsys	Use `@nvtx.annotate` decorator or context manager	NVTX Reference

根据工作负载类型选择一种方案：

工作负载	方法	章节
训练循环	带预热的手动 `torch.cuda.synchronize()` + `time.perf_counter()`	循环类工作负载——手动计时
单个内核或算子	编写CUDA事件基准测试代码（预分配、预热、事件对）	非循环类工作负载——CUDA事件基准测试
为nsys添加时间线标签	使用 `@nvtx.annotate` 装饰器或上下文管理器	NVTX参考

Principles

原则

Measure, don't guess. Every performance claim must trace back to profiler output or structured measurement data. Never invent metrics.
Isolate steady-state. Warmup costs (CUDA context init, cuDNN autotuning, JIT compilation) distort measurements. Always exclude warmup iterations before collecting data.
Use hardware timing. CUDA events measure GPU time precisely. CPU timers (
```
time.perf_counter()
```
) include host overhead and miss asynchronous execution.
No sync inside measurement loops. Each
```
torch.cuda.synchronize()
```
adds 10-50us overhead. Record CUDA events asynchronously, sync once at the end.
Pre-allocate everything. Tensors, events, compiled kernels — all before the timing loop. For CuTe DSL kernels, pre-compile with
```
cute.compile()
```
.
Minimize profiler interference. Start with lightweight measurement (manual timing for latency/throughput) and escalate to heavier tools (Kineto, nsys, ncu) only when lighter tools cannot answer the question.

量化而非猜测：所有性能结论必须追溯到分析器输出或结构化测量数据，切勿编造指标。
隔离稳态阶段：预热成本（CUDA上下文初始化、cuDNN自动调优、JIT编译）会干扰测量结果，收集数据前务必排除预热迭代。
使用硬件计时：CUDA事件可精准测量GPU时间。CPU计时器（
```
time.perf_counter()
```
）包含主机端开销，无法捕捉异步执行情况。
测量循环内不要同步：每次
```
torch.cuda.synchronize()
```
会增加10-50微秒的开销。异步记录CUDA事件，仅在最后同步一次。
预分配所有资源：张量、事件、编译后的内核——全部在计时循环前完成分配。对于CuTe DSL内核，使用
```
cute.compile()
```
预编译。
最小化分析器干扰：先使用轻量级测量（手动计时获取延迟/吞吐量），仅当轻量级工具无法解答问题时，再升级到更重型的工具（Kineto、nsys、ncu）。

Loop Workloads — Manual Timing

循环类工作负载——手动计时

For training loops and iterative workloads, use manual

torch.cuda.synchronize()

time.perf_counter()

timing with warmup to measure per-iteration latency, throughput, and data load time.

针对训练循环和迭代式工作负载，使用带预热的手动

torch.cuda.synchronize()

time.perf_counter()

计时，测量每迭代延迟、吞吐量和数据加载时间。

Injection Template

插入模板

Read the user's training script, understand the dataloader and loop structure, then inject timing code.

python

import time
import torch

WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128  # global batch size for throughput calculation

iter_times = []
data_times = []

for i, batch in enumerate(dataloader):
    if i >= WARMUP + NUM_ITERS:
        break

    t_data_end = time.perf_counter()

    torch.cuda.synchronize()
    t_start = time.perf_counter()

    # ... existing training loop body ...

    torch.cuda.synchronize()
    t_end = time.perf_counter()

    if i >= WARMUP:
        iter_ms = (t_end - t_start) * 1000
        iter_times.append(iter_ms)
        if i > 0:
            data_times.append((t_data_end - prev_iter_end) * 1000)
        print(f"[{i:04d}]: iter {iter_ms:.2f} ms, fps {BATCH_SIZE / (iter_ms / 1000):.2f}")

    prev_iter_end = t_end

import statistics
print(f"Average: iter {statistics.mean(iter_times):.2f} ms, "
      f"fps {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")

阅读用户的训练脚本，理解数据加载器和循环结构，然后插入计时代码。

python

import time
import torch

WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128  # 用于计算吞吐量的全局批次大小

iter_times = []
data_times = []

for i, batch in enumerate(dataloader):
    if i >= WARMUP + NUM_ITERS:
        break

    t_data_end = time.perf_counter()

    torch.cuda.synchronize()
    t_start = time.perf_counter()

    # ... 现有训练循环主体 ...

    torch.cuda.synchronize()
    t_end = time.perf_counter()

    if i >= WARMUP:
        iter_ms = (t_end - t_start) * 1000
        iter_times.append(iter_ms)
        if i > 0:
            data_times.append((t_data_end - prev_iter_end) * 1000)
        print(f"[{i:04d}]: 迭代耗时 {iter_ms:.2f} 毫秒，每秒帧数 {BATCH_SIZE / (iter_ms / 1000):.2f}")

    prev_iter_end = t_end

import statistics
print(f"平均值: 迭代耗时 {statistics.mean(iter_times):.2f} 毫秒，"
      f"每秒帧数 {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")

Interpreting Results

结果解读

iter (ms): Wall-clock time per iteration (compute + communication, excluding data loading)
data (ms): Time spent in dataloader between iterations. If
```
data / iter > 0.2
```
, data loading is a bottleneck.
fps: Global throughput in samples/second. Use with known FLOPs-per-sample to compute MFU.

iter (ms)：每次迭代的挂钟时间（计算+通信，不含数据加载）
data (ms)：迭代间数据加载器的耗时。如果
```
data / iter > 0.2
```
，说明数据加载是瓶颈。
fps：全局吞吐量（样本/秒）。结合已知的每样本FLOPs可计算MFU。

Limitations

局限性

Manual timing reports aggregate iteration timing — not per-sub-phase breakdown (forward, backward, optimizer). When the user asks where time is spent within compute:

Add
```
torch.cuda.synchronize()
```
+
```
time.perf_counter()
```
around each sub-phase for a one-off diagnosis, OR
Add NVTX annotations and run with
```
nsys profile
```
for timeline visualization.

手动计时报告的是聚合迭代时间——无法细分各子阶段（前向传播、反向传播、优化器）的耗时。当用户询问计算过程中时间消耗在哪里时：

在每个子阶段前后添加
```
torch.cuda.synchronize()
```
+
```
time.perf_counter()
```
进行一次性诊断，或者
添加NVTX注解并使用
```
nsys profile
```
运行，以可视化时间线。

Non-Loop Workloads — CUDA Event Benchmarking

非循环类工作负载——CUDA事件基准测试

For single kernels, one-shot inference, or standalone operations, write CUDA event benchmarking code directly.

针对单个内核、一次性推理或独立算子，直接编写CUDA事件基准测试代码。

PyTorch: Simple (Mean Only)

PyTorch：简易版（仅平均值）

python

import torch

def benchmark(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iters):
        fn()
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end) / iters  # ms per iteration

python

import torch

def benchmark(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iters):
        fn()
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end) / iters  # 每迭代毫秒数

PyTorch: Detailed (Per-Iteration Stats)

PyTorch：详细版（每迭代统计）

python

import torch
import statistics

def benchmark_detailed(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
    ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]

    for i in range(iters):
        starts[i].record()
        fn()
        ends[i].record()

    torch.cuda.synchronize()
    times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]

    return {
        "mean_ms": statistics.mean(times),
        "median_ms": statistics.median(times),
        "std_ms": statistics.stdev(times) if len(times) > 1 else 0,
        "min_ms": min(times),
        "max_ms": max(times),
    }

python

import torch
import statistics

def benchmark_detailed(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
    ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]

    for i in range(iters):
        starts[i].record()
        fn()
        ends[i].record()

    torch.cuda.synchronize()
    times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]

    return {
        "mean_ms": statistics.mean(times),
        "median_ms": statistics.median(times),
        "std_ms": statistics.stdev(times) if len(times) > 1 else 0,
        "min_ms": min(times),
        "max_ms": max(times),
    }

Anti-Patterns

反模式

Anti-Pattern	Problem
`torch.cuda.synchronize()` before AND after each iteration	Adds ~10-50us overhead per iteration
`time.perf_counter()` for GPU timing	Measures CPU time, misses async GPU execution
Missing warmup	First iterations include JIT, clock ramp-up, context init
Allocating tensors inside measurement loop	Allocation overhead pollutes timing
Reporting only mean	Hides variance, outliers, bimodal distributions

For additional benchmarking templates (CUDA Graph, CuTe DSL, Triton, Raw CUDA), see references/benchmarking-patterns.md.

反模式	问题
每次迭代前后都调用 `torch.cuda.synchronize()`	每迭代增加约10-50微秒的开销
使用 `time.perf_counter()` 进行GPU计时	测量的是CPU时间，无法捕捉异步GPU执行
缺少预热阶段	前几次迭代包含JIT编译、时钟加速、上下文初始化等开销
在测量循环内分配张量	分配开销会干扰计时结果
仅报告平均值	隐藏了方差、异常值和双峰分布情况

更多基准测试模板（CUDA Graph、CuTe DSL、Triton、原生CUDA）请参考references/benchmarking-patterns.md。

NVTX Reference

NVTX参考

NVTX (NVIDIA Tools Extension) adds named annotations to profiler timelines. Use NVTX to label phases (forward, backward, optimizer) for readability in nsys — not for measurement.

python

import nvtx

NVTX（NVIDIA工具扩展）用于为分析器时间线添加命名注解。使用NVTX标记阶段（前向传播、反向传播、优化器），提升nsys中的可读性——而非用于测量。

python

import nvtx

Decorator — annotates every call

装饰器——标记每次调用

@nvtx.annotate("training_step", color="blue") def training_step(): ...

Context manager — annotates a code block

上下文管理器——标记代码块

with nvtx.annotate("data_loading", color="green"): batch = next(dataloader)


- **Do** annotate training phases (forward, backward, optimizer, data loading) for nsys timeline clarity.
- **Do not** annotate for measurement — use CUDA events or manual timing instead.
- **Do not** over-annotate — too many fine-grained ranges add visual clutter and minor overhead.

For NVTX domains, categories, payloads, and legacy API details, see [references/nvtx-api.md](references/nvtx-api.md).

with nvtx.annotate("data_loading", color="green"): batch = next(dataloader)


- **建议**：为训练阶段（前向传播、反向传播、优化器、数据加载）添加注解，提升nsys时间线的清晰度。
- **不建议**：使用注解进行测量——应改用CUDA事件或手动计时。
- **不建议**过度注解：过多细粒度范围会增加视觉混乱和轻微开销。

关于NVTX域、类别、负载和旧版API的详细信息，请参考[references/nvtx-api.md](references/nvtx-api.md)。

References

参考资料

references/benchmarking-patterns.md — CUDA Graph, CuTe DSL, Triton, Raw CUDA templates; warmup guidance; GPU hardware properties; reporting format
references/nvtx-api.md — Domains, categories, payloads, legacy push/pop API
references/pytorch-profiler-api.md — PyTorch 2.0+ profiler API changes (
```
device_time
```
vs deprecated
```
cuda_time
```
)

references/benchmarking-patterns.md — CUDA Graph、CuTe DSL、Triton、原生CUDA模板；预热指南；GPU硬件属性；报告格式
references/nvtx-api.md — 域、类别、负载、旧版push/pop API
references/pytorch-profiler-api.md — PyTorch 2.0+分析器API变更（
```
device_time
```
vs 已弃用的
```
cuda_time
```
）