perf-workload-profiling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Workload Profiling

工作负载分析

Quick Reference

快速参考

Pick ONE path based on the workload type:
WorkloadApproachSection
Training loopManual
torch.cuda.synchronize()
+
time.perf_counter()
with warmup
Loop Workloads — Manual Timing
Single kernel or opWrite CUDA event benchmark (pre-allocate, warmup, event pairs)Non-Loop Workloads — CUDA Event Benchmarking
Add timeline labels for nsysUse
@nvtx.annotate
decorator or context manager
NVTX Reference
根据工作负载类型选择一种方案:
工作负载方法章节
训练循环带预热的手动
torch.cuda.synchronize()
+
time.perf_counter()
循环类工作负载——手动计时
单个内核或算子编写CUDA事件基准测试代码(预分配、预热、事件对)非循环类工作负载——CUDA事件基准测试
为nsys添加时间线标签使用
@nvtx.annotate
装饰器或上下文管理器
NVTX参考

Principles

原则

  • Measure, don't guess. Every performance claim must trace back to profiler output or structured measurement data. Never invent metrics.
  • Isolate steady-state. Warmup costs (CUDA context init, cuDNN autotuning, JIT compilation) distort measurements. Always exclude warmup iterations before collecting data.
  • Use hardware timing. CUDA events measure GPU time precisely. CPU timers (
    time.perf_counter()
    ) include host overhead and miss asynchronous execution.
  • No sync inside measurement loops. Each
    torch.cuda.synchronize()
    adds 10-50us overhead. Record CUDA events asynchronously, sync once at the end.
  • Pre-allocate everything. Tensors, events, compiled kernels — all before the timing loop. For CuTe DSL kernels, pre-compile with
    cute.compile()
    .
  • Minimize profiler interference. Start with lightweight measurement (manual timing for latency/throughput) and escalate to heavier tools (Kineto, nsys, ncu) only when lighter tools cannot answer the question.
  • 量化而非猜测:所有性能结论必须追溯到分析器输出或结构化测量数据,切勿编造指标。
  • 隔离稳态阶段:预热成本(CUDA上下文初始化、cuDNN自动调优、JIT编译)会干扰测量结果,收集数据前务必排除预热迭代。
  • 使用硬件计时:CUDA事件可精准测量GPU时间。CPU计时器(
    time.perf_counter()
    )包含主机端开销,无法捕捉异步执行情况。
  • 测量循环内不要同步:每次
    torch.cuda.synchronize()
    会增加10-50微秒的开销。异步记录CUDA事件,仅在最后同步一次。
  • 预分配所有资源:张量、事件、编译后的内核——全部在计时循环前完成分配。对于CuTe DSL内核,使用
    cute.compile()
    预编译。
  • 最小化分析器干扰:先使用轻量级测量(手动计时获取延迟/吞吐量),仅当轻量级工具无法解答问题时,再升级到更重型的工具(Kineto、nsys、ncu)。

Loop Workloads — Manual Timing

循环类工作负载——手动计时

For training loops and iterative workloads, use manual
torch.cuda.synchronize()
+
time.perf_counter()
timing with warmup to measure per-iteration latency, throughput, and data load time.
针对训练循环和迭代式工作负载,使用带预热的手动
torch.cuda.synchronize()
+
time.perf_counter()
计时,测量每迭代延迟、吞吐量和数据加载时间。

Injection Template

插入模板

Read the user's training script, understand the dataloader and loop structure, then inject timing code.
python
import time
import torch

WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128  # global batch size for throughput calculation

iter_times = []
data_times = []

for i, batch in enumerate(dataloader):
    if i >= WARMUP + NUM_ITERS:
        break

    t_data_end = time.perf_counter()

    torch.cuda.synchronize()
    t_start = time.perf_counter()

    # ... existing training loop body ...

    torch.cuda.synchronize()
    t_end = time.perf_counter()

    if i >= WARMUP:
        iter_ms = (t_end - t_start) * 1000
        iter_times.append(iter_ms)
        if i > 0:
            data_times.append((t_data_end - prev_iter_end) * 1000)
        print(f"[{i:04d}]: iter {iter_ms:.2f} ms, fps {BATCH_SIZE / (iter_ms / 1000):.2f}")

    prev_iter_end = t_end

import statistics
print(f"Average: iter {statistics.mean(iter_times):.2f} ms, "
      f"fps {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")
阅读用户的训练脚本,理解数据加载器和循环结构,然后插入计时代码。
python
import time
import torch

WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128  # 用于计算吞吐量的全局批次大小

iter_times = []
data_times = []

for i, batch in enumerate(dataloader):
    if i >= WARMUP + NUM_ITERS:
        break

    t_data_end = time.perf_counter()

    torch.cuda.synchronize()
    t_start = time.perf_counter()

    # ... 现有训练循环主体 ...

    torch.cuda.synchronize()
    t_end = time.perf_counter()

    if i >= WARMUP:
        iter_ms = (t_end - t_start) * 1000
        iter_times.append(iter_ms)
        if i > 0:
            data_times.append((t_data_end - prev_iter_end) * 1000)
        print(f"[{i:04d}]: 迭代耗时 {iter_ms:.2f} 毫秒,每秒帧数 {BATCH_SIZE / (iter_ms / 1000):.2f}")

    prev_iter_end = t_end

import statistics
print(f"平均值: 迭代耗时 {statistics.mean(iter_times):.2f} 毫秒,"
      f"每秒帧数 {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")

Interpreting Results

结果解读

  • iter (ms): Wall-clock time per iteration (compute + communication, excluding data loading)
  • data (ms): Time spent in dataloader between iterations. If
    data / iter > 0.2
    , data loading is a bottleneck.
  • fps: Global throughput in samples/second. Use with known FLOPs-per-sample to compute MFU.
  • iter (ms):每次迭代的挂钟时间(计算+通信,不含数据加载)
  • data (ms):迭代间数据加载器的耗时。如果
    data / iter > 0.2
    ,说明数据加载是瓶颈。
  • fps:全局吞吐量(样本/秒)。结合已知的每样本FLOPs可计算MFU。

Limitations

局限性

Manual timing reports aggregate iteration timing — not per-sub-phase breakdown (forward, backward, optimizer). When the user asks where time is spent within compute:
  1. Add
    torch.cuda.synchronize()
    +
    time.perf_counter()
    around each sub-phase for a one-off diagnosis, OR
  2. Add NVTX annotations and run with
    nsys profile
    for timeline visualization.
手动计时报告的是聚合迭代时间——无法细分各子阶段(前向传播、反向传播、优化器)的耗时。当用户询问计算过程中时间消耗在哪里时:
  1. 在每个子阶段前后添加
    torch.cuda.synchronize()
    +
    time.perf_counter()
    进行一次性诊断,或者
  2. 添加NVTX注解并使用
    nsys profile
    运行,以可视化时间线。

Non-Loop Workloads — CUDA Event Benchmarking

非循环类工作负载——CUDA事件基准测试

For single kernels, one-shot inference, or standalone operations, write CUDA event benchmarking code directly.
针对单个内核、一次性推理或独立算子,直接编写CUDA事件基准测试代码。

PyTorch: Simple (Mean Only)

PyTorch:简易版(仅平均值)

python
import torch

def benchmark(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iters):
        fn()
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end) / iters  # ms per iteration
python
import torch

def benchmark(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    start.record()
    for _ in range(iters):
        fn()
    end.record()
    torch.cuda.synchronize()

    return start.elapsed_time(end) / iters  # 每迭代毫秒数

PyTorch: Detailed (Per-Iteration Stats)

PyTorch:详细版(每迭代统计)

python
import torch
import statistics

def benchmark_detailed(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
    ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]

    for i in range(iters):
        starts[i].record()
        fn()
        ends[i].record()

    torch.cuda.synchronize()
    times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]

    return {
        "mean_ms": statistics.mean(times),
        "median_ms": statistics.median(times),
        "std_ms": statistics.stdev(times) if len(times) > 1 else 0,
        "min_ms": min(times),
        "max_ms": max(times),
    }
python
import torch
import statistics

def benchmark_detailed(fn, warmup=50, iters=100):
    for _ in range(warmup):
        fn()
    torch.cuda.synchronize()

    starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
    ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]

    for i in range(iters):
        starts[i].record()
        fn()
        ends[i].record()

    torch.cuda.synchronize()
    times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]

    return {
        "mean_ms": statistics.mean(times),
        "median_ms": statistics.median(times),
        "std_ms": statistics.stdev(times) if len(times) > 1 else 0,
        "min_ms": min(times),
        "max_ms": max(times),
    }

Anti-Patterns

反模式

Anti-PatternProblem
torch.cuda.synchronize()
before AND after each iteration
Adds ~10-50us overhead per iteration
time.perf_counter()
for GPU timing
Measures CPU time, misses async GPU execution
Missing warmupFirst iterations include JIT, clock ramp-up, context init
Allocating tensors inside measurement loopAllocation overhead pollutes timing
Reporting only meanHides variance, outliers, bimodal distributions
For additional benchmarking templates (CUDA Graph, CuTe DSL, Triton, Raw CUDA), see references/benchmarking-patterns.md.
反模式问题
每次迭代前后都调用
torch.cuda.synchronize()
每迭代增加约10-50微秒的开销
使用
time.perf_counter()
进行GPU计时
测量的是CPU时间,无法捕捉异步GPU执行
缺少预热阶段前几次迭代包含JIT编译、时钟加速、上下文初始化等开销
在测量循环内分配张量分配开销会干扰计时结果
仅报告平均值隐藏了方差、异常值和双峰分布情况
更多基准测试模板(CUDA Graph、CuTe DSL、Triton、原生CUDA)请参考references/benchmarking-patterns.md

NVTX Reference

NVTX参考

NVTX (NVIDIA Tools Extension) adds named annotations to profiler timelines. Use NVTX to label phases (forward, backward, optimizer) for readability in nsys — not for measurement.
python
import nvtx
NVTX(NVIDIA工具扩展)用于为分析器时间线添加命名注解。使用NVTX标记阶段(前向传播、反向传播、优化器),提升nsys中的可读性——而非用于测量。
python
import nvtx

Decorator — annotates every call

装饰器——标记每次调用

@nvtx.annotate("training_step", color="blue") def training_step(): ...
@nvtx.annotate("training_step", color="blue") def training_step(): ...

Context manager — annotates a code block

上下文管理器——标记代码块

with nvtx.annotate("data_loading", color="green"): batch = next(dataloader)

- **Do** annotate training phases (forward, backward, optimizer, data loading) for nsys timeline clarity.
- **Do not** annotate for measurement — use CUDA events or manual timing instead.
- **Do not** over-annotate — too many fine-grained ranges add visual clutter and minor overhead.

For NVTX domains, categories, payloads, and legacy API details, see [references/nvtx-api.md](references/nvtx-api.md).
with nvtx.annotate("data_loading", color="green"): batch = next(dataloader)

- **建议**:为训练阶段(前向传播、反向传播、优化器、数据加载)添加注解,提升nsys时间线的清晰度。
- **不建议**:使用注解进行测量——应改用CUDA事件或手动计时。
- **不建议**过度注解:过多细粒度范围会增加视觉混乱和轻微开销。

关于NVTX域、类别、负载和旧版API的详细信息,请参考[references/nvtx-api.md](references/nvtx-api.md)。

References

参考资料

  • references/benchmarking-patterns.md — CUDA Graph, CuTe DSL, Triton, Raw CUDA templates; warmup guidance; GPU hardware properties; reporting format
  • references/nvtx-api.md — Domains, categories, payloads, legacy push/pop API
  • references/pytorch-profiler-api.md — PyTorch 2.0+ profiler API changes (
    device_time
    vs deprecated
    cuda_time
    )
  • references/benchmarking-patterns.md — CUDA Graph、CuTe DSL、Triton、原生CUDA模板;预热指南;GPU硬件属性;报告格式
  • references/nvtx-api.md — 域、类别、负载、旧版push/pop API
  • references/pytorch-profiler-api.md — PyTorch 2.0+分析器API变更(
    device_time
    vs 已弃用的
    cuda_time