perf-workload-profiling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWorkload Profiling
工作负载分析
Quick Reference
快速参考
Pick ONE path based on the workload type:
| Workload | Approach | Section |
|---|---|---|
| Training loop | Manual | Loop Workloads — Manual Timing |
| Single kernel or op | Write CUDA event benchmark (pre-allocate, warmup, event pairs) | Non-Loop Workloads — CUDA Event Benchmarking |
| Add timeline labels for nsys | Use | NVTX Reference |
根据工作负载类型选择一种方案:
| 工作负载 | 方法 | 章节 |
|---|---|---|
| 训练循环 | 带预热的手动 | 循环类工作负载——手动计时 |
| 单个内核或算子 | 编写CUDA事件基准测试代码(预分配、预热、事件对) | 非循环类工作负载——CUDA事件基准测试 |
| 为nsys添加时间线标签 | 使用 | NVTX参考 |
Principles
原则
- Measure, don't guess. Every performance claim must trace back to profiler output or structured measurement data. Never invent metrics.
- Isolate steady-state. Warmup costs (CUDA context init, cuDNN autotuning, JIT compilation) distort measurements. Always exclude warmup iterations before collecting data.
- Use hardware timing. CUDA events measure GPU time precisely. CPU timers () include host overhead and miss asynchronous execution.
time.perf_counter() - No sync inside measurement loops. Each adds 10-50us overhead. Record CUDA events asynchronously, sync once at the end.
torch.cuda.synchronize() - Pre-allocate everything. Tensors, events, compiled kernels — all before the timing loop. For CuTe DSL kernels, pre-compile with .
cute.compile() - Minimize profiler interference. Start with lightweight measurement (manual timing for latency/throughput) and escalate to heavier tools (Kineto, nsys, ncu) only when lighter tools cannot answer the question.
- 量化而非猜测:所有性能结论必须追溯到分析器输出或结构化测量数据,切勿编造指标。
- 隔离稳态阶段:预热成本(CUDA上下文初始化、cuDNN自动调优、JIT编译)会干扰测量结果,收集数据前务必排除预热迭代。
- 使用硬件计时:CUDA事件可精准测量GPU时间。CPU计时器()包含主机端开销,无法捕捉异步执行情况。
time.perf_counter() - 测量循环内不要同步:每次会增加10-50微秒的开销。异步记录CUDA事件,仅在最后同步一次。
torch.cuda.synchronize() - 预分配所有资源:张量、事件、编译后的内核——全部在计时循环前完成分配。对于CuTe DSL内核,使用预编译。
cute.compile() - 最小化分析器干扰:先使用轻量级测量(手动计时获取延迟/吞吐量),仅当轻量级工具无法解答问题时,再升级到更重型的工具(Kineto、nsys、ncu)。
Loop Workloads — Manual Timing
循环类工作负载——手动计时
For training loops and iterative workloads, use manual + timing with warmup to measure per-iteration latency, throughput, and data load time.
torch.cuda.synchronize()time.perf_counter()针对训练循环和迭代式工作负载,使用带预热的手动 + 计时,测量每迭代延迟、吞吐量和数据加载时间。
torch.cuda.synchronize()time.perf_counter()Injection Template
插入模板
Read the user's training script, understand the dataloader and loop structure, then inject timing code.
python
import time
import torch
WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128 # global batch size for throughput calculation
iter_times = []
data_times = []
for i, batch in enumerate(dataloader):
if i >= WARMUP + NUM_ITERS:
break
t_data_end = time.perf_counter()
torch.cuda.synchronize()
t_start = time.perf_counter()
# ... existing training loop body ...
torch.cuda.synchronize()
t_end = time.perf_counter()
if i >= WARMUP:
iter_ms = (t_end - t_start) * 1000
iter_times.append(iter_ms)
if i > 0:
data_times.append((t_data_end - prev_iter_end) * 1000)
print(f"[{i:04d}]: iter {iter_ms:.2f} ms, fps {BATCH_SIZE / (iter_ms / 1000):.2f}")
prev_iter_end = t_end
import statistics
print(f"Average: iter {statistics.mean(iter_times):.2f} ms, "
f"fps {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")阅读用户的训练脚本,理解数据加载器和循环结构,然后插入计时代码。
python
import time
import torch
WARMUP = 5
NUM_ITERS = 30
BATCH_SIZE = 128 # 用于计算吞吐量的全局批次大小
iter_times = []
data_times = []
for i, batch in enumerate(dataloader):
if i >= WARMUP + NUM_ITERS:
break
t_data_end = time.perf_counter()
torch.cuda.synchronize()
t_start = time.perf_counter()
# ... 现有训练循环主体 ...
torch.cuda.synchronize()
t_end = time.perf_counter()
if i >= WARMUP:
iter_ms = (t_end - t_start) * 1000
iter_times.append(iter_ms)
if i > 0:
data_times.append((t_data_end - prev_iter_end) * 1000)
print(f"[{i:04d}]: 迭代耗时 {iter_ms:.2f} 毫秒,每秒帧数 {BATCH_SIZE / (iter_ms / 1000):.2f}")
prev_iter_end = t_end
import statistics
print(f"平均值: 迭代耗时 {statistics.mean(iter_times):.2f} 毫秒,"
f"每秒帧数 {BATCH_SIZE / (statistics.mean(iter_times) / 1000):.2f}")Interpreting Results
结果解读
- iter (ms): Wall-clock time per iteration (compute + communication, excluding data loading)
- data (ms): Time spent in dataloader between iterations. If , data loading is a bottleneck.
data / iter > 0.2 - fps: Global throughput in samples/second. Use with known FLOPs-per-sample to compute MFU.
- iter (ms):每次迭代的挂钟时间(计算+通信,不含数据加载)
- data (ms):迭代间数据加载器的耗时。如果,说明数据加载是瓶颈。
data / iter > 0.2 - fps:全局吞吐量(样本/秒)。结合已知的每样本FLOPs可计算MFU。
Limitations
局限性
Manual timing reports aggregate iteration timing — not per-sub-phase breakdown (forward, backward, optimizer). When the user asks where time is spent within compute:
- Add +
torch.cuda.synchronize()around each sub-phase for a one-off diagnosis, ORtime.perf_counter() - Add NVTX annotations and run with for timeline visualization.
nsys profile
手动计时报告的是聚合迭代时间——无法细分各子阶段(前向传播、反向传播、优化器)的耗时。当用户询问计算过程中时间消耗在哪里时:
- 在每个子阶段前后添加+
torch.cuda.synchronize()进行一次性诊断,或者time.perf_counter() - 添加NVTX注解并使用运行,以可视化时间线。
nsys profile
Non-Loop Workloads — CUDA Event Benchmarking
非循环类工作负载——CUDA事件基准测试
For single kernels, one-shot inference, or standalone operations, write CUDA event benchmarking code directly.
针对单个内核、一次性推理或独立算子,直接编写CUDA事件基准测试代码。
PyTorch: Simple (Mean Only)
PyTorch:简易版(仅平均值)
python
import torch
def benchmark(fn, warmup=50, iters=100):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(iters):
fn()
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end) / iters # ms per iterationpython
import torch
def benchmark(fn, warmup=50, iters=100):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(iters):
fn()
end.record()
torch.cuda.synchronize()
return start.elapsed_time(end) / iters # 每迭代毫秒数PyTorch: Detailed (Per-Iteration Stats)
PyTorch:详细版(每迭代统计)
python
import torch
import statistics
def benchmark_detailed(fn, warmup=50, iters=100):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
for i in range(iters):
starts[i].record()
fn()
ends[i].record()
torch.cuda.synchronize()
times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]
return {
"mean_ms": statistics.mean(times),
"median_ms": statistics.median(times),
"std_ms": statistics.stdev(times) if len(times) > 1 else 0,
"min_ms": min(times),
"max_ms": max(times),
}python
import torch
import statistics
def benchmark_detailed(fn, warmup=50, iters=100):
for _ in range(warmup):
fn()
torch.cuda.synchronize()
starts = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
ends = [torch.cuda.Event(enable_timing=True) for _ in range(iters)]
for i in range(iters):
starts[i].record()
fn()
ends[i].record()
torch.cuda.synchronize()
times = [starts[i].elapsed_time(ends[i]) for i in range(iters)]
return {
"mean_ms": statistics.mean(times),
"median_ms": statistics.median(times),
"std_ms": statistics.stdev(times) if len(times) > 1 else 0,
"min_ms": min(times),
"max_ms": max(times),
}Anti-Patterns
反模式
| Anti-Pattern | Problem |
|---|---|
| Adds ~10-50us overhead per iteration |
| Measures CPU time, misses async GPU execution |
| Missing warmup | First iterations include JIT, clock ramp-up, context init |
| Allocating tensors inside measurement loop | Allocation overhead pollutes timing |
| Reporting only mean | Hides variance, outliers, bimodal distributions |
For additional benchmarking templates (CUDA Graph, CuTe DSL, Triton, Raw CUDA), see references/benchmarking-patterns.md.
| 反模式 | 问题 |
|---|---|
每次迭代前后都调用 | 每迭代增加约10-50微秒的开销 |
使用 | 测量的是CPU时间,无法捕捉异步GPU执行 |
| 缺少预热阶段 | 前几次迭代包含JIT编译、时钟加速、上下文初始化等开销 |
| 在测量循环内分配张量 | 分配开销会干扰计时结果 |
| 仅报告平均值 | 隐藏了方差、异常值和双峰分布情况 |
更多基准测试模板(CUDA Graph、CuTe DSL、Triton、原生CUDA)请参考references/benchmarking-patterns.md。
NVTX Reference
NVTX参考
NVTX (NVIDIA Tools Extension) adds named annotations to profiler timelines. Use NVTX to label phases (forward, backward, optimizer) for readability in nsys — not for measurement.
python
import nvtxNVTX(NVIDIA工具扩展)用于为分析器时间线添加命名注解。使用NVTX标记阶段(前向传播、反向传播、优化器),提升nsys中的可读性——而非用于测量。
python
import nvtxDecorator — annotates every call
装饰器——标记每次调用
@nvtx.annotate("training_step", color="blue")
def training_step():
...
@nvtx.annotate("training_step", color="blue")
def training_step():
...
Context manager — annotates a code block
上下文管理器——标记代码块
with nvtx.annotate("data_loading", color="green"):
batch = next(dataloader)
- **Do** annotate training phases (forward, backward, optimizer, data loading) for nsys timeline clarity.
- **Do not** annotate for measurement — use CUDA events or manual timing instead.
- **Do not** over-annotate — too many fine-grained ranges add visual clutter and minor overhead.
For NVTX domains, categories, payloads, and legacy API details, see [references/nvtx-api.md](references/nvtx-api.md).with nvtx.annotate("data_loading", color="green"):
batch = next(dataloader)
- **建议**:为训练阶段(前向传播、反向传播、优化器、数据加载)添加注解,提升nsys时间线的清晰度。
- **不建议**:使用注解进行测量——应改用CUDA事件或手动计时。
- **不建议**过度注解:过多细粒度范围会增加视觉混乱和轻微开销。
关于NVTX域、类别、负载和旧版API的详细信息,请参考[references/nvtx-api.md](references/nvtx-api.md)。References
参考资料
- references/benchmarking-patterns.md — CUDA Graph, CuTe DSL, Triton, Raw CUDA templates; warmup guidance; GPU hardware properties; reporting format
- references/nvtx-api.md — Domains, categories, payloads, legacy push/pop API
- references/pytorch-profiler-api.md — PyTorch 2.0+ profiler API changes (vs deprecated
device_time)cuda_time
- references/benchmarking-patterns.md — CUDA Graph、CuTe DSL、Triton、原生CUDA模板;预热指南;GPU硬件属性;报告格式
- references/nvtx-api.md — 域、类别、负载、旧版push/pop API
- references/pytorch-profiler-api.md — PyTorch 2.0+分析器API变更(vs 已弃用的
device_time)cuda_time