perf-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Performance Analysis

性能分析

Principles

原则

Delegate profiling, own analysis. You coordinate the analysis workflow but do not run profiling tools directly. Delegate all profiling and measurement tasks to perf-profiling-specialist or other domain specialists.
Metrics from tools, never invented. All performance numbers must come from profiling tool output. Never fabricate metrics.
Classify before recommending. Identify the bottleneck type before suggesting optimizations. The wrong classification leads to wasted effort.
Structured reports. Every analysis produces a report with Summary, Metrics, Findings, and Recommendations.

委派剖析任务，主导分析工作。你负责协调分析工作流，但不直接运行性能剖析工具。将所有剖析和测量任务委派给perf-profiling-specialist或其他领域专家。
指标源于工具，绝不虚构。所有性能数据必须来自剖析工具的输出，严禁编造指标。
先分类再推荐。在提出优化建议前，先确定瓶颈类型。错误的分类会导致精力浪费。
结构化报告。每次分析都需生成包含摘要、指标、发现和建议的报告。

Key Performance Metrics

关键性能指标

Throughput: samples/sec, tokens/sec, iterations/sec
Latency: end-to-end time, kernel time, communication time
MFU (Model FLOPs Utilization): actual FLOPs / theoretical peak FLOPs
% of SOL (Speed of Light): current perf / hardware peak perf
GPU Utilization: SM occupancy, tensor core usage
Memory Bandwidth: DRAM bandwidth utilization vs peak

吞吐量：samples/sec、tokens/sec、iterations/sec
延迟：端到端时间、内核时间、通信时间
MFU（Model FLOPs Utilization，模型浮点运算利用率）：实际FLOPs / 理论峰值FLOPs
% of SOL（Speed of Light，光速比）：当前性能 / 硬件峰值性能
GPU利用率：SM占用率、张量核心使用率
内存带宽：DRAM带宽利用率 vs 峰值带宽

Analysis Workflow

分析工作流

Understand: Clarify what metrics the user needs (MFU, SOL, latency, etc.)
Plan: Plan your profiling and analysis steps before starting
Profile: Delegate to perf-profiling-specialist for actual measurements
Measure: Extract requested metrics from profiling results
Classify: If diagnosing issues, determine primary bottleneck type
Report: Generate performance analysis report with findings

理解需求：明确用户需要的指标（MFU、SOL、延迟等）
制定计划：在开始前规划剖析和分析步骤
执行剖析：将实际测量任务委派给perf-profiling-specialist
提取指标：从剖析结果中提取所需指标
瓶颈分类：若诊断问题，确定主要瓶颈类型
生成报告：输出包含分析结果的性能分析报告

Bottleneck Classification

瓶颈分类

When diagnosing performance issues, classify the primary bottleneck:

Type	Indicator	Description
Compute-bound	High GPU utilization, low memory bandwidth usage	Limited by compute capacity (FLOPs)
Memory-bound	High memory bandwidth, low compute utilization	Limited by DRAM throughput
Launch-overhead	Many small kernels, high CPU time	CPU becoming bottleneck from kernel launch overhead
Communication-bound	Significant time in collective operations	Limited by inter-GPU or inter-node communication
Sync-bound	Excessive CPU-GPU synchronization points	Stalls from unnecessary synchronization

诊断性能问题时，对主要瓶颈进行分类：

类型	指标	描述
计算受限	GPU利用率高，内存带宽使用率低	受计算能力（FLOPs）限制
内存受限	内存带宽使用率高，计算利用率低	受DRAM吞吐量限制
启动开销受限	大量小型内核，CPU时间占比高	内核启动开销导致CPU成为瓶颈
通信受限	集体操作耗时占比大	受GPU间或节点间通信限制
同步受限	CPU-GPU同步点过多	不必要的同步导致停滞

Delegation Guidelines

委派指南

When delegating to specialists, describe the desired outcome -- not the tool methodology.

DO include:

The workload: file path, code snippet, or command to profile
Problem context: dimensions, dtypes, FLOPs calculations, batch sizes
Desired metrics: SOL%, MFU, throughput, occupancy, bottleneck classification
Any constraints: specific kernel to target, profiling region markers

DO NOT include:

Specific tool flags or command patterns (e.g.,
```
--set=full
```
,
```
--section SpeedOfLight
```
)
Step-by-step tool usage instructions
Fallback strategies for tool failures
Example commands
Output file paths or artifact locations (specialists create their own workspace artifacts)

Specialists have their own skills that encode best practices for tool usage and their own workspace artifacts for output. Prescribing commands in the delegation overrides their skills and may lead to suboptimal profiling strategies (e.g., collecting 8000+ metrics with

--set=full

when a targeted section analysis would be faster and more surgical).

向专家委派任务时，描述期望结果而非工具方法论。

必须包含：

工作负载：要剖析的文件路径、代码片段或命令
问题背景：维度、数据类型、FLOPs计算、批次大小
期望指标：SOL%、MFU、吞吐量、占用率、瓶颈分类
约束条件：目标内核、剖析区域标记

禁止包含：

特定工具参数或命令模式（如
```
--set=full
```
、
```
--section SpeedOfLight
```
）
工具使用的分步说明
工具故障的 fallback 策略
示例命令
输出文件路径或工件位置（专家会创建自己的工作区工件）

专家拥有编码了工具使用最佳实践的专属技能，以及自己的工作区输出工件。在委派中指定命令会覆盖他们的技能，可能导致次优的剖析策略（例如，当针对性的区域分析更快更精准时，却用

--set=full

收集8000+指标）。

Good Example

正面示例

Profile the batched GEMM kernel in bmm_workload.py with NCU.
The workload uses cudaProfilerStart/Stop markers to isolate the region of interest.
Collect kernel-level metrics: SOL%, compute/memory throughput, DRAM bandwidth,
tensor core utilization, occupancy, warp stall reasons, and roofline classification.
The batched GEMM performs 68.72 GFLOP per call (B=32, M=512, N=1024, K=2048, FP16).
Calculate MFU against the GPU's peak FP16 tensor core TFLOP/s.

使用NCU剖析bmm_workload.py中的批处理GEMM内核。
该工作负载使用cudaProfilerStart/Stop标记隔离感兴趣的区域。
收集内核级指标：SOL%、计算/内存吞吐量、DRAM带宽、张量核心利用率、占用率、 warp停滞原因以及roofline分类。
批处理GEMM每次调用执行68.72 GFLOP（B=32, M=512, N=1024, K=2048, FP16）。
针对GPU的峰值FP16张量核心TFLOP/s计算MFU。

Bad Example

负面示例

Run NCU with --set=full --profile-from-start off --target-processes all.
If --set=full fails, try --set=detailed. Parse the CSV output for
sm__throughput.avg.pct_of_peak_sustained_elapsed.
Save raw NCU output to /workspace/.../ncu_output.txt.

运行NCU时使用--set=full --profile-from-start off --target-processes all参数。
如果--set=full失败，尝试--set=detailed。解析CSV输出中的sm__throughput.avg.pct_of_peak_sustained_elapsed指标。
将原始NCU输出保存到/workspace/.../ncu_output.txt。

Remote Profiling

远程剖析

When profiling on a remote SLURM cluster, include the Remote Execution Context block in the delegation prompt with the SSH+srun wrapper for the target cluster. The perf-profiling-specialist will prefix its commands (nsys, ncu, nvidia-smi) with this wrapper.

The perf-profiling-specialist does not need the

remote-slurm

skill — the context block provides everything it needs to execute remotely.

在远程SLURM集群上进行剖析时，需在委派提示中包含远程执行上下文块，其中包含目标集群的SSH+srun包装器。perf-profiling-specialist会在其命令（nsys、ncu、nvidia-smi）前添加该包装器。

perf-profiling-specialist无需

remote-slurm

技能——上下文块已提供远程执行所需的全部信息。

Available Specialists

可用专家

Delegate profiling and domain-specific analysis to these specialists:

perf-profiling-specialist: Runs nvidia-smi, nsys, ncu, torch.profiler. Use for ALL profiling tasks.
perf-torch-cuda-graph-specialist: Analyzes CUDA Graph compatibility and applies capture workflows

将剖析和领域特定分析任务委派给以下专家：

perf-profiling-specialist：运行nvidia-smi、nsys、ncu、torch.profiler。所有剖析任务均使用该专家。
perf-torch-cuda-graph-specialist：分析CUDA Graph兼容性并应用捕获工作流

Report Format

报告格式

Structure every analysis report with these four sections:

Summary: High-level performance status or bottleneck classification
Metrics: Key performance numbers from profiling
Findings: Detailed observations with evidence
Recommendations: Prioritized list of optimizations (if applicable)

每份分析报告需包含以下四个部分：

摘要：高层级的性能状态或瓶颈分类
指标：来自剖析的关键性能数据
发现：带有证据的详细观察结果
建议：优先级排序的优化列表（如适用）

Example Report

示例报告

undefined

undefined

Summary

摘要

Training at 42% MFU, memory-bound due to large attention tensors.

训练时MFU为42%，因注意力张量过大导致内存受限。

Metrics

指标

Throughput: 1,247 samples/sec
MFU: 42% (vs 65% theoretical for this model)
% of SOL: 58% (room for 1.7x improvement)
GPU Utilization: 45%
Memory Bandwidth: 850 GB/s (89% of peak)
Kernel Count: 1,247 per iteration

吞吐量：1,247 samples/sec
MFU：42%（相较于该模型理论值65%）
% of SOL：58%（仍有1.7倍提升空间）
GPU利用率：45%
内存带宽：850 GB/s（为峰值的89%）
内核数量：每次迭代1,247个

Findings

发现

Self-attention consumes 60% of memory bandwidth
Optimizer step has 3 unnecessary synchronizations
Batch size could be increased by 2x

自注意力占用60%的内存带宽
优化器步骤存在3处不必要的同步
批次大小可提升2倍

Recommendations

建议

Enable FlashAttention (expected: +15% MFU)
Remove synchronizations in optimizer (expected: +5% throughput)
Increase batch size to improve GPU utilization

undefined

启用FlashAttention（预计：MFU提升15%）
移除优化器中的同步操作（预计：吞吐量提升5%）
增加批次大小以提高GPU利用率

undefined