perf-nsight-compute-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Nsight Compute Analysis

Nsight Compute 分析

NVIDIA Nsight Compute (

ncu

) profiles individual CUDA kernels to determine why they are slow and what to optimize. It measures GPU throughput as a percentage of theoretical peak (Speed of Light / SOL%), enabling systematic bottleneck classification and targeted optimization.

NVIDIA Nsight Compute（

ncu

）用于对单个CUDA内核进行性能分析，以确定内核运行缓慢的原因以及优化方向。它会将GPU吞吐量测量为理论峰值的百分比（Speed of Light / SOL%），从而实现系统化的瓶颈分类和针对性优化。

When to Use

使用场景

Reach for this skill when you encounter:

Triggers: User wants to profile a CUDA kernel, analyze
```
ncu
```
output, interpret
```
.ncu-rep
```
reports, or optimize GPU kernel performance
Symptoms: Kernel running slower than expected, low GPU utilization, need to classify compute-bound vs memory-bound, occupancy issues
Keywords: "ncu", "nsight compute", "SOL%", "speed of light", "kernel profiling", "compute-bound", "memory-bound", "latency-bound", "occupancy", "roofline", "warp stalls", "cache hit rate", "ncu-rep"

Do NOT use this skill for:

System-level profiling (use Nsight Systems /
```
nsys
```
instead)
CUDA API tracing or CPU-GPU timeline analysis (use
```
nsys
```
)
GPU monitoring without profiling (use
```
nvidia-smi
```
)

当遇到以下情况时，可以使用本技能：

触发条件：用户希望对CUDA内核进行性能分析、解读
```
ncu
```
输出、分析
```
.ncu-rep
```
报告，或优化GPU内核性能
症状：内核运行速度低于预期、GPU利用率低、需要区分计算受限（compute-bound）与内存受限（memory-bound）、occupancy问题
关键词："ncu"、"nsight compute"、"SOL%"、"speed of light"、"kernel profiling"、"compute-bound"、"memory-bound"、"latency-bound"、"occupancy"、"roofline"、"warp stalls"、"cache hit rate"、"ncu-rep"

请勿将本技能用于：

系统级性能分析（请使用Nsight Systems /
```
nsys
```
）
CUDA API追踪或CPU-GPU时间线分析（请使用
```
nsys
```
）
无性能分析的GPU监控（请使用
```
nvidia-smi
```
）

Requirements

要求

Dependency	Version	Notes
CUDA Toolkit	>=11.0	Includes `ncu`
`ncu` binary	Match CUDA version	Or set `$NCU` env var
NVIDIA GPU	Kepler+	Volta+ recommended

Permissions:

ncu

may require

sudo

CAP_SYS_ADMIN

, or

--privileged

in containers. Check with

ncu -v

first.

依赖项	版本	说明
CUDA Toolkit	>=11.0	包含 `ncu`
`ncu` 二进制文件	匹配CUDA版本	或设置 `$NCU` 环境变量
NVIDIA GPU	Kepler及以上	推荐使用Volta及以上

权限：

ncu

可能需要

sudo

、

CAP_SYS_ADMIN

权限，或在容器中使用

--privileged

参数。请先运行

ncu -v

检查。

Principles

原则

Data Integrity

数据完整性

This is a data-driven analysis system. Every number you present must have an authoritative source. Follow these rules without exception:

Quote before you interpret. When presenting metrics from ncu output, always show the actual ncu command you ran AND the relevant raw output (CSV lines, metric values) before stating any numeric conclusion.
Never fabricate metrics. If ncu fails, returns unexpected output, or you cannot run it, say so explicitly. Do not invent plausible-looking numbers. An honest "profiling failed" is better than fabricated data.
Attribute every value. For each metric you cite (SOL%, duration, occupancy, throughput), the reader must be able to trace it back to a specific line in the raw ncu output you showed.

这是一个基于数据的分析系统。所有呈现的数值必须有权威来源。请严格遵循以下规则：

先引用再解读：当展示ncu输出中的指标时，必须先显示你运行的实际ncu命令以及相关的原始输出（CSV行、指标值），再得出任何数值结论。
切勿伪造指标：如果ncu运行失败、返回意外输出或无法运行，请明确说明。不要编造看似合理的数值。诚实的"性能分析失败"比伪造数据更好。
为每个数值标注来源：对于你引用的每个指标（SOL%、时长、occupancy、吞吐量），读者必须能够追溯到你展示的原始ncu输出中的特定行。

SOL% Mental Model

SOL% 心智模型

Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:

Compute SOL% = actual compute throughput / peak compute throughput
Memory SOL% = actual memory throughput / peak memory throughput

A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.

光速（SOL%）用于衡量内核运行速度接近GPU理论峰值的程度：

Compute SOL% = 实际计算吞吐量 / 峰值计算吞吐量
Memory SOL% = 实际内存吞吐量 / 峰值内存吞吐量

内核无法同时饱和计算和内存。数值较高的指标揭示了瓶颈类型。请将此作为主要的分类依据。

Classification Thresholds

分类阈值

Compute %	Memory %	Bottleneck	Next Step
>60	<40	Compute-bound	ComputeWorkloadAnalysis section
<40	>60	Memory-bound	MemoryWorkloadAnalysis section
<40	<40	Latency-bound	LaunchStats + Occupancy sections
40-60	40-60	Balanced	Profile deeper with detailed sections

Additional signals:

Duration <10us with many launches -> Launch-overhead bound (use nsys first)
Both <40% but occupancy >50% -> Instruction-bound (check InstructionStats)

计算占比	内存占比	瓶颈类型	下一步操作
>60	<40	计算受限（Compute-bound）	查看ComputeWorkloadAnalysis部分
<40	>60	内存受限（Memory-bound）	查看MemoryWorkloadAnalysis部分
<40	<40	延迟受限（Latency-bound）	查看LaunchStats + Occupancy部分
40-60	40-60	平衡型	使用详细部分进行深度分析

额外信号：

时长<10微秒且多次启动 -> 启动开销受限（Launch-overhead bound）（请先使用nsys）
两者均<40%但occupancy>50% -> 指令受限（Instruction-bound）（查看InstructionStats）

SOL% Performance Levels

SOL% 性能等级

SOL%	Level	Action
>80%	Excellent	Minor tuning only
60-80%	Good	Targeted optimization
40-60%	Fair	Significant optimization needed
<40%	Poor	Major rework needed

SOL%	等级	操作建议
>80%	优秀	仅需微调
60-80%	良好	针对性优化
40-60%	一般	需要显著优化
<40%	较差	需要重大重构

Section-First Profiling

优先按部分分析

Always use targeted

--section

flags instead of bulk

--set

collection. Individual sections are faster and more surgical. Only escalate to

--set basic

--set detailed

when broad exploration is needed.

始终使用针对性的

--section

标志，而非批量的

--set

收集。单独的部分分析速度更快、更精准。仅当需要广泛探索时，才升级使用

--set basic

或

--set detailed

。

ncu vs nsys

Tool	Scope	Overhead	Purpose
nsys	System-level	5-10%	Find which kernels to optimize
ncu	Kernel-level	10-100x slower	Understand why a kernel is slow

Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.

工具	范围	开销	用途
nsys	系统级	5-10%	找出需要优化的内核
ncu	内核级	慢10-100倍	理解内核运行缓慢的原因

请先使用nsys确定GPU耗时最高的内核，再使用ncu对这些特定内核进行深度分析。

Workflow

工作流程

Choose your path based on the request:

Knowledge query (what metrics to use, --section vs --set, how to filter kernels): Answer directly from Principles, Command Reference, and References below. Do NOT run ncu.
Quick diagnosis (classify bottleneck, check SOL%): Step 1 only. Escalate if user wants more.
Specific diagnosis (bank conflicts, register pressure, occupancy): Quick SOL% check (Step 1), then go directly to the relevant section in Step 2.
Deep analysis (detailed report, optimization recommendations): Full Steps 1-5. Present the complete structured report with all key metrics (SOL%, duration, occupancy) in your final response — do not split the report across messages or replace it with a brief summary.

根据请求选择路径：

知识查询（使用哪些指标、--section vs --set、如何过滤内核）：直接根据以下原则、命令参考和参考资料回答。无需运行ncu。
快速诊断（分类瓶颈、检查SOL%）：仅执行步骤1。如果用户需要更多信息，再深入分析。
特定诊断（存储体冲突、寄存器压力、occupancy）：先快速检查SOL%（步骤1），然后直接进入步骤2中的相关部分。
深度分析（详细报告、优化建议）：执行完整步骤1-5。在最终响应中呈现包含所有关键指标（SOL%、时长、occupancy）的完整结构化报告——不要将报告拆分到多条消息中，也不要用简短摘要替代。

Step 0: Verify ncu

步骤0：验证ncu

bash

ncu -v

bash

ncu -v

Or: $NCU -v

或：$NCU -v


If not found, ensure CUDA toolkit is installed or set `NCU` env var to the binary path.


如果未找到，请确保已安装CUDA Toolkit，或设置`NCU`环境变量指向二进制文件路径。

Step 1: SOL% Diagnosis

步骤1：SOL% 诊断

Always start with SpeedOfLight to classify the bottleneck:

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND

Read

Compute (SM) Throughput

and

Memory Throughput

from the output. Classify using the thresholds above.

始终从SpeedOfLight开始，对瓶颈进行分类：

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND

从输出中读取

Compute (SM) Throughput

和

Memory Throughput

。使用上述阈值进行分类。

Step 2: Escalate with Targeted Sections

步骤2：添加针对性部分进行深入分析

Based on Step 1 classification, add sections:

Classification	Sections to Add
Compute-bound	`ComputeWorkloadAnalysis`
Memory-bound	`MemoryWorkloadAnalysis`
Latency-bound	`LaunchStats` , `Occupancy`
Warp stalls	`WarpStateStats` , `SchedulerStats`
Need instruction breakdown	`InstructionStats`

Always include

LaunchStats

and

Occupancy

when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.

Example -- memory-bound deep dive:

bash

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py

Example -- compute-bound deep dive:

bash

ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py

Example -- occupancy investigation:

bash

ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py

根据步骤1的分类，添加相应部分：

分类	需添加的部分
计算受限	`ComputeWorkloadAnalysis`
内存受限	`MemoryWorkloadAnalysis`
延迟受限	`LaunchStats` , `Occupancy`
Warp停顿	`WarpStateStats` , `SchedulerStats`
需要指令细分	`InstructionStats`

诊断延迟受限内核时，始终包含

LaunchStats

和

Occupancy

。这些部分会揭示寄存器压力、共享内存限制和块大小问题。

示例——内存受限深度分析：

bash

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py

示例——计算受限深度分析：

bash

ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py

示例——occupancy调查：

bash

ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py

Step 3: Roofline Analysis (Optional)

步骤3：Roofline分析（可选）

For visual understanding of compute vs memory balance:

bash

ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

For precision-specific hierarchical roofline:

bash

undefined

为了直观理解计算与内存的平衡：

bash

ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND

针对特定精度的分层Roofline分析：

bash

undefined

FP16 kernels

FP16内核

ncu --section SpeedOfLight_HierarchicalHalfRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND

Tensor core kernels

Tensor core内核

ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND


Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See `references/roofline-analysis.md`.

ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND


解读：内核位于脊点左侧 = 内存受限；右侧 = 计算受限；远低于两条脊线 = 延迟/occupancy问题。详见`references/roofline-analysis.md`。

Step 4: Interpret and Optimize

步骤4：解读与优化

Identify the dominant bottleneck from SOL% classification
Look up detailed analysis and optimization strategies in
```
references/bottleneck-guide.md
```
Apply highest-impact optimization first
Re-profile to validate improvement and detect bottleneck shifts

从SOL%分类中确定主导瓶颈
在
```
references/bottleneck-guide.md
```
中查找详细分析和优化策略
优先应用影响最大的优化措施
重新进行性能分析，验证改进效果并检测瓶颈变化

Step 5: Validate

步骤5：验证

Re-profile the same kernel after optimization:

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py

Compare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?

优化后重新分析同一个内核：

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py

对比：吞吐量百分比是否提升？时长是否减少？瓶颈类型是否变化？

Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)

JIT编译内核的性能分析（Triton/cuTile/CuTeDSL）

JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:

Warmup first: Run the kernel 3-5 times to complete JIT compilation and autotuning, then
```
torch.cuda.synchronize()
```
.
Use profiler markers: Bracket the measured region with
```
cudaProfilerStart()
```
/
```
cudaProfilerStop()
```
.
Use
--profile-from-start off
so ncu only captures the marked region:

python

undefined

JIT编译内核会在首次调用时触发自动调优。请隔离实际执行阶段：

先预热：运行内核3-5次以完成JIT编译和自动调优，然后执行
```
torch.cuda.synchronize()
```
。
使用性能分析器标记：用
```
cudaProfilerStart()
```
/
```
cudaProfilerStop()
```
包围测量区域。
使用
--profile-from-start off
，使ncu仅捕获标记区域：

python

undefined

Warmup (JIT + autotuning)

预热（JIT + 自动调优）

for _ in range(5): result = kernel(inputs) torch.cuda.synchronize()

Profile only steady-state

仅分析稳态阶段

torch.cuda.cudart().cudaProfilerStart() for _ in range(3): result = kernel(inputs) torch.cuda.synchronize() torch.cuda.cudart().cudaProfilerStop()


```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py

Alternative: use

--launch-skip N

to skip autotuning launches. See

references/advanced-profiling.md

for NVTX range and replay mode alternatives.

torch.cuda.cudart().cudaProfilerStart() for _ in range(3): result = kernel(inputs) torch.cuda.synchronize() torch.cuda.cudart().cudaProfilerStop()


```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py

替代方案：使用

--launch-skip N

跳过自动调优启动。详见

references/advanced-profiling.md

中的NVTX范围和重放模式替代方案。

Programmatic Report Analysis

程序化报告分析

Extract metrics from

.ncu-rep

files using the

ncu_report

Python module (in

extras/python/

of the Nsight Compute installation):

python

import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")

See

references/python-report-api.md

for the full API (IContext, IRange, IAction, IMetric classes).

使用

ncu_report

Python模块（位于Nsight Compute安装目录的

extras/python/

中）从

.ncu-rep

文件提取指标：

python

import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")

完整API（IContext、IRange、IAction、IMetric类）详见

references/python-report-api.md

。

Output Formats

输出格式

CSV output (for scripting and automated analysis):

bash

ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # All metrics flat

Report files (for later analysis):

bash

ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # Export to CSV

Key CSV columns:

Column	Meaning
`Kernel Name`	CUDA kernel function name
`Duration`	Execution time (nanoseconds)
`Compute (SM) Throughput`	% of peak compute
`Memory Throughput`	% of peak memory bandwidth
`Achieved Occupancy`	Active warps / max warps (%)

Success indicators:

SOL% values present in output -> profiling succeeded
Duration values reasonable (not 0 or extremely large)
Multiple launches captured when
```
--launch-count > 1
```

CSV输出（用于脚本和自动化分析）：

bash

ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # 所有指标平铺展示

报告文件（用于后续分析）：

bash

ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # 导出为CSV

关键CSV列：

列名	含义
`Kernel Name`	CUDA内核函数名称
`Duration`	执行时间（纳秒）
`Compute (SM) Throughput`	峰值计算占比
`Memory Throughput`	峰值内存带宽占比
`Achieved Occupancy`	活跃warp数 / 最大warp数（百分比）

成功指标：

输出中包含SOL%值 -> 性能分析成功
时长值合理（不为0或极大）
使用
```
--launch-count > 1
```
时捕获到多次启动

Examples

示例

Example: Classify a GEMM Kernel

示例：分类GEMM内核

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py

Output:

"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2

Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step: check tensor core usage with

--section ComputeWorkloadAnalysis

bash

ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py

输出：

"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2

解读：计算受限（计算占比78.5%，内存占比35.2%）。下一步：使用

--section ComputeWorkloadAnalysis

检查Tensor Core使用率。

Example: Diagnose a Memory-Bound Embedding Kernel

示例：诊断内存受限的Embedding内核

bash

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py

Check L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates suggest poor data locality; low coalescing efficiency suggests scattered access.

bash

ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py

检查输出中的L1/L2缓存命中率和合并效率。低命中率表明数据局部性差；低合并效率表明访问分散。

Error Handling

错误处理

Error	Cause	Fix
`ncu: command not found`	Not in PATH	`export PATH=$PATH:/usr/local/cuda/bin` or set `$NCU`
`Permission denied`	Needs elevated privileges	`sudo ncu ...` or `--cap-add=SYS_ADMIN` in containers
No kernels captured	Name regex doesn't match	Run without `--kernel-name` first to see actual names
Profiling extremely slow	Using `--set full` or many sections	Use `--section SpeedOfLight` only; reduce `--launch-count`
Autotuning pollutes results	JIT kernel warmup captured	Use `--profile-from-start off` with profiler markers
Metrics show 0% tensor cores	Kernel doesn't use tensor cores	Check with `--section InstructionStats` ; verify dimensions align to 8/16
Report file too large	`--set full` with many kernels	Use targeted sections; limit with `--kernel-name` and `--launch-count`
Out-of-range metric values	Async GPU activity or short kernels	Profile on isolated GPU; increase workload size
`ncu` hangs on MPI app	Dependent kernels across ranks	Use `--communicator=tcp --lockstep-kernel-launch`

错误	原因	修复方案
`ncu: command not found`	不在PATH中	`export PATH=$PATH:/usr/local/cuda/bin` 或设置 `$NCU` 环境变量
`Permission denied`	需要提升权限	`sudo ncu ...` 或在容器中使用 `--cap-add=SYS_ADMIN`
未捕获到内核	名称正则不匹配	先不使用 `--kernel-name` 运行，查看实际名称
性能分析极慢	使用了 `--set full` 或多个部分	仅使用 `--section SpeedOfLight` ；减少 `--launch-count`
自动调优污染结果	捕获到JIT内核预热	使用 `--profile-from-start off` 配合性能分析器标记
指标显示Tensor Core使用率为0%	内核未使用Tensor Core	使用 `--section InstructionStats` 检查；验证维度是否符合8/16对齐
报告文件过大	使用 `--set full` 且包含多个内核	使用针对性部分；通过 `--kernel-name` 和 `--launch-count` 限制范围
指标值超出范围	GPU异步活动或内核运行时间过短	在隔离的GPU上进行性能分析；增加工作负载大小
`ncu` 在MPI应用中挂起	跨进程依赖内核	使用 `--communicator=tcp --lockstep-kernel-launch`

Finding More Information

获取更多信息

Tier 1: This File (SKILL.md)

第一层：本文件（SKILL.md）

You are reading it now. The section-first workflow and error table above cover the most common profiling tasks. Search this file first.

你正在阅读的就是本文件。上述的按部分分析工作流程和错误表涵盖了最常见的性能分析任务。请首先搜索本文件。

Tier 2: references/ Directory

第二层：references/目录

Grep for keywords across

references/

-- headers are grep-friendly:

```
references/cli-reference.md
```
-- Complete CLI options, filtering, output formats
```
references/metrics-guide.md
```
-- Hardware model, metric naming, key metrics
```
references/sections-guide.md
```
-- All
```
--section
```
names, when to use each
```
references/bottleneck-guide.md
```
-- Per-bottleneck root causes and optimization
```
references/memory-analysis.md
```
-- Memory hierarchy, cache analysis, coalescing
```
references/roofline-analysis.md
```
-- Roofline charts and interpretation
```
references/advanced-profiling.md
```
-- Replay modes, MPI, CUDA graphs, PM sampling, customization

references/python-report-api.md

ncu_report

Python module API

How to search:

```
Grep
```
for your keyword across
```
references/
```
```
Read
```
only the file that Grep points to

在

references/

中搜索关键词——标题便于grep搜索：

```
references/cli-reference.md
```
—— 完整CLI选项、过滤、输出格式
```
references/metrics-guide.md
```
—— 硬件模型、指标命名、关键指标
```
references/sections-guide.md
```
—— 所有
```
--section
```
名称及使用场景
```
references/bottleneck-guide.md
```
—— 各类型瓶颈的根本原因及优化方案
```
references/memory-analysis.md
```
—— 内存层级、缓存分析、访问合并
```
references/roofline-analysis.md
```
—— Roofline图表及解读
```
references/advanced-profiling.md
```
—— 重放模式、MPI、CUDA图、PM采样、自定义配置

references/python-report-api.md

——

ncu_report

Python模块API

搜索方法：

使用
```
Grep
```
在
```
references/
```
中搜索关键词
仅阅读Grep指向的文件

Tier 3: Official Documentation

第三层：官方文档

If Tiers 1-2 don't answer:

Profiling Guide -- Metrics, hardware model, analysis concepts
CLI Reference -- Full CLI options
Python Report Interface --
```
ncu_report
```
API
Customization Guide -- Section files, rules

WebFetch or WebSearch these URLs for the latest content. Consider distilling new findings back into

references/

如果第一层和第二层无法解答：

性能分析指南 —— 指标、硬件模型、分析概念
CLI参考 —— 完整CLI选项
Python报告接口 ——
```
ncu_report
```
API
自定义指南 —— 部分文件、规则

可通过WebFetch或WebSearch获取这些URL的最新内容。考虑将新发现提炼并添加到

references/

中。