perf-nsight-compute-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Nsight Compute Analysis

Nsight Compute 分析

NVIDIA Nsight Compute (
ncu
) profiles individual CUDA kernels to determine why they are slow and what to optimize. It measures GPU throughput as a percentage of theoretical peak (Speed of Light / SOL%), enabling systematic bottleneck classification and targeted optimization.
NVIDIA Nsight Compute(
ncu
)用于对单个CUDA内核进行性能分析,以确定内核运行缓慢的原因以及优化方向。它会将GPU吞吐量测量为理论峰值的百分比(Speed of Light / SOL%),从而实现系统化的瓶颈分类和针对性优化。

When to Use

使用场景

Reach for this skill when you encounter:
  • Triggers: User wants to profile a CUDA kernel, analyze
    ncu
    output, interpret
    .ncu-rep
    reports, or optimize GPU kernel performance
  • Symptoms: Kernel running slower than expected, low GPU utilization, need to classify compute-bound vs memory-bound, occupancy issues
  • Keywords: "ncu", "nsight compute", "SOL%", "speed of light", "kernel profiling", "compute-bound", "memory-bound", "latency-bound", "occupancy", "roofline", "warp stalls", "cache hit rate", "ncu-rep"
Do NOT use this skill for:
  • System-level profiling (use Nsight Systems /
    nsys
    instead)
  • CUDA API tracing or CPU-GPU timeline analysis (use
    nsys
    )
  • GPU monitoring without profiling (use
    nvidia-smi
    )
当遇到以下情况时,可以使用本技能:
  • 触发条件:用户希望对CUDA内核进行性能分析、解读
    ncu
    输出、分析
    .ncu-rep
    报告,或优化GPU内核性能
  • 症状:内核运行速度低于预期、GPU利用率低、需要区分计算受限(compute-bound)与内存受限(memory-bound)、occupancy问题
  • 关键词:"ncu"、"nsight compute"、"SOL%"、"speed of light"、"kernel profiling"、"compute-bound"、"memory-bound"、"latency-bound"、"occupancy"、"roofline"、"warp stalls"、"cache hit rate"、"ncu-rep"
请勿将本技能用于:
  • 系统级性能分析(请使用Nsight Systems /
    nsys
  • CUDA API追踪或CPU-GPU时间线分析(请使用
    nsys
  • 无性能分析的GPU监控(请使用
    nvidia-smi

Requirements

要求

DependencyVersionNotes
CUDA Toolkit>=11.0Includes
ncu
ncu
binary
Match CUDA versionOr set
$NCU
env var
NVIDIA GPUKepler+Volta+ recommended
Permissions:
ncu
may require
sudo
,
CAP_SYS_ADMIN
, or
--privileged
in containers. Check with
ncu -v
first.
依赖项版本说明
CUDA Toolkit>=11.0包含
ncu
ncu
二进制文件
匹配CUDA版本或设置
$NCU
环境变量
NVIDIA GPUKepler及以上推荐使用Volta及以上
权限:
ncu
可能需要
sudo
CAP_SYS_ADMIN
权限,或在容器中使用
--privileged
参数。请先运行
ncu -v
检查。

Principles

原则

Data Integrity

数据完整性

This is a data-driven analysis system. Every number you present must have an authoritative source. Follow these rules without exception:
  1. Quote before you interpret. When presenting metrics from ncu output, always show the actual ncu command you ran AND the relevant raw output (CSV lines, metric values) before stating any numeric conclusion.
  2. Never fabricate metrics. If ncu fails, returns unexpected output, or you cannot run it, say so explicitly. Do not invent plausible-looking numbers. An honest "profiling failed" is better than fabricated data.
  3. Attribute every value. For each metric you cite (SOL%, duration, occupancy, throughput), the reader must be able to trace it back to a specific line in the raw ncu output you showed.
这是一个基于数据的分析系统。所有呈现的数值必须有权威来源。请严格遵循以下规则:
  1. 先引用再解读:当展示ncu输出中的指标时,必须先显示你运行的实际ncu命令以及相关的原始输出(CSV行、指标值),再得出任何数值结论。
  2. 切勿伪造指标:如果ncu运行失败、返回意外输出或无法运行,请明确说明。不要编造看似合理的数值。诚实的"性能分析失败"比伪造数据更好。
  3. 为每个数值标注来源:对于你引用的每个指标(SOL%、时长、occupancy、吞吐量),读者必须能够追溯到你展示的原始ncu输出中的特定行。

SOL% Mental Model

SOL% 心智模型

Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:
  • Compute SOL% = actual compute throughput / peak compute throughput
  • Memory SOL% = actual memory throughput / peak memory throughput
A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.
光速(SOL%)用于衡量内核运行速度接近GPU理论峰值的程度:
  • Compute SOL% = 实际计算吞吐量 / 峰值计算吞吐量
  • Memory SOL% = 实际内存吞吐量 / 峰值内存吞吐量
内核无法同时饱和计算和内存。数值较高的指标揭示了瓶颈类型。请将此作为主要的分类依据。

Classification Thresholds

分类阈值

Compute %Memory %BottleneckNext Step
>60<40Compute-boundComputeWorkloadAnalysis section
<40>60Memory-boundMemoryWorkloadAnalysis section
<40<40Latency-boundLaunchStats + Occupancy sections
40-6040-60BalancedProfile deeper with detailed sections
Additional signals:
  • Duration <10us with many launches -> Launch-overhead bound (use nsys first)
  • Both <40% but occupancy >50% -> Instruction-bound (check InstructionStats)
计算占比内存占比瓶颈类型下一步操作
>60<40计算受限(Compute-bound)查看ComputeWorkloadAnalysis部分
<40>60内存受限(Memory-bound)查看MemoryWorkloadAnalysis部分
<40<40延迟受限(Latency-bound)查看LaunchStats + Occupancy部分
40-6040-60平衡型使用详细部分进行深度分析
额外信号:
  • 时长<10微秒且多次启动 -> 启动开销受限(Launch-overhead bound)(请先使用nsys)
  • 两者均<40%但occupancy>50% -> 指令受限(Instruction-bound)(查看InstructionStats)

SOL% Performance Levels

SOL% 性能等级

SOL%LevelAction
>80%ExcellentMinor tuning only
60-80%GoodTargeted optimization
40-60%FairSignificant optimization needed
<40%PoorMajor rework needed
SOL%等级操作建议
>80%优秀仅需微调
60-80%良好针对性优化
40-60%一般需要显著优化
<40%较差需要重大重构

Section-First Profiling

优先按部分分析

Always use targeted
--section
flags instead of bulk
--set
collection. Individual sections are faster and more surgical. Only escalate to
--set basic
or
--set detailed
when broad exploration is needed.
始终使用针对性的
--section
标志,而非批量的
--set
收集。单独的部分分析速度更快、更精准。仅当需要广泛探索时,才升级使用
--set basic
--set detailed

ncu vs nsys

ncu vs nsys

ToolScopeOverheadPurpose
nsysSystem-level5-10%Find which kernels to optimize
ncuKernel-level10-100x slowerUnderstand why a kernel is slow
Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.
工具范围开销用途
nsys系统级5-10%找出需要优化的内核
ncu内核级慢10-100倍理解内核运行缓慢的原因
请先使用nsys确定GPU耗时最高的内核,再使用ncu对这些特定内核进行深度分析。

Workflow

工作流程

Choose your path based on the request:
  • Knowledge query (what metrics to use, --section vs --set, how to filter kernels): Answer directly from Principles, Command Reference, and References below. Do NOT run ncu.
  • Quick diagnosis (classify bottleneck, check SOL%): Step 1 only. Escalate if user wants more.
  • Specific diagnosis (bank conflicts, register pressure, occupancy): Quick SOL% check (Step 1), then go directly to the relevant section in Step 2.
  • Deep analysis (detailed report, optimization recommendations): Full Steps 1-5. Present the complete structured report with all key metrics (SOL%, duration, occupancy) in your final response — do not split the report across messages or replace it with a brief summary.
根据请求选择路径:
  • 知识查询(使用哪些指标、--section vs --set、如何过滤内核):直接根据以下原则、命令参考和参考资料回答。无需运行ncu。
  • 快速诊断(分类瓶颈、检查SOL%):仅执行步骤1。如果用户需要更多信息,再深入分析。
  • 特定诊断(存储体冲突、寄存器压力、occupancy):先快速检查SOL%(步骤1),然后直接进入步骤2中的相关部分。
  • 深度分析(详细报告、优化建议):执行完整步骤1-5。在最终响应中呈现包含所有关键指标(SOL%、时长、occupancy)的完整结构化报告——不要将报告拆分到多条消息中,也不要用简短摘要替代。

Step 0: Verify ncu

步骤0:验证ncu

bash
ncu -v
bash
ncu -v

Or: $NCU -v

或:$NCU -v


If not found, ensure CUDA toolkit is installed or set `NCU` env var to the binary path.

如果未找到,请确保已安装CUDA Toolkit,或设置`NCU`环境变量指向二进制文件路径。

Step 1: SOL% Diagnosis

步骤1:SOL% 诊断

Always start with SpeedOfLight to classify the bottleneck:
bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND
Read
Compute (SM) Throughput
and
Memory Throughput
from the output. Classify using the thresholds above.
始终从SpeedOfLight开始,对瓶颈进行分类:
bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"KERNEL" \
    --launch-skip 5 --launch-count 3 \
    -- COMMAND
从输出中读取
Compute (SM) Throughput
Memory Throughput
。使用上述阈值进行分类。

Step 2: Escalate with Targeted Sections

步骤2:添加针对性部分进行深入分析

Based on Step 1 classification, add sections:
ClassificationSections to Add
Compute-bound
ComputeWorkloadAnalysis
Memory-bound
MemoryWorkloadAnalysis
Latency-bound
LaunchStats
,
Occupancy
Warp stalls
WarpStateStats
,
SchedulerStats
Need instruction breakdown
InstructionStats
Always include
LaunchStats
and
Occupancy
when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.
Example -- memory-bound deep dive:
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py
Example -- compute-bound deep dive:
bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py
Example -- occupancy investigation:
bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py
根据步骤1的分类,添加相应部分:
分类需添加的部分
计算受限
ComputeWorkloadAnalysis
内存受限
MemoryWorkloadAnalysis
延迟受限
LaunchStats
,
Occupancy
Warp停顿
WarpStateStats
,
SchedulerStats
需要指令细分
InstructionStats
诊断延迟受限内核时,始终包含
LaunchStats
Occupancy
。这些部分会揭示寄存器压力、共享内存限制和块大小问题。
示例——内存受限深度分析:
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding_lookup" \
    --launch-count 3 \
    -- python script.py
示例——计算受限深度分析:
bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
    --kernel-name regex:"gemm" \
    --launch-count 3 -- python script.py
示例——occupancy调查:
bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
    --kernel-name regex:"small_kernel" \
    -- python script.py

Step 3: Roofline Analysis (Optional)

步骤3:Roofline分析(可选)

For visual understanding of compute vs memory balance:
bash
ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND
For precision-specific hierarchical roofline:
bash
undefined
为了直观理解计算与内存的平衡:
bash
ncu --section SpeedOfLight_RooflineChart \
    --kernel-name regex:"KERNEL" -- COMMAND
针对特定精度的分层Roofline分析:
bash
undefined

FP16 kernels

FP16内核

ncu --section SpeedOfLight_HierarchicalHalfRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND

Tensor core kernels

Tensor core内核

ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND

Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See `references/roofline-analysis.md`.
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND

解读:内核位于脊点左侧 = 内存受限;右侧 = 计算受限;远低于两条脊线 = 延迟/occupancy问题。详见`references/roofline-analysis.md`。

Step 4: Interpret and Optimize

步骤4:解读与优化

  1. Identify the dominant bottleneck from SOL% classification
  2. Look up detailed analysis and optimization strategies in
    references/bottleneck-guide.md
  3. Apply highest-impact optimization first
  4. Re-profile to validate improvement and detect bottleneck shifts
  1. 从SOL%分类中确定主导瓶颈
  2. references/bottleneck-guide.md
    中查找详细分析和优化策略
  3. 优先应用影响最大的优化措施
  4. 重新进行性能分析,验证改进效果并检测瓶颈变化

Step 5: Validate

步骤5:验证

Re-profile the same kernel after optimization:
bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py
Compare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?
优化后重新分析同一个内核:
bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"optimized_kernel" \
    --launch-count 3 \
    -- python optimized_script.py
对比:吞吐量百分比是否提升?时长是否减少?瓶颈类型是否变化?

Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)

JIT编译内核的性能分析(Triton/cuTile/CuTeDSL)

JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:
  1. Warmup first: Run the kernel 3-5 times to complete JIT compilation and autotuning, then
    torch.cuda.synchronize()
    .
  2. Use profiler markers: Bracket the measured region with
    cudaProfilerStart()
    /
    cudaProfilerStop()
    .
  3. Use
    --profile-from-start off
    so ncu only captures the marked region:
python
undefined
JIT编译内核会在首次调用时触发自动调优。请隔离实际执行阶段:
  1. 先预热:运行内核3-5次以完成JIT编译和自动调优,然后执行
    torch.cuda.synchronize()
  2. 使用性能分析器标记:用
    cudaProfilerStart()
    /
    cudaProfilerStop()
    包围测量区域。
  3. 使用
    --profile-from-start off
    ,使ncu仅捕获标记区域:
python
undefined

Warmup (JIT + autotuning)

预热(JIT + 自动调优)

for _ in range(5): result = kernel(inputs) torch.cuda.synchronize()
for _ in range(5): result = kernel(inputs) torch.cuda.synchronize()

Profile only steady-state

仅分析稳态阶段

torch.cuda.cudart().cudaProfilerStart() for _ in range(3): result = kernel(inputs) torch.cuda.synchronize() torch.cuda.cudart().cudaProfilerStop()

```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py
Alternative: use
--launch-skip N
to skip autotuning launches. See
references/advanced-profiling.md
for NVTX range and replay mode alternatives.
torch.cuda.cudart().cudaProfilerStart() for _ in range(3): result = kernel(inputs) torch.cuda.synchronize() torch.cuda.cudart().cudaProfilerStop()

```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
    --kernel-name regex:"target_kernel" \
    --launch-count 3 -- python script.py
替代方案:使用
--launch-skip N
跳过自动调优启动。详见
references/advanced-profiling.md
中的NVTX范围和重放模式替代方案。

Programmatic Report Analysis

程序化报告分析

Extract metrics from
.ncu-rep
files using the
ncu_report
Python module (in
extras/python/
of the Nsight Compute installation):
python
import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")
See
references/python-report-api.md
for the full API (IContext, IRange, IAction, IMetric classes).
使用
ncu_report
Python模块(位于Nsight Compute安装目录的
extras/python/
中)从
.ncu-rep
文件提取指标:
python
import ncu_report

ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
    for action in rng:
        name = action.name()
        compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
        duration = action["gpu__time_duration.sum"].as_uint64()

        if compute > 60:
            classification = "compute-bound"
        elif memory > 60:
            classification = "memory-bound"
        else:
            classification = "latency-bound"

        print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")
完整API(IContext、IRange、IAction、IMetric类)详见
references/python-report-api.md

Output Formats

输出格式

CSV output (for scripting and automated analysis):
bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # All metrics flat
Report files (for later analysis):
bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # Export to CSV
Key CSV columns:
ColumnMeaning
Kernel Name
CUDA kernel function name
Duration
Execution time (nanoseconds)
Compute (SM) Throughput
% of peak compute
Memory Throughput
% of peak memory bandwidth
Achieved Occupancy
Active warps / max warps (%)
Success indicators:
  • SOL% values present in output -> profiling succeeded
  • Duration values reasonable (not 0 or extremely large)
  • Multiple launches captured when
    --launch-count > 1
CSV输出(用于脚本和自动化分析):
bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND   # 所有指标平铺展示
报告文件(用于后续分析):
bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw            # 导出为CSV
关键CSV列:
列名含义
Kernel Name
CUDA内核函数名称
Duration
执行时间(纳秒)
Compute (SM) Throughput
峰值计算占比
Memory Throughput
峰值内存带宽占比
Achieved Occupancy
活跃warp数 / 最大warp数(百分比)
成功指标:
  • 输出中包含SOL%值 -> 性能分析成功
  • 时长值合理(不为0或极大)
  • 使用
    --launch-count > 1
    时捕获到多次启动

Examples

示例

Example: Classify a GEMM Kernel

示例:分类GEMM内核

bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py
Output:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2
Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step: check tensor core usage with
--section ComputeWorkloadAnalysis
.
bash
ncu --section SpeedOfLight --csv \
    --kernel-name regex:"gemm" \
    --launch-skip 5 --launch-count 3 \
    -- python train.py
输出:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2
解读:计算受限(计算占比78.5%,内存占比35.2%)。下一步:使用
--section ComputeWorkloadAnalysis
检查Tensor Core使用率。

Example: Diagnose a Memory-Bound Embedding Kernel

示例:诊断内存受限的Embedding内核

bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py
Check L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates suggest poor data locality; low coalescing efficiency suggests scattered access.
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
    --kernel-name regex:"embedding" \
    --launch-count 3 -- python train.py
检查输出中的L1/L2缓存命中率和合并效率。低命中率表明数据局部性差;低合并效率表明访问分散。

Error Handling

错误处理

ErrorCauseFix
ncu: command not found
Not in PATH
export PATH=$PATH:/usr/local/cuda/bin
or set
$NCU
Permission denied
Needs elevated privileges
sudo ncu ...
or
--cap-add=SYS_ADMIN
in containers
No kernels capturedName regex doesn't matchRun without
--kernel-name
first to see actual names
Profiling extremely slowUsing
--set full
or many sections
Use
--section SpeedOfLight
only; reduce
--launch-count
Autotuning pollutes resultsJIT kernel warmup capturedUse
--profile-from-start off
with profiler markers
Metrics show 0% tensor coresKernel doesn't use tensor coresCheck with
--section InstructionStats
; verify dimensions align to 8/16
Report file too large
--set full
with many kernels
Use targeted sections; limit with
--kernel-name
and
--launch-count
Out-of-range metric valuesAsync GPU activity or short kernelsProfile on isolated GPU; increase workload size
ncu
hangs on MPI app
Dependent kernels across ranksUse
--communicator=tcp --lockstep-kernel-launch
错误原因修复方案
ncu: command not found
不在PATH中
export PATH=$PATH:/usr/local/cuda/bin
或设置
$NCU
环境变量
Permission denied
需要提升权限
sudo ncu ...
或在容器中使用
--cap-add=SYS_ADMIN
未捕获到内核名称正则不匹配先不使用
--kernel-name
运行,查看实际名称
性能分析极慢使用了
--set full
或多个部分
仅使用
--section SpeedOfLight
;减少
--launch-count
自动调优污染结果捕获到JIT内核预热使用
--profile-from-start off
配合性能分析器标记
指标显示Tensor Core使用率为0%内核未使用Tensor Core使用
--section InstructionStats
检查;验证维度是否符合8/16对齐
报告文件过大使用
--set full
且包含多个内核
使用针对性部分;通过
--kernel-name
--launch-count
限制范围
指标值超出范围GPU异步活动或内核运行时间过短在隔离的GPU上进行性能分析;增加工作负载大小
ncu
在MPI应用中挂起
跨进程依赖内核使用
--communicator=tcp --lockstep-kernel-launch

Finding More Information

获取更多信息

Tier 1: This File (SKILL.md)

第一层:本文件(SKILL.md)

You are reading it now. The section-first workflow and error table above cover the most common profiling tasks. Search this file first.
你正在阅读的就是本文件。上述的按部分分析工作流程和错误表涵盖了最常见的性能分析任务。请首先搜索本文件。

Tier 2: references/ Directory

第二层:references/目录

Grep for keywords across
references/
-- headers are grep-friendly:
  • references/cli-reference.md
    -- Complete CLI options, filtering, output formats
  • references/metrics-guide.md
    -- Hardware model, metric naming, key metrics
  • references/sections-guide.md
    -- All
    --section
    names, when to use each
  • references/bottleneck-guide.md
    -- Per-bottleneck root causes and optimization
  • references/memory-analysis.md
    -- Memory hierarchy, cache analysis, coalescing
  • references/roofline-analysis.md
    -- Roofline charts and interpretation
  • references/advanced-profiling.md
    -- Replay modes, MPI, CUDA graphs, PM sampling, customization
  • references/python-report-api.md
    --
    ncu_report
    Python module API
How to search:
  1. Grep
    for your keyword across
    references/
  2. Read
    only the file that Grep points to
references/
中搜索关键词——标题便于grep搜索:
  • references/cli-reference.md
    —— 完整CLI选项、过滤、输出格式
  • references/metrics-guide.md
    —— 硬件模型、指标命名、关键指标
  • references/sections-guide.md
    —— 所有
    --section
    名称及使用场景
  • references/bottleneck-guide.md
    —— 各类型瓶颈的根本原因及优化方案
  • references/memory-analysis.md
    —— 内存层级、缓存分析、访问合并
  • references/roofline-analysis.md
    —— Roofline图表及解读
  • references/advanced-profiling.md
    —— 重放模式、MPI、CUDA图、PM采样、自定义配置
  • references/python-report-api.md
    ——
    ncu_report
    Python模块API
搜索方法:
  1. 使用
    Grep
    references/
    中搜索关键词
  2. 仅阅读Grep指向的文件

Tier 3: Official Documentation

第三层:官方文档

If Tiers 1-2 don't answer:
WebFetch or WebSearch these URLs for the latest content. Consider distilling new findings back into
references/
.
如果第一层和第二层无法解答:
可通过WebFetch或WebSearch获取这些URL的最新内容。考虑将新发现提炼并添加到
references/
中。