perf-nsight-compute-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNsight Compute Analysis
Nsight Compute 分析
NVIDIA Nsight Compute () profiles individual CUDA kernels to determine
why they are slow and what to optimize. It measures GPU throughput as a
percentage of theoretical peak (Speed of Light / SOL%), enabling systematic
bottleneck classification and targeted optimization.
ncuNVIDIA Nsight Compute()用于对单个CUDA内核进行性能分析,以确定内核运行缓慢的原因以及优化方向。它会将GPU吞吐量测量为理论峰值的百分比(Speed of Light / SOL%),从而实现系统化的瓶颈分类和针对性优化。
ncuWhen to Use
使用场景
Reach for this skill when you encounter:
- Triggers: User wants to profile a CUDA kernel, analyze output, interpret
ncureports, or optimize GPU kernel performance.ncu-rep - Symptoms: Kernel running slower than expected, low GPU utilization, need to classify compute-bound vs memory-bound, occupancy issues
- Keywords: "ncu", "nsight compute", "SOL%", "speed of light", "kernel profiling", "compute-bound", "memory-bound", "latency-bound", "occupancy", "roofline", "warp stalls", "cache hit rate", "ncu-rep"
Do NOT use this skill for:
- System-level profiling (use Nsight Systems / instead)
nsys - CUDA API tracing or CPU-GPU timeline analysis (use )
nsys - GPU monitoring without profiling (use )
nvidia-smi
当遇到以下情况时,可以使用本技能:
- 触发条件:用户希望对CUDA内核进行性能分析、解读输出、分析
ncu报告,或优化GPU内核性能.ncu-rep - 症状:内核运行速度低于预期、GPU利用率低、需要区分计算受限(compute-bound)与内存受限(memory-bound)、occupancy问题
- 关键词:"ncu"、"nsight compute"、"SOL%"、"speed of light"、"kernel profiling"、"compute-bound"、"memory-bound"、"latency-bound"、"occupancy"、"roofline"、"warp stalls"、"cache hit rate"、"ncu-rep"
请勿将本技能用于:
- 系统级性能分析(请使用Nsight Systems / )
nsys - CUDA API追踪或CPU-GPU时间线分析(请使用)
nsys - 无性能分析的GPU监控(请使用)
nvidia-smi
Requirements
要求
| Dependency | Version | Notes |
|---|---|---|
| CUDA Toolkit | >=11.0 | Includes |
| Match CUDA version | Or set |
| NVIDIA GPU | Kepler+ | Volta+ recommended |
Permissions: may require , , or
in containers. Check with first.
ncusudoCAP_SYS_ADMIN--privilegedncu -v| 依赖项 | 版本 | 说明 |
|---|---|---|
| CUDA Toolkit | >=11.0 | 包含 |
| 匹配CUDA版本 | 或设置 |
| NVIDIA GPU | Kepler及以上 | 推荐使用Volta及以上 |
权限:可能需要、权限,或在容器中使用参数。请先运行检查。
ncusudoCAP_SYS_ADMIN--privilegedncu -vPrinciples
原则
Data Integrity
数据完整性
This is a data-driven analysis system. Every number you present must have
an authoritative source. Follow these rules without exception:
- Quote before you interpret. When presenting metrics from ncu output, always show the actual ncu command you ran AND the relevant raw output (CSV lines, metric values) before stating any numeric conclusion.
- Never fabricate metrics. If ncu fails, returns unexpected output, or you cannot run it, say so explicitly. Do not invent plausible-looking numbers. An honest "profiling failed" is better than fabricated data.
- Attribute every value. For each metric you cite (SOL%, duration, occupancy, throughput), the reader must be able to trace it back to a specific line in the raw ncu output you showed.
这是一个基于数据的分析系统。所有呈现的数值必须有权威来源。请严格遵循以下规则:
- 先引用再解读:当展示ncu输出中的指标时,必须先显示你运行的实际ncu命令以及相关的原始输出(CSV行、指标值),再得出任何数值结论。
- 切勿伪造指标:如果ncu运行失败、返回意外输出或无法运行,请明确说明。不要编造看似合理的数值。诚实的"性能分析失败"比伪造数据更好。
- 为每个数值标注来源:对于你引用的每个指标(SOL%、时长、occupancy、吞吐量),读者必须能够追溯到你展示的原始ncu输出中的特定行。
SOL% Mental Model
SOL% 心智模型
Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:
- Compute SOL% = actual compute throughput / peak compute throughput
- Memory SOL% = actual memory throughput / peak memory throughput
A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.
光速(SOL%)用于衡量内核运行速度接近GPU理论峰值的程度:
- Compute SOL% = 实际计算吞吐量 / 峰值计算吞吐量
- Memory SOL% = 实际内存吞吐量 / 峰值内存吞吐量
内核无法同时饱和计算和内存。数值较高的指标揭示了瓶颈类型。请将此作为主要的分类依据。
Classification Thresholds
分类阈值
| Compute % | Memory % | Bottleneck | Next Step |
|---|---|---|---|
| >60 | <40 | Compute-bound | ComputeWorkloadAnalysis section |
| <40 | >60 | Memory-bound | MemoryWorkloadAnalysis section |
| <40 | <40 | Latency-bound | LaunchStats + Occupancy sections |
| 40-60 | 40-60 | Balanced | Profile deeper with detailed sections |
Additional signals:
- Duration <10us with many launches -> Launch-overhead bound (use nsys first)
- Both <40% but occupancy >50% -> Instruction-bound (check InstructionStats)
| 计算占比 | 内存占比 | 瓶颈类型 | 下一步操作 |
|---|---|---|---|
| >60 | <40 | 计算受限(Compute-bound) | 查看ComputeWorkloadAnalysis部分 |
| <40 | >60 | 内存受限(Memory-bound) | 查看MemoryWorkloadAnalysis部分 |
| <40 | <40 | 延迟受限(Latency-bound) | 查看LaunchStats + Occupancy部分 |
| 40-60 | 40-60 | 平衡型 | 使用详细部分进行深度分析 |
额外信号:
- 时长<10微秒且多次启动 -> 启动开销受限(Launch-overhead bound)(请先使用nsys)
- 两者均<40%但occupancy>50% -> 指令受限(Instruction-bound)(查看InstructionStats)
SOL% Performance Levels
SOL% 性能等级
| SOL% | Level | Action |
|---|---|---|
| >80% | Excellent | Minor tuning only |
| 60-80% | Good | Targeted optimization |
| 40-60% | Fair | Significant optimization needed |
| <40% | Poor | Major rework needed |
| SOL% | 等级 | 操作建议 |
|---|---|---|
| >80% | 优秀 | 仅需微调 |
| 60-80% | 良好 | 针对性优化 |
| 40-60% | 一般 | 需要显著优化 |
| <40% | 较差 | 需要重大重构 |
Section-First Profiling
优先按部分分析
Always use targeted flags instead of bulk collection. Individual sections are faster and more surgical. Only escalate to or when broad exploration is needed.
--section--set--set basic--set detailed始终使用针对性的标志,而非批量的收集。单独的部分分析速度更快、更精准。仅当需要广泛探索时,才升级使用或。
--section--set--set basic--set detailedncu vs nsys
ncu vs nsys
| Tool | Scope | Overhead | Purpose |
|---|---|---|---|
| nsys | System-level | 5-10% | Find which kernels to optimize |
| ncu | Kernel-level | 10-100x slower | Understand why a kernel is slow |
Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.
| 工具 | 范围 | 开销 | 用途 |
|---|---|---|---|
| nsys | 系统级 | 5-10% | 找出需要优化的内核 |
| ncu | 内核级 | 慢10-100倍 | 理解内核运行缓慢的原因 |
请先使用nsys确定GPU耗时最高的内核,再使用ncu对这些特定内核进行深度分析。
Workflow
工作流程
Choose your path based on the request:
- Knowledge query (what metrics to use, --section vs --set, how to filter kernels): Answer directly from Principles, Command Reference, and References below. Do NOT run ncu.
- Quick diagnosis (classify bottleneck, check SOL%): Step 1 only. Escalate if user wants more.
- Specific diagnosis (bank conflicts, register pressure, occupancy): Quick SOL% check (Step 1), then go directly to the relevant section in Step 2.
- Deep analysis (detailed report, optimization recommendations): Full Steps 1-5. Present the complete structured report with all key metrics (SOL%, duration, occupancy) in your final response — do not split the report across messages or replace it with a brief summary.
根据请求选择路径:
- 知识查询(使用哪些指标、--section vs --set、如何过滤内核):直接根据以下原则、命令参考和参考资料回答。无需运行ncu。
- 快速诊断(分类瓶颈、检查SOL%):仅执行步骤1。如果用户需要更多信息,再深入分析。
- 特定诊断(存储体冲突、寄存器压力、occupancy):先快速检查SOL%(步骤1),然后直接进入步骤2中的相关部分。
- 深度分析(详细报告、优化建议):执行完整步骤1-5。在最终响应中呈现包含所有关键指标(SOL%、时长、occupancy)的完整结构化报告——不要将报告拆分到多条消息中,也不要用简短摘要替代。
Step 0: Verify ncu
步骤0:验证ncu
bash
ncu -vbash
ncu -vOr: $NCU -v
或:$NCU -v
If not found, ensure CUDA toolkit is installed or set `NCU` env var to the binary path.
如果未找到,请确保已安装CUDA Toolkit,或设置`NCU`环境变量指向二进制文件路径。Step 1: SOL% Diagnosis
步骤1:SOL% 诊断
Always start with SpeedOfLight to classify the bottleneck:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMANDRead and from the output. Classify using the thresholds above.
Compute (SM) ThroughputMemory Throughput始终从SpeedOfLight开始,对瓶颈进行分类:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMAND从输出中读取和。使用上述阈值进行分类。
Compute (SM) ThroughputMemory ThroughputStep 2: Escalate with Targeted Sections
步骤2:添加针对性部分进行深入分析
Based on Step 1 classification, add sections:
| Classification | Sections to Add |
|---|---|
| Compute-bound | |
| Memory-bound | |
| Latency-bound | |
| Warp stalls | |
| Need instruction breakdown | |
Always include and when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.
LaunchStatsOccupancyExample -- memory-bound deep dive:
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.pyExample -- compute-bound deep dive:
bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.pyExample -- occupancy investigation:
bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.py根据步骤1的分类,添加相应部分:
| 分类 | 需添加的部分 |
|---|---|
| 计算受限 | |
| 内存受限 | |
| 延迟受限 | |
| Warp停顿 | |
| 需要指令细分 | |
诊断延迟受限内核时,始终包含和。这些部分会揭示寄存器压力、共享内存限制和块大小问题。
LaunchStatsOccupancy示例——内存受限深度分析:
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.py示例——计算受限深度分析:
bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.py示例——occupancy调查:
bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.pyStep 3: Roofline Analysis (Optional)
步骤3:Roofline分析(可选)
For visual understanding of compute vs memory balance:
bash
ncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMANDFor precision-specific hierarchical roofline:
bash
undefined为了直观理解计算与内存的平衡:
bash
ncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND针对特定精度的分层Roofline分析:
bash
undefinedFP16 kernels
FP16内核
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND
--kernel-name regex:"KERNEL" -- COMMAND
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND
--kernel-name regex:"KERNEL" -- COMMAND
Tensor core kernels
Tensor core内核
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND
--kernel-name regex:"KERNEL" -- COMMAND
Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See `references/roofline-analysis.md`.ncu --section SpeedOfLight_HierarchicalTensorRooflineChart
--kernel-name regex:"KERNEL" -- COMMAND
--kernel-name regex:"KERNEL" -- COMMAND
解读:内核位于脊点左侧 = 内存受限;右侧 = 计算受限;远低于两条脊线 = 延迟/occupancy问题。详见`references/roofline-analysis.md`。Step 4: Interpret and Optimize
步骤4:解读与优化
- Identify the dominant bottleneck from SOL% classification
- Look up detailed analysis and optimization strategies in
references/bottleneck-guide.md - Apply highest-impact optimization first
- Re-profile to validate improvement and detect bottleneck shifts
- 从SOL%分类中确定主导瓶颈
- 在中查找详细分析和优化策略
references/bottleneck-guide.md - 优先应用影响最大的优化措施
- 重新进行性能分析,验证改进效果并检测瓶颈变化
Step 5: Validate
步骤5:验证
Re-profile the same kernel after optimization:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.pyCompare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?
优化后重新分析同一个内核:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.py对比:吞吐量百分比是否提升?时长是否减少?瓶颈类型是否变化?
Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)
JIT编译内核的性能分析(Triton/cuTile/CuTeDSL)
JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:
- Warmup first: Run the kernel 3-5 times to complete JIT compilation and autotuning, then .
torch.cuda.synchronize() - Use profiler markers: Bracket the measured region with /
cudaProfilerStart().cudaProfilerStop() - Use so ncu only captures the marked region:
--profile-from-start off
python
undefinedJIT编译内核会在首次调用时触发自动调优。请隔离实际执行阶段:
- 先预热:运行内核3-5次以完成JIT编译和自动调优,然后执行。
torch.cuda.synchronize() - 使用性能分析器标记:用/
cudaProfilerStart()包围测量区域。cudaProfilerStop() - 使用,使ncu仅捕获标记区域:
--profile-from-start off
python
undefinedWarmup (JIT + autotuning)
预热(JIT + 自动调优)
for _ in range(5):
result = kernel(inputs)
torch.cuda.synchronize()
for _ in range(5):
result = kernel(inputs)
torch.cuda.synchronize()
Profile only steady-state
仅分析稳态阶段
torch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
result = kernel(inputs)
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.pyAlternative: use to skip autotuning launches. See
for NVTX range and replay mode alternatives.
--launch-skip Nreferences/advanced-profiling.mdtorch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
result = kernel(inputs)
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
```bash
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.py替代方案:使用跳过自动调优启动。详见中的NVTX范围和重放模式替代方案。
--launch-skip Nreferences/advanced-profiling.mdProgrammatic Report Analysis
程序化报告分析
Extract metrics from files using the Python module
(in of the Nsight Compute installation):
.ncu-repncu_reportextras/python/python
import ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")See for the full API (IContext, IRange, IAction, IMetric classes).
references/python-report-api.md使用 Python模块(位于Nsight Compute安装目录的中)从文件提取指标:
ncu_reportextras/python/.ncu-reppython
import ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")完整API(IContext、IRange、IAction、IMetric类)详见。
references/python-report-api.mdOutput Formats
输出格式
CSV output (for scripting and automated analysis):
bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # All metrics flatReport files (for later analysis):
bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # Export to CSVKey CSV columns:
| Column | Meaning |
|---|---|
| CUDA kernel function name |
| Execution time (nanoseconds) |
| % of peak compute |
| % of peak memory bandwidth |
| Active warps / max warps (%) |
Success indicators:
- SOL% values present in output -> profiling succeeded
- Duration values reasonable (not 0 or extremely large)
- Multiple launches captured when
--launch-count > 1
CSV输出(用于脚本和自动化分析):
bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # 所有指标平铺展示报告文件(用于后续分析):
bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # 导出为CSV关键CSV列:
| 列名 | 含义 |
|---|---|
| CUDA内核函数名称 |
| 执行时间(纳秒) |
| 峰值计算占比 |
| 峰值内存带宽占比 |
| 活跃warp数 / 最大warp数(百分比) |
成功指标:
- 输出中包含SOL%值 -> 性能分析成功
- 时长值合理(不为0或极大)
- 使用时捕获到多次启动
--launch-count > 1
Examples
示例
Example: Classify a GEMM Kernel
示例:分类GEMM内核
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.pyOutput:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step:
check tensor core usage with .
--section ComputeWorkloadAnalysisbash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.py输出:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2解读:计算受限(计算占比78.5%,内存占比35.2%)。下一步:使用检查Tensor Core使用率。
--section ComputeWorkloadAnalysisExample: Diagnose a Memory-Bound Embedding Kernel
示例:诊断内存受限的Embedding内核
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.pyCheck L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates
suggest poor data locality; low coalescing efficiency suggests scattered access.
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.py检查输出中的L1/L2缓存命中率和合并效率。低命中率表明数据局部性差;低合并效率表明访问分散。
Error Handling
错误处理
| Error | Cause | Fix |
|---|---|---|
| Not in PATH | |
| Needs elevated privileges | |
| No kernels captured | Name regex doesn't match | Run without |
| Profiling extremely slow | Using | Use |
| Autotuning pollutes results | JIT kernel warmup captured | Use |
| Metrics show 0% tensor cores | Kernel doesn't use tensor cores | Check with |
| Report file too large | | Use targeted sections; limit with |
| Out-of-range metric values | Async GPU activity or short kernels | Profile on isolated GPU; increase workload size |
| Dependent kernels across ranks | Use |
| 错误 | 原因 | 修复方案 |
|---|---|---|
| 不在PATH中 | |
| 需要提升权限 | |
| 未捕获到内核 | 名称正则不匹配 | 先不使用 |
| 性能分析极慢 | 使用了 | 仅使用 |
| 自动调优污染结果 | 捕获到JIT内核预热 | 使用 |
| 指标显示Tensor Core使用率为0% | 内核未使用Tensor Core | 使用 |
| 报告文件过大 | 使用 | 使用针对性部分;通过 |
| 指标值超出范围 | GPU异步活动或内核运行时间过短 | 在隔离的GPU上进行性能分析;增加工作负载大小 |
| 跨进程依赖内核 | 使用 |
Finding More Information
获取更多信息
Tier 1: This File (SKILL.md)
第一层:本文件(SKILL.md)
You are reading it now. The section-first workflow and error table above cover
the most common profiling tasks. Search this file first.
你正在阅读的就是本文件。上述的按部分分析工作流程和错误表涵盖了最常见的性能分析任务。请首先搜索本文件。
Tier 2: references/ Directory
第二层:references/目录
Grep for keywords across -- headers are grep-friendly:
references/- -- Complete CLI options, filtering, output formats
references/cli-reference.md - -- Hardware model, metric naming, key metrics
references/metrics-guide.md - -- All
references/sections-guide.mdnames, when to use each--section - -- Per-bottleneck root causes and optimization
references/bottleneck-guide.md - -- Memory hierarchy, cache analysis, coalescing
references/memory-analysis.md - -- Roofline charts and interpretation
references/roofline-analysis.md - -- Replay modes, MPI, CUDA graphs, PM sampling, customization
references/advanced-profiling.md - --
references/python-report-api.mdPython module APIncu_report
How to search:
- for your keyword across
Grepreferences/ - only the file that Grep points to
Read
在中搜索关键词——标题便于grep搜索:
references/- —— 完整CLI选项、过滤、输出格式
references/cli-reference.md - —— 硬件模型、指标命名、关键指标
references/metrics-guide.md - —— 所有
references/sections-guide.md名称及使用场景--section - —— 各类型瓶颈的根本原因及优化方案
references/bottleneck-guide.md - —— 内存层级、缓存分析、访问合并
references/memory-analysis.md - —— Roofline图表及解读
references/roofline-analysis.md - —— 重放模式、MPI、CUDA图、PM采样、自定义配置
references/advanced-profiling.md - ——
references/python-report-api.mdPython模块APIncu_report
搜索方法:
- 使用在
Grep中搜索关键词references/ - 仅阅读Grep指向的文件
Tier 3: Official Documentation
第三层:官方文档
If Tiers 1-2 don't answer:
- Profiling Guide -- Metrics, hardware model, analysis concepts
- CLI Reference -- Full CLI options
- Python Report Interface -- API
ncu_report - Customization Guide -- Section files, rules
WebFetch or WebSearch these URLs for the latest content. Consider distilling
new findings back into .
references/如果第一层和第二层无法解答:
- 性能分析指南 —— 指标、硬件模型、分析概念
- CLI参考 —— 完整CLI选项
- Python报告接口 —— API
ncu_report - 自定义指南 —— 部分文件、规则
可通过WebFetch或WebSearch获取这些URL的最新内容。考虑将新发现提炼并添加到中。
references/