Nsight Compute Analysis
NVIDIA Nsight Compute (
) profiles individual CUDA kernels to determine
why they are slow and what to optimize. It measures GPU throughput as a
percentage of theoretical peak (Speed of Light / SOL%), enabling systematic
bottleneck classification and targeted optimization.
When to Use
Reach for this skill when you encounter:
- Triggers: User wants to profile a CUDA kernel, analyze output,
interpret reports, or optimize GPU kernel performance
- Symptoms: Kernel running slower than expected, low GPU utilization,
need to classify compute-bound vs memory-bound, occupancy issues
- Keywords: "ncu", "nsight compute", "SOL%", "speed of light", "kernel
profiling", "compute-bound", "memory-bound", "latency-bound", "occupancy",
"roofline", "warp stalls", "cache hit rate", "ncu-rep"
Do NOT use this skill for:
- System-level profiling (use Nsight Systems / instead)
- CUDA API tracing or CPU-GPU timeline analysis (use )
- GPU monitoring without profiling (use )
Requirements
| Dependency | Version | Notes |
|---|
| CUDA Toolkit | >=11.0 | Includes |
| binary | Match CUDA version | Or set env var |
| NVIDIA GPU | Kepler+ | Volta+ recommended |
Permissions:
may require
,
, or
in containers. Check with
first.
Principles
Data Integrity
This is a data-driven analysis system. Every number you present must have
an authoritative source. Follow these rules without exception:
- Quote before you interpret. When presenting metrics from ncu output,
always show the actual ncu command you ran AND the relevant raw output
(CSV lines, metric values) before stating any numeric conclusion.
- Never fabricate metrics. If ncu fails, returns unexpected output, or
you cannot run it, say so explicitly. Do not invent plausible-looking
numbers. An honest "profiling failed" is better than fabricated data.
- Attribute every value. For each metric you cite (SOL%, duration,
occupancy, throughput), the reader must be able to trace it back to a
specific line in the raw ncu output you showed.
SOL% Mental Model
Speed of Light (SOL%) measures how close a kernel runs to the GPU's theoretical peak:
- Compute SOL% = actual compute throughput / peak compute throughput
- Memory SOL% = actual memory throughput / peak memory throughput
A kernel cannot saturate both simultaneously. The higher metric reveals the bottleneck type. Use this as the primary classification signal.
Classification Thresholds
| Compute % | Memory % | Bottleneck | Next Step |
|---|
| >60 | <40 | Compute-bound | ComputeWorkloadAnalysis section |
| <40 | >60 | Memory-bound | MemoryWorkloadAnalysis section |
| <40 | <40 | Latency-bound | LaunchStats + Occupancy sections |
| 40-60 | 40-60 | Balanced | Profile deeper with detailed sections |
Additional signals:
- Duration <10us with many launches -> Launch-overhead bound (use nsys first)
- Both <40% but occupancy >50% -> Instruction-bound (check InstructionStats)
SOL% Performance Levels
| SOL% | Level | Action |
|---|
| >80% | Excellent | Minor tuning only |
| 60-80% | Good | Targeted optimization |
| 40-60% | Fair | Significant optimization needed |
| <40% | Poor | Major rework needed |
Section-First Profiling
Always use targeted
flags instead of bulk
collection. Individual sections are faster and more surgical. Only escalate to
or
when broad exploration is needed.
ncu vs nsys
| Tool | Scope | Overhead | Purpose |
|---|
| nsys | System-level | 5-10% | Find which kernels to optimize |
| ncu | Kernel-level | 10-100x slower | Understand why a kernel is slow |
Use nsys first to identify top kernels by GPU time, then ncu for deep analysis of those specific kernels.
Workflow
Choose your path based on the request:
- Knowledge query (what metrics to use, --section vs --set, how to filter kernels):
Answer directly from Principles, Command Reference, and References below. Do NOT run ncu.
- Quick diagnosis (classify bottleneck, check SOL%): Step 1 only. Escalate if user wants more.
- Specific diagnosis (bank conflicts, register pressure, occupancy): Quick SOL% check (Step 1),
then go directly to the relevant section in Step 2.
- Deep analysis (detailed report, optimization recommendations): Full Steps 1-5.
Present the complete structured report with all key metrics (SOL%, duration,
occupancy) in your final response — do not split the report across messages
or replace it with a brief summary.
Step 0: Verify ncu
If not found, ensure CUDA toolkit is installed or set
env var to the binary path.
Step 1: SOL% Diagnosis
Always start with SpeedOfLight to classify the bottleneck:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"KERNEL" \
--launch-skip 5 --launch-count 3 \
-- COMMAND
Read
and
from the output. Classify using the thresholds above.
Step 2: Escalate with Targeted Sections
Based on Step 1 classification, add sections:
| Classification | Sections to Add |
|---|
| Compute-bound | |
| Memory-bound | |
| Latency-bound | , |
| Warp stalls | , |
| Need instruction breakdown | |
Always include
and
when diagnosing latency-bound kernels. These reveal register pressure, shared memory limits, and block size issues.
Example -- memory-bound deep dive:
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding_lookup" \
--launch-count 3 \
-- python script.py
Example -- compute-bound deep dive:
bash
ncu --section SpeedOfLight --section ComputeWorkloadAnalysis --csv \
--kernel-name regex:"gemm" \
--launch-count 3 -- python script.py
Example -- occupancy investigation:
bash
ncu --section SpeedOfLight --section LaunchStats --section Occupancy --csv \
--kernel-name regex:"small_kernel" \
-- python script.py
Step 3: Roofline Analysis (Optional)
For visual understanding of compute vs memory balance:
bash
ncu --section SpeedOfLight_RooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
For precision-specific hierarchical roofline:
bash
# FP16 kernels
ncu --section SpeedOfLight_HierarchicalHalfRooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
# Tensor core kernels
ncu --section SpeedOfLight_HierarchicalTensorRooflineChart \
--kernel-name regex:"KERNEL" -- COMMAND
Interpretation: kernel left of ridge point = memory-bound; right = compute-bound;
far below both roofs = latency/occupancy issue. See
references/roofline-analysis.md
.
Step 4: Interpret and Optimize
- Identify the dominant bottleneck from SOL% classification
- Look up detailed analysis and optimization strategies in
references/bottleneck-guide.md
- Apply highest-impact optimization first
- Re-profile to validate improvement and detect bottleneck shifts
Step 5: Validate
Re-profile the same kernel after optimization:
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"optimized_kernel" \
--launch-count 3 \
-- python optimized_script.py
Compare: Did throughput % increase? Did duration decrease? Did the bottleneck type shift?
Profiling JIT-Compiled Kernels (Triton/cuTile/CuTeDSL)
JIT-compiled kernels trigger autotuning on first invocation. Isolate the actual execution:
- Warmup first: Run the kernel 3-5 times to complete JIT compilation and autotuning, then .
- Use profiler markers: Bracket the measured region with /.
- Use so ncu only captures the marked region:
python
# Warmup (JIT + autotuning)
for _ in range(5):
result = kernel(inputs)
torch.cuda.synchronize()
# Profile only steady-state
torch.cuda.cudart().cudaProfilerStart()
for _ in range(3):
result = kernel(inputs)
torch.cuda.synchronize()
torch.cuda.cudart().cudaProfilerStop()
bash
ncu --profile-from-start off --section SpeedOfLight --csv \
--kernel-name regex:"target_kernel" \
--launch-count 3 -- python script.py
Alternative: use
to skip autotuning launches. See
references/advanced-profiling.md
for NVTX range and replay mode alternatives.
Programmatic Report Analysis
Extract metrics from
files using the
Python module
(in
of the Nsight Compute installation):
python
import ncu_report
ctx = ncu_report.load_report("report.ncu-rep")
for rng in ctx:
for action in rng:
name = action.name()
compute = action["sm__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
memory = action["dram__throughput.avg.pct_of_peak_sustained_elapsed"].as_double()
duration = action["gpu__time_duration.sum"].as_uint64()
if compute > 60:
classification = "compute-bound"
elif memory > 60:
classification = "memory-bound"
else:
classification = "latency-bound"
print(f"{name}: {classification} (compute={compute:.1f}%, mem={memory:.1f}%, {duration}ns)")
See
references/python-report-api.md
for the full API (IContext, IRange, IAction, IMetric classes).
Output Formats
CSV output (for scripting and automated analysis):
bash
ncu --csv --section SpeedOfLight --kernel-name regex:"KERNEL" -- COMMAND
ncu --csv --page raw --section SpeedOfLight -- COMMAND # All metrics flat
Report files (for later analysis):
bash
ncu -o report --section SpeedOfLight -- COMMAND
ncu --import report.ncu-rep --csv --page raw # Export to CSV
Key CSV columns:
| Column | Meaning |
|---|
| CUDA kernel function name |
| Execution time (nanoseconds) |
| % of peak compute |
| % of peak memory bandwidth |
| Active warps / max warps (%) |
Success indicators:
- SOL% values present in output -> profiling succeeded
- Duration values reasonable (not 0 or extremely large)
- Multiple launches captured when
Examples
Example: Classify a GEMM Kernel
bash
ncu --section SpeedOfLight --csv \
--kernel-name regex:"gemm" \
--launch-skip 5 --launch-count 3 \
-- python train.py
Output:
"Kernel Name","Duration","Compute (SM) Throughput","Memory Throughput"
"ampere_fp16_gemm",1250000,78.5,35.2
Interpretation: compute-bound (78.5% compute, 35.2% memory). Next step:
check tensor core usage with
--section ComputeWorkloadAnalysis
.
Example: Diagnose a Memory-Bound Embedding Kernel
bash
ncu --section SpeedOfLight --section MemoryWorkloadAnalysis --csv \
--kernel-name regex:"embedding" \
--launch-count 3 -- python train.py
Check L1/L2 cache hit rates and coalescing efficiency in output. Low hit rates
suggest poor data locality; low coalescing efficiency suggests scattered access.
Error Handling
| Error | Cause | Fix |
|---|
| Not in PATH | export PATH=$PATH:/usr/local/cuda/bin
or set |
| Needs elevated privileges | or in containers |
| No kernels captured | Name regex doesn't match | Run without first to see actual names |
| Profiling extremely slow | Using or many sections | Use only; reduce |
| Autotuning pollutes results | JIT kernel warmup captured | Use with profiler markers |
| Metrics show 0% tensor cores | Kernel doesn't use tensor cores | Check with --section InstructionStats
; verify dimensions align to 8/16 |
| Report file too large | with many kernels | Use targeted sections; limit with and |
| Out-of-range metric values | Async GPU activity or short kernels | Profile on isolated GPU; increase workload size |
| hangs on MPI app | Dependent kernels across ranks | Use --communicator=tcp --lockstep-kernel-launch
|
Finding More Information
Tier 1: This File (SKILL.md)
You are reading it now. The section-first workflow and error table above cover
the most common profiling tasks. Search this file first.
Tier 2: references/ Directory
Grep for keywords across
-- headers are grep-friendly:
references/cli-reference.md
-- Complete CLI options, filtering, output formats
references/metrics-guide.md
-- Hardware model, metric naming, key metrics
references/sections-guide.md
-- All names, when to use each
references/bottleneck-guide.md
-- Per-bottleneck root causes and optimization
references/memory-analysis.md
-- Memory hierarchy, cache analysis, coalescing
references/roofline-analysis.md
-- Roofline charts and interpretation
references/advanced-profiling.md
-- Replay modes, MPI, CUDA graphs, PM sampling, customization
references/python-report-api.md
-- Python module API
How to search:
- for your keyword across
- only the file that Grep points to
Tier 3: Official Documentation
If Tiers 1-2 don't answer:
- Profiling Guide -- Metrics, hardware model, analysis concepts
- CLI Reference -- Full CLI options
- Python Report Interface -- API
- Customization Guide -- Section files, rules
WebFetch or WebSearch these URLs for the latest content. Consider distilling
new findings back into
.