Loading...
Loading...
Query NVIDIA PTX ISA 9.1, CUDA Runtime API 13.1, Driver API 13.1, Programming Guide v13.1, Best Practices Guide, Nsight Compute, Nsight Systems local documentation. Debug and optimize GPU kernels with nsys/ncu/compute-sanitizer workflows. Use when writing, debugging, or optimizing CUDA code, GPU kernels, PTX instructions, inline PTX, TensorCore operations (WMMA, WGMMA, TMA, tcgen05), or when the user mentions CUDA API functions, error codes, device properties, memory management, profiling, GPU performance, compute capabilities, CUDA Graphs, Cooperative Groups, Unified Memory, dynamic parallelism, or CUDA programming model concepts.
npx skill4agent add slowlyc/agent-gpu-skills cuda-skillreferences/~/.cursor/skills/cuda-skill/references/~/.claude/skills/cuda-skill/references/~/.agents/skills/cuda-skill/references/CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"rg~/.cursor/skills/cuda-skill/references/references/
├── ptx-docs/ # PTX ISA 9.1 full spec (405 files, 2.3MB)
├── ptx-simple/ # PTX condensed quick-ref (13 files, 149KB)
├── cuda-runtime-docs/ # CUDA Runtime API 13.1 (107 files, 0.9MB)
├── cuda-driver-docs/ # CUDA Driver API 13.1 (128 files, 0.8MB)
├── cuda-guide/ # CUDA Programming Guide v13.1 (39 pages, 1.6MB)
│ ├── 01-introduction/ # Programming model, CUDA platform
│ ├── 02-basics/ # CUDA C++, kernels, async, memory, nvcc
│ ├── 03-advanced/ # Advanced APIs, kernel programming, driver API, multi-GPU
│ ├── 04-special-topics/ # Graphs, Unified Memory, Coop Groups, TMA, etc.
│ ├── 05-appendices/ # Compute Capabilities, C++ extensions, math funcs
│ └── INDEX.md
├── best-practices-guide/ # CUDA C++ Best Practices Guide
├── ncu-docs/ # Nsight Compute full docs (ProfilingGuide, CLI, etc.)
├── nsys-docs/ # Nsight Systems full docs (UserGuide, etc.)
├── ptx-isa.md # PTX search guide
├── cuda-runtime.md # Runtime API search guide
├── cuda-driver.md # Driver API search guide
├── nsys-guide.md # Nsight Systems quick reference
├── ncu-guide.md # Nsight Compute quick reference
├── debugging-tools.md # compute-sanitizer, cuda-gdb
├── nvtx-patterns.md # NVTX instrumentation
└── performance-traps.md # Bank conflicts, coalescingptx-simple/
├── ptx-isa-arithmetic.md # add, sub, mul, mad, fma, div, min, max
├── ptx-isa-data-types.md # Types, cvt, rounding, pack
├── ptx-isa-memory-spaces.md # .reg, .global, .shared, fences
├── ptx-isa-load-store.md # ld, st, prefetch
├── ptx-isa-control-flow.md # @p, setp, bra, call, ret, exit
├── ptx-isa-tensor-cores.md # mma.sync, ldmatrix, wgmma
├── ptx-isa-async-copy.md # cp.async, cp.async.bulk, TMA
├── ptx-isa-barriers.md # bar.sync, mbarrier
├── ptx-isa-warp-ops.md # shfl, vote, match, redux
├── ptx-isa-cache-hints.md # Cache control
├── ptx-isa-sm90-hopper.md # Hopper-specific (sm_90)
├── ptx-isa-sm100-blackwell.md # Blackwell-specific (sm_100, tcgen05)
└── ptx-isa-misc.md # Other instructions# Find specific instruction
rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/
# Find WGMMA register fragments
rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma
# Find TMA swizzling modes
rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/
# Quick PTX syntax lookup (condensed)
rg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md# Error code meaning
rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/
# Function documentation
rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md
# Struct fields
rg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md# Context management
rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md
# Module loading
rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md
# Virtual memory
rg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md# Compute Capabilities table
rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md
# CUDA Graphs usage
rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md
# Cooperative Groups
rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md
# Unified Memory behavior
rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md
# Thread Block Clusters (Hopper+)
rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md
# Programming Guide index (discover all topics)
cat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md# Memory coalescing best practices
rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
# Occupancy optimization
rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/
# Shared memory usage patterns
rg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/# Metric meanings and collection
rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md
# CLI usage and options
rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md
# Roofline analysis
rg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md# CLI profiling options
rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md
# CUDA trace analysis
rg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md| Need | Source | Path shorthand |
|---|---|---|
| PTX instruction syntax/semantics | Full PTX docs | |
| Quick PTX syntax check | Condensed PTX | |
| State spaces, data types | Full PTX docs | |
| Memory consistency model | Full PTX docs | |
| Special registers (%tid, etc.) | Full PTX docs | |
| Directives (.version, .target) | Full PTX docs | |
| CUDA Runtime functions | Runtime docs | |
| CUDA structs (cudaDeviceProp) | Runtime docs | |
| Driver API (cuCtx, cuModule) | Driver docs | |
| sm_90 / Hopper specifics | Condensed PTX | |
| sm_100 / Blackwell / tcgen05 | Condensed PTX | |
| CUDA C++ programming concepts | Programming Guide | |
| Thread/block/grid model | Programming Guide | |
| Compute Capabilities table | Programming Guide | |
| CUDA Graphs usage | Programming Guide | |
| Unified Memory | Programming Guide | |
| Cooperative Groups | Programming Guide | |
| Async barriers/pipelines (C++) | Programming Guide | |
| L2 cache control | Programming Guide | |
| Dynamic parallelism | Programming Guide | |
| C++ language extensions | Programming Guide | |
| Math functions (device) | Programming Guide | |
| Multi-GPU programming | Programming Guide | |
| Environment variables | Programming Guide | |
| Memory optimization practices | Best Practices | |
| Performance profiling strategy | Best Practices | |
| ncu metrics, sections, roofline | Nsight Compute | |
| ncu CLI options and workflows | Nsight Compute | |
| nsys profiling and tracing | Nsight Systems | |
if (idx == 0) printf(...)compute-sanitizer --tool memcheck ./program
compute-sanitizer --tool racecheck ./programcuda-gdb -batch -ex "run" -ex "bt" ./program~/.cursor/skills/cuda-skill/references/debugging-tools.mdnsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sumncu --kernel-name "myKernel" --set full -o report ./program| Symptom | Likely Cause | Tool |
|---|---|---|
| Low GPU utilization | Launch overhead, CPU bottleneck | nsys timeline |
| Memory bound | Poor coalescing, low cache hit | ncu memory section |
| Compute bound but slow | Low occupancy, register pressure | ncu occupancy |
| High sectors/request (>4) | Poor coalescing | ncu memory metrics |
~/.cursor/skills/cuda-skill/references/nsys-guide.md~/.cursor/skills/cuda-skill/references/ncu-guide.md~/.cursor/skills/cuda-skill/references/performance-traps.md~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md~/.cursor/skills/cuda-skill/references/best-practices-guide/# Debug
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
# Release with line info (always use -lineinfo for profiling)
nvcc -O3 -lineinfo program.cu -o program
# Target architecture
nvcc -arch=sm_80 program.cu # Ampere
nvcc -arch=sm_90 program.cu # Hopper
nvcc -arch=sm_100 program.cu # Blackwell
# Generate PTX / inspect binary
nvcc -ptx program.cu
cuobjdump -ptx ./program
cuobjdump -sass ./program
nvcc --ptxas-options=-v program.cu # Register usage__device__ int myAdd(int a, int b) {
int result;
asm("add.s32 %0, %1, %2;"
: "=r"(result)
: "r"(a), "r"(b));
return result;
}
// Constraint codes: r=32b reg, l=64b reg, f=f32, d=f64, n=immediateptx-docs/
├── 1-introduction/
├── 2-programming-model/ # Thread hierarchy, memory
├── 3-ptx-machine-model/ # SIMT architecture
├── 4-syntax/ # PTX syntax rules
├── 5-state-spaces-types-and-variables/ # Memory spaces, data types
├── 6-instruction-operands/ # Operand types
├── 7-abstracting-the-abi/ # Functions, calling conventions
├── 8-memory-consistency-model/ # Memory ordering, atomics
├── 9-instruction-set/ # 186 instruction files
│ ├── 9.7.1-* Integer arithmetic
│ ├── 9.7.3-* Floating point
│ ├── 9.7.9-* Data movement (includes TMA)
│ ├── 9.7.14-* WMMA (sm_70+)
│ ├── 9.7.15-* WGMMA (sm_90+)
│ └── 9.7.16-* TensorCore Gen5 (sm_100+)
├── 10-special-registers/ # %tid, %ctaid, %clock64
├── 11-directives/ # .version, .target, .entry
├── 12-descriptions-ofpragmastrings/
└── 13-release-notes/cd /path/to/cursor-gpu-skills
# Update everything
uv run scrape_docs.py all --force
# Or update individually:
uv run scrape_docs.py ptx-simple --force # Condensed PTX from triton repo
uv run scrape_docs.py ptx # Full PTX ISA from NVIDIA
uv run scrape_docs.py runtime # CUDA Runtime API
uv run scrape_docs.py driver # CUDA Driver API
uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1
uv run scrape_docs.py best-practices --force # CUDA C++ Best Practices Guide
uv run scrape_docs.py ncu-docs --force # Nsight Compute docs
uv run scrape_docs.py nsys-docs --force # Nsight Systems docs~/.cursor/skills/cuda-skill/references/ptx-isa.md~/.cursor/skills/cuda-skill/references/cuda-runtime.md~/.cursor/skills/cuda-skill/references/cuda-driver.md