CUDA Programming Skill
CUDA编程技能指南
Measure before guessing. GPU performance is deeply counterintuitive. Profile first, hypothesize second, change third, verify fourth.
Small, isolated changes. CUDA bugs compound. Make one change, test it, commit it. Resist the urge to "fix everything at once."
printf is your strongest tool. When debuggers fail, when tools produce inscrutable output, printf in device code reveals truth. Don't be embarrassed to use it extensively.
Sometimes, stare at the diff. Inscrutable segfaults are common. Tools often don't help. The human approach: minimize the diff, read it carefully, see the bug. This is legitimate and often faster than tooling.
先测量,后猜测。 GPU性能往往与直觉相悖。请遵循:先做性能分析,再提出假设,然后修改代码,最后验证效果。
小步迭代,单独验证。 CUDA的Bug会相互叠加。每次只做一处修改,测试通过后再提交。不要试图“一次性修复所有问题”。
printf是最强调试工具之一。 当调试器失效、工具输出难以理解时,在设备代码中加入printf能直接揭示问题真相。不要不好意思大量使用它。
有时,盯着代码差异看。 难以解释的段错误很常见,工具往往帮不上忙。此时可以用人工方法:最小化正常代码与错误代码的差异,仔细阅读差异部分,Bug往往就藏在其中。这是合理且常常比工具更高效的方法。
First Response to a Bug
遇到Bug的第一处理步骤
- Reproduce minimally — Isolate the failing kernel with smallest possible input
- Add printf — Before any tool, add in device code to trace execution
- Run compute-sanitizer — Catch memory errors non-interactively:
bash
compute-sanitizer --tool memcheck ./your_program
compute-sanitizer --tool racecheck ./your_program # for race conditions
compute-sanitizer --tool initcheck ./your_program # uninitialized memory
- If still stuck, try cuda-gdb non-interactively for backtrace:
bash
cuda-gdb -batch -ex "run" -ex "bt" ./your_program
- When tools fail — Minimize the diff between working and broken code. Read it. The bug is in the diff.
- 最小化复现场景 —— 用最小的输入规模隔离出出现问题的内核
- 添加printf输出 —— 在使用任何工具之前,先在设备代码中加入printf追踪执行过程
- 运行compute-sanitizer —— 非交互式捕获内存错误:
bash
compute-sanitizer --tool memcheck ./your_program
compute-sanitizer --tool racecheck ./your_program # 检测竞态条件
compute-sanitizer --tool initcheck ./your_program # 检测未初始化内存
- 如果仍无法解决,尝试用cuda-gdb非交互式获取回溯信息:
bash
cuda-gdb -batch -ex "run" -ex "bt" ./your_program
- 当工具都失效时 —— 缩小正常代码与错误代码的差异范围,仔细阅读差异部分,Bug就在其中。
printf in Device Code
设备代码中的printf使用
cuda
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) { // Limit output
printf("Kernel launched, n=%d, data[0]=%f\n", n, data[0]);
}
// ... kernel logic ...
if (idx < 10) { // Sample a few threads
printf("Thread %d: result=%f\n", idx, someValue);
}
}
Key patterns:
- Guard with or to avoid output flood
- Print at kernel entry to confirm launch
- Print intermediate values at suspected failure points
- Flush is automatic at kernel completion
cuda
__global__ void myKernel(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) { // 限制输出数量
printf("Kernel已启动, n=%d, data[0]=%f\n", n, data[0]);
}
// ... 内核逻辑 ...
if (idx < 10) { // 采样部分线程输出
printf("线程 %d: 结果=%f\n", idx, someValue);
}
}
关键模式:
- 用或来避免输出泛滥
- 在内核入口处打印信息,确认内核已成功启动
- 在疑似出错的位置打印中间值
- 内核执行完成时会自动刷新输出缓冲区
compute-sanitizer Quick Reference
compute-sanitizer快速参考
Common gotcha: "Invalid
shared write... out of bounds" usually means insufficient dynamic shared memory allocation in the kernel launch, not wrong array indexing. Check
<<<grid, block, smem_size>>>
.
常见陷阱: "Invalid
shared write... out of bounds"(无效的共享内存写入...越界)通常意味着内核启动时动态共享内存分配不足,而非数组索引错误。检查
<<<grid, block, smem_size>>>
参数。
Memory errors (most common)
检测内存错误(最常用)
compute-sanitizer --tool memcheck ./program
compute-sanitizer --tool memcheck ./program
Other tools: racecheck, initcheck, synccheck
其他工具:racecheck、initcheck、synccheck
For detailed options, see references/debugging-tools.md
详细选项请参考references/debugging-tools.md
cuda-gdb Non-Interactive
cuda-gdb非交互式使用
Get backtrace on crash
崩溃时获取调用栈
cuda-gdb -batch -ex "run" -ex "bt" ./program
cuda-gdb -batch -ex "run" -ex "bt" ./program
For breakpoints, thread inspection, see references/debugging-tools.md
断点设置、线程检查等功能请参考references/debugging-tools.md
**Compile with debug info:**
```bash
nvcc -g -G -lineinfo program.cu -o program
**编译时包含调试信息:**
```bash
nvcc -g -G -lineinfo program.cu -o program
cuobjdump for Binary Inspection
使用cuobjdump进行二进制检查
Dump PTX and SASS
导出PTX和SASS代码
cuobjdump -ptx ./program
cuobjdump -sass ./program
cuobjdump -ptx ./program
cuobjdump -sass ./program
For resource usage, symbol listing, see references/debugging-tools.md
资源使用情况、符号列表等请参考references/debugging-tools.md
**For complete debugging tool reference:** See `references/debugging-tools.md` for detailed compute-sanitizer options, cuda-gdb workflows, and cuobjdump analysis patterns.
**完整调试工具参考:** 关于compute-sanitizer的详细选项、cuda-gdb工作流和cuobjdump分析模式,请查看`references/debugging-tools.md`。
Performance Optimization Workflow
性能优化工作流
Never optimize without profiling first. Intuition about GPU bottlenecks is almost always wrong. The profile → fix → verify loop is the actual optimization work, not a preliminary step.
未经性能分析,绝不进行优化。 对GPU瓶颈的直觉几乎总是错误的。“分析→修复→验证”的循环才是真正的优化工作,而非预备步骤。
Performance Investigation Steps
性能分析步骤
- Establish baseline — Time the operation, record it
- Profile with nsys — Get timeline, identify which kernels matter
- Deep-dive with ncu — Analyze specific bottleneck kernels
- Hypothesize — Based on metrics, form specific hypothesis
- Change one thing — Make a single targeted change
- Verify — Re-profile, confirm improvement
- Repeat
- 建立基准线 —— 记录当前操作的耗时
- 用nsys做性能分析 —— 获取时间线,确定哪些内核是性能热点
- 用ncu深度分析 —— 分析特定的瓶颈内核
- 提出假设 —— 根据性能指标形成具体假设
- 只改一处 —— 做一个针对性的小修改
- 验证效果 —— 重新分析性能,确认优化有效
- 重复迭代
nsys (Nsight Systems) — Timeline Profiling
nsys(Nsight Systems)—— 时间线性能分析
Use nsys for: "Where is time being spent?" — CPU/GPU interaction, kernel launch patterns, memory transfers, overall timeline.
nsys用于回答:“时间都花在哪里了?”—— 包括CPU/GPU交互、内核启动模式、内存传输、整体时间线等。
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum
nsys profile -o report ./program
nsys stats report.nsys-rep --report cuda_gpu_kern_sum
With NVTX markers
配合NVTX标记
nsys profile --trace=cuda,nvtx -o report ./program
nsys profile --trace=cuda,nvtx -o report ./program
Key reports: cuda_gpu_kern_sum, cuda_api_sum, cuda_gpu_mem_time_sum, nvtx_sum
关键报告:cuda_gpu_kern_sum、cuda_api_sum、cuda_gpu_mem_time_sum、nvtx_sum
For detailed usage, see references/nsys-guide.md
详细用法请参考references/nsys-guide.md
**For detailed nsys analysis patterns:** See `references/nsys-guide.md` for timeline interpretation, identifying common bottlenecks, and analysis workflows.
**nsys详细分析模式:** 关于时间线解读、常见瓶颈识别和分析工作流,请查看`references/nsys-guide.md`。
ncu (Nsight Compute) — Kernel Analysis
ncu(Nsight Compute)—— 内核分析
Use ncu for: "Why is this kernel slow?" — Detailed metrics, roofline, memory analysis, occupancy.
ncu用于回答:“这个内核为什么慢?”—— 包括详细性能指标、roofline分析、内存分析、占用率等。
Profile specific kernel
分析特定内核
ncu --kernel-name "myKernel" -o report ./program
ncu --kernel-name "myKernel" -o report ./program
Quick summary to stdout
快速输出摘要到控制台
ncu --set basic ./program
ncu --set basic ./program
Sets: basic, full, memory, launch, roofline
预设配置:basic、full、memory、launch、roofline
Sections: ComputeWorkloadAnalysis, MemoryWorkloadAnalysis, Occupancy
分析模块:ComputeWorkloadAnalysis、MemoryWorkloadAnalysis、Occupancy
For detailed metrics and interpretation, see references/ncu-guide.md
详细指标及解读请参考references/ncu-guide.md
**Warning:** ncu expert system recommendations can be misleading. Always verify with actual metrics and experiments.
**Scale matters:** Optimizations that help at large scale can hurt at small scale. Always profile at your actual problem size, not theoretical maximums.
**For detailed ncu metric interpretation:** See `references/ncu-guide.md` for understanding roofline analysis, memory bottlenecks, occupancy limits, and warp scheduling.
**注意:** ncu的专家系统建议可能有误导性。请始终结合实际指标和实验结果进行验证。
**规模很重要:** 在大规模场景有效的优化,在小规模场景可能反而有害。请始终在实际问题规模下进行性能分析,而非理论最大值。
**ncu指标详细解读:** 关于roofline分析、内存瓶颈、占用率限制和 warp调度的内容,请查看`references/ncu-guide.md`。
NVTX for Custom Instrumentation
NVTX自定义埋点
When you need finer granularity than kernel-level, use NVTX:
cuda
#include <nvtx3/nvToolsExt.h>
nvtxRangePush("Operation Name");
// ... code to profile ...
nvtxRangePop();
Compile: |
Profile: nsys profile --trace=cuda,nvtx
For complete patterns: See
references/nvtx-patterns.md
for nested ranges, colors, and analysis workflows.
当需要比内核更细粒度的性能分析时,使用NVTX:
cuda
#include <nvtx3/nvToolsExt.h>
nvtxRangePush("操作名称");
// ... 需要分析的代码 ...
nvtxRangePop();
编译选项: |
性能分析: nsys profile --trace=cuda,nvtx
完整使用模式: 关于嵌套范围、颜色标记和分析工作流,请查看
references/nvtx-patterns.md
。
Common Performance Patterns
常见性能问题对照表
| Symptom | Likely Cause | Investigation |
|---|
| Low GPU utilization | Kernel launch overhead, CPU bottleneck | nsys timeline, look for gaps |
| Memory bound | Poor access patterns, low cache hit | ncu memory section, check coalescing |
| Compute bound but slow | Low occupancy, register pressure | ncu occupancy, reduce registers |
| Lots of small kernels | Launch overhead dominates | nsys timeline, consider fusion |
| High memcpy time | Excessive H2D/D2H transfers | nsys cuda_gpu_mem, batch transfers |
| Most cycles stalled | Bank conflicts, memory stalls | ncu SchedulerStatistics, check shared memory |
| High sectors/request | Poor coalescing (>4 sectors/req) | ncu memory metrics, use vectorized loads |
Critical traps: Bank conflicts and memory coalescing issues often dominate performance but aren't obvious without profiling. See
references/performance-traps.md
for detailed diagnosis and fixes.
Reality check: Budget 80% of optimization time for problems you didn't predict. Profile-driven iteration discovers the real bottlenecks.
| 症状 | 可能原因 | 排查方向 |
|---|
| GPU利用率低 | 内核启动开销、CPU瓶颈 | nsys时间线,查找间隙 |
| 内存受限 | 内存访问模式差、缓存命中率低 | ncu内存模块,检查内存合并 |
| 计算受限但速度慢 | 占用率低、寄存器压力大 | ncu占用率模块,减少寄存器使用 |
| 大量小内核 | 启动开销占主导 | nsys时间线,考虑内核融合 |
| 内存拷贝时间长 | 过多H2D/D2H传输 | nsys cuda_gpu_mem模块,批量传输 |
| 大部分周期处于停滞状态 | 存储体冲突、内存停滞 | ncu SchedulerStatistics模块,检查共享内存 |
| 高扇区数/请求 | 内存合并差(>4扇区/请求) | ncu内存指标,使用向量化加载 |
关键陷阱: 存储体冲突和内存合并问题通常对性能影响极大,但如果不进行性能分析很难发现。详细的诊断和修复方法请查看
references/performance-traps.md
。
现实情况: 请为你未预料到的问题预留80%的优化时间。基于性能分析的迭代才能发现真正的瓶颈。
Compilation Reference
编译参考
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
nvcc -g -G -lineinfo -O0 program.cu -o program_debug
nvcc -O3 -lineinfo program.cu -o program
nvcc -O3 -lineinfo program.cu -o program
Specific architecture
指定架构
nvcc -arch=sm_80 program.cu -o program # Ampere
nvcc -arch=sm_89 program.cu -o program # Ada Lovelace
nvcc -arch=sm_90 program.cu -o program # Hopper
nvcc -arch=sm_80 program.cu -o program # Ampere架构
nvcc -arch=sm_89 program.cu -o program # Ada Lovelace架构
nvcc -arch=sm_90 program.cu -o program # Hopper架构
Generate PTX (inspect it)
生成PTX代码(用于检查)
Verbose compilation (see register usage)
详细编译输出(查看寄存器使用情况)
nvcc --ptxas-options=-v program.cu
nvcc --ptxas-options=-v program.cu
nvcc program.cu -lnvToolsExt -o program
**Always compile with `-lineinfo` for production profiling** — minimal overhead, enables source correlation.
nvcc program.cu -lnvToolsExt -o program
**生产环境性能分析请始终加上`-lineinfo`选项** —— 开销极小,且支持源代码关联。
Local API Documentation
本地API文档
Complete reference documentation available for grep-based search:
PTX ISA 9.1 —
(405 files, 2.3MB)
- Search guide:
- Use for: Instruction-level optimization, inline PTX, TensorCore operations (WMMA, WGMMA, TMA), memory swizzling
CUDA Runtime API 13.1 —
references/cuda-runtime-docs/
(107 files, 0.9MB)
- Search guide:
references/cuda-runtime.md
- Use for: Error codes, API parameters, device properties (), memory management, stream behavior
CUDA Driver API 13.1 —
references/cuda-driver-docs/
(128 files, 0.8MB)
- Search guide:
references/cuda-driver.md
- Use for: Context management (), module loading (), virtual memory, Driver errors (), advanced features
Each search guide contains grep examples, documentation structure, and common usage patterns.
Search strategy: Use grep/ripgrep to search directly in the
directories. The search guides (
files) provide navigation patterns and common queries.
完整的参考文档支持基于grep的搜索:
PTX ISA 9.1 ——
(405个文件,2.3MB)
- 搜索指南:
- 用途:指令级优化、内联PTX、TensorCore操作(WMMA、WGMMA、TMA)、内存重排
CUDA Runtime API 13.1 ——
references/cuda-runtime-docs/
(107个文件,0.9MB)
- 搜索指南:
references/cuda-runtime.md
- 用途:错误码、API参数、设备属性()、内存管理、流行为
CUDA Driver API 13.1 ——
references/cuda-driver-docs/
(128个文件,0.8MB)
- 搜索指南:
references/cuda-driver.md
- 用途:上下文管理()、模块加载()、虚拟内存、驱动错误()、高级特性
每个搜索指南都包含grep示例、文档结构和常见使用模式。
搜索策略: 使用grep/ripgrep直接在
目录中搜索。搜索指南(
文件)提供了导航模式和常见查询示例。
Additional References
额外参考资料
references/performance-traps.md
— Bank conflicts, memory coalescing, scale-dependent optimizations
references/debugging-tools.md
— compute-sanitizer, cuda-gdb, cuobjdump detailed usage
- — nsys timeline analysis and bottleneck identification
- — ncu metrics, roofline, occupancy interpretation
references/nvtx-patterns.md
— NVTX instrumentation and profiling patterns
references/performance-traps.md
—— 存储体冲突、内存合并、规模相关的优化陷阱
references/debugging-tools.md
—— compute-sanitizer、cuda-gdb、cuobjdump的详细用法
- —— nsys时间线分析和瓶颈识别
- —— ncu指标、roofline分析、占用率解读
references/nvtx-patterns.md
—— NVTX埋点和性能分析模式
Checklist Before Optimizing
优化前检查清单