cuda-auto-tune
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNCU-driven iterative kernel optimization (CUDA / CUTLASS / Triton / CuTe DSL)
基于NCU的迭代式内核优化(CUDA / CUTLASS / Triton / CuTe DSL)
GATE CHECK (enforce before any optimization)
前置检查(所有优化前必须执行)
STOP — Do you have NCU profile data for this kernel?
NO → Go to Step 1. Do NOT touch any kernel code.
YES → Go to Step 2.Hard rules — violation of any rule invalidates the entire optimization:
- NEVER change kernel code, launch config, or template parameters without NCU data.
- ALL recommendations MUST cite specific NCU metric values as evidence.
- Each iteration MUST cover at minimum: roofline, memory hierarchy, warp stalls, occupancy.
- The optimization playbook MUST match the kernel implementation type.
- After EVERY code change, re-profile and compare with .
--diff - Stop iterating when improvements plateau or metrics approach hardware ceiling.
STOP — 你是否拥有该内核的NCU性能分析数据?
否 → 前往步骤1。禁止修改任何内核代码。
是 → 前往步骤2。硬性规则 — 违反任何规则将导致整个优化失效:
- 若无NCU数据,绝不能修改内核代码、启动配置或模板参数。
- 所有优化建议必须引用具体的NCU指标数值作为依据。
- 每次迭代必须至少覆盖:roofline分析、内存层级分析、warp停顿分析、occupancy分析。
- 优化手册必须与内核实现类型匹配。
- 每次代码修改后,必须重新分析并使用命令对比结果。
--diff - 当性能提升进入平台期或指标接近硬件上限时,停止迭代。
Mandatory optimization loop
强制优化循环
┌─────────────────────────────────────────────────────────────────────┐
│ Step 1: Profile (NCU --set full) │
│ ↓ │
│ Step 2: Multi-dimensional analysis + identify kernel type │
│ ↓ │
│ Step 3: Apply type-specific playbook (one change per iteration) │
│ ↓ │
│ Step 4: Re-profile + diff → improved? → loop or stop │
│ ↑ │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘┌─────────────────────────────────────────────────────────────────────┐
│ 步骤1:性能分析(NCU --set full) │
│ ↓ │
│ 步骤2:多维度分析 + 识别内核类型 │
│ ↓ │
│ 步骤3:应用对应类型的优化手册(每次迭代仅做一处修改) │
│ ↓ │
│ 步骤4:重新分析 + 对比 → 性能提升? → 循环或停止 │
│ ↑ │ │
│ └───────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘Step 1: Profile with NCU (REQUIRED — no data = no optimization)
步骤1:使用NCU进行性能分析(必填 — 无数据则无法优化)
Option A: Profiling script (recommended)
选项A:分析脚本(推荐)
bash
undefinedbash
undefinedNative CUDA / CUTLASS binaries
原生CUDA / CUTLASS二进制文件
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel report_v1
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel report_v1
Triton / Python
Triton / Python
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_kernel.py" report_v1
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_kernel.py" report_v1
CuTe DSL / Python
CuTe DSL / Python
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1
The script collects `--set full` → exports CSV → runs deep analysis → generates reports.bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1
该脚本会收集`--set full`级别的数据 → 导出为CSV → 执行深度分析 → 生成报告。Option B: Manual profiling
选项B:手动分析
bash
ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csvbash
ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csvKernel-name filters (reduce noise)
内核名称过滤(减少干扰)
bash
undefinedbash
undefinedCUTLASS only
仅针对CUTLASS
ncu --set full -o report_v1 --target-processes all
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program
ncu --set full -o report_v1 --target-processes all
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program
Triton only
仅针对Triton
ncu --set full -o report_v1 --target-processes all
--kernel-name "triton_" "python triton_kernel.py"
--kernel-name "triton_" "python triton_kernel.py"
ncu --set full -o report_v1 --target-processes all
--kernel-name "triton_" "python triton_kernel.py"
--kernel-name "triton_" "python triton_kernel.py"
CuTe DSL (kernel name often generic — use --type override in analysis)
CuTe DSL(内核名称通常通用 — 在分析时使用--type参数指定)
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl
undefinedpython3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl
undefinedExpected outputs
预期输出
ncu_reports/
├── report_v1.ncu-rep # Full binary report
├── report_v1.csv # Raw metrics CSV
├── report_v1_analysis.md # Deep analysis report
└── report_v1_summary.txt # Per-kernel summaryncu_reports/
├── report_v1.ncu-rep # 完整二进制报告
├── report_v1.csv # 原始指标CSV
├── report_v1_analysis.md # 深度分析报告
└── report_v1_summary.txt # 单内核汇总Step 2: Multi-dimensional analysis
步骤2:多维度分析
2.1 Identify implementation type
2.1 识别实现类型
Determine the kernel type from NCU "Function Name" and source context:
| Type | Detection signals |
|---|---|
| Native CUDA | No library prefix; hand-written |
| CUTLASS | |
| Triton | |
| CuTe DSL | Generic names from |
| Library | |
从NCU的「函数名称」和源码上下文判断内核类型:
| 类型 | 识别特征 |
|---|---|
| 原生CUDA | 无库前缀;手写 |
| CUTLASS | |
| Triton | |
| CuTe DSL | 来自 |
| 库内核 | |
2.2 Common diagnostics (ALL kernel types — always run)
2.2 通用诊断(所有内核类型 — 必须执行)
| Dimension | Key NCU metrics | Output |
|---|---|---|
| Roofline | SM throughput, memory throughput | compute-bound / memory-bound / latency-bound / balanced |
| Memory hierarchy | L1/L2 hit rate, coalescing ratio, DRAM throughput | cache efficiency + bandwidth sub-bottleneck (DRAM/L2/L1) |
| Warp stalls | PC sampling stall reasons (long_scoreboard, wait, barrier, ...) | top stall reasons with percentages |
| Instruction mix | pipe FMA/ALU/LSU/Tensor utilization | pipeline imbalance, Tensor Core usage |
| Occupancy | active warps %, limiter breakdown (register/smem/warp/block) | limiting factor + register count + smem size |
| Memory hazards | bank conflicts, register spills (local store sectors) | severity and root cause |
| Divergence | avg threads executed vs avg threads active (true) | divergence percentage |
| 维度 | 关键NCU指标 | 输出结果 |
|---|---|---|
| Roofline分析 | SM吞吐量、内存吞吐量 | 计算受限 / 内存受限 / 延迟受限 / 平衡状态 |
| 内存层级分析 | L1/L2命中率、内存合并率、DRAM吞吐量 | 缓存效率 + 带宽子瓶颈(DRAM/L2/L1) |
| Warp停顿分析 | PC采样停顿原因(long_scoreboard、wait、barrier等) | 占比最高的停顿原因 |
| 指令混合分析 | FMA/ALU/LSU/Tensor流水线利用率 | 流水线失衡情况、Tensor Core使用率 |
| Occupancy分析 | 活跃warp占比、限制因素细分(寄存器/共享内存/warp/block) | 限制因素 + 寄存器数量 + 共享内存大小 |
| 内存风险分析 | 存储体冲突、寄存器溢出(本地存储扇区) | 严重程度及根本原因 |
| 分支发散分析 | 平均执行线程数 vs 平均活跃线程数(真实值) | 分支发散占比 |
2.3 Type-specific focus
2.3 类型专属分析重点
| Type | Key focus areas |
|---|---|
| Native CUDA | launch config (block size, grid), memory access patterns, async copy (cp.async/TMA), Tensor Core opportunity |
| CUTLASS | ThreadblockShape, WarpShape, stages, alignment, schedule policy, epilogue fusion, CTA swizzle |
| Triton | |
| CuTe DSL | |
| 类型 | 核心关注领域 |
|---|---|
| 原生CUDA | 启动配置(block大小、grid大小)、内存访问模式、异步拷贝(cp.async/TMA)、Tensor Core使用机会 |
| CUTLASS | ThreadblockShape、WarpShape、流水线阶段、对齐方式、调度策略、尾处理融合、CTA混洗 |
| Triton | |
| CuTe DSL | |
2.4 Bottleneck classification decision tree
2.4 瓶颈分类决策树
SM% > MEM% + 20 → COMPUTE_BOUND
MEM% > SM% + 20 → MEMORY_BOUND
├─ DRAM throughput > 70% → DRAM-Bound (near HBM ceiling)
├─ L2 hit < 50%, DRAM > 40% → DRAM-Bound (L2 miss driven)
├─ L1 hit < 20%, L2 hit >= 50% → L2-Bound
└─ L1 hit < 20% → L1-Bound
SM% < 40 AND MEM% < 40 → LATENCY_BOUND
SM% > 60 AND MEM% > 60 → BALANCED (near peak)SM% > MEM% + 20 → 计算受限(COMPUTE_BOUND)
MEM% > SM% + 20 → 内存受限(MEMORY_BOUND)
├─ DRAM吞吐量 > 70% → DRAM受限(接近HBM上限)
├─ L2命中率 < 50%,DRAM吞吐量 > 40% → DRAM受限(由L2未命中驱动)
├─ L1命中率 < 20%,L2命中率 >= 50% → L2受限
└─ L1命中率 < 20% → L1受限
SM% < 40 且 MEM% < 40 → 延迟受限(LATENCY_BOUND)
SM% > 60 且 MEM% > 60 → 平衡状态(接近峰值)2.5 Conclusion template (REQUIRED after every analysis)
2.5 结论模板(每次分析后必填)
=== Conclusion ===
Kernel: {kernel_name}
Type: {Native CUDA | CUTLASS | Triton | CuTe DSL}
Arch: SM_{arch}
Overall: {COMPUTE_BOUND | MEMORY_BOUND | LATENCY_BOUND | BALANCED}
Duration: {duration_us} us
Roofline: SM {sm}%, MEM {mem}%, DRAM {dram}%
Occupancy: {occ}% (theoretical: {theo}%), limited by {limiter}
Regs/Thread: {regs}, Smem/Block: {smem} KB
Findings (sorted by severity):
[CRITICAL] {finding}: {NCU evidence with numbers} -> {specific action}
[WARNING] {finding}: {NCU evidence with numbers} -> {specific action}
[INFO] {finding}: {NCU evidence with numbers}
Optimization priorities:
1. {highest_priority} (expected gain: Nx, evidence: {metric}={value})
2. {second_priority} (expected gain: Nx, evidence: {metric}={value})
3. {third_priority} (expected gain: Nx, evidence: {metric}={value})=== 结论 ===
内核名称: {kernel_name}
类型: {原生CUDA | CUTLASS | Triton | CuTe DSL}
架构: SM_{arch}
整体状态: {计算受限 | 内存受限 | 延迟受限 | 平衡状态}
耗时: {duration_us} 微秒
Roofline: SM {sm}%, 内存 {mem}%, DRAM {dram}%
Occupancy: {occ}%(理论值: {theo}%),受限因素: {limiter}
每线程寄存器数: {regs}, 每Block共享内存: {smem} KB
发现问题(按严重程度排序):
[CRITICAL] {问题描述}: {带数值的NCU证据} -> {具体行动}
[WARNING] {问题描述}: {带数值的NCU证据} -> {具体行动}
[INFO] {问题描述}: {带数值的NCU证据}
优化优先级:
1. {最高优先级}(预期收益: N倍,依据: {metric}={value})
2. {次优先级} (预期收益: N倍,依据: {metric}={value})
3. {第三优先级} (预期收益: N倍,依据: {metric}={value})Step 3: Apply type-specific playbook
步骤3:应用对应类型的优化手册
No intuition-only edits. Every change MUST directly address an NCU finding.
Apply ONE change per iteration, then re-profile (Step 4).
禁止仅凭直觉修改代码。每一处修改必须直接对应NCU的分析发现。
每次迭代仅做一处修改,然后进入步骤4重新分析。
3.1 Playbook: Native CUDA
3.1 优化手册:原生CUDA
3.1.1 Launch configuration
3.1.1 启动配置
| NCU finding | Action | Code pattern |
|---|---|---|
| Occupancy < 50%, block size < 128 | Increase block size to 128–256 | |
| Registers are occupancy limiter | Cap registers via | |
| Grid too small (< SM count) | Ensure enough blocks for full SM coverage | |
| Occupancy low, blocks limiter | Reduce block size to fit more blocks per SM | Try 128 instead of 256 |
| NCU发现 | 行动 | 代码示例 |
|---|---|---|
| Occupancy < 50%,block大小 < 128 | 将block大小增加至128–256 | |
| 寄存器是Occupancy限制因素 | 通过 | |
| Grid过小(< SM数量) | 确保有足够的block以充分利用所有SM | |
| Occupancy低,block是限制因素 | 减小block大小以在每个SM上容纳更多block | 尝试将256改为128 |
3.1.2 Memory access optimization
3.1.2 内存访问优化
| NCU finding | Action | Code pattern |
|---|---|---|
| Load coalescing ratio > 8 | Ensure warp-contiguous addressing, AoS→SoA | |
| Store coalescing ratio > 8 | Use shared memory staging for scatter writes | Write to smem first, then coalesced writeback |
| L1 hit rate < 20% | Use | Tile into shared memory with |
| L2 hit rate < 50% | Use L2 persistence hints (Ampere+) | |
| DRAM throughput > 80% | Reduce data movement: mixed precision, compression | |
| Bank conflicts > 100K | Pad shared memory or swizzle layout | |
| Register spills > 0 | Reduce per-thread state, use | Simplify accumulators, split into sub-kernels |
| NCU发现 | 行动 | 代码示例 |
|---|---|---|
| 加载合并率 > 8 | 确保warp连续寻址,将AoS转为SoA | |
| 存储合并率 > 8 | 使用共享内存暂存分散写入 | 先写入共享内存,再合并写回全局内存 |
| L1命中率 < 20% | 使用 | 将数据分块存入共享内存,并配合 |
| L2命中率 < 50% | 使用L2持久化提示(Ampere及以上架构) | 对热点数据范围使用 |
| DRAM吞吐量 > 80% | 减少数据移动:使用混合精度、数据压缩 | 对带宽敏感操作使用 |
| 存储体冲突 > 100K | 填充共享内存或调整布局 | |
| 寄存器溢出 > 0 | 减少每线程状态,使用 | 简化累加器,拆分为子内核 |
3.1.3 Latency hiding and pipelining
3.1.3 延迟隐藏与流水线
| NCU finding | Action | Code pattern |
|---|---|---|
| stall_long_scoreboard > 30% (SM>=80) | Use | |
| stall_long_scoreboard > 30% (SM>=90) | Use TMA for bulk async transfers | |
| stall_barrier > 25% | Reduce sync frequency, use warp primitives | |
| stall_wait > 30%, long_scoreboard < 15% | Pipeline over-buffered, reduce depth | Remove one buffer stage |
| stall_math_pipe_throttle > 20% | Compute saturated (positive signal) | Consider Tensor Core or reduce FLOPs |
| NCU发现 | 行动 | 代码示例 |
|---|---|---|
| stall_long_scoreboard > 30%(SM>=80) | 使用 | |
| stall_long_scoreboard > 30%(SM>=90) | 使用TMA进行批量异步传输 | |
| stall_barrier > 25% | 减少同步频率,使用warp原语 | |
| stall_wait > 30%,long_scoreboard < 15% | 流水线过度缓冲,减少深度 | 移除一个缓冲阶段 |
| stall_math_pipe_throttle > 20% | 计算饱和(积极信号) | 考虑使用Tensor Core或减少FLOPs |
3.1.4 Tensor Core utilization
3.1.4 Tensor Core利用率
| NCU finding | Action |
|---|---|
| pipe_tensor < 5%, FP16/BF16 workload with GEMM-like pattern | Use WMMA ( |
| pipe_tensor < 5%, but data is FP32 | Use TF32 path via |
| pipe_fma_fp16 > 10%, pipe_tensor < 5% | Switch from scalar FP16 FMA to Tensor Core path |
| NCU发现 | 行动 |
|---|---|
| pipe_tensor < 5%,且是FP16/BF16类GEMM工作负载 | 使用WMMA( |
| pipe_tensor < 5%,但数据是FP32 | 通过 |
| pipe_fma_fp16 > 10%,pipe_tensor < 5% | 从标量FP16 FMA切换为Tensor Core路径 |
3.1.5 Vectorized memory access
3.1.5 向量化内存访问
// NCU evidence: coalescing ratio > 4 for 32-bit loads
// Before: scalar loads
float val = input[idx];
// After: vectorized 128-bit load (4x float)
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];// NCU证据:32位加载的合并率 > 4
// 优化前:标量加载
float val = input[idx];
// 优化后:128位向量化加载(4个float)
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];3.2 Playbook: CUTLASS
3.2 优化手册:CUTLASS
3.2.1 Kernel config parsing
3.2.1 内核配置解析
CUTLASS kernel names encode configuration. Extract:
- Architecture: ,
sm80_,sm90_,ampere_hopper_ - Compute type: vs
tensoropsimt - Tile shape: ,
128x128x32256x128x64 - Pipeline stages: trailing ,
x3x5 - Alignment:
align8 - Schedule (3.x): ,
WarpSpecialized,WarpSpecializedCooperativeWarpSpecializedPingpong
CUTLASS内核名称包含配置信息,可提取:
- 架构:,
sm80_,sm90_,ampere_hopper_ - 计算类型:vs
tensoropsimt - 分块形状:,
128x128x32256x128x64 - 流水线阶段:末尾的,
x3x5 - 对齐方式:
align8 - 调度策略(3.x版本):,
WarpSpecialized,WarpSpecializedCooperativeWarpSpecializedPingpong
3.2.2 Tile shape and occupancy
3.2.2 分块形状与Occupancy
| NCU finding | Action |
|---|---|
| Occupancy < 40%, smem is limiter | Reduce ThreadblockShape (e.g., 256x128→128x128) or reduce stages |
| Occupancy < 40%, registers are limiter | Use smaller WarpShape (e.g., 64x64→32x32) to reduce per-thread regs |
| SM throughput < 30%, grid is small | Increase ThreadblockShape to process more elements per CTA |
| SM throughput > 80%, MEM < 40% | Already compute-bound; increase pipeline stages for more overlap |
| NCU发现 | 行动 |
|---|---|
| Occupancy < 40%,共享内存是限制因素 | 减小ThreadblockShape(如256x128→128x128)或减少流水线阶段 |
| Occupancy < 40%,寄存器是限制因素 | 使用更小的WarpShape(如64x64→32x32)以减少每线程寄存器数 |
| SM吞吐量 < 30%,Grid过小 | 增大ThreadblockShape以让每个CTA处理更多元素 |
| SM吞吐量 > 80%,内存利用率 < 40% | 已处于计算受限状态;增加流水线阶段以提升重叠度 |
3.2.3 Pipeline stages
3.2.3 流水线阶段
| NCU finding | Action |
|---|---|
| stall_long_scoreboard > 30% | Increase stages (Ampere: 3→5, Hopper: 2→3) |
| stall_wait > 30%, long_scoreboard < 15% | Pipeline over-buffered; reduce stages to save smem |
| Smem limiter + stages > 3 | Reduce stages to free smem for higher occupancy |
| NCU发现 | 行动 |
|---|---|
| stall_long_scoreboard > 30% | 增加流水线阶段(Ampere:3→5,Hopper:2→3) |
| stall_wait > 30%,long_scoreboard < 15% | 流水线过度缓冲;减少阶段以节省共享内存 |
| 共享内存是限制因素且阶段数 > 3 | 减少阶段以释放共享内存,提升Occupancy |
3.2.4 Alignment and vectorization
3.2.4 对齐方式与向量化
| NCU finding | Action |
|---|---|
| Load coalescing > 4, alignment < 8 | Increase CUTLASS alignment to 8 (128 bytes); pad matrix leading dims to multiples of alignment |
| SIMT path used but data supports TensorOp | Switch to |
| TensorOp configured but pipe_tensor < 5% | Check alignment requirements — LD must be multiple of InstructionShape::kK |
| NCU发现 | 行动 |
|---|---|
| 加载合并率 > 4,对齐方式 < 8 | 将CUTLASS对齐方式提升至8(128字节);将矩阵首维度填充至对齐倍数 |
| 使用SIMT路径但数据支持TensorOp | 切换为 |
| 已配置TensorOp但pipe_tensor < 5% | 检查对齐要求 — 加载操作必须是InstructionShape::kK的倍数 |
3.2.5 Schedule and architecture
3.2.5 调度策略与架构
| NCU finding | Action |
|---|---|
| CUTLASS 2.x on SM>=90 | Upgrade to CUTLASS 3.x with WarpSpecialized + TMA (1.2–1.5x gain) |
| L2 hit rate < 50% on large GEMM | Add ThreadblockSwizzle (2.x: |
| stall_long_scoreboard > 30% on Hopper | Switch to |
| NCU发现 | 行动 |
|---|---|
| 在SM>=90上使用CUTLASS 2.x版本 | 升级至CUTLASS 3.x版本,搭配WarpSpecialized + TMA(可提升1.2–1.5倍性能) |
| 大型GEMM的L2命中率 < 50% | 添加ThreadblockSwizzle(2.x版本: |
| Hopper架构上stall_long_scoreboard > 30% | 切换为 |
3.2.6 Epilogue fusion
3.2.6 尾处理融合
| NCU finding | Action |
|---|---|
| Multiple CUTLASS kernels back-to-back (e.g., GEMM + bias + activation) | Fuse into single kernel via CUTLASS epilogue visitor tree |
| High DRAM traffic (read+write GB > expected) | Move post-GEMM ops into epilogue to eliminate intermediate tensors |
| NCU发现 | 行动 |
|---|---|
| 多个CUTLASS内核连续执行(如GEMM + 偏置 + 激活) | 通过CUTLASS尾处理访问者树融合为单个内核 |
| DRAM流量过高(读+写GB数超出预期) | 将GEMM后的操作移入尾处理,消除中间张量 |
3.3 Playbook: Triton
3.3 优化手册:Triton
3.3.1 Kernel classification
3.3.1 内核分类
Triton kernel subtypes (from kernel name):
- : Inductor pointwise (auto-generated)
triton_poi_ - : Inductor reduction (auto-generated)
triton_red_ - : Inductor persistent reduction (auto-generated)
triton_per_ - Custom : hand-written kernel (fully tunable)
@triton.jit
Inductor-generated kernels: optimize at PyTorch level (), or rewrite as custom if this is a hot path.
torch._inductor.config@triton.jitTriton内核子类型(从内核名称判断):
- : Inductor逐点运算(自动生成)
triton_poi_ - : Inductor归约运算(自动生成)
triton_red_ - : Inductor持久化归约(自动生成)
triton_per_ - 自定义: 手写内核(完全可调优)
@triton.jit
自动生成的Inductor内核:在PyTorch层面优化(),若为热点路径可重写为自定义内核。
torch._inductor.config@triton.jit3.3.2 num_warps tuning
3.3.2 num_warps调优
| NCU finding | Action |
|---|---|
| Registers >= 128, num_warps >= 8 | CRITICAL: reduce num_warps (try 4 or 2) |
| Registers >= 64, num_warps >= 8 | Reduce num_warps to 4 |
| Occupancy < 40%, register-limited | Reduce num_warps AND/OR reduce BLOCK_* tile sizes |
| SM throughput < 30%, few warps | Increase num_warps to improve latency hiding |
| NCU发现 | 行动 |
|---|---|
| 寄存器数 >= 128,num_warps >= 8 | CRITICAL:减小num_warps(尝试4或2) |
| 寄存器数 >= 64,num_warps >= 8 | 将num_warps减小至4 |
| Occupancy < 40%,寄存器受限 | 减小num_warps 和/或 减小BLOCK_*分块大小 |
| SM吞吐量 < 30%,warp数量少 | 增大num_warps以提升延迟隐藏效果 |
3.3.3 num_stages tuning
3.3.3 num_stages调优
| NCU finding | Action |
|---|---|
| stall_long_scoreboard > 30% | Increase num_stages (2→3→4 on Ampere, 2→3 on Hopper) |
| stall_wait > 30%, long_scoreboard < 15% | Decrease num_stages (over-buffered) or increase tile work |
| Smem is occupancy limiter | Decrease num_stages (each stage doubles smem buffer) |
| On Hopper + long_scoreboard high | Also consider |
| NCU发现 | 行动 |
|---|---|
| stall_long_scoreboard > 30% | 增加num_stages(Ampere架构:2→3→4,Hopper架构:2→3) |
| stall_wait > 30%,long_scoreboard < 15% | 减小num_stages(过度缓冲)或增大分块工作量 |
| 共享内存是Occupancy限制因素 | 减小num_stages(每个阶段会使共享内存缓冲区翻倍) |
| Hopper架构上long_scoreboard数值高 | 同时考虑使用 |
3.3.4 BLOCK_* tile size tuning
3.3.4 BLOCK_*分块大小调优
| NCU finding | Action |
|---|---|
| Register pressure high | Reduce BLOCK_M, BLOCK_N, or BLOCK_K |
| SM throughput low, compute-bound opportunity | Increase BLOCK_M/BLOCK_N for more compute per tile |
| DRAM bandwidth near ceiling | Increase BLOCK_K for more data reuse before writeback |
| NCU发现 | 行动 |
|---|---|
| 寄存器压力大 | 减小BLOCK_M、BLOCK_N或BLOCK_K |
| SM吞吐量低,存在计算受限优化空间 | 增大BLOCK_M/BLOCK_N以让每个分块处理更多计算 |
| DRAM带宽接近上限 | 增大BLOCK_K以提升写回前的数据复用率 |
3.3.5 Memory access optimization
3.3.5 内存访问优化
| NCU finding | Action | Code pattern |
|---|---|---|
| Load coalescing > 8 | Add stride hints | |
| Uncoalesced on transposed input | Use structured pointers | |
| L1 hit rate low | Verify access pattern continuity | Ensure innermost dim stride == 1 |
| NCU发现 | 行动 | 代码示例 |
|---|---|---|
| 加载合并率 > 8 | 添加步长提示 | |
| 转置输入存在非合并访问 | 使用结构化指针 | |
| L1命中率低 | 验证访问模式的连续性 | 确保最内层维度步长 == 1 |
3.3.6 Tensor Core utilization
3.3.6 Tensor Core利用率
| NCU finding | Action |
|---|---|
pipe_tensor < 5%, kernel uses | 1) |
pipe_tensor < 5%, no | GEMM-like pattern missing — restructure to use |
| NCU发现 | 行动 |
|---|---|
pipe_tensor < 5%,内核使用 | 1) 对fp32设置 |
pipe_tensor < 5%,代码中未使用 | 重构代码以使用 |
3.3.7 Triton autotune integration
3.3.7 Triton自动调参集成
python
@triton.autotune(
configs=[
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_warps=4, num_stages=4),
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'BLOCK_K': 32}, num_warps=8, num_stages=3),
],
key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
...When NCU reveals the bottleneck, narrow autotune configs to the promising region instead of blind search.
python
@triton.autotune(
configs=[
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64, 'BLOCK_K': 64}, num_warps=4, num_stages=4),
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64, 'BLOCK_K': 32}, num_warps=8, num_stages=3),
],
key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
...当NCU揭示瓶颈后,缩小自动调参配置范围至有潜力的区域,而非盲目搜索。
3.4 Playbook: CuTe DSL
3.4 优化手册:CuTe DSL
3.4.1 Key tuning parameters
3.4.1 关键调参参数
| Parameter | Effect | Typical range |
|---|---|---|
| Warps per CTA; affects occupancy, barrier cost, reduce cost | 128–512 |
| Elements per thread; affects register pressure, data reuse | 4–32 |
| CopyAtom width; affects vectorized load/store bandwidth | 32, 64, 128 |
| Smem staging buffer | Pipeline depth × tile size; affects smem footprint | Minimize for occupancy |
| 参数 | 作用 | 典型范围 |
|---|---|---|
| 每个CTA的warp数;影响Occupancy、同步开销、归约开销 | 128–512 |
| 每个线程处理的元素数;影响寄存器压力、数据复用率 | 4–32 |
| CopyAtom宽度;影响向量化加载/存储带宽 | 32, 64, 128 |
| 共享内存暂存缓冲区 | 流水线深度 × 分块大小;影响共享内存占用 | 为提升Occupancy尽量最小化 |
3.4.2 Occupancy optimization
3.4.2 Occupancy优化
| NCU finding | Action |
|---|---|
| Occupancy < 40%, registers are limiter | Reduce |
| Occupancy < 40%, smem is limiter | Reduce |
| Registers >= 128, warps >= 8 | CRITICAL: reduce |
| NCU发现 | 行动 |
|---|---|
| Occupancy < 40%,寄存器受限 | 减小 |
| Occupancy < 40%,共享内存受限 | 减小 |
| 寄存器数 >= 128,warp数 >= 8 | CRITICAL:将 |
3.4.3 Memory access (TiledCopy)
3.4.3 内存访问(TiledCopy)
| NCU finding | Action |
|---|---|
| Load coalescing > 8 | 1) Increase |
| stall_long_scoreboard > 30% | 1) Increase |
| stall_wait > 30%, long_scoreboard < 15% | Pipeline over-buffered; increase |
| NCU发现 | 行动 |
|---|---|
| 加载合并率 > 8 | 1) 将 |
| stall_long_scoreboard > 30% | 1) 将 |
| stall_wait > 30%,long_scoreboard < 15% | 流水线过度缓冲;增大 |
3.4.4 Synchronization and reduction
3.4.4 同步与归约
| NCU finding | Action |
|---|---|
| stall_barrier > 25% | 1) Reduce |
| High barrier + small reduction | Use warp-only reduce without smem for small element counts |
| Multiple sync_threads per iteration | Minimize sync points; use async pipeline commit/wait patterns |
| NCU发现 | 行动 |
|---|---|
| stall_barrier > 25% | 1) 减小 |
| 同步开销高且归约规模小 | 对于小元素计数,使用仅warp级别的归约,无需共享内存 |
| 每次迭代存在多次sync_threads | 尽量减少同步点;使用异步流水线提交/等待模式 |
3.4.5 Thread divergence
3.4.5 线程分支发散
| NCU finding | Action |
|---|---|
| Divergence > 20% | Adjust |
| Predicated copies show high divergence | Ensure N is divisible by |
| NCU发现 | 行动 |
|---|---|
| 分支发散率 > 20% | 调整 |
| 带谓词的拷贝存在高发散率 | 确保N能被 |
3.4.6 Compute optimization
3.4.6 计算优化
| NCU finding | Action |
|---|---|
| pipe_tensor < 5%, FP16 GEMM-like ops | Use |
| pipe_fma high but pipe_tensor low (non-GEMM ops like RMSNorm/LayerNorm) | Tensor Core not applicable for reductions — focus on memory and barrier optimization instead |
| NCU发现 | 行动 |
|---|---|
| pipe_tensor < 5%,存在FP16类GEMM操作 | 使用 |
| pipe_fma使用率高但pipe_tensor使用率低(非GEMM操作如RMSNorm/LayerNorm) | Tensor Core不适用于这类归约操作 — 转而关注内存和同步优化 |
3.4.7 Cache invalidation for re-profiling
3.4.7 重新分析前的缓存清理
CuTe DSL compiles Python to CUDA via JIT. After code changes:
bash
undefinedCuTe DSL通过JIT将Python编译为CUDA。代码修改后:
bash
undefinedClear compilation cache to ensure re-compilation
清除编译缓存以确保重新编译
rm -rf pycache/ .cache/ /tmp/cutlass_cute_cache/
rm -rf pycache/ .cache/ /tmp/cutlass_cute_cache/
Then re-profile
然后重新分析
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2
---bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2
---Step 4: Re-profile and verify (REQUIRED after every change)
步骤4:重新分析与验证(每次修改后必填)
4.1 Re-profile
4.1 重新分析
bash
undefinedbash
undefinedClear JIT caches first
先清除JIT缓存
rm -rf ~/.triton/cache # Triton
rm -rf pycache/ .cache/ # CuTe DSL
rm -rf ~/.triton/cache # Triton缓存
rm -rf pycache/ .cache/ # CuTe DSL缓存
Profile updated version
分析更新后的版本
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel_v2 report_v2
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel_v2 report_v2
or
或
bash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2
undefinedbash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2
undefined4.2 Compare against baseline
4.2 与基准版本对比
bash
python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csvbash
python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csv4.3 Verification checklist
4.3 验证检查清单
| Check | Criteria |
|---|---|
| Duration improved? | |
| Target bottleneck improved? | The specific metric that triggered the change improved |
| No new bottlenecks? | No new CRITICAL findings in the diff report |
| At hardware ceiling? | SM throughput > 80% or DRAM throughput > 85% means near peak |
| 检查项 | 判定标准 |
|---|---|
| 耗时是否减少? | |
| 目标瓶颈是否改善? | 触发修改的特定指标得到改善 |
| 是否引入新瓶颈? | 对比报告中无新的CRITICAL问题 |
| 是否接近硬件上限? | SM吞吐量 > 80% 或 DRAM吞吐量 > 85% 意味着接近峰值 |
4.4 Iteration log template
4.4 迭代日志模板
Track each iteration for accountability:
=== Iteration {N} ===
Change: {what was changed and why}
NCU evidence: {metric}={before_value} -> {finding}
Report: report_v{N}.csv
Result:
Duration: {before} us -> {after} us ({delta}%)
Target metric: {metric}={before} -> {after}
New findings: {any new issues introduced}
Decision: {CONTINUE to next bottleneck | STOP — at ceiling | ROLLBACK — regression}跟踪每次迭代以确保可追溯:
=== 迭代 {N} ===
修改内容: {修改的内容及原因}
NCU依据: {指标}={修改前数值} -> {问题发现}
报告: report_v{N}.csv
结果:
耗时: {修改前} 微秒 -> {修改后} 微秒(变化率: {delta}%)
目标指标: {指标}={修改前} -> {修改后}
新发现问题: {是否引入新问题}
决策: {继续优化下一个瓶颈 | 停止 — 已达硬件上限 | 回滚 — 性能退化}Quick reference: high-signal NCU metrics
快速参考:高价值NCU指标
| Metric | NCU key |
|---|---|
| Duration | |
| SM throughput | |
| Memory throughput | |
| DRAM throughput | |
| L1 hit rate | |
| L2 hit rate | |
| Load coalescing | |
| Bank conflicts | |
| Register spills | |
| Occupancy | |
| Warp eligibility | |
| Registers/thread | |
| Smem/block | |
| 指标 | NCU关键字 |
|---|---|
| 耗时 | |
| SM吞吐量 | |
| 内存吞吐量 | |
| DRAM吞吐量 | |
| L1命中率 | |
| L2命中率 | |
| 加载合并率 | |
| 存储体冲突 | |
| 寄存器溢出 | |
| Occupancy | |
| Warp可用率 | |
| 每线程寄存器数 | |
| 每Block共享内存 | |
Summary
总结
This skill enforces a strict profile → analyze → change → verify loop.
No NCU data = no optimization. No metric evidence = no code change.
Each kernel type (Native CUDA / CUTLASS / Triton / CuTe DSL) has a dedicated playbook
with NCU-metric-to-action mappings. Every change is tracked and verified by re-profiling.
本技能强制要求严格遵循分析 → 诊断 → 修改 → 验证的循环。
无NCU数据则无法开展优化。无指标依据则不能修改代码。
每种内核类型(原生CUDA / CUTLASS / Triton / CuTe DSL)都有专属优化手册,
包含NCU指标到优化行动的映射。每一处修改都需跟踪并通过重新分析验证效果。