cuda-auto-tune

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

NCU-driven iterative kernel optimization (CUDA / CUTLASS / Triton / CuTe DSL)

基于NCU的迭代式内核优化（CUDA / CUTLASS / Triton / CuTe DSL）

GATE CHECK (enforce before any optimization)

前置检查（所有优化前必须执行）

STOP — Do you have NCU profile data for this kernel?
  NO  → Go to Step 1. Do NOT touch any kernel code.
  YES → Go to Step 2.

Hard rules — violation of any rule invalidates the entire optimization:

NEVER change kernel code, launch config, or template parameters without NCU data.
ALL recommendations MUST cite specific NCU metric values as evidence.
Each iteration MUST cover at minimum: roofline, memory hierarchy, warp stalls, occupancy.
The optimization playbook MUST match the kernel implementation type.
After EVERY code change, re-profile and compare with
```
--diff
```
.
Stop iterating when improvements plateau or metrics approach hardware ceiling.

STOP — 你是否拥有该内核的NCU性能分析数据？
  否  → 前往步骤1。禁止修改任何内核代码。
  是 → 前往步骤2。

硬性规则 — 违反任何规则将导致整个优化失效：

若无NCU数据，绝不能修改内核代码、启动配置或模板参数。
所有优化建议必须引用具体的NCU指标数值作为依据。
每次迭代必须至少覆盖：roofline分析、内存层级分析、warp停顿分析、occupancy分析。
优化手册必须与内核实现类型匹配。
每次代码修改后，必须重新分析并使用
```
--diff
```
命令对比结果。
当性能提升进入平台期或指标接近硬件上限时，停止迭代。

Mandatory optimization loop

强制优化循环

┌─────────────────────────────────────────────────────────────────────┐
│  Step 1: Profile (NCU --set full)                                   │
│      ↓                                                              │
│  Step 2: Multi-dimensional analysis + identify kernel type          │
│      ↓                                                              │
│  Step 3: Apply type-specific playbook (one change per iteration)    │
│      ↓                                                              │
│  Step 4: Re-profile + diff → improved? → loop or stop               │
│      ↑                                           │                  │
│      └───────────────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│  步骤1：性能分析（NCU --set full）                                   │
│      ↓                                                              │
│  步骤2：多维度分析 + 识别内核类型          │
│      ↓                                                              │
│  步骤3：应用对应类型的优化手册（每次迭代仅做一处修改）    │
│      ↓                                                              │
│  步骤4：重新分析 + 对比 → 性能提升？ → 循环或停止               │
│      ↑                                           │                  │
│      └───────────────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

Step 1: Profile with NCU (REQUIRED — no data = no optimization)

步骤1：使用NCU进行性能分析（必填 — 无数据则无法优化）

Option A: Profiling script (recommended)

选项A：分析脚本（推荐）

bash

undefined

bash

undefined

Native CUDA / CUTLASS binaries

原生CUDA / CUTLASS二进制文件

bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel report_v1

Triton / Python

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_kernel.py" report_v1

CuTe DSL / Python

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1


The script collects `--set full` → exports CSV → runs deep analysis → generates reports.

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1


该脚本会收集`--set full`级别的数据 → 导出为CSV → 执行深度分析 → 生成报告。

Option B: Manual profiling

选项B：手动分析

bash

ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv

bash

ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv

Kernel-name filters (reduce noise)

内核名称过滤（减少干扰）

bash

undefined

bash

undefined

CUTLASS only

仅针对CUTLASS

ncu --set full -o report_v1 --target-processes all
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program

Triton only

仅针对Triton

ncu --set full -o report_v1 --target-processes all
--kernel-name "triton_" "python triton_kernel.py"

CuTe DSL (kernel name often generic — use --type override in analysis)

CuTe DSL（内核名称通常通用 — 在分析时使用--type参数指定）

python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl

undefined

python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl

undefined

Expected outputs

预期输出

ncu_reports/
├── report_v1.ncu-rep           # Full binary report
├── report_v1.csv               # Raw metrics CSV
├── report_v1_analysis.md       # Deep analysis report
└── report_v1_summary.txt       # Per-kernel summary

ncu_reports/
├── report_v1.ncu-rep           # 完整二进制报告
├── report_v1.csv               # 原始指标CSV
├── report_v1_analysis.md       # 深度分析报告
└── report_v1_summary.txt       # 单内核汇总

Step 2: Multi-dimensional analysis

步骤2：多维度分析

2.1 Identify implementation type

2.1 识别实现类型

Determine the kernel type from NCU "Function Name" and source context:

Type	Detection signals
Native CUDA	No library prefix; hand-written `__global__` functions
CUTLASS	`cutlass_` prefix, `smXX_xmma_` , contains `tensorop` or `cutlass`
Triton	`triton_` prefix, contains `triton` , encoded suffixes (e.g. `_0d1d...e` )
CuTe DSL	Generic names from `@cute.kernel` ; confirm via source imports ( `cutlass.cute` , `cute.compile` ) or `--type cutedsl`
Library	`cublas` , `cudnn` — baseline/reference only, not optimizable

从NCU的「函数名称」和源码上下文判断内核类型：

类型	识别特征
原生CUDA	无库前缀；手写 `__global__` 函数
CUTLASS	`cutlass_` 前缀、 `smXX_xmma_` 、包含 `tensorop` 或 `cutlass` 关键字
Triton	`triton_` 前缀、包含 `triton` 关键字、编码后缀（如 `_0d1d...e` ）
CuTe DSL	来自 `@cute.kernel` 的通用名称；通过源码导入（ `cutlass.cute` 、 `cute.compile` ）或 `--type cutedsl` 参数确认
库内核	`cublas` 、 `cudnn` — 仅作为基准/参考，不可优化

2.2 Common diagnostics (ALL kernel types — always run)

2.2 通用诊断（所有内核类型 — 必须执行）

Dimension	Key NCU metrics	Output
Roofline	SM throughput, memory throughput	compute-bound / memory-bound / latency-bound / balanced
Memory hierarchy	L1/L2 hit rate, coalescing ratio, DRAM throughput	cache efficiency + bandwidth sub-bottleneck (DRAM/L2/L1)
Warp stalls	PC sampling stall reasons (long_scoreboard, wait, barrier, ...)	top stall reasons with percentages
Instruction mix	pipe FMA/ALU/LSU/Tensor utilization	pipeline imbalance, Tensor Core usage
Occupancy	active warps %, limiter breakdown (register/smem/warp/block)	limiting factor + register count + smem size
Memory hazards	bank conflicts, register spills (local store sectors)	severity and root cause
Divergence	avg threads executed vs avg threads active (true)	divergence percentage

维度	关键NCU指标	输出结果
Roofline分析	SM吞吐量、内存吞吐量	计算受限 / 内存受限 / 延迟受限 / 平衡状态
内存层级分析	L1/L2命中率、内存合并率、DRAM吞吐量	缓存效率 + 带宽子瓶颈（DRAM/L2/L1）
Warp停顿分析	PC采样停顿原因（long_scoreboard、wait、barrier等）	占比最高的停顿原因
指令混合分析	FMA/ALU/LSU/Tensor流水线利用率	流水线失衡情况、Tensor Core使用率
Occupancy分析	活跃warp占比、限制因素细分（寄存器/共享内存/warp/block）	限制因素 + 寄存器数量 + 共享内存大小
内存风险分析	存储体冲突、寄存器溢出（本地存储扇区）	严重程度及根本原因
分支发散分析	平均执行线程数 vs 平均活跃线程数（真实值）	分支发散占比

2.3 Type-specific focus

2.3 类型专属分析重点

Type	Key focus areas
Native CUDA	launch config (block size, grid), memory access patterns, async copy (cp.async/TMA), Tensor Core opportunity
CUTLASS	ThreadblockShape, WarpShape, stages, alignment, schedule policy, epilogue fusion, CTA swizzle
Triton	`num_warps` , `num_stages` , `BLOCK_*` sizes, compiler hints ( `tl.multiple_of` , `tl.max_contiguous` ), `tl.dot` config
CuTe DSL	`threads_per_cta` , `elems_per_thread` , CopyAtom ( `num_bits_per_copy` ), `tiled_copy` layout, smem staging, `cta_reduce` pattern

类型	核心关注领域
原生CUDA	启动配置（block大小、grid大小）、内存访问模式、异步拷贝（cp.async/TMA）、Tensor Core使用机会
CUTLASS	ThreadblockShape、WarpShape、流水线阶段、对齐方式、调度策略、尾处理融合、CTA混洗
Triton	`num_warps` 、 `num_stages` 、 `BLOCK_*` 大小、编译器提示（ `tl.multiple_of` 、 `tl.max_contiguous` ）、 `tl.dot` 配置
CuTe DSL	`threads_per_cta` 、 `elems_per_thread` 、CopyAtom（ `num_bits_per_copy` ）、 `tiled_copy` 布局、共享内存暂存、 `cta_reduce` 模式

2.4 Bottleneck classification decision tree

2.4 瓶颈分类决策树

SM% > MEM% + 20  →  COMPUTE_BOUND
MEM% > SM% + 20  →  MEMORY_BOUND
  ├─ DRAM throughput > 70%        → DRAM-Bound (near HBM ceiling)
  ├─ L2 hit < 50%, DRAM > 40%    → DRAM-Bound (L2 miss driven)
  ├─ L1 hit < 20%, L2 hit >= 50% → L2-Bound
  └─ L1 hit < 20%                → L1-Bound
SM% < 40 AND MEM% < 40           →  LATENCY_BOUND
SM% > 60 AND MEM% > 60           →  BALANCED (near peak)

SM% > MEM% + 20  →  计算受限（COMPUTE_BOUND）
MEM% > SM% + 20  →  内存受限（MEMORY_BOUND）
  ├─ DRAM吞吐量 > 70%        → DRAM受限（接近HBM上限）
  ├─ L2命中率 < 50%，DRAM吞吐量 > 40%    → DRAM受限（由L2未命中驱动）
  ├─ L1命中率 < 20%，L2命中率 >= 50% → L2受限
  └─ L1命中率 < 20%                → L1受限
SM% < 40 且 MEM% < 40           →  延迟受限（LATENCY_BOUND）
SM% > 60 且 MEM% > 60           →  平衡状态（接近峰值）

2.5 Conclusion template (REQUIRED after every analysis)

2.5 结论模板（每次分析后必填）

=== Conclusion ===
Kernel:    {kernel_name}
Type:      {Native CUDA | CUTLASS | Triton | CuTe DSL}
Arch:      SM_{arch}
Overall:   {COMPUTE_BOUND | MEMORY_BOUND | LATENCY_BOUND | BALANCED}
Duration:  {duration_us} us
Roofline:  SM {sm}%, MEM {mem}%, DRAM {dram}%
Occupancy: {occ}% (theoretical: {theo}%), limited by {limiter}
Regs/Thread: {regs}, Smem/Block: {smem} KB

Findings (sorted by severity):
  [CRITICAL] {finding}: {NCU evidence with numbers} -> {specific action}
  [WARNING]  {finding}: {NCU evidence with numbers} -> {specific action}
  [INFO]     {finding}: {NCU evidence with numbers}

Optimization priorities:
  1. {highest_priority} (expected gain: Nx, evidence: {metric}={value})
  2. {second_priority}  (expected gain: Nx, evidence: {metric}={value})
  3. {third_priority}   (expected gain: Nx, evidence: {metric}={value})

=== 结论 ===
内核名称:    {kernel_name}
类型:      {原生CUDA | CUTLASS | Triton | CuTe DSL}
架构:      SM_{arch}
整体状态:   {计算受限 | 内存受限 | 延迟受限 | 平衡状态}
耗时:  {duration_us} 微秒
Roofline:  SM {sm}%, 内存 {mem}%, DRAM {dram}%
Occupancy: {occ}%（理论值: {theo}%），受限因素: {limiter}
每线程寄存器数: {regs}, 每Block共享内存: {smem} KB

发现问题（按严重程度排序）:
  [CRITICAL] {问题描述}: {带数值的NCU证据} -> {具体行动}
  [WARNING]  {问题描述}: {带数值的NCU证据} -> {具体行动}
  [INFO]     {问题描述}: {带数值的NCU证据}

优化优先级:
  1. {最高优先级}（预期收益: N倍，依据: {metric}={value}）
  2. {次优先级} （预期收益: N倍，依据: {metric}={value}）
  3. {第三优先级}  （预期收益: N倍，依据: {metric}={value}）

Step 3: Apply type-specific playbook

步骤3：应用对应类型的优化手册

No intuition-only edits. Every change MUST directly address an NCU finding. Apply ONE change per iteration, then re-profile (Step 4).

禁止仅凭直觉修改代码。每一处修改必须直接对应NCU的分析发现。每次迭代仅做一处修改，然后进入步骤4重新分析。

3.1 Playbook: Native CUDA

3.1 优化手册：原生CUDA

3.1.1 Launch configuration

3.1.1 启动配置

NCU finding	Action	Code pattern
Occupancy < 50%, block size < 128	Increase block size to 128–256	`kernel<<<grid, 256>>>`
Registers are occupancy limiter	Cap registers via `__launch_bounds__`	`__global__ void __launch_bounds__(256, 2) kernel()`
Grid too small (< SM count)	Ensure enough blocks for full SM coverage	`grid = (N + block - 1) / block` with sufficient N
Occupancy low, blocks limiter	Reduce block size to fit more blocks per SM	Try 128 instead of 256

NCU发现	行动	代码示例
Occupancy < 50%，block大小 < 128	将block大小增加至128–256	`kernel<<<grid, 256>>>`
寄存器是Occupancy限制因素	通过 `__launch_bounds__` 限制寄存器数量	`__global__ void __launch_bounds__(256, 2) kernel()`
Grid过小（< SM数量）	确保有足够的block以充分利用所有SM	`grid = (N + block - 1) / block` ，且N足够大
Occupancy低，block是限制因素	减小block大小以在每个SM上容纳更多block	尝试将256改为128

3.1.2 Memory access optimization

3.1.2 内存访问优化

NCU finding	Action	Code pattern
Load coalescing ratio > 8	Ensure warp-contiguous addressing, AoS→SoA	`data[threadIdx.x + blockIdx.x * blockDim.x]`
Store coalescing ratio > 8	Use shared memory staging for scatter writes	Write to smem first, then coalesced writeback
L1 hit rate < 20%	Use `__shared__` for frequently reused data	Tile into shared memory with `__syncthreads()`
L2 hit rate < 50%	Use L2 persistence hints (Ampere+)	`cudaAccessPolicyWindow` for hot data ranges
DRAM throughput > 80%	Reduce data movement: mixed precision, compression	`half` / `__nv_bfloat16` for bandwidth-sensitive ops
Bank conflicts > 100K	Pad shared memory or swizzle layout	`__shared__ float smem[32][33];` (pad +1)
Register spills > 0	Reduce per-thread state, use `__launch_bounds__`	Simplify accumulators, split into sub-kernels

NCU发现	行动	代码示例
加载合并率 > 8	确保warp连续寻址，将AoS转为SoA	`data[threadIdx.x + blockIdx.x * blockDim.x]`
存储合并率 > 8	使用共享内存暂存分散写入	先写入共享内存，再合并写回全局内存
L1命中率 < 20%	使用 `__shared__` 存储频繁复用的数据	将数据分块存入共享内存，并配合 `__syncthreads()`
L2命中率 < 50%	使用L2持久化提示（Ampere及以上架构）	对热点数据范围使用 `cudaAccessPolicyWindow`
DRAM吞吐量 > 80%	减少数据移动：使用混合精度、数据压缩	对带宽敏感操作使用 `half` / `__nv_bfloat16`
存储体冲突 > 100K	填充共享内存或调整布局	`__shared__ float smem[32][33];` （填充+1）
寄存器溢出 > 0	减少每线程状态，使用 `__launch_bounds__`	简化累加器，拆分为子内核

3.1.3 Latency hiding and pipelining

3.1.3 延迟隐藏与流水线

NCU finding	Action	Code pattern
stall_long_scoreboard > 30% (SM>=80)	Use `cp.async` + double buffering	`__pipeline_memcpy_async(&smem, &gmem, size)`
stall_long_scoreboard > 30% (SM>=90)	Use TMA for bulk async transfers	`cute::copy(tma_load, ...)` or CuTe TMA atoms
stall_barrier > 25%	Reduce sync frequency, use warp primitives	`__shfl_sync()` , `cooperative_groups`
stall_wait > 30%, long_scoreboard < 15%	Pipeline over-buffered, reduce depth	Remove one buffer stage
stall_math_pipe_throttle > 20%	Compute saturated (positive signal)	Consider Tensor Core or reduce FLOPs

NCU发现	行动	代码示例
stall_long_scoreboard > 30%（SM>=80）	使用 `cp.async` + 双缓冲	`__pipeline_memcpy_async(&smem, &gmem, size)`
stall_long_scoreboard > 30%（SM>=90）	使用TMA进行批量异步传输	`cute::copy(tma_load, ...)` 或 CuTe TMA原子
stall_barrier > 25%	减少同步频率，使用warp原语	`__shfl_sync()` , `cooperative_groups`
stall_wait > 30%，long_scoreboard < 15%	流水线过度缓冲，减少深度	移除一个缓冲阶段
stall_math_pipe_throttle > 20%	计算饱和（积极信号）	考虑使用Tensor Core或减少FLOPs

3.1.4 Tensor Core utilization

3.1.4 Tensor Core利用率

NCU finding	Action
pipe_tensor < 5%, FP16/BF16 workload with GEMM-like pattern	Use WMMA ( `wmma::mma_sync` ) or inline PTX ( `mma.sync` )
pipe_tensor < 5%, but data is FP32	Use TF32 path via `wmma::mma_sync` with `nvcuda::wmma::precision::tf32`
pipe_fma_fp16 > 10%, pipe_tensor < 5%	Switch from scalar FP16 FMA to Tensor Core path

NCU发现	行动
pipe_tensor < 5%，且是FP16/BF16类GEMM工作负载	使用WMMA（ `wmma::mma_sync` ）或内联PTX（ `mma.sync` ）
pipe_tensor < 5%，但数据是FP32	通过 `wmma::mma_sync` 配合 `nvcuda::wmma::precision::tf32` 使用TF32路径
pipe_fma_fp16 > 10%，pipe_tensor < 5%	从标量FP16 FMA切换为Tensor Core路径

3.1.5 Vectorized memory access

3.1.5 向量化内存访问

// NCU evidence: coalescing ratio > 4 for 32-bit loads
// Before: scalar loads
float val = input[idx];

// After: vectorized 128-bit load (4x float)
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];

// NCU证据：32位加载的合并率 > 4
// 优化前：标量加载
float val = input[idx];

// 优化后：128位向量化加载（4个float）
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];

3.2 Playbook: CUTLASS

3.2 优化手册：CUTLASS

3.2.1 Kernel config parsing

3.2.1 内核配置解析

CUTLASS kernel names encode configuration. Extract:

Architecture:
```
sm80_
```
,
```
sm90_
```
,
```
ampere_
```
,
```
hopper_
```
Compute type:
```
tensorop
```
vs
```
simt
```
Tile shape:
```
128x128x32
```
,
```
256x128x64
```
Pipeline stages: trailing
```
x3
```
,
```
x5
```
Alignment:
```
align8
```

Schedule (3.x):

WarpSpecialized

WarpSpecializedCooperative

WarpSpecializedPingpong

CUTLASS内核名称包含配置信息，可提取：

架构：
```
sm80_
```
,
```
sm90_
```
,
```
ampere_
```
,
```
hopper_
```
计算类型：
```
tensorop
```
vs
```
simt
```
分块形状：
```
128x128x32
```
,
```
256x128x64
```
流水线阶段：末尾的
```
x3
```
,
```
x5
```
对齐方式：
```
align8
```

调度策略（3.x版本）：

WarpSpecialized

WarpSpecializedCooperative

WarpSpecializedPingpong

3.2.2 Tile shape and occupancy

3.2.2 分块形状与Occupancy

NCU finding	Action
Occupancy < 40%, smem is limiter	Reduce ThreadblockShape (e.g., 256x128→128x128) or reduce stages
Occupancy < 40%, registers are limiter	Use smaller WarpShape (e.g., 64x64→32x32) to reduce per-thread regs
SM throughput < 30%, grid is small	Increase ThreadblockShape to process more elements per CTA
SM throughput > 80%, MEM < 40%	Already compute-bound; increase pipeline stages for more overlap

NCU发现	行动
Occupancy < 40%，共享内存是限制因素	减小ThreadblockShape（如256x128→128x128）或减少流水线阶段
Occupancy < 40%，寄存器是限制因素	使用更小的WarpShape（如64x64→32x32）以减少每线程寄存器数
SM吞吐量 < 30%，Grid过小	增大ThreadblockShape以让每个CTA处理更多元素
SM吞吐量 > 80%，内存利用率 < 40%	已处于计算受限状态；增加流水线阶段以提升重叠度

3.2.3 Pipeline stages

3.2.3 流水线阶段

NCU finding	Action
stall_long_scoreboard > 30%	Increase stages (Ampere: 3→5, Hopper: 2→3)
stall_wait > 30%, long_scoreboard < 15%	Pipeline over-buffered; reduce stages to save smem
Smem limiter + stages > 3	Reduce stages to free smem for higher occupancy

NCU发现	行动
stall_long_scoreboard > 30%	增加流水线阶段（Ampere：3→5，Hopper：2→3）
stall_wait > 30%，long_scoreboard < 15%	流水线过度缓冲；减少阶段以节省共享内存
共享内存是限制因素且阶段数 > 3	减少阶段以释放共享内存，提升Occupancy

3.2.4 Alignment and vectorization

3.2.4 对齐方式与向量化

NCU finding	Action
Load coalescing > 4, alignment < 8	Increase CUTLASS alignment to 8 (128 bytes); pad matrix leading dims to multiples of alignment
SIMT path used but data supports TensorOp	Switch to `tensorop` CUTLASS configuration (2–8x speedup)
TensorOp configured but pipe_tensor < 5%	Check alignment requirements — LD must be multiple of InstructionShape::kK

NCU发现	行动
加载合并率 > 4，对齐方式 < 8	将CUTLASS对齐方式提升至8（128字节）；将矩阵首维度填充至对齐倍数
使用SIMT路径但数据支持TensorOp	切换为 `tensorop` 类型的CUTLASS配置（可提升2–8倍速度）
已配置TensorOp但pipe_tensor < 5%	检查对齐要求 — 加载操作必须是InstructionShape::kK的倍数

3.2.5 Schedule and architecture

3.2.5 调度策略与架构

NCU finding	Action
CUTLASS 2.x on SM>=90	Upgrade to CUTLASS 3.x with WarpSpecialized + TMA (1.2–1.5x gain)
L2 hit rate < 50% on large GEMM	Add ThreadblockSwizzle (2.x: `GemmIdentityThreadblockSwizzle<N>` , 3.x: `StreamK` or tile swizzle)
stall_long_scoreboard > 30% on Hopper	Switch to `WarpSpecializedCooperative` schedule with TMA loads

NCU发现	行动
在SM>=90上使用CUTLASS 2.x版本	升级至CUTLASS 3.x版本，搭配WarpSpecialized + TMA（可提升1.2–1.5倍性能）
大型GEMM的L2命中率 < 50%	添加ThreadblockSwizzle（2.x版本： `GemmIdentityThreadblockSwizzle<N>` ，3.x版本： `StreamK` 或分块混洗）
Hopper架构上stall_long_scoreboard > 30%	切换为 `WarpSpecializedCooperative` 调度策略，配合TMA加载

3.2.6 Epilogue fusion

3.2.6 尾处理融合

NCU finding	Action
Multiple CUTLASS kernels back-to-back (e.g., GEMM + bias + activation)	Fuse into single kernel via CUTLASS epilogue visitor tree
High DRAM traffic (read+write GB > expected)	Move post-GEMM ops into epilogue to eliminate intermediate tensors

NCU发现	行动
多个CUTLASS内核连续执行（如GEMM + 偏置 + 激活）	通过CUTLASS尾处理访问者树融合为单个内核
DRAM流量过高（读+写GB数超出预期）	将GEMM后的操作移入尾处理，消除中间张量

3.3 Playbook: Triton

3.3 优化手册：Triton

3.3.1 Kernel classification

3.3.1 内核分类

Triton kernel subtypes (from kernel name):

```
triton_poi_
```
: Inductor pointwise (auto-generated)
```
triton_red_
```
: Inductor reduction (auto-generated)
```
triton_per_
```
: Inductor persistent reduction (auto-generated)
Custom
```
@triton.jit
```
: hand-written kernel (fully tunable)

Inductor-generated kernels: optimize at PyTorch level (

torch._inductor.config

), or rewrite as custom

@triton.jit

if this is a hot path.

Triton内核子类型（从内核名称判断）：

```
triton_poi_
```
: Inductor逐点运算（自动生成）
```
triton_red_
```
: Inductor归约运算（自动生成）
```
triton_per_
```
: Inductor持久化归约（自动生成）
自定义
```
@triton.jit
```
: 手写内核（完全可调优）

自动生成的Inductor内核：在PyTorch层面优化（

torch._inductor.config

），若为热点路径可重写为自定义

@triton.jit

内核。

3.3.2 num_warps tuning

3.3.2 num_warps调优

NCU finding	Action
Registers >= 128, num_warps >= 8	CRITICAL: reduce num_warps (try 4 or 2)
Registers >= 64, num_warps >= 8	Reduce num_warps to 4
Occupancy < 40%, register-limited	Reduce num_warps AND/OR reduce BLOCK_* tile sizes
SM throughput < 30%, few warps	Increase num_warps to improve latency hiding

NCU发现	行动
寄存器数 >= 128，num_warps >= 8	CRITICAL：减小num_warps（尝试4或2）
寄存器数 >= 64，num_warps >= 8	将num_warps减小至4
Occupancy < 40%，寄存器受限	减小num_warps 和/或减小BLOCK_*分块大小
SM吞吐量 < 30%，warp数量少	增大num_warps以提升延迟隐藏效果

3.3.3 num_stages tuning

3.3.3 num_stages调优

NCU finding	Action
stall_long_scoreboard > 30%	Increase num_stages (2→3→4 on Ampere, 2→3 on Hopper)
stall_wait > 30%, long_scoreboard < 15%	Decrease num_stages (over-buffered) or increase tile work
Smem is occupancy limiter	Decrease num_stages (each stage doubles smem buffer)
On Hopper + long_scoreboard high	Also consider `tl.make_block_ptr()` for TMA-based loads

NCU发现	行动
stall_long_scoreboard > 30%	增加num_stages（Ampere架构：2→3→4，Hopper架构：2→3）
stall_wait > 30%，long_scoreboard < 15%	减小num_stages（过度缓冲）或增大分块工作量
共享内存是Occupancy限制因素	减小num_stages（每个阶段会使共享内存缓冲区翻倍）
Hopper架构上long_scoreboard数值高	同时考虑使用 `tl.make_block_ptr()` 实现基于TMA的加载

3.3.4 BLOCK_* tile size tuning

3.3.4 BLOCK_*分块大小调优

NCU finding	Action
Register pressure high	Reduce BLOCK_M, BLOCK_N, or BLOCK_K
SM throughput low, compute-bound opportunity	Increase BLOCK_M/BLOCK_N for more compute per tile
DRAM bandwidth near ceiling	Increase BLOCK_K for more data reuse before writeback

NCU发现	行动
寄存器压力大	减小BLOCK_M、BLOCK_N或BLOCK_K
SM吞吐量低，存在计算受限优化空间	增大BLOCK_M/BLOCK_N以让每个分块处理更多计算
DRAM带宽接近上限	增大BLOCK_K以提升写回前的数据复用率

3.3.5 Memory access optimization

3.3.5 内存访问优化

NCU finding	Action	Code pattern
Load coalescing > 8	Add stride hints	`tl.multiple_of(stride, 16)` and `tl.max_contiguous(offsets, BLOCK)`
Uncoalesced on transposed input	Use structured pointers	`tl.make_block_ptr(base, shape, strides, offsets, block_shape, order)`
L1 hit rate low	Verify access pattern continuity	Ensure innermost dim stride == 1

NCU发现	行动	代码示例
加载合并率 > 8	添加步长提示	`tl.multiple_of(stride, 16)` 和 `tl.max_contiguous(offsets, BLOCK)`
转置输入存在非合并访问	使用结构化指针	`tl.make_block_ptr(base, shape, strides, offsets, block_shape, order)`
L1命中率低	验证访问模式的连续性	确保最内层维度步长 == 1

3.3.6 Tensor Core utilization

3.3.6 Tensor Core利用率

NCU finding	Action
pipe_tensor < 5%, kernel uses `tl.dot`	1) `allow_tf32=True` for fp32; 2) BLOCK_K multiple of 16; 3) check dtypes are fp16/bf16/tf32/fp8
pipe_tensor < 5%, no `tl.dot` in code	GEMM-like pattern missing — restructure to use `tl.dot`

NCU发现	行动
pipe_tensor < 5%，内核使用 `tl.dot`	1) 对fp32设置 `allow_tf32=True` ；2) BLOCK_K为16的倍数；3) 检查数据类型为fp16/bf16/tf32/fp8
pipe_tensor < 5%，代码中未使用 `tl.dot`	重构代码以使用 `tl.dot` ，适配类GEMM模式

3.3.7 Triton autotune integration

3.3.7 Triton自动调参集成

python

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64}, num_warps=4, num_stages=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64,  'BLOCK_K': 32}, num_warps=8, num_stages=3),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
    ...

When NCU reveals the bottleneck, narrow autotune configs to the promising region instead of blind search.

python

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64}, num_warps=4, num_stages=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64,  'BLOCK_K': 32}, num_warps=8, num_stages=3),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
    ...

当NCU揭示瓶颈后，缩小自动调参配置范围至有潜力的区域，而非盲目搜索。

3.4 Playbook: CuTe DSL

3.4 优化手册：CuTe DSL

3.4.1 Key tuning parameters

3.4.1 关键调参参数

Parameter	Effect	Typical range
`threads_per_cta`	Warps per CTA; affects occupancy, barrier cost, reduce cost	128–512
`elems_per_thread`	Elements per thread; affects register pressure, data reuse	4–32
`num_bits_per_copy`	CopyAtom width; affects vectorized load/store bandwidth	32, 64, 128
Smem staging buffer	Pipeline depth × tile size; affects smem footprint	Minimize for occupancy

参数	作用	典型范围
`threads_per_cta`	每个CTA的warp数；影响Occupancy、同步开销、归约开销	128–512
`elems_per_thread`	每个线程处理的元素数；影响寄存器压力、数据复用率	4–32
`num_bits_per_copy`	CopyAtom宽度；影响向量化加载/存储带宽	32, 64, 128
共享内存暂存缓冲区	流水线深度 × 分块大小；影响共享内存占用	为提升Occupancy尽量最小化

3.4.2 Occupancy optimization

3.4.2 Occupancy优化

NCU finding	Action
Occupancy < 40%, registers are limiter	Reduce `elems_per_thread` or reduce `threads_per_cta` ; add `--maxrregcount=128` to `cute.compile()`
Occupancy < 40%, smem is limiter	Reduce `threads_per_cta` (fewer warps → smaller reduce buffer) or reduce staging buffer count
Registers >= 128, warps >= 8	CRITICAL: reduce `threads_per_cta` to 128 or 256

NCU发现	行动
Occupancy < 40%，寄存器受限	减小 `elems_per_thread` 或 `threads_per_cta` ；在 `cute.compile()` 中添加 `--maxrregcount=128`
Occupancy < 40%，共享内存受限	减小 `threads_per_cta` （更少warp → 更小归约缓冲区）或减少暂存缓冲区数量
寄存器数 >= 128，warp数 >= 8	CRITICAL：将 `threads_per_cta` 减小至128或256

3.4.3 Memory access (TiledCopy)

3.4.3 内存访问（TiledCopy）

NCU finding	Action
Load coalescing > 8	1) Increase `num_bits_per_copy` to 128; 2) verify `t_layout` distributes threads along contiguous addresses; 3) ensure `from_dlpack()` uses `assumed_align=16`
stall_long_scoreboard > 30%	1) Increase `num_bits_per_copy` to 128; 2) increase `elems_per_thread` for more reuse; 3) on SM>=80 use CpAsyncOp copy atom; 4) add double-buffering
stall_wait > 30%, long_scoreboard < 15%	Pipeline over-buffered; increase `elems_per_thread` for more compute per stage or reduce pipeline depth

NCU发现	行动
加载合并率 > 8	1) 将 `num_bits_per_copy` 增大至128；2) 验证 `t_layout` 将线程分布在连续地址上；3) 确保 `from_dlpack()` 使用 `assumed_align=16`
stall_long_scoreboard > 30%	1) 将 `num_bits_per_copy` 增大至128；2) 增大 `elems_per_thread` 提升复用率；3) 在SM>=80架构上使用CpAsyncOp拷贝原子；4) 添加双缓冲
stall_wait > 30%，long_scoreboard < 15%	流水线过度缓冲；增大 `elems_per_thread` 提升每个阶段的计算量或减小流水线深度

3.4.4 Synchronization and reduction

3.4.4 同步与归约

NCU finding	Action
stall_barrier > 25%	1) Reduce `threads_per_cta` (fewer warps at barrier); 2) replace second `sync_threads` with shuffle broadcast (if warps <= 32); 3) merge multiple `cta_reduce` calls
High barrier + small reduction	Use warp-only reduce without smem for small element counts
Multiple sync_threads per iteration	Minimize sync points; use async pipeline commit/wait patterns

NCU发现	行动
stall_barrier > 25%	1) 减小 `threads_per_cta` （同步时的warp数更少）；2) 若warp数 <=32，用shuffle广播替换第二次 `sync_threads` ；3) 合并多个 `cta_reduce` 调用
同步开销高且归约规模小	对于小元素计数，使用仅warp级别的归约，无需共享内存
每次迭代存在多次sync_threads	尽量减少同步点；使用异步流水线提交/等待模式

3.4.5 Thread divergence

3.4.5 线程分支发散

NCU finding	Action
Divergence > 20%	Adjust `threads_per_cta * elems_per_thread` to closely match problem dimension N, reducing predicated-off threads
Predicated copies show high divergence	Ensure N is divisible by `threads_per_cta * elems_per_thread` or use tail-handling strategy

NCU发现	行动
分支发散率 > 20%	调整 `threads_per_cta * elems_per_thread` 使其接近问题维度N，减少被谓词关闭的线程
带谓词的拷贝存在高发散率	确保N能被 `threads_per_cta * elems_per_thread` 整除，或使用尾部处理策略

3.4.6 Compute optimization

3.4.6 计算优化

NCU finding	Action
pipe_tensor < 5%, FP16 GEMM-like ops	Use `cute.make_mma_atom()` with MmaOp for Tensor Core path
pipe_fma high but pipe_tensor low (non-GEMM ops like RMSNorm/LayerNorm)	Tensor Core not applicable for reductions — focus on memory and barrier optimization instead

NCU发现	行动
pipe_tensor < 5%，存在FP16类GEMM操作	使用 `cute.make_mma_atom()` 配合MmaOp启用Tensor Core路径
pipe_fma使用率高但pipe_tensor使用率低（非GEMM操作如RMSNorm/LayerNorm）	Tensor Core不适用于这类归约操作 — 转而关注内存和同步优化

3.4.7 Cache invalidation for re-profiling

3.4.7 重新分析前的缓存清理

CuTe DSL compiles Python to CUDA via JIT. After code changes:

bash

undefined

CuTe DSL通过JIT将Python编译为CUDA。代码修改后：

bash

undefined

Clear compilation cache to ensure re-compilation

清除编译缓存以确保重新编译

rm -rf pycache/ .cache/ /tmp/cutlass_cute_cache/

Then re-profile

然后重新分析

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2

---

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2

---

Step 4: Re-profile and verify (REQUIRED after every change)

步骤4：重新分析与验证（每次修改后必填）

4.1 Re-profile

4.1 重新分析

bash

undefined

bash

undefined

Clear JIT caches first

先清除JIT缓存

rm -rf ~/.triton/cache # Triton rm -rf pycache/ .cache/ # CuTe DSL

rm -rf ~/.triton/cache # Triton缓存 rm -rf pycache/ .cache/ # CuTe DSL缓存

Profile updated version

分析更新后的版本

bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel_v2 report_v2

or

或

bash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2

undefined

bash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2

undefined

4.2 Compare against baseline

4.2 与基准版本对比

bash

python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csv

bash

python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csv

4.3 Verification checklist

4.3 验证检查清单

Check	Criteria
Duration improved?	`gpu__time_duration.sum` decreased
Target bottleneck improved?	The specific metric that triggered the change improved
No new bottlenecks?	No new CRITICAL findings in the diff report
At hardware ceiling?	SM throughput > 80% or DRAM throughput > 85% means near peak

检查项	判定标准
耗时是否减少？	`gpu__time_duration.sum` 数值降低
目标瓶颈是否改善？	触发修改的特定指标得到改善
是否引入新瓶颈？	对比报告中无新的CRITICAL问题
是否接近硬件上限？	SM吞吐量 > 80% 或 DRAM吞吐量 > 85% 意味着接近峰值

4.4 Iteration log template

4.4 迭代日志模板

Track each iteration for accountability:

=== Iteration {N} ===
Change:  {what was changed and why}
NCU evidence: {metric}={before_value} -> {finding}
Report: report_v{N}.csv

Result:
  Duration: {before} us -> {after} us ({delta}%)
  Target metric: {metric}={before} -> {after}
  New findings: {any new issues introduced}

Decision: {CONTINUE to next bottleneck | STOP — at ceiling | ROLLBACK — regression}

跟踪每次迭代以确保可追溯：

=== 迭代 {N} ===
修改内容:  {修改的内容及原因}
NCU依据: {指标}={修改前数值} -> {问题发现}
报告: report_v{N}.csv

结果:
  耗时: {修改前} 微秒 -> {修改后} 微秒（变化率: {delta}%）
  目标指标: {指标}={修改前} -> {修改后}
  新发现问题: {是否引入新问题}

决策: {继续优化下一个瓶颈 | 停止 — 已达硬件上限 | 回滚 — 性能退化}

Quick reference: high-signal NCU metrics

快速参考：高价值NCU指标

Metric	NCU key
Duration	`gpu__time_duration.sum [us]`
SM throughput	`sm__throughput.avg.pct_of_peak_sustained_elapsed [%]`
Memory throughput	`gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed [%]`
DRAM throughput	`gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed [%]`
L1 hit rate	`l1tex__t_sector_hit_rate.pct [%]`
L2 hit rate	`lts__t_sector_hit_rate.pct [%]`
Load coalescing	`l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum / l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum`
Bank conflicts	`l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum`
Register spills	`l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum [sector]`
Occupancy	`sm__warps_active.avg.pct_of_peak_sustained_active [%]`
Warp eligibility	`smsp__warps_eligible.avg.per_cycle_active [warp]`
Registers/thread	`launch__registers_per_thread [register/thread]`
Smem/block	`launch__shared_mem_per_block [Kbyte/block]`

指标	NCU关键字
耗时	`gpu__time_duration.sum [us]`
SM吞吐量	`sm__throughput.avg.pct_of_peak_sustained_elapsed [%]`
内存吞吐量	`gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed [%]`
DRAM吞吐量	`gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed [%]`
L1命中率	`l1tex__t_sector_hit_rate.pct [%]`
L2命中率	`lts__t_sector_hit_rate.pct [%]`
加载合并率	`l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum / l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum`
存储体冲突	`l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum`
寄存器溢出	`l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum [sector]`
Occupancy	`sm__warps_active.avg.pct_of_peak_sustained_active [%]`
Warp可用率	`smsp__warps_eligible.avg.per_cycle_active [warp]`
每线程寄存器数	`launch__registers_per_thread [register/thread]`
每Block共享内存	`launch__shared_mem_per_block [Kbyte/block]`

Summary

总结

This skill enforces a strict profile → analyze → change → verify loop. No NCU data = no optimization. No metric evidence = no code change. Each kernel type (Native CUDA / CUTLASS / Triton / CuTe DSL) has a dedicated playbook with NCU-metric-to-action mappings. Every change is tracked and verified by re-profiling.

本技能强制要求严格遵循分析 → 诊断 → 修改 → 验证的循环。无NCU数据则无法开展优化。无指标依据则不能修改代码。每种内核类型（原生CUDA / CUTLASS / Triton / CuTe DSL）都有专属优化手册，包含NCU指标到优化行动的映射。每一处修改都需跟踪并通过重新分析验证效果。