cuda-auto-tune

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

NCU-driven iterative kernel optimization (CUDA / CUTLASS / Triton / CuTe DSL)

基于NCU的迭代式内核优化(CUDA / CUTLASS / Triton / CuTe DSL)

GATE CHECK (enforce before any optimization)

前置检查(所有优化前必须执行)

STOP — Do you have NCU profile data for this kernel?
  NO  → Go to Step 1. Do NOT touch any kernel code.
  YES → Go to Step 2.
Hard rules — violation of any rule invalidates the entire optimization:
  • NEVER change kernel code, launch config, or template parameters without NCU data.
  • ALL recommendations MUST cite specific NCU metric values as evidence.
  • Each iteration MUST cover at minimum: roofline, memory hierarchy, warp stalls, occupancy.
  • The optimization playbook MUST match the kernel implementation type.
  • After EVERY code change, re-profile and compare with
    --diff
    .
  • Stop iterating when improvements plateau or metrics approach hardware ceiling.

STOP — 你是否拥有该内核的NCU性能分析数据?
  否  → 前往步骤1。禁止修改任何内核代码。
  是 → 前往步骤2。
硬性规则 — 违反任何规则将导致整个优化失效:
  • 若无NCU数据,绝不能修改内核代码、启动配置或模板参数。
  • 所有优化建议必须引用具体的NCU指标数值作为依据。
  • 每次迭代必须至少覆盖:roofline分析、内存层级分析、warp停顿分析、occupancy分析。
  • 优化手册必须与内核实现类型匹配。
  • 每次代码修改后,必须重新分析并使用
    --diff
    命令对比结果。
  • 当性能提升进入平台期或指标接近硬件上限时,停止迭代。

Mandatory optimization loop

强制优化循环

┌─────────────────────────────────────────────────────────────────────┐
│  Step 1: Profile (NCU --set full)                                   │
│      ↓                                                              │
│  Step 2: Multi-dimensional analysis + identify kernel type          │
│      ↓                                                              │
│  Step 3: Apply type-specific playbook (one change per iteration)    │
│      ↓                                                              │
│  Step 4: Re-profile + diff → improved? → loop or stop               │
│      ↑                                           │                  │
│      └───────────────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│  步骤1:性能分析(NCU --set full)                                   │
│      ↓                                                              │
│  步骤2:多维度分析 + 识别内核类型          │
│      ↓                                                              │
│  步骤3:应用对应类型的优化手册(每次迭代仅做一处修改)    │
│      ↓                                                              │
│  步骤4:重新分析 + 对比 → 性能提升? → 循环或停止               │
│      ↑                                           │                  │
│      └───────────────────────────────────────────┘                  │
└─────────────────────────────────────────────────────────────────────┘

Step 1: Profile with NCU (REQUIRED — no data = no optimization)

步骤1:使用NCU进行性能分析(必填 — 无数据则无法优化)

Option A: Profiling script (recommended)

选项A:分析脚本(推荐)

bash
undefined
bash
undefined

Native CUDA / CUTLASS binaries

原生CUDA / CUTLASS二进制文件

bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel report_v1
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel report_v1

Triton / Python

Triton / Python

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_kernel.py" report_v1
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_kernel.py" report_v1

CuTe DSL / Python

CuTe DSL / Python

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1

The script collects `--set full` → exports CSV → runs deep analysis → generates reports.
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v1

该脚本会收集`--set full`级别的数据 → 导出为CSV → 执行深度分析 → 生成报告。

Option B: Manual profiling

选项B:手动分析

bash
ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv
bash
ncu --set full -o report_v1 --target-processes all ./your_kernel
ncu --import report_v1.ncu-rep --page raw --csv > report_v1.csv
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv

Kernel-name filters (reduce noise)

内核名称过滤(减少干扰)

bash
undefined
bash
undefined

CUTLASS only

仅针对CUTLASS

ncu --set full -o report_v1 --target-processes all
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program
ncu --set full -o report_v1 --target-processes all
--kernel-name "cutlass_|sm90_|ampere_" ./cutlass_program

Triton only

仅针对Triton

ncu --set full -o report_v1 --target-processes all
--kernel-name "triton_" "python triton_kernel.py"
ncu --set full -o report_v1 --target-processes all
--kernel-name "triton_" "python triton_kernel.py"

CuTe DSL (kernel name often generic — use --type override in analysis)

CuTe DSL(内核名称通常通用 — 在分析时使用--type参数指定)

python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl
undefined
python3 cuda-auto-tune/scripts/ncu_analyse.py report_v1.csv --type cutedsl
undefined

Expected outputs

预期输出

ncu_reports/
├── report_v1.ncu-rep           # Full binary report
├── report_v1.csv               # Raw metrics CSV
├── report_v1_analysis.md       # Deep analysis report
└── report_v1_summary.txt       # Per-kernel summary

ncu_reports/
├── report_v1.ncu-rep           # 完整二进制报告
├── report_v1.csv               # 原始指标CSV
├── report_v1_analysis.md       # 深度分析报告
└── report_v1_summary.txt       # 单内核汇总

Step 2: Multi-dimensional analysis

步骤2:多维度分析

2.1 Identify implementation type

2.1 识别实现类型

Determine the kernel type from NCU "Function Name" and source context:
TypeDetection signals
Native CUDANo library prefix; hand-written
__global__
functions
CUTLASS
cutlass_
prefix,
smXX_xmma_
, contains
tensorop
or
cutlass
Triton
triton_
prefix, contains
triton
, encoded suffixes (e.g.
_0d1d...e
)
CuTe DSLGeneric names from
@cute.kernel
; confirm via source imports (
cutlass.cute
,
cute.compile
) or
--type cutedsl
Library
cublas*
,
cudnn*
— baseline/reference only, not optimizable
从NCU的「函数名称」和源码上下文判断内核类型:
类型识别特征
原生CUDA无库前缀;手写
__global__
函数
CUTLASS
cutlass_
前缀、
smXX_xmma_
、包含
tensorop
cutlass
关键字
Triton
triton_
前缀、包含
triton
关键字、编码后缀(如
_0d1d...e
CuTe DSL来自
@cute.kernel
的通用名称;通过源码导入(
cutlass.cute
cute.compile
)或
--type cutedsl
参数确认
库内核
cublas*
cudnn*
— 仅作为基准/参考,不可优化

2.2 Common diagnostics (ALL kernel types — always run)

2.2 通用诊断(所有内核类型 — 必须执行)

DimensionKey NCU metricsOutput
RooflineSM throughput, memory throughputcompute-bound / memory-bound / latency-bound / balanced
Memory hierarchyL1/L2 hit rate, coalescing ratio, DRAM throughputcache efficiency + bandwidth sub-bottleneck (DRAM/L2/L1)
Warp stallsPC sampling stall reasons (long_scoreboard, wait, barrier, ...)top stall reasons with percentages
Instruction mixpipe FMA/ALU/LSU/Tensor utilizationpipeline imbalance, Tensor Core usage
Occupancyactive warps %, limiter breakdown (register/smem/warp/block)limiting factor + register count + smem size
Memory hazardsbank conflicts, register spills (local store sectors)severity and root cause
Divergenceavg threads executed vs avg threads active (true)divergence percentage
维度关键NCU指标输出结果
Roofline分析SM吞吐量、内存吞吐量计算受限 / 内存受限 / 延迟受限 / 平衡状态
内存层级分析L1/L2命中率、内存合并率、DRAM吞吐量缓存效率 + 带宽子瓶颈(DRAM/L2/L1)
Warp停顿分析PC采样停顿原因(long_scoreboard、wait、barrier等)占比最高的停顿原因
指令混合分析FMA/ALU/LSU/Tensor流水线利用率流水线失衡情况、Tensor Core使用率
Occupancy分析活跃warp占比、限制因素细分(寄存器/共享内存/warp/block)限制因素 + 寄存器数量 + 共享内存大小
内存风险分析存储体冲突、寄存器溢出(本地存储扇区)严重程度及根本原因
分支发散分析平均执行线程数 vs 平均活跃线程数(真实值)分支发散占比

2.3 Type-specific focus

2.3 类型专属分析重点

TypeKey focus areas
Native CUDAlaunch config (block size, grid), memory access patterns, async copy (cp.async/TMA), Tensor Core opportunity
CUTLASSThreadblockShape, WarpShape, stages, alignment, schedule policy, epilogue fusion, CTA swizzle
Triton
num_warps
,
num_stages
,
BLOCK_*
sizes, compiler hints (
tl.multiple_of
,
tl.max_contiguous
),
tl.dot
config
CuTe DSL
threads_per_cta
,
elems_per_thread
, CopyAtom (
num_bits_per_copy
),
tiled_copy
layout, smem staging,
cta_reduce
pattern
类型核心关注领域
原生CUDA启动配置(block大小、grid大小)、内存访问模式、异步拷贝(cp.async/TMA)、Tensor Core使用机会
CUTLASSThreadblockShape、WarpShape、流水线阶段、对齐方式、调度策略、尾处理融合、CTA混洗
Triton
num_warps
num_stages
BLOCK_*
大小、编译器提示(
tl.multiple_of
tl.max_contiguous
)、
tl.dot
配置
CuTe DSL
threads_per_cta
elems_per_thread
、CopyAtom(
num_bits_per_copy
)、
tiled_copy
布局、共享内存暂存、
cta_reduce
模式

2.4 Bottleneck classification decision tree

2.4 瓶颈分类决策树

SM% > MEM% + 20  →  COMPUTE_BOUND
MEM% > SM% + 20  →  MEMORY_BOUND
  ├─ DRAM throughput > 70%        → DRAM-Bound (near HBM ceiling)
  ├─ L2 hit < 50%, DRAM > 40%    → DRAM-Bound (L2 miss driven)
  ├─ L1 hit < 20%, L2 hit >= 50% → L2-Bound
  └─ L1 hit < 20%                → L1-Bound
SM% < 40 AND MEM% < 40           →  LATENCY_BOUND
SM% > 60 AND MEM% > 60           →  BALANCED (near peak)
SM% > MEM% + 20  →  计算受限(COMPUTE_BOUND)
MEM% > SM% + 20  →  内存受限(MEMORY_BOUND)
  ├─ DRAM吞吐量 > 70%        → DRAM受限(接近HBM上限)
  ├─ L2命中率 < 50%,DRAM吞吐量 > 40%    → DRAM受限(由L2未命中驱动)
  ├─ L1命中率 < 20%,L2命中率 >= 50% → L2受限
  └─ L1命中率 < 20%                → L1受限
SM% < 40 且 MEM% < 40           →  延迟受限(LATENCY_BOUND)
SM% > 60 且 MEM% > 60           →  平衡状态(接近峰值)

2.5 Conclusion template (REQUIRED after every analysis)

2.5 结论模板(每次分析后必填)

=== Conclusion ===
Kernel:    {kernel_name}
Type:      {Native CUDA | CUTLASS | Triton | CuTe DSL}
Arch:      SM_{arch}
Overall:   {COMPUTE_BOUND | MEMORY_BOUND | LATENCY_BOUND | BALANCED}
Duration:  {duration_us} us
Roofline:  SM {sm}%, MEM {mem}%, DRAM {dram}%
Occupancy: {occ}% (theoretical: {theo}%), limited by {limiter}
Regs/Thread: {regs}, Smem/Block: {smem} KB

Findings (sorted by severity):
  [CRITICAL] {finding}: {NCU evidence with numbers} -> {specific action}
  [WARNING]  {finding}: {NCU evidence with numbers} -> {specific action}
  [INFO]     {finding}: {NCU evidence with numbers}

Optimization priorities:
  1. {highest_priority} (expected gain: Nx, evidence: {metric}={value})
  2. {second_priority}  (expected gain: Nx, evidence: {metric}={value})
  3. {third_priority}   (expected gain: Nx, evidence: {metric}={value})

=== 结论 ===
内核名称:    {kernel_name}
类型:      {原生CUDA | CUTLASS | Triton | CuTe DSL}
架构:      SM_{arch}
整体状态:   {计算受限 | 内存受限 | 延迟受限 | 平衡状态}
耗时:  {duration_us} 微秒
Roofline:  SM {sm}%, 内存 {mem}%, DRAM {dram}%
Occupancy: {occ}%(理论值: {theo}%),受限因素: {limiter}
每线程寄存器数: {regs}, 每Block共享内存: {smem} KB

发现问题(按严重程度排序):
  [CRITICAL] {问题描述}: {带数值的NCU证据} -> {具体行动}
  [WARNING]  {问题描述}: {带数值的NCU证据} -> {具体行动}
  [INFO]     {问题描述}: {带数值的NCU证据}

优化优先级:
  1. {最高优先级}(预期收益: N倍,依据: {metric}={value})
  2. {次优先级} (预期收益: N倍,依据: {metric}={value})
  3. {第三优先级}  (预期收益: N倍,依据: {metric}={value})

Step 3: Apply type-specific playbook

步骤3:应用对应类型的优化手册

No intuition-only edits. Every change MUST directly address an NCU finding. Apply ONE change per iteration, then re-profile (Step 4).

禁止仅凭直觉修改代码。每一处修改必须直接对应NCU的分析发现。 每次迭代仅做一处修改,然后进入步骤4重新分析。

3.1 Playbook: Native CUDA

3.1 优化手册:原生CUDA

3.1.1 Launch configuration

3.1.1 启动配置

NCU findingActionCode pattern
Occupancy < 50%, block size < 128Increase block size to 128–256
kernel<<<grid, 256>>>
Registers are occupancy limiterCap registers via
__launch_bounds__
__global__ void __launch_bounds__(256, 2) kernel()
Grid too small (< SM count)Ensure enough blocks for full SM coverage
grid = (N + block - 1) / block
with sufficient N
Occupancy low, blocks limiterReduce block size to fit more blocks per SMTry 128 instead of 256
NCU发现行动代码示例
Occupancy < 50%,block大小 < 128将block大小增加至128–256
kernel<<<grid, 256>>>
寄存器是Occupancy限制因素通过
__launch_bounds__
限制寄存器数量
__global__ void __launch_bounds__(256, 2) kernel()
Grid过小(< SM数量)确保有足够的block以充分利用所有SM
grid = (N + block - 1) / block
,且N足够大
Occupancy低,block是限制因素减小block大小以在每个SM上容纳更多block尝试将256改为128

3.1.2 Memory access optimization

3.1.2 内存访问优化

NCU findingActionCode pattern
Load coalescing ratio > 8Ensure warp-contiguous addressing, AoS→SoA
data[threadIdx.x + blockIdx.x * blockDim.x]
Store coalescing ratio > 8Use shared memory staging for scatter writesWrite to smem first, then coalesced writeback
L1 hit rate < 20%Use
__shared__
for frequently reused data
Tile into shared memory with
__syncthreads()
L2 hit rate < 50%Use L2 persistence hints (Ampere+)
cudaAccessPolicyWindow
for hot data ranges
DRAM throughput > 80%Reduce data movement: mixed precision, compression
half
/
__nv_bfloat16
for bandwidth-sensitive ops
Bank conflicts > 100KPad shared memory or swizzle layout
__shared__ float smem[32][33];
(pad +1)
Register spills > 0Reduce per-thread state, use
__launch_bounds__
Simplify accumulators, split into sub-kernels
NCU发现行动代码示例
加载合并率 > 8确保warp连续寻址,将AoS转为SoA
data[threadIdx.x + blockIdx.x * blockDim.x]
存储合并率 > 8使用共享内存暂存分散写入先写入共享内存,再合并写回全局内存
L1命中率 < 20%使用
__shared__
存储频繁复用的数据
将数据分块存入共享内存,并配合
__syncthreads()
L2命中率 < 50%使用L2持久化提示(Ampere及以上架构)对热点数据范围使用
cudaAccessPolicyWindow
DRAM吞吐量 > 80%减少数据移动:使用混合精度、数据压缩对带宽敏感操作使用
half
/
__nv_bfloat16
存储体冲突 > 100K填充共享内存或调整布局
__shared__ float smem[32][33];
(填充+1)
寄存器溢出 > 0减少每线程状态,使用
__launch_bounds__
简化累加器,拆分为子内核

3.1.3 Latency hiding and pipelining

3.1.3 延迟隐藏与流水线

NCU findingActionCode pattern
stall_long_scoreboard > 30% (SM>=80)Use
cp.async
+ double buffering
__pipeline_memcpy_async(&smem, &gmem, size)
stall_long_scoreboard > 30% (SM>=90)Use TMA for bulk async transfers
cute::copy(tma_load, ...)
or CuTe TMA atoms
stall_barrier > 25%Reduce sync frequency, use warp primitives
__shfl_sync()
,
cooperative_groups
stall_wait > 30%, long_scoreboard < 15%Pipeline over-buffered, reduce depthRemove one buffer stage
stall_math_pipe_throttle > 20%Compute saturated (positive signal)Consider Tensor Core or reduce FLOPs
NCU发现行动代码示例
stall_long_scoreboard > 30%(SM>=80)使用
cp.async
+ 双缓冲
__pipeline_memcpy_async(&smem, &gmem, size)
stall_long_scoreboard > 30%(SM>=90)使用TMA进行批量异步传输
cute::copy(tma_load, ...)
或 CuTe TMA原子
stall_barrier > 25%减少同步频率,使用warp原语
__shfl_sync()
,
cooperative_groups
stall_wait > 30%,long_scoreboard < 15%流水线过度缓冲,减少深度移除一个缓冲阶段
stall_math_pipe_throttle > 20%计算饱和(积极信号)考虑使用Tensor Core或减少FLOPs

3.1.4 Tensor Core utilization

3.1.4 Tensor Core利用率

NCU findingAction
pipe_tensor < 5%, FP16/BF16 workload with GEMM-like patternUse WMMA (
wmma::mma_sync
) or inline PTX (
mma.sync
)
pipe_tensor < 5%, but data is FP32Use TF32 path via
wmma::mma_sync
with
nvcuda::wmma::precision::tf32
pipe_fma_fp16 > 10%, pipe_tensor < 5%Switch from scalar FP16 FMA to Tensor Core path
NCU发现行动
pipe_tensor < 5%,且是FP16/BF16类GEMM工作负载使用WMMA(
wmma::mma_sync
)或内联PTX(
mma.sync
pipe_tensor < 5%,但数据是FP32通过
wmma::mma_sync
配合
nvcuda::wmma::precision::tf32
使用TF32路径
pipe_fma_fp16 > 10%,pipe_tensor < 5%从标量FP16 FMA切换为Tensor Core路径

3.1.5 Vectorized memory access

3.1.5 向量化内存访问

// NCU evidence: coalescing ratio > 4 for 32-bit loads
// Before: scalar loads
float val = input[idx];

// After: vectorized 128-bit load (4x float)
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];

// NCU证据:32位加载的合并率 > 4
// 优化前:标量加载
float val = input[idx];

// 优化后:128位向量化加载(4个float)
float4 val = reinterpret_cast<const float4*>(input)[idx / 4];

3.2 Playbook: CUTLASS

3.2 优化手册:CUTLASS

3.2.1 Kernel config parsing

3.2.1 内核配置解析

CUTLASS kernel names encode configuration. Extract:
  • Architecture:
    sm80_
    ,
    sm90_
    ,
    ampere_
    ,
    hopper_
  • Compute type:
    tensorop
    vs
    simt
  • Tile shape:
    128x128x32
    ,
    256x128x64
  • Pipeline stages: trailing
    x3
    ,
    x5
  • Alignment:
    align8
  • Schedule (3.x):
    WarpSpecialized
    ,
    WarpSpecializedCooperative
    ,
    WarpSpecializedPingpong
CUTLASS内核名称包含配置信息,可提取:
  • 架构:
    sm80_
    ,
    sm90_
    ,
    ampere_
    ,
    hopper_
  • 计算类型:
    tensorop
    vs
    simt
  • 分块形状:
    128x128x32
    ,
    256x128x64
  • 流水线阶段:末尾的
    x3
    ,
    x5
  • 对齐方式:
    align8
  • 调度策略(3.x版本):
    WarpSpecialized
    ,
    WarpSpecializedCooperative
    ,
    WarpSpecializedPingpong

3.2.2 Tile shape and occupancy

3.2.2 分块形状与Occupancy

NCU findingAction
Occupancy < 40%, smem is limiterReduce ThreadblockShape (e.g., 256x128→128x128) or reduce stages
Occupancy < 40%, registers are limiterUse smaller WarpShape (e.g., 64x64→32x32) to reduce per-thread regs
SM throughput < 30%, grid is smallIncrease ThreadblockShape to process more elements per CTA
SM throughput > 80%, MEM < 40%Already compute-bound; increase pipeline stages for more overlap
NCU发现行动
Occupancy < 40%,共享内存是限制因素减小ThreadblockShape(如256x128→128x128)或减少流水线阶段
Occupancy < 40%,寄存器是限制因素使用更小的WarpShape(如64x64→32x32)以减少每线程寄存器数
SM吞吐量 < 30%,Grid过小增大ThreadblockShape以让每个CTA处理更多元素
SM吞吐量 > 80%,内存利用率 < 40%已处于计算受限状态;增加流水线阶段以提升重叠度

3.2.3 Pipeline stages

3.2.3 流水线阶段

NCU findingAction
stall_long_scoreboard > 30%Increase stages (Ampere: 3→5, Hopper: 2→3)
stall_wait > 30%, long_scoreboard < 15%Pipeline over-buffered; reduce stages to save smem
Smem limiter + stages > 3Reduce stages to free smem for higher occupancy
NCU发现行动
stall_long_scoreboard > 30%增加流水线阶段(Ampere:3→5,Hopper:2→3)
stall_wait > 30%,long_scoreboard < 15%流水线过度缓冲;减少阶段以节省共享内存
共享内存是限制因素且阶段数 > 3减少阶段以释放共享内存,提升Occupancy

3.2.4 Alignment and vectorization

3.2.4 对齐方式与向量化

NCU findingAction
Load coalescing > 4, alignment < 8Increase CUTLASS alignment to 8 (128 bytes); pad matrix leading dims to multiples of alignment
SIMT path used but data supports TensorOpSwitch to
tensorop
CUTLASS configuration (2–8x speedup)
TensorOp configured but pipe_tensor < 5%Check alignment requirements — LD must be multiple of InstructionShape::kK
NCU发现行动
加载合并率 > 4,对齐方式 < 8将CUTLASS对齐方式提升至8(128字节);将矩阵首维度填充至对齐倍数
使用SIMT路径但数据支持TensorOp切换为
tensorop
类型的CUTLASS配置(可提升2–8倍速度)
已配置TensorOp但pipe_tensor < 5%检查对齐要求 — 加载操作必须是InstructionShape::kK的倍数

3.2.5 Schedule and architecture

3.2.5 调度策略与架构

NCU findingAction
CUTLASS 2.x on SM>=90Upgrade to CUTLASS 3.x with WarpSpecialized + TMA (1.2–1.5x gain)
L2 hit rate < 50% on large GEMMAdd ThreadblockSwizzle (2.x:
GemmIdentityThreadblockSwizzle<N>
, 3.x:
StreamK
or tile swizzle)
stall_long_scoreboard > 30% on HopperSwitch to
WarpSpecializedCooperative
schedule with TMA loads
NCU发现行动
在SM>=90上使用CUTLASS 2.x版本升级至CUTLASS 3.x版本,搭配WarpSpecialized + TMA(可提升1.2–1.5倍性能)
大型GEMM的L2命中率 < 50%添加ThreadblockSwizzle(2.x版本:
GemmIdentityThreadblockSwizzle<N>
,3.x版本:
StreamK
或分块混洗)
Hopper架构上stall_long_scoreboard > 30%切换为
WarpSpecializedCooperative
调度策略,配合TMA加载

3.2.6 Epilogue fusion

3.2.6 尾处理融合

NCU findingAction
Multiple CUTLASS kernels back-to-back (e.g., GEMM + bias + activation)Fuse into single kernel via CUTLASS epilogue visitor tree
High DRAM traffic (read+write GB > expected)Move post-GEMM ops into epilogue to eliminate intermediate tensors

NCU发现行动
多个CUTLASS内核连续执行(如GEMM + 偏置 + 激活)通过CUTLASS尾处理访问者树融合为单个内核
DRAM流量过高(读+写GB数超出预期)将GEMM后的操作移入尾处理,消除中间张量

3.3 Playbook: Triton

3.3 优化手册:Triton

3.3.1 Kernel classification

3.3.1 内核分类

Triton kernel subtypes (from kernel name):
  • triton_poi_
    : Inductor pointwise (auto-generated)
  • triton_red_
    : Inductor reduction (auto-generated)
  • triton_per_
    : Inductor persistent reduction (auto-generated)
  • Custom
    @triton.jit
    : hand-written kernel (fully tunable)
Inductor-generated kernels: optimize at PyTorch level (
torch._inductor.config
), or rewrite as custom
@triton.jit
if this is a hot path.
Triton内核子类型(从内核名称判断):
  • triton_poi_
    : Inductor逐点运算(自动生成)
  • triton_red_
    : Inductor归约运算(自动生成)
  • triton_per_
    : Inductor持久化归约(自动生成)
  • 自定义
    @triton.jit
    : 手写内核(完全可调优)
自动生成的Inductor内核:在PyTorch层面优化(
torch._inductor.config
),若为热点路径可重写为自定义
@triton.jit
内核。

3.3.2 num_warps tuning

3.3.2 num_warps调优

NCU findingAction
Registers >= 128, num_warps >= 8CRITICAL: reduce num_warps (try 4 or 2)
Registers >= 64, num_warps >= 8Reduce num_warps to 4
Occupancy < 40%, register-limitedReduce num_warps AND/OR reduce BLOCK_* tile sizes
SM throughput < 30%, few warpsIncrease num_warps to improve latency hiding
NCU发现行动
寄存器数 >= 128,num_warps >= 8CRITICAL:减小num_warps(尝试4或2)
寄存器数 >= 64,num_warps >= 8将num_warps减小至4
Occupancy < 40%,寄存器受限减小num_warps 和/或 减小BLOCK_*分块大小
SM吞吐量 < 30%,warp数量少增大num_warps以提升延迟隐藏效果

3.3.3 num_stages tuning

3.3.3 num_stages调优

NCU findingAction
stall_long_scoreboard > 30%Increase num_stages (2→3→4 on Ampere, 2→3 on Hopper)
stall_wait > 30%, long_scoreboard < 15%Decrease num_stages (over-buffered) or increase tile work
Smem is occupancy limiterDecrease num_stages (each stage doubles smem buffer)
On Hopper + long_scoreboard highAlso consider
tl.make_block_ptr()
for TMA-based loads
NCU发现行动
stall_long_scoreboard > 30%增加num_stages(Ampere架构:2→3→4,Hopper架构:2→3)
stall_wait > 30%,long_scoreboard < 15%减小num_stages(过度缓冲)或增大分块工作量
共享内存是Occupancy限制因素减小num_stages(每个阶段会使共享内存缓冲区翻倍)
Hopper架构上long_scoreboard数值高同时考虑使用
tl.make_block_ptr()
实现基于TMA的加载

3.3.4 BLOCK_* tile size tuning

3.3.4 BLOCK_*分块大小调优

NCU findingAction
Register pressure highReduce BLOCK_M, BLOCK_N, or BLOCK_K
SM throughput low, compute-bound opportunityIncrease BLOCK_M/BLOCK_N for more compute per tile
DRAM bandwidth near ceilingIncrease BLOCK_K for more data reuse before writeback
NCU发现行动
寄存器压力大减小BLOCK_M、BLOCK_N或BLOCK_K
SM吞吐量低,存在计算受限优化空间增大BLOCK_M/BLOCK_N以让每个分块处理更多计算
DRAM带宽接近上限增大BLOCK_K以提升写回前的数据复用率

3.3.5 Memory access optimization

3.3.5 内存访问优化

NCU findingActionCode pattern
Load coalescing > 8Add stride hints
tl.multiple_of(stride, 16)
and
tl.max_contiguous(offsets, BLOCK)
Uncoalesced on transposed inputUse structured pointers
tl.make_block_ptr(base, shape, strides, offsets, block_shape, order)
L1 hit rate lowVerify access pattern continuityEnsure innermost dim stride == 1
NCU发现行动代码示例
加载合并率 > 8添加步长提示
tl.multiple_of(stride, 16)
tl.max_contiguous(offsets, BLOCK)
转置输入存在非合并访问使用结构化指针
tl.make_block_ptr(base, shape, strides, offsets, block_shape, order)
L1命中率低验证访问模式的连续性确保最内层维度步长 == 1

3.3.6 Tensor Core utilization

3.3.6 Tensor Core利用率

NCU findingAction
pipe_tensor < 5%, kernel uses
tl.dot
1)
allow_tf32=True
for fp32; 2) BLOCK_K multiple of 16; 3) check dtypes are fp16/bf16/tf32/fp8
pipe_tensor < 5%, no
tl.dot
in code
GEMM-like pattern missing — restructure to use
tl.dot
NCU发现行动
pipe_tensor < 5%,内核使用
tl.dot
1) 对fp32设置
allow_tf32=True
;2) BLOCK_K为16的倍数;3) 检查数据类型为fp16/bf16/tf32/fp8
pipe_tensor < 5%,代码中未使用
tl.dot
重构代码以使用
tl.dot
,适配类GEMM模式

3.3.7 Triton autotune integration

3.3.7 Triton自动调参集成

python
@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64}, num_warps=4, num_stages=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64,  'BLOCK_K': 32}, num_warps=8, num_stages=3),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
    ...
When NCU reveals the bottleneck, narrow autotune configs to the promising region instead of blind search.

python
@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'BLOCK_K': 32}, num_warps=4, num_stages=3),
        triton.Config({'BLOCK_M': 64,  'BLOCK_N': 64,  'BLOCK_K': 64}, num_warps=4, num_stages=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64,  'BLOCK_K': 32}, num_warps=8, num_stages=3),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def kernel(...):
    ...
当NCU揭示瓶颈后,缩小自动调参配置范围至有潜力的区域,而非盲目搜索。

3.4 Playbook: CuTe DSL

3.4 优化手册:CuTe DSL

3.4.1 Key tuning parameters

3.4.1 关键调参参数

ParameterEffectTypical range
threads_per_cta
Warps per CTA; affects occupancy, barrier cost, reduce cost128–512
elems_per_thread
Elements per thread; affects register pressure, data reuse4–32
num_bits_per_copy
CopyAtom width; affects vectorized load/store bandwidth32, 64, 128
Smem staging bufferPipeline depth × tile size; affects smem footprintMinimize for occupancy
参数作用典型范围
threads_per_cta
每个CTA的warp数;影响Occupancy、同步开销、归约开销128–512
elems_per_thread
每个线程处理的元素数;影响寄存器压力、数据复用率4–32
num_bits_per_copy
CopyAtom宽度;影响向量化加载/存储带宽32, 64, 128
共享内存暂存缓冲区流水线深度 × 分块大小;影响共享内存占用为提升Occupancy尽量最小化

3.4.2 Occupancy optimization

3.4.2 Occupancy优化

NCU findingAction
Occupancy < 40%, registers are limiterReduce
elems_per_thread
or reduce
threads_per_cta
; add
--maxrregcount=128
to
cute.compile()
Occupancy < 40%, smem is limiterReduce
threads_per_cta
(fewer warps → smaller reduce buffer) or reduce staging buffer count
Registers >= 128, warps >= 8CRITICAL: reduce
threads_per_cta
to 128 or 256
NCU发现行动
Occupancy < 40%,寄存器受限减小
elems_per_thread
threads_per_cta
;在
cute.compile()
中添加
--maxrregcount=128
Occupancy < 40%,共享内存受限减小
threads_per_cta
(更少warp → 更小归约缓冲区)或减少暂存缓冲区数量
寄存器数 >= 128,warp数 >= 8CRITICAL:将
threads_per_cta
减小至128或256

3.4.3 Memory access (TiledCopy)

3.4.3 内存访问(TiledCopy)

NCU findingAction
Load coalescing > 81) Increase
num_bits_per_copy
to 128; 2) verify
t_layout
distributes threads along contiguous addresses; 3) ensure
from_dlpack()
uses
assumed_align=16
stall_long_scoreboard > 30%1) Increase
num_bits_per_copy
to 128; 2) increase
elems_per_thread
for more reuse; 3) on SM>=80 use CpAsyncOp copy atom; 4) add double-buffering
stall_wait > 30%, long_scoreboard < 15%Pipeline over-buffered; increase
elems_per_thread
for more compute per stage or reduce pipeline depth
NCU发现行动
加载合并率 > 81) 将
num_bits_per_copy
增大至128;2) 验证
t_layout
将线程分布在连续地址上;3) 确保
from_dlpack()
使用
assumed_align=16
stall_long_scoreboard > 30%1) 将
num_bits_per_copy
增大至128;2) 增大
elems_per_thread
提升复用率;3) 在SM>=80架构上使用CpAsyncOp拷贝原子;4) 添加双缓冲
stall_wait > 30%,long_scoreboard < 15%流水线过度缓冲;增大
elems_per_thread
提升每个阶段的计算量或减小流水线深度

3.4.4 Synchronization and reduction

3.4.4 同步与归约

NCU findingAction
stall_barrier > 25%1) Reduce
threads_per_cta
(fewer warps at barrier); 2) replace second
sync_threads
with shuffle broadcast (if warps <= 32); 3) merge multiple
cta_reduce
calls
High barrier + small reductionUse warp-only reduce without smem for small element counts
Multiple sync_threads per iterationMinimize sync points; use async pipeline commit/wait patterns
NCU发现行动
stall_barrier > 25%1) 减小
threads_per_cta
(同步时的warp数更少);2) 若warp数 <=32,用shuffle广播替换第二次
sync_threads
;3) 合并多个
cta_reduce
调用
同步开销高且归约规模小对于小元素计数,使用仅warp级别的归约,无需共享内存
每次迭代存在多次sync_threads尽量减少同步点;使用异步流水线提交/等待模式

3.4.5 Thread divergence

3.4.5 线程分支发散

NCU findingAction
Divergence > 20%Adjust
threads_per_cta * elems_per_thread
to closely match problem dimension N, reducing predicated-off threads
Predicated copies show high divergenceEnsure N is divisible by
threads_per_cta * elems_per_thread
or use tail-handling strategy
NCU发现行动
分支发散率 > 20%调整
threads_per_cta * elems_per_thread
使其接近问题维度N,减少被谓词关闭的线程
带谓词的拷贝存在高发散率确保N能被
threads_per_cta * elems_per_thread
整除,或使用尾部处理策略

3.4.6 Compute optimization

3.4.6 计算优化

NCU findingAction
pipe_tensor < 5%, FP16 GEMM-like opsUse
cute.make_mma_atom()
with MmaOp for Tensor Core path
pipe_fma high but pipe_tensor low (non-GEMM ops like RMSNorm/LayerNorm)Tensor Core not applicable for reductions — focus on memory and barrier optimization instead
NCU发现行动
pipe_tensor < 5%,存在FP16类GEMM操作使用
cute.make_mma_atom()
配合MmaOp启用Tensor Core路径
pipe_fma使用率高但pipe_tensor使用率低(非GEMM操作如RMSNorm/LayerNorm)Tensor Core不适用于这类归约操作 — 转而关注内存和同步优化

3.4.7 Cache invalidation for re-profiling

3.4.7 重新分析前的缓存清理

CuTe DSL compiles Python to CUDA via JIT. After code changes:
bash
undefined
CuTe DSL通过JIT将Python编译为CUDA。代码修改后:
bash
undefined

Clear compilation cache to ensure re-compilation

清除编译缓存以确保重新编译

rm -rf pycache/ .cache/ /tmp/cutlass_cute_cache/
rm -rf pycache/ .cache/ /tmp/cutlass_cute_cache/

Then re-profile

然后重新分析

bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2

---
bash cuda-auto-tune/scripts/ncu_profile.sh "python your_cutedsl_kernel.py" report_v2

---

Step 4: Re-profile and verify (REQUIRED after every change)

步骤4:重新分析与验证(每次修改后必填)

4.1 Re-profile

4.1 重新分析

bash
undefined
bash
undefined

Clear JIT caches first

先清除JIT缓存

rm -rf ~/.triton/cache # Triton rm -rf pycache/ .cache/ # CuTe DSL
rm -rf ~/.triton/cache # Triton缓存 rm -rf pycache/ .cache/ # CuTe DSL缓存

Profile updated version

分析更新后的版本

bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel_v2 report_v2
bash cuda-auto-tune/scripts/ncu_profile.sh ./kernel_v2 report_v2

or

bash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2
undefined
bash cuda-auto-tune/scripts/ncu_profile.sh "python kernel_v2.py" report_v2
undefined

4.2 Compare against baseline

4.2 与基准版本对比

bash
python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csv
bash
python3 cuda-auto-tune/scripts/ncu_analyse.py ncu_reports/report_v2.csv --diff ncu_reports/report_v1.csv

4.3 Verification checklist

4.3 验证检查清单

CheckCriteria
Duration improved?
gpu__time_duration.sum
decreased
Target bottleneck improved?The specific metric that triggered the change improved
No new bottlenecks?No new CRITICAL findings in the diff report
At hardware ceiling?SM throughput > 80% or DRAM throughput > 85% means near peak
检查项判定标准
耗时是否减少?
gpu__time_duration.sum
数值降低
目标瓶颈是否改善?触发修改的特定指标得到改善
是否引入新瓶颈?对比报告中无新的CRITICAL问题
是否接近硬件上限?SM吞吐量 > 80% 或 DRAM吞吐量 > 85% 意味着接近峰值

4.4 Iteration log template

4.4 迭代日志模板

Track each iteration for accountability:
=== Iteration {N} ===
Change:  {what was changed and why}
NCU evidence: {metric}={before_value} -> {finding}
Report: report_v{N}.csv

Result:
  Duration: {before} us -> {after} us ({delta}%)
  Target metric: {metric}={before} -> {after}
  New findings: {any new issues introduced}

Decision: {CONTINUE to next bottleneck | STOP — at ceiling | ROLLBACK — regression}

跟踪每次迭代以确保可追溯:
=== 迭代 {N} ===
修改内容:  {修改的内容及原因}
NCU依据: {指标}={修改前数值} -> {问题发现}
报告: report_v{N}.csv

结果:
  耗时: {修改前} 微秒 -> {修改后} 微秒(变化率: {delta}%)
  目标指标: {指标}={修改前} -> {修改后}
  新发现问题: {是否引入新问题}

决策: {继续优化下一个瓶颈 | 停止 — 已达硬件上限 | 回滚 — 性能退化}

Quick reference: high-signal NCU metrics

快速参考:高价值NCU指标

MetricNCU key
Duration
gpu__time_duration.sum [us]
SM throughput
sm__throughput.avg.pct_of_peak_sustained_elapsed [%]
Memory throughput
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed [%]
DRAM throughput
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed [%]
L1 hit rate
l1tex__t_sector_hit_rate.pct [%]
L2 hit rate
lts__t_sector_hit_rate.pct [%]
Load coalescing
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum / l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum
Bank conflicts
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum
Register spills
l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum [sector]
Occupancy
sm__warps_active.avg.pct_of_peak_sustained_active [%]
Warp eligibility
smsp__warps_eligible.avg.per_cycle_active [warp]
Registers/thread
launch__registers_per_thread [register/thread]
Smem/block
launch__shared_mem_per_block [Kbyte/block]

指标NCU关键字
耗时
gpu__time_duration.sum [us]
SM吞吐量
sm__throughput.avg.pct_of_peak_sustained_elapsed [%]
内存吞吐量
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed [%]
DRAM吞吐量
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed [%]
L1命中率
l1tex__t_sector_hit_rate.pct [%]
L2命中率
lts__t_sector_hit_rate.pct [%]
加载合并率
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum / l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum
存储体冲突
l1tex__data_bank_conflicts_pipe_lsu_mem_shared.sum
寄存器溢出
l1tex__t_sectors_pipe_lsu_mem_local_op_st.sum [sector]
Occupancy
sm__warps_active.avg.pct_of_peak_sustained_active [%]
Warp可用率
smsp__warps_eligible.avg.per_cycle_active [warp]
每线程寄存器数
launch__registers_per_thread [register/thread]
每Block共享内存
launch__shared_mem_per_block [Kbyte/block]

Summary

总结

This skill enforces a strict profile → analyze → change → verify loop. No NCU data = no optimization. No metric evidence = no code change. Each kernel type (Native CUDA / CUTLASS / Triton / CuTe DSL) has a dedicated playbook with NCU-metric-to-action mappings. Every change is tracked and verified by re-profiling.
本技能强制要求严格遵循分析 → 诊断 → 修改 → 验证的循环。 无NCU数据则无法开展优化。无指标依据则不能修改代码。 每种内核类型(原生CUDA / CUTLASS / Triton / CuTe DSL)都有专属优化手册, 包含NCU指标到优化行动的映射。每一处修改都需跟踪并通过重新分析验证效果。