design-cute-dsl-kernel

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CuTe Python DSL Kernel Design

CuTe Python DSL内核设计

Always also load

/design-kernel

for shared naming, versioning, and workflow. Also load

/cute-dsl-ref

for API reference, execution model, and architecture operations.

请同时加载

/design-kernel

以获取通用的命名、版本控制和工作流规范。同时加载

/cute-dsl-ref

以获取API参考、执行模型和架构操作说明。

When To Use CuTe Python DSL

何时使用CuTe Python DSL

Use CuTe Python DSL (

cute-dsl

) when cuTile's public control surface is no longer sufficient, but a Python-authored kernel workflow is still appropriate.

当cuTile的公共控制界面不再满足需求，但仍适合采用Python编写内核的工作流时，可使用CuTe Python DSL（

cute-dsl

）。

Suitability Gate

适用性判断标准

CuTe DSL is the right choice when the next optimization requires any of these controls that cuTile does not expose:

Thread/warp/warpgroup identity -- explicit control over which threads do what
Intra-CTA synchronization -- barriers, named barriers, arrive/wait patterns
Warpgroup scheduling -- producer/consumer warpgroup roles, persistent warpgroup loops
TMA pipeline control -- explicit multi-stage async copy pipelines with barrier synchronization
Cluster programming -- cross-CTA shared memory access, distributed shared memory
Register-level data movement -- explicit register-to-register shuffles, warp-level primitives
Custom epilogues -- fused post-processing with fine-grained control

当后续优化需要cuTile未提供的以下任意一项控制能力时，CuTe DSL是合适的选择：

线程/Warp/Warpgroup标识——明确控制各线程的分工
CTA内同步——屏障、命名屏障、到达/等待模式
Warpgroup调度——生产者/消费者Warpgroup角色、持久化Warpgroup循环
TMA流水线控制——带屏障同步的显式多阶段异步拷贝流水线
集群编程——跨CTA共享内存访问、分布式共享内存
寄存器级数据移动——显式寄存器到寄存器的洗牌操作、Warp级原语
自定义结尾处理（Epilogue）——细粒度控制的融合后处理

When to stay in cuTile instead

何时应继续使用cuTile

If the optimization is still expressible through tile sizes, CTA remapping,

occupancy

num_ctas

hints, latency hints, or

allow_tma

flags, stay in cuTile. CuTe DSL adds complexity -- use it only when that complexity is load-bearing.

如果优化仍可通过分片大小、CTA重映射、

occupancy

num_ctas

提示、延迟提示或

allow_tma

标志来实现，请继续使用cuTile。CuTe DSL会增加复杂度——仅当这种复杂度是实现优化所必需时才使用它。

Hard design constraint

硬性设计约束

When profiling shows 1 CTA/SM, low eligible warps, and the fix requires explicit warpgroup or barrier scheduling, cuTile cannot close the gap. This is the canonical trigger to switch to CuTe DSL.

当性能分析显示每个SM只有1个CTA、可用Warp数量较少，且修复方案需要显式的Warpgroup或屏障调度时，cuTile无法解决此问题。这是切换到CuTe DSL的典型触发条件。

Naming And Layout

命名与布局

Public language key:
```
cute-dsl
```
Python package path:
```
cute_python
```

Kernel layout:

src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/

Module:
```
<design>[_vN].py
```
Wrapper:
```
CuteKernel
```

公开语言标识：
```
cute-dsl
```
Python包路径：
```
cute_python
```

内核布局：

src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/

模块：
```
<design>[_vN].py
```
包装器：
```
CuteKernel
```

CuteKernel Runtime Pattern

CuteKernel运行时模式

The runtime wrapper lives at

src/mla_var3/runtime/cute_kernel.py

CuTe DSL kernels use a two-level host/device pattern:

A
```
@cute.jit
```
host function sets up TMA descriptors, computes the grid, and launches the kernel
A
```
@cute.kernel
```
device function contains the GPU code

The

CuteKernel

dataclass wraps this pattern:

python

from mla_var3.runtime.cute_kernel import CuteKernel

运行时包装器位于

src/mla_var3/runtime/cute_kernel.py

。

CuTe DSL内核采用两级主机/设备模式：

带
```
@cute.jit
```
装饰器的主机函数负责设置TMA描述符、计算网格并启动内核
带
```
@cute.kernel
```
装饰器的设备函数包含GPU代码

CuteKernel

数据类封装了此模式：

python

from mla_var3.runtime.cute_kernel import CuteKernel

In KernelPlan.plan():

在KernelPlan.plan()中：

def plan(self, *inputs): # Build a closure that captures inputs and calls the host function def launch_fn(tiling): # Convert tensors, set up TMA descriptors, compute grid host_fn(tiling, *converted_inputs)

return CuteKernel(
    kernel_fn=device_kernel,       # The @cute.kernel function (for naming)
    launch_fn=launch_fn,           # Closure: (tiling) -> launches kernel
    input_tensors=list(inputs),
    output_tensors=[output],
    tiling=self.tiling,
    autotune_configs=self._autotune_configs(),
    algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)

undefined

def plan(self, *inputs): # 构建一个捕获输入并调用主机函数的闭包 def launch_fn(tiling): # 转换张量、设置TMA描述符、计算网格 host_fn(tiling, *converted_inputs)

return CuteKernel(
    kernel_fn=device_kernel,       # @cute.kernel函数（用于命名）
    launch_fn=launch_fn,           # 闭包：(tiling) -> 启动内核
    input_tensors=list(inputs),
    output_tensors=[output],
    tiling=self.tiling,
    autotune_configs=self._autotune_configs(),
    algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)

undefined

Key differences from CtKernel

与CtKernel的主要区别

Aspect	CtKernel (cuTile)	CuteKernel (CuTe DSL)
Launch mechanism	`grid_fn` + `args_fn` + cuTile compiler	`launch_fn` closure wrapping `@cute.jit` host
Compilation	cuTile bytecode + MLIR	CuTe DSL JIT (cached automatically)
Autotuning	`autotune_launch()` from cuda.tile_experimental	`triton.testing.do_bench_cudagraph` per config
Grid setup	`grid_fn(cfg) -> (x, y, z)`	Inside `launch_fn` / host function
TMA descriptors	Implicit (cuTile handles)	Explicit setup in host function

方面	CtKernel（cuTile）	CuteKernel（CuTe DSL）
启动机制	`grid_fn` + `args_fn` + cuTile编译器	封装 `@cute.jit` 主机函数的 `launch_fn` 闭包
编译方式	cuTile字节码 + MLIR	CuTe DSL JIT（自动缓存）
自动调优	来自cuda.tile_experimental的 `autotune_launch()`	每个配置使用 `triton.testing.do_bench_cudagraph`
网格设置	`grid_fn(cfg) -> (x, y, z)`	在 `launch_fn` /主机函数内部完成
TMA描述符	隐式处理（cuTile负责）	在主机函数中显式设置

Autotuning

自动调优

CuteKernel iterates over

autotune_configs

, benchmarks each via

do_bench_cudagraph

, and selects the fastest. Failed configs are caught and skipped.

CuteKernel会遍历

autotune_configs

，通过

do_bench_cudagraph

对每个配置进行基准测试，并选择最快的配置。失败的配置会被捕获并跳过。

Compile

编译

CuTe DSL JIT caches artifacts automatically. The

compile()

method creates the output directory but does not extract explicit artifacts (can be extended per-kernel).

CuTe DSL JIT会自动缓存编译产物。

compile()

方法会创建输出目录，但不会提取显式产物（可针对每个内核进行扩展）。

Tiling Dataclass Guidance

分片数据类指导

CuTe DSL tilings typically include fields that cuTile tilings do not:

python

@dataclass
class Tiling:
    # Standard tile dimensions
    tile_m: int
    tile_n: int
    tile_k: int

    # CuTe DSL-specific fields
    num_warpgroups: int          # Warpgroups per CTA (typically 2-4)
    num_pipeline_stages: int     # Async copy pipeline depth
    num_tma_buffers: int         # TMA double/triple buffering
    cluster_m: int = 1           # Cluster shape (M dimension)
    cluster_n: int = 1           # Cluster shape (N dimension)

    def validate(self, pd) -> bool:
        # Validate against problem dimensions and device limits
        ...

The exact fields depend on the kernel design. The

validate

method should check that tile dimensions divide evenly into problem dimensions and that resource usage (registers, SMEM) stays within device limits.

CuTe DSL的分片通常包含cuTile分片没有的字段：

python

@dataclass
class Tiling:
    # 标准分片维度
    tile_m: int
    tile_n: int
    tile_k: int

    # CuTe DSL特有的字段
    num_warpgroups: int          # 每个CTA的Warpgroups数量（通常为2-4）
    num_pipeline_stages: int     # 异步拷贝流水线深度
    num_tma_buffers: int         # TMA双缓冲/三缓冲
    cluster_m: int = 1           # 集群形状（M维度）
    cluster_n: int = 1           # 集群形状（N维度）

    def validate(self, pd) -> bool:
        # 根据问题维度和设备限制进行验证
        ...

具体字段取决于内核设计。

validate

方法应检查分片维度是否能整除问题维度，以及资源使用（寄存器、共享内存）是否在设备限制范围内。

Architecture Compatibility Check

架构兼容性检查

Kernels may target architecture-specific instructions that require a specific SM version. The "Blackwell" marketing name spans multiple SM versions with different instruction sets:

SM100 (B200, B300) — datacenter GPUs, supports
```
tcgen05
```
MMA, TMEM, full cluster features
SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses
```
sm_120a
```
, does not support
```
tcgen05
```
ops

A kernel using

tcgen05

ops requires SM100 and will not run on an RTX 5090 (SM120) despite both being "Blackwell".

Before running, compiling, or profiling a CuTe DSL kernel, always verify the available device:

bash

nvidia-smi --query-gpu=name --format=csv,noheader

Then match the GPU name to its SM version. If the kernel's target SM version does not match, skip the run/profiling step. Do not attempt execution — it will either fail at compile time or produce misleading results. Record the architecture requirement in the kernel's devlog and note the skip.

内核可能会针对特定架构指令，这些指令需要特定的SM版本。"Blackwell"这个市场名称涵盖了多个SM版本，而这些版本的指令集不同：

SM100（B200、B300）——数据中心GPU，支持
```
tcgen05
```
MMA、TMEM、完整的集群特性
SM120（GeForce RTX 5090、RTX 5080）——消费级Blackwell GPU，使用
```
sm_120a
```
，不支持
```
tcgen05
```
操作

使用

tcgen05

操作的内核需要SM100，即使RTX 5090（SM120）属于"Blackwell"系列，也无法在其上运行。

在运行、编译或分析CuTe DSL内核之前，请务必验证可用设备：

bash

nvidia-smi --query-gpu=name --format=csv,noheader

然后将GPU名称与其SM版本匹配。如果内核的目标SM版本不匹配，请跳过运行/分析步骤。不要尝试执行——这要么会在编译时失败，要么会产生误导性结果。在内核的开发日志中记录架构要求，并注明跳过的原因。

Common Pitfalls

常见注意事项

Never hallucinate CuTe DSL APIs -- verify against the
```
/cute-dsl-ref
```
skill,
```
docs/cute-dsl/
```
documentation, or CUTLASS example kernels
```
@cute.kernel
```
function name MUST match the module filename (same rule as cuTile)
TMA descriptor setup is host-side only -- do not attempt TMA operations inside
```
@cute.kernel
```
without proper
```
@cute.jit
```
host setup
Register budget: 255 max/thread, validate against the active device's SM limits via
```
docs/devices/
```
and
```
src/mla_var3/conf/devices.json
```
Shared memory varies by SM version: SM100 (B200/B300): up to 228 KB/SM, 227 KB/block opt-in; SM120 (RTX 5090): 96 KB -- do not assume "Blackwell" means datacenter SMEM limits
Pipeline stage count affects SMEM usage (each stage needs its own buffer) -- validate total SMEM before increasing stages
Barrier synchronization errors are silent and cause incorrect results, not crashes -- always test with
```
--check
```
Cluster programming requires the launch to use cluster-compatible grid dimensions

请勿凭空捏造CuTe DSL API——请对照
```
/cute-dsl-ref
```
技能文档、
```
docs/cute-dsl/
```
文档或CUTLASS示例内核进行验证
```
@cute.kernel
```
函数名称必须与模块文件名匹配（与cuTile规则相同）
TMA描述符设置仅在主机端进行——如果没有正确的
```
@cute.jit
```
主机设置，请勿在
```
@cute.kernel
```
内尝试TMA操作
寄存器预算：每个线程最多255个，请通过
```
docs/devices/
```
和
```
src/mla_var3/conf/devices.json
```
验证当前设备的SM限制
共享内存大小因SM版本而异：SM100（B200/B300）：最高228 KB/SM，可选择227 KB/块；SM120（RTX 5090）：96 KB——不要假设"Blackwell"就意味着数据中心级的共享内存限制
流水线阶段数会影响共享内存使用（每个阶段需要自己的缓冲区）——增加阶段数前请验证总共享内存使用量
屏障同步错误是静默的，会导致结果不正确而非崩溃——请始终使用
```
--check
```
进行测试
集群编程要求启动时使用兼容集群的网格维度

Reference Resources

参考资源

API reference: Load
```
/cute-dsl-ref
```
for the core API table, execution model, and architecture operations
Official documentation:
```
docs/cute-dsl/
```
(CuTe Python DSL) and
```
docs/cutlass-cpp/cute/
```
(CuTe C++ concepts)
Example kernels and learning paths: Load
```
/learn-cute-dsl
```
for a categorized index of CUTLASS example kernels

API参考：加载
```
/cute-dsl-ref
```
获取核心API表、执行模型和架构操作说明
官方文档：
```
docs/cute-dsl/
```
（CuTe Python DSL）和
```
docs/cutlass-cpp/cute/
```
（CuTe C++概念）
示例内核和学习路径：加载
```
/learn-cute-dsl
```
获取CUTLASS示例内核的分类索引

Knowledge Links

知识链接

Shared optimization knowledge:
```
docs/knowledge/optimizations/
```
Shared anti-patterns:
```
docs/knowledge/anti-patterns/
```
CuTe Python DSL overlays:
```
docs/knowledge/languages/cute-dsl/
```

通用优化知识：
```
docs/knowledge/optimizations/
```
通用反模式：
```
docs/knowledge/anti-patterns/
```
CuTe Python DSL相关内容：
```
docs/knowledge/languages/cute-dsl/
```

Development Log Entry

开发日志记录

Use

docs/kernels/<kernel>.md

and record the implementation location using the Python package path form:

text

src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/

使用

docs/kernels/<kernel>.md

，并以Python包路径形式记录实现位置：

text

src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/