design-cute-dsl-kernel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CuTe Python DSL Kernel Design

CuTe Python DSL内核设计

Always also load
/design-kernel
for shared naming, versioning, and workflow. Also load
/cute-dsl-ref
for API reference, execution model, and architecture operations.
请同时加载
/design-kernel
以获取通用的命名、版本控制和工作流规范。 同时加载
/cute-dsl-ref
以获取API参考、执行模型和架构操作说明。

When To Use CuTe Python DSL

何时使用CuTe Python DSL

Use CuTe Python DSL (
cute-dsl
) when cuTile's public control surface is no longer sufficient, but a Python-authored kernel workflow is still appropriate.
当cuTile的公共控制界面不再满足需求,但仍适合采用Python编写内核的工作流时,可使用CuTe Python DSL(
cute-dsl
)。

Suitability Gate

适用性判断标准

CuTe DSL is the right choice when the next optimization requires any of these controls that cuTile does not expose:
  • Thread/warp/warpgroup identity -- explicit control over which threads do what
  • Intra-CTA synchronization -- barriers, named barriers, arrive/wait patterns
  • Warpgroup scheduling -- producer/consumer warpgroup roles, persistent warpgroup loops
  • TMA pipeline control -- explicit multi-stage async copy pipelines with barrier synchronization
  • Cluster programming -- cross-CTA shared memory access, distributed shared memory
  • Register-level data movement -- explicit register-to-register shuffles, warp-level primitives
  • Custom epilogues -- fused post-processing with fine-grained control
当后续优化需要cuTile未提供的以下任意一项控制能力时,CuTe DSL是合适的选择:
  • 线程/Warp/Warpgroup标识——明确控制各线程的分工
  • CTA内同步——屏障、命名屏障、到达/等待模式
  • Warpgroup调度——生产者/消费者Warpgroup角色、持久化Warpgroup循环
  • TMA流水线控制——带屏障同步的显式多阶段异步拷贝流水线
  • 集群编程——跨CTA共享内存访问、分布式共享内存
  • 寄存器级数据移动——显式寄存器到寄存器的洗牌操作、Warp级原语
  • 自定义结尾处理(Epilogue)——细粒度控制的融合后处理

When to stay in cuTile instead

何时应继续使用cuTile

If the optimization is still expressible through tile sizes, CTA remapping,
occupancy
/
num_ctas
hints, latency hints, or
allow_tma
flags, stay in cuTile. CuTe DSL adds complexity -- use it only when that complexity is load-bearing.
如果优化仍可通过分片大小、CTA重映射、
occupancy
/
num_ctas
提示、延迟提示或
allow_tma
标志来实现,请继续使用cuTile。CuTe DSL会增加复杂度——仅当这种复杂度是实现优化所必需时才使用它。

Hard design constraint

硬性设计约束

When profiling shows 1 CTA/SM, low eligible warps, and the fix requires explicit warpgroup or barrier scheduling, cuTile cannot close the gap. This is the canonical trigger to switch to CuTe DSL.
当性能分析显示每个SM只有1个CTA、可用Warp数量较少,且修复方案需要显式的Warpgroup或屏障调度时,cuTile无法解决此问题。这是切换到CuTe DSL的典型触发条件。

Naming And Layout

命名与布局

  • Public language key:
    cute-dsl
  • Python package path:
    cute_python
  • Kernel layout:
    src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/
  • Module:
    <design>[_vN].py
  • Wrapper:
    CuteKernel
  • 公开语言标识:
    cute-dsl
  • Python包路径:
    cute_python
  • 内核布局:
    src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/
  • 模块:
    <design>[_vN].py
  • 包装器:
    CuteKernel

CuteKernel Runtime Pattern

CuteKernel运行时模式

The runtime wrapper lives at
src/mla_var3/runtime/cute_kernel.py
.
CuTe DSL kernels use a two-level host/device pattern:
  1. A
    @cute.jit
    host function sets up TMA descriptors, computes the grid, and launches the kernel
  2. A
    @cute.kernel
    device function contains the GPU code
The
CuteKernel
dataclass wraps this pattern:
python
from mla_var3.runtime.cute_kernel import CuteKernel
运行时包装器位于
src/mla_var3/runtime/cute_kernel.py
CuTe DSL内核采用两级主机/设备模式
  1. @cute.jit
    装饰器的主机函数负责设置TMA描述符、计算网格并启动内核
  2. @cute.kernel
    装饰器的设备函数包含GPU代码
CuteKernel
数据类封装了此模式:
python
from mla_var3.runtime.cute_kernel import CuteKernel

In KernelPlan.plan():

在KernelPlan.plan()中:

def plan(self, *inputs): # Build a closure that captures inputs and calls the host function def launch_fn(tiling): # Convert tensors, set up TMA descriptors, compute grid host_fn(tiling, *converted_inputs)
return CuteKernel(
    kernel_fn=device_kernel,       # The @cute.kernel function (for naming)
    launch_fn=launch_fn,           # Closure: (tiling) -> launches kernel
    input_tensors=list(inputs),
    output_tensors=[output],
    tiling=self.tiling,
    autotune_configs=self._autotune_configs(),
    algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)
undefined
def plan(self, *inputs): # 构建一个捕获输入并调用主机函数的闭包 def launch_fn(tiling): # 转换张量、设置TMA描述符、计算网格 host_fn(tiling, *converted_inputs)
return CuteKernel(
    kernel_fn=device_kernel,       # @cute.kernel函数(用于命名)
    launch_fn=launch_fn,           # 闭包:(tiling) -> 启动内核
    input_tensors=list(inputs),
    output_tensors=[output],
    tiling=self.tiling,
    autotune_configs=self._autotune_configs(),
    algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)
undefined

Key differences from CtKernel

与CtKernel的主要区别

AspectCtKernel (cuTile)CuteKernel (CuTe DSL)
Launch mechanism
grid_fn
+
args_fn
+ cuTile compiler
launch_fn
closure wrapping
@cute.jit
host
CompilationcuTile bytecode + MLIRCuTe DSL JIT (cached automatically)
Autotuning
autotune_launch()
from cuda.tile_experimental
triton.testing.do_bench_cudagraph
per config
Grid setup
grid_fn(cfg) -> (x, y, z)
Inside
launch_fn
/ host function
TMA descriptorsImplicit (cuTile handles)Explicit setup in host function
方面CtKernel(cuTile)CuteKernel(CuTe DSL)
启动机制
grid_fn
+
args_fn
+ cuTile编译器
封装
@cute.jit
主机函数的
launch_fn
闭包
编译方式cuTile字节码 + MLIRCuTe DSL JIT(自动缓存)
自动调优来自cuda.tile_experimental的
autotune_launch()
每个配置使用
triton.testing.do_bench_cudagraph
网格设置
grid_fn(cfg) -> (x, y, z)
launch_fn
/主机函数内部完成
TMA描述符隐式处理(cuTile负责)在主机函数中显式设置

Autotuning

自动调优

CuteKernel iterates over
autotune_configs
, benchmarks each via
do_bench_cudagraph
, and selects the fastest. Failed configs are caught and skipped.
CuteKernel会遍历
autotune_configs
,通过
do_bench_cudagraph
对每个配置进行基准测试,并选择最快的配置。失败的配置会被捕获并跳过。

Compile

编译

CuTe DSL JIT caches artifacts automatically. The
compile()
method creates the output directory but does not extract explicit artifacts (can be extended per-kernel).
CuTe DSL JIT会自动缓存编译产物。
compile()
方法会创建输出目录,但不会提取显式产物(可针对每个内核进行扩展)。

Tiling Dataclass Guidance

分片数据类指导

CuTe DSL tilings typically include fields that cuTile tilings do not:
python
@dataclass
class Tiling:
    # Standard tile dimensions
    tile_m: int
    tile_n: int
    tile_k: int

    # CuTe DSL-specific fields
    num_warpgroups: int          # Warpgroups per CTA (typically 2-4)
    num_pipeline_stages: int     # Async copy pipeline depth
    num_tma_buffers: int         # TMA double/triple buffering
    cluster_m: int = 1           # Cluster shape (M dimension)
    cluster_n: int = 1           # Cluster shape (N dimension)

    def validate(self, pd) -> bool:
        # Validate against problem dimensions and device limits
        ...
The exact fields depend on the kernel design. The
validate
method should check that tile dimensions divide evenly into problem dimensions and that resource usage (registers, SMEM) stays within device limits.
CuTe DSL的分片通常包含cuTile分片没有的字段:
python
@dataclass
class Tiling:
    # 标准分片维度
    tile_m: int
    tile_n: int
    tile_k: int

    # CuTe DSL特有的字段
    num_warpgroups: int          # 每个CTA的Warpgroups数量(通常为2-4)
    num_pipeline_stages: int     # 异步拷贝流水线深度
    num_tma_buffers: int         # TMA双缓冲/三缓冲
    cluster_m: int = 1           # 集群形状(M维度)
    cluster_n: int = 1           # 集群形状(N维度)

    def validate(self, pd) -> bool:
        # 根据问题维度和设备限制进行验证
        ...
具体字段取决于内核设计。
validate
方法应检查分片维度是否能整除问题维度,以及资源使用(寄存器、共享内存)是否在设备限制范围内。

Architecture Compatibility Check

架构兼容性检查

Kernels may target architecture-specific instructions that require a specific SM version. The "Blackwell" marketing name spans multiple SM versions with different instruction sets:
  • SM100 (B200, B300) — datacenter GPUs, supports
    tcgen05
    MMA, TMEM, full cluster features
  • SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses
    sm_120a
    , does not support
    tcgen05
    ops
A kernel using
tcgen05
ops requires SM100 and will not run on an RTX 5090 (SM120) despite both being "Blackwell".
Before running, compiling, or profiling a CuTe DSL kernel, always verify the available device:
bash
nvidia-smi --query-gpu=name --format=csv,noheader
Then match the GPU name to its SM version. If the kernel's target SM version does not match, skip the run/profiling step. Do not attempt execution — it will either fail at compile time or produce misleading results. Record the architecture requirement in the kernel's devlog and note the skip.
内核可能会针对特定架构指令,这些指令需要特定的SM版本。"Blackwell"这个市场名称涵盖了多个SM版本,而这些版本的指令集不同
  • SM100(B200、B300)——数据中心GPU,支持
    tcgen05
    MMA、TMEM、完整的集群特性
  • SM120(GeForce RTX 5090、RTX 5080)——消费级Blackwell GPU,使用
    sm_120a
    不支持
    tcgen05
    操作
使用
tcgen05
操作的内核需要SM100,即使RTX 5090(SM120)属于"Blackwell"系列,也无法在其上运行。
在运行、编译或分析CuTe DSL内核之前,请务必验证可用设备:
bash
nvidia-smi --query-gpu=name --format=csv,noheader
然后将GPU名称与其SM版本匹配。如果内核的目标SM版本不匹配,请跳过运行/分析步骤。不要尝试执行——这要么会在编译时失败,要么会产生误导性结果。在内核的开发日志中记录架构要求,并注明跳过的原因。

Common Pitfalls

常见注意事项

  • Never hallucinate CuTe DSL APIs -- verify against the
    /cute-dsl-ref
    skill,
    docs/cute-dsl/
    documentation, or CUTLASS example kernels
  • @cute.kernel
    function name MUST match the module filename (same rule as cuTile)
  • TMA descriptor setup is host-side only -- do not attempt TMA operations inside
    @cute.kernel
    without proper
    @cute.jit
    host setup
  • Register budget: 255 max/thread, validate against the active device's SM limits via
    docs/devices/
    and
    src/mla_var3/conf/devices.json
  • Shared memory varies by SM version: SM100 (B200/B300): up to 228 KB/SM, 227 KB/block opt-in; SM120 (RTX 5090): 96 KB -- do not assume "Blackwell" means datacenter SMEM limits
  • Pipeline stage count affects SMEM usage (each stage needs its own buffer) -- validate total SMEM before increasing stages
  • Barrier synchronization errors are silent and cause incorrect results, not crashes -- always test with
    --check
  • Cluster programming requires the launch to use cluster-compatible grid dimensions
  • 请勿凭空捏造CuTe DSL API——请对照
    /cute-dsl-ref
    技能文档、
    docs/cute-dsl/
    文档或CUTLASS示例内核进行验证
  • @cute.kernel
    函数名称必须与模块文件名匹配(与cuTile规则相同)
  • TMA描述符设置仅在主机端进行——如果没有正确的
    @cute.jit
    主机设置,请勿在
    @cute.kernel
    内尝试TMA操作
  • 寄存器预算:每个线程最多255个,请通过
    docs/devices/
    src/mla_var3/conf/devices.json
    验证当前设备的SM限制
  • 共享内存大小因SM版本而异:SM100(B200/B300):最高228 KB/SM,可选择227 KB/块;SM120(RTX 5090):96 KB——不要假设"Blackwell"就意味着数据中心级的共享内存限制
  • 流水线阶段数会影响共享内存使用(每个阶段需要自己的缓冲区)——增加阶段数前请验证总共享内存使用量
  • 屏障同步错误是静默的,会导致结果不正确而非崩溃——请始终使用
    --check
    进行测试
  • 集群编程要求启动时使用兼容集群的网格维度

Reference Resources

参考资源

  • API reference: Load
    /cute-dsl-ref
    for the core API table, execution model, and architecture operations
  • Official documentation:
    docs/cute-dsl/
    (CuTe Python DSL) and
    docs/cutlass-cpp/cute/
    (CuTe C++ concepts)
  • Example kernels and learning paths: Load
    /learn-cute-dsl
    for a categorized index of CUTLASS example kernels
  • API参考:加载
    /cute-dsl-ref
    获取核心API表、执行模型和架构操作说明
  • 官方文档
    docs/cute-dsl/
    (CuTe Python DSL)和
    docs/cutlass-cpp/cute/
    (CuTe C++概念)
  • 示例内核和学习路径:加载
    /learn-cute-dsl
    获取CUTLASS示例内核的分类索引

Knowledge Links

知识链接

  • Shared optimization knowledge:
    docs/knowledge/optimizations/
  • Shared anti-patterns:
    docs/knowledge/anti-patterns/
  • CuTe Python DSL overlays:
    docs/knowledge/languages/cute-dsl/
  • 通用优化知识:
    docs/knowledge/optimizations/
  • 通用反模式:
    docs/knowledge/anti-patterns/
  • CuTe Python DSL相关内容:
    docs/knowledge/languages/cute-dsl/

Development Log Entry

开发日志记录

Use
docs/kernels/<kernel>.md
and record the implementation location using the Python package path form:
text
src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/
使用
docs/kernels/<kernel>.md
,并以Python包路径形式记录实现位置:
text
src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/