design-cute-dsl-kernel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCuTe Python DSL Kernel Design
CuTe Python DSL内核设计
Always also load for shared naming, versioning, and workflow.
Also load for API reference, execution model, and architecture operations.
/design-kernel/cute-dsl-ref请同时加载以获取通用的命名、版本控制和工作流规范。
同时加载以获取API参考、执行模型和架构操作说明。
/design-kernel/cute-dsl-refWhen To Use CuTe Python DSL
何时使用CuTe Python DSL
Use CuTe Python DSL () when cuTile's public control surface is no longer sufficient, but a Python-authored kernel workflow is still appropriate.
cute-dsl当cuTile的公共控制界面不再满足需求,但仍适合采用Python编写内核的工作流时,可使用CuTe Python DSL()。
cute-dslSuitability Gate
适用性判断标准
CuTe DSL is the right choice when the next optimization requires any of these controls that cuTile does not expose:
- Thread/warp/warpgroup identity -- explicit control over which threads do what
- Intra-CTA synchronization -- barriers, named barriers, arrive/wait patterns
- Warpgroup scheduling -- producer/consumer warpgroup roles, persistent warpgroup loops
- TMA pipeline control -- explicit multi-stage async copy pipelines with barrier synchronization
- Cluster programming -- cross-CTA shared memory access, distributed shared memory
- Register-level data movement -- explicit register-to-register shuffles, warp-level primitives
- Custom epilogues -- fused post-processing with fine-grained control
当后续优化需要cuTile未提供的以下任意一项控制能力时,CuTe DSL是合适的选择:
- 线程/Warp/Warpgroup标识——明确控制各线程的分工
- CTA内同步——屏障、命名屏障、到达/等待模式
- Warpgroup调度——生产者/消费者Warpgroup角色、持久化Warpgroup循环
- TMA流水线控制——带屏障同步的显式多阶段异步拷贝流水线
- 集群编程——跨CTA共享内存访问、分布式共享内存
- 寄存器级数据移动——显式寄存器到寄存器的洗牌操作、Warp级原语
- 自定义结尾处理(Epilogue)——细粒度控制的融合后处理
When to stay in cuTile instead
何时应继续使用cuTile
If the optimization is still expressible through tile sizes, CTA remapping, / hints, latency hints, or flags, stay in cuTile. CuTe DSL adds complexity -- use it only when that complexity is load-bearing.
occupancynum_ctasallow_tma如果优化仍可通过分片大小、CTA重映射、/提示、延迟提示或标志来实现,请继续使用cuTile。CuTe DSL会增加复杂度——仅当这种复杂度是实现优化所必需时才使用它。
occupancynum_ctasallow_tmaHard design constraint
硬性设计约束
When profiling shows 1 CTA/SM, low eligible warps, and the fix requires explicit warpgroup or barrier scheduling, cuTile cannot close the gap. This is the canonical trigger to switch to CuTe DSL.
当性能分析显示每个SM只有1个CTA、可用Warp数量较少,且修复方案需要显式的Warpgroup或屏障调度时,cuTile无法解决此问题。这是切换到CuTe DSL的典型触发条件。
Naming And Layout
命名与布局
- Public language key:
cute-dsl - Python package path:
cute_python - Kernel layout:
src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/ - Module:
<design>[_vN].py - Wrapper:
CuteKernel
- 公开语言标识:
cute-dsl - Python包路径:
cute_python - 内核布局:
src/mla_var3/kernel/cute_python/<layer>/<design>/<design>[_vN]/ - 模块:
<design>[_vN].py - 包装器:
CuteKernel
CuteKernel Runtime Pattern
CuteKernel运行时模式
The runtime wrapper lives at .
src/mla_var3/runtime/cute_kernel.pyCuTe DSL kernels use a two-level host/device pattern:
- A host function sets up TMA descriptors, computes the grid, and launches the kernel
@cute.jit - A device function contains the GPU code
@cute.kernel
The dataclass wraps this pattern:
CuteKernelpython
from mla_var3.runtime.cute_kernel import CuteKernel运行时包装器位于。
src/mla_var3/runtime/cute_kernel.pyCuTe DSL内核采用两级主机/设备模式:
- 带装饰器的主机函数负责设置TMA描述符、计算网格并启动内核
@cute.jit - 带装饰器的设备函数包含GPU代码
@cute.kernel
CuteKernelpython
from mla_var3.runtime.cute_kernel import CuteKernelIn KernelPlan.plan():
在KernelPlan.plan()中:
def plan(self, *inputs):
# Build a closure that captures inputs and calls the host function
def launch_fn(tiling):
# Convert tensors, set up TMA descriptors, compute grid
host_fn(tiling, *converted_inputs)
return CuteKernel(
kernel_fn=device_kernel, # The @cute.kernel function (for naming)
launch_fn=launch_fn, # Closure: (tiling) -> launches kernel
input_tensors=list(inputs),
output_tensors=[output],
tiling=self.tiling,
autotune_configs=self._autotune_configs(),
algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)undefineddef plan(self, *inputs):
# 构建一个捕获输入并调用主机函数的闭包
def launch_fn(tiling):
# 转换张量、设置TMA描述符、计算网格
host_fn(tiling, *converted_inputs)
return CuteKernel(
kernel_fn=device_kernel, # @cute.kernel函数(用于命名)
launch_fn=launch_fn, # 闭包:(tiling) -> 启动内核
input_tensors=list(inputs),
output_tensors=[output],
tiling=self.tiling,
autotune_configs=self._autotune_configs(),
algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)undefinedKey differences from CtKernel
与CtKernel的主要区别
| Aspect | CtKernel (cuTile) | CuteKernel (CuTe DSL) |
|---|---|---|
| Launch mechanism | | |
| Compilation | cuTile bytecode + MLIR | CuTe DSL JIT (cached automatically) |
| Autotuning | | |
| Grid setup | | Inside |
| TMA descriptors | Implicit (cuTile handles) | Explicit setup in host function |
| 方面 | CtKernel(cuTile) | CuteKernel(CuTe DSL) |
|---|---|---|
| 启动机制 | | 封装 |
| 编译方式 | cuTile字节码 + MLIR | CuTe DSL JIT(自动缓存) |
| 自动调优 | 来自cuda.tile_experimental的 | 每个配置使用 |
| 网格设置 | | 在 |
| TMA描述符 | 隐式处理(cuTile负责) | 在主机函数中显式设置 |
Autotuning
自动调优
CuteKernel iterates over , benchmarks each via , and selects the fastest. Failed configs are caught and skipped.
autotune_configsdo_bench_cudagraphCuteKernel会遍历,通过对每个配置进行基准测试,并选择最快的配置。失败的配置会被捕获并跳过。
autotune_configsdo_bench_cudagraphCompile
编译
CuTe DSL JIT caches artifacts automatically. The method creates the output directory but does not extract explicit artifacts (can be extended per-kernel).
compile()CuTe DSL JIT会自动缓存编译产物。方法会创建输出目录,但不会提取显式产物(可针对每个内核进行扩展)。
compile()Tiling Dataclass Guidance
分片数据类指导
CuTe DSL tilings typically include fields that cuTile tilings do not:
python
@dataclass
class Tiling:
# Standard tile dimensions
tile_m: int
tile_n: int
tile_k: int
# CuTe DSL-specific fields
num_warpgroups: int # Warpgroups per CTA (typically 2-4)
num_pipeline_stages: int # Async copy pipeline depth
num_tma_buffers: int # TMA double/triple buffering
cluster_m: int = 1 # Cluster shape (M dimension)
cluster_n: int = 1 # Cluster shape (N dimension)
def validate(self, pd) -> bool:
# Validate against problem dimensions and device limits
...The exact fields depend on the kernel design. The method should check that tile dimensions divide evenly into problem dimensions and that resource usage (registers, SMEM) stays within device limits.
validateCuTe DSL的分片通常包含cuTile分片没有的字段:
python
@dataclass
class Tiling:
# 标准分片维度
tile_m: int
tile_n: int
tile_k: int
# CuTe DSL特有的字段
num_warpgroups: int # 每个CTA的Warpgroups数量(通常为2-4)
num_pipeline_stages: int # 异步拷贝流水线深度
num_tma_buffers: int # TMA双缓冲/三缓冲
cluster_m: int = 1 # 集群形状(M维度)
cluster_n: int = 1 # 集群形状(N维度)
def validate(self, pd) -> bool:
# 根据问题维度和设备限制进行验证
...具体字段取决于内核设计。方法应检查分片维度是否能整除问题维度,以及资源使用(寄存器、共享内存)是否在设备限制范围内。
validateArchitecture Compatibility Check
架构兼容性检查
Kernels may target architecture-specific instructions that require a specific SM version. The "Blackwell" marketing name spans multiple SM versions with different instruction sets:
- SM100 (B200, B300) — datacenter GPUs, supports MMA, TMEM, full cluster features
tcgen05 - SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses , does not support
sm_120aopstcgen05
A kernel using ops requires SM100 and will not run on an RTX 5090 (SM120) despite both being "Blackwell".
tcgen05Before running, compiling, or profiling a CuTe DSL kernel, always verify the available device:
bash
nvidia-smi --query-gpu=name --format=csv,noheaderThen match the GPU name to its SM version. If the kernel's target SM version does not match, skip the run/profiling step. Do not attempt execution — it will either fail at compile time or produce misleading results. Record the architecture requirement in the kernel's devlog and note the skip.
内核可能会针对特定架构指令,这些指令需要特定的SM版本。"Blackwell"这个市场名称涵盖了多个SM版本,而这些版本的指令集不同:
- SM100(B200、B300)——数据中心GPU,支持MMA、TMEM、完整的集群特性
tcgen05 - SM120(GeForce RTX 5090、RTX 5080)——消费级Blackwell GPU,使用,不支持
sm_120a操作tcgen05
使用操作的内核需要SM100,即使RTX 5090(SM120)属于"Blackwell"系列,也无法在其上运行。
tcgen05在运行、编译或分析CuTe DSL内核之前,请务必验证可用设备:
bash
nvidia-smi --query-gpu=name --format=csv,noheader然后将GPU名称与其SM版本匹配。如果内核的目标SM版本不匹配,请跳过运行/分析步骤。不要尝试执行——这要么会在编译时失败,要么会产生误导性结果。在内核的开发日志中记录架构要求,并注明跳过的原因。
Common Pitfalls
常见注意事项
- Never hallucinate CuTe DSL APIs -- verify against the skill,
/cute-dsl-refdocumentation, or CUTLASS example kernelsdocs/cute-dsl/ - function name MUST match the module filename (same rule as cuTile)
@cute.kernel - TMA descriptor setup is host-side only -- do not attempt TMA operations inside without proper
@cute.kernelhost setup@cute.jit - Register budget: 255 max/thread, validate against the active device's SM limits via and
docs/devices/src/mla_var3/conf/devices.json - Shared memory varies by SM version: SM100 (B200/B300): up to 228 KB/SM, 227 KB/block opt-in; SM120 (RTX 5090): 96 KB -- do not assume "Blackwell" means datacenter SMEM limits
- Pipeline stage count affects SMEM usage (each stage needs its own buffer) -- validate total SMEM before increasing stages
- Barrier synchronization errors are silent and cause incorrect results, not crashes -- always test with
--check - Cluster programming requires the launch to use cluster-compatible grid dimensions
- 请勿凭空捏造CuTe DSL API——请对照技能文档、
/cute-dsl-ref文档或CUTLASS示例内核进行验证docs/cute-dsl/ - 函数名称必须与模块文件名匹配(与cuTile规则相同)
@cute.kernel - TMA描述符设置仅在主机端进行——如果没有正确的主机设置,请勿在
@cute.jit内尝试TMA操作@cute.kernel - 寄存器预算:每个线程最多255个,请通过和
docs/devices/验证当前设备的SM限制src/mla_var3/conf/devices.json - 共享内存大小因SM版本而异:SM100(B200/B300):最高228 KB/SM,可选择227 KB/块;SM120(RTX 5090):96 KB——不要假设"Blackwell"就意味着数据中心级的共享内存限制
- 流水线阶段数会影响共享内存使用(每个阶段需要自己的缓冲区)——增加阶段数前请验证总共享内存使用量
- 屏障同步错误是静默的,会导致结果不正确而非崩溃——请始终使用进行测试
--check - 集群编程要求启动时使用兼容集群的网格维度
Reference Resources
参考资源
- API reference: Load for the core API table, execution model, and architecture operations
/cute-dsl-ref - Official documentation: (CuTe Python DSL) and
docs/cute-dsl/(CuTe C++ concepts)docs/cutlass-cpp/cute/ - Example kernels and learning paths: Load for a categorized index of CUTLASS example kernels
/learn-cute-dsl
- API参考:加载获取核心API表、执行模型和架构操作说明
/cute-dsl-ref - 官方文档:(CuTe Python DSL)和
docs/cute-dsl/(CuTe C++概念)docs/cutlass-cpp/cute/ - 示例内核和学习路径:加载获取CUTLASS示例内核的分类索引
/learn-cute-dsl
Knowledge Links
知识链接
- Shared optimization knowledge:
docs/knowledge/optimizations/ - Shared anti-patterns:
docs/knowledge/anti-patterns/ - CuTe Python DSL overlays:
docs/knowledge/languages/cute-dsl/
- 通用优化知识:
docs/knowledge/optimizations/ - 通用反模式:
docs/knowledge/anti-patterns/ - CuTe Python DSL相关内容:
docs/knowledge/languages/cute-dsl/
Development Log Entry
开发日志记录
Use and record the implementation location using the Python package path form:
docs/kernels/<kernel>.mdtext
src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/使用,并以Python包路径形式记录实现位置:
docs/kernels/<kernel>.mdtext
src/mla_var3/kernel/cute_python/mla/<kernel>/<kernel>[_vN]/