design-cutile-dsl-kernel

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

cuTile Python DSL — Language-Specific Guidance

cuTile Python DSL — 语言专属指南

Prerequisite: Load

/design-kernel

for shared naming, versioning, KernelPlan structure, composition patterns, clone workflow, devlog template, and designer output contract. This skill covers only cuTile-specific runtime patterns and constraints.

前置要求：加载/design-kernel以获取通用命名规则、版本控制、KernelPlan结构、组合模式、克隆工作流、开发日志模板以及设计者输出约定。本技能仅涵盖cuTile特有的运行时模式与约束。

When To Use cuTile Python DSL

何时使用cuTile Python DSL

Stay in cuTile Python DSL (

cutile-dsl

) when the next optimization is still expressible through:

tile sizes and tensor layout choices,
CTA remapping and work partitioning via
```
ct.bid()
```
/
```
ct.num_blocks()
```
,
occupancy and cluster-size hints (
```
num_ctas
```
,
```
occupancy
```
),
compiler guidance such as
```
latency
```
/
```
allow_tma
```
hints,
multi-stage composition through
```
KernelPipeline
```
/
```
ConcurrentKernels
```
.

当后续优化仍可通过以下方式表达时，继续使用cuTile Python DSL（

cutile-dsl

）：

分块大小与张量布局选择，
通过
```
ct.bid()
```
/
```
ct.num_blocks()
```
进行CTA重映射与工作分区，
占用率与集群大小提示（
```
num_ctas
```
、
```
occupancy
```
），
编译器提示，如
```
latency
```
/
```
allow_tma
```
提示，
通过
```
KernelPipeline
```
/
```
ConcurrentKernels
```
实现多阶段组合。

Suitability Gate

适用性判断标准

Do not expect cuTile to express explicit intra-CTA scheduling. The public execution model does not expose thread, warp, or warpgroup identity, and does not allow explicit synchronization or communication within a block. Likewise,

num_ctas

does not give you a public cluster programming model: there is no exposed cluster rank, cluster barrier, DSM primitive, or cluster memory scope.

This becomes a hard design constraint when profiling shows the kernel is pinned to one CTA per SM. In that regime, the usual recovery path in state-of-the-art kernels is explicit scheduling inside the CTA: warpgroup seesaw schedules, producer/consumer pipelines, barrier choreography, or cluster-cooperative overlap. If the optimization you need falls in that category, stop iterating inside cuTile and switch to CuTe DSL (

/design-cute-dsl-kernel

The FlashMLA devlog (

docs/kernels/flash-mla.md

) is the concrete example: the official FlashMLA design remains productive at one resident block per SM because it uses explicit scheduling; cuTile FlashMLA versions cannot reproduce that behavior.

不要期望cuTile能表达显式的CTA内调度。其公开执行模型不会暴露线程、线程束（warp）或线程束组（warpgroup）的标识，也不允许在块内进行显式同步或通信。同样，

num_ctas

并非公开的集群编程模型：它没有暴露集群秩、集群屏障、DSM原语或集群内存范围。

当性能分析显示内核被固定为每个SM对应一个CTA时，这将成为硬性设计约束。在这种情况下，当前最先进内核的常规补救方案是在CTA内部进行显式调度：线程束组交替调度、生产者/消费者流水线、屏障编排或集群协作重叠。如果您需要的优化属于此类范畴，请停止在cuTile内迭代，切换至CuTe DSL（

/design-cute-dsl-kernel

）。

FlashMLA开发日志（

docs/kernels/flash-mla.md

）是具体示例：官方FlashMLA设计通过使用显式调度，在每个SM驻留一个块的情况下仍能保持高效；而cuTile版本的FlashMLA无法重现该行为。

Naming And Layout

命名与布局

Public language key:
```
cutile-dsl
```
Python package path:
```
cutile
```

Kernel layout:

src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/

Module file:
```
<design>[_vN].py
```
Main runtime wrapper:
```
CtKernel
```

公开语言标识：
```
cutile-dsl
```
Python包路径：
```
cutile
```

内核布局：

src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/

模块文件：
```
<design>[_vN].py
```
主运行时包装器：
```
CtKernel
```

CtKernel Runtime Pattern

CtKernel运行时模式

CtKernel(Kernel)

wraps cuTile kernels with

grid_fn

and

args_fn

callables:

python

from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool

@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
  bid_x = ct.bid(0)
  # block-level operations: ct.load, ct.mma, ct.store, ct.reshape, ...

@dataclass
class CtMyKernel(KernelPlan):
  b: int = 64; s: int = 1; t: int = 4096
  tiling: MyTiling = field(default_factory=MyTiling)

  def plan(self, *inputs) -> CtKernel:
    Out = torch.empty_like(inputs[0])

    def grid_fn(cfg):
      return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)

    def args_fn(cfg):
      return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)

    return CtKernel(
      input_tensors=inputs,
      output_tensors=(Out,),
      kernel_fn=my_kernel,
      grid_fn=grid_fn,
      args_fn=args_fn,
      tiling=self.tiling,
      autotune_configs=self._autotune_configs(),
      algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
    )

Key fields on

CtKernel

Field	Type	Purpose
`kernel_fn`	`@ct.kernel` function	The cuTile kernel function
`grid_fn`	`Callable[[Tiling], tuple]`	Maps tiling config to 3D grid dimensions
`args_fn`	`Callable[[Tiling], tuple]`	Maps tiling config to kernel arguments
`input_tensors`	`tuple[torch.Tensor]`	For autotuning cache key
`output_tensors`	`tuple[torch.Tensor]`	Returned by `__call__`
`tiling`	`Tiling`	Current tiling config
`autotune_configs`	`list[Tiling]`	Search space for autotuning
`algorithmic_flops_bytes_fn`	`Callable`	For roofline analysis

CtKernel(Kernel)

使用

grid_fn

和

args_fn

可调用对象包装cuTile内核：

python

from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool

@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
  bid_x = ct.bid(0)
  # 块级操作：ct.load, ct.mma, ct.store, ct.reshape, ...

@dataclass
class CtMyKernel(KernelPlan):
  b: int = 64; s: int = 1; t: int = 4096
  tiling: MyTiling = field(default_factory=MyTiling)

  def plan(self, *inputs) -> CtKernel:
    Out = torch.empty_like(inputs[0])

    def grid_fn(cfg):
      return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)

    def args_fn(cfg):
      return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)

    return CtKernel(
      input_tensors=inputs,
      output_tensors=(Out,),
      kernel_fn=my_kernel,
      grid_fn=grid_fn,
      args_fn=args_fn,
      tiling=self.tiling,
      autotune_configs=self._autotune_configs(),
      algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
    )

CtKernel

的关键字段：

字段	类型	用途
`kernel_fn`	`@ct.kernel` 函数	cuTile内核函数
`grid_fn`	`Callable[[Tiling], tuple]`	将分块配置映射为3D网格维度
`args_fn`	`Callable[[Tiling], tuple]`	将分块配置映射为内核参数
`input_tensors`	`tuple[torch.Tensor]`	用于自动调优缓存键
`output_tensors`	`tuple[torch.Tensor]`	由 `__call__` 返回
`tiling`	`Tiling`	当前分块配置
`autotune_configs`	`list[Tiling]`	自动调优的搜索空间
`algorithmic_flops_bytes_fn`	`Callable`	用于roofline分析

Autotuning

自动调优

CtKernel

autotuning uses

autotune_launch()

from

cuda.tile_experimental

, which handles occupancy/num_ctas hints automatically. The

_apply_hints()

method re-applies hints after tuning.

CtKernel

的自动调优使用

cuda.tile_experimental

中的

autotune_launch()

，它会自动处理占用率/num_ctas提示。

_apply_hints()

方法会在调优后重新应用这些提示。

Compilation

编译

CtKernel.compile()

uses

compile_tile()

to emit

.bytecode

and optionally translates to

.mlir

via

cuda-tile-translate

CtKernel.compile()

使用

compile_tile()

生成

.bytecode

，并可通过

cuda-tile-translate

选择性转换为

.mlir

。

Tiling Dataclass

分块数据类

cuTile tilings typically include tile dimensions plus optional compiler hints:

python

@dataclass
class MyTiling(Tiling):
  Bm: int = 16       # query tile size
  Bn: int = 64       # KV tile size
  Bh: int = 8        # heads per block
  num_ctas: int = None   # CGA size (None = auto)
  occupancy: int = None  # occupancy hint (None = auto)

  def validate(self, pd: "CtMyKernel") -> bool:
    return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.h

cuTile分块通常包含分块维度以及可选的编译器提示：

python

@dataclass
class MyTiling(Tiling):
  Bm: int = 16       # 查询分块大小
  Bn: int = 64       # KV分块大小
  Bh: int = 8        # 每个块的头数
  num_ctas: int = None   # CGA大小（None = 自动）
  occupancy: int = None  # 占用率提示（None = 自动）

  def validate(self, pd: "CtMyKernel") -> bool:
    return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.h

cuTile Constant Types

cuTile常量类型

python

from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2

```
ConstInt = ct.Constant[int]
```
— compile-time integer
```
ConstBool = ct.Constant[bool]
```
— compile-time boolean
```
ConstFloat = ct.Constant[float]
```
— compile-time float

python

from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2

```
ConstInt = ct.Constant[int]
```
— 编译期整数
```
ConstBool = ct.Constant[bool]
```
— 编译期布尔值
```
ConstFloat = ct.Constant[float]
```
— 编译期浮点数

Common Pitfalls

常见陷阱

Never hallucinate cuTile APIs — always verify with
```
/cutile-dsl-ref
```
or the official docs
```
@ct.kernel
```
function name MUST match the module filename
Register budget: 255 max/thread, aim <128 for good occupancy (validate against the active device's SM limits via
```
docs/devices/
```
and
```
src/mla_var3/conf/devices.json
```
)
Shared memory: 48KB/block default, 99KB opt-in, 100KB/SM total
When reducing tile sizes for occupancy, verify compute efficiency doesn't degrade
Do not assume
```
num_ctas
```
unlocks explicit cluster-cooperative algorithms; in cuTile it is a hint, not a cluster programming interface
If NCU shows one CTA per SM, low eligible warps, and the missing fix is explicit warpgroup or barrier scheduling, further cuTile-only tuning is unlikely to close the gap

切勿凭空捏造cuTile API — 请始终通过
```
/cutile-dsl-ref
```
或官方文档进行验证
```
@ct.kernel
```
函数名称必须与模块文件名匹配
寄存器预算：每个线程最多255个，为保证良好的占用率应控制在128以下（可通过
```
docs/devices/
```
和
```
src/mla_var3/conf/devices.json
```
验证当前设备的SM限制）
共享内存：默认每个块48KB，可选择启用99KB，每个SM总计100KB
为提升占用率而减小分块大小时，请验证计算效率不会下降
不要假设
```
num_ctas
```
能解锁显式的集群协作算法；在cuTile中它只是一个提示，而非集群编程接口
如果NCU显示每个SM对应一个CTA、合格线程束数量少，且所需修复是显式线程束组或屏障调度，那么仅靠cuTile调优不太可能弥补性能差距

Knowledge Links

知识链接

Shared optimization knowledge:
```
docs/knowledge/optimizations/
```
Shared anti-patterns:
```
docs/knowledge/anti-patterns/
```
cuTile-specific optimization overlays:
```
docs/knowledge/languages/cutile-dsl/
```
cuTile-specific optimization catalog: load
```
/optimization-catalog-cutile-dsl
```

通用优化知识：
```
docs/knowledge/optimizations/
```
通用反模式：
```
docs/knowledge/anti-patterns/
```
cuTile专属优化覆盖：
```
docs/knowledge/languages/cutile-dsl/
```
cuTile专属优化目录：加载
```
/optimization-catalog-cutile-dsl
```

References

参考资料

Runtime API details: See the
```
/design-kernel
```
skill's runtime-api reference for the complete CtKernel API documentation
Kernel templates: See the
```
/design-kernel
```
skill's kernel-templates reference for copyable template code
FlashMLA scheduling limit: See
```
docs/kernels/flash-mla.md
```
for the diagnosis comparing cuTile FlashMLA with the official explicitly scheduled implementation

运行时API详情：请查看
```
/design-kernel
```
技能的runtime-api参考文档，获取完整的CtKernel API说明
内核模板：请查看
```
/design-kernel
```
技能的kernel-templates参考文档，获取可复制的模板代码
FlashMLA调度限制：请查看
```
docs/kernels/flash-mla.md
```
，了解cuTile版本FlashMLA与官方显式调度实现的对比分析