design-cutile-dsl-kernel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

cuTile Python DSL — Language-Specific Guidance

cuTile Python DSL — 语言专属指南

Prerequisite: Load
/design-kernel
for shared naming, versioning, KernelPlan structure, composition patterns, clone workflow, devlog template, and designer output contract. This skill covers only cuTile-specific runtime patterns and constraints.
前置要求:加载/design-kernel以获取通用命名规则、版本控制、KernelPlan结构、组合模式、克隆工作流、开发日志模板以及设计者输出约定。本技能仅涵盖cuTile特有的运行时模式与约束。

When To Use cuTile Python DSL

何时使用cuTile Python DSL

Stay in cuTile Python DSL (
cutile-dsl
) when the next optimization is still expressible through:
  • tile sizes and tensor layout choices,
  • CTA remapping and work partitioning via
    ct.bid()
    /
    ct.num_blocks()
    ,
  • occupancy and cluster-size hints (
    num_ctas
    ,
    occupancy
    ),
  • compiler guidance such as
    latency
    /
    allow_tma
    hints,
  • multi-stage composition through
    KernelPipeline
    /
    ConcurrentKernels
    .
当后续优化仍可通过以下方式表达时,继续使用cuTile Python DSL(
cutile-dsl
):
  • 分块大小与张量布局选择,
  • 通过
    ct.bid()
    /
    ct.num_blocks()
    进行CTA重映射与工作分区,
  • 占用率与集群大小提示(
    num_ctas
    occupancy
    ),
  • 编译器提示,如
    latency
    /
    allow_tma
    提示,
  • 通过
    KernelPipeline
    /
    ConcurrentKernels
    实现多阶段组合。

Suitability Gate

适用性判断标准

Do not expect cuTile to express explicit intra-CTA scheduling. The public execution model does not expose thread, warp, or warpgroup identity, and does not allow explicit synchronization or communication within a block. Likewise,
num_ctas
does not give you a public cluster programming model: there is no exposed cluster rank, cluster barrier, DSM primitive, or cluster memory scope.
This becomes a hard design constraint when profiling shows the kernel is pinned to one CTA per SM. In that regime, the usual recovery path in state-of-the-art kernels is explicit scheduling inside the CTA: warpgroup seesaw schedules, producer/consumer pipelines, barrier choreography, or cluster-cooperative overlap. If the optimization you need falls in that category, stop iterating inside cuTile and switch to CuTe DSL (
/design-cute-dsl-kernel
).
The FlashMLA devlog (
docs/kernels/flash-mla.md
) is the concrete example: the official FlashMLA design remains productive at one resident block per SM because it uses explicit scheduling; cuTile FlashMLA versions cannot reproduce that behavior.
不要期望cuTile能表达显式的CTA内调度。其公开执行模型不会暴露线程、线程束(warp)或线程束组(warpgroup)的标识,也不允许在块内进行显式同步或通信。同样,
num_ctas
并非公开的集群编程模型:它没有暴露集群秩、集群屏障、DSM原语或集群内存范围。
当性能分析显示内核被固定为每个SM对应一个CTA时,这将成为硬性设计约束。在这种情况下,当前最先进内核的常规补救方案是在CTA内部进行显式调度:线程束组交替调度、生产者/消费者流水线、屏障编排或集群协作重叠。如果您需要的优化属于此类范畴,请停止在cuTile内迭代,切换至CuTe DSL
/design-cute-dsl-kernel
)。
FlashMLA开发日志(
docs/kernels/flash-mla.md
)是具体示例:官方FlashMLA设计通过使用显式调度,在每个SM驻留一个块的情况下仍能保持高效;而cuTile版本的FlashMLA无法重现该行为。

Naming And Layout

命名与布局

  • Public language key:
    cutile-dsl
  • Python package path:
    cutile
  • Kernel layout:
    src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/
  • Module file:
    <design>[_vN].py
  • Main runtime wrapper:
    CtKernel
  • 公开语言标识:
    cutile-dsl
  • Python包路径:
    cutile
  • 内核布局:
    src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/
  • 模块文件:
    <design>[_vN].py
  • 主运行时包装器:
    CtKernel

CtKernel Runtime Pattern

CtKernel运行时模式

CtKernel(Kernel)
wraps cuTile kernels with
grid_fn
and
args_fn
callables:
python
from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool

@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
  bid_x = ct.bid(0)
  # block-level operations: ct.load, ct.mma, ct.store, ct.reshape, ...

@dataclass
class CtMyKernel(KernelPlan):
  b: int = 64; s: int = 1; t: int = 4096
  tiling: MyTiling = field(default_factory=MyTiling)

  def plan(self, *inputs) -> CtKernel:
    Out = torch.empty_like(inputs[0])

    def grid_fn(cfg):
      return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)

    def args_fn(cfg):
      return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)

    return CtKernel(
      input_tensors=inputs,
      output_tensors=(Out,),
      kernel_fn=my_kernel,
      grid_fn=grid_fn,
      args_fn=args_fn,
      tiling=self.tiling,
      autotune_configs=self._autotune_configs(),
      algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
    )
Key fields on
CtKernel
:
FieldTypePurpose
kernel_fn
@ct.kernel
function
The cuTile kernel function
grid_fn
Callable[[Tiling], tuple]
Maps tiling config to 3D grid dimensions
args_fn
Callable[[Tiling], tuple]
Maps tiling config to kernel arguments
input_tensors
tuple[torch.Tensor]
For autotuning cache key
output_tensors
tuple[torch.Tensor]
Returned by
__call__
tiling
Tiling
Current tiling config
autotune_configs
list[Tiling]
Search space for autotuning
algorithmic_flops_bytes_fn
Callable
For roofline analysis
CtKernel(Kernel)
使用
grid_fn
args_fn
可调用对象包装cuTile内核:
python
from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool

@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
  bid_x = ct.bid(0)
  # 块级操作:ct.load, ct.mma, ct.store, ct.reshape, ...

@dataclass
class CtMyKernel(KernelPlan):
  b: int = 64; s: int = 1; t: int = 4096
  tiling: MyTiling = field(default_factory=MyTiling)

  def plan(self, *inputs) -> CtKernel:
    Out = torch.empty_like(inputs[0])

    def grid_fn(cfg):
      return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)

    def args_fn(cfg):
      return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)

    return CtKernel(
      input_tensors=inputs,
      output_tensors=(Out,),
      kernel_fn=my_kernel,
      grid_fn=grid_fn,
      args_fn=args_fn,
      tiling=self.tiling,
      autotune_configs=self._autotune_configs(),
      algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
    )
CtKernel
的关键字段:
字段类型用途
kernel_fn
@ct.kernel
函数
cuTile内核函数
grid_fn
Callable[[Tiling], tuple]
将分块配置映射为3D网格维度
args_fn
Callable[[Tiling], tuple]
将分块配置映射为内核参数
input_tensors
tuple[torch.Tensor]
用于自动调优缓存键
output_tensors
tuple[torch.Tensor]
__call__
返回
tiling
Tiling
当前分块配置
autotune_configs
list[Tiling]
自动调优的搜索空间
algorithmic_flops_bytes_fn
Callable
用于roofline分析

Autotuning

自动调优

CtKernel
autotuning uses
autotune_launch()
from
cuda.tile_experimental
, which handles occupancy/num_ctas hints automatically. The
_apply_hints()
method re-applies hints after tuning.
CtKernel
的自动调优使用
cuda.tile_experimental
中的
autotune_launch()
,它会自动处理占用率/num_ctas提示。
_apply_hints()
方法会在调优后重新应用这些提示。

Compilation

编译

CtKernel.compile()
uses
compile_tile()
to emit
.bytecode
and optionally translates to
.mlir
via
cuda-tile-translate
.
CtKernel.compile()
使用
compile_tile()
生成
.bytecode
,并可通过
cuda-tile-translate
选择性转换为
.mlir

Tiling Dataclass

分块数据类

cuTile tilings typically include tile dimensions plus optional compiler hints:
python
@dataclass
class MyTiling(Tiling):
  Bm: int = 16       # query tile size
  Bn: int = 64       # KV tile size
  Bh: int = 8        # heads per block
  num_ctas: int = None   # CGA size (None = auto)
  occupancy: int = None  # occupancy hint (None = auto)

  def validate(self, pd: "CtMyKernel") -> bool:
    return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.h
cuTile分块通常包含分块维度以及可选的编译器提示:
python
@dataclass
class MyTiling(Tiling):
  Bm: int = 16       # 查询分块大小
  Bn: int = 64       # KV分块大小
  Bh: int = 8        # 每个块的头数
  num_ctas: int = None   # CGA大小(None = 自动)
  occupancy: int = None  # 占用率提示(None = 自动)

  def validate(self, pd: "CtMyKernel") -> bool:
    return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.h

cuTile Constant Types

cuTile常量类型

python
from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2
  • ConstInt = ct.Constant[int]
    — compile-time integer
  • ConstBool = ct.Constant[bool]
    — compile-time boolean
  • ConstFloat = ct.Constant[float]
    — compile-time float
python
from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2
  • ConstInt = ct.Constant[int]
    — 编译期整数
  • ConstBool = ct.Constant[bool]
    — 编译期布尔值
  • ConstFloat = ct.Constant[float]
    — 编译期浮点数

Common Pitfalls

常见陷阱

  • Never hallucinate cuTile APIs — always verify with
    /cutile-dsl-ref
    or the official docs
  • @ct.kernel
    function name MUST match the module filename
  • Register budget: 255 max/thread, aim <128 for good occupancy (validate against the active device's SM limits via
    docs/devices/
    and
    src/mla_var3/conf/devices.json
    )
  • Shared memory: 48KB/block default, 99KB opt-in, 100KB/SM total
  • When reducing tile sizes for occupancy, verify compute efficiency doesn't degrade
  • Do not assume
    num_ctas
    unlocks explicit cluster-cooperative algorithms; in cuTile it is a hint, not a cluster programming interface
  • If NCU shows one CTA per SM, low eligible warps, and the missing fix is explicit warpgroup or barrier scheduling, further cuTile-only tuning is unlikely to close the gap
  • 切勿凭空捏造cuTile API — 请始终通过
    /cutile-dsl-ref
    或官方文档进行验证
  • @ct.kernel
    函数名称必须与模块文件名匹配
  • 寄存器预算:每个线程最多255个,为保证良好的占用率应控制在128以下(可通过
    docs/devices/
    src/mla_var3/conf/devices.json
    验证当前设备的SM限制)
  • 共享内存:默认每个块48KB,可选择启用99KB,每个SM总计100KB
  • 为提升占用率而减小分块大小时,请验证计算效率不会下降
  • 不要假设
    num_ctas
    能解锁显式的集群协作算法;在cuTile中它只是一个提示,而非集群编程接口
  • 如果NCU显示每个SM对应一个CTA、合格线程束数量少,且所需修复是显式线程束组或屏障调度,那么仅靠cuTile调优不太可能弥补性能差距

Knowledge Links

知识链接

  • Shared optimization knowledge:
    docs/knowledge/optimizations/
  • Shared anti-patterns:
    docs/knowledge/anti-patterns/
  • cuTile-specific optimization overlays:
    docs/knowledge/languages/cutile-dsl/
  • cuTile-specific optimization catalog: load
    /optimization-catalog-cutile-dsl
  • 通用优化知识:
    docs/knowledge/optimizations/
  • 通用反模式:
    docs/knowledge/anti-patterns/
  • cuTile专属优化覆盖:
    docs/knowledge/languages/cutile-dsl/
  • cuTile专属优化目录:加载
    /optimization-catalog-cutile-dsl

References

参考资料

  • Runtime API details: See the
    /design-kernel
    skill's runtime-api reference for the complete CtKernel API documentation
  • Kernel templates: See the
    /design-kernel
    skill's kernel-templates reference for copyable template code
  • FlashMLA scheduling limit: See
    docs/kernels/flash-mla.md
    for the diagnosis comparing cuTile FlashMLA with the official explicitly scheduled implementation
  • 运行时API详情:请查看
    /design-kernel
    技能的runtime-api参考文档,获取完整的CtKernel API说明
  • 内核模板:请查看
    /design-kernel
    技能的kernel-templates参考文档,获取可复制的模板代码
  • FlashMLA调度限制:请查看
    docs/kernels/flash-mla.md
    ,了解cuTile版本FlashMLA与官方显式调度实现的对比分析