design-cutile-dsl-kernel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuTile Python DSL — Language-Specific Guidance
cuTile Python DSL — 语言专属指南
Prerequisite: Load for shared naming, versioning, KernelPlan structure, composition patterns, clone workflow, devlog template, and designer output contract. This skill covers only cuTile-specific runtime patterns and constraints.
/design-kernel前置要求:加载/design-kernel以获取通用命名规则、版本控制、KernelPlan结构、组合模式、克隆工作流、开发日志模板以及设计者输出约定。本技能仅涵盖cuTile特有的运行时模式与约束。
When To Use cuTile Python DSL
何时使用cuTile Python DSL
Stay in cuTile Python DSL () when the next optimization is still expressible through:
cutile-dsl- tile sizes and tensor layout choices,
- CTA remapping and work partitioning via /
ct.bid(),ct.num_blocks() - occupancy and cluster-size hints (,
num_ctas),occupancy - compiler guidance such as /
latencyhints,allow_tma - multi-stage composition through /
KernelPipeline.ConcurrentKernels
当后续优化仍可通过以下方式表达时,继续使用cuTile Python DSL():
cutile-dsl- 分块大小与张量布局选择,
- 通过/
ct.bid()进行CTA重映射与工作分区,ct.num_blocks() - 占用率与集群大小提示(、
num_ctas),occupancy - 编译器提示,如/
latency提示,allow_tma - 通过/
KernelPipeline实现多阶段组合。ConcurrentKernels
Suitability Gate
适用性判断标准
Do not expect cuTile to express explicit intra-CTA scheduling. The public execution model does not expose thread, warp, or warpgroup identity, and does not allow explicit synchronization or communication within a block. Likewise, does not give you a public cluster programming model: there is no exposed cluster rank, cluster barrier, DSM primitive, or cluster memory scope.
num_ctasThis becomes a hard design constraint when profiling shows the kernel is pinned to one CTA per SM. In that regime, the usual recovery path in state-of-the-art kernels is explicit scheduling inside the CTA: warpgroup seesaw schedules, producer/consumer pipelines, barrier choreography, or cluster-cooperative overlap. If the optimization you need falls in that category, stop iterating inside cuTile and switch to CuTe DSL ().
/design-cute-dsl-kernelThe FlashMLA devlog () is the concrete example: the official FlashMLA design remains productive at one resident block per SM because it uses explicit scheduling; cuTile FlashMLA versions cannot reproduce that behavior.
docs/kernels/flash-mla.md不要期望cuTile能表达显式的CTA内调度。其公开执行模型不会暴露线程、线程束(warp)或线程束组(warpgroup)的标识,也不允许在块内进行显式同步或通信。同样,并非公开的集群编程模型:它没有暴露集群秩、集群屏障、DSM原语或集群内存范围。
num_ctas当性能分析显示内核被固定为每个SM对应一个CTA时,这将成为硬性设计约束。在这种情况下,当前最先进内核的常规补救方案是在CTA内部进行显式调度:线程束组交替调度、生产者/消费者流水线、屏障编排或集群协作重叠。如果您需要的优化属于此类范畴,请停止在cuTile内迭代,切换至CuTe DSL()。
/design-cute-dsl-kernelFlashMLA开发日志()是具体示例:官方FlashMLA设计通过使用显式调度,在每个SM驻留一个块的情况下仍能保持高效;而cuTile版本的FlashMLA无法重现该行为。
docs/kernels/flash-mla.mdNaming And Layout
命名与布局
- Public language key:
cutile-dsl - Python package path:
cutile - Kernel layout:
src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/ - Module file:
<design>[_vN].py - Main runtime wrapper:
CtKernel
- 公开语言标识:
cutile-dsl - Python包路径:
cutile - 内核布局:
src/mla_var3/kernel/cutile/<layer>/<design>/<design>[_vN]/ - 模块文件:
<design>[_vN].py - 主运行时包装器:
CtKernel
CtKernel Runtime Pattern
CtKernel运行时模式
CtKernel(Kernel)grid_fnargs_fnpython
from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool
@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
bid_x = ct.bid(0)
# block-level operations: ct.load, ct.mma, ct.store, ct.reshape, ...
@dataclass
class CtMyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096
tiling: MyTiling = field(default_factory=MyTiling)
def plan(self, *inputs) -> CtKernel:
Out = torch.empty_like(inputs[0])
def grid_fn(cfg):
return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)
def args_fn(cfg):
return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)
return CtKernel(
input_tensors=inputs,
output_tensors=(Out,),
kernel_fn=my_kernel,
grid_fn=grid_fn,
args_fn=args_fn,
tiling=self.tiling,
autotune_configs=self._autotune_configs(),
algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)Key fields on :
CtKernel| Field | Type | Purpose |
|---|---|---|
| | The cuTile kernel function |
| | Maps tiling config to 3D grid dimensions |
| | Maps tiling config to kernel arguments |
| | For autotuning cache key |
| | Returned by |
| | Current tiling config |
| | Search space for autotuning |
| | For roofline analysis |
CtKernel(Kernel)grid_fnargs_fnpython
from mla_var3.runtime import CtKernel, KernelPlan, Tiling, ConstInt, ConstBool
@ct.kernel
def my_kernel(Tensor, ..., Bm: ConstInt, Bn: ConstInt, EVEN_N: ConstBool):
bid_x = ct.bid(0)
# 块级操作:ct.load, ct.mma, ct.store, ct.reshape, ...
@dataclass
class CtMyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096
tiling: MyTiling = field(default_factory=MyTiling)
def plan(self, *inputs) -> CtKernel:
Out = torch.empty_like(inputs[0])
def grid_fn(cfg):
return (math.ceil(s / cfg.Bm), math.ceil(h / cfg.Bh), b)
def args_fn(cfg):
return (inputs[0], Out, cfg.Bm, cfg.Bn, (t % cfg.Bn) == 0)
return CtKernel(
input_tensors=inputs,
output_tensors=(Out,),
kernel_fn=my_kernel,
grid_fn=grid_fn,
args_fn=args_fn,
tiling=self.tiling,
autotune_configs=self._autotune_configs(),
algorithmic_flops_bytes_fn=self._algorithmic_flops_bytes,
)CtKernel| 字段 | 类型 | 用途 |
|---|---|---|
| | cuTile内核函数 |
| | 将分块配置映射为3D网格维度 |
| | 将分块配置映射为内核参数 |
| | 用于自动调优缓存键 |
| | 由 |
| | 当前分块配置 |
| | 自动调优的搜索空间 |
| | 用于roofline分析 |
Autotuning
自动调优
CtKernelautotune_launch()cuda.tile_experimental_apply_hints()CtKernelcuda.tile_experimentalautotune_launch()_apply_hints()Compilation
编译
CtKernel.compile()compile_tile().bytecode.mlircuda-tile-translateCtKernel.compile()compile_tile().bytecodecuda-tile-translate.mlirTiling Dataclass
分块数据类
cuTile tilings typically include tile dimensions plus optional compiler hints:
python
@dataclass
class MyTiling(Tiling):
Bm: int = 16 # query tile size
Bn: int = 64 # KV tile size
Bh: int = 8 # heads per block
num_ctas: int = None # CGA size (None = auto)
occupancy: int = None # occupancy hint (None = auto)
def validate(self, pd: "CtMyKernel") -> bool:
return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.hcuTile分块通常包含分块维度以及可选的编译器提示:
python
@dataclass
class MyTiling(Tiling):
Bm: int = 16 # 查询分块大小
Bn: int = 64 # KV分块大小
Bh: int = 8 # 每个块的头数
num_ctas: int = None # CGA大小(None = 自动)
occupancy: int = None # 占用率提示(None = 自动)
def validate(self, pd: "CtMyKernel") -> bool:
return self.Bm <= pd.s and self.Bn <= pd.t and self.Bh <= pd.hcuTile Constant Types
cuTile常量类型
python
from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2- — compile-time integer
ConstInt = ct.Constant[int] - — compile-time boolean
ConstBool = ct.Constant[bool] - — compile-time float
ConstFloat = ct.Constant[float]
python
from mla_var3.runtime import ConstInt, ConstBool, ConstFloat, INV_LOG_2- — 编译期整数
ConstInt = ct.Constant[int] - — 编译期布尔值
ConstBool = ct.Constant[bool] - — 编译期浮点数
ConstFloat = ct.Constant[float]
Common Pitfalls
常见陷阱
- Never hallucinate cuTile APIs — always verify with or the official docs
/cutile-dsl-ref - function name MUST match the module filename
@ct.kernel - Register budget: 255 max/thread, aim <128 for good occupancy (validate against the active device's SM limits via and
docs/devices/)src/mla_var3/conf/devices.json - Shared memory: 48KB/block default, 99KB opt-in, 100KB/SM total
- When reducing tile sizes for occupancy, verify compute efficiency doesn't degrade
- Do not assume unlocks explicit cluster-cooperative algorithms; in cuTile it is a hint, not a cluster programming interface
num_ctas - If NCU shows one CTA per SM, low eligible warps, and the missing fix is explicit warpgroup or barrier scheduling, further cuTile-only tuning is unlikely to close the gap
- 切勿凭空捏造cuTile API — 请始终通过或官方文档进行验证
/cutile-dsl-ref - 函数名称必须与模块文件名匹配
@ct.kernel - 寄存器预算:每个线程最多255个,为保证良好的占用率应控制在128以下(可通过和
docs/devices/验证当前设备的SM限制)src/mla_var3/conf/devices.json - 共享内存:默认每个块48KB,可选择启用99KB,每个SM总计100KB
- 为提升占用率而减小分块大小时,请验证计算效率不会下降
- 不要假设能解锁显式的集群协作算法;在cuTile中它只是一个提示,而非集群编程接口
num_ctas - 如果NCU显示每个SM对应一个CTA、合格线程束数量少,且所需修复是显式线程束组或屏障调度,那么仅靠cuTile调优不太可能弥补性能差距
Knowledge Links
知识链接
- Shared optimization knowledge:
docs/knowledge/optimizations/ - Shared anti-patterns:
docs/knowledge/anti-patterns/ - cuTile-specific optimization overlays:
docs/knowledge/languages/cutile-dsl/ - cuTile-specific optimization catalog: load
/optimization-catalog-cutile-dsl
- 通用优化知识:
docs/knowledge/optimizations/ - 通用反模式:
docs/knowledge/anti-patterns/ - cuTile专属优化覆盖:
docs/knowledge/languages/cutile-dsl/ - cuTile专属优化目录:加载
/optimization-catalog-cutile-dsl
References
参考资料
- Runtime API details: See the skill's runtime-api reference for the complete CtKernel API documentation
/design-kernel - Kernel templates: See the skill's kernel-templates reference for copyable template code
/design-kernel - FlashMLA scheduling limit: See for the diagnosis comparing cuTile FlashMLA with the official explicitly scheduled implementation
docs/kernels/flash-mla.md
- 运行时API详情:请查看技能的runtime-api参考文档,获取完整的CtKernel API说明
/design-kernel - 内核模板:请查看技能的kernel-templates参考文档,获取可复制的模板代码
/design-kernel - FlashMLA调度限制:请查看,了解cuTile版本FlashMLA与官方显式调度实现的对比分析
docs/kernels/flash-mla.md