design-kernel

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kernel Design — Shared Workflow

内核设计 — 共享工作流

This skill contains everything that is common across all supported DSLs. Once the implementation language is known, also load the matching language-specific design skill for DSL-specific runtime patterns, pitfalls, and API guidance.
本技能包含所有支持的DSL共通的内容。确定实现语言后,还需加载匹配的特定语言设计技能,以获取DSL特定的运行时模式、常见陷阱和API指导。

Language Selection

语言选择

Language keyPython package pathDesign skillAPI reference skillUse when
cutile-dsl
cutile
/design-cutile-dsl-kernel
/cutile-dsl-ref
Block-level control, tiling, CTA remapping, compiler hints are sufficient
cute-dsl
cute_python
/design-cute-dsl-kernel
/cute-dsl-ref
Explicit thread/warp scheduling, TMA pipelines, shared memory control needed

语言标识Python包路径设计技能API参考技能适用场景
cutile-dsl
cutile
/design-cutile-dsl-kernel
/cutile-dsl-ref
块级控制、tiling、CTA重映射、编译器提示已足够
cute-dsl
cute_python
/design-cute-dsl-kernel
/cute-dsl-ref
需要显式线程/ warp调度、TMA流水线、共享内存控制

Naming Conventions

命名规范

  • Public-facing names, docs, skills, and knowledge-base entries use kebab-case language keys (e.g.,
    cute-dsl
    ).
  • Python packages use underscores where required (e.g.,
    cute_python
    ).
  • Kernel package nesting:
    src/mla_var3/kernel/<lang_pkg>/mla/<design>/...
    .
  • The shortcut CLI
    python -m mla_var3.kernel <kernel> [<version>]
    is the preferred user-facing entry point.
  • 面向公众的名称、文档、技能和知识库条目使用kebab-case语言标识(例如:
    cute-dsl
    )。
  • Python包在必要时使用下划线(例如:
    cute_python
    )。
  • 内核包嵌套结构:
    src/mla_var3/kernel/<lang_pkg>/mla/<design>/...
  • 快捷CLI命令
    python -m mla_var3.kernel <kernel> [<version>]
    是推荐的用户入口。

Kernel Structure and Versioning

内核结构与版本控制

We use a nested structure for kernel packages inside the
kernel
sub-package:
  1. Language level: which DSL the kernel is implemented in (e.g.,
    kernel.cutile
    ,
    kernel.cute_python
    ).
  2. Layer level: which model layer the kernel targets (e.g.,
    kernel.cutile.mla
    ).
  3. Design level: the kernel design (e.g.,
    kernel.cutile.mla.flash_mla
    ).
  4. Version level: the kernel version (e.g.,
    kernel.cutile.mla.flash_mla.flash_mla_v2
    ). The first version has no suffix and is called the "base version".
  5. Entry module: inside the version package, a module named after the version's full name (e.g.,
    flash_mla_v2.py
    ) must contain the
    KernelPlan
    subclass.
我们在
kernel
子包内使用嵌套结构组织内核包:
  1. 语言层:内核实现所使用的DSL(例如:
    kernel.cutile
    kernel.cute_python
    )。
  2. 模型层:内核针对的模型层级(例如:
    kernel.cutile.mla
    )。
  3. 设计层:内核设计方案(例如:
    kernel.cutile.mla.flash_mla
    )。
  4. 版本层:内核版本(例如:
    kernel.cutile.mla.flash_mla.flash_mla_v2
    )。第一个版本无后缀,称为「基础版本」。
  5. 入口模块:在版本包内,必须有一个与版本全名同名的模块(例如:
    flash_mla_v2.py
    ),其中包含
    KernelPlan
    子类。

Package Nesting Pattern

包嵌套模式

kernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].py
kernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].py

Rules

规则

  • Base version:
    <design>/<design>/
    (no suffix, aliased as v0)
  • Version N:
    <design>/<design>_vN/
  • Kernel entry function name MUST match the module filename
  • Each version is a sibling package under the design package
  • 基础版本:
    <design>/<design>/
    (无后缀,别名为v0)
  • 版本N:
    <design>/<design>_vN/
  • 内核入口函数名称必须与模块文件名匹配
  • 每个版本都是设计包下的同级包

CLI Usage

CLI使用示例

bash
undefined
bash
undefined

Full path

完整路径

python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]
python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]

Shortcut (discovers across all languages)

快捷方式(跨所有语言自动发现)

python -m mla_var3.kernel <design> [<version>] [args]
python -m mla_var3.kernel <design> [<version>] [args]

Examples

示例

python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096 python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096

---
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096 python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096

---

Version Creation Checklist

版本创建检查清单

  1. Clone the previous version:
    bash
    source .venv/bin/activate
    python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>
  2. The script rewrites versioned symbols automatically:
    • Python module filenames
    • Decorated kernel function names (
      @ct.kernel
      ,
      @cute.kernel
      ,
      @cute.jit
      )
    • KernelPlan
      subclass names
    • Tiling
      subclass names
    • Intra-package imports
    • Quoted forward references and embedded package-name string literals
  3. Modify the cloned files to implement the new optimization.
  4. Manual fallback if the script is unavailable:
    • Copy the latest version directory
    • Rename the versioned module files
    • Update class names, function names, imports, and pipeline name strings by hand
  5. Verify correctness:
    bash
    source .venv/bin/activate
    python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --check
  6. Update devlog with "What changed" section.

  1. 克隆上一个版本:
    bash
    source .venv/bin/activate
    python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>
  2. 脚本会自动重写带版本标识的符号:
    • Python模块文件名
    • 装饰后的内核函数名称(
      @ct.kernel
      @cute.kernel
      @cute.jit
    • KernelPlan
      子类名称
    • Tiling
      子类名称
    • 包内导入语句
    • 带引号的前向引用和嵌入的包名字符串字面量
  3. 修改克隆后的文件以实现新的优化。
  4. 若脚本不可用,手动操作:
    • 复制最新版本目录
    • 重命名带版本标识的模块文件
    • 手动更新类名、函数名、导入语句和流水线名字符串
  5. 验证正确性:
    bash
    source .venv/bin/activate
    python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --check
  6. 在开发日志中添加「变更内容」章节。

KernelPlan Structure

KernelPlan结构

Every kernel version must implement a
KernelPlan
subclass. The
plan()
method returns a DSL-specific runtime wrapper (see the language-specific skill for the concrete type).
python
@dataclass
class MyKernel(KernelPlan):
    b: int = 64; s: int = 1; t: int = 4096  # problem dimensions
    tiling: MyTiling = field(default_factory=MyTiling)

    def prepare_inputs(self, device) -> tuple:
        # Allocate and return input tensors

    def reference_fn(self, *inputs) -> tuple:
        # Reference implementation for --check

    def _autotune_configs(self) -> list[MyTiling]:
        # Candidate tiling configs for autotuner search

    def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
        # Analytical (FLOPs, bytes) for roofline

    def plan(self, *inputs) -> BenchmarkFn:
        # Build executable runtime object (DSL-specific)

    def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
        # Roofline-only prediction (no real tensors)
每个内核版本必须实现
KernelPlan
子类。
plan()
方法返回DSL特定的运行时包装器(具体类型请参考特定语言技能)。
python
@dataclass
class MyKernel(KernelPlan):
    b: int = 64; s: int = 1; t: int = 4096  # problem dimensions
    tiling: MyTiling = field(default_factory=MyTiling)

    def prepare_inputs(self, device) -> tuple:
        # Allocate and return input tensors

    def reference_fn(self, *inputs) -> tuple:
        # Reference implementation for --check

    def _autotune_configs(self) -> list[MyTiling]:
        # Candidate tiling configs for autotuner search

    def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
        # Analytical (FLOPs, bytes) for roofline

    def plan(self, *inputs) -> BenchmarkFn:
        # Build executable runtime object (DSL-specific)

    def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
        # Roofline-only prediction (no real tensors)

Tiling Dataclass

Tiling数据类

python
@dataclass
class MyTiling(Tiling):
    # DSL-specific fields — see the language-specific skill for examples

    def validate(self, pd: "MyKernel") -> bool:
        # Return True if this tiling is valid for the given problem dimensions
        ...
python
@dataclass
class MyTiling(Tiling):
    # DSL-specific fields — see the language-specific skill for examples

    def validate(self, pd: "MyKernel") -> bool:
        # Return True if this tiling is valid for the given problem dimensions
        ...

Composition Patterns

组合模式

Sequential pipeline

顺序流水线

python
def plan(self, *inputs) -> KernelPipeline:
    stage1 = stage1_plan.plan(...)
    stage2 = stage2_plan.plan(...)
    return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])
python
def plan(self, *inputs) -> KernelPipeline:
    stage1 = stage1_plan.plan(...)
    stage2 = stage2_plan.plan(...)
    return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])

Concurrent stages + sequential combine

并发阶段 + 顺序合并

python
def plan(self, *inputs) -> KernelPipeline:
    a = plan_a.plan(...)
    b = plan_b.plan(...)
    concurrent = ConcurrentKernels(
        _name="overlap_group", concurrent_kernels=[a, b],
        validate_joint_tiling_fn=validate_fn,
    )
    combine = combine_plan.plan(...)
    return KernelPipeline(_name="pipeline", stages=[concurrent, combine])

python
def plan(self, *inputs) -> KernelPipeline:
    a = plan_a.plan(...)
    b = plan_b.plan(...)
    concurrent = ConcurrentKernels(
        _name="overlap_group", concurrent_kernels=[a, b],
        validate_joint_tiling_fn=validate_fn,
    )
    combine = combine_plan.plan(...)
    return KernelPipeline(_name="pipeline", stages=[concurrent, combine])

Implementation Workflow

实现工作流

  1. Read current kernel source — understand the existing implementation
  2. Read optimization instructions from orchestrator (specific optimizations to apply)
  3. Load the DSL API reference skill — do not hallucinate APIs
  4. Check the DSL suitability gate (in the language-specific skill) before implementing
  5. Clone the current version with
    python ./scripts/clone-kernel.py
  6. Read optimization detail files from
    docs/knowledge/
    for implementation patterns
  7. Implement changes in the new version files
  8. Test correctness:
    --prof_type=disabled --check
  9. Update devlog with "What changed" and "High-level description"
  10. If committing, use a Conventional Commits message

  1. 阅读当前内核源码 — 理解现有实现
  2. 阅读编排器提供的优化说明 — 明确需要应用的特定优化
  3. 加载DSL API参考技能 — 请勿虚构API
  4. 实现前检查DSL适用性门槛(在特定语言技能中)
  5. 使用
    python ./scripts/clone-kernel.py
    克隆当前版本
  6. 阅读
    docs/knowledge/
    中的优化细节文件,获取实现模式
  7. 在新版本文件中实现变更
  8. 验证正确性
    --prof_type=disabled --check
  9. 更新开发日志,添加「变更内容」和「代码变更概述」
  10. 若提交代码,使用Conventional Commits格式的提交信息

Knowledge Base Links

知识库链接

  • Put reusable algorithmic/device/hardware findings in
    docs/knowledge/optimizations/
    or
    docs/knowledge/anti-patterns/
    .
  • Put DSL-specific implementation findings in
    docs/knowledge/languages/<language>/...
    .
  • Use the language-specific optimization catalog skill together with the language-specific design skill.
  • 可复用的算法/设备/硬件相关结论请放入
    docs/knowledge/optimizations/
    docs/knowledge/anti-patterns/
  • DSL特定的实现结论请放入
    docs/knowledge/languages/<language>/...
  • 请结合特定语言的优化目录技能与特定语言的设计技能使用。

Development Log Entry Template

开发日志条目模板

Add to
docs/kernels/<kernel>.md
under
## Development log
:
markdown
undefined
docs/kernels/<kernel>.md
## Development log
下添加:
markdown
undefined

V<N>: [Brief Description]

V<N>: [简要描述]

Location:
src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/
What changed:
  • [Bullet list of changes]
High-level description of main code changes:
  • [Description of optimizations and how they relate to profiling insights]

Performance metrics, bottleneck analysis, issues, and insights are filled by the profiler agent after profiling.
位置:
src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/
变更内容:
  • [变更点列表]
代码变更概述:
  • [优化内容描述及其与性能分析结论的关联]

性能指标、瓶颈分析、问题和结论将由性能分析Agent在分析后补充。

Designer Output Contract

设计器输出约定

Return results to the orchestrator in this format:
markdown
undefined
请按以下格式向编排器返回结果:
markdown
undefined

New Version: [kernel] [version]

新版本: [kernel] [version]

Changes Applied

已应用的变更

  1. [change + rationale]
  1. [变更内容 + 理由]

Files

文件

  • Created: [paths]
  • Modified: [paths]
  • 创建: [路径]
  • 修改: [路径]

Correctness: [PASS/FAIL]

正确性: [PASS/FAIL]

Devlog Entry Written: [path]

已写入开发日志: [路径]

undefined
undefined

References

参考资料

  • Optimization patterns:
    docs/knowledge/optimizations/
  • Anti-patterns:
    docs/knowledge/anti-patterns/
  • Kernel devlogs:
    docs/kernels/<kernel>.md
  • 优化模式:
    docs/knowledge/optimizations/
  • 反模式:
    docs/knowledge/anti-patterns/
  • 内核开发日志:
    docs/kernels/<kernel>.md