design-kernel
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKernel Design — Shared Workflow
内核设计 — 共享工作流
This skill contains everything that is common across all supported DSLs. Once the implementation language is known, also load the matching language-specific design skill for DSL-specific runtime patterns, pitfalls, and API guidance.
本技能包含所有支持的DSL共通的内容。确定实现语言后,还需加载匹配的特定语言设计技能,以获取DSL特定的运行时模式、常见陷阱和API指导。
Language Selection
语言选择
| Language key | Python package path | Design skill | API reference skill | Use when |
|---|---|---|---|---|
| | | | Block-level control, tiling, CTA remapping, compiler hints are sufficient |
| | | | Explicit thread/warp scheduling, TMA pipelines, shared memory control needed |
| 语言标识 | Python包路径 | 设计技能 | API参考技能 | 适用场景 |
|---|---|---|---|---|
| | | | 块级控制、tiling、CTA重映射、编译器提示已足够 |
| | | | 需要显式线程/ warp调度、TMA流水线、共享内存控制 |
Naming Conventions
命名规范
- Public-facing names, docs, skills, and knowledge-base entries use kebab-case language keys (e.g., ).
cute-dsl - Python packages use underscores where required (e.g., ).
cute_python - Kernel package nesting: .
src/mla_var3/kernel/<lang_pkg>/mla/<design>/... - The shortcut CLI is the preferred user-facing entry point.
python -m mla_var3.kernel <kernel> [<version>]
- 面向公众的名称、文档、技能和知识库条目使用kebab-case语言标识(例如:)。
cute-dsl - Python包在必要时使用下划线(例如:)。
cute_python - 内核包嵌套结构:。
src/mla_var3/kernel/<lang_pkg>/mla/<design>/... - 快捷CLI命令 是推荐的用户入口。
python -m mla_var3.kernel <kernel> [<version>]
Kernel Structure and Versioning
内核结构与版本控制
We use a nested structure for kernel packages inside the sub-package:
kernel- Language level: which DSL the kernel is implemented in (e.g., ,
kernel.cutile).kernel.cute_python - Layer level: which model layer the kernel targets (e.g., ).
kernel.cutile.mla - Design level: the kernel design (e.g., ).
kernel.cutile.mla.flash_mla - Version level: the kernel version (e.g., ). The first version has no suffix and is called the "base version".
kernel.cutile.mla.flash_mla.flash_mla_v2 - Entry module: inside the version package, a module named after the version's full name (e.g., ) must contain the
flash_mla_v2.pysubclass.KernelPlan
我们在子包内使用嵌套结构组织内核包:
kernel- 语言层:内核实现所使用的DSL(例如:、
kernel.cutile)。kernel.cute_python - 模型层:内核针对的模型层级(例如:)。
kernel.cutile.mla - 设计层:内核设计方案(例如:)。
kernel.cutile.mla.flash_mla - 版本层:内核版本(例如:)。第一个版本无后缀,称为「基础版本」。
kernel.cutile.mla.flash_mla.flash_mla_v2 - 入口模块:在版本包内,必须有一个与版本全名同名的模块(例如:),其中包含
flash_mla_v2.py子类。KernelPlan
Package Nesting Pattern
包嵌套模式
kernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].pykernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].pyRules
规则
- Base version: (no suffix, aliased as v0)
<design>/<design>/ - Version N:
<design>/<design>_vN/ - Kernel entry function name MUST match the module filename
- Each version is a sibling package under the design package
- 基础版本:(无后缀,别名为v0)
<design>/<design>/ - 版本N:
<design>/<design>_vN/ - 内核入口函数名称必须与模块文件名匹配
- 每个版本都是设计包下的同级包
CLI Usage
CLI使用示例
bash
undefinedbash
undefinedFull path
完整路径
python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]
python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]
Shortcut (discovers across all languages)
快捷方式(跨所有语言自动发现)
python -m mla_var3.kernel <design> [<version>] [args]
python -m mla_var3.kernel <design> [<version>] [args]
Examples
示例
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096
---python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096
---Version Creation Checklist
版本创建检查清单
- Clone the previous version:
bash
source .venv/bin/activate python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix> - The script rewrites versioned symbols automatically:
- Python module filenames
- Decorated kernel function names (,
@ct.kernel,@cute.kernel)@cute.jit - subclass names
KernelPlan - subclass names
Tiling - Intra-package imports
- Quoted forward references and embedded package-name string literals
- Modify the cloned files to implement the new optimization.
- Manual fallback if the script is unavailable:
- Copy the latest version directory
- Rename the versioned module files
- Update class names, function names, imports, and pipeline name strings by hand
- Verify correctness:
bash
source .venv/bin/activate python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --check - Update devlog with "What changed" section.
- 克隆上一个版本:
bash
source .venv/bin/activate python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix> - 脚本会自动重写带版本标识的符号:
- Python模块文件名
- 装饰后的内核函数名称(、
@ct.kernel、@cute.kernel)@cute.jit - 子类名称
KernelPlan - 子类名称
Tiling - 包内导入语句
- 带引号的前向引用和嵌入的包名字符串字面量
- 修改克隆后的文件以实现新的优化。
- 若脚本不可用,手动操作:
- 复制最新版本目录
- 重命名带版本标识的模块文件
- 手动更新类名、函数名、导入语句和流水线名字符串
- 验证正确性:
bash
source .venv/bin/activate python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --check - 在开发日志中添加「变更内容」章节。
KernelPlan Structure
KernelPlan结构
Every kernel version must implement a subclass. The method returns a DSL-specific runtime wrapper (see the language-specific skill for the concrete type).
KernelPlanplan()python
@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)每个内核版本必须实现子类。方法返回DSL特定的运行时包装器(具体类型请参考特定语言技能)。
KernelPlanplan()python
@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)Tiling Dataclass
Tiling数据类
python
@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...python
@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...Composition Patterns
组合模式
Sequential pipeline
顺序流水线
python
def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])python
def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])Concurrent stages + sequential combine
并发阶段 + 顺序合并
python
def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])python
def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])Implementation Workflow
实现工作流
- Read current kernel source — understand the existing implementation
- Read optimization instructions from orchestrator (specific optimizations to apply)
- Load the DSL API reference skill — do not hallucinate APIs
- Check the DSL suitability gate (in the language-specific skill) before implementing
- Clone the current version with
python ./scripts/clone-kernel.py - Read optimization detail files from for implementation patterns
docs/knowledge/ - Implement changes in the new version files
- Test correctness:
--prof_type=disabled --check - Update devlog with "What changed" and "High-level description"
- If committing, use a Conventional Commits message
- 阅读当前内核源码 — 理解现有实现
- 阅读编排器提供的优化说明 — 明确需要应用的特定优化
- 加载DSL API参考技能 — 请勿虚构API
- 实现前检查DSL适用性门槛(在特定语言技能中)
- 使用克隆当前版本
python ./scripts/clone-kernel.py - 阅读中的优化细节文件,获取实现模式
docs/knowledge/ - 在新版本文件中实现变更
- 验证正确性:
--prof_type=disabled --check - 更新开发日志,添加「变更内容」和「代码变更概述」
- 若提交代码,使用Conventional Commits格式的提交信息
Knowledge Base Links
知识库链接
- Put reusable algorithmic/device/hardware findings in or
docs/knowledge/optimizations/.docs/knowledge/anti-patterns/ - Put DSL-specific implementation findings in .
docs/knowledge/languages/<language>/... - Use the language-specific optimization catalog skill together with the language-specific design skill.
- 可复用的算法/设备/硬件相关结论请放入或
docs/knowledge/optimizations/。docs/knowledge/anti-patterns/ - DSL特定的实现结论请放入。
docs/knowledge/languages/<language>/... - 请结合特定语言的优化目录技能与特定语言的设计技能使用。
Development Log Entry Template
开发日志条目模板
Add to under :
docs/kernels/<kernel>.md## Development logmarkdown
undefined在的下添加:
docs/kernels/<kernel>.md## Development logmarkdown
undefinedV<N>: [Brief Description]
V<N>: [简要描述]
Location:
src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/What changed:
- [Bullet list of changes]
High-level description of main code changes:
- [Description of optimizations and how they relate to profiling insights]
Performance metrics, bottleneck analysis, issues, and insights are filled by the profiler agent after profiling.位置:
src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/变更内容:
- [变更点列表]
代码变更概述:
- [优化内容描述及其与性能分析结论的关联]
性能指标、瓶颈分析、问题和结论将由性能分析Agent在分析后补充。Designer Output Contract
设计器输出约定
Return results to the orchestrator in this format:
markdown
undefined请按以下格式向编排器返回结果:
markdown
undefinedNew Version: [kernel] [version]
新版本: [kernel] [version]
Changes Applied
已应用的变更
- [change + rationale]
- [变更内容 + 理由]
Files
文件
- Created: [paths]
- Modified: [paths]
- 创建: [路径]
- 修改: [路径]
Correctness: [PASS/FAIL]
正确性: [PASS/FAIL]
Devlog Entry Written: [path]
已写入开发日志: [路径]
undefinedundefinedReferences
参考资料
- Optimization patterns:
docs/knowledge/optimizations/ - Anti-patterns:
docs/knowledge/anti-patterns/ - Kernel devlogs:
docs/kernels/<kernel>.md
- 优化模式:
docs/knowledge/optimizations/ - 反模式:
docs/knowledge/anti-patterns/ - 内核开发日志:
docs/kernels/<kernel>.md