Loading...
Loading...
Shared kernel design workflow across all supported languages and DSLs. Provides language selection table, naming conventions, versioning rules, KernelPlan structure, composition patterns, clone workflow, implementation workflow, devlog template, and designer output contract. Use when: (1) choosing which language-specific kernel design skill to load, (2) the intended implementation language is not fixed yet, (3) you need naming or versioning guidance before selecting a DSL, (4) you are implementing any kernel regardless of DSL, (5) you are updating docs that refer to kernel design skills.
npx skill4agent add pepperu96/hyper-mla design-kernel| Language key | Python package path | Design skill | API reference skill | Use when |
|---|---|---|---|---|
| | | | Block-level control, tiling, CTA remapping, compiler hints are sufficient |
| | | | Explicit thread/warp scheduling, TMA pipelines, shared memory control needed |
cute-dslcute_pythonsrc/mla_var3/kernel/<lang_pkg>/mla/<design>/...python -m mla_var3.kernel <kernel> [<version>]kernelkernel.cutilekernel.cute_pythonkernel.cutile.mlakernel.cutile.mla.flash_mlakernel.cutile.mla.flash_mla.flash_mla_v2flash_mla_v2.pyKernelPlankernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].py<design>/<design>/<design>/<design>_vN/# Full path
python -m mla_var3.kernel.<lang_pkg>.mla.<design> [<version>] [args]
# Shortcut (discovers across all languages)
python -m mla_var3.kernel <design> [<version>] [args]
# Examples
python -m mla_var3.kernel.cutile.mla.mla_var6_plus v4 --b=32 --s=16 --t=4096
python -m mla_var3.kernel mla_var6_plus v4 --b=32 --s=16 --t=4096source .venv/bin/activate
python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>@ct.kernel@cute.kernel@cute.jitKernelPlanTilingsource .venv/bin/activate
python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --checkKernelPlanplan()@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])python ./scripts/clone-kernel.pydocs/knowledge/--prof_type=disabled --checkdocs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/knowledge/languages/<language>/...docs/kernels/<kernel>.md## Development log### V<N>: [Brief Description]
**Location**: `src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/`
**What changed**:
- [Bullet list of changes]
**High-level description of main code changes**:
- [Description of optimizations and how they relate to profiling insights]## New Version: [kernel] [version]
### Changes Applied
1. [change + rationale]
### Files
- Created: [paths]
- Modified: [paths]
### Correctness: [PASS/FAIL]
### Devlog Entry Written: [path]docs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/kernels/<kernel>.md