Loading...
Loading...
Compatibility router for the shared optimization knowledge base and the language-specific optimization catalog skills. Use when: (1) selecting which optimization catalog skill to load, (2) the implementation language is not fixed yet, (3) a workflow still references the legacy optimization-catalog skill name, (4) deciding whether a finding is shared or language-specific, (5) updating the generalized knowledge-base structure.
npx skill4agent add pepperu96/hyper-mla optimization-catalogdocs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/knowledge/languages/<language>/...| Language key | Load this skill | Language-specific knowledge root |
|---|---|---|
| | |
| | |
docs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/knowledge/languages/docs/knowledge/optimizations/docs/knowledge/anti-patterns/| ID | Optimization | Trigger Conditions | Est. Impact | Pitfalls | Detail File |
|---|---|---|---|---|---|
| O1 | Split-KV proportional block allocation | Dual-path kernel has strongly unbalanced latent vs decompressed work; equal block split leaves one path starved | Very High | Requires a reduction kernel and intermediate tensors | |
| O2 | Head grouping for shared-operand reuse | Same operand is reused across heads (for example latent | Very High | Raises register pressure; not appropriate for per-head-only operands like decompressed | |
| O3 | Independent path scheduling | After Split-KV, the latent and decompressed paths still want different streaming granularities or load scheduling | Low-Med | Gains are modest if register pressure and occupancy do not improve | |
| O4 | Swizzle scheduling | Tiled kernel reuses one operand across neighboring CTAs and block order measurably affects L2 locality | Low | Compute-bound kernels may see only marginal benefit; 1D remap adds index overhead | |
| O5 | Latency hints | Load/compute overlap tuning is needed after the kernel structure is already sound, and the per-path DRAM pressure differs enough that compiler guidance may help scheduling | Low | Difficult to isolate; hints are suggestions rather than guarantees | |
| O6 | CGA thread block clusters | Hopper+ kernel has genuine cross-CTA data sharing opportunity or needs to reason correctly about | Medium | Hardware-specific and easy to misuse when there is no real sharing benefit | |
| O7 | Fast math for online softmax | Online-softmax loop is heavy on | High | Not a substitute for sane tiling; gains shrink if spills or bandwidth dominate | |
| O8 | Causal K-loop split | Causal attention has many fully unmasked K-tiles, only diagonal/tail tiles need mask logic, and future tiles can be skipped | Medium-High | Can be neutral or negative if masking is no longer material on the target bottleneck | |
| O9 | Causal ProgramId remap | Triangular work causes wave-tail imbalance across CTAs and a simple launch-order reversal is available | Low-Med | Small gain if work distribution is already uniform or the grid is tiny | |
| O10 | Adaptive tile and occupancy autotuning | Best tile/occupancy point changes with sequence length or device envelope; fixed manual choice underfills short shapes or overcommits long ones | High | Search space can explode if not curated; cannot rescue a structurally broken kernel | |
| O11 | Pipeline-driven low-occupancy scheduling | Large unavoidable per-CTA state forces one CTA/SM or very low occupancy, and the kernel must win via overlap and scheduling rather than more resident warps | High | Requires architecture/DSL support for fine-grained scheduling; cannot be faked with a monolithic CTA | |
| ID | Anti-Pattern | Failure Mode | Source | Detail File |
|---|---|---|---|---|
| A1 | Equal block split across heterogeneous paths | 50/50 block allocation for unequal latent vs decompressed work produces sequential bottlenecks instead of overlap | MLA-var6+ V0 | |
| A2 | | Single-row combine GEMMs disable Tensor Cores and leave a persistent tail at small | MLA-var6+ V0/V2/V3 | |
| A3 | Underfilled persistent concurrency | Resident block budgets below the SM count leave the GPU idle and make overlap ineffective | MLA-var6+ V4 | |
| A4 | Register-ceiling persistent stage | Persistent kernel lands at the register ceiling ( | MLA-var6+ V4 | |
| A5 | Blind large-tile port | Copying a tile shape from another device without revalidating registers, SMEM, local-memory traffic, and grid size crosses a resource cliff | FlashAttention learning chain | |
| A6 | Uniform causal masking | Paying full causal-mask logic on every K-tile wastes work even though most tiles are fully valid or fully skipped | FlashAttention learning chain | |
| A7 | Sub-16 MMA dimension | Shrinking any effective MMA dimension ( | FlashMLA V1 manual probes + Tensor Core shape constraints | |
| A8 | Blind dataflow flip | Rewriting a resource-sensitive Tensor Core kernel into an algebraically equivalent operand orientation changes codegen enough to reintroduce spills or worsen SMEM behavior | FlashMLA V1 → V2 | |
| Profiling Metric State | Likely Optimizations |
|---|---|
| Equal-duration kernels with obviously unequal work | O1 (Split-KV proportional block allocation) |
| O2 (Head grouping) |
| Dual-path kernel plateaus near the ridge point after Split-KV | O3 (Independent path scheduling) |
| L2 hit < 90% and neighboring CTAs reload the same operand tiles | O4 (Swizzle scheduling) |
| Same kernel structure but different path-level memory pressure suggests compiler scheduling changes may matter | O5 (Latency hints) |
Cross-CTA data sharing on Hopper+ or confusion about what | O6 (CGA thread block clusters) |
| Tensor Core/MMA structure is sound, but online-softmax special functions dominate the hot loop | O7 (Fast math for online softmax) |
| Causal kernel still spends meaningful time in mask/control flow because many K-tiles are fully valid and future tiles are fully skipped | O8 (Causal K-loop split) |
| Causal/triangular workload shows tail imbalance across CTAs or waves finish unevenly | O9 (Causal ProgramId remap) |
| Short shapes underfill the GPU while long shapes prefer different tiles or occupancy hints | O10 (Adaptive tile and occupancy autotuning) |
| Registers/SMEM force one CTA/SM, eligible warps stay very low, and the kernel is compute-bound but still under-utilizes Tensor Cores | O11 (Pipeline-driven low-occupancy scheduling) |
Combine stage has | A2 ( |
| Persistent resident blocks < SM count or waves/SM < 1 | A3 (Underfilled persistent concurrency) |
Persistent stage at | A4 (Register-ceiling persistent stage) |
| Imported large tile causes register/SMEM cliffs, local-memory traffic, or abrupt grid shrinkage on the target device | A5 (Blind large-tile port) |
| Causal masking logic is paid uniformly across all K-tiles despite obvious fully-valid and fully-skipped regions | A6 (Uniform causal masking) |
Candidate tile drives any effective MMA dimension below | A7 (Sub-16 MMA dimension) |
| Proposed optimization is only an algebraic operand/dataflow flip, especially near a register or SMEM cliff | A8 (Blind dataflow flip) |
/orchestration-workflow