optimization-catalog

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Optimization Catalog Router

优化目录路由器

Use this skill as a dispatcher. The shared root knowledge base remains the canonical location for algorithmic patterns that transfer across implementation languages. Language-specific overlays capture implementation details tied to a specific DSL or programming model.

将此Skill用作调度器。共享根知识库是跨实现语言可迁移的算法模式的标准存储位置。特定语言覆盖层记录与特定DSL或编程模型相关的实现细节。

Load Order

加载顺序

Read the shared root knowledge under
```
docs/knowledge/optimizations/
```
and
```
docs/knowledge/anti-patterns/
```
when the pattern is algorithmic.
Load the language-specific optimization catalog skill for the chosen implementation language.
Read language overlays in
```
docs/knowledge/languages/<language>/...
```
only when the implementation details depend on that surface.

当模式为算法型时，读取
```
docs/knowledge/optimizations/
```
和
```
docs/knowledge/anti-patterns/
```
下的共享根知识库内容。
为所选实现语言加载特定语言的优化目录Skill。
仅当实现细节依赖对应语言层面时，读取
```
docs/knowledge/languages/<language>/...
```
下的语言覆盖层内容。

Language-Specific Catalog Skills

特定语言目录Skill

Language key	Load this skill	Language-specific knowledge root
`cutile-dsl`	`/optimization-catalog-cutile-dsl`	`docs/knowledge/languages/cutile-dsl/`
`cute-dsl`	`/optimization-catalog-cute-dsl`	`docs/knowledge/languages/cute-dsl/`

语言标识	加载该Skill	特定语言知识库根目录
`cutile-dsl`	`/optimization-catalog-cutile-dsl`	`docs/knowledge/languages/cutile-dsl/`
`cute-dsl`	`/optimization-catalog-cute-dsl`	`docs/knowledge/languages/cute-dsl/`

Classification Rule

分类规则

Shared root catalogs are for algorithmic patterns, workload-shape rules, and anti-patterns that transfer across DSLs.
Language overlays are for implementation details that depend on a specific API, compiler behavior, code-generation surface, or scheduling model.
When in doubt, keep the reusable mechanism in the shared root and add a language-specific overlay only for the implementation surface.

共享根目录用于存储跨DSL可迁移的算法模式、工作负载形态规则和反模式。
语言覆盖层用于存储依赖特定API、编译器行为、代码生成层面或调度模型的实现细节。
若存在疑问，将可复用机制保留在共享根目录中，仅为实现层面添加特定语言覆盖层。

Current Migration State

当前迁移状态

The shared root catalogs in
```
docs/knowledge/optimizations/
```
and
```
docs/knowledge/anti-patterns/
```
remain the compatibility-preserving baseline.
New language-specific overlays are being added under
```
docs/knowledge/languages/
```
.
Existing references to the legacy root paths continue to work during the migration.

docs/knowledge/optimizations/

和

docs/knowledge/anti-patterns/

下的共享根目录仍是保持兼容性的基准。

新的特定语言覆盖层正在
```
docs/knowledge/languages/
```
下添加。
迁移期间，对旧版根路径的现有引用仍可正常工作。

Knowledge Base Principles

知识库原则

The goal of this catalog is to build reusable cuTile kernel design knowledge, not to accumulate device folklore.

Every entry must therefore be written as:

a pattern or anti-pattern that can transfer across kernels,
with a clear context describing when it applies,
with explicit performance metrics affected,
with evidence separated into local validation, and generalized takeaway when relevant.

Device and architecture facts are encouraged when they sharpen the rule:

architecture families such as Blackwell or Hopper,
architecture capabilities such as Tensor Cores, TMA, thread block clusters, scheduler behavior, or shared-memory limits,
device-level specs such as peak FP16/BF16 TFLOP/s, peak memory bandwidth, ridge point, SM count, registers/SM, or SMEM capacity.

But those facts must be converted into reusable guidance. Entries should avoid collapsing into "kernel X on device Y liked config Z" unless that observation is only being used as evidence for a broader pattern.

本目录的目标是构建可复用的cuTile内核设计知识，而非积累设备相关的零散经验。

因此，每个条目必须按照以下格式撰写：

可跨内核迁移的模式或反模式，
包含清晰描述适用场景的上下文，
明确列出受影响的性能指标，
证据分为本地验证和（相关时）通用结论两部分。

当能强化规则时，鼓励添加设备和架构相关事实：

架构系列，如Blackwell或Hopper，
架构能力，如Tensor Cores、TMA、线程块集群（thread block clusters）、调度器行为或共享内存限制，
设备级规格，如峰值FP16/BF16 TFLOP/s、峰值内存带宽、拐点（ridge point）、SM数量、每SM寄存器数或SMEM容量。

但这些事实必须转化为可复用的指导建议。条目应避免沦为“设备Y上的内核X偏好配置Z”，除非该观察结果仅用作更广泛模式的证据。

How This Catalog Works

本目录的工作方式

Index below maps trigger conditions → optimization → detail file
Orchestrator reads index to pick optimizations based on profiling results
Kernel designer reads detail files to implement specific optimizations
New optimizations: create detail file in
```
docs/knowledge/optimizations/
```
+ add row to index
Failed optimizations: create anti-pattern in
```
docs/knowledge/anti-patterns/
```
+ add row below

下方索引将触发条件映射到优化方案及详情文件
编排器读取索引，根据性能分析结果选择优化方案
内核设计者读取详情文件，实现特定优化
新增优化方案：在
```
docs/knowledge/optimizations/
```
中创建详情文件，并在索引中添加一行
失败的优化方案：在
```
docs/knowledge/anti-patterns/
```
中创建反模式条目，并在下方添加一行

Optimization Index

优化索引

ID	Optimization	Trigger Conditions	Est. Impact	Pitfalls	Detail File
O1	Split-KV proportional block allocation	Dual-path kernel has strongly unbalanced latent vs decompressed work; equal block split leaves one path starved	Very High	Requires a reduction kernel and intermediate tensors	`docs/knowledge/optimizations/split-kv.md`
O2	Head grouping for shared-operand reuse	Same operand is reused across heads (for example latent `Z` ) and `Bh=1` produces skinny per-head MMAs or near-zero TC utilization	Very High	Raises register pressure; not appropriate for per-head-only operands like decompressed `K/V`	`docs/knowledge/optimizations/head-grouping.md`
O3	Independent path scheduling	After Split-KV, the latent and decompressed paths still want different streaming granularities or load scheduling	Low-Med	Gains are modest if register pressure and occupancy do not improve	`docs/knowledge/optimizations/independent-tiles.md`
O4	Swizzle scheduling	Tiled kernel reuses one operand across neighboring CTAs and block order measurably affects L2 locality	Low	Compute-bound kernels may see only marginal benefit; 1D remap adds index overhead	`docs/knowledge/optimizations/swizzle.md`
O5	Latency hints	Load/compute overlap tuning is needed after the kernel structure is already sound, and the per-path DRAM pressure differs enough that compiler guidance may help scheduling	Low	Difficult to isolate; hints are suggestions rather than guarantees	`docs/knowledge/optimizations/latency-hints.md`
O6	CGA thread block clusters	Hopper+ kernel has genuine cross-CTA data sharing opportunity or needs to reason correctly about `num_ctas` semantics and cluster launch behavior	Medium	Hardware-specific and easy to misuse when there is no real sharing benefit	`docs/knowledge/optimizations/cga.md`
O7	Fast math for online softmax	Online-softmax loop is heavy on `exp2` /division, larger tiles collapse compute throughput, and reduced-precision inference math is acceptable	High	Not a substitute for sane tiling; gains shrink if spills or bandwidth dominate	`docs/knowledge/optimizations/fast-math-online-softmax.md`
O8	Causal K-loop split	Causal attention has many fully unmasked K-tiles, only diagonal/tail tiles need mask logic, and future tiles can be skipped	Medium-High	Can be neutral or negative if masking is no longer material on the target bottleneck	`docs/knowledge/optimizations/causal-k-loop-split.md`
O9	Causal ProgramId remap	Triangular work causes wave-tail imbalance across CTAs and a simple launch-order reversal is available	Low-Med	Small gain if work distribution is already uniform or the grid is tiny	`docs/knowledge/optimizations/causal-program-id-remap.md`
O10	Adaptive tile and occupancy autotuning	Best tile/occupancy point changes with sequence length or device envelope; fixed manual choice underfills short shapes or overcommits long ones	High	Search space can explode if not curated; cannot rescue a structurally broken kernel	`docs/knowledge/optimizations/adaptive-tile-autotuning.md`
O11	Pipeline-driven low-occupancy scheduling	Large unavoidable per-CTA state forces one CTA/SM or very low occupancy, and the kernel must win via overlap and scheduling rather than more resident warps	High	Requires architecture/DSL support for fine-grained scheduling; cannot be faked with a monolithic CTA	`docs/knowledge/optimizations/pipeline-driven-low-occupancy.md`

ID	优化方案	触发条件	预估影响	注意事项	详情文件
O1	按比例分配Split-KV块	双路径内核的潜在工作与解压缩工作严重不平衡；均等块拆分导致其中一条路径资源匮乏	极高	需要归约内核和中间张量	`docs/knowledge/optimizations/split-kv.md`
O2	基于共享操作数复用的Head分组	同一操作数跨Head复用（例如潜在 `Z` ），且 `Bh=1` 产生窄幅 per-head MMA 或近乎零的TC利用率	极高	会增加寄存器压力；不适用于仅per-head的操作数，如解压缩后的 `K/V`	`docs/knowledge/optimizations/head-grouping.md`
O3	独立路径调度	完成Split-KV后，潜在路径和解压缩路径仍需要不同的流处理粒度或加载调度策略	低-中	若寄存器压力和占用率未得到改善，收益有限	`docs/knowledge/optimizations/independent-tiles.md`
O4	混洗调度	分块内核跨相邻CTA复用同一操作数，且块顺序对L2 locality有显著影响	低	计算密集型内核可能仅获得边际收益；一维重映射会增加索引开销	`docs/knowledge/optimizations/swizzle.md`
O5	延迟提示	内核结构已合理，但需要优化加载/计算重叠，且各路径的DRAM压力差异足够大，编译器指导可能有助于调度	低	难以隔离影响；提示仅为建议而非保证	`docs/knowledge/optimizations/latency-hints.md`
O6	CGA线程块集群	Hopper+内核存在真正的跨CTA数据共享机会，或需要正确理解 `num_ctas` 语义和集群启动行为	中	依赖硬件特性，若无实际共享收益则容易误用	`docs/knowledge/optimizations/cga.md`
O7	在线Softmax快速数学优化	在线Softmax循环中 `exp2` /除法操作占比高，更大的分块会降低计算吞吐量，且允许使用低精度推理数学	高	不能替代合理的分块策略；若溢出或带宽成为瓶颈，收益会下降	`docs/knowledge/optimizations/fast-math-online-softmax.md`
O8	因果K循环拆分	因果注意力存在大量完全未掩码的K分块，仅对角线/尾部分块需要掩码逻辑，且后续分块可跳过	中-高	若掩码不再是目标瓶颈的主要因素，可能无收益甚至负收益	`docs/knowledge/optimizations/causal-k-loop-split.md`
O9	因果ProgramId重映射	三角工作负载导致CTA间的尾端不平衡，且可通过简单反转启动顺序解决	低-中	若工作分布已均匀或网格极小，收益有限	`docs/knowledge/optimizations/causal-program-id-remap.md`
O10	自适应分块与占用率自动调优	最佳分块/占用率随序列长度或设备环境变化；固定手动选择会导致短形状未充分利用GPU，长形状过度提交	高	若未精心筛选，搜索空间会爆炸；无法挽救结构上存在缺陷的内核	`docs/knowledge/optimizations/adaptive-tile-autotuning.md`
O11	流水线驱动的低占用率调度	不可避免的高 per-CTA 状态导致每个SM仅能运行一个CTA或占用率极低，内核必须通过重叠和调度而非更多驻留 warp 来提升性能	高	需要架构/DSL支持细粒度调度；无法通过单一CTA模拟实现	`docs/knowledge/optimizations/pipeline-driven-low-occupancy.md`

Anti-Pattern Index

反模式索引

ID	Anti-Pattern	Failure Mode	Source	Detail File
A1	Equal block split across heterogeneous paths	50/50 block allocation for unequal latent vs decompressed work produces sequential bottlenecks instead of overlap	MLA-var6+ V0	`docs/knowledge/anti-patterns/equal-block-split.md`
A2	`TILE_M=1` combine/decompression	Single-row combine GEMMs disable Tensor Cores and leave a persistent tail at small `s`	MLA-var6+ V0/V2/V3	`docs/knowledge/anti-patterns/tile-m-one-combine.md`
A3	Underfilled persistent concurrency	Resident block budgets below the SM count leave the GPU idle and make overlap ineffective	MLA-var6+ V4	`docs/knowledge/anti-patterns/underfilled-persistent.md`
A4	Register-ceiling persistent stage	Persistent kernel lands at the register ceiling ( `255` regs/thread), collapses occupancy, and becomes latency-bound	MLA-var6+ V4	`docs/knowledge/anti-patterns/spill-heavy-persistent.md`
A5	Blind large-tile port	Copying a tile shape from another device without revalidating registers, SMEM, local-memory traffic, and grid size crosses a resource cliff	FlashAttention learning chain	`docs/knowledge/anti-patterns/blind-large-tile-port.md`
A6	Uniform causal masking	Paying full causal-mask logic on every K-tile wastes work even though most tiles are fully valid or fully skipped	FlashAttention learning chain	`docs/knowledge/anti-patterns/uniform-causal-masking.md`
A7	Sub-16 MMA dimension	Shrinking any effective MMA dimension ( `M` , `N` , or `K` ) below `16` usually breaks Tensor Core coverage and explodes latency instead of fixing occupancy	FlashMLA V1 manual probes + Tensor Core shape constraints	`docs/knowledge/anti-patterns/sub-16-mma-dimension.md`
A8	Blind dataflow flip	Rewriting a resource-sensitive Tensor Core kernel into an algebraically equivalent operand orientation changes codegen enough to reintroduce spills or worsen SMEM behavior	FlashMLA V1 → V2	`docs/knowledge/anti-patterns/blind-dataflow-flip.md`

ID	反模式	失效模式	来源	详情文件
A1	异构路径的均等块拆分	对潜在工作与解压缩工作不均等的双路径内核采用50/50块分配，导致顺序瓶颈而非重叠执行	MLA-var6+ V0	`docs/knowledge/anti-patterns/equal-block-split.md`
A2	`TILE_M=1` 合并/解压缩	单行合并GEMM会禁用Tensor Cores，且在 `s` 较小时会产生持续的尾端开销	MLA-var6+ V0/V2/V3	`docs/knowledge/anti-patterns/tile-m-one-combine.md`
A3	未充分利用的持久并发	驻留块预算低于SM数量，导致GPU空闲，重叠执行失效	MLA-var6+ V4	`docs/knowledge/anti-patterns/underfilled-persistent.md`
A4	寄存器上限持久阶段	持久内核达到寄存器上限（255寄存器/线程），占用率骤降，变为延迟受限	MLA-var6+ V4	`docs/knowledge/anti-patterns/spill-heavy-persistent.md`
A5	盲目移植大分块	从其他设备复制分块形状但未重新验证寄存器、SMEM、本地内存流量和网格大小，导致资源骤降	FlashAttention学习链	`docs/knowledge/anti-patterns/blind-large-tile-port.md`
A6	统一因果掩码	对所有K分块执行完整的因果掩码逻辑，浪费计算资源，尽管大多数分块完全有效或可完全跳过	FlashAttention学习链	`docs/knowledge/anti-patterns/uniform-causal-masking.md`
A7	MMA维度小于16	将任何有效MMA维度（ `M` 、 `N` 或 `K` ）缩小至16以下通常会破坏Tensor Core覆盖，导致延迟激增而非改善占用率	FlashMLA V1手动测试 + Tensor Core形状约束	`docs/knowledge/anti-patterns/sub-16-mma-dimension.md`
A8	盲目数据流翻转	将对资源敏感的Tensor Core内核重写为代数等价的操作数方向，代码生成变化足以重新引入溢出或恶化SMEM行为	FlashMLA V1 → V2	`docs/knowledge/anti-patterns/blind-dataflow-flip.md`

Cross-Referencing: Metrics → Optimizations

交叉引用：指标→优化方案

Profiling Metric State	Likely Optimizations
Equal-duration kernels with obviously unequal work	O1 (Split-KV proportional block allocation)
`Bh=1` or per-head skinny MMAs on a shared operand	O2 (Head grouping)
Dual-path kernel plateaus near the ridge point after Split-KV	O3 (Independent path scheduling)
L2 hit < 90% and neighboring CTAs reload the same operand tiles	O4 (Swizzle scheduling)
Same kernel structure but different path-level memory pressure suggests compiler scheduling changes may matter	O5 (Latency hints)
Cross-CTA data sharing on Hopper+ or confusion about what `num_ctas` actually controls	O6 (CGA thread block clusters)
Tensor Core/MMA structure is sound, but online-softmax special functions dominate the hot loop	O7 (Fast math for online softmax)
Causal kernel still spends meaningful time in mask/control flow because many K-tiles are fully valid and future tiles are fully skipped	O8 (Causal K-loop split)
Causal/triangular workload shows tail imbalance across CTAs or waves finish unevenly	O9 (Causal ProgramId remap)
Short shapes underfill the GPU while long shapes prefer different tiles or occupancy hints	O10 (Adaptive tile and occupancy autotuning)
Registers/SMEM force one CTA/SM, eligible warps stay very low, and the kernel is compute-bound but still under-utilizes Tensor Cores	O11 (Pipeline-driven low-occupancy scheduling)
Combine stage has `TILE_M=1` and `TC Util = 0%`	A2 ( `TILE_M=1` combine/decompression)
Persistent resident blocks < SM count or waves/SM < 1	A3 (Underfilled persistent concurrency)
Persistent stage at `255` regs/thread with single-digit occupancy	A4 (Register-ceiling persistent stage)
Imported large tile causes register/SMEM cliffs, local-memory traffic, or abrupt grid shrinkage on the target device	A5 (Blind large-tile port)
Causal masking logic is paid uniformly across all K-tiles despite obvious fully-valid and fully-skipped regions	A6 (Uniform causal masking)
Candidate tile drives any effective MMA dimension below `16` , or Tensor Core FLOPs collapse far below algorithmic FLOPs after a "smaller tile" change	A7 (Sub-16 MMA dimension)
Proposed optimization is only an algebraic operand/dataflow flip, especially near a register or SMEM cliff	A8 (Blind dataflow flip)

性能分析指标状态	推荐优化方案
耗时相同但工作负载明显不均等的内核	O1（按比例分配Split-KV块）
共享操作数上的 `Bh=1` 或per-head窄幅MMA	O2（Head分组）
完成Split-KV后双路径内核性能接近拐点	O3（独立路径调度）
L2命中率<90%，且相邻CTA重新加载相同操作数分块	O4（混洗调度）
内核结构相同，但各路径内存压力不同，表明编译器调度变化可能有影响	O5（延迟提示）
Hopper+上存在跨CTA数据共享，或对 `num_ctas` 实际控制的内容存在困惑	O6（CGA线程块集群）
Tensor Core/MMA结构合理，但在线Softmax特殊函数主导热循环	O7（在线Softmax快速数学优化）
因果内核仍在掩码/控制流上花费大量时间，因为许多K分块完全有效且后续分块可完全跳过	O8（因果K循环拆分）
因果/三角工作负载显示CTA间尾端不平衡或wave完成时间不均	O9（因果ProgramId重映射）
短形状未充分利用GPU，而长形状偏好不同分块或占用率提示	O10（自适应分块与占用率自动调优）
寄存器/SMEM限制导致每个SM仅能运行一个CTA，可用warp数量极低，且内核为计算密集型但仍未充分利用Tensor Cores	O11（流水线驱动的低占用率调度）
合并阶段 `TILE_M=1` 且 `TC利用率=0%`	A2（ `TILE_M=1` 合并/解压缩）
持久驻留块数<SM数量或每SM wave数<1	A3（未充分利用的持久并发）
持久阶段达到255寄存器/线程，占用率为个位数	A4（寄存器上限持久阶段）
引入的大分块在目标设备上导致寄存器/SMEM骤降、本地内存流量增加或网格突然缩小	A5（盲目移植大分块）
尽管存在明显的完全有效和完全跳过区域，仍对所有K分块统一执行因果掩码逻辑	A6（统一因果掩码）
候选分块导致任何有效MMA维度低于16，或“更小分块”更改后Tensor Core FLOPs远低于算法FLOPs	A7（MMA维度小于16）
拟议的优化仅为代数操作数/数据流翻转，尤其是在接近寄存器或SMEM瓶颈时	A8（盲目数据流翻转）

Knowledge Base Update Protocol

知识库更新协议

For the full update protocol, detail file template, and step-by-step instructions, load the

/orchestration-workflow

skill.

有关完整的更新协议、详情文件模板和分步说明，请加载

/orchestration-workflow

Skill。