optimization-catalog
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOptimization Catalog Router
优化目录路由器
Use this skill as a dispatcher. The shared root knowledge base remains the canonical location for algorithmic patterns that transfer across implementation languages. Language-specific overlays capture implementation details tied to a specific DSL or programming model.
将此Skill用作调度器。共享根知识库是跨实现语言可迁移的算法模式的标准存储位置。特定语言覆盖层记录与特定DSL或编程模型相关的实现细节。
Load Order
加载顺序
- Read the shared root knowledge under and
docs/knowledge/optimizations/when the pattern is algorithmic.docs/knowledge/anti-patterns/ - Load the language-specific optimization catalog skill for the chosen implementation language.
- Read language overlays in only when the implementation details depend on that surface.
docs/knowledge/languages/<language>/...
- 当模式为算法型时,读取和
docs/knowledge/optimizations/下的共享根知识库内容。docs/knowledge/anti-patterns/ - 为所选实现语言加载特定语言的优化目录Skill。
- 仅当实现细节依赖对应语言层面时,读取下的语言覆盖层内容。
docs/knowledge/languages/<language>/...
Language-Specific Catalog Skills
特定语言目录Skill
| Language key | Load this skill | Language-specific knowledge root |
|---|---|---|
| | |
| | |
| 语言标识 | 加载该Skill | 特定语言知识库根目录 |
|---|---|---|
| | |
| | |
Classification Rule
分类规则
- Shared root catalogs are for algorithmic patterns, workload-shape rules, and anti-patterns that transfer across DSLs.
- Language overlays are for implementation details that depend on a specific API, compiler behavior, code-generation surface, or scheduling model.
- When in doubt, keep the reusable mechanism in the shared root and add a language-specific overlay only for the implementation surface.
- 共享根目录用于存储跨DSL可迁移的算法模式、工作负载形态规则和反模式。
- 语言覆盖层用于存储依赖特定API、编译器行为、代码生成层面或调度模型的实现细节。
- 若存在疑问,将可复用机制保留在共享根目录中,仅为实现层面添加特定语言覆盖层。
Current Migration State
当前迁移状态
- The shared root catalogs in and
docs/knowledge/optimizations/remain the compatibility-preserving baseline.docs/knowledge/anti-patterns/ - New language-specific overlays are being added under .
docs/knowledge/languages/ - Existing references to the legacy root paths continue to work during the migration.
- 和
docs/knowledge/optimizations/下的共享根目录仍是保持兼容性的基准。docs/knowledge/anti-patterns/ - 新的特定语言覆盖层正在下添加。
docs/knowledge/languages/ - 迁移期间,对旧版根路径的现有引用仍可正常工作。
Knowledge Base Principles
知识库原则
The goal of this catalog is to build reusable cuTile kernel design knowledge, not to accumulate device folklore.
Every entry must therefore be written as:
- a pattern or anti-pattern that can transfer across kernels,
- with a clear context describing when it applies,
- with explicit performance metrics affected,
- with evidence separated into local validation, and generalized takeaway when relevant.
Device and architecture facts are encouraged when they sharpen the rule:
- architecture families such as Blackwell or Hopper,
- architecture capabilities such as Tensor Cores, TMA, thread block clusters, scheduler behavior, or shared-memory limits,
- device-level specs such as peak FP16/BF16 TFLOP/s, peak memory bandwidth, ridge point, SM count, registers/SM, or SMEM capacity.
But those facts must be converted into reusable guidance. Entries should avoid collapsing into "kernel X on device Y liked config Z" unless that observation is only being used as evidence for a broader pattern.
本目录的目标是构建可复用的cuTile内核设计知识,而非积累设备相关的零散经验。
因此,每个条目必须按照以下格式撰写:
- 可跨内核迁移的模式或反模式,
- 包含清晰描述适用场景的上下文,
- 明确列出受影响的性能指标,
- 证据分为本地验证和(相关时)通用结论两部分。
当能强化规则时,鼓励添加设备和架构相关事实:
- 架构系列,如Blackwell或Hopper,
- 架构能力,如Tensor Cores、TMA、线程块集群(thread block clusters)、调度器行为或共享内存限制,
- 设备级规格,如峰值FP16/BF16 TFLOP/s、峰值内存带宽、拐点(ridge point)、SM数量、每SM寄存器数或SMEM容量。
但这些事实必须转化为可复用的指导建议。条目应避免沦为“设备Y上的内核X偏好配置Z”,除非该观察结果仅用作更广泛模式的证据。
How This Catalog Works
本目录的工作方式
- Index below maps trigger conditions → optimization → detail file
- Orchestrator reads index to pick optimizations based on profiling results
- Kernel designer reads detail files to implement specific optimizations
- New optimizations: create detail file in + add row to index
docs/knowledge/optimizations/ - Failed optimizations: create anti-pattern in + add row below
docs/knowledge/anti-patterns/
- 下方索引将触发条件映射到优化方案及详情文件
- 编排器读取索引,根据性能分析结果选择优化方案
- 内核设计者读取详情文件,实现特定优化
- 新增优化方案:在中创建详情文件,并在索引中添加一行
docs/knowledge/optimizations/ - 失败的优化方案:在中创建反模式条目,并在下方添加一行
docs/knowledge/anti-patterns/
Optimization Index
优化索引
| ID | Optimization | Trigger Conditions | Est. Impact | Pitfalls | Detail File |
|---|---|---|---|---|---|
| O1 | Split-KV proportional block allocation | Dual-path kernel has strongly unbalanced latent vs decompressed work; equal block split leaves one path starved | Very High | Requires a reduction kernel and intermediate tensors | |
| O2 | Head grouping for shared-operand reuse | Same operand is reused across heads (for example latent | Very High | Raises register pressure; not appropriate for per-head-only operands like decompressed | |
| O3 | Independent path scheduling | After Split-KV, the latent and decompressed paths still want different streaming granularities or load scheduling | Low-Med | Gains are modest if register pressure and occupancy do not improve | |
| O4 | Swizzle scheduling | Tiled kernel reuses one operand across neighboring CTAs and block order measurably affects L2 locality | Low | Compute-bound kernels may see only marginal benefit; 1D remap adds index overhead | |
| O5 | Latency hints | Load/compute overlap tuning is needed after the kernel structure is already sound, and the per-path DRAM pressure differs enough that compiler guidance may help scheduling | Low | Difficult to isolate; hints are suggestions rather than guarantees | |
| O6 | CGA thread block clusters | Hopper+ kernel has genuine cross-CTA data sharing opportunity or needs to reason correctly about | Medium | Hardware-specific and easy to misuse when there is no real sharing benefit | |
| O7 | Fast math for online softmax | Online-softmax loop is heavy on | High | Not a substitute for sane tiling; gains shrink if spills or bandwidth dominate | |
| O8 | Causal K-loop split | Causal attention has many fully unmasked K-tiles, only diagonal/tail tiles need mask logic, and future tiles can be skipped | Medium-High | Can be neutral or negative if masking is no longer material on the target bottleneck | |
| O9 | Causal ProgramId remap | Triangular work causes wave-tail imbalance across CTAs and a simple launch-order reversal is available | Low-Med | Small gain if work distribution is already uniform or the grid is tiny | |
| O10 | Adaptive tile and occupancy autotuning | Best tile/occupancy point changes with sequence length or device envelope; fixed manual choice underfills short shapes or overcommits long ones | High | Search space can explode if not curated; cannot rescue a structurally broken kernel | |
| O11 | Pipeline-driven low-occupancy scheduling | Large unavoidable per-CTA state forces one CTA/SM or very low occupancy, and the kernel must win via overlap and scheduling rather than more resident warps | High | Requires architecture/DSL support for fine-grained scheduling; cannot be faked with a monolithic CTA | |
| ID | 优化方案 | 触发条件 | 预估影响 | 注意事项 | 详情文件 |
|---|---|---|---|---|---|
| O1 | 按比例分配Split-KV块 | 双路径内核的潜在工作与解压缩工作严重不平衡;均等块拆分导致其中一条路径资源匮乏 | 极高 | 需要归约内核和中间张量 | |
| O2 | 基于共享操作数复用的Head分组 | 同一操作数跨Head复用(例如潜在 | 极高 | 会增加寄存器压力;不适用于仅per-head的操作数,如解压缩后的 | |
| O3 | 独立路径调度 | 完成Split-KV后,潜在路径和解压缩路径仍需要不同的流处理粒度或加载调度策略 | 低-中 | 若寄存器压力和占用率未得到改善,收益有限 | |
| O4 | 混洗调度 | 分块内核跨相邻CTA复用同一操作数,且块顺序对L2 locality有显著影响 | 低 | 计算密集型内核可能仅获得边际收益;一维重映射会增加索引开销 | |
| O5 | 延迟提示 | 内核结构已合理,但需要优化加载/计算重叠,且各路径的DRAM压力差异足够大,编译器指导可能有助于调度 | 低 | 难以隔离影响;提示仅为建议而非保证 | |
| O6 | CGA线程块集群 | Hopper+内核存在真正的跨CTA数据共享机会,或需要正确理解 | 中 | 依赖硬件特性,若无实际共享收益则容易误用 | |
| O7 | 在线Softmax快速数学优化 | 在线Softmax循环中 | 高 | 不能替代合理的分块策略;若溢出或带宽成为瓶颈,收益会下降 | |
| O8 | 因果K循环拆分 | 因果注意力存在大量完全未掩码的K分块,仅对角线/尾部分块需要掩码逻辑,且后续分块可跳过 | 中-高 | 若掩码不再是目标瓶颈的主要因素,可能无收益甚至负收益 | |
| O9 | 因果ProgramId重映射 | 三角工作负载导致CTA间的尾端不平衡,且可通过简单反转启动顺序解决 | 低-中 | 若工作分布已均匀或网格极小,收益有限 | |
| O10 | 自适应分块与占用率自动调优 | 最佳分块/占用率随序列长度或设备环境变化;固定手动选择会导致短形状未充分利用GPU,长形状过度提交 | 高 | 若未精心筛选,搜索空间会爆炸;无法挽救结构上存在缺陷的内核 | |
| O11 | 流水线驱动的低占用率调度 | 不可避免的高 per-CTA 状态导致每个SM仅能运行一个CTA或占用率极低,内核必须通过重叠和调度而非更多驻留 warp 来提升性能 | 高 | 需要架构/DSL支持细粒度调度;无法通过单一CTA模拟实现 | |
Anti-Pattern Index
反模式索引
| ID | Anti-Pattern | Failure Mode | Source | Detail File |
|---|---|---|---|---|
| A1 | Equal block split across heterogeneous paths | 50/50 block allocation for unequal latent vs decompressed work produces sequential bottlenecks instead of overlap | MLA-var6+ V0 | |
| A2 | | Single-row combine GEMMs disable Tensor Cores and leave a persistent tail at small | MLA-var6+ V0/V2/V3 | |
| A3 | Underfilled persistent concurrency | Resident block budgets below the SM count leave the GPU idle and make overlap ineffective | MLA-var6+ V4 | |
| A4 | Register-ceiling persistent stage | Persistent kernel lands at the register ceiling ( | MLA-var6+ V4 | |
| A5 | Blind large-tile port | Copying a tile shape from another device without revalidating registers, SMEM, local-memory traffic, and grid size crosses a resource cliff | FlashAttention learning chain | |
| A6 | Uniform causal masking | Paying full causal-mask logic on every K-tile wastes work even though most tiles are fully valid or fully skipped | FlashAttention learning chain | |
| A7 | Sub-16 MMA dimension | Shrinking any effective MMA dimension ( | FlashMLA V1 manual probes + Tensor Core shape constraints | |
| A8 | Blind dataflow flip | Rewriting a resource-sensitive Tensor Core kernel into an algebraically equivalent operand orientation changes codegen enough to reintroduce spills or worsen SMEM behavior | FlashMLA V1 → V2 | |
| ID | 反模式 | 失效模式 | 来源 | 详情文件 |
|---|---|---|---|---|
| A1 | 异构路径的均等块拆分 | 对潜在工作与解压缩工作不均等的双路径内核采用50/50块分配,导致顺序瓶颈而非重叠执行 | MLA-var6+ V0 | |
| A2 | | 单行合并GEMM会禁用Tensor Cores,且在 | MLA-var6+ V0/V2/V3 | |
| A3 | 未充分利用的持久并发 | 驻留块预算低于SM数量,导致GPU空闲,重叠执行失效 | MLA-var6+ V4 | |
| A4 | 寄存器上限持久阶段 | 持久内核达到寄存器上限(255寄存器/线程),占用率骤降,变为延迟受限 | MLA-var6+ V4 | |
| A5 | 盲目移植大分块 | 从其他设备复制分块形状但未重新验证寄存器、SMEM、本地内存流量和网格大小,导致资源骤降 | FlashAttention学习链 | |
| A6 | 统一因果掩码 | 对所有K分块执行完整的因果掩码逻辑,浪费计算资源,尽管大多数分块完全有效或可完全跳过 | FlashAttention学习链 | |
| A7 | MMA维度小于16 | 将任何有效MMA维度( | FlashMLA V1手动测试 + Tensor Core形状约束 | |
| A8 | 盲目数据流翻转 | 将对资源敏感的Tensor Core内核重写为代数等价的操作数方向,代码生成变化足以重新引入溢出或恶化SMEM行为 | FlashMLA V1 → V2 | |
Cross-Referencing: Metrics → Optimizations
交叉引用:指标→优化方案
| Profiling Metric State | Likely Optimizations |
|---|---|
| Equal-duration kernels with obviously unequal work | O1 (Split-KV proportional block allocation) |
| O2 (Head grouping) |
| Dual-path kernel plateaus near the ridge point after Split-KV | O3 (Independent path scheduling) |
| L2 hit < 90% and neighboring CTAs reload the same operand tiles | O4 (Swizzle scheduling) |
| Same kernel structure but different path-level memory pressure suggests compiler scheduling changes may matter | O5 (Latency hints) |
Cross-CTA data sharing on Hopper+ or confusion about what | O6 (CGA thread block clusters) |
| Tensor Core/MMA structure is sound, but online-softmax special functions dominate the hot loop | O7 (Fast math for online softmax) |
| Causal kernel still spends meaningful time in mask/control flow because many K-tiles are fully valid and future tiles are fully skipped | O8 (Causal K-loop split) |
| Causal/triangular workload shows tail imbalance across CTAs or waves finish unevenly | O9 (Causal ProgramId remap) |
| Short shapes underfill the GPU while long shapes prefer different tiles or occupancy hints | O10 (Adaptive tile and occupancy autotuning) |
| Registers/SMEM force one CTA/SM, eligible warps stay very low, and the kernel is compute-bound but still under-utilizes Tensor Cores | O11 (Pipeline-driven low-occupancy scheduling) |
Combine stage has | A2 ( |
| Persistent resident blocks < SM count or waves/SM < 1 | A3 (Underfilled persistent concurrency) |
Persistent stage at | A4 (Register-ceiling persistent stage) |
| Imported large tile causes register/SMEM cliffs, local-memory traffic, or abrupt grid shrinkage on the target device | A5 (Blind large-tile port) |
| Causal masking logic is paid uniformly across all K-tiles despite obvious fully-valid and fully-skipped regions | A6 (Uniform causal masking) |
Candidate tile drives any effective MMA dimension below | A7 (Sub-16 MMA dimension) |
| Proposed optimization is only an algebraic operand/dataflow flip, especially near a register or SMEM cliff | A8 (Blind dataflow flip) |
| 性能分析指标状态 | 推荐优化方案 |
|---|---|
| 耗时相同但工作负载明显不均等的内核 | O1(按比例分配Split-KV块) |
共享操作数上的 | O2(Head分组) |
| 完成Split-KV后双路径内核性能接近拐点 | O3(独立路径调度) |
| L2命中率<90%,且相邻CTA重新加载相同操作数分块 | O4(混洗调度) |
| 内核结构相同,但各路径内存压力不同,表明编译器调度变化可能有影响 | O5(延迟提示) |
Hopper+上存在跨CTA数据共享,或对 | O6(CGA线程块集群) |
| Tensor Core/MMA结构合理,但在线Softmax特殊函数主导热循环 | O7(在线Softmax快速数学优化) |
| 因果内核仍在掩码/控制流上花费大量时间,因为许多K分块完全有效且后续分块可完全跳过 | O8(因果K循环拆分) |
| 因果/三角工作负载显示CTA间尾端不平衡或wave完成时间不均 | O9(因果ProgramId重映射) |
| 短形状未充分利用GPU,而长形状偏好不同分块或占用率提示 | O10(自适应分块与占用率自动调优) |
| 寄存器/SMEM限制导致每个SM仅能运行一个CTA,可用warp数量极低,且内核为计算密集型但仍未充分利用Tensor Cores | O11(流水线驱动的低占用率调度) |
合并阶段 | A2( |
| 持久驻留块数<SM数量或每SM wave数<1 | A3(未充分利用的持久并发) |
| 持久阶段达到255寄存器/线程,占用率为个位数 | A4(寄存器上限持久阶段) |
| 引入的大分块在目标设备上导致寄存器/SMEM骤降、本地内存流量增加或网格突然缩小 | A5(盲目移植大分块) |
| 尽管存在明显的完全有效和完全跳过区域,仍对所有K分块统一执行因果掩码逻辑 | A6(统一因果掩码) |
| 候选分块导致任何有效MMA维度低于16,或“更小分块”更改后Tensor Core FLOPs远低于算法FLOPs | A7(MMA维度小于16) |
| 拟议的优化仅为代数操作数/数据流翻转,尤其是在接近寄存器或SMEM瓶颈时 | A8(盲目数据流翻转) |
Knowledge Base Update Protocol
知识库更新协议
For the full update protocol, detail file template, and step-by-step instructions, load the skill.
/orchestration-workflow有关完整的更新协议、详情文件模板和分步说明,请加载 Skill。
/orchestration-workflow