optimization-catalog

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Optimization Catalog Router

优化目录路由器

Use this skill as a dispatcher. The shared root knowledge base remains the canonical location for algorithmic patterns that transfer across implementation languages. Language-specific overlays capture implementation details tied to a specific DSL or programming model.
将此Skill用作调度器。共享根知识库是跨实现语言可迁移的算法模式的标准存储位置。特定语言覆盖层记录与特定DSL或编程模型相关的实现细节。

Load Order

加载顺序

  1. Read the shared root knowledge under
    docs/knowledge/optimizations/
    and
    docs/knowledge/anti-patterns/
    when the pattern is algorithmic.
  2. Load the language-specific optimization catalog skill for the chosen implementation language.
  3. Read language overlays in
    docs/knowledge/languages/<language>/...
    only when the implementation details depend on that surface.
  1. 当模式为算法型时,读取
    docs/knowledge/optimizations/
    docs/knowledge/anti-patterns/
    下的共享根知识库内容。
  2. 为所选实现语言加载特定语言的优化目录Skill。
  3. 仅当实现细节依赖对应语言层面时,读取
    docs/knowledge/languages/<language>/...
    下的语言覆盖层内容。

Language-Specific Catalog Skills

特定语言目录Skill

Language keyLoad this skillLanguage-specific knowledge root
cutile-dsl
/optimization-catalog-cutile-dsl
docs/knowledge/languages/cutile-dsl/
cute-dsl
/optimization-catalog-cute-dsl
docs/knowledge/languages/cute-dsl/
语言标识加载该Skill特定语言知识库根目录
cutile-dsl
/optimization-catalog-cutile-dsl
docs/knowledge/languages/cutile-dsl/
cute-dsl
/optimization-catalog-cute-dsl
docs/knowledge/languages/cute-dsl/

Classification Rule

分类规则

  • Shared root catalogs are for algorithmic patterns, workload-shape rules, and anti-patterns that transfer across DSLs.
  • Language overlays are for implementation details that depend on a specific API, compiler behavior, code-generation surface, or scheduling model.
  • When in doubt, keep the reusable mechanism in the shared root and add a language-specific overlay only for the implementation surface.
  • 共享根目录用于存储跨DSL可迁移的算法模式、工作负载形态规则和反模式。
  • 语言覆盖层用于存储依赖特定API、编译器行为、代码生成层面或调度模型的实现细节。
  • 若存在疑问,将可复用机制保留在共享根目录中,仅为实现层面添加特定语言覆盖层。

Current Migration State

当前迁移状态

  • The shared root catalogs in
    docs/knowledge/optimizations/
    and
    docs/knowledge/anti-patterns/
    remain the compatibility-preserving baseline.
  • New language-specific overlays are being added under
    docs/knowledge/languages/
    .
  • Existing references to the legacy root paths continue to work during the migration.
  • docs/knowledge/optimizations/
    docs/knowledge/anti-patterns/
    下的共享根目录仍是保持兼容性的基准。
  • 新的特定语言覆盖层正在
    docs/knowledge/languages/
    下添加。
  • 迁移期间,对旧版根路径的现有引用仍可正常工作。

Knowledge Base Principles

知识库原则

The goal of this catalog is to build reusable cuTile kernel design knowledge, not to accumulate device folklore.
Every entry must therefore be written as:
  • a pattern or anti-pattern that can transfer across kernels,
  • with a clear context describing when it applies,
  • with explicit performance metrics affected,
  • with evidence separated into local validation, and generalized takeaway when relevant.
Device and architecture facts are encouraged when they sharpen the rule:
  • architecture families such as Blackwell or Hopper,
  • architecture capabilities such as Tensor Cores, TMA, thread block clusters, scheduler behavior, or shared-memory limits,
  • device-level specs such as peak FP16/BF16 TFLOP/s, peak memory bandwidth, ridge point, SM count, registers/SM, or SMEM capacity.
But those facts must be converted into reusable guidance. Entries should avoid collapsing into "kernel X on device Y liked config Z" unless that observation is only being used as evidence for a broader pattern.
本目录的目标是构建可复用的cuTile内核设计知识,而非积累设备相关的零散经验。
因此,每个条目必须按照以下格式撰写:
  • 可跨内核迁移的模式反模式
  • 包含清晰描述适用场景的上下文
  • 明确列出受影响的性能指标
  • 证据分为本地验证和(相关时)通用结论两部分。
当能强化规则时,鼓励添加设备和架构相关事实:
  • 架构系列,如Blackwell或Hopper,
  • 架构能力,如Tensor Cores、TMA、线程块集群(thread block clusters)、调度器行为或共享内存限制,
  • 设备级规格,如峰值FP16/BF16 TFLOP/s、峰值内存带宽、拐点(ridge point)、SM数量、每SM寄存器数或SMEM容量。
但这些事实必须转化为可复用的指导建议。条目应避免沦为“设备Y上的内核X偏好配置Z”,除非该观察结果仅用作更广泛模式的证据。

How This Catalog Works

本目录的工作方式

  1. Index below maps trigger conditions → optimization → detail file
  2. Orchestrator reads index to pick optimizations based on profiling results
  3. Kernel designer reads detail files to implement specific optimizations
  4. New optimizations: create detail file in
    docs/knowledge/optimizations/
    + add row to index
  5. Failed optimizations: create anti-pattern in
    docs/knowledge/anti-patterns/
    + add row below
  1. 下方索引将触发条件映射到优化方案及详情文件
  2. 编排器读取索引,根据性能分析结果选择优化方案
  3. 内核设计者读取详情文件,实现特定优化
  4. 新增优化方案:在
    docs/knowledge/optimizations/
    中创建详情文件,并在索引中添加一行
  5. 失败的优化方案:在
    docs/knowledge/anti-patterns/
    中创建反模式条目,并在下方添加一行

Optimization Index

优化索引

IDOptimizationTrigger ConditionsEst. ImpactPitfallsDetail File
O1Split-KV proportional block allocationDual-path kernel has strongly unbalanced latent vs decompressed work; equal block split leaves one path starvedVery HighRequires a reduction kernel and intermediate tensors
docs/knowledge/optimizations/split-kv.md
O2Head grouping for shared-operand reuseSame operand is reused across heads (for example latent
Z
) and
Bh=1
produces skinny per-head MMAs or near-zero TC utilization
Very HighRaises register pressure; not appropriate for per-head-only operands like decompressed
K/V
docs/knowledge/optimizations/head-grouping.md
O3Independent path schedulingAfter Split-KV, the latent and decompressed paths still want different streaming granularities or load schedulingLow-MedGains are modest if register pressure and occupancy do not improve
docs/knowledge/optimizations/independent-tiles.md
O4Swizzle schedulingTiled kernel reuses one operand across neighboring CTAs and block order measurably affects L2 localityLowCompute-bound kernels may see only marginal benefit; 1D remap adds index overhead
docs/knowledge/optimizations/swizzle.md
O5Latency hintsLoad/compute overlap tuning is needed after the kernel structure is already sound, and the per-path DRAM pressure differs enough that compiler guidance may help schedulingLowDifficult to isolate; hints are suggestions rather than guarantees
docs/knowledge/optimizations/latency-hints.md
O6CGA thread block clustersHopper+ kernel has genuine cross-CTA data sharing opportunity or needs to reason correctly about
num_ctas
semantics and cluster launch behavior
MediumHardware-specific and easy to misuse when there is no real sharing benefit
docs/knowledge/optimizations/cga.md
O7Fast math for online softmaxOnline-softmax loop is heavy on
exp2
/division, larger tiles collapse compute throughput, and reduced-precision inference math is acceptable
HighNot a substitute for sane tiling; gains shrink if spills or bandwidth dominate
docs/knowledge/optimizations/fast-math-online-softmax.md
O8Causal K-loop splitCausal attention has many fully unmasked K-tiles, only diagonal/tail tiles need mask logic, and future tiles can be skippedMedium-HighCan be neutral or negative if masking is no longer material on the target bottleneck
docs/knowledge/optimizations/causal-k-loop-split.md
O9Causal ProgramId remapTriangular work causes wave-tail imbalance across CTAs and a simple launch-order reversal is availableLow-MedSmall gain if work distribution is already uniform or the grid is tiny
docs/knowledge/optimizations/causal-program-id-remap.md
O10Adaptive tile and occupancy autotuningBest tile/occupancy point changes with sequence length or device envelope; fixed manual choice underfills short shapes or overcommits long onesHighSearch space can explode if not curated; cannot rescue a structurally broken kernel
docs/knowledge/optimizations/adaptive-tile-autotuning.md
O11Pipeline-driven low-occupancy schedulingLarge unavoidable per-CTA state forces one CTA/SM or very low occupancy, and the kernel must win via overlap and scheduling rather than more resident warpsHighRequires architecture/DSL support for fine-grained scheduling; cannot be faked with a monolithic CTA
docs/knowledge/optimizations/pipeline-driven-low-occupancy.md
ID优化方案触发条件预估影响注意事项详情文件
O1按比例分配Split-KV块双路径内核的潜在工作与解压缩工作严重不平衡;均等块拆分导致其中一条路径资源匮乏极高需要归约内核和中间张量
docs/knowledge/optimizations/split-kv.md
O2基于共享操作数复用的Head分组同一操作数跨Head复用(例如潜在
Z
),且
Bh=1
产生窄幅 per-head MMA 或近乎零的TC利用率
极高会增加寄存器压力;不适用于仅per-head的操作数,如解压缩后的
K/V
docs/knowledge/optimizations/head-grouping.md
O3独立路径调度完成Split-KV后,潜在路径和解压缩路径仍需要不同的流处理粒度或加载调度策略低-中若寄存器压力和占用率未得到改善,收益有限
docs/knowledge/optimizations/independent-tiles.md
O4混洗调度分块内核跨相邻CTA复用同一操作数,且块顺序对L2 locality有显著影响计算密集型内核可能仅获得边际收益;一维重映射会增加索引开销
docs/knowledge/optimizations/swizzle.md
O5延迟提示内核结构已合理,但需要优化加载/计算重叠,且各路径的DRAM压力差异足够大,编译器指导可能有助于调度难以隔离影响;提示仅为建议而非保证
docs/knowledge/optimizations/latency-hints.md
O6CGA线程块集群Hopper+内核存在真正的跨CTA数据共享机会,或需要正确理解
num_ctas
语义和集群启动行为
依赖硬件特性,若无实际共享收益则容易误用
docs/knowledge/optimizations/cga.md
O7在线Softmax快速数学优化在线Softmax循环中
exp2
/除法操作占比高,更大的分块会降低计算吞吐量,且允许使用低精度推理数学
不能替代合理的分块策略;若溢出或带宽成为瓶颈,收益会下降
docs/knowledge/optimizations/fast-math-online-softmax.md
O8因果K循环拆分因果注意力存在大量完全未掩码的K分块,仅对角线/尾部分块需要掩码逻辑,且后续分块可跳过中-高若掩码不再是目标瓶颈的主要因素,可能无收益甚至负收益
docs/knowledge/optimizations/causal-k-loop-split.md
O9因果ProgramId重映射三角工作负载导致CTA间的尾端不平衡,且可通过简单反转启动顺序解决低-中若工作分布已均匀或网格极小,收益有限
docs/knowledge/optimizations/causal-program-id-remap.md
O10自适应分块与占用率自动调优最佳分块/占用率随序列长度或设备环境变化;固定手动选择会导致短形状未充分利用GPU,长形状过度提交若未精心筛选,搜索空间会爆炸;无法挽救结构上存在缺陷的内核
docs/knowledge/optimizations/adaptive-tile-autotuning.md
O11流水线驱动的低占用率调度不可避免的高 per-CTA 状态导致每个SM仅能运行一个CTA或占用率极低,内核必须通过重叠和调度而非更多驻留 warp 来提升性能需要架构/DSL支持细粒度调度;无法通过单一CTA模拟实现
docs/knowledge/optimizations/pipeline-driven-low-occupancy.md

Anti-Pattern Index

反模式索引

IDAnti-PatternFailure ModeSourceDetail File
A1Equal block split across heterogeneous paths50/50 block allocation for unequal latent vs decompressed work produces sequential bottlenecks instead of overlapMLA-var6+ V0
docs/knowledge/anti-patterns/equal-block-split.md
A2
TILE_M=1
combine/decompression
Single-row combine GEMMs disable Tensor Cores and leave a persistent tail at small
s
MLA-var6+ V0/V2/V3
docs/knowledge/anti-patterns/tile-m-one-combine.md
A3Underfilled persistent concurrencyResident block budgets below the SM count leave the GPU idle and make overlap ineffectiveMLA-var6+ V4
docs/knowledge/anti-patterns/underfilled-persistent.md
A4Register-ceiling persistent stagePersistent kernel lands at the register ceiling (
255
regs/thread), collapses occupancy, and becomes latency-bound
MLA-var6+ V4
docs/knowledge/anti-patterns/spill-heavy-persistent.md
A5Blind large-tile portCopying a tile shape from another device without revalidating registers, SMEM, local-memory traffic, and grid size crosses a resource cliffFlashAttention learning chain
docs/knowledge/anti-patterns/blind-large-tile-port.md
A6Uniform causal maskingPaying full causal-mask logic on every K-tile wastes work even though most tiles are fully valid or fully skippedFlashAttention learning chain
docs/knowledge/anti-patterns/uniform-causal-masking.md
A7Sub-16 MMA dimensionShrinking any effective MMA dimension (
M
,
N
, or
K
) below
16
usually breaks Tensor Core coverage and explodes latency instead of fixing occupancy
FlashMLA V1 manual probes + Tensor Core shape constraints
docs/knowledge/anti-patterns/sub-16-mma-dimension.md
A8Blind dataflow flipRewriting a resource-sensitive Tensor Core kernel into an algebraically equivalent operand orientation changes codegen enough to reintroduce spills or worsen SMEM behaviorFlashMLA V1 → V2
docs/knowledge/anti-patterns/blind-dataflow-flip.md
ID反模式失效模式来源详情文件
A1异构路径的均等块拆分对潜在工作与解压缩工作不均等的双路径内核采用50/50块分配,导致顺序瓶颈而非重叠执行MLA-var6+ V0
docs/knowledge/anti-patterns/equal-block-split.md
A2
TILE_M=1
合并/解压缩
单行合并GEMM会禁用Tensor Cores,且在
s
较小时会产生持续的尾端开销
MLA-var6+ V0/V2/V3
docs/knowledge/anti-patterns/tile-m-one-combine.md
A3未充分利用的持久并发驻留块预算低于SM数量,导致GPU空闲,重叠执行失效MLA-var6+ V4
docs/knowledge/anti-patterns/underfilled-persistent.md
A4寄存器上限持久阶段持久内核达到寄存器上限(255寄存器/线程),占用率骤降,变为延迟受限MLA-var6+ V4
docs/knowledge/anti-patterns/spill-heavy-persistent.md
A5盲目移植大分块从其他设备复制分块形状但未重新验证寄存器、SMEM、本地内存流量和网格大小,导致资源骤降FlashAttention学习链
docs/knowledge/anti-patterns/blind-large-tile-port.md
A6统一因果掩码对所有K分块执行完整的因果掩码逻辑,浪费计算资源,尽管大多数分块完全有效或可完全跳过FlashAttention学习链
docs/knowledge/anti-patterns/uniform-causal-masking.md
A7MMA维度小于16将任何有效MMA维度(
M
N
K
)缩小至16以下通常会破坏Tensor Core覆盖,导致延迟激增而非改善占用率
FlashMLA V1手动测试 + Tensor Core形状约束
docs/knowledge/anti-patterns/sub-16-mma-dimension.md
A8盲目数据流翻转将对资源敏感的Tensor Core内核重写为代数等价的操作数方向,代码生成变化足以重新引入溢出或恶化SMEM行为FlashMLA V1 → V2
docs/knowledge/anti-patterns/blind-dataflow-flip.md

Cross-Referencing: Metrics → Optimizations

交叉引用:指标→优化方案

Profiling Metric StateLikely Optimizations
Equal-duration kernels with obviously unequal workO1 (Split-KV proportional block allocation)
Bh=1
or per-head skinny MMAs on a shared operand
O2 (Head grouping)
Dual-path kernel plateaus near the ridge point after Split-KVO3 (Independent path scheduling)
L2 hit < 90% and neighboring CTAs reload the same operand tilesO4 (Swizzle scheduling)
Same kernel structure but different path-level memory pressure suggests compiler scheduling changes may matterO5 (Latency hints)
Cross-CTA data sharing on Hopper+ or confusion about what
num_ctas
actually controls
O6 (CGA thread block clusters)
Tensor Core/MMA structure is sound, but online-softmax special functions dominate the hot loopO7 (Fast math for online softmax)
Causal kernel still spends meaningful time in mask/control flow because many K-tiles are fully valid and future tiles are fully skippedO8 (Causal K-loop split)
Causal/triangular workload shows tail imbalance across CTAs or waves finish unevenlyO9 (Causal ProgramId remap)
Short shapes underfill the GPU while long shapes prefer different tiles or occupancy hintsO10 (Adaptive tile and occupancy autotuning)
Registers/SMEM force one CTA/SM, eligible warps stay very low, and the kernel is compute-bound but still under-utilizes Tensor CoresO11 (Pipeline-driven low-occupancy scheduling)
Combine stage has
TILE_M=1
and
TC Util = 0%
A2 (
TILE_M=1
combine/decompression)
Persistent resident blocks < SM count or waves/SM < 1A3 (Underfilled persistent concurrency)
Persistent stage at
255
regs/thread with single-digit occupancy
A4 (Register-ceiling persistent stage)
Imported large tile causes register/SMEM cliffs, local-memory traffic, or abrupt grid shrinkage on the target deviceA5 (Blind large-tile port)
Causal masking logic is paid uniformly across all K-tiles despite obvious fully-valid and fully-skipped regionsA6 (Uniform causal masking)
Candidate tile drives any effective MMA dimension below
16
, or Tensor Core FLOPs collapse far below algorithmic FLOPs after a "smaller tile" change
A7 (Sub-16 MMA dimension)
Proposed optimization is only an algebraic operand/dataflow flip, especially near a register or SMEM cliffA8 (Blind dataflow flip)
性能分析指标状态推荐优化方案
耗时相同但工作负载明显不均等的内核O1(按比例分配Split-KV块)
共享操作数上的
Bh=1
或per-head窄幅MMA
O2(Head分组)
完成Split-KV后双路径内核性能接近拐点O3(独立路径调度)
L2命中率<90%,且相邻CTA重新加载相同操作数分块O4(混洗调度)
内核结构相同,但各路径内存压力不同,表明编译器调度变化可能有影响O5(延迟提示)
Hopper+上存在跨CTA数据共享,或对
num_ctas
实际控制的内容存在困惑
O6(CGA线程块集群)
Tensor Core/MMA结构合理,但在线Softmax特殊函数主导热循环O7(在线Softmax快速数学优化)
因果内核仍在掩码/控制流上花费大量时间,因为许多K分块完全有效且后续分块可完全跳过O8(因果K循环拆分)
因果/三角工作负载显示CTA间尾端不平衡或wave完成时间不均O9(因果ProgramId重映射)
短形状未充分利用GPU,而长形状偏好不同分块或占用率提示O10(自适应分块与占用率自动调优)
寄存器/SMEM限制导致每个SM仅能运行一个CTA,可用warp数量极低,且内核为计算密集型但仍未充分利用Tensor CoresO11(流水线驱动的低占用率调度)
合并阶段
TILE_M=1
TC利用率=0%
A2(
TILE_M=1
合并/解压缩)
持久驻留块数<SM数量或每SM wave数<1A3(未充分利用的持久并发)
持久阶段达到255寄存器/线程,占用率为个位数A4(寄存器上限持久阶段)
引入的大分块在目标设备上导致寄存器/SMEM骤降、本地内存流量增加或网格突然缩小A5(盲目移植大分块)
尽管存在明显的完全有效和完全跳过区域,仍对所有K分块统一执行因果掩码逻辑A6(统一因果掩码)
候选分块导致任何有效MMA维度低于16,或“更小分块”更改后Tensor Core FLOPs远低于算法FLOPsA7(MMA维度小于16)
拟议的优化仅为代数操作数/数据流翻转,尤其是在接近寄存器或SMEM瓶颈时A8(盲目数据流翻转)

Knowledge Base Update Protocol

知识库更新协议

For the full update protocol, detail file template, and step-by-step instructions, load the
/orchestration-workflow
skill.
有关完整的更新协议、详情文件模板和分步说明,请加载
/orchestration-workflow
Skill。