tilegym-cutile-autotuning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CuTile Autotuning

CuTile自动调优

Add autotuning to CuTile kernels using the

exhaustive_search

API with tune-once/cache/direct-launch pattern.

使用

exhaustive_search

API结合一次性调优/缓存/直接启动模式，为CuTile内核添加自动调优功能。

Instructions

操作步骤

Follow the decision tree to classify the kernel, design a search space, implement the tune-once/cache/launch pattern, and validate performance.

Classify — use the Decision Tree to determine search dimensions (occupancy-only vs full tile search)
Design search space — select the matching template from
```
references/kernel-type-templates.md
```
; prune to ≤ 30 configs in the final code via arch filters (directed exploration probes may temporarily exceed this — see Design Philosophy)
Implement — add
```
exhaustive_search
```
+ cache +
```
ct.launch
```
following the Step-by-Step Workflow; handle in-place writes with split-buffer if needed
Test — run correctness with autotune enabled and with
```
DISABLE_AUTOTUNE=1
```
Validate — A/B benchmark against fixed best-known config; see
```
references/search-strategies.md
```
Shrink — prune dead-weight configs that never win, targeting ≤ 8 configs per architecture to minimize compilation cost (Step 10)

遵循决策树对内核进行分类，设计搜索空间，实现一次性调优/缓存/启动模式，并验证性能。

分类 — 使用决策树确定搜索维度（仅occupancy调优 vs 完整分块搜索）
设计搜索空间 — 从
```
references/kernel-type-templates.md
```
中选择匹配的模板；通过架构过滤器将最终代码中的配置数量修剪至≤30个（定向探索测试可暂时超过此限制——详见设计理念）
实现 — 按照分步工作流添加
```
exhaustive_search
```
+ 缓存 +
```
ct.launch
```
；若需要处理原地写入，使用拆分缓冲区（split-buffer）
测试 — 在启用自动调优和设置
```
DISABLE_AUTOTUNE=1
```
的两种情况下运行正确性测试
验证 — 与已知最优的固定配置进行A/B基准测试；详见
```
references/search-strategies.md
```
精简 — 删除从未胜出的无效配置，目标是每个架构≤8个配置以最小化编译成本（步骤10）

Task Router — Jump to What You Need

任务导航 — 快速定位所需内容

What are you trying to do?	Go to
Add autotune to a new kernel (most common)	Quick Reference below → Workflow: Adding Autotune → `references/kernel-type-templates.md` (pick by kernel type: T1=elementwise, T2=in-place, T3=matmul, T4=persistent, T5=FMHA, T6=FP8, T7=grouped GEMM, T8=varlen attention, T9=dual-GEMM fusion)
Debug: data corruption / wrong results after first run	Pitfall #1 (In-Place Kernel)
Debug: autotune taking 5+ minutes	Pitfall #2 (Compilation Timeout)
Debug: search space generator returning zero configs	Pitfall #5 first; also check arch filters, size guards, and `num_ctas` constraints
Optimize an existing autotune config	Workflow: Optimizing an Existing Config

你想要完成什么操作？	跳转至
为新内核添加自动调优（最常见场景）	下方快速参考 → 工作流：添加自动调优 → `references/kernel-type-templates.md` （按内核类型选择：T1=逐元素型、T2=原地型、T3=矩阵乘法、T4=持久型、T5=FMHA、T6=FP8、T7=分组GEMM、T8=变长注意力、T9=双GEMM融合）
调试：首次运行后出现数据损坏/结果错误	问题#1（原地内核）
调试：自动调优耗时超过5分钟	问题#2（编译超时）
调试：搜索空间生成器返回零配置	先查看问题#5；同时检查架构过滤器、大小限制和 `num_ctas` 约束
优化现有自动调优配置	工作流：优化现有配置

Quick Reference — Occupancy-Only Autotune (Tune-Once/Cache/Launch)

快速参考 — 仅Occupancy调优（一次性调优/缓存/启动）

Most CuTile kernels (elementwise, reduction, LayerNorm) need only occupancy tuning. Copy this pattern:

python

from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch

def _my_autotune_configs():
    for occ in [1, 2, 4, 8]:
        yield SimpleNamespace(occupancy=occ)

大多数CuTile内核（逐元素型、归约型、LayerNorm）仅需occupancy调优。复制以下模式：

python

from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch

def _my_autotune_configs():
    for occ in [1, 2, 4, 8]:
        yield SimpleNamespace(occupancy=occ)

Module-level cache: tune once, launch fast forever after

模块级缓存：仅调优一次，后续启动永久快速

_autotune_cache = {}

def my_op(x, output): stream = torch.cuda.current_stream() NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count

# Cache key: anything that affects optimal config (use str() for device)
cache_key = (x.shape, x.dtype, str(x.device))

if cache_key not in _autotune_cache:
    configs = list(_my_autotune_configs())
    result = exhaustive_search(
        configs,
        stream,
        grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
        kernel=my_kernel,
        args_fn=lambda cfg: (x, output, ...),
        hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
    )
    best_cfg = result.best.config
    tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
    _autotune_cache[cache_key] = (best_cfg, tuned_kernel)  # cache BOTH

cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))


Key rules:
- **Tune once, cache, launch directly** — `exhaustive_search` runs only on first call per shape; subsequent calls use cached config + `ct.launch` with zero overhead
- For in-place kernels use split-buffer during search (separate input/output tensors)
- Keep ≤ 30 configs in final code (see Design Philosophy for temporary directed probes)
- `exhaustive_search` requires a `Sequence` (list/tuple) — convert generators with `list()`
- **Search space must include the original fixed config** — this guarantees autotuning never makes performance worse

**When to use this pattern**: Kernel has fixed block size (not tile-size tunable). Includes: elementwise (SwiGLU, GeGLU), reduction (RMSNorm, LayerNorm), RoPE, and persistent kernels with heuristic block sizes (grouped GEMM).

For complex kernels (matmul with tile sizes, FMHA, FP8 with num_ctas), read the full guide below + [`kernel-type-templates.md`](references/kernel-type-templates.md).

> **⚠️ Three pitfalls catch almost everyone — check before submitting:**
> - **`replace_hints` on hot path?** → Cache BOTH config AND kernel object from `exhaustive_search`. Calling `replace_hints()` every invocation recompiles (100–500× slower) → Pitfall #7
> - **In-place kernel** (writes back to input tensor)? → MUST use split-buffer pattern during search → Pitfall #1
> - **Search space empty?** → Check arch filters and `num_ctas` constraints → Pitfall #5

> **Minimum coverage**: On sm100+, FMHA/matmul/varlen search spaces must include both `num_ctas=1` and `num_ctas=2`. For core dimensions (tile sizes, occupancy), keep at least 2 distinct values even if unsure which is better — let `exhaustive_search` decide.

> **When to stop tuning**: A mean speedup in [0.98, 1.02] means your *current* search space isn't helping — but doesn't mean no config will help. Before stopping, check whether you've covered the key dimensions for this kernel type (consult `references/kernel-type-templates.md`). If the search space already covers the template's recommended dimensions and the best result is still noise-floor, then stop — further micro-adjustments won't help. If key dimensions are missing (e.g., never tried `num_ctas=2` for a dual-GEMM kernel), expand the search space rather than giving up.
>
> Once correctness tests pass and the autotuned kernel shows speedup over the fixed-config baseline, **stop — do not re-run to "confirm".** GPU kernel timing fluctuates ±5–10 % between invocations due to clock scaling and OS scheduling; a subsequent timing dip does not mean your code is wrong.
>
> To improve speedup, only modify the autotune search space (configs, tile sizes, occupancy, num_ctas). Do not modify other code (Python wrapper, stream management, etc.) to chase speedup — kernel performance is determined by the config selection, not by host-side code.

_autotune_cache = {}

def my_op(x, output): stream = torch.cuda.current_stream() NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count

# 缓存键：所有会影响最优配置的因素（对设备使用str()转换）
cache_key = (x.shape, x.dtype, str(x.device))

if cache_key not in _autotune_cache:
    configs = list(_my_autotune_configs())
    result = exhaustive_search(
        configs,
        stream,
        grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
        kernel=my_kernel,
        args_fn=lambda cfg: (x, output, ...),
        hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
    )
    best_cfg = result.best.config
    tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
    _autotune_cache[cache_key] = (best_cfg, tuned_kernel)  # 同时缓存两者

cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))


核心规则：
- **一次性调优、缓存、直接启动** — `exhaustive_search`仅在每个形状的首次调用时运行；后续调用使用缓存的配置 + `ct.launch`，零开销
- 原地内核在搜索期间需使用拆分缓冲区（单独的输入/输出张量）
- 最终代码中保持≤30个配置（设计理念中提及临时定向测试的情况）
- `exhaustive_search`需要Sequence类型（列表/元组）——使用`list()`转换生成器
- **搜索空间必须包含原始固定配置** — 这保证自动调优绝不会导致性能下降

**何时使用此模式**：内核具有固定块大小（不可调分块大小）。包括：逐元素型（SwiGLU、GeGLU）、归约型（RMSNorm、LayerNorm）、RoPE，以及使用启发式块大小的持久型内核（分组GEMM）。

对于复杂内核（带分块大小的矩阵乘法、FMHA、带num_ctas的FP8），请阅读下方完整指南 + [`kernel-type-templates.md`](references/kernel-type-templates.md)。

> **⚠️ 几乎所有人都会遇到三个问题——提交前务必检查：**
> - **在热路径上调用`replace_hints`？** → 同时缓存`exhaustive_search`返回的配置和内核对象。每次调用`replace_hints()`都会重新编译（慢100–500倍）→ 问题#7
> - **原地内核**（写回输入张量）？ → 搜索期间必须使用拆分缓冲区模式 → 问题#1
> - **搜索空间为空？** → 检查架构过滤器和`num_ctas`约束 → 问题#5

> **最低覆盖要求**：在sm100+架构上，FMHA/矩阵乘法/变长注意力的搜索空间必须同时包含`num_ctas=1`和`num_ctas=2`。对于核心维度（分块大小、occupancy），即使不确定哪个更好，也要保留至少2个不同的值——让`exhaustive_search`来决定。

> **何时停止调优**：平均加速比在[0.98, 1.02]范围内意味着当前搜索空间没有帮助——但不代表没有合适的配置。停止前，检查是否已覆盖该内核类型的关键维度（参考`references/kernel-type-templates.md`）。如果搜索空间已覆盖模板推荐的维度，且最佳结果仍处于噪声水平，则可以停止——进一步的微调整不会有帮助。如果缺少关键维度（例如，双GEMM内核从未尝试过`num_ctas=2`），则应扩展搜索空间而非放弃。
>
> 一旦正确性测试通过，且自动调优后的内核相比固定配置基线有性能提升，**立即停止——不要重新运行以“确认”**。GPU内核的计时会因时钟缩放和操作系统调度在不同调用间波动±5–10%；后续的计时下降并不意味着代码存在问题。
>
> 要提升加速比，仅修改自动调优的搜索空间（配置、分块大小、occupancy、num_ctas）。不要修改其他代码（Python包装器、流管理等）来追求加速——内核性能由配置选择决定，而非主机端代码。

Reading Guide

阅读指南

Occupancy-only kernels (elementwise, reduction, persistent with fixed block sizes): Quick Reference + Pitfall Checklist is sufficient — skip
```
references/
```
docs. For in-place kernels, also read Pitfall #1.
Complex kernels (matmul with tunable tile sizes, FMHA, FP8 with num_ctas): Quick Reference → Decision Tree → API Reference → Step-by-Step Workflow → relevant
```
references/
```
docs.

5-step summary: Classify kernel → Design search space (

parameter-space-design.md

) → Implement using template (

kernel-type-templates.md

) → Validate with A/B test → Check Pitfall Checklist.

Reading references: Read only the reference relevant to your kernel type — e.g., for FMHA, read the Template 5 section in

references/kernel-type-templates.md

; for hardware constraints, read only the target architecture's section. Avoid reading all references end-to-end when a targeted lookup suffices.

仅Occupancy调优的内核（逐元素型、归约型、固定块大小的持久型）：快速参考 + 问题检查表已足够——可跳过
```
references/
```
文档。对于原地内核，还需阅读问题#1。
复杂内核（带可调分块大小的矩阵乘法、FMHA、带num_ctas的FP8）：快速参考 → 决策树 → API参考 → 分步工作流 → 相关
```
references/
```
文档。

5步总结：分类内核 → 设计搜索空间（

parameter-space-design.md

） → 使用模板实现（

kernel-type-templates.md

） → 通过A/B测试验证 → 检查问题检查表。

参考文档阅读建议：仅阅读与你的内核类型相关的参考内容——例如，对于FMHA，阅读

references/kernel-type-templates.md

中的模板5部分；对于硬件约束，仅阅读目标架构的部分。当定向查找足够时，避免从头到尾阅读所有参考文档。

Design Philosophy

设计理念

Build a small, precise search space bottom-up — not a large space trimmed down. CuTile compilation is much heavier than Triton (~0.5-1s per config), so the final code should contain ≤ 30 configs. The approach is: classify the kernel type first, then construct only the relevant configs for that type and architecture.

Directed exploration during development: If the initial template configs yield speedup < 1.0, you may run a temporary larger probe (30–100 configs) via

bash + python3 -c

to identify which dimensions matter — but this probe must be directional, not a blind cartesian product. Use the kernel type classification to decide which dimensions to vary (e.g. for dual-GEMM, probe

num_ctas × occupancy

while fixing tile sizes; for FMHA, probe

TILE_M × num_ctas

while fixing TILE_N). Once the probe identifies the winning region, lock the final code's search space to ≤ 8 top candidates. Do NOT write the large probe into the source file — it is a one-shot diagnostic tool.

自下而上构建小型、精准的搜索空间——而非先构建大型空间再精简。CuTile的编译比Triton重得多（每个配置约0.5-1秒），因此最终代码应包含≤30个配置。方法是：先对内核类型进行分类，然后仅构建该类型和架构相关的配置。

开发期间的定向探索：如果初始模板配置的加速比<1.0，你可以通过

bash + python3 -c

运行一个临时的大型测试（30–100个配置）来确定哪些维度重要——但此测试必须是定向的，而非盲目的笛卡尔积。使用内核类型分类来决定要调整哪些维度（例如，对于双GEMM，在固定分块大小的情况下测试

num_ctas × occupancy

；对于FMHA，在固定TILE_N的情况下测试

TILE_M × num_ctas

）。一旦测试确定了胜出区域，将最终代码的搜索空间锁定为≤8个顶级候选配置。不要将大型测试写入源文件——它只是一次性诊断工具。

Decision Tree: What Search Dimensions Does This Kernel Need?

决策树：该内核需要哪些搜索维度？

All kernels should have autotuning added. The question is not whether to autotune, but what dimensions to search:

What type of kernel is this?
├── Compute-bound (matmul, GEMM, FMHA) → Does it have multiple tunable dimensions (tile sizes)?
│   ├── YES → Is it a fused multi-GEMM kernel (dual-GEMM, e.g. Linear+GLUAct)?
│   │   ├── YES → Template 9: low occupancy (1–2), conservative tiles (2× SHMEM/register pressure)
│   │   └── NO  → Full search: TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│   │             (see matmul/FMHA templates in kernel-type-templates.md)
│   └── NO  → Occupancy-only search: [1, 2, 4, 8]
│             (see Quick Reference above)
├── Balanced (LayerNorm, reduction + compute) →
│   Occupancy-only search: [1, 2, 4, 8]
│   Expected benefit: 2-15%
└── Memory-bound (CE Loss, pure elementwise) →
    Occupancy-only search: [1, 2, 4, 8]
    Expected benefit: 0-15% (varies by kernel; zero-cost after tuning)

Why memory-bound kernels only search occupancy (not num_ctas or tile sizes):

num_ctas
has zero benefit:
```
num_ctas > 1
```
enables TMA multicast, where multiple CTAs share tile data in shared memory (e.g., matmul A/B tiles reused across CTAs). Memory-bound kernels use per-element
```
ct.gather
```
/
```
ct.scatter
```
with no tile reuse — multi-CTA cooperation adds overhead with no data sharing benefit.
Tile sizes are pre-determined: BLOCK_SIZE for memory-bound kernels is determined by offline sweep (e.g., 1024 is globally optimal on B200 across [256, 512, 1024, 2048, 4096, 8192]). This is a constant, not a runtime tunable.
Occupancy is the only effective knob: Higher occupancy lets the GPU hide memory latency by switching to another CTA while one is stalled on a memory request.

Evidence — CE Loss experiment: A 12-config search (occupancy × num_ctas) on Cross-Entropy Loss yielded only 2.5% gain (0.79x → 0.81x vs Triton). The
num_ctas
dimension contributed nothing; the result was reverted because compilation cost outweighed the marginal benefit. Occupancy-only (4 configs) achieves the same result at 3x less compilation time.

Note on memory-bound kernels: Adding occupancy-only autotune is always worthwhile because:

The tune-once/cache/launch pattern has zero runtime overhead after the first call
The search space is tiny (4 configs, ~2-4s compilation)
Even small improvements have value at scale

所有内核都应添加自动调优。问题不是是否要调优，而是调优哪些维度：

这是什么类型的内核？
├── 计算密集型（矩阵乘法、GEMM、FMHA）→ 是否有多个可调维度（分块大小）？
│   ├── 是 → 是否为融合多GEMM内核（双GEMM，例如Linear+GLUAct）？
│   │   ├── 是 → 模板9：低occupancy（1–2）、保守分块（2×共享内存/寄存器压力）
│   │   └── 否 → 完整搜索：TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│   │             （详见kernel-type-templates.md中的矩阵乘法/FMHA模板）
│   └── 否 → 仅Occupancy搜索：[1, 2, 4, 8]
│             （详见上方快速参考）
├── 平衡型（LayerNorm、归约+计算）→
│   仅Occupancy搜索：[1, 2, 4, 8]
│   预期收益：2-15%
└── 内存密集型（交叉熵损失、纯逐元素型）→
    仅Occupancy搜索：[1, 2, 4, 8]
    预期收益：0-15%（因内核而异；调优后零成本）

为什么内存密集型内核仅搜索occupancy（而非num_ctas或分块大小）：

num_ctas
无收益：
```
num_ctas > 1
```
启用TMA多播，即多个CTA共享共享内存中的分块数据（例如，矩阵乘法中A/B分块在多个CTA间复用）。内存密集型内核使用逐元素的
```
ct.gather
```
/
```
ct.scatter
```
，无分块复用——多CTA协作会增加开销且无数据共享收益。
分块大小预先确定：内存密集型内核的BLOCK_SIZE由离线扫描确定（例如，在B200上，1024在[256, 512, 1024, 2048, 4096, 8192]范围内全局最优）。这是一个常量，而非运行时可调参数。
Occupancy是唯一有效的调节项：更高的occupancy让GPU在某个CTA等待内存请求时切换到另一个CTA，从而隐藏内存延迟。

证据——交叉熵损失实验：对交叉熵损失进行12个配置的搜索（occupancy × num_ctas）仅获得2.5%的提升（相比Triton从0.79x提升至0.81x）。
num_ctas
维度毫无贡献；由于编译成本超过边际收益，该结果被回退。仅Occupancy调优（4个配置）在编译时间减少2/3的情况下达到相同效果。

关于内存密集型内核的注意事项：添加仅Occupancy的自动调优始终值得，因为：

一次性调优/缓存/启动模式在首次调用后零运行时开销
搜索空间极小（4个配置，约2-4秒编译时间）
即使是微小的改进在大规模场景下也有价值

Occupancy Selection Guide

Occupancy选择指南

Occupancy controls how many CTAs run concurrently per SM. Use this as a starting point when designing the occupancy search space:

Occupancy Range	Best For	Example Kernels
1–4	Compute-bound (heavy math)	Complex transforms, matmul
4–8	Balanced (GEMM, TMA)	Matrix multiply, FMHA
8–16	Memory-bound (reductions)	Softmax, LayerNorm
16–32	Very light (copies, casts)	Type conversions, elementwise

Use these ranges to seed your initial search space. For occupancy-only kernels,

[1, 2, 4, 8]

covers most cases — see Quick Reference above.

Occupancy控制每个SM上同时运行的CTA数量。设计occupancy搜索空间时，可将以下内容作为起点：

Occupancy范围	适用场景	示例内核
1–4	计算密集型（繁重数学运算）	复杂变换、矩阵乘法
4–8	平衡型（GEMM、TMA）	矩阵乘法、FMHA
8–16	内存密集型（归约）	Softmax、LayerNorm
16–32	极轻量型（复制、类型转换）	类型转换、逐元素型

使用这些范围作为初始搜索空间的种子。对于仅Occupancy调优的内核，

[1, 2, 4, 8]

覆盖大多数情况——详见上方快速参考。

exhaustive_search API Reference

exhaustive_search API参考

See references/api-reference.md for the full

exhaustive_search

API surface — current signature,

TuningResult

, the tune-once/cache/launch pattern,

replace_hints

, kernel hints,

search_space

design, and

grid_fn

patterns.

详见references/api-reference.md获取完整的

exhaustive_search

API信息——当前签名、

TuningResult

、一次性调优/缓存/启动模式、

replace_hints

、内核提示、

search_space

设计和

grid_fn

模式。

Step-by-Step Workflow

分步工作流

See references/workflow.md for the end-to-end workflow — adding autotune to a new kernel, handling existing multi-architecture configs, integration with

torch.autograd.Function

, cross-backend config transfer (Triton → CuTile), and optimizing an existing config.

详见references/workflow.md获取端到端工作流——为新内核添加自动调优、处理现有多架构配置、与

torch.autograd.Function

集成、跨后端配置迁移（Triton → CuTile）以及优化现有配置。

Pitfall Checklist

问题检查表

See references/pitfalls.md for the full list of common pitfalls — in-place data corruption, compilation timeout, cold-cache performance skew, NCU profiling interference,

search_space

generator exhaustion, FP8 precision loss, and

replace_hints

recompilation on hot paths.

详见references/pitfalls.md获取完整的常见问题列表——原地数据损坏、编译超时、冷缓存性能偏差、NCU分析干扰、

search_space

生成器耗尽、FP8精度损失以及热路径上的

replace_hints

重新编译。

Scope and Boundaries

范围与边界

This skill covers only autotune configuration: search space design,

exhaustive_search

invocation, caching, and

ct.launch

with tuned hints. It does not modify kernel code.

In scope (autotune config):

Search space generator functions
```
exhaustive_search()
```
calls and result handling
```
kernel.replace_hints()
```
for applying tuned hints
Cache logic (key design, dict management)
```
ct.launch()
```
with tuned kernel
```
DISABLE_AUTOTUNE
```
fallback path

Out of scope (kernel code modifications — do NOT make these changes):

Math flags (flush_to_zero, rounding_mode)
Performance Hints (slice_hint, buffer_depth, copy_config)
Memory access patterns (2D→1D gather/scatter conversion)
Codegen optimizations (safe_offs → padding_value)
Algorithm changes (K-loop split, load balancing)

本技能仅涵盖自动调优配置：搜索空间设计、

exhaustive_search

调用、缓存以及带调优提示的

ct.launch

。不修改内核代码。

范围内（自动调优配置）：

搜索空间生成器函数
```
exhaustive_search()
```
调用及结果处理
```
kernel.replace_hints()
```
应用调优提示
缓存逻辑（键设计、字典管理）
带调优内核的
```
ct.launch()
```
```
DISABLE_AUTOTUNE
```
回退路径

范围外（内核代码修改——请勿进行这些更改）：

数学标志（flush_to_zero、rounding_mode）
性能提示（slice_hint、buffer_depth、copy_config）
内存访问模式（2D→1D gather/scatter转换）
代码生成优化（safe_offs → padding_value）
算法更改（K循环拆分、负载均衡）

Further Optimization Suggestions

进一步优化建议

After adding autotuning, the following kernel-level optimizations may yield additional gains. These are outside the scope of this skill — mention them to the user as potential next steps, but do not implement them as part of autotuning:

Math flags:
```
flush_to_zero=True
```
+
```
rounding_mode=APPROX
```
can provide 34-72% improvement for FMHA-class kernels (set via environment variables
```
TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1
```
or in kernel code). Causal chain: larger tiles initially decrease performance by 18-43% due to subnormal handling overhead; enabling FTZ+APPROX rescues this and flips the result to +34-72%. Math flags are therefore a prerequisite for large-tile configs to be effective on FMHA-class kernels.
Performance Hints:
```
slice_hint
```
,
```
buffer_depth
```
,
```
copy_config
```
— requires modifying kernel IR code
Memory access patterns: Using TMA loads (
```
ct.load
```
) instead of
```
ct.gather
```
; removing unnecessary bounds checks (
```
check_bounds=False
```
when safe)
Codegen quality: Using
```
padding_value
```
parameter instead of manual
```
ct.where
```
masking; removing
```
safe_offs
```
Algorithm restructuring: K-loop split, load balancing, algebraic simplification

添加自动调优后，以下内核级优化可能带来额外收益。这些不属于本技能的范围——可向用户提及作为潜在后续步骤，但不要作为自动调优的一部分实现：

数学标志：
```
flush_to_zero=True
```
+
```
rounding_mode=APPROX
```
可为FMHA类内核带来34-72%的性能提升（通过环境变量
```
TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1
```
或在内核代码中设置）。因果链：大分块最初会因次正常数处理开销导致性能下降18-43%；启用FTZ+APPROX可挽回这一损失并将结果转为+34-72%。因此，数学标志是大分块配置在FMHA类内核上生效的先决条件。
性能提示：
```
slice_hint
```
、
```
buffer_depth
```
、
```
copy_config
```
——需要修改内核IR代码
内存访问模式：使用TMA加载（
```
ct.load
```
）替代
```
ct.gather
```
；在安全情况下移除不必要的边界检查（
```
check_bounds=False
```
）
代码生成质量：使用
```
padding_value
```
参数替代手动
```
ct.where
```
掩码；移除
```
safe_offs
```
算法重构：K循环拆分、负载均衡、代数简化

Differences from Triton Autotune

与Triton自动调优的差异

Key differences: Triton uses

@triton.autotune

decorator with

Config(...)

objects; CuTile uses

exhaustive_search()

with

SimpleNamespace

configs + separate cache +

ct.launch

. CuTile has no

num_warps

num_stages

(compiler decides) — only tile sizes +

occupancy

num_ctas

. CuTile compilation is heavier (keep ≤30 configs in final code). CuTile cache is user-managed in-memory (no automatic persistence). CuTile separates

args_fn

(kernel args) from

hints_fn

(compiler hints).

核心差异：Triton使用

@triton.autotune

装饰器和

Config(...)

对象；CuTile使用

exhaustive_search()

和

SimpleNamespace

配置 + 独立缓存 +

ct.launch

。CuTile没有

num_warps

num_stages

（由编译器决定）——仅有分块大小 +

occupancy

num_ctas

。CuTile的编译更重（最终代码中保持≤30个配置）。CuTile的缓存由用户在内存中管理（无自动持久化）。CuTile将

args_fn

（内核参数）与

hints_fn

（编译器提示）分离。

Reference Documents

参考文档

Category	Document	Content
API Reference	`api-reference.md`	`exhaustive_search` signature, `TuningResult` , tune-once/cache/launch pattern, `replace_hints` , kernel hints, `search_space` design, `grid_fn` patterns
Workflow	`workflow.md`	End-to-end workflow: adding autotune to a new kernel, multi-architecture configs, `torch.autograd.Function` integration, Triton→CuTile transfer, optimizing existing configs
Pitfalls	`pitfalls.md`	Common pitfalls: in-place corruption, compilation timeout, cold-cache skew, NCU interference, `search_space` exhaustion, FP8 precision, `replace_hints` recompilation
Parameter Design	`parameter-space-design.md`	Per-kernel-type parameter spaces, cross-arch patterns, grid_fn patterns, pruning rules
Search Strategies	`search-strategies.md`	Exhaustive search, A/B test methodology, DISABLE_AUTOTUNE pattern
Templates	`kernel-type-templates.md`	Copy-paste autotune templates for 8 kernel types
Hardware	`hardware-constraints.md`	Per-architecture constraints, tile size ranges, num_ctas rules, TMA requirements

分类	文档	内容
API参考	`api-reference.md`	`exhaustive_search` 签名、 `TuningResult` 、一次性调优/缓存/启动模式、 `replace_hints` 、内核提示、 `search_space` 设计、 `grid_fn` 模式
工作流	`workflow.md`	端到端工作流：为新内核添加自动调优、多架构配置、 `torch.autograd.Function` 集成、Triton→CuTile迁移、优化现有配置
常见问题	`pitfalls.md`	常见问题：原地损坏、编译超时、冷缓存偏差、NCU干扰、 `search_space` 耗尽、FP8精度、 `replace_hints` 重新编译
参数设计	`parameter-space-design.md`	各内核类型的参数空间、跨架构模式、grid_fn模式、修剪规则
搜索策略	`search-strategies.md`	穷举搜索、A/B测试方法、DISABLE_AUTOTUNE模式
模板	`kernel-type-templates.md`	8种内核类型的可复制自动调优模板
硬件	`hardware-constraints.md`	各架构约束、分块大小范围、num_ctas规则、TMA要求

Source Code References

源代码参考

Key files:

ops/cutile/matmul.py

(matmul autotune),

ops/cutile/attention.py

(FMHA autotune),

suites/unsloth/cutile/ct_ops.py

(shared

autotune_configs()

occupancy=[1,2,4,8]),

suites/unsloth/cutile/swiglu.py

(elementwise example),

suites/unsloth/cutile/rope_embedding.py

(split-buffer pattern),

suites/unsloth/cutile/grouped_gemm.py

(persistent GEMM, occupancy-only).

关键文件：

ops/cutile/matmul.py

（矩阵乘法自动调优）、

ops/cutile/attention.py

（FMHA自动调优）、

suites/unsloth/cutile/ct_ops.py

（共享

autotune_configs()

occupancy=[1,2,4,8]）、

suites/unsloth/cutile/swiglu.py

（逐元素型示例）、

suites/unsloth/cutile/rope_embedding.py

（拆分缓冲区模式）、

suites/unsloth/cutile/grouped_gemm.py

（持久型GEMM，仅Occupancy调优）。

Worked Examples

实战示例

Each example shows the before → after pattern:

fixed_launch.py

(hardcoded

ct.launch

) and

autotuned_launch.py

(refactored to tune-once/cache/launch).

Directory	Kernel	Autotune Pattern	Complexity	Key Teaching Point
`assets/examples/01_rmsnorm_occupancy_only/`	RMSNorm (reduction)	Occupancy-only `[1,2,4,8]`	Low	Most common pattern — no tile tuning, just find best occupancy. Grid = `NUM_SM * cfg.occupancy` . Not in-place.
`assets/examples/02_matmul_full_search/`	GEMM C=A@B	Full: `TILE_M/N/K` + `occupancy` + `num_ctas` (sm90+)	High	Compute-bound kernel with multiple tunable dimensions. `args_fn` passes tile sizes as `ct.Constant[int]` . `grid_fn` depends on `cfg` . ≤30 configs.
`assets/examples/03_rope_inplace_splitbuffer/`	RoPE embedding (in-place)	Occupancy-only, with split-buffer	Medium	In-place kernel MUST use split-buffer during search to avoid corruption. Search writes to scratch; final `ct.launch` uses real in-place args.

每个示例展示前后对比模式：

fixed_launch.py

（硬编码

ct.launch

）和

autotuned_launch.py

（重构为一次性调优/缓存/启动）。

目录	内核	自动调优模式	复杂度	核心教学点
`assets/examples/01_rmsnorm_occupancy_only/`	RMSNorm（归约型）	仅Occupancy `[1,2,4,8]`	低	最常见模式——无需分块调优，仅寻找最佳occupancy。Grid = `NUM_SM * cfg.occupancy` 。非原地。
`assets/examples/02_matmul_full_search/`	GEMM C=A@B	完整搜索： `TILE_M/N/K` + `occupancy` + `num_ctas` （sm90+）	高	计算密集型内核，多可调维度。 `args_fn` 将分块大小作为 `ct.Constant[int]` 传递。 `grid_fn` 依赖 `cfg` 。≤30个配置。
`assets/examples/03_rope_inplace_splitbuffer/`	RoPE嵌入（原地型）	仅Occupancy，带拆分缓冲区	中	原地内核必须在搜索期间使用拆分缓冲区以避免损坏。搜索写入临时缓冲区；最终 `ct.launch` 使用真实的原地参数。