tilegym-cutile-autotuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCuTile Autotuning
CuTile自动调优
Add autotuning to CuTile kernels using the API with tune-once/cache/direct-launch pattern.
exhaustive_search使用 API结合一次性调优/缓存/直接启动模式,为CuTile内核添加自动调优功能。
exhaustive_searchInstructions
操作步骤
Follow the decision tree to classify the kernel, design a search space, implement the tune-once/cache/launch pattern, and validate performance.
- Classify — use the Decision Tree to determine search dimensions (occupancy-only vs full tile search)
- Design search space — select the matching template from ; prune to ≤ 30 configs in the final code via arch filters (directed exploration probes may temporarily exceed this — see Design Philosophy)
references/kernel-type-templates.md - Implement — add + cache +
exhaustive_searchfollowing the Step-by-Step Workflow; handle in-place writes with split-buffer if neededct.launch - Test — run correctness with autotune enabled and with
DISABLE_AUTOTUNE=1 - Validate — A/B benchmark against fixed best-known config; see
references/search-strategies.md - Shrink — prune dead-weight configs that never win, targeting ≤ 8 configs per architecture to minimize compilation cost (Step 10)
遵循决策树对内核进行分类,设计搜索空间,实现一次性调优/缓存/启动模式,并验证性能。
- 分类 — 使用决策树确定搜索维度(仅occupancy调优 vs 完整分块搜索)
- 设计搜索空间 — 从中选择匹配的模板;通过架构过滤器将最终代码中的配置数量修剪至≤30个(定向探索测试可暂时超过此限制——详见设计理念)
references/kernel-type-templates.md - 实现 — 按照分步工作流添加+ 缓存 +
exhaustive_search;若需要处理原地写入,使用拆分缓冲区(split-buffer)ct.launch - 测试 — 在启用自动调优和设置的两种情况下运行正确性测试
DISABLE_AUTOTUNE=1 - 验证 — 与已知最优的固定配置进行A/B基准测试;详见
references/search-strategies.md - 精简 — 删除从未胜出的无效配置,目标是每个架构≤8个配置以最小化编译成本(步骤10)
Task Router — Jump to What You Need
任务导航 — 快速定位所需内容
| What are you trying to do? | Go to |
|---|---|
| Add autotune to a new kernel (most common) | Quick Reference below → Workflow: Adding Autotune → |
| Debug: data corruption / wrong results after first run | Pitfall #1 (In-Place Kernel) |
| Debug: autotune taking 5+ minutes | Pitfall #2 (Compilation Timeout) |
| Debug: search space generator returning zero configs | Pitfall #5 first; also check arch filters, size guards, and |
| Optimize an existing autotune config | Workflow: Optimizing an Existing Config |
| 你想要完成什么操作? | 跳转至 |
|---|---|
| 为新内核添加自动调优(最常见场景) | 下方快速参考 → 工作流:添加自动调优 → |
| 调试:首次运行后出现数据损坏/结果错误 | 问题#1(原地内核) |
| 调试:自动调优耗时超过5分钟 | 问题#2(编译超时) |
| 调试:搜索空间生成器返回零配置 | 先查看问题#5;同时检查架构过滤器、大小限制和 |
| 优化现有自动调优配置 | 工作流:优化现有配置 |
Quick Reference — Occupancy-Only Autotune (Tune-Once/Cache/Launch)
快速参考 — 仅Occupancy调优(一次性调优/缓存/启动)
Most CuTile kernels (elementwise, reduction, LayerNorm) need only occupancy tuning. Copy this pattern:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch
def _my_autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)大多数CuTile内核(逐元素型、归约型、LayerNorm)仅需occupancy调优。复制以下模式:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch
def _my_autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)Module-level cache: tune once, launch fast forever after
模块级缓存:仅调优一次,后续启动永久快速
_autotune_cache = {}
def my_op(x, output):
stream = torch.cuda.current_stream()
NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# Cache key: anything that affects optimal config (use str() for device)
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _autotune_cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs,
stream,
grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
kernel=my_kernel,
args_fn=lambda cfg: (x, output, ...),
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_autotune_cache[cache_key] = (best_cfg, tuned_kernel) # cache BOTH
cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))
Key rules:
- **Tune once, cache, launch directly** — `exhaustive_search` runs only on first call per shape; subsequent calls use cached config + `ct.launch` with zero overhead
- For in-place kernels use split-buffer during search (separate input/output tensors)
- Keep ≤ 30 configs in final code (see Design Philosophy for temporary directed probes)
- `exhaustive_search` requires a `Sequence` (list/tuple) — convert generators with `list()`
- **Search space must include the original fixed config** — this guarantees autotuning never makes performance worse
**When to use this pattern**: Kernel has fixed block size (not tile-size tunable). Includes: elementwise (SwiGLU, GeGLU), reduction (RMSNorm, LayerNorm), RoPE, and persistent kernels with heuristic block sizes (grouped GEMM).
For complex kernels (matmul with tile sizes, FMHA, FP8 with num_ctas), read the full guide below + [`kernel-type-templates.md`](references/kernel-type-templates.md).
> **⚠️ Three pitfalls catch almost everyone — check before submitting:**
> - **`replace_hints` on hot path?** → Cache BOTH config AND kernel object from `exhaustive_search`. Calling `replace_hints()` every invocation recompiles (100–500× slower) → Pitfall #7
> - **In-place kernel** (writes back to input tensor)? → MUST use split-buffer pattern during search → Pitfall #1
> - **Search space empty?** → Check arch filters and `num_ctas` constraints → Pitfall #5
> **Minimum coverage**: On sm100+, FMHA/matmul/varlen search spaces must include both `num_ctas=1` and `num_ctas=2`. For core dimensions (tile sizes, occupancy), keep at least 2 distinct values even if unsure which is better — let `exhaustive_search` decide.
> **When to stop tuning**: A mean speedup in [0.98, 1.02] means your *current* search space isn't helping — but doesn't mean no config will help. Before stopping, check whether you've covered the key dimensions for this kernel type (consult `references/kernel-type-templates.md`). If the search space already covers the template's recommended dimensions and the best result is still noise-floor, then stop — further micro-adjustments won't help. If key dimensions are missing (e.g., never tried `num_ctas=2` for a dual-GEMM kernel), expand the search space rather than giving up.
>
> Once correctness tests pass and the autotuned kernel shows speedup over the fixed-config baseline, **stop — do not re-run to "confirm".** GPU kernel timing fluctuates ±5–10 % between invocations due to clock scaling and OS scheduling; a subsequent timing dip does not mean your code is wrong.
>
> To improve speedup, only modify the autotune search space (configs, tile sizes, occupancy, num_ctas). Do not modify other code (Python wrapper, stream management, etc.) to chase speedup — kernel performance is determined by the config selection, not by host-side code._autotune_cache = {}
def my_op(x, output):
stream = torch.cuda.current_stream()
NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# 缓存键:所有会影响最优配置的因素(对设备使用str()转换)
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _autotune_cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs,
stream,
grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
kernel=my_kernel,
args_fn=lambda cfg: (x, output, ...),
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_autotune_cache[cache_key] = (best_cfg, tuned_kernel) # 同时缓存两者
cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))
核心规则:
- **一次性调优、缓存、直接启动** — `exhaustive_search`仅在每个形状的首次调用时运行;后续调用使用缓存的配置 + `ct.launch`,零开销
- 原地内核在搜索期间需使用拆分缓冲区(单独的输入/输出张量)
- 最终代码中保持≤30个配置(设计理念中提及临时定向测试的情况)
- `exhaustive_search`需要Sequence类型(列表/元组)——使用`list()`转换生成器
- **搜索空间必须包含原始固定配置** — 这保证自动调优绝不会导致性能下降
**何时使用此模式**:内核具有固定块大小(不可调分块大小)。包括:逐元素型(SwiGLU、GeGLU)、归约型(RMSNorm、LayerNorm)、RoPE,以及使用启发式块大小的持久型内核(分组GEMM)。
对于复杂内核(带分块大小的矩阵乘法、FMHA、带num_ctas的FP8),请阅读下方完整指南 + [`kernel-type-templates.md`](references/kernel-type-templates.md)。
> **⚠️ 几乎所有人都会遇到三个问题——提交前务必检查:**
> - **在热路径上调用`replace_hints`?** → 同时缓存`exhaustive_search`返回的配置和内核对象。每次调用`replace_hints()`都会重新编译(慢100–500倍)→ 问题#7
> - **原地内核**(写回输入张量)? → 搜索期间必须使用拆分缓冲区模式 → 问题#1
> - **搜索空间为空?** → 检查架构过滤器和`num_ctas`约束 → 问题#5
> **最低覆盖要求**:在sm100+架构上,FMHA/矩阵乘法/变长注意力的搜索空间必须同时包含`num_ctas=1`和`num_ctas=2`。对于核心维度(分块大小、occupancy),即使不确定哪个更好,也要保留至少2个不同的值——让`exhaustive_search`来决定。
> **何时停止调优**:平均加速比在[0.98, 1.02]范围内意味着当前搜索空间没有帮助——但不代表没有合适的配置。停止前,检查是否已覆盖该内核类型的关键维度(参考`references/kernel-type-templates.md`)。如果搜索空间已覆盖模板推荐的维度,且最佳结果仍处于噪声水平,则可以停止——进一步的微调整不会有帮助。如果缺少关键维度(例如,双GEMM内核从未尝试过`num_ctas=2`),则应扩展搜索空间而非放弃。
>
> 一旦正确性测试通过,且自动调优后的内核相比固定配置基线有性能提升,**立即停止——不要重新运行以“确认”**。GPU内核的计时会因时钟缩放和操作系统调度在不同调用间波动±5–10%;后续的计时下降并不意味着代码存在问题。
>
> 要提升加速比,仅修改自动调优的搜索空间(配置、分块大小、occupancy、num_ctas)。不要修改其他代码(Python包装器、流管理等)来追求加速——内核性能由配置选择决定,而非主机端代码。Reading Guide
阅读指南
- Occupancy-only kernels (elementwise, reduction, persistent with fixed block sizes): Quick Reference + Pitfall Checklist is sufficient — skip docs. For in-place kernels, also read Pitfall #1.
references/ - Complex kernels (matmul with tunable tile sizes, FMHA, FP8 with num_ctas): Quick Reference → Decision Tree → API Reference → Step-by-Step Workflow → relevant docs.
references/
5-step summary: Classify kernel → Design search space () → Implement using template () → Validate with A/B test → Check Pitfall Checklist.
parameter-space-design.mdkernel-type-templates.mdReading references: Read only the reference relevant to your kernel type — e.g., for FMHA, read the Template 5 section in ; for hardware constraints, read only the target architecture's section. Avoid reading all references end-to-end when a targeted lookup suffices.
references/kernel-type-templates.md- 仅Occupancy调优的内核(逐元素型、归约型、固定块大小的持久型):快速参考 + 问题检查表已足够——可跳过文档。对于原地内核,还需阅读问题#1。
references/ - 复杂内核(带可调分块大小的矩阵乘法、FMHA、带num_ctas的FP8):快速参考 → 决策树 → API参考 → 分步工作流 → 相关文档。
references/
5步总结:分类内核 → 设计搜索空间() → 使用模板实现() → 通过A/B测试验证 → 检查问题检查表。
parameter-space-design.mdkernel-type-templates.md参考文档阅读建议:仅阅读与你的内核类型相关的参考内容——例如,对于FMHA,阅读中的模板5部分;对于硬件约束,仅阅读目标架构的部分。当定向查找足够时,避免从头到尾阅读所有参考文档。
references/kernel-type-templates.mdDesign Philosophy
设计理念
Build a small, precise search space bottom-up — not a large space trimmed down. CuTile compilation is much heavier than Triton (~0.5-1s per config), so the final code should contain ≤ 30 configs. The approach is: classify the kernel type first, then construct only the relevant configs for that type and architecture.
Directed exploration during development: If the initial template configs yield speedup < 1.0, you may run a temporary larger probe (30–100 configs) via to identify which dimensions matter — but this probe must be directional, not a blind cartesian product. Use the kernel type classification to decide which dimensions to vary (e.g. for dual-GEMM, probe while fixing tile sizes; for FMHA, probe while fixing TILE_N). Once the probe identifies the winning region, lock the final code's search space to ≤ 8 top candidates. Do NOT write the large probe into the source file — it is a one-shot diagnostic tool.
bash + python3 -cnum_ctas × occupancyTILE_M × num_ctas自下而上构建小型、精准的搜索空间——而非先构建大型空间再精简。CuTile的编译比Triton重得多(每个配置约0.5-1秒),因此最终代码应包含≤30个配置。方法是:先对内核类型进行分类,然后仅构建该类型和架构相关的配置。
开发期间的定向探索:如果初始模板配置的加速比<1.0,你可以通过运行一个临时的大型测试(30–100个配置)来确定哪些维度重要——但此测试必须是定向的,而非盲目的笛卡尔积。使用内核类型分类来决定要调整哪些维度(例如,对于双GEMM,在固定分块大小的情况下测试;对于FMHA,在固定TILE_N的情况下测试)。一旦测试确定了胜出区域,将最终代码的搜索空间锁定为≤8个顶级候选配置。不要将大型测试写入源文件——它只是一次性诊断工具。
bash + python3 -cnum_ctas × occupancyTILE_M × num_ctasDecision Tree: What Search Dimensions Does This Kernel Need?
决策树:该内核需要哪些搜索维度?
All kernels should have autotuning added. The question is not whether to autotune, but what dimensions to search:
What type of kernel is this?
├── Compute-bound (matmul, GEMM, FMHA) → Does it have multiple tunable dimensions (tile sizes)?
│ ├── YES → Is it a fused multi-GEMM kernel (dual-GEMM, e.g. Linear+GLUAct)?
│ │ ├── YES → Template 9: low occupancy (1–2), conservative tiles (2× SHMEM/register pressure)
│ │ └── NO → Full search: TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│ │ (see matmul/FMHA templates in kernel-type-templates.md)
│ └── NO → Occupancy-only search: [1, 2, 4, 8]
│ (see Quick Reference above)
├── Balanced (LayerNorm, reduction + compute) →
│ Occupancy-only search: [1, 2, 4, 8]
│ Expected benefit: 2-15%
└── Memory-bound (CE Loss, pure elementwise) →
Occupancy-only search: [1, 2, 4, 8]
Expected benefit: 0-15% (varies by kernel; zero-cost after tuning)Why memory-bound kernels only search occupancy (not num_ctas or tile sizes):
- has zero benefit:
num_ctasenables TMA multicast, where multiple CTAs share tile data in shared memory (e.g., matmul A/B tiles reused across CTAs). Memory-bound kernels use per-elementnum_ctas > 1/ct.gatherwith no tile reuse — multi-CTA cooperation adds overhead with no data sharing benefit.ct.scatter - Tile sizes are pre-determined: BLOCK_SIZE for memory-bound kernels is determined by offline sweep (e.g., 1024 is globally optimal on B200 across [256, 512, 1024, 2048, 4096, 8192]). This is a constant, not a runtime tunable.
- Occupancy is the only effective knob: Higher occupancy lets the GPU hide memory latency by switching to another CTA while one is stalled on a memory request.
Evidence — CE Loss experiment: A 12-config search (occupancy × num_ctas) on Cross-Entropy Loss yielded only 2.5% gain (0.79x → 0.81x vs Triton). Thedimension contributed nothing; the result was reverted because compilation cost outweighed the marginal benefit. Occupancy-only (4 configs) achieves the same result at 3x less compilation time.num_ctas
Note on memory-bound kernels: Adding occupancy-only autotune is always worthwhile because:
- The tune-once/cache/launch pattern has zero runtime overhead after the first call
- The search space is tiny (4 configs, ~2-4s compilation)
- Even small improvements have value at scale
所有内核都应添加自动调优。问题不是是否要调优,而是调优哪些维度:
这是什么类型的内核?
├── 计算密集型(矩阵乘法、GEMM、FMHA)→ 是否有多个可调维度(分块大小)?
│ ├── 是 → 是否为融合多GEMM内核(双GEMM,例如Linear+GLUAct)?
│ │ ├── 是 → 模板9:低occupancy(1–2)、保守分块(2×共享内存/寄存器压力)
│ │ └── 否 → 完整搜索:TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│ │ (详见kernel-type-templates.md中的矩阵乘法/FMHA模板)
│ └── 否 → 仅Occupancy搜索:[1, 2, 4, 8]
│ (详见上方快速参考)
├── 平衡型(LayerNorm、归约+计算)→
│ 仅Occupancy搜索:[1, 2, 4, 8]
│ 预期收益:2-15%
└── 内存密集型(交叉熵损失、纯逐元素型)→
仅Occupancy搜索:[1, 2, 4, 8]
预期收益:0-15%(因内核而异;调优后零成本)为什么内存密集型内核仅搜索occupancy(而非num_ctas或分块大小):
- 无收益:
num_ctas启用TMA多播,即多个CTA共享共享内存中的分块数据(例如,矩阵乘法中A/B分块在多个CTA间复用)。内存密集型内核使用逐元素的num_ctas > 1/ct.gather,无分块复用——多CTA协作会增加开销且无数据共享收益。ct.scatter - 分块大小预先确定:内存密集型内核的BLOCK_SIZE由离线扫描确定(例如,在B200上,1024在[256, 512, 1024, 2048, 4096, 8192]范围内全局最优)。这是一个常量,而非运行时可调参数。
- Occupancy是唯一有效的调节项:更高的occupancy让GPU在某个CTA等待内存请求时切换到另一个CTA,从而隐藏内存延迟。
证据——交叉熵损失实验:对交叉熵损失进行12个配置的搜索(occupancy × num_ctas)仅获得2.5%的提升(相比Triton从0.79x提升至0.81x)。维度毫无贡献;由于编译成本超过边际收益,该结果被回退。仅Occupancy调优(4个配置)在编译时间减少2/3的情况下达到相同效果。num_ctas
关于内存密集型内核的注意事项:添加仅Occupancy的自动调优始终值得,因为:
- 一次性调优/缓存/启动模式在首次调用后零运行时开销
- 搜索空间极小(4个配置,约2-4秒编译时间)
- 即使是微小的改进在大规模场景下也有价值
Occupancy Selection Guide
Occupancy选择指南
Occupancy controls how many CTAs run concurrently per SM. Use this as a starting point when designing the occupancy search space:
| Occupancy Range | Best For | Example Kernels |
|---|---|---|
| 1–4 | Compute-bound (heavy math) | Complex transforms, matmul |
| 4–8 | Balanced (GEMM, TMA) | Matrix multiply, FMHA |
| 8–16 | Memory-bound (reductions) | Softmax, LayerNorm |
| 16–32 | Very light (copies, casts) | Type conversions, elementwise |
Use these ranges to seed your initial search space. For occupancy-only kernels, covers most cases — see Quick Reference above.
[1, 2, 4, 8]Occupancy控制每个SM上同时运行的CTA数量。设计occupancy搜索空间时,可将以下内容作为起点:
| Occupancy范围 | 适用场景 | 示例内核 |
|---|---|---|
| 1–4 | 计算密集型(繁重数学运算) | 复杂变换、矩阵乘法 |
| 4–8 | 平衡型(GEMM、TMA) | 矩阵乘法、FMHA |
| 8–16 | 内存密集型(归约) | Softmax、LayerNorm |
| 16–32 | 极轻量型(复制、类型转换) | 类型转换、逐元素型 |
使用这些范围作为初始搜索空间的种子。对于仅Occupancy调优的内核,覆盖大多数情况——详见上方快速参考。
[1, 2, 4, 8]exhaustive_search API Reference
exhaustive_search API参考
See references/api-reference.md for the full
API surface — current signature, , the
tune-once/cache/launch pattern, , kernel hints,
design, and patterns.
exhaustive_searchTuningResultreplace_hintssearch_spacegrid_fn详见references/api-reference.md获取完整的 API信息——当前签名、、一次性调优/缓存/启动模式、、内核提示、设计和模式。
exhaustive_searchTuningResultreplace_hintssearch_spacegrid_fnStep-by-Step Workflow
分步工作流
See references/workflow.md for the end-to-end
workflow — adding autotune to a new kernel, handling existing
multi-architecture configs, integration with ,
cross-backend config transfer (Triton → CuTile), and optimizing an existing
config.
torch.autograd.Function详见references/workflow.md获取端到端工作流——为新内核添加自动调优、处理现有多架构配置、与集成、跨后端配置迁移(Triton → CuTile)以及优化现有配置。
torch.autograd.FunctionPitfall Checklist
问题检查表
See references/pitfalls.md for the full list of
common pitfalls — in-place data corruption, compilation timeout, cold-cache
performance skew, NCU profiling interference, generator
exhaustion, FP8 precision loss, and recompilation on hot
paths.
search_spacereplace_hints详见references/pitfalls.md获取完整的常见问题列表——原地数据损坏、编译超时、冷缓存性能偏差、NCU分析干扰、生成器耗尽、FP8精度损失以及热路径上的重新编译。
search_spacereplace_hintsScope and Boundaries
范围与边界
This skill covers only autotune configuration: search space design, invocation, caching, and with tuned hints. It does not modify kernel code.
exhaustive_searchct.launchIn scope (autotune config):
- Search space generator functions
- calls and result handling
exhaustive_search() - for applying tuned hints
kernel.replace_hints() - Cache logic (key design, dict management)
- with tuned kernel
ct.launch() - fallback path
DISABLE_AUTOTUNE
Out of scope (kernel code modifications — do NOT make these changes):
- Math flags (flush_to_zero, rounding_mode)
- Performance Hints (slice_hint, buffer_depth, copy_config)
- Memory access patterns (2D→1D gather/scatter conversion)
- Codegen optimizations (safe_offs → padding_value)
- Algorithm changes (K-loop split, load balancing)
本技能仅涵盖自动调优配置:搜索空间设计、调用、缓存以及带调优提示的。不修改内核代码。
exhaustive_searchct.launch范围内(自动调优配置):
- 搜索空间生成器函数
- 调用及结果处理
exhaustive_search() - 应用调优提示
kernel.replace_hints() - 缓存逻辑(键设计、字典管理)
- 带调优内核的
ct.launch() - 回退路径
DISABLE_AUTOTUNE
范围外(内核代码修改——请勿进行这些更改):
- 数学标志(flush_to_zero、rounding_mode)
- 性能提示(slice_hint、buffer_depth、copy_config)
- 内存访问模式(2D→1D gather/scatter转换)
- 代码生成优化(safe_offs → padding_value)
- 算法更改(K循环拆分、负载均衡)
Further Optimization Suggestions
进一步优化建议
After adding autotuning, the following kernel-level optimizations may yield additional gains. These are outside the scope of this skill — mention them to the user as potential next steps, but do not implement them as part of autotuning:
- Math flags: +
flush_to_zero=Truecan provide 34-72% improvement for FMHA-class kernels (set via environment variablesrounding_mode=APPROXor in kernel code). Causal chain: larger tiles initially decrease performance by 18-43% due to subnormal handling overhead; enabling FTZ+APPROX rescues this and flips the result to +34-72%. Math flags are therefore a prerequisite for large-tile configs to be effective on FMHA-class kernels.TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1 - Performance Hints: ,
slice_hint,buffer_depth— requires modifying kernel IR codecopy_config - Memory access patterns: Using TMA loads () instead of
ct.load; removing unnecessary bounds checks (ct.gatherwhen safe)check_bounds=False - Codegen quality: Using parameter instead of manual
padding_valuemasking; removingct.wheresafe_offs - Algorithm restructuring: K-loop split, load balancing, algebraic simplification
添加自动调优后,以下内核级优化可能带来额外收益。这些不属于本技能的范围——可向用户提及作为潜在后续步骤,但不要作为自动调优的一部分实现:
- 数学标志:+
flush_to_zero=True可为FMHA类内核带来34-72%的性能提升(通过环境变量rounding_mode=APPROX或在内核代码中设置)。因果链:大分块最初会因次正常数处理开销导致性能下降18-43%;启用FTZ+APPROX可挽回这一损失并将结果转为+34-72%。因此,数学标志是大分块配置在FMHA类内核上生效的先决条件。TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1 - 性能提示:、
slice_hint、buffer_depth——需要修改内核IR代码copy_config - 内存访问模式:使用TMA加载()替代
ct.load;在安全情况下移除不必要的边界检查(ct.gather)check_bounds=False - 代码生成质量:使用参数替代手动
padding_value掩码;移除ct.wheresafe_offs - 算法重构:K循环拆分、负载均衡、代数简化
Differences from Triton Autotune
与Triton自动调优的差异
Key differences: Triton uses decorator with objects; CuTile uses with configs + separate cache + . CuTile has no / (compiler decides) — only tile sizes + + . CuTile compilation is heavier (keep ≤30 configs in final code). CuTile cache is user-managed in-memory (no automatic persistence). CuTile separates (kernel args) from (compiler hints).
@triton.autotuneConfig(...)exhaustive_search()SimpleNamespacect.launchnum_warpsnum_stagesoccupancynum_ctasargs_fnhints_fn核心差异:Triton使用装饰器和对象;CuTile使用和配置 + 独立缓存 + 。CuTile没有/(由编译器决定)——仅有分块大小 + + 。CuTile的编译更重(最终代码中保持≤30个配置)。CuTile的缓存由用户在内存中管理(无自动持久化)。CuTile将(内核参数)与(编译器提示)分离。
@triton.autotuneConfig(...)exhaustive_search()SimpleNamespacect.launchnum_warpsnum_stagesoccupancynum_ctasargs_fnhints_fnReference Documents
参考文档
| Category | Document | Content |
|---|---|---|
| API Reference | | |
| Workflow | | End-to-end workflow: adding autotune to a new kernel, multi-architecture configs, |
| Pitfalls | | Common pitfalls: in-place corruption, compilation timeout, cold-cache skew, NCU interference, |
| Parameter Design | | Per-kernel-type parameter spaces, cross-arch patterns, grid_fn patterns, pruning rules |
| Search Strategies | | Exhaustive search, A/B test methodology, DISABLE_AUTOTUNE pattern |
| Templates | | Copy-paste autotune templates for 8 kernel types |
| Hardware | | Per-architecture constraints, tile size ranges, num_ctas rules, TMA requirements |
| 分类 | 文档 | 内容 |
|---|---|---|
| API参考 | | |
| 工作流 | | 端到端工作流:为新内核添加自动调优、多架构配置、 |
| 常见问题 | | 常见问题:原地损坏、编译超时、冷缓存偏差、NCU干扰、 |
| 参数设计 | | 各内核类型的参数空间、跨架构模式、grid_fn模式、修剪规则 |
| 搜索策略 | | 穷举搜索、A/B测试方法、DISABLE_AUTOTUNE模式 |
| 模板 | | 8种内核类型的可复制自动调优模板 |
| 硬件 | | 各架构约束、分块大小范围、num_ctas规则、TMA要求 |
Source Code References
源代码参考
Key files: (matmul autotune), (FMHA autotune), (shared occupancy=[1,2,4,8]), (elementwise example), (split-buffer pattern), (persistent GEMM, occupancy-only).
ops/cutile/matmul.pyops/cutile/attention.pysuites/unsloth/cutile/ct_ops.pyautotune_configs()suites/unsloth/cutile/swiglu.pysuites/unsloth/cutile/rope_embedding.pysuites/unsloth/cutile/grouped_gemm.py关键文件:(矩阵乘法自动调优)、(FMHA自动调优)、(共享 occupancy=[1,2,4,8])、(逐元素型示例)、(拆分缓冲区模式)、(持久型GEMM,仅Occupancy调优)。
ops/cutile/matmul.pyops/cutile/attention.pysuites/unsloth/cutile/ct_ops.pyautotune_configs()suites/unsloth/cutile/swiglu.pysuites/unsloth/cutile/rope_embedding.pysuites/unsloth/cutile/grouped_gemm.pyWorked Examples
实战示例
Each example shows the before → after pattern: (hardcoded ) and (refactored to tune-once/cache/launch).
fixed_launch.pyct.launchautotuned_launch.py| Directory | Kernel | Autotune Pattern | Complexity | Key Teaching Point |
|---|---|---|---|---|
| RMSNorm (reduction) | Occupancy-only | Low | Most common pattern — no tile tuning, just find best occupancy. Grid = |
| GEMM C=A@B | Full: | High | Compute-bound kernel with multiple tunable dimensions. |
| RoPE embedding (in-place) | Occupancy-only, with split-buffer | Medium | In-place kernel MUST use split-buffer during search to avoid corruption. Search writes to scratch; final |
每个示例展示前后对比模式:(硬编码)和(重构为一次性调优/缓存/启动)。
fixed_launch.pyct.launchautotuned_launch.py| 目录 | 内核 | 自动调优模式 | 复杂度 | 核心教学点 |
|---|---|---|---|---|
| RMSNorm(归约型) | 仅Occupancy | 低 | 最常见模式——无需分块调优,仅寻找最佳occupancy。Grid = |
| GEMM C=A@B | 完整搜索: | 高 | 计算密集型内核,多可调维度。 |
| RoPE嵌入(原地型) | 仅Occupancy,带拆分缓冲区 | 中 | 原地内核必须在搜索期间使用拆分缓冲区以避免损坏。搜索写入临时缓冲区;最终 |