cutile-autotuning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCuTile Autotuning
CuTile自动调优
Add autotuning to CuTile kernels using the API with tune-once/cache/direct-launch pattern.
exhaustive_search使用 API结合一次性调优/缓存/直接启动模式,为CuTile内核添加自动调优功能。
exhaustive_searchInstructions
操作步骤
Follow the decision tree to classify the kernel, design a search space, implement the tune-once/cache/launch pattern, and validate performance.
- Classify — use the Decision Tree to determine search dimensions (occupancy-only vs full tile search)
- Design search space — select the matching template from ; prune to ≤ 30 configs via arch filters
references/kernel-type-templates.md - Implement — add + cache +
exhaustive_searchfollowing the Step-by-Step Workflow; handle in-place writes with split-buffer if neededct.launch - Test — run correctness with autotune enabled and with
DISABLE_AUTOTUNE=1 - Validate — A/B benchmark against fixed best-known config; see
references/search-strategies.md
按照决策树对内核进行分类,设计搜索空间,实现一次性调优/缓存/启动模式,并验证性能。
- 分类 — 使用决策树确定搜索维度(仅occupancy搜索 vs 完整tile搜索)
- 设计搜索空间 — 从中选择匹配的模板;通过架构过滤器将配置数量修剪至≤30个
references/kernel-type-templates.md - 实现 — 按照分步工作流添加+ 缓存 +
exhaustive_search;如果需要,使用拆分缓冲区处理原地写入ct.launch - 测试 — 在启用自动调优和设置的情况下分别运行正确性测试
DISABLE_AUTOTUNE=1 - 验证 — 与已知最优的固定配置进行A/B基准测试;参考
references/search-strategies.md
Task Router — Jump to What You Need
任务导航 — 快速定位所需内容
| What are you trying to do? | Go to |
|---|---|
| Add autotune to a new kernel (most common) | Quick Reference below → Workflow: Adding Autotune → |
| Debug: data corruption / wrong results after first run | Pitfall #1 (In-Place Kernel) |
| Debug: autotune taking 5+ minutes | Pitfall #2 (Compilation Timeout) |
| Debug: search space generator returning zero configs | Pitfall #5 first; also check arch filters, size guards, and |
| Optimize an existing autotune config | Workflow: Optimizing an Existing Config |
| 你想要完成什么操作? | 前往位置 |
|---|---|
| 为新内核添加自动调优(最常见场景) | 下方快速参考 → 添加自动调优工作流 → |
| 调试:首次运行后出现数据损坏/结果错误 | 问题#1(原地内核) |
| 调试:自动调优耗时超过5分钟 | 问题#2(编译超时) |
| 调试:搜索空间生成器返回空配置 | 先查看问题#5;同时检查架构过滤器、尺寸限制和 |
| 优化现有自动调优配置 | 优化现有配置工作流 |
Quick Reference — Occupancy-Only Autotune (Tune-Once/Cache/Launch)
快速参考 — 仅Occupancy自动调优(一次性调优/缓存/启动)
Most CuTile kernels (elementwise, reduction, LayerNorm) need only occupancy tuning. Copy this pattern:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch
def _my_autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)大多数CuTile内核(逐元素型、归约型、LayerNorm)仅需occupancy调优。复制以下模式:
python
from types import SimpleNamespace
from cuda.tile.tune import exhaustive_search
import cuda.tile as ct
import torch
def _my_autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)Module-level cache: tune once, launch fast forever after
模块级缓存:仅调优一次,后续启动全程高速
_autotune_cache = {}
def my_op(x, output):
stream = torch.cuda.current_stream()
NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# Cache key: anything that affects optimal config (use str() for device)
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _autotune_cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs,
stream,
grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
kernel=my_kernel,
args_fn=lambda cfg: (x, output, ...),
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_autotune_cache[cache_key] = (best_cfg, tuned_kernel) # cache BOTH
cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))
Key rules:
- **Tune once, cache, launch directly** — `exhaustive_search` runs only on first call per shape; subsequent calls use cached config + `ct.launch` with zero overhead
- For in-place kernels use split-buffer during search (separate input/output tensors)
- Keep ≤ 30 configs total
- `exhaustive_search` requires a `Sequence` (list/tuple) — convert generators with `list()`
- **Search space must include the original fixed config** — this guarantees autotuning never makes performance worse
**When to use this pattern**: Kernel has fixed block size (not tile-size tunable). Includes: elementwise (SwiGLU, GeGLU), reduction (RMSNorm, LayerNorm), RoPE, and persistent kernels with heuristic block sizes (grouped GEMM).
For complex kernels (matmul with tile sizes, FMHA, FP8 with num_ctas), read the full guide below + [`kernel-type-templates.md`](references/kernel-type-templates.md).
> **⚠️ Two pitfalls catch almost everyone — check before submitting:**
> - **In-place kernel** (writes back to input tensor)? → MUST use split-buffer pattern during search → Pitfall #1
> - **Search space empty?** → Check arch filters and `num_ctas` constraints → Pitfall #5_autotune_cache = {}
def my_op(x, output):
stream = torch.cuda.current_stream()
NUM_SM = torch.cuda.get_device_properties(x.device).multi_processor_count
# 缓存键:所有会影响最优配置的参数(对device使用str()转换)
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _autotune_cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs,
stream,
grid_fn=lambda cfg: (min(NUM_SM * cfg.occupancy, M), 1, 1),
kernel=my_kernel,
args_fn=lambda cfg: (x, output, ...),
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_autotune_cache[cache_key] = (best_cfg, tuned_kernel) # 同时缓存两者
cfg, tuned_kernel = _autotune_cache[cache_key]
grid = (min(NUM_SM * cfg.occupancy, M), 1, 1)
ct.launch(stream, grid, tuned_kernel, (x, output, ...))
核心规则:
- **一次性调优、缓存、直接启动** — `exhaustive_search`仅在每个形状的首次调用时运行;后续调用使用缓存配置+`ct.launch`,零开销
- 原地内核在搜索期间需使用拆分缓冲区(独立的输入/输出张量)
- 总配置数保持≤30
- `exhaustive_search`要求传入`Sequence`(列表/元组)—— 使用`list()`转换生成器
- **搜索空间必须包含原固定配置** — 这保证自动调优绝不会导致性能下降
**适用场景**:内核具有固定块大小(tile尺寸不可调)。包括:逐元素型(SwiGLU、GeGLU)、归约型(RMSNorm、LayerNorm)、RoPE,以及使用启发式块大小的持久型内核(分组GEMM)。
对于复杂内核(带tile尺寸的矩阵乘法、FMHA、带num_ctas的FP8),请阅读下方完整指南 + [`kernel-type-templates.md`](references/kernel-type-templates.md)。
> **⚠️ 几乎所有人都会遇到两个问题 — 提交前务必检查:**
> - **原地内核**(会写回输入张量)?→ 搜索期间必须使用拆分缓冲区模式 → 问题#1
> - **搜索空间为空?** → 检查架构过滤器和`num_ctas`约束 → 问题#5Reading Guide
阅读指南
- Occupancy-only kernels (elementwise, reduction, persistent with fixed block sizes): Quick Reference + Pitfall Checklist is sufficient — skip docs. For in-place kernels, also read Pitfall #1.
references/ - Complex kernels (matmul with tunable tile sizes, FMHA, FP8 with num_ctas): Quick Reference → Decision Tree → API Reference → Step-by-Step Workflow → relevant docs.
references/
5-step summary: Classify kernel → Design search space () → Implement using template () → Validate with A/B test → Check Pitfall Checklist.
parameter-space-design.mdkernel-type-templates.md- 仅Occupancy内核(逐元素型、归约型、固定块大小的持久型):快速参考 + 问题清单已足够 — 可跳过文档。对于原地内核,还需阅读问题#1。
references/ - 复杂内核(带可调tile尺寸的矩阵乘法、FMHA、带num_ctas的FP8):快速参考 → 决策树 → API参考 → 分步工作流 → 相关文档。
references/
5步总结:分类内核 → 设计搜索空间() → 使用模板实现() → 通过A/B测试验证 → 检查问题清单。
parameter-space-design.mdkernel-type-templates.mdDesign Philosophy
设计理念
Build a small, precise search space bottom-up — not a large space trimmed down. CuTile compilation is much heavier than Triton (~0.5-1s per config), so 30 configs is the hard upper limit. The approach is: classify the kernel type first, then construct only the relevant configs for that type and architecture. Never start with a large cartesian product and prune — start with the minimum viable space and expand only if data shows it's needed.
自底向上构建小型、精准的搜索空间 — 而非先构建大空间再修剪。CuTile编译比Triton重得多(每个配置约0.5-1秒),因此30个配置是硬性上限。方法是:先分类内核类型,再仅为该类型和架构构建相关配置。绝不要从大型笛卡尔积开始再修剪 — 从最小可行空间开始,仅当数据表明需要时才扩展。
Decision Tree: What Search Dimensions Does This Kernel Need?
决策树:该内核需要哪些搜索维度?
All kernels should have autotuning added. The question is not whether to autotune, but what dimensions to search:
What type of kernel is this?
├── Compute-bound (matmul, GEMM, FMHA) → Does it have multiple tunable dimensions (tile sizes)?
│ ├── YES → Is it a fused multi-GEMM kernel (dual-GEMM, e.g. Linear+GLUAct)?
│ │ ├── YES → Template 9: low occupancy (1–2), conservative tiles (2× SHMEM/register pressure)
│ │ └── NO → Full search: TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│ │ (see matmul/FMHA templates in kernel-type-templates.md)
│ └── NO → Occupancy-only search: [1, 2, 4, 8]
│ (see Quick Reference above)
├── Balanced (LayerNorm, reduction + compute) →
│ Occupancy-only search: [1, 2, 4, 8]
│ Expected benefit: 2-15%
└── Memory-bound (CE Loss, pure elementwise) →
Occupancy-only search: [1, 2, 4, 8]
Expected benefit: 0-15% (varies by kernel; zero-cost after tuning)Why memory-bound kernels only search occupancy (not num_ctas or tile sizes):
- has zero benefit:
num_ctasenables TMA multicast, where multiple CTAs share tile data in shared memory (e.g., matmul A/B tiles reused across CTAs). Memory-bound kernels use per-elementnum_ctas > 1/ct.gatherwith no tile reuse — multi-CTA cooperation adds overhead with no data sharing benefit.ct.scatter - Tile sizes are pre-determined: BLOCK_SIZE for memory-bound kernels is determined by offline sweep (e.g., 1024 is globally optimal on B200 across [256, 512, 1024, 2048, 4096, 8192]). This is a constant, not a runtime tunable.
- Occupancy is the only effective knob: Higher occupancy lets the GPU hide memory latency by switching to another CTA while one is stalled on a memory request.
Evidence — CE Loss experiment: A 12-config search (occupancy × num_ctas) on Cross-Entropy Loss yielded only 2.5% gain (0.79x → 0.81x vs Triton). Thedimension contributed nothing; the result was reverted because compilation cost outweighed the marginal benefit. Occupancy-only (4 configs) achieves the same result at 3x less compilation time.num_ctas
Note on memory-bound kernels: Adding occupancy-only autotune is always worthwhile because:
- The tune-once/cache/launch pattern has zero runtime overhead after the first call
- The search space is tiny (4 configs, ~2-4s compilation)
- Even small improvements have value at scale
所有内核都应添加自动调优。问题不是是否要调优,而是要搜索哪些维度:
这是什么类型的内核?
├── 计算密集型(矩阵乘法、GEMM、FMHA)→ 是否有多可调维度(tile尺寸)?
│ ├── 是 → 是否为融合多GEMM内核(双GEMM,例如Linear+GLUAct)?
│ │ ├── 是 → 模板9:低occupancy(1–2)、保守tile(2×共享内存/寄存器压力)
│ │ └── 否 → 完整搜索:TILE_M × TILE_N × (TILE_K) × occupancy × num_ctas
│ │ (参考kernel-type-templates.md中的矩阵乘法/FMHA模板)
│ └── 否 → 仅Occupancy搜索:[1, 2, 4, 8]
│ (参考上方快速参考)
├── 平衡型(LayerNorm、归约+计算)→
│ 仅Occupancy搜索:[1, 2, 4, 8]
│ 预期收益:2-15%
└── 内存密集型(交叉熵损失、纯逐元素型)→
仅Occupancy搜索:[1, 2, 4, 8]
预期收益:0-15%(因内核而异;调优后零成本)为什么内存密集型内核仅搜索occupancy(不搜索num_ctas或tile尺寸):
- 无收益:
num_ctas启用TMA多播,即多个CTA共享共享内存中的tile数据(例如,矩阵乘法中A/B tile在CTA间复用)。内存密集型内核使用逐元素num_ctas > 1/ct.gather,无tile复用 — 多CTA协作只会增加开销,无数据共享收益。ct.scatter - tile尺寸预先确定:内存密集型内核的BLOCK_SIZE由离线扫描确定(例如,在B200上,1024在[256, 512, 1024, 2048, 4096, 8192]中全局最优)。这是常量,而非运行时可调参数。
- Occupancy是唯一有效旋钮:更高的occupancy让GPU在某个CTA因内存请求停滞时切换到另一个CTA,从而隐藏内存延迟。
证据 — 交叉熵损失实验:对交叉熵损失进行12个配置的搜索(occupancy × num_ctas)仅获得2.5%的收益(相对于Triton,从0.79x提升至0.81x)。维度无贡献;因编译成本超过边际收益,该结果被回退。仅Occupancy搜索(4个配置)以1/3的编译时间实现了相同结果。num_ctas
内存密集型内核注意事项:添加仅Occupancy自动调优始终值得,因为:
- 一次性调优/缓存/启动模式在首次调用后零运行时开销
- 搜索空间极小(4个配置,约2-4秒编译时间)
- 即使是微小的改进在规模上也有价值
Occupancy Selection Guide
Occupancy选择指南
Occupancy controls how many CTAs run concurrently per SM. Use this as a starting point when designing the occupancy search space:
| Occupancy Range | Best For | Example Kernels |
|---|---|---|
| 1–4 | Compute-bound (heavy math) | Complex transforms, matmul |
| 4–8 | Balanced (GEMM, TMA) | Matrix multiply, FMHA |
| 8–16 | Memory-bound (reductions) | Softmax, LayerNorm |
| 16–32 | Very light (copies, casts) | Type conversions, elementwise |
Use these ranges to seed your initial search space. For occupancy-only kernels, covers most cases — see Quick Reference above.
[1, 2, 4, 8]Occupancy控制每个SM上同时运行的CTA数量。设计occupancy搜索空间时,可将以下内容作为起点:
| Occupancy范围 | 适用场景 | 示例内核 |
|---|---|---|
| 1–4 | 计算密集型(大量数学运算) | 复杂变换、矩阵乘法 |
| 4–8 | 平衡型(GEMM、TMA) | 矩阵乘法、FMHA |
| 8–16 | 内存密集型(归约) | Softmax、LayerNorm |
| 16–32 | 极轻量型(复制、类型转换) | 类型转换、逐元素型 |
使用这些范围初始化搜索空间。对于仅Occupancy内核,覆盖大多数场景 — 参考上方快速参考。
[1, 2, 4, 8]exhaustive_search API Reference
exhaustive_search API参考
⚠️ Deprecated API:(akacuda.tile_experimental.autotune_launch()) is deprecated and should NOT be used. It combines search + launch in one call with random sampling, which produces less reproducible results and worse config selection compared toct_experimental.autotune_launch. Always useexhaustive_search(the current API below) with explicit caching andcuda.tile.tune.exhaustive_search.ct.launch
⚠️ 已废弃API:(又名cuda.tile_experimental.autotune_launch())已废弃,禁止使用。它将搜索+启动合并为一次调用,采用随机采样,与ct_experimental.autotune_launch相比,结果可复现性更低,配置选择效果更差。请始终使用exhaustive_search(下方当前API),并结合显式缓存和cuda.tile.tune.exhaustive_search。ct.launch
Current API (cuda.tile.tune
)
cuda.tile.tune当前API (cuda.tile.tune
)
cuda.tile.tunepython
from cuda.tile.tune import exhaustive_search, TuningResult
result: TuningResult = exhaustive_search(
search_space, # Sequence[T] — list or tuple of configs (NOT a generator)
stream, # torch.cuda.current_stream()
grid_fn, # callable(cfg) → tuple[int, ...]
kernel, # @ct.kernel decorated function
args_fn, # callable(cfg) → tuple of kernel args
hints_fn=None, # callable(cfg) → {"occupancy": int, "num_ctas": int}
*,
quiet=False # suppress output
)python
from cuda.tile.tune import exhaustive_search, TuningResult
result: TuningResult = exhaustive_search(
search_space, # Sequence[T] — 配置的列表或元组(不能是生成器)
stream, # torch.cuda.current_stream()
grid_fn, # callable(cfg) → tuple[int, ...]
kernel, # 被@ct.kernel装饰的函数
args_fn, # callable(cfg) → 内核参数的元组
hints_fn=None, # callable(cfg) → {"occupancy": int, "num_ctas": int}
*,
quiet=False # 抑制输出
)TuningResult
TuningResult
python
@dataclass
class TuningResult[T]:
best: Measurement # best config + timing (mean_us, error_margin_us, num_samples)
successes: Sequence[Measurement] # all successful configs (sorted by performance)
failures: Sequence[tuple[T, str, str]] # (config, exception_type, message)Key properties:
- Exhaustive: evaluates ALL configs in order — no random sampling, no skipped configs
- Search only: does not perform the final production launch — it executes trial runs internally for benchmarking, but you call separately for the actual production invocation
ct.launch - No built-in cache: you manage caching explicitly (see tune-once/cache/launch pattern)
- Deterministic: same search space always produces the same evaluation order
python
@dataclass
class TuningResult[T]:
best: Measurement # 最优配置 + 计时(mean_us, error_margin_us, num_samples)
successes: Sequence[Measurement] # 所有成功的配置(按性能排序)
failures: Sequence[tuple[T, str, str]] # (配置, 异常类型, 消息)核心特性:
- ** exhaustive(穷尽式)**:按顺序评估所有配置 — 无随机采样,无跳过配置
- 仅搜索:不执行最终生产环境启动 — 内部执行试运行以进行基准测试,但需单独调用进行实际生产调用
ct.launch - 无内置缓存:需显式管理缓存(参考一次性调优/缓存/启动模式)
- 确定性:相同搜索空间始终产生相同的评估顺序
Tune-Once / Cache / Launch Pattern
一次性调优/缓存/启动模式
This is the recommended pattern for all autotuned kernels. It ensures:
- First call: runs to find the best config (~2-30s depending on space size)
exhaustive_search - Subsequent calls: uses cached config with — zero overhead (identical to a fixed
ct.launch)ct.launch
python
_cache = {}
def run_kernel_autotuned(x, ...):
stream = torch.cuda.current_stream()
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs, stream,
grid_fn=lambda cfg: ...,
kernel=my_kernel,
args_fn=lambda cfg: ...,
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_cache[cache_key] = (best_cfg, tuned_kernel) # cache BOTH config and compiled kernel
cfg, tuned_kernel = _cache[cache_key]
grid = compute_grid(cfg)
ct.launch(stream, grid, tuned_kernel, (x, ...))Why this pattern matters: The call in the fast path is identical to what you'd write for a fixed-config kernel. There is zero per-call overhead — no lock, no hash lookup, no lambda invocation. The only cost is the Python dict lookup for .
ct.launch_cache[cache_key]⚠️ Critical: always cache the tuned kernel object, not just the config.returns a new kernel object with its own independent JIT cache. Calling it on every invocation triggers recompilation each time, degrading performance by 100–500×. Callreplace_hints()once afterreplace_hints(), store the returned kernel in the cache alongside the config, and reuse it directly on the fast path. See Pitfall #7.exhaustive_search
这是所有自动调优内核的推荐模式。它确保:
- 首次调用:运行找到最优配置(根据空间大小,约2-30秒)
exhaustive_search - 后续调用:使用缓存配置+— 零开销(与固定配置的
ct.launch完全相同)ct.launch
python
_cache = {}
def run_kernel_autotuned(x, ...):
stream = torch.cuda.current_stream()
cache_key = (x.shape, x.dtype, str(x.device))
if cache_key not in _cache:
configs = list(_my_autotune_configs())
result = exhaustive_search(
configs, stream,
grid_fn=lambda cfg: ...,
kernel=my_kernel,
args_fn=lambda cfg: ...,
hints_fn=lambda cfg: {"occupancy": cfg.occupancy},
)
best_cfg = result.best.config
tuned_kernel = my_kernel.replace_hints(occupancy=best_cfg.occupancy)
_cache[cache_key] = (best_cfg, tuned_kernel) # 同时缓存配置和编译后的内核
cfg, tuned_kernel = _cache[cache_key]
grid = compute_grid(cfg)
ct.launch(stream, grid, tuned_kernel, (x, ...))该模式的重要性:快速路径中的调用与固定配置内核的写法完全相同。无每次调用的开销 — 无锁、无哈希查找、无lambda调用。唯一成本是Python字典查找。
ct.launch_cache[cache_key]⚠️ 关键:始终缓存调优后的内核对象,而不仅仅是配置。返回一个新的内核对象,拥有独立的JIT缓存。每次调用它都会触发重新编译,导致性能下降100–500倍。在replace_hints()后调用一次exhaustive_search,将返回的内核与配置一起存储在缓存中,并在快速路径中直接复用。参考问题#7。replace_hints()
replace_hints
replace_hints
After finding the best config, use to create a kernel variant with the optimal hints:
kernel.replace_hints()python
undefined找到最优配置后,使用创建带最优提示的内核变体:
kernel.replace_hints()python
undefinedFor occupancy-only:
仅Occupancy调优:
tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy)
tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy)
For occupancy + num_ctas:
Occupancy + num_ctas调优:
tuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy, num_ctas=cfg.num_ctas)
`replace_hints` accepts only `occupancy` and `num_ctas` — these are the only compiler hints controllable via the autotune API.
**`ByTarget` wrapping for cross-architecture portability**: When creating tuned kernel variants via `ct.kernel()`, prefer wrapping hint values in `ct.ByTarget` for portability across GPU architectures:
```pythontuned_kernel = my_kernel.replace_hints(occupancy=cfg.occupancy, num_ctas=cfg.num_ctas)
`replace_hints`仅接受`occupancy`和`num_ctas` — 这是自动调优API可控制的仅有的两个编译器提示。
**跨架构可移植性的`ByTarget`包装**:通过`ct.kernel()`创建调优后的内核变体时,建议将提示值包装在`ct.ByTarget`中,以实现跨GPU架构的可移植性:
```pythonPreferred: explicit architecture targeting (portable)
推荐:显式架构目标(可移植)
tuned_kernel = ct.kernel(
my_kernel._pyfunc,
occupancy=ct.ByTarget(sm_100=best_cfg.occupancy),
num_ctas=ct.ByTarget(sm_100=best_cfg.num_ctas, default=1),
)
tuned_kernel = ct.kernel(
my_kernel._pyfunc,
occupancy=ct.ByTarget(sm_100=best_cfg.occupancy),
num_ctas=ct.ByTarget(sm_100=best_cfg.num_ctas, default=1),
)
Also acceptable: plain integers (when targeting a single architecture)
也可接受:纯整数(仅针对单一架构时)
tuned_kernel = ct.kernel(my_kernel._pyfunc, occupancy=best_cfg.occupancy)
When targeting only the current GPU (the common case in autotuning), plain integers work fine. Use `ByTarget` when the code may run on multiple architectures or when following production conventions (TileGym production code consistently uses `ByTarget`).tuned_kernel = ct.kernel(my_kernel._pyfunc, occupancy=best_cfg.occupancy)
当仅针对当前GPU(自动调优中的常见场景)时,纯整数足够。当代码可能在多个架构上运行或遵循生产规范时(TileGym生产代码一致使用`ByTarget`),使用`ByTarget`。Kernel Hints
内核提示
CuTile kernel performance is controlled by two compile-time hints:
- : Number of CTAs per SM. Higher occupancy = more parallelism but less shared memory per CTA.
occupancy - : Number of CTAs in a CGA (Cooperative Group Array). Used for multi-CTA cooperation (e.g., TMA multicast). Only supported on sm90+.
num_ctas
Three ways to set hints:
python
undefinedCuTile内核性能由两个编译时提示控制:
- :每个SM上的CTA数量。更高的occupancy = 更多并行性,但每个CTA的共享内存更少。
occupancy - :CGA(协作组数组)中的CTA数量。用于多CTA协作(例如TMA多播)。仅在sm90+上支持。
num_ctas
设置提示的三种方式:
python
undefined1. Fixed value in decorator (no autotune needed)
1. 装饰器中的固定值(无需自动调优)
@ct.kernel(occupancy=2, num_ctas=1)
def my_kernel(...): ...
@ct.kernel(occupancy=2, num_ctas=1)
def my_kernel(...): ...
2. Architecture-specific fixed value (no autotune needed)
2. 特定架构的固定值(无需自动调优)
@ct.kernel(num_ctas=ct.ByTarget(sm_100=2, sm_120=1, default=1))
def my_kernel(...): ...
@ct.kernel(num_ctas=ct.ByTarget(sm_100=2, sm_120=1, default=1))
def my_kernel(...): ...
3. Runtime autotune via exhaustive_search + replace_hints
3. 通过exhaustive_search + replace_hints进行运行时自动调优
IMPORTANT: Remove fixed hints from decorator first!
重要:先从装饰器中移除固定提示!
@ct.kernel
def my_kernel(...): ...
@ct.kernel
def my_kernel(...): ...
Then in the host wrapper:
然后在主机包装器中:
tuned_kernel = my_kernel.replace_hints(occupancy=best_occ, num_ctas=best_ctas)
ct.launch(stream, grid, tuned_kernel, args)
**Important**: `replace_hints` correctly overrides decorator hints (it uses `dataclasses.replace()` internally). However, if you forget to call `replace_hints`, the decorator's fixed values are used instead of the autotuned values. To avoid this confusion, always remove fixed hints from the `@ct.kernel(...)` decorator before adding autotuning — this makes it explicit that hints come only from the autotune path.tuned_kernel = my_kernel.replace_hints(occupancy=best_occ, num_ctas=best_ctas)
ct.launch(stream, grid, tuned_kernel, args)
**重要提示**:`replace_hints`会正确覆盖装饰器中的提示(内部使用`dataclasses.replace()`)。但如果忘记调用`replace_hints`,将使用装饰器中的固定值而非自动调优值。为避免这种混淆,添加自动调优前,请始终从`@ct.kernel(...)`装饰器中移除固定提示 — 这样可以明确提示仅来自自动调优路径。search_space Design
search_space设计
The search space is a list of objects. Each namespace holds config fields that , , and can read.
SimpleNamespacegrid_fnargs_fnhints_fnpython
from types import SimpleNamespace搜索空间是对象的列表。每个命名空间包含、和可读取的配置字段。
SimpleNamespacegrid_fnargs_fnhints_fnpython
from types import SimpleNamespaceOccupancy-only (elementwise kernels)
仅Occupancy(逐元素内核)
def autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)
def autotune_configs():
for occ in [1, 2, 4, 8]:
yield SimpleNamespace(occupancy=occ)
Full matmul search space — see parameter-space-design.md for complete per-architecture configs
完整矩阵乘法搜索空间 — 参考parameter-space-design.md获取完整的各架构配置
Pattern: yield SimpleNamespace(TILE_SIZE_M=..., TILE_SIZE_N=..., TILE_SIZE_K=..., num_ctas=..., occupancy=...)
模式:yield SimpleNamespace(TILE_SIZE_M=..., TILE_SIZE_N=..., TILE_SIZE_K=..., num_ctas=..., occupancy=...)
**Note**: `exhaustive_search` requires a `Sequence` (list/tuple), not a generator. Always convert with `list()`:
```python
configs = list(autotune_configs())
result = exhaustive_search(configs, ...)
**注意**:`exhaustive_search`要求传入`Sequence`(列表/元组),而非生成器。始终使用`list()`转换:
```python
configs = list(autotune_configs())
result = exhaustive_search(configs, ...)grid_fn Patterns
grid_fn模式
python
from math import ceilpython
from math import ceilPattern A: Simple tile coverage (matmul, elementwise)
模式A:简单tile覆盖(矩阵乘法、逐元素型)
grid_fn=lambda cfg: (ceil(M / cfg.TILE_SIZE_M) * ceil(N / cfg.TILE_SIZE_N), 1, 1)
grid_fn=lambda cfg: (ceil(M / cfg.TILE_SIZE_M) * ceil(N / cfg.TILE_SIZE_N), 1, 1)
Pattern B: Persistent matmul (static_persistent_matmul_kernel)
模式B:持久型矩阵乘法(static_persistent_matmul_kernel)
NUM_SMS = torch.cuda.get_device_properties("cuda").multi_processor_count
grid_fn=lambda cfg: (
min(NUM_SMS // cfg.num_ctas, ceil(M / cfg.TILE_M) * ceil(N / cfg.TILE_N)) * cfg.occupancy,
1, 1,
)
NUM_SMS = torch.cuda.get_device_properties("cuda").multi_processor_count
grid_fn=lambda cfg: (
min(NUM_SMS // cfg.num_ctas, ceil(M / cfg.TILE_M) * ceil(N / cfg.TILE_N)) * cfg.occupancy,
1, 1,
)
Pattern C: 2D grid (FMHA — one dim for seq tiles, one for batch*heads)
模式C:2D网格(FMHA — 一维用于序列tile,一维用于batch*heads)
grid_fn=lambda cfg: (ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1)
grid_fn=lambda cfg: (ceil(q_len / cfg.TILE_M), batch_size * num_heads, 1)
Pattern D: 1D elementwise (cdiv = math.ceil(a/b), from ct_ops.py)
模式D:1D逐元素型(cdiv = math.ceil(a/b),来自ct_ops.py)
grid_fn=lambda cfg: (cdiv(n_elements, BLOCK_SIZE),)
grid_fn=lambda cfg: (cdiv(n_elements, BLOCK_SIZE),)
Pattern E: Grouped GEMM persistent (grid fixed at NUM_SMS, occupancy via hints_fn only)
模式E:分组GEMM持久型(网格固定为NUM_SMS,仅通过hints_fn设置occupancy)
grid_fn=lambda cfg: (NUM_SMS, 1, 1)
undefinedgrid_fn=lambda cfg: (NUM_SMS, 1, 1)
undefinedStep-by-Step Workflow
分步工作流
Adding Autotune to a New Kernel
为新内核添加自动调优
-
Classify the kernel using the decision tree above.
- VERIFY: You know whether this is occupancy-only or requires tile-size tuning.
-
Remove hardcoded hints from decorator (strongly recommended): If the kernel currently has hardcoded hints in its decorator (e.g.), remove those fixed hints and change to bare
@ct.kernel(occupancy=2, num_ctas=1)before adding autotuning. While@ct.kerneldoes correctly override decorator values at runtime, leaving them creates a silent fallback trap: if any code path (e.g.,replace_hints, error handling, or a future refactor) skipsDISABLE_AUTOTUNE, the decorator's fixed hints are used instead of the autotuned values — and this produces no error, just silently worse performance. Removing them makes the failure mode explicit (missing hints → compiler defaults) rather than silent (wrong fixed hints used).replace_hints- VERIFY: The decorator has no
@ct.kerneloroccupancy=arguments before proceeding. Use barenum_ctas=instead.@ct.kernel
- VERIFY: The
-
Check for in-place writes: If the kernel modifies input tensors in-place, you MUST use the split-buffer pattern during— see Pitfall #1.
exhaustive_search- VERIFY: Either the kernel is not in-place, or you have added a split-buffer scratch tensor for the search phase.
-
Select the template frombased on kernel type.
kernel-type-templates.md -
Design the search space following:
parameter-space-design.md- Start from reference configs, not from scratch. Clone configs from existing production kernels of the same type (e.g., for GEMM) and adapt. For GEMM-class kernels,
ops/cutile/matmul.pycan suggest 8-16 high-quality candidates that reach 96-99% peak performance — seenvMatmulHeuristicsfor details.parameter-space-design.md - Detect the current GPU architecture with .
torch.cuda.get_device_capability() - Target one architecture at a time. Generate configs only for the detected arch. Do NOT add branches for other architectures — they cannot be tested on this machine and untested code paths are unreliable. If multi-arch support is needed later, add it in a separate pass on the appropriate hardware.
- Identify tunable parameters (tile sizes, occupancy, num_ctas)
- Ensure the search space includes the original fixed config (or an equivalent). This guarantees that the autotuned result is at least as good as the original — no performance regression is possible.
- If the generated set exceeds 30, apply tile size filters and pruning rules to reduce it to ≤ 30
- VERIFY: Total configs ≤ 30 (hard limit: CuTile compilation is heavy, >30 configs will timeout).
- Start from reference configs, not from scratch. Clone configs from existing production kernels of the same type (e.g.,
-
Implement the tune-once/cache/launch pattern:
- Define a dict at module level
_cache - Define a cache key that captures all parameters affecting optimal config (shapes, dtypes, device, any flags like ). ⚠️ Use
is_causalnotstr(x.device)in the cache key —x.deviceobjects are not reliably hashable and can causetorch.deviceat runtime. Always convert to string:TypeError: unhashable type. Tip: For GEMM-class kernels, round dimensions to the next power of 2 in the cache key (e.g.,cache_key = (..., x.dtype, str(x.device))) to reduce unique key count and avoid re-tuning for similar shapes.cache_key = (next_pow2(M), next_pow2(N), next_pow2(K), dtype, str(device)) - Call only when cache misses
exhaustive_search(list(configs), ...) - Store in cache
result.best.config - Use to create the tuned kernel variant
kernel.replace_hints(...) - Use for the actual kernel invocation
ct.launch() - correctly computes grid from config
grid_fn - passes all kernel arguments including tile sizes as
args_fnct.Constant[int] - passes
hints_fnand/oroccupancyfrom confignum_ctas - VERIFY: receives a
exhaustive_searchof configs, not a raw generator.list()
- Define a
-
(MANDATORY) Add DISABLE_AUTOTUNE support for CI and profiling: check— when set, skip
os.environ.get("DISABLE_AUTOTUNE", "0") == "1"entirely and fall back toexhaustive_searchwith the first valid config. This is required for:ct.launch- CI determinism (autotune adds variable wall time)
- NCU profiling (prevents autotune trial runs from cluttering the trace — see Pitfall #4)
- Debugging (isolates kernel correctness from autotune behavior)
Place the check before the cache lookup so that bypasses all autotune logic. Provide a hardcoded fallback config in case the generator yields zero configs.
DISABLE_AUTOTUNE=1 - VERIFY: Running with produces correct results and does not call
DISABLE_AUTOTUNE=1.exhaustive_search
-
Test: Run correctness tests first (), then benchmark.
pytest -k "test_op and cutile"- VERIFY: Correctness passes with autotune enabled AND with .
DISABLE_AUTOTUNE=1
- VERIFY: Correctness passes with autotune enabled AND with
-
Validate with A/B test: Compare autotune version vs fixed best-known config. Seefor methodology.
search-strategies.md- VERIFY: Autotune version ≥ baseline (or within noise). If worse, check that the search space includes the original fixed config, and that is being used correctly.
replace_hints
- VERIFY: Autotune version ≥ baseline (or within noise). If worse, check that the search space includes the original fixed config, and that
-
(MANDATORY) Run the test and verify performance before submitting.Execute the provided test script (e.g.) and check:
ENABLE_TILE=1 python3 test.pycorrectness: PASS- (autotuned must not be slower than fixed baseline)
speedup_over_fixed >= 1.0
If:speedup_over_fixed < 1.0- Check that the search space includes the original fixed config (this guarantees no regression)
- Check if is being called on every code path — revisit Step 2 (if any path skips
replace_hints, the decorator's fixed hints are used instead of autotuned values)replace_hints - Expand search space if all configs perform similarly (see → "Adapting Search Space")
references/parameter-space-design.md
⚠️ DO NOT submit without running the test at least once. Writing correct-looking code is not sufficient — autotuning bugs (silent hint override, split-buffer omission) are only caught at runtime.
-
使用上述决策树对内核进行分类。
- 验证:明确该内核是仅需Occupancy调优还是需要tile尺寸调优。
-
从装饰器中移除硬编码提示(强烈推荐):如果内核当前在装饰器中有硬编码提示(例如),移除这些固定提示,改为使用裸
@ct.kernel(occupancy=2, num_ctas=1),再添加自动调优。虽然@ct.kernel会在运行时正确覆盖装饰器值,但保留它们会造成隐性回退陷阱:如果任何代码路径(例如replace_hints、错误处理或未来重构)跳过DISABLE_AUTOTUNE,将使用装饰器的固定提示而非自动调优值 — 且不会报错,只会导致性能下降。移除它们可让失败模式显式化(缺少提示→编译器默认值),而非隐性化(使用错误的固定提示)。replace_hints- 验证:装饰器在继续前无
@ct.kernel或occupancy=参数。使用裸num_ctas=替代。@ct.kernel
- 验证:
-
检查原地写入:如果内核会原地修改输入张量,必须在期间使用拆分缓冲区模式 — 参考问题#1。
exhaustive_search- 验证:要么内核不是原地型,要么已为搜索阶段添加拆分缓冲区临时张量。
-
根据内核类型从中选择模板。
kernel-type-templates.md -
遵循设计搜索空间:
parameter-space-design.md- 从参考配置开始,而非从零构建。克隆相同类型的现有生产内核的配置(例如GEMM参考)并调整。对于GEMM类内核,
ops/cutile/matmul.py可推荐8-16个高质量候选,达到96-99%的峰值性能 — 参考nvMatmulHeuristics获取详情。parameter-space-design.md - 使用检测当前GPU架构。
torch.cuda.get_device_capability() - 一次针对一个架构。仅为检测到的架构生成配置。不要为其他架构添加分支 — 这些分支无法在本机测试,未测试的代码路径不可靠。如果后续需要多架构支持,在相应硬件上单独添加。
- 识别可调参数(tile尺寸、occupancy、num_ctas)
- 确保搜索空间包含原固定配置(或等效配置)。这保证自动调优结果至少与原配置一样好 — 不会出现性能退化。
- 如果生成的配置集超过30个,应用tile尺寸过滤器和修剪规则将其减少至≤30个
- 验证:总配置数≤30(硬性限制:CuTile编译较重,超过30个配置会超时)。
- 从参考配置开始,而非从零构建。克隆相同类型的现有生产内核的配置(例如GEMM参考
-
实现一次性调优/缓存/启动模式:
- 在模块级别定义字典
_cache - 定义缓存键,包含所有影响最优配置的参数(形状、dtype、设备、任何标志如)。⚠️ 在缓存键中使用
is_causal而非str(x.device)—x.device对象不可靠地哈希,会导致运行时torch.device。始终转换为字符串:TypeError: unhashable type。提示:对于GEMM类内核,在缓存键中将维度四舍五入到下一个2的幂(例如cache_key = (..., x.dtype, str(x.device))),以减少唯一键的数量,避免为相似形状重复调优。cache_key = (next_pow2(M), next_pow2(N), next_pow2(K), dtype, str(device)) - 仅在缓存未命中时调用
exhaustive_search(list(configs), ...) - 将存储在缓存中
result.best.config - 使用创建调优后的内核变体
kernel.replace_hints(...) - 使用进行实际内核调用
ct.launch() - 根据配置正确计算网格
grid_fn - 传递所有内核参数,包括作为
args_fn的tile尺寸ct.Constant[int] - 传递配置中的
hints_fn和/或occupancynum_ctas - 验证:接收的是配置的
exhaustive_search,而非原始生成器。list()
- 在模块级别定义
-
(必须)添加DISABLE_AUTOTUNE支持,用于CI和性能分析:检查— 当设置为1时,完全跳过
os.environ.get("DISABLE_AUTOTUNE", "0") == "1",回退到使用第一个有效配置的exhaustive_search。这是必需的:ct.launch- CI确定性(自动调优会增加可变的运行时间)
- NCU性能分析(防止自动调优试运行干扰跟踪 — 参考问题#4)
- 调试(将内核正确性与自动调优行为隔离)
将检查放在缓存查找之前,以便绕过所有自动调优逻辑。如果生成器返回空配置,提供硬编码的回退配置。
DISABLE_AUTOTUNE=1 - 验证:设置运行时,结果正确且未调用
DISABLE_AUTOTUNE=1。exhaustive_search
-
测试:先运行正确性测试(),再进行基准测试。
pytest -k "test_op and cutile"- 验证:启用自动调优和设置时,正确性测试均通过。
DISABLE_AUTOTUNE=1
- 验证:启用自动调优和设置
-
通过A/B测试验证:比较自动调优版本与已知最优的固定配置。参考获取方法。
search-strategies.md- 验证:自动调优版本≥基线(或在误差范围内)。如果性能更差,检查搜索空间是否包含原固定配置,以及是否正确使用。
replace_hints
- 验证:自动调优版本≥基线(或在误差范围内)。如果性能更差,检查搜索空间是否包含原固定配置,以及
-
(必须)提交前运行测试并验证性能。执行提供的测试脚本(例如)并检查:
ENABLE_TILE=1 python3 test.pycorrectness: PASS- (自动调优版本不得慢于固定基线)
speedup_over_fixed >= 1.0
如果:speedup_over_fixed < 1.0- 检查搜索空间是否包含原固定配置(这保证不会出现退化)
- 检查是否所有代码路径都调用了— 重新查看步骤2(如果任何路径跳过
replace_hints,将使用装饰器的固定提示而非自动调优值)replace_hints - 如果所有配置性能相似,扩展搜索空间(参考→ "调整搜索空间")
references/parameter-space-design.md
⚠️ 至少运行一次测试后再提交。代码看起来正确并不足够 — 自动调优错误(隐性提示覆盖、拆分缓冲区遗漏)仅在运行时才会被发现。
Integration with torch.autograd.Function
与torch.autograd.Function集成
When the kernel is used inside a :
torch.autograd.Function- Place the tune-once/cache/launch logic in only. The cached config is reused across calls.
forward() - In , using
backward()with a fixed or cached config is often sufficient. However, if backward has its own independent search space (e.g. grouped GEMM dX and dW have separate optimal configs), autotuning is appropriate there too.ct.launch - Example: — forward uses
rope_embedding.py+ cache with split-buffer, backward usesexhaustive_searchwith same-buffer (Q_in=Q_out).ct.launch
当内核在内部使用时:
torch.autograd.Function- 仅在中放置一次性调优/缓存/启动逻辑。缓存的配置会在调用间复用。
forward() - 在中,使用带固定或缓存配置的
backward()通常足够。但如果反向传播有独立的搜索空间(例如分组GEMM的dX和dW有各自的最优配置),也适合在此处添加自动调优。ct.launch - 示例:— 正向传播使用
rope_embedding.py+ 拆分缓冲区缓存,反向传播使用带相同缓冲区(Q_in=Q_out)的exhaustive_search。ct.launch
Cross-Backend Config Transfer (Triton → CuTile)
跨后端配置迁移(Triton → CuTile)
Use : maps → ; / have no CuTile equivalent.
src/tilegym/autotune.pyBLOCK_SIZE_M/N/KTILE_SIZE_M/N/Knum_warpsnum_stages使用:将映射为;/在CuTile中无等效项。
src/tilegym/autotune.pyBLOCK_SIZE_M/N/KTILE_SIZE_M/N/Knum_warpsnum_stagesOptimizing an Existing Autotune Config
优化现有自动调优配置
- Profile first: Use NCU (set ).
DISABLE_AUTOTUNE=1 - Expand (too narrow): add tile sizes, (sm90+),
num_ctas.swap_ab - Prune (too slow): remove suboptimal configs, use arch-conditional yield, add size filters.
- Re-validate: A/B test to confirm improvement.
- 先进行性能分析:使用NCU(设置)。
DISABLE_AUTOTUNE=1 - 扩展(空间过窄):添加tile尺寸、(sm90+)、
num_ctas。swap_ab - 修剪(速度过慢):移除次优配置,使用架构条件生成,添加尺寸过滤器。
- 重新验证:通过A/B测试确认改进。
Pitfall Checklist
问题清单
Before submitting code with autotune, verify these:
提交带自动调优的代码前,请验证以下内容:
Pitfall #1: In-Place Kernel Data Corruption
问题#1:原地内核数据损坏
Problem: runs the kernel multiple times to benchmark. If the kernel modifies input tensors in-place, the data is corrupted after the first trial run.
exhaustive_searchSolution: Split-buffer pattern — use separate read-only input and write-only output during search:
python
undefined问题:多次运行内核进行基准测试。如果内核原地修改输入张量,首次试运行后数据会损坏。
exhaustive_search解决方案:拆分缓冲区模式 — 搜索期间使用独立的只读输入和只写输出:
python
undefinedDuring exhaustive_search: use separate output buffer
exhaustive_search期间:使用独立的输出缓冲区
Q_scratch = torch.empty_like(Q)
configs = list(_rope_autotune_configs())
result = exhaustive_search(
configs, stream,
grid_fn=...,
kernel=rope_kernel,
args_fn=lambda cfg: (Q, Q_scratch, ...), # Q_in != Q_out
hints_fn=...,
)
Q_scratch = torch.empty_like(Q)
configs = list(_rope_autotune_configs())
result = exhaustive_search(
configs, stream,
grid_fn=...,
kernel=rope_kernel,
args_fn=lambda cfg: (Q, Q_scratch, ...), # Q_in != Q_out
hints_fn=...,
)
After search: launch with in-place args using tuned config
搜索完成后:使用调优后的配置启动原地参数
cfg = result.best.config
tuned_kernel = rope_kernel.replace_hints(occupancy=cfg.occupancy)
ct.launch(stream, grid, tuned_kernel, (Q, Q, ...)) # Q_in == Q_out (in-place)
**Real example**: `rope_embedding.py` — Search uses split-buffer, final launch uses same-buffer.
**Also wrong**: Using `Q.clone()` in `args_fn` — this adds ~4us per clone, which is fatal for small kernels (~5us). The clone+copy pattern caused 0.48x performance in RoPE.
**Tip — isolating output buffers in `args_fn`**: For kernels that write to a dedicated output tensor (not in-place), use `c.clone()` inside `args_fn` to prevent trial runs from overwriting the final output buffer:
```pythoncfg = result.best.config
tuned_kernel = rope_kernel.replace_hints(occupancy=cfg.occupancy)
ct.launch(stream, grid, tuned_kernel, (Q, Q, ...)) # Q_in == Q_out(原地)
**实际示例**:`rope_embedding.py` — 搜索使用拆分缓冲区,最终启动使用相同缓冲区。
**另一种错误做法**:在`args_fn`中使用`Q.clone()` — 每次克隆会增加约4微秒的开销,这对小型内核(约5微秒)是致命的。克隆+复制模式导致RoPE性能下降至0.48x。
**提示 — 在`args_fn`中隔离输出缓冲区**:对于写入专用输出张量(非原地)的内核,在`args_fn`中使用`c.clone()`,防止试运行覆盖最终输出缓冲区:
```pythonOutput tensor c will be overwritten by each trial — clone it so trials don't
输出张量c会被每次试运行覆盖 — 克隆它,使试运行不会破坏调用者期望在exhaustive_search返回后使用的缓冲区。
corrupt the buffer the caller expects to use after exhaustive_search returns.
—
result = exhaustive_search(
configs, stream,
grid_fn=...,
kernel=my_kernel,
args_fn=lambda cfg: (a, b, c.clone()), # each trial gets a fresh output
hints_fn=...,
)
This is safe because the clone cost (~4us) is negligible relative to compute-bound kernel execution time (~50us+). Only avoid `clone()` for very small, memory-bound kernels where 4us is a significant fraction of runtime — in that case, pre-allocate a single scratch buffer outside `args_fn` (as in the split-buffer pattern above).result = exhaustive_search(
configs, stream,
grid_fn=...,
kernel=my_kernel,
args_fn=lambda cfg: (a, b, c.clone()), # 每次试运行获得新的输出
hints_fn=...,
)
这是安全的,因为克隆成本(约4微秒)相对于计算密集型内核的执行时间(约50微秒+)可以忽略不计。仅在非常小的内存密集型内核中避免`clone()`,因为4微秒占运行时间的比例很大 — 这种情况下,在`args_fn`外预分配单个临时缓冲区(如上述拆分缓冲区模式)。Pitfall #2: Compilation Timeout
问题#2:编译超时
Problem: >30 configs causes compilation to exceed 5 minutes. CuTile compilation is heavier than Triton.
Solution:
- Keep total search space ≤ 30 configs — apply arch filters, tile size filters, and pruning rules until you're under the limit
- Use architecture-conditional yield to only generate relevant configs
- Prune the search space using architecture-conditional yield and size filters until total configs ≤ 30
Real example: Grouped GEMM expanded from 4 to 32 configs → all backward tests timed out. Reverted to occupancy-only (4 configs) with no performance loss.
问题:超过30个配置导致编译时间超过5分钟。CuTile编译比Triton重。
解决方案:
- 总搜索空间保持≤30个配置 — 应用架构过滤器、tile尺寸过滤器和修剪规则,直到数量低于限制
- 使用架构条件生成仅生成相关配置
- 使用架构条件生成和尺寸过滤器修剪搜索空间,直到总配置数≤30
实际示例:分组GEMM从4个配置扩展到32个 → 所有反向测试超时。回退到仅Occupancy(4个配置),无性能损失。
Pitfall #3: Cold-Cache Performance Skew
问题#3:冷缓存性能偏差
Problem: First process run is slower due to driver/JIT caches. Can cause wrong config selection.
Solution: Always warm up before measuring. has built-in warmup, but first-process cold start is unavoidable. Re-run if you suspect the initial result was affected.
exhaustive_search问题:首次进程运行因驱动/JIT缓存而变慢。可能导致错误的配置选择。
解决方案:测量前始终预热。内置预热,但首次进程冷启动不可避免。如果怀疑初始结果受影响,重新运行。
exhaustive_searchPitfall #4: NCU Profiling Interference
问题#4:NCU性能分析干扰
Problem: NCU profiles autotune trial runs, cluttering the trace.
Solution: Set before profiling, or use .
DISABLE_AUTOTUNE=1ncu --launch-skip N问题:NCU会分析自动调优试运行,导致跟踪信息混乱。
解决方案:性能分析前设置,或使用。
DISABLE_AUTOTUNE=1ncu --launch-skip NPitfall #5: search_space as Generator (Exhaustion)
问题#5:search_space为生成器(耗尽)
Problem: requires a (list/tuple), not a generator. Passing a generator directly will fail or produce unexpected results.
exhaustive_searchSequenceSolution: Always convert to list:
python
undefined问题:要求传入(列表/元组),而非生成器。直接传递生成器会失败或产生意外结果。
exhaustive_searchSequence解决方案:始终转换为列表:
python
undefinedCORRECT: convert generator to list
正确:将生成器转换为列表
configs = list(_matmul_autotune_configs())
result = exhaustive_search(configs, ...)
configs = list(_matmul_autotune_configs())
result = exhaustive_search(configs, ...)
WRONG: passing generator directly
错误:直接传递生成器
result = exhaustive_search(_matmul_autotune_configs(), ...)
undefinedresult = exhaustive_search(_matmul_autotune_configs(), ...)
undefinedPitfall #6: FP8 Precision Loss
问题#6:FP8精度损失
Problem: Hardware breaks FP8 quantization bucket boundaries.
/Solution: Use for IEEE-compliant division in FP8 kernels. Never use operator for FP8 scale computation.
ct.truediv(x, y, rounding_mode=RoundingMode.FULL)/问题:硬件运算符破坏FP8量化桶边界。
/解决方案:在FP8内核中使用实现符合IEEE标准的除法。绝不要在FP8缩放计算中使用运算符。
ct.truediv(x, y, rounding_mode=RoundingMode.FULL)/Pitfall #7: replace_hints
on Hot Path (Recompilation)
replace_hints问题#7:热路径中调用replace_hints
(重新编译)
replace_hintsProblem: returns a new kernel object with its own JIT cache (internally uses which creates a fresh instance). Calling it on every kernel invocation — even with the same arguments — triggers recompilation every time. This is the most common autotune performance bug: jumps from ~0.04ms to 16–39ms (100–500× slower).
replace_hints()dataclasses.replace()cutile_msIncorrect (recompiles on every call):
python
_cache[key] = result.best.config # only stores config
cfg = _cache[key]
tuned = my_kernel.replace_hints(occupancy=cfg.occupancy) # NEW kernel each time!
ct.launch(stream, grid, tuned, ...)Correct (compile once, reuse forever):
python
best_cfg = result.best.config
tuned = my_kernel.replace_hints(occupancy=best_cfg.occupancy) # compile ONCE
_cache[key] = (best_cfg, tuned) # cache both
cfg, tuned = _cache[key]
ct.launch(stream, grid, tuned, ...) # reuse compiled kernelRule: Call exactly once per config (immediately after ), cache the returned kernel object, and never call again on the fast path.
replace_hintsexhaustive_searchreplace_hints问题:返回一个新的内核对象,拥有独立的JIT缓存(内部使用创建新实例)。即使参数相同,每次内核调用时调用它都会触发重新编译。这是最常见的自动调优性能错误:从约0.04ms跃升至16–39ms(慢100–500倍)。
replace_hints()dataclasses.replace()cutile_ms错误写法(每次调用都重新编译):
python
_cache[key] = result.best.config # 仅存储配置
cfg = _cache[key]
tuned = my_kernel.replace_hints(occupancy=cfg.occupancy) # 每次都是新内核!
ct.launch(stream, grid, tuned, ...)正确写法(编译一次,永久复用):
python
best_cfg = result.best.config
tuned = my_kernel.replace_hints(occupancy=best_cfg.occupancy) # 仅编译一次
_cache[key] = (best_cfg, tuned) # 同时缓存两者
cfg, tuned = _cache[key]
ct.launch(stream, grid, tuned, ...) # 复用编译后的内核规则:每个配置仅调用一次(在后立即调用),缓存返回的内核对象,绝不在热路径中再次调用。
replace_hintsexhaustive_searchreplace_hintsScope and Boundaries
范围与边界
This skill covers only autotune configuration: search space design, invocation, caching, and with tuned hints. It does not modify kernel code.
exhaustive_searchct.launchIn scope (autotune config):
- Search space generator functions
- calls and result handling
exhaustive_search() - for applying tuned hints
kernel.replace_hints() - Cache logic (key design, dict management)
- with tuned kernel
ct.launch() - fallback path
DISABLE_AUTOTUNE
Out of scope (kernel code modifications — do NOT make these changes):
- Math flags (flush_to_zero, rounding_mode)
- Performance Hints (slice_hint, buffer_depth, copy_config)
- Memory access patterns (2D→1D gather/scatter conversion)
- Codegen optimizations (safe_offs → padding_value)
- Algorithm changes (K-loop split, load balancing)
本技能仅涵盖自动调优配置:搜索空间设计、调用、缓存,以及带调优提示的。不修改内核代码。
exhaustive_searchct.launch范围内(自动调优配置):
- 搜索空间生成器函数
- 调用和结果处理
exhaustive_search() - 应用调优提示
kernel.replace_hints() - 缓存逻辑(键设计、字典管理)
- 带调优内核的
ct.launch() - 回退路径
DISABLE_AUTOTUNE
范围外(内核代码修改 — 请勿进行这些更改):
- 数学标志(flush_to_zero、rounding_mode)
- 性能提示(slice_hint、buffer_depth、copy_config)
- 内存访问模式(2D→1D gather/scatter转换)
- 代码生成优化(safe_offs → padding_value)
- 算法更改(K循环拆分、负载均衡)
Further Optimization Suggestions
进一步优化建议
After adding autotuning, the following kernel-level optimizations may yield additional gains. These are outside the scope of this skill — mention them to the user as potential next steps, but do not implement them as part of autotuning:
- Math flags: +
flush_to_zero=Truecan provide 34-72% improvement for FMHA-class kernels (set via environment variablesrounding_mode=APPROXor in kernel code). Causal chain: larger tiles initially decrease performance by 18-43% due to subnormal handling overhead; enabling FTZ+APPROX rescues this and flips the result to +34-72%. Math flags are therefore a prerequisite for large-tile configs to be effective on FMHA-class kernels.TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1 - Performance Hints: ,
slice_hint,buffer_depth— requires modifying kernel IR codecopy_config - Memory access patterns: Using TMA loads () instead of
ct.load; removing unnecessary bounds checks (ct.gatherwhen safe)check_bounds=False - Codegen quality: Using parameter instead of manual
padding_valuemasking; removingct.wheresafe_offs - Algorithm restructuring: K-loop split, load balancing, algebraic simplification
添加自动调优后,以下内核级优化可能带来额外收益。这些超出本技能范围 — 可向用户提及作为潜在下一步,但不要作为自动调优的一部分实现:
- 数学标志:+
flush_to_zero=True可为FMHA类内核带来34-72%的性能提升(通过环境变量rounding_mode=APPROX或内核代码设置)。因果链:大tile最初会因次正规数处理开销导致性能下降18-43%;启用FTZ+APPROX可挽回损失,转而实现+34-72%的提升。因此,数学标志是大tile配置在FMHA类内核上生效的先决条件。TILEIR_ENABLE_FTZ=1 TILEIR_ENABLE_APPROX=1 - 性能提示:、
slice_hint、buffer_depth— 需要修改内核IR代码copy_config - 内存访问模式:使用TMA加载()替代
ct.load;移除不必要的边界检查(安全时设置ct.gather)check_bounds=False - 代码生成质量:使用参数替代手动
padding_value掩码;移除ct.wheresafe_offs - 算法重构:K循环拆分、负载均衡、代数简化
Differences from Triton Autotune
与Triton自动调优的差异
Key differences: Triton uses decorator with objects; CuTile uses with configs + separate cache + . CuTile has no / (compiler decides) — only tile sizes + + . CuTile compilation is heavier (keep ≤30 configs). CuTile cache is user-managed in-memory (no automatic persistence). CuTile separates (kernel args) from (compiler hints).
@triton.autotuneConfig(...)exhaustive_search()SimpleNamespacect.launchnum_warpsnum_stagesoccupancynum_ctasargs_fnhints_fn核心差异:Triton使用装饰器和对象;CuTile使用和配置 + 独立缓存 + 。CuTile无/(由编译器决定) — 仅tile尺寸 + + 。CuTile编译更重(保持≤30个配置)。CuTile缓存由用户在内存中管理(无自动持久化)。CuTile将(内核参数)与(编译器提示)分离。
@triton.autotuneConfig(...)exhaustive_search()SimpleNamespacect.launchnum_warpsnum_stagesoccupancynum_ctasargs_fnhints_fnReference Documents
参考文档
| Category | Document | Content |
|---|---|---|
| Parameter Design | | Per-kernel-type parameter spaces, cross-arch patterns, grid_fn patterns, pruning rules |
| Search Strategies | | Exhaustive search, A/B test methodology, DISABLE_AUTOTUNE pattern |
| Templates | | Copy-paste autotune templates for 8 kernel types |
| Hardware | | Per-architecture constraints, tile size ranges, num_ctas rules, TMA requirements |
| 分类 | 文档 | 内容 |
|---|---|---|
| 参数设计 | | 各内核类型的参数空间、跨架构模式、grid_fn模式、修剪规则 |
| 搜索策略 | | 穷尽式搜索、A/B测试方法、DISABLE_AUTOTUNE模式 |
| 模板 | | 8种内核类型的复制粘贴自动调优模板 |
| 硬件 | | 各架构约束、tile尺寸范围、num_ctas规则、TMA要求 |
Source Code References
源代码参考
Key files: (matmul autotune), (FMHA autotune), (shared occupancy=[1,2,4,8]), (elementwise example), (split-buffer pattern), (persistent GEMM, occupancy-only).
ops/cutile/matmul.pyops/cutile/attention.pysuites/unsloth/cutile/ct_ops.pyautotune_configs()suites/unsloth/cutile/swiglu.pysuites/unsloth/cutile/rope_embedding.pysuites/unsloth/cutile/grouped_gemm.py关键文件:(矩阵乘法自动调优)、(FMHA自动调优)、(共享 occupancy=[1,2,4,8])、(逐元素型示例)、(拆分缓冲区模式)、(持久型GEMM、仅Occupancy)。
ops/cutile/matmul.pyops/cutile/attention.pysuites/unsloth/cutile/ct_ops.pyautotune_configs()suites/unsloth/cutile/swiglu.pysuites/unsloth/cutile/rope_embedding.pysuites/unsloth/cutile/grouped_gemm.pyWorked Examples
实战示例
Each example shows the before → after pattern: (hardcoded ) and (refactored to tune-once/cache/launch).
fixed_launch.pyct.launchautotuned_launch.py| Directory | Kernel | Autotune Pattern | Complexity | Key Teaching Point |
|---|---|---|---|---|
| RMSNorm (reduction) | Occupancy-only | Low | Most common pattern — no tile tuning, just find best occupancy. Grid = |
| GEMM C=A@B | Full: | High | Compute-bound kernel with multiple tunable dimensions. |
| RoPE embedding (in-place) | Occupancy-only, with split-buffer | Medium | In-place kernel MUST use split-buffer during search to avoid corruption. Search writes to scratch; final |
每个示例展示前后模式:(硬编码)和(重构为一次性调优/缓存/启动)。
fixed_launch.pyct.launchautotuned_launch.py| 目录 | 内核 | 自动调优模式 | 复杂度 | 核心教学点 |
|---|---|---|---|---|
| RMSNorm(归约型) | 仅Occupancy | 低 | 最常见模式 — 无需tile调优,仅寻找最优occupancy。网格 = |
| GEMM C=A@B | 完整搜索: | 高 | 计算密集型内核,多可调维度。 |
| RoPE嵌入(原地型) | 仅Occupancy,带拆分缓冲区 | 中 | 原地内核必须在搜索期间使用拆分缓冲区避免损坏。搜索写入临时缓冲区;最终 |
| ", |