improve-cutile-kernel-perf

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Iterative cuTile Kernel Performance Optimization

cuTile内核性能迭代优化

Systematically profile, diagnose bottlenecks, and iteratively tune a cuTile kernel's performance in the TileGym repository.

在TileGym代码库中，系统化地进行性能分析、瓶颈诊断，并迭代调优cuTile内核的性能。

Instructions

操作说明

Follow the three phases in order: Setup the environment and baseline, run the Experimentation loop with a tracked log, then iterate The experiment loop until perf goals are met or further gains plateau.

按顺序执行三个阶段：环境搭建与基准测试、运行带日志追踪的实验循环，然后重复实验循环直到性能目标达成或无法进一步提升。

Setup

环境搭建

Work with user to prepare optimization environment:

Create a fresh git branch: Propose a branch name, e.g.,
```
cutile-perf-<kernel_name>-<date>
```
from current branch. Checkout
```
git checkout -b <branch name>
```
Locate the target kernel:
- cuTile kernels live under
```
src/tilegym/suites/<suite>/cutile/
```
  or
```
src/tilegym/ops/cutile/
```
- Read the kernel file and identify: the
```
@ct.kernel
```
  decorated function(s), the launch wrapper (
```
ct.launch()
```
  or
```
ct_experimental.autotune_launch()
```
  ), the
```
@register_impl
```
  registration, and current autotune configs (if any)
Classify the kernel:
- Arithmetic Intensity < 10 -> Memory-bound
- Arithmetic Intensity 10-50 -> Balanced
- Arithmetic Intensity > 50 -> Compute-bound
Note: classification is only used to pick the optimization priority order in the experiment loop. The core metric is always
```
latency (ms)
```
.
Check GPU environment:
- Ensure a GPU node (Blackwell or Ampere GPU) is available
- All subsequent benchmark commands should run on the GPU node
Study related references:
- ```
references/optimization-playbook.md
```
  : Step-by-step recipes for each optimization (A through J) with before/after code examples
- ```
references/perf-knobs-catalog.md
```
  : Complete catalog of all tunable parameters (TMA, persistent scheduling, occupancy, tile sizes, latency hints, etc.)
- ```
references/cutile-api-reference.md
```
  : cuTile API reference and 18 critical rules
- ```
references/performance-model.md
```
  : Roofline/performance model, bottleneck diagnosis, autotuning
- ```
references/ir-dump-guide.md
```
  : IR dump, analysis, and error diagnosis
- ```
references/cutile-patterns-reference.md
```
  : Common cuTile patterns and conversion quick-reference
Create @sandbox/perf_results.md to track progress. The first run will write a baseline
Confirm and go: Once you get confirmation, kick off the experimentation

与用户协作准备优化环境：

创建新的git分支：从当前分支出发，建议分支命名格式为
```
cutile-perf-<kernel_name>-<date>
```
。执行切换命令：
```
git checkout -b <branch name>
```
定位目标内核：
- cuTile内核位于
```
src/tilegym/suites/<suite>/cutile/
```
  或
```
src/tilegym/ops/cutile/
```
  目录下
- 阅读内核文件，识别以下内容：带有
```
@ct.kernel
```
  装饰器的函数、启动封装器（
```
ct.launch()
```
  或
```
ct_experimental.autotune_launch()
```
  ）、
```
@register_impl
```
  注册代码，以及当前的自动调优配置（如果存在）
内核分类：
- 算术强度 < 10 → 内存受限型
- 算术强度 10-50 → 均衡型
- 算术强度 > 50 → 计算受限型
注意：分类仅用于确定实验循环中的优化优先级顺序。核心指标始终为
```
延迟（ms）
```
。
检查GPU环境：
- 确保可用GPU节点（Blackwell或Ampere架构GPU）
- 后续所有基准测试命令均需在GPU节点上运行
研读相关参考文档：
- ```
references/optimization-playbook.md
```
  ：每个优化项（A至J）的分步指南，包含优化前后的代码示例
- ```
references/perf-knobs-catalog.md
```
  ：所有可调参数的完整目录（TMA、持久调度、线程占用率、tile尺寸、延迟提示等）
- ```
references/cutile-api-reference.md
```
  ：cuTile API参考文档及18条关键规则
- ```
references/performance-model.md
```
  ：Roofline性能模型、瓶颈诊断、自动调优
- ```
references/ir-dump-guide.md
```
  ：IR导出、分析及错误诊断
- ```
references/cutile-patterns-reference.md
```
  ：常见cuTile模式及快速转换参考
创建
```
@sandbox/perf_results.md
```
文件以追踪进度。首次运行将写入基准数据
确认启动：获得用户确认后，启动实验流程

Experimentation

实验阶段

Every experiment iteration applies ONE optimization to the target kernel, verifies correctness, re-benchmarks, and records results. Each iteration should be enforced to finish within 10 minutes.

每次实验迭代仅对目标内核应用一项优化，验证正确性，重新执行基准测试，并记录结果。每次迭代必须在10分钟内完成。

The goal

目标

Improve the core metric: reduce
```
latency (ms)
```
Subject to the core constraint: Correctness shall not regress — every optimization MUST preserve numerical correctness.
```
latency (ms)
```
shall not regress > 2% compared to baseline.

提升核心指标：降低
```
延迟（ms）
```
遵守核心约束：正确性不得退化——所有优化必须保留数值正确性。与基准相比，
```
延迟（ms）
```
退化不得超过2%。

What you can change

可修改内容

The target kernel file under
```
src/tilegym/suites/<suite>/cutile/
```
or
```
src/tilegym/ops/cutile/
```
: kernel body, tile sizes, occupancy, num_ctas, TMA usage, latency hints, flush_to_zero, autotune configs, persistent scheduling, and other cuTile-specific parameters
The kernel's launch wrapper: grid computation, autotune config space
@sandbox/: Feel free to add new files or modify files created by you, but don't check to git

```
src/tilegym/suites/<suite>/cutile/
```
或
```
src/tilegym/ops/cutile/
```
下的目标内核文件：内核主体、tile尺寸、线程占用率、num_ctas、TMA使用、延迟提示、flush_to_zero、自动调优配置、持久调度及其他cuTile特定参数
内核的启动封装器：网格计算、自动调优配置空间
```
@sandbox/
```
目录：可自由添加新文件或修改你创建的文件，但无需提交至git

What you can NOT change

不可修改内容

Kernel functional semantics (inputs, outputs, and numerical behavior within tolerance)
Test infrastructure and benchmark harness
Anything not listed above

内核功能语义（输入、输出及公差范围内的数值行为）
测试基础设施与基准测试框架
上述未列出的任何内容

What to expect from experiment outputs

实验输出预期

Correctness test:

正确性测试：

bash

python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -v

bash

python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -v

Performance benchmark:

性能基准测试：

For each iteration:

Run pytest benchmark:
```
python -m pytest ... --print-record
```
→ extract latency (ms)
Record latency in perf_results.md

Benchmark cmdlines:

bash

python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_perf and cutile" --print-record -v

latency sample:

Cutile: {'forward': {'mean': 3.7903138461538455, 'std': 0.0016941310873207053, 'rel_std': 0.044696327430505396, 'median': 3.789880999999999, 'min': 3.7883389999999992, 'max': 3.7941230000000004, 'nrep': 13, 'peak_mem_mb': 913}} ms

每次迭代需执行：

运行pytest基准测试：
```
python -m pytest ... --print-record
```
→ 提取延迟（ms）
将延迟记录至
```
perf_results.md
```

基准测试命令：

bash

python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_perf and cutile" --print-record -v

延迟示例：

Cutile: {'forward': {'mean': 3.7903138461538455, 'std': 0.0016941310873207053, 'rel_std': 0.044696327430505396, 'median': 3.789880999999999, 'min': 3.7883389999999992, 'max': 3.7941230000000004, 'nrep': 13, 'peak_mem_mb': 913}} ms

Track experiment progress

追踪实验进度

Use @sandbox/perf_results.md to record each iteration's results. It should only contain a Markdown table with 5 columns:

```
iteration
```
: iteration number, starting from 0 (baseline)
```
optimization
```
: what was applied (e.g., "baseline", "TMA replace gather", "persistent scheduling")
```
latency_ms
```
: kernel latency in milliseconds, six decimal points
```
correctness
```
: PASS or FAIL
```
status
```
: Whether this iteration was
```
keep
```
,
```
revert
```
, or
```
crash
```

Example content:

markdown

| iteration | optimization       | latency_ms | correctness | status |
|----------:|:-------------------|-----------:|:------------|-------:|
| 0         | baseline           |   0.820000 | PASS        | keep   |
| 1         | TMA replace gather |   0.390000 | PASS        | keep   |

Create the tabular header if the file was empty. Append one line for each iteration.

使用

@sandbox/perf_results.md

记录每次迭代的结果。文件中仅需包含一个Markdown表格，包含5列：

```
iteration
```
：迭代编号，从0开始（基准测试）
```
optimization
```
：应用的优化项（例如："baseline"、"TMA替换gather"、"持久调度"）
```
latency_ms
```
：内核延迟（毫秒），保留六位小数
```
correctness
```
：PASS或FAIL
```
status
```
：本次迭代是
```
keep
```
（保留）、
```
revert
```
（回退）还是
```
crash
```
（崩溃）

示例内容：

markdown

| iteration | optimization       | latency_ms | correctness | status |
|----------:|:-------------------|-----------:|:------------|-------:|
| 0         | baseline           |   0.820000 | PASS        | keep   |
| 1         | TMA replace gather |   0.390000 | PASS        | keep   |

如果文件为空，先创建表格表头。每次迭代追加一行记录。

The baseline

基准测试

The first iteration (iteration 0) will not change any code and simply run the correctness test and performance benchmark. Results will be listed at the first row as baseline.

第一次迭代（编号0）不修改任何代码，仅运行正确性测试和性能基准测试。结果将作为基准数据列在第一行。

The experiment loop

实验循环

Core methodology is to apply ONE optimization per iteration from the playbook, verify correctness, benchmark, and decide whether to keep or revert. Try one optimization at a time, and have clean experiment records.

LOOP:

Check git status: Current git branch/commit we're on
Select and apply ONE optimization from
```
references/optimization-playbook.md
```
:
Verify correctness — if fails, revert immediately. Common causes:
```
flush_to_zero
```
/
```
rounding_mode=APPROX
```
changed results, tile size OOB,
```
allow_tma=False
```
semantics, persistent loop bound error
Re-benchmark and compare against current baseline
Git commit
Record results to @sandbox/perf_results.md

Decision rules:

Outcome	Action
Improvement( `latency (ms)` ) >= 5%	Accept as new baseline, continue
Improvement 2-5%	Accept, lower priority for next iteration
Improvement < 2%	Accept but stop unless user wants more
Regression on any config	Revert immediately, try next optimization
No improvement after 2 consecutive iterations	Stop
Root cause is `scheduling` or `unknown`	Escalate to user

If keeping, advance the baseline numbers and continue loop
If reverting, git reset back to where you started and try the next optimization in priority order UNTIL: all attempts are finished, or more than 25 iterations have occurred, or the user interrupts

Be autonomous: Ask user clarifications at setup phase. Once stepped into the experiment loop, do not pause to ask user feedback: Use your best judgement for decision making, consult the optimization playbook and perf knobs catalog promptly, and think harder if stuck.

核心方法是从优化指南中每次选取一项优化进行应用，验证正确性，执行基准测试，然后决定保留或回退。每次仅尝试一项优化，并保持清晰的实验记录。

循环步骤：

检查git状态：确认当前所在的git分支/提交记录
从
```
references/optimization-playbook.md
```
中选择并应用一项优化
验证正确性——若失败，立即回退。常见原因：
```
flush_to_zero
```
/
```
rounding_mode=APPROX
```
改变了结果、tile尺寸越界、
```
allow_tma=False
```
语义错误、持久循环边界错误
重新执行基准测试并与当前基准数据对比
提交git commit
将结果记录至
```
@sandbox/perf_results.md
```

决策规则：

结果	操作
性能提升（延迟）≥5%	作为新基准数据，继续循环
性能提升2%-5%	接受优化，降低下一次迭代的优先级
性能提升<2%	接受优化，但除非用户要求否则停止循环
任意配置下性能退化	立即回退，尝试下一项优化
连续2次迭代无性能提升	停止循环
根因为调度问题或未知原因	升级反馈给用户

若保留优化，更新基准数据并继续循环
若回退，执行git reset回到迭代开始状态，按优先级顺序尝试下一项优化直到：所有尝试完成、迭代次数超过25次，或用户中断

自主执行：在环境搭建阶段可向用户询问澄清信息。进入实验循环后，无需暂停等待用户反馈：运用最佳判断进行决策，及时参考优化指南和可调参数目录，遇到问题时深入思考。