improve-cutile-kernel-perf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIterative cuTile Kernel Performance Optimization
cuTile内核性能迭代优化
Systematically profile, diagnose bottlenecks, and iteratively tune a cuTile kernel's performance in the TileGym repository.
在TileGym代码库中,系统化地进行性能分析、瓶颈诊断,并迭代调优cuTile内核的性能。
Instructions
操作说明
Follow the three phases in order: Setup the environment and baseline, run the Experimentation loop with a tracked log, then iterate The experiment loop until perf goals are met or further gains plateau.
按顺序执行三个阶段:环境搭建与基准测试、运行带日志追踪的实验循环,然后重复实验循环直到性能目标达成或无法进一步提升。
Setup
环境搭建
Work with user to prepare optimization environment:
-
Create a fresh git branch: Propose a branch name, e.g.,from current branch. Checkout
cutile-perf-<kernel_name>-<date>git checkout -b <branch name> -
Locate the target kernel:
- cuTile kernels live under or
src/tilegym/suites/<suite>/cutile/src/tilegym/ops/cutile/ - Read the kernel file and identify: the decorated function(s), the launch wrapper (
@ct.kernelorct.launch()), thect_experimental.autotune_launch()registration, and current autotune configs (if any)@register_impl
- cuTile kernels live under
-
Classify the kernel:
- Arithmetic Intensity < 10 -> Memory-bound
- Arithmetic Intensity 10-50 -> Balanced
- Arithmetic Intensity > 50 -> Compute-bound
Note: classification is only used to pick the optimization priority order in the experiment loop. The core metric is always.latency (ms) -
Check GPU environment:
- Ensure a GPU node (Blackwell or Ampere GPU) is available
- All subsequent benchmark commands should run on the GPU node
-
Study related references:
- : Step-by-step recipes for each optimization (A through J) with before/after code examples
references/optimization-playbook.md - : Complete catalog of all tunable parameters (TMA, persistent scheduling, occupancy, tile sizes, latency hints, etc.)
references/perf-knobs-catalog.md - : cuTile API reference and 18 critical rules
references/cutile-api-reference.md - : Roofline/performance model, bottleneck diagnosis, autotuning
references/performance-model.md - : IR dump, analysis, and error diagnosis
references/ir-dump-guide.md - : Common cuTile patterns and conversion quick-reference
references/cutile-patterns-reference.md
-
Create @sandbox/perf_results.md to track progress. The first run will write a baseline
-
Confirm and go: Once you get confirmation, kick off the experimentation
与用户协作准备优化环境:
-
创建新的git分支:从当前分支出发,建议分支命名格式为。执行切换命令:
cutile-perf-<kernel_name>-<date>git checkout -b <branch name> -
定位目标内核:
- cuTile内核位于或
src/tilegym/suites/<suite>/cutile/目录下src/tilegym/ops/cutile/ - 阅读内核文件,识别以下内容:带有装饰器的函数、启动封装器(
@ct.kernel或ct.launch())、ct_experimental.autotune_launch()注册代码,以及当前的自动调优配置(如果存在)@register_impl
- cuTile内核位于
-
内核分类:
- 算术强度 < 10 → 内存受限型
- 算术强度 10-50 → 均衡型
- 算术强度 > 50 → 计算受限型
注意:分类仅用于确定实验循环中的优化优先级顺序。核心指标始终为。延迟(ms) -
检查GPU环境:
- 确保可用GPU节点(Blackwell或Ampere架构GPU)
- 后续所有基准测试命令均需在GPU节点上运行
-
研读相关参考文档:
- :每个优化项(A至J)的分步指南,包含优化前后的代码示例
references/optimization-playbook.md - :所有可调参数的完整目录(TMA、持久调度、线程占用率、tile尺寸、延迟提示等)
references/perf-knobs-catalog.md - :cuTile API参考文档及18条关键规则
references/cutile-api-reference.md - :Roofline性能模型、瓶颈诊断、自动调优
references/performance-model.md - :IR导出、分析及错误诊断
references/ir-dump-guide.md - :常见cuTile模式及快速转换参考
references/cutile-patterns-reference.md
-
创建文件以追踪进度。首次运行将写入基准数据
@sandbox/perf_results.md -
确认启动:获得用户确认后,启动实验流程
Experimentation
实验阶段
Every experiment iteration applies ONE optimization to the target kernel, verifies correctness, re-benchmarks, and records results. Each iteration should be enforced to finish within 10 minutes.
每次实验迭代仅对目标内核应用一项优化,验证正确性,重新执行基准测试,并记录结果。每次迭代必须在10分钟内完成。
The goal
目标
- Improve the core metric: reduce
latency (ms) - Subject to the core constraint: Correctness shall not regress — every optimization MUST preserve numerical correctness. shall not regress > 2% compared to baseline.
latency (ms)
- 提升核心指标:降低
延迟(ms) - 遵守核心约束:正确性不得退化——所有优化必须保留数值正确性。与基准相比,退化不得超过2%。
延迟(ms)
What you can change
可修改内容
- The target kernel file under or
src/tilegym/suites/<suite>/cutile/: kernel body, tile sizes, occupancy, num_ctas, TMA usage, latency hints, flush_to_zero, autotune configs, persistent scheduling, and other cuTile-specific parameterssrc/tilegym/ops/cutile/ - The kernel's launch wrapper: grid computation, autotune config space
- @sandbox/: Feel free to add new files or modify files created by you, but don't check to git
- 或
src/tilegym/suites/<suite>/cutile/下的目标内核文件:内核主体、tile尺寸、线程占用率、num_ctas、TMA使用、延迟提示、flush_to_zero、自动调优配置、持久调度及其他cuTile特定参数src/tilegym/ops/cutile/ - 内核的启动封装器:网格计算、自动调优配置空间
- 目录:可自由添加新文件或修改你创建的文件,但无需提交至git
@sandbox/
What you can NOT change
不可修改内容
- Kernel functional semantics (inputs, outputs, and numerical behavior within tolerance)
- Test infrastructure and benchmark harness
- Anything not listed above
- 内核功能语义(输入、输出及公差范围内的数值行为)
- 测试基础设施与基准测试框架
- 上述未列出的任何内容
What to expect from experiment outputs
实验输出预期
Correctness test:
正确性测试:
bash
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -vbash
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_ and cutile and not test_perf" -vPerformance benchmark:
性能基准测试:
For each iteration:
- Run pytest benchmark: → extract latency (ms)
python -m pytest ... --print-record - Record latency in perf_results.md
Benchmark cmdlines:
bash
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_perf and cutile" --print-record -vlatency sample:
Cutile: {'forward': {'mean': 3.7903138461538455, 'std': 0.0016941310873207053, 'rel_std': 0.044696327430505396, 'median': 3.789880999999999, 'min': 3.7883389999999992, 'max': 3.7941230000000004, 'nrep': 13, 'peak_mem_mb': 913}} ms每次迭代需执行:
- 运行pytest基准测试:→ 提取延迟(ms)
python -m pytest ... --print-record - 将延迟记录至
perf_results.md
基准测试命令:
bash
python -m pytest tests/suites/.../test_<kernel_name>.py -k "test_perf and cutile" --print-record -v延迟示例:
Cutile: {'forward': {'mean': 3.7903138461538455, 'std': 0.0016941310873207053, 'rel_std': 0.044696327430505396, 'median': 3.789880999999999, 'min': 3.7883389999999992, 'max': 3.7941230000000004, 'nrep': 13, 'peak_mem_mb': 913}} msTrack experiment progress
追踪实验进度
Use @sandbox/perf_results.md to record each iteration's results. It should only contain a Markdown table with 5 columns:
- : iteration number, starting from 0 (baseline)
iteration - : what was applied (e.g., "baseline", "TMA replace gather", "persistent scheduling")
optimization - : kernel latency in milliseconds, six decimal points
latency_ms - : PASS or FAIL
correctness - : Whether this iteration was
status,keep, orrevertcrash
Example content:
markdown
| iteration | optimization | latency_ms | correctness | status |
|----------:|:-------------------|-----------:|:------------|-------:|
| 0 | baseline | 0.820000 | PASS | keep |
| 1 | TMA replace gather | 0.390000 | PASS | keep |Create the tabular header if the file was empty. Append one line for each iteration.
使用记录每次迭代的结果。文件中仅需包含一个Markdown表格,包含5列:
@sandbox/perf_results.md- :迭代编号,从0开始(基准测试)
iteration - :应用的优化项(例如:"baseline"、"TMA替换gather"、"持久调度")
optimization - :内核延迟(毫秒),保留六位小数
latency_ms - :PASS或FAIL
correctness - :本次迭代是
status(保留)、keep(回退)还是revert(崩溃)crash
示例内容:
markdown
| iteration | optimization | latency_ms | correctness | status |
|----------:|:-------------------|-----------:|:------------|-------:|
| 0 | baseline | 0.820000 | PASS | keep |
| 1 | TMA replace gather | 0.390000 | PASS | keep |如果文件为空,先创建表格表头。每次迭代追加一行记录。
The baseline
基准测试
The first iteration (iteration 0) will not change any code and simply run the correctness test and performance benchmark. Results will be listed at the first row as baseline.
第一次迭代(编号0)不修改任何代码,仅运行正确性测试和性能基准测试。结果将作为基准数据列在第一行。
The experiment loop
实验循环
Core methodology is to apply ONE optimization per iteration from the playbook, verify correctness, benchmark, and decide whether to keep or revert. Try one optimization at a time, and have clean experiment records.
LOOP:
-
Check git status: Current git branch/commit we're on
-
Select and apply ONE optimization from:
references/optimization-playbook.md -
Verify correctness — if fails, revert immediately. Common causes:/
flush_to_zerochanged results, tile size OOB,rounding_mode=APPROXsemantics, persistent loop bound errorallow_tma=False -
Re-benchmark and compare against current baseline
-
Git commit
-
Record results to @sandbox/perf_results.md
-
Decision rules:
Outcome Action Improvement( ) >= 5%latency (ms)Accept as new baseline, continue Improvement 2-5% Accept, lower priority for next iteration Improvement < 2% Accept but stop unless user wants more Regression on any config Revert immediately, try next optimization No improvement after 2 consecutive iterations Stop Root cause is orschedulingunknownEscalate to user -
If keeping, advance the baseline numbers and continue loop
-
If reverting, git reset back to where you started and try the next optimization in priority order UNTIL: all attempts are finished, or more than 25 iterations have occurred, or the user interrupts
Be autonomous: Ask user clarifications at setup phase. Once stepped into the experiment loop, do not pause to ask user feedback: Use your best judgement for decision making, consult the optimization playbook and perf knobs catalog promptly, and think harder if stuck.
核心方法是从优化指南中每次选取一项优化进行应用,验证正确性,执行基准测试,然后决定保留或回退。每次仅尝试一项优化,并保持清晰的实验记录。
循环步骤:
-
检查git状态:确认当前所在的git分支/提交记录
-
从中选择并应用一项优化
references/optimization-playbook.md -
验证正确性——若失败,立即回退。常见原因:/
flush_to_zero改变了结果、tile尺寸越界、rounding_mode=APPROX语义错误、持久循环边界错误allow_tma=False -
重新执行基准测试并与当前基准数据对比
-
提交git commit
-
将结果记录至
@sandbox/perf_results.md -
决策规则:
结果 操作 性能提升(延迟)≥5% 作为新基准数据,继续循环 性能提升2%-5% 接受优化,降低下一次迭代的优先级 性能提升<2% 接受优化,但除非用户要求否则停止循环 任意配置下性能退化 立即回退,尝试下一项优化 连续2次迭代无性能提升 停止循环 根因为调度问题或未知原因 升级反馈给用户 -
若保留优化,更新基准数据并继续循环
-
若回退,执行git reset回到迭代开始状态,按优先级顺序尝试下一项优化 直到:所有尝试完成、迭代次数超过25次,或用户中断
自主执行:在环境搭建阶段可向用户询问澄清信息。进入实验循环后,无需暂停等待用户反馈:运用最佳判断进行决策,及时参考优化指南和可调参数目录,遇到问题时深入思考。