tilegym-converting-cutile-to-triton

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

cuTile → Triton Conversion

cuTile → Triton 转换

Convert

@ct.kernel

kernels to

@triton.jit

. API mapping: references/api-mapping.md (cuTile → Triton).

In this skill’s Markdown, Triton launch syntax
kernel［grid］(…)
uses Unicode brackets so link checkers do not parse
[grid](…)
as a hyperlink; use normal ASCII brackets in real Triton code.

将

@ct.kernel

内核转换为

@triton.jit

。API映射参考：references/api-mapping.md（cuTile → Triton）。

本技能的Markdown中，Triton启动语法
kernel［grid］(…)
使用Unicode方括号，避免链接检查器将
[grid](…)
解析为超链接；实际编写Triton代码时请使用标准ASCII方括号。

Instructions

操作步骤

Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).

Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma
```
gemma_attention
```
), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done.
Select path — Existing TileGym op: standard mode in
```
translations/workflow.md
```
. If the cuTile source uses
```
transpose
```
/
```
transpose_v
```
, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels +
```
META
```
grid, not one kernel +
```
tl.trans
```
).
Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Count
```
@ct.kernel
```
definitions; note TMA-relevant
```
ct.load
```
/
```
ct.store
```
,
```
ct.launch
```
,
```
Constant
```
, and layout flags.
Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses
```
tl.make_tensor_descriptor
```
(TMA), not raw
```
tl.load(ptr+offs, mask=…)
```
for full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel［grid］(args)</code> with tuple or
```
lambda META: (…)
```
for autotune; no
```
ct.launch
```
.
Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op:
```
pytest tests/ops/test_<op>.py -k "triton" -vs
```
. Fix failures before benchmarking.
Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.

Execution rules (MUST):

Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.

遵循translations/workflow.md中的阶段门控工作流。所有转换都应经过分析→转换→验证→测试→基准测试流程，进入下一阶段前需通过明确的检查点。当任务匹配特殊场景（错误、布局标志、性能问题）时，使用工作流选择中的文档。

优化策略（对性能敏感/注意力相关） — 如果算子是注意力、FMHA、滑动窗口、软上限或GQA（例如Gemma的
```
gemma_attention
```
），在转换内循环前请阅读**references/optimization-strategy.md，然后应用§4 Gemma FMHA检查清单。对于其他GEMM/BMM/注意力相关内核，完成TMA后仍需浏览该文件的§2–§3**部分。
选择路径 — 已有TileGym算子：使用
```
translations/workflow.md
```
中的标准模式。如果cuTile源码使用
```
transpose
```
/
```
transpose_v
```
、双布局或MLA风格路径，在编写Triton代码前请阅读translations/advanced-patterns.md（需两个内核+
```
META
```
网格，而非单个内核+
```
tl.trans
```
）。
预检查 — 对cuTile源码运行预检查分析中的grep命令。统计
```
@ct.kernel
```
定义数量；记录与TMA相关的
```
ct.load
```
/
```
ct.store
```
、
```
ct.launch
```
、
```
Constant
```
及布局标志。
查阅映射表 — 打开references/api-mapping.md查看cuTile→Triton的API对应关系。若出现运行时错误（非法地址、数据类型、步长问题），请使用references/debugging.md。
转换 — 将转换检查清单复制到待办列表并按顺序执行。文件结构与放置规则：translations/file-structure.md。强制要求：所有二维及以上块形状的分片加载/存储需使用
```
tl.make_tensor_descriptor
```
（TMA），不得对完整分片使用原始
```
tl.load(ptr+offs, mask=…)
```
——跳过此步骤是导致性能大幅退化的最常见原因。主机端：使用Triton方括号启动方式<code>kernel［grid］(args)</code>，可使用元组或
```
lambda META: (…)
```
实现自动调优；无需
```
ct.launch
```
。
验证 — 对新的Triton模块进行语法检查；运行对应算子的TileGym pytest测试目标：
```
pytest tests/ops/test_<op>.py -k "triton" -vs
```
。在基准测试前修复所有失败项。
基准测试 — 对比Triton与cuTile的性能测试结果。若Triton明显更慢，请遵循translations/workflow.md中的性能分析（阶段c2t-5），以及针对GEMM/BMM/注意力的references/optimizing-reference.md；使用references/optimization-strategy.md作为有序检查清单。若出现10–50倍的性能下降，请先阅读同一工作流文件中的关键性能模式部分。

执行规则（必须遵守）：

在编辑内核代码前创建并跟踪转换检查清单（例如使用TodoWrite）；按顺序完成步骤——不得跳过预检查或TMA决策。
对于注意力/FMHA/Gemma/GQA/软上限/滑动窗口：在完成转换优化前，必须阅读references/optimization-strategy.md并应用**§4**。
不得在适用TMA的场景下使用原始指针+掩码的二维及以上分片加载；若有故意例外情况需记录文档。
若测试或基准测试未通过检查点，需停止并修复问题之后再宣布转换完成——不得堆叠未验证的修改。

Workflow Selection

工作流选择

Existing TileGym op → Standard Mode: translations/workflow.md
Errors (
```
cudaErrorIllegalAddress
```
, shape mismatch, numerical mismatch) → references/debugging.md
Advanced patterns (TMA, dual layout flags
```
transpose
```
, autotune +
```
META
```
grid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression on
```
transpose=False
```
).
Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
Gemma attention / GQA FMHA conversion → references/optimization-strategy.md §4
Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors,
```
loop_unroll_factor
```
, occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design
⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
⚠️ Good perf on
transpose=True
only, collapse on
transpose=False
(or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; two
```
@triton.jit
```
kernels +
```
grid = lambda META: (... META["BLOCK_H"] ...)
```

已有TileGym算子 → 标准模式：translations/workflow.md
错误（
```
cudaErrorIllegalAddress
```
、形状不匹配、数值不匹配）→ references/debugging.md
高级模式（TMA、双布局标志
```
transpose
```
、自动调优+
```
META
```
网格、Array.slice、ct.gather().item()）→ translations/advanced-patterns.md（MLA风格双内核，避免
```
transpose=False
```
时出现3–15倍性能退化）。
性能问题（Triton内核比cuTile慢、自动调优、性能分析）→ translations/workflow.md（**性能分析（阶段c2t-5）**部分）
优化策略中心（有序检查清单：高级模式+优化参考）→ references/optimization-strategy.md — 注意力/FMHA/Gemma相关转换需优先阅读；之后根据需要深入查阅两个源文档
GEMM/BMM/注意力优化（完成TMA后，或Triton慢10–20%）→ references/optimizing-reference.md — EVEN_K快速路径、通过指针算术实现转置、网格布局、自动调优广度、结尾子分片；在转换过程中及性能验收前使用这些模式（总结于优化策略§2–§3）
Gemma注意力/GQA FMHA转换 → references/optimization-strategy.md §4
Blackwell优化（含迭代算法、寄存器压力、循环展开的复杂内核）→ references/optimizing-reference.md §9 — TMA描述符、
```
loop_unroll_factor
```
、占用率自动调优、TMEM友好的块大小、分片分配器、双路径内核设计
⚠️ 10-50倍性能退化（转换后出现灾难性性能下降）→ translations/workflow.md — **关键性能模式（避免10-50倍退化）**部分
⚠️
transpose=True
性能良好，但
transpose=False
性能崩溃（反之亦然）→ translations/advanced-patterns.md — §1双布局标志；两个
```
@triton.jit
```
内核+
```
grid = lambda META: (... META["BLOCK_H"] ...)
```

Pre-flight Analysis (Run BEFORE converting)

预检查分析（转换前必须运行）

bash

undefined

bash

undefined

Count kernels (only main kernel gets @triton.jit, helpers stay plain def)

统计内核数量（仅主内核使用@triton.jit，辅助函数保持普通def）

grep "@ct.kernel" source.py | wc -l

Check for patterns needing special handling

检查需要特殊处理的模式

grep "ct.transpose|ct.permute" source.py # → use tl.trans/tl.permute grep "ct.astype" source.py # → use .to(dtype) grep "ct.load|ct.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs) grep "ct.launch" source.py # → bracket launch: kernel then [grid] then (args) grep "ct.Constant|ct.ConstInt" source.py # → tl.constexpr grep "ct.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b grep "ct.bid|ct.num_blocks" source.py # → tl.program_id/tl.num_programs grep "1 << .*.bit_length" source.py # → triton.next_power_of_2 if needed grep "transpose|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)

undefined

grep "ct.transpose|ct.permute" source.py # → 使用tl.trans/tl.permute grep "ct.astype" source.py # → 使用.to(dtype) grep "ct.load|ct.store" source.py # → 二维及以上使用TMA（tl.make_tensor_descriptor），不得使用原始tl.load(ptr+offs) grep "ct.launch" source.py # → 使用方括号启动：kernel后接[grid]再传入(args) grep "ct.Constant|ct.ConstInt" source.py # → 使用tl.constexpr grep "ct.cdiv" source.py # → 主机端使用triton.cdiv，或Python写法(a+b-1)//b grep "ct.bid|ct.num_blocks" source.py # → 使用tl.program_id/tl.num_programs grep "1 << .*.bit_length" source.py # → 必要时使用triton.next_power_of_2 grep "transpose|transpose_v" source.py # → 若匹配，阅读translations/advanced-patterns.md（双内核+META网格）

undefined

Conversion Checklist

转换检查清单

Copy this checklist and track progress:

Conversion Progress:
 [ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
 [ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
 [ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
 [ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
 [ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
 [ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
 [ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
 [ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
 [ ] Step 7: Validate — run pytest or syntax check on Triton file
 [ ] Step 8: Test — run pytest, verify X passed 0 failed
 [ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
 [ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
 [ ] Step 10b: If GEMM/BMM/attention and Triton &gt;20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
 [ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled

Post-conversion Verification (TMA is mandatory for 2D+ loads):
 [ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
 [ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
 [ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
 [ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
 [ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
 [ ] ct.astype(x, dtype) → x.to(dtype) in Triton
 [ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
 [ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
 [ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
 [ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
 [ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
 [ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
 [ ] tl.assume() alignment hints added for strides and pointers

复制此清单并跟踪进度：

转换进度:
 [ ] 步骤0（注意力/Gemma FMHA/GQA/软上限/滑动窗口）：阅读[references/optimization-strategy.md](./references/optimization-strategy.md)并应用§4检查清单，再编写Triton内循环
 [ ] 步骤1：预检查 — 运行上述grep命令，记录特殊模式和二维及以上加载（→ TMA）
 [ ] 步骤2：分析源cuTile内核（识别模式、形状、数据类型）
 [ ] 步骤3：创建结构正确的Triton文件（参考translations/file-structure.md）
 [ ] 步骤4：转换内核签名（张量参数→指针参数，Constant→constexpr）
 [ ] 步骤4b：TMA（二维及以上加载强制要求）—— 所有二维及以上分片加载/存储使用tl.make_tensor_descriptor；不得对块形状访问使用原始tl.load(ptr+offs,mask)（参考workflow.md § TMA优化）
 [ ] 步骤5：转换内核主体（应用下方陷阱表+API映射）
 [ ] 步骤6：转换主机包装器（网格元组/ lambda，方括号式启动：kernel、网格、然后参数；无需ct.launch）；若使用TMA需调用triton.set_allocator(alloc_fn)
 [ ] 步骤7：验证 — 对Triton文件运行pytest或语法检查
 [ ] 步骤8：测试 — 运行pytest，验证全部通过、无失败
 [ ] 步骤9：若测试失败→修复→重新验证→重新测试（循环至全部通过）
 [ ] 步骤10：基准测试 — 运行性能测试，与cuTile对比（参考workflow.md § 性能分析）
 [ ] 步骤10b：若为GEMM/BMM/注意力且Triton慢于cuTile20%以上→浏览[references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3，再参考[references/optimizing-reference.md](./references/optimizing-reference.md)（EVEN_K、转置、网格、自动调优、结尾子分片），然后重新基准测试
 [ ] 步骤10c：若算子包含`transpose`/布局标志→阅读[translations/advanced-patterns.md](./translations/advanced-patterns.md)；验证每个布局使用**独立内核**（而非转置内核+`tl.trans`）；**自动调优**启动使用`lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — 除非禁用自动调优，否则不得通过`apply()`固定`BLOCK_H`/`BLOCK_N`

转换后验证（二维及以上加载必须使用TMA）：
 [ ] TMA：所有二维及以上分片加载使用tl.make_tensor_descriptor(...).load([...])；块形状二维及以上访问不得使用原始ptr+mask（否则会出现5-20倍性能退化）
 [ ] 网格使用元组或lambda（无需像cuTile那样必须使用三元组）
 [ ] 若cuTile算子使用kernel_configs/autotune，需添加Triton自动调优（参考工作流§ 性能分析）
 [ ] 主机网格在合适场景下使用triton.cdiv（不得仅使用(a+b-1)//b）
 [ ] 指针/偏移索引：Triton使用元素偏移（ptr + offs），而非tl.load中的块索引（或使用TMA描述符）
 [ ] ct.astype(x, dtype) → Triton中使用x.to(dtype)
 [ ] ct.mma(a, b, acc=acc) → Triton中使用tl.dot(a, b, acc)（无关键字参数）
 [ ] 可选/None参数：TMA允许内核参数为None（cuTile需要占位符+标志）
 [ ] 当BLOCK_SIZE > 实际维度时应用掩码（与cuTile相同）；使用TMA时，完整分片通常可移除掩码
 [ ] 归约除数使用实际大小，而非BLOCK_SIZE
 [ ] fp32/tf32：Triton默认allow_tf32=True；若cuTile有显式tf32转换需匹配其行为
 [ ] 若任何二维及以上加载使用原始ptr+mask（仅例外情况）：记录未使用TMA的原因
 [ ] 为步长和指针添加tl.assume()对齐提示

Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}

常见陷阱（最易出错的转换问题） {#gotchas-most-common-translation-errors}

Comprehensive table of patterns that frequently break or regress when porting

@ct.kernel

@triton.jit

— mma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.

See: references/gotchas.md — read this BEFORE writing the Triton kernel.

全面汇总将

@ct.kernel

移植到

@triton.jit

时经常出错或性能退化的模式——mma累加器、类型转换、网格、TMA使用、数据类型处理、布局标志、批量矩阵乘法等。

参考： references/gotchas.md — 编写Triton内核前请阅读。

Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}

性能陷阱（10-50倍退化风险） {#performance-gotchas-10-50x-regression-risk}

⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.

Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×),

broadcast_to + tl.dot

(10-50×),

extract_slice

chains (2-5×), and more.

See: references/performance-gotchas.md — full regression-risk table.

Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).

Full API mapping: references/api-mapping.md.

Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).

⚠️ 这些会导致灾难性性能下降。基准测试前必须检查。

模式及其影响：TMA vs 原始ptr+mask（5-20倍）、自动调优 vs 固定分片大小（2-3倍）、

broadcast_to + tl.dot

（10-50倍）、

extract_slice

链（2-5倍）等。

参考： references/performance-gotchas.md — 完整退化风险表。

详细内容： translations/workflow.md — **关键性能模式（避免10-50倍退化）**部分。

完整API映射：references/api-mapping.md。

Triton数学函数数据类型（erf/erfc/exp/log/sqrt）及“不得用tanh替代erf”模式：references/debugging.md — **Triton数学函数数据类型要求（关键）**部分。

Optimization strategy (hub)

优化策略（中心文档）

File: references/optimization-strategy.md

Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+

META

, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.

Rule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.

文件： references/optimization-strategy.md

将**translations/advanced-patterns.md（布局标志、双内核、自动调优+
META
、批量启动、Blackwell指针）和references/optimizing-reference.md（TMA后微优化、§9）汇总为§1–§3**，外加强制§4 Gemma FMHA检查清单。

规则： 对于注意力/FMHA/Gemma风格转换，在同一会话中同时打开优化策略和工作流文档——不能仅依赖TMA完成性能验收。

Reference Documents {#reference-documents}

参考文档 {#reference-documents}

Read from cuTile → Triton perspective. Core files live in this skill under ``.

Category	Document	Content
Strategy	optimization-strategy.md	Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist
Workflows	translations/workflow.md	Standard c2t conversion (phases + checklist)
	translations/file-structure.md	Where to place Triton files when converting from cuTile
	translations/advanced-patterns.md	Dual layout flags (transpose), autotune + `META` grid, MLA-style two kernels
API	api-mapping.md	cuTile → Triton mapping
	optimizing-reference.md	GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile)
Gotchas	gotchas.md	Common cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags)
	performance-gotchas.md	10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune)
Testing & errors	references/debugging.md	Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow)

从cuTile → Triton视角阅读。核心文件位于本技能的当前目录下。

分类	文档	内容
策略	optimization-strategy.md	有序中心文档：高级模式+优化参考；§4 Gemma FMHA强制检查清单
工作流	translations/workflow.md	标准c2t转换（阶段+检查清单）
	translations/file-structure.md	从cuTile转换时Triton文件的放置规则
	translations/advanced-patterns.md	双布局标志（transpose）、自动调优+ `META` 网格、MLA风格双内核
API	api-mapping.md	cuTile → Triton映射表
	optimizing-reference.md	GEMM/BMM/注意力优化（EVEN_K、转置、网格、自动调优、结尾子分片）
陷阱	gotchas.md	cuTile→Triton转换常见错误（mma、数据类型、网格、TMA、布局标志）
	performance-gotchas.md	10-50倍退化风险表（TMA vs ptr+mask、broadcast_to、extract_slice链、自动调优）
测试与错误	references/debugging.md	Triton运行时错误（cudaErrorIllegalAddress、指针类型、步长溢出）

Worked Examples

示例

Use cutile_kernel.py as source and triton_kernel.py as target:

Example	Directory	Complexity
Vector Add	examples/01_vector_add/	Basic
Softmax	examples/02_softmax/	Intermediate
LayerNorm	examples/03_layernorm/	Intermediate
MatMul	examples/04_matmul/	Advanced
Attention	examples/05_attention/	Advanced

Read

cutile_kernel.py

first, then

triton_kernel.py

, to see the inverse mapping.

以cutile_kernel.py为源文件，triton_kernel.py为目标文件：

示例	目录	复杂度
向量加法	examples/01_vector_add/	基础
Softmax	examples/02_softmax/	中级
LayerNorm	examples/03_layernorm/	中级
矩阵乘法	examples/04_matmul/	高级
注意力	examples/05_attention/	高级

先阅读

cutile_kernel.py

，再阅读

triton_kernel.py

，查看反向映射示例。

⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)

⚠️ 强制完成检查清单（不得跳过）

A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:

MANDATORY COMPLETION GATES:
 [ ] 1. CORRECTNESS: pytest passes with 0 failures
     Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     Gate: "X passed, 0 failed"

 [ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
     Verify: grep -n "tl.load.*mask" triton_file.py | wc -l  # Should be 0 for 2D+ ops
     Skip = 5-20x performance regression

 [ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
     Command: python -m pytest {test_path} -k "test_perf" --print-record -v
     OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
     Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. PERFORMANCE COMPARISON RECORDED:
     Document results:
     | Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
     |--------|-----------------|-----------------|-------|
     | [fill] | [fill]          | [fill]          | [fill]|

CONVERSION COMPLETE: All 4 gates passed? → YES / NO

Why this matters:

Gate 1 catches functional bugs
Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
Gate 3 validates that optimization was effective
Gate 4 creates accountability record

If any gate fails: Fix and re-verify before declaring complete.

所有项目检查通过后，转换才算完成。复制并填写：

强制完成检查点:
 [ ] 1. 正确性：pytest全部通过，无失败
     命令：python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     检查点：“X passed, 0 failed”

 [ ] 2. TMA优化：所有二维及以上分片加载使用tl.make_tensor_descriptor
     验证：grep -n "tl.load.*mask" triton_file.py | wc -l  # 二维及以上算子结果应为0
     跳过此步骤会导致5-20倍性能退化

 [ ] 3. 性能测试：Triton性能达到cuTile基准的80%以上
     命令：python -m pytest {test_path} -k "test_perf" --print-record -v
     或：运行基准脚本：cd tests/benchmark && python bench_{op}.py
     检查点：Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. 性能对比已记录:
     记录结果:
     | 配置 | Triton (TFLOPS) | CuTile (TFLOPS) | 比值 |
     |--------|-----------------|-----------------|-------|
     | [填写] | [填写]          | [填写]          | [填写]|

转换完成：所有4个检查点均通过？→ 是 / 否

重要性说明：

检查点1捕获功能性错误
检查点2防止灾难性5-20倍性能退化（最常见错误）
检查点3验证优化效果
检查点4创建可追溯记录

若任何检查点未通过： 修复后重新验证，再宣布完成。