tilegym-converting-cutile-to-triton
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesecuTile → Triton Conversion
cuTile → Triton 转换
Convert kernels to . API mapping: references/api-mapping.md (cuTile → Triton).
@ct.kernel@triton.jitIn this skill’s Markdown, Triton launch syntax uses Unicode brackets so link checkers do not parse as a hyperlink; use normal ASCII brackets in real Triton code.
kernel[grid](…)[grid](…)将内核转换为。API映射参考:references/api-mapping.md(cuTile → Triton)。
@ct.kernel@triton.jit本技能的Markdown中,Triton启动语法使用Unicode方括号,避免链接检查器将解析为超链接;实际编写Triton代码时请使用标准ASCII方括号。
kernel[grid](…)[grid](…)Instructions
操作步骤
Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).
-
Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done.
gemma_attention -
Select path — Existing TileGym op: standard mode in. If the cuTile source uses
translations/workflow.md/transpose, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels +transpose_vgrid, not one kernel +META).tl.trans -
Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Countdefinitions; note TMA-relevant
@ct.kernel/ct.load,ct.store,ct.launch, and layout flags.Constant -
Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
-
Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses(TMA), not raw
tl.make_tensor_descriptorfor full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple ortl.load(ptr+offs, mask=…)for autotune; nolambda META: (…).ct.launch -
Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op:. Fix failures before benchmarking.
pytest tests/ops/test_<op>.py -k "triton" -vs -
Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.
Execution rules (MUST):
- Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
- For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
- Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
- If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.
遵循translations/workflow.md中的阶段门控工作流。所有转换都应经过分析→转换→验证→测试→基准测试流程,进入下一阶段前需通过明确的检查点。当任务匹配特殊场景(错误、布局标志、性能问题)时,使用工作流选择中的文档。
-
优化策略(对性能敏感/注意力相关) — 如果算子是注意力、FMHA、滑动窗口、软上限或GQA(例如Gemma的),在转换内循环前请阅读**references/optimization-strategy.md,然后应用§4 Gemma FMHA检查清单。对于其他GEMM/BMM/注意力相关内核,完成TMA后仍需浏览该文件的§2–§3**部分。
gemma_attention -
选择路径 — 已有TileGym算子:使用中的标准模式。如果cuTile源码使用
translations/workflow.md/transpose、双布局或MLA风格路径,在编写Triton代码前请阅读translations/advanced-patterns.md(需两个内核+transpose_v网格,而非单个内核+META)。tl.trans -
查阅映射表 — 打开references/api-mapping.md查看cuTile→Triton的API对应关系。若出现运行时错误(非法地址、数据类型、步长问题),请使用references/debugging.md。
-
转换 — 将转换检查清单复制到待办列表并按顺序执行。文件结构与放置规则:translations/file-structure.md。强制要求:所有二维及以上块形状的分片加载/存储需使用(TMA),不得对完整分片使用原始
tl.make_tensor_descriptor——跳过此步骤是导致性能大幅退化的最常见原因。主机端:使用Triton方括号启动方式<code>kernel[grid](args)</code>,可使用元组或tl.load(ptr+offs, mask=…)实现自动调优;无需lambda META: (…)。ct.launch -
验证 — 对新的Triton模块进行语法检查;运行对应算子的TileGym pytest测试目标:。在基准测试前修复所有失败项。
pytest tests/ops/test_<op>.py -k "triton" -vs -
基准测试 — 对比Triton与cuTile的性能测试结果。若Triton明显更慢,请遵循translations/workflow.md中的性能分析(阶段c2t-5),以及针对GEMM/BMM/注意力的references/optimizing-reference.md;使用references/optimization-strategy.md作为有序检查清单。若出现10–50倍的性能下降,请先阅读同一工作流文件中的关键性能模式部分。
执行规则(必须遵守):
- 在编辑内核代码前创建并跟踪转换检查清单(例如使用TodoWrite);按顺序完成步骤——不得跳过预检查或TMA决策。
- 对于注意力/FMHA/Gemma/GQA/软上限/滑动窗口:在完成转换优化前,必须阅读references/optimization-strategy.md并应用**§4**。
- 不得在适用TMA的场景下使用原始指针+掩码的二维及以上分片加载;若有故意例外情况需记录文档。
- 若测试或基准测试未通过检查点,需停止并修复问题之后再宣布转换完成——不得堆叠未验证的修改。
Workflow Selection
工作流选择
- Existing TileGym op → Standard Mode: translations/workflow.md
- Errors (, shape mismatch, numerical mismatch) → references/debugging.md
cudaErrorIllegalAddress - Advanced patterns (TMA, dual layout flags , autotune +
transposegrid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression onMETA).transpose=False - Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
- Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
- Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
- Gemma attention / GQA FMHA conversion → references/optimization-strategy.md §4
- Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors, , occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design
loop_unroll_factor - ⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
- ⚠️ Good perf on only, collapse on
transpose=True(or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; twotranspose=Falsekernels +@triton.jitgrid = lambda META: (... META["BLOCK_H"] ...)
- 已有TileGym算子 → 标准模式:translations/workflow.md
- 错误(、形状不匹配、数值不匹配)→ references/debugging.md
cudaErrorIllegalAddress - 高级模式(TMA、双布局标志、自动调优+
transpose网格、Array.slice、ct.gather().item())→ translations/advanced-patterns.md(MLA风格双内核,避免META时出现3–15倍性能退化)。transpose=False - 性能问题(Triton内核比cuTile慢、自动调优、性能分析)→ translations/workflow.md(**性能分析(阶段c2t-5)**部分)
- 优化策略中心(有序检查清单:高级模式+优化参考)→ references/optimization-strategy.md — 注意力/FMHA/Gemma相关转换需优先阅读;之后根据需要深入查阅两个源文档
- GEMM/BMM/注意力优化(完成TMA后,或Triton慢10–20%)→ references/optimizing-reference.md — EVEN_K快速路径、通过指针算术实现转置、网格布局、自动调优广度、结尾子分片;在转换过程中及性能验收前使用这些模式(总结于优化策略§2–§3)
- Gemma注意力/GQA FMHA转换 → references/optimization-strategy.md §4
- Blackwell优化(含迭代算法、寄存器压力、循环展开的复杂内核)→ references/optimizing-reference.md §9 — TMA描述符、、占用率自动调优、TMEM友好的块大小、分片分配器、双路径内核设计
loop_unroll_factor - ⚠️ 10-50倍性能退化(转换后出现灾难性性能下降)→ translations/workflow.md — **关键性能模式(避免10-50倍退化)**部分
- ⚠️ 性能良好,但
transpose=True性能崩溃(反之亦然)→ translations/advanced-patterns.md — §1双布局标志;两个transpose=False内核+@triton.jitgrid = lambda META: (... META["BLOCK_H"] ...)
Pre-flight Analysis (Run BEFORE converting)
预检查分析(转换前必须运行)
bash
undefinedbash
undefinedCount kernels (only main kernel gets @triton.jit, helpers stay plain def)
统计内核数量(仅主内核使用@triton.jit,辅助函数保持普通def)
grep "@ct.kernel" source.py | wc -l
grep "@ct.kernel" source.py | wc -l
Check for patterns needing special handling
检查需要特殊处理的模式
grep "ct.transpose|ct.permute" source.py # → use tl.trans/tl.permute
grep "ct.astype" source.py # → use .to(dtype)
grep "ct.load|ct.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct.launch" source.py # → bracket launch: kernel then [grid] then (args)
grep "ct.Constant|ct.ConstInt" source.py # → tl.constexpr
grep "ct.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct.bid|ct.num_blocks" source.py # → tl.program_id/tl.num_programs
grep "1 << .*.bit_length" source.py # → triton.next_power_of_2 if needed
grep "transpose|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)
undefinedgrep "ct.transpose|ct.permute" source.py # → 使用tl.trans/tl.permute
grep "ct.astype" source.py # → 使用.to(dtype)
grep "ct.load|ct.store" source.py # → 二维及以上使用TMA(tl.make_tensor_descriptor),不得使用原始tl.load(ptr+offs)
grep "ct.launch" source.py # → 使用方括号启动:kernel后接[grid]再传入(args)
grep "ct.Constant|ct.ConstInt" source.py # → 使用tl.constexpr
grep "ct.cdiv" source.py # → 主机端使用triton.cdiv,或Python写法(a+b-1)//b
grep "ct.bid|ct.num_blocks" source.py # → 使用tl.program_id/tl.num_programs
grep "1 << .*.bit_length" source.py # → 必要时使用triton.next_power_of_2
grep "transpose|transpose_v" source.py # → 若匹配,阅读translations/advanced-patterns.md(双内核+META网格)
undefinedConversion Checklist
转换检查清单
Copy this checklist and track progress:
Conversion Progress:
[ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
[ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
[ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
[ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
[ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
[ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
[ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
[ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
[ ] Step 7: Validate — run pytest or syntax check on Triton file
[ ] Step 8: Test — run pytest, verify X passed 0 failed
[ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
[ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
[ ] Step 10b: If GEMM/BMM/attention and Triton >20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
[ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled
Post-conversion Verification (TMA is mandatory for 2D+ loads):
[ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
[ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
[ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
[ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
[ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
[ ] ct.astype(x, dtype) → x.to(dtype) in Triton
[ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
[ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
[ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
[ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
[ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
[ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
[ ] tl.assume() alignment hints added for strides and pointers复制此清单并跟踪进度:
转换进度:
[ ] 步骤0(注意力/Gemma FMHA/GQA/软上限/滑动窗口):阅读[references/optimization-strategy.md](./references/optimization-strategy.md)并应用§4检查清单,再编写Triton内循环
[ ] 步骤1:预检查 — 运行上述grep命令,记录特殊模式和二维及以上加载(→ TMA)
[ ] 步骤2:分析源cuTile内核(识别模式、形状、数据类型)
[ ] 步骤3:创建结构正确的Triton文件(参考translations/file-structure.md)
[ ] 步骤4:转换内核签名(张量参数→指针参数,Constant→constexpr)
[ ] 步骤4b:TMA(二维及以上加载强制要求)—— 所有二维及以上分片加载/存储使用tl.make_tensor_descriptor;不得对块形状访问使用原始tl.load(ptr+offs,mask)(参考workflow.md § TMA优化)
[ ] 步骤5:转换内核主体(应用下方陷阱表+API映射)
[ ] 步骤6:转换主机包装器(网格元组/ lambda,方括号式启动:kernel、网格、然后参数;无需ct.launch);若使用TMA需调用triton.set_allocator(alloc_fn)
[ ] 步骤7:验证 — 对Triton文件运行pytest或语法检查
[ ] 步骤8:测试 — 运行pytest,验证全部通过、无失败
[ ] 步骤9:若测试失败→修复→重新验证→重新测试(循环至全部通过)
[ ] 步骤10:基准测试 — 运行性能测试,与cuTile对比(参考workflow.md § 性能分析)
[ ] 步骤10b:若为GEMM/BMM/注意力且Triton慢于cuTile20%以上→浏览[references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3,再参考[references/optimizing-reference.md](./references/optimizing-reference.md)(EVEN_K、转置、网格、自动调优、结尾子分片),然后重新基准测试
[ ] 步骤10c:若算子包含`transpose`/布局标志→阅读[translations/advanced-patterns.md](./translations/advanced-patterns.md);验证每个布局使用**独立内核**(而非转置内核+`tl.trans`);**自动调优**启动使用`lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — 除非禁用自动调优,否则不得通过`apply()`固定`BLOCK_H`/`BLOCK_N`
转换后验证(二维及以上加载必须使用TMA):
[ ] TMA:所有二维及以上分片加载使用tl.make_tensor_descriptor(...).load([...]);块形状二维及以上访问不得使用原始ptr+mask(否则会出现5-20倍性能退化)
[ ] 网格使用元组或lambda(无需像cuTile那样必须使用三元组)
[ ] 若cuTile算子使用kernel_configs/autotune,需添加Triton自动调优(参考工作流§ 性能分析)
[ ] 主机网格在合适场景下使用triton.cdiv(不得仅使用(a+b-1)//b)
[ ] 指针/偏移索引:Triton使用元素偏移(ptr + offs),而非tl.load中的块索引(或使用TMA描述符)
[ ] ct.astype(x, dtype) → Triton中使用x.to(dtype)
[ ] ct.mma(a, b, acc=acc) → Triton中使用tl.dot(a, b, acc)(无关键字参数)
[ ] 可选/None参数:TMA允许内核参数为None(cuTile需要占位符+标志)
[ ] 当BLOCK_SIZE > 实际维度时应用掩码(与cuTile相同);使用TMA时,完整分片通常可移除掩码
[ ] 归约除数使用实际大小,而非BLOCK_SIZE
[ ] fp32/tf32:Triton默认allow_tf32=True;若cuTile有显式tf32转换需匹配其行为
[ ] 若任何二维及以上加载使用原始ptr+mask(仅例外情况):记录未使用TMA的原因
[ ] 为步长和指针添加tl.assume()对齐提示Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}
常见陷阱(最易出错的转换问题) {#gotchas-most-common-translation-errors}
Comprehensive table of patterns that frequently break or regress when porting to — mma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.
@ct.kernel@triton.jitSee: references/gotchas.md — read this BEFORE writing the Triton kernel.
全面汇总将移植到时经常出错或性能退化的模式——mma累加器、类型转换、网格、TMA使用、数据类型处理、布局标志、批量矩阵乘法等。
@ct.kernel@triton.jit参考: references/gotchas.md — 编写Triton内核前请阅读。
Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}
性能陷阱(10-50倍退化风险) {#performance-gotchas-10-50x-regression-risk}
⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.
Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×), (10-50×), chains (2-5×), and more.
broadcast_to + tl.dotextract_sliceSee: references/performance-gotchas.md — full regression-risk table.
Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).
Full API mapping: references/api-mapping.md.
Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).
⚠️ 这些会导致灾难性性能下降。基准测试前必须检查。
模式及其影响:TMA vs 原始ptr+mask(5-20倍)、自动调优 vs 固定分片大小(2-3倍)、(10-50倍)、链(2-5倍)等。
broadcast_to + tl.dotextract_slice参考: references/performance-gotchas.md — 完整退化风险表。
详细内容: translations/workflow.md — **关键性能模式(避免10-50倍退化)**部分。
完整API映射:references/api-mapping.md。
Triton数学函数数据类型(erf/erfc/exp/log/sqrt)及“不得用tanh替代erf”模式:references/debugging.md — **Triton数学函数数据类型要求(关键)**部分。
Optimization strategy (hub)
优化策略(中心文档)
File: references/optimization-strategy.md
Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.
METARule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.
文件: references/optimization-strategy.md
将**translations/advanced-patterns.md(布局标志、双内核、自动调优+、批量启动、Blackwell指针)和references/optimizing-reference.md(TMA后微优化、§9)汇总为§1–§3**,外加强制§4 Gemma FMHA检查清单。
META规则: 对于注意力/FMHA/Gemma风格转换,在同一会话中同时打开优化策略和工作流文档——不能仅依赖TMA完成性能验收。
Reference Documents {#reference-documents}
参考文档 {#reference-documents}
Read from cuTile → Triton perspective. Core files live in this skill under ``.
| Category | Document | Content |
|---|---|---|
| Strategy | optimization-strategy.md | Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist |
| Workflows | translations/workflow.md | Standard c2t conversion (phases + checklist) |
| translations/file-structure.md | Where to place Triton files when converting from cuTile | |
| translations/advanced-patterns.md | Dual layout flags (transpose), autotune + | |
| API | api-mapping.md | cuTile → Triton mapping |
| optimizing-reference.md | GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile) | |
| Gotchas | gotchas.md | Common cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags) |
| performance-gotchas.md | 10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune) | |
| Testing & errors | references/debugging.md | Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow) |
从cuTile → Triton视角阅读。核心文件位于本技能的当前目录下。
| 分类 | 文档 | 内容 |
|---|---|---|
| 策略 | optimization-strategy.md | 有序中心文档: 高级模式+优化参考;§4 Gemma FMHA强制检查清单 |
| 工作流 | translations/workflow.md | 标准c2t转换(阶段+检查清单) |
| translations/file-structure.md | 从cuTile转换时Triton文件的放置规则 | |
| translations/advanced-patterns.md | 双布局标志(transpose)、自动调优+ | |
| API | api-mapping.md | cuTile → Triton映射表 |
| optimizing-reference.md | GEMM/BMM/注意力优化(EVEN_K、转置、网格、自动调优、结尾子分片) | |
| 陷阱 | gotchas.md | cuTile→Triton转换常见错误(mma、数据类型、网格、TMA、布局标志) |
| performance-gotchas.md | 10-50倍退化风险表(TMA vs ptr+mask、broadcast_to、extract_slice链、自动调优) | |
| 测试与错误 | references/debugging.md | Triton运行时错误(cudaErrorIllegalAddress、指针类型、步长溢出) |
Worked Examples
示例
Use cutile_kernel.py as source and triton_kernel.py as target:
| Example | Directory | Complexity |
|---|---|---|
| Vector Add | examples/01_vector_add/ | Basic |
| Softmax | examples/02_softmax/ | Intermediate |
| LayerNorm | examples/03_layernorm/ | Intermediate |
| MatMul | examples/04_matmul/ | Advanced |
| Attention | examples/05_attention/ | Advanced |
Read first, then , to see the inverse mapping.
cutile_kernel.pytriton_kernel.py以cutile_kernel.py为源文件,triton_kernel.py为目标文件:
| 示例 | 目录 | 复杂度 |
|---|---|---|
| 向量加法 | examples/01_vector_add/ | 基础 |
| Softmax | examples/02_softmax/ | 中级 |
| LayerNorm | examples/03_layernorm/ | 中级 |
| 矩阵乘法 | examples/04_matmul/ | 高级 |
| 注意力 | examples/05_attention/ | 高级 |
先阅读,再阅读,查看反向映射示例。
cutile_kernel.pytriton_kernel.py⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)
⚠️ 强制完成检查清单(不得跳过)
A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:
MANDATORY COMPLETION GATES:
[ ] 1. CORRECTNESS: pytest passes with 0 failures
Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
Gate: "X passed, 0 failed"
[ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
Verify: grep -n "tl.load.*mask" triton_file.py | wc -l # Should be 0 for 2D+ ops
Skip = 5-20x performance regression
[ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
Command: python -m pytest {test_path} -k "test_perf" --print-record -v
OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. PERFORMANCE COMPARISON RECORDED:
Document results:
| Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
|--------|-----------------|-----------------|-------|
| [fill] | [fill] | [fill] | [fill]|
CONVERSION COMPLETE: All 4 gates passed? → YES / NOWhy this matters:
- Gate 1 catches functional bugs
- Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
- Gate 3 validates that optimization was effective
- Gate 4 creates accountability record
If any gate fails: Fix and re-verify before declaring complete.
所有项目检查通过后,转换才算完成。复制并填写:
强制完成检查点:
[ ] 1. 正确性:pytest全部通过,无失败
命令:python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
检查点:“X passed, 0 failed”
[ ] 2. TMA优化:所有二维及以上分片加载使用tl.make_tensor_descriptor
验证:grep -n "tl.load.*mask" triton_file.py | wc -l # 二维及以上算子结果应为0
跳过此步骤会导致5-20倍性能退化
[ ] 3. 性能测试:Triton性能达到cuTile基准的80%以上
命令:python -m pytest {test_path} -k "test_perf" --print-record -v
或:运行基准脚本:cd tests/benchmark && python bench_{op}.py
检查点:Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. 性能对比已记录:
记录结果:
| 配置 | Triton (TFLOPS) | CuTile (TFLOPS) | 比值 |
|--------|-----------------|-----------------|-------|
| [填写] | [填写] | [填写] | [填写]|
转换完成:所有4个检查点均通过?→ 是 / 否重要性说明:
- 检查点1捕获功能性错误
- 检查点2防止灾难性5-20倍性能退化(最常见错误)
- 检查点3验证优化效果
- 检查点4创建可追溯记录
若任何检查点未通过: 修复后重新验证,再宣布完成。