Loading...
Loading...
Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
npx skill4agent add nvidia/skills converting-cutile-to-triton@ct.kernel@triton.jitkernel[grid](…)[grid](…)gemma_attentiontranslations/workflow.mdtransposetranspose_vMETAtl.trans@ct.kernelct.loadct.storect.launchConstanttl.make_tensor_descriptortl.load(ptr+offs, mask=…)lambda META: (…)ct.launchpytest tests/ops/test_<op>.py -k "triton" -vscudaErrorIllegalAddresstransposeMETAtranspose=Falseloop_unroll_factortranspose=Truetranspose=False@triton.jitgrid = lambda META: (... META["BLOCK_H"] ...)# Count kernels (only main kernel gets @triton.jit, helpers stay plain def)
grep "@ct\.kernel" source.py | wc -l
# Check for patterns needing special handling
grep "ct\.transpose\|ct\.permute" source.py # → use tl.trans/tl.permute
grep "ct\.astype" source.py # → use .to(dtype)
grep "ct\.load\|ct\.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct\.launch" source.py # → bracket launch: kernel then [grid] then (args)
grep "ct\.Constant\|ct\.ConstInt" source.py # → tl.constexpr
grep "ct\.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct\.bid\|ct\.num_blocks" source.py # → tl.program_id/tl.num_programs
grep "1 << .*\.bit_length" source.py # → triton.next_power_of_2 if needed
grep "transpose\|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)Conversion Progress:
[ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
[ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
[ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
[ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
[ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
[ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
[ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
[ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
[ ] Step 7: Validate — run pytest or syntax check on Triton file
[ ] Step 8: Test — run pytest, verify X passed 0 failed
[ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
[ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
[ ] Step 10b: If GEMM/BMM/attention and Triton >20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
[ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled
Post-conversion Verification (TMA is mandatory for 2D+ loads):
[ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
[ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
[ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
[ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
[ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
[ ] ct.astype(x, dtype) → x.to(dtype) in Triton
[ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
[ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
[ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
[ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
[ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
[ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
[ ] tl.assume() alignment hints added for strides and pointers@ct.kernel@triton.jitbroadcast_to + tl.dotextract_sliceMETA| Category | Document | Content |
|---|---|---|
| Strategy | optimization-strategy.md | Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist |
| Workflows | translations/workflow.md | Standard c2t conversion (phases + checklist) |
| translations/file-structure.md | Where to place Triton files when converting from cuTile | |
| translations/advanced-patterns.md | Dual layout flags (transpose), autotune + | |
| API | api-mapping.md | cuTile → Triton mapping |
| optimizing-reference.md | GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile) | |
| Gotchas | gotchas.md | Common cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags) |
| performance-gotchas.md | 10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune) | |
| Testing & errors | references/debugging.md | Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow) |
| Example | Directory | Complexity |
|---|---|---|
| Vector Add | examples/01_vector_add/ | Basic |
| Softmax | examples/02_softmax/ | Intermediate |
| LayerNorm | examples/03_layernorm/ | Intermediate |
| MatMul | examples/04_matmul/ | Advanced |
| Attention | examples/05_attention/ | Advanced |
cutile_kernel.pytriton_kernel.pyMANDATORY COMPLETION GATES:
[ ] 1. CORRECTNESS: pytest passes with 0 failures
Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
Gate: "X passed, 0 failed"
[ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
Verify: grep -n "tl.load.*mask" triton_file.py | wc -l # Should be 0 for 2D+ ops
Skip = 5-20x performance regression
[ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
Command: python -m pytest {test_path} -k "test_perf" --print-record -v
OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. PERFORMANCE COMPARISON RECORDED:
Document results:
| Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
|--------|-----------------|-----------------|-------|
| [fill] | [fill] | [fill] | [fill]|
CONVERSION COMPLETE: All 4 gates passed? → YES / NO