Triton TileIR Optimization

Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs. This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.

Principles

TileIR vs PTX Backend

TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code using CGA-level (Cooperative Grid Array) tile representations. Critical differences:

Parameter	PTX Backend	TileIR Backend
`num_warps`	Strict directive	Ignored (compiler decides)
`num_stages`	Strict directive	Cost hint (compiler optimizes)
`occupancy`	Not available	Critical tuning param (1-32)
`num_ctas`	Limited	2CTA mode for Blackwell
Block sizes	Smaller often better	Larger often better
TMA	Not available	Required for dot kernels

Key implication: Do not tune

num_warps

for TileIR -- focus on

occupancy

instead.

Triton Package Landscape

Three packages share

import triton

Package	Source	Use Case
`pytorch-triton`	PyTorch wheel	`torch.compile` , standard kernels
`triton`	OpenAI PyPI	Official Triton from triton-lang.org
nvtriton	Triton-to-tile-IR	TileIR backend for Blackwell

Only one triton package should be installed at a time. "Converting to TileIR" means adding TileIR-specific configs, NOT changing imports. TileIR activates via

ENABLE_TILE=1

When TileIR Applies

TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the specialist still adds TileIR-optimized configs that standard triton safely ignores, enabling future deployment.

Expected speedups (with nvtriton on Blackwell):

Kernel Type	Speedup	Key Lever
Dot-Related (GEMM, Attention)	1.2-2.0x	TMA + 2CTA
Norm-Like (LayerNorm, Softmax)	2.0-5.0x	High occupancy
Element-Wise (ReLU, Add, Exp)	1.5-3.0x	Occupancy + num_stages
Reduction (Sum, Mean, Max)	1.8-4.0x	High occupancy

Workflow

Five-phase workflow: compatibility, classify, transform, validate, benchmark.

Phase 1: Compatibility Test (ENABLE_TILE=0)

Verify the kernel works in PTX mode before applying TileIR optimizations.

bash

python scripts/tileir_check.py

Then use the kernel-triton-writing skill's

verify_kernel.py

to verify with

ENABLE_TILE=0

bash

python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

Phase 2: Classify Kernel

Determine kernel type to select the optimization strategy.

bash

python scripts/classify_kernel.py --file kernel.py

Classification decision tree:

Contains tl.dot()?
  YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
  NO  --> Has reduction + normalization?
            YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
            NO  --> Point-wise only?
                      YES --> element-wise: occupancy (1-16) + num_stages (2-4)
                      NO  --> reduction: high occupancy + num_warps

Phase 3: Apply Transformations

Classify and apply optimizations in one step:

bash

python scripts/classify_kernel.py --file kernel.py --apply-optimizations

Output JSON includes

optimized_code

and

changes_applied

fields.

Type-specific transformations:

Dot-related (highest priority):

Convert
```
tl.load
```
/
```
tl.store
```
to TMA descriptors (MANDATORY). See
```
references/tma-conversion.md
```
.
Add 2CTA configs (
```
num_ctas=2
```
) with SM oversubscription guard in pre-hook.
Add occupancy (1, 2, 4) and extended num_stages (4, 6).
Use larger block sizes (256x256, 256x128).

Norm-like (LayerNorm, Softmax, RMSNorm):

Add occupancy (2, 4), num_warps (4, 8). No TMA needed.

Element-wise (ReLU, GELU, Add, Mul, Exp):

Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.

Reduction (Sum, Mean, Max):

Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).

Gate TileIR-specific configs for sm_100+:

python

import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs

See

references/config-templates.md

for complete config templates per kernel type.

Phase 4: TileIR Validation (ENABLE_TILE=1)

Use the kernel-triton-writing skill's

verify_kernel.py

to verify the optimized kernel with TileIR backend:

bash

python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

Set

ENABLE_TILE=1

before running. Check: numerical correctness, no compilation errors, TMA/2CTA patterns compile successfully.

Phase 5: Benchmark

Use

triton.testing.do_bench()

(as documented in the perf-workload-profiling skill) to compare PTX (

ENABLE_TILE=0

) vs TileIR (

ENABLE_TILE=1

Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.

Scripts

tileir_check.py

Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):

bash

python scripts/tileir_check.py

Returns JSON:

nvtriton_installed

tileir_active

blackwell_gpu

gpu_capability

recommendation

classify_kernel.py

Classify kernel type and optionally apply TileIR optimizations:

bash

# Classify only
python scripts/classify_kernel.py --file kernel.py

# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations

# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'

Returns JSON:

classification

confidence

indicators

recommendations

. With

--apply-optimizations

: adds

optimized_code

and

changes_applied

Error Handling

Common Pitfalls

TMA descriptor errors (dot-related kernels):

Always pass
```
pre_hook=tma_set_block_size_hook
```
to config generation -- without it, TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
For GEMM: pass
```
b.T.contiguous()
```
in wrapper and use
```
tl.dot(a, b.T, accumulator)
```
in kernel. Transposition mismatch produces incorrect results silently.

2CTA oversubscription:

Adjust SM count in pre-hook when using

num_ctas=2

python

if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
    nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]

Config function signatures:

ALL config helper functions MUST accept

pre_hook=None

, even if unused. Without it:

TypeError: get_autotune_configs() takes 0 positional arguments

Hardware gating:

Gate TileIR configs with
```
torch.cuda.get_device_capability()[0] >= 10
```
. TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.

API availability:

Use
```
1.0 / (1.0 + tl.exp(-x))
```
instead of
```
tl.sigmoid(x)
```
-- not available in all Triton versions including some nvtriton builds.

Performance tuning:

Do not over-tune
```
num_warps
```
-- TileIR ignores it. Focus on
```
occupancy
```
.
Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
Benchmark across small/medium/large inputs; one-size configs underperform.

For exp/log heavy kernels, enable approximate math:

bash

export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1

When to Abort

Stop and report if:

No triton installed -- cannot proceed.
Compatibility test fails -- kernel has syntax/runtime errors before optimization.
TileIR validation fails -- optimized kernel produces wrong results.
No speedup -- TileIR version is slower than PTX baseline (with nvtriton).
Not Blackwell GPU -- still add configs for future deployment, but skip ENABLE_TILE testing and benchmarking.

Output Format

After optimization, return:

## TileIR Optimization: kernel_name

### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]

### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Transformations Applied
- [List of transformations]

### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y

### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |

### Output
File: kernel_name_tileir.py

kernel-tileir-optimization

NPX Install

Tags

SKILL.md Content

Triton TileIR Optimization

Principles

TileIR vs PTX Backend

Triton Package Landscape

When TileIR Applies

Workflow

Phase 1: Compatibility Test (ENABLE_TILE=0)

Phase 2: Classify Kernel

Phase 3: Apply Transformations

Phase 4: TileIR Validation (ENABLE_TILE=1)

Phase 5: Benchmark

Scripts

tileir_check.py

classify_kernel.py

Error Handling

Common Pitfalls

When to Abort

Output Format