Loading...
Loading...
Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.
npx skill4agent add nvidia/skills kernel-tileir-optimization| Parameter | PTX Backend | TileIR Backend |
|---|---|---|
| Strict directive | Ignored (compiler decides) |
| Strict directive | Cost hint (compiler optimizes) |
| Not available | Critical tuning param (1-32) |
| Limited | 2CTA mode for Blackwell |
| Block sizes | Smaller often better | Larger often better |
| TMA | Not available | Required for dot kernels |
num_warpsoccupancyimport triton| Package | Source | Use Case |
|---|---|---|
| PyTorch wheel | |
| OpenAI PyPI | Official Triton from triton-lang.org |
| nvtriton | Triton-to-tile-IR | TileIR backend for Blackwell |
ENABLE_TILE=1| Kernel Type | Speedup | Key Lever |
|---|---|---|
| Dot-Related (GEMM, Attention) | 1.2-2.0x | TMA + 2CTA |
| Norm-Like (LayerNorm, Softmax) | 2.0-5.0x | High occupancy |
| Element-Wise (ReLU, Add, Exp) | 1.5-3.0x | Occupancy + num_stages |
| Reduction (Sum, Mean, Max) | 1.8-4.0x | High occupancy |
python scripts/tileir_check.pyverify_kernel.pyENABLE_TILE=0python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'python scripts/classify_kernel.py --file kernel.pyContains tl.dot()?
YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
NO --> Has reduction + normalization?
YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
NO --> Point-wise only?
YES --> element-wise: occupancy (1-16) + num_stages (2-4)
NO --> reduction: high occupancy + num_warpspython scripts/classify_kernel.py --file kernel.py --apply-optimizationsoptimized_codechanges_appliedtl.loadtl.storereferences/tma-conversion.mdnum_ctas=2import torch
def get_configs_with_gating(pre_hook=None):
configs = get_baseline_configs()
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
configs.extend(get_tileir_specific_configs(pre_hook))
return configsreferences/config-templates.mdverify_kernel.pypython scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'ENABLE_TILE=1triton.testing.do_bench()ENABLE_TILE=0ENABLE_TILE=1python scripts/tileir_check.pynvtriton_installedtileir_activeblackwell_gpugpu_capabilityrecommendation# Classify only
python scripts/classify_kernel.py --file kernel.py
# Classify + apply optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
# From inline code
python scripts/classify_kernel.py --code '<kernel_code>'classificationconfidenceindicatorsrecommendations--apply-optimizationsoptimized_codechanges_appliedpre_hook=tma_set_block_size_hookb.T.contiguous()tl.dot(a, b.T, accumulator)num_ctas=2if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]pre_hook=NoneTypeError: get_autotune_configs() takes 0 positional argumentstorch.cuda.get_device_capability()[0] >= 101.0 / (1.0 + tl.exp(-x))tl.sigmoid(x)num_warpsoccupancyexport TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1## TileIR Optimization: kernel_name
### Classification
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]
### Compatibility Check (ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Transformations Applied
- [List of transformations]
### TileIR Validation (ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y
### Benchmark Comparison
| Backend | Time (ms) | Speedup |
|---------|-----------|---------|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |
### Output
File: kernel_name_tileir.py