kernel-tileir-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Triton TileIR Optimization

Triton TileIR优化

Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs. This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.

针对Blackwell GPU优化现有Triton内核以适配NVIDIA的TileIR后端。本技能不会从零编写内核——那是Triton专家的工作。

Principles

原则

TileIR vs PTX Backend

TileIR与PTX后端对比

TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code using CGA-level (Cooperative Grid Array) tile representations. Critical differences:

Parameter	PTX Backend	TileIR Backend
`num_warps`	Strict directive	Ignored (compiler decides)
`num_stages`	Strict directive	Cost hint (compiler optimizes)
`occupancy`	Not available	Critical tuning param (1-32)
`num_ctas`	Limited	2CTA mode for Blackwell
Block sizes	Smaller often better	Larger often better
TMA	Not available	Required for dot kernels

Key implication: Do not tune

num_warps

for TileIR -- focus on

occupancy

instead.

TileIR是NVIDIA为Triton打造的编译器后端，它利用CGA级（协作网格数组）tile表示生成优化后的CUDA代码。核心差异如下：

参数	PTX后端	TileIR后端
`num_warps`	严格指令	被忽略（由编译器决定）
`num_stages`	严格指令	成本提示（由编译器优化）
`occupancy`	不可用	关键调优参数（1-32）
`num_ctas`	受限	Blackwell支持2CTA模式
块大小	通常越小越好	通常越大越好
TMA	不可用	点积内核必需

核心结论：不要为TileIR调优

num_warps

——应专注于

occupancy

。

Triton Package Landscape

Triton包生态

Three packages share

import triton

Package	Source	Use Case
`pytorch-triton`	PyTorch wheel	`torch.compile` , standard kernels
`triton`	OpenAI PyPI	Official Triton from triton-lang.org
nvtriton	Triton-to-tile-IR	TileIR backend for Blackwell

Only one triton package should be installed at a time. "Converting to TileIR" means adding TileIR-specific configs, NOT changing imports. TileIR activates via

ENABLE_TILE=1

三个包均使用

import triton

：

包	来源	使用场景
`pytorch-triton`	PyTorch安装包	`torch.compile` 、标准内核
`triton`	OpenAI PyPI	triton-lang.org官方Triton
nvtriton	Triton-to-tile-IR	适配Blackwell的TileIR后端

同一时间应仅安装一个triton包。"转换为TileIR"指添加TileIR专属配置，而非修改导入语句。可通过

ENABLE_TILE=1

激活TileIR。

When TileIR Applies

TileIR适用场景

TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the specialist still adds TileIR-optimized configs that standard triton safely ignores, enabling future deployment.

Expected speedups (with nvtriton on Blackwell):

Kernel Type	Speedup	Key Lever
Dot-Related (GEMM, Attention)	1.2-2.0x	TMA + 2CTA
Norm-Like (LayerNorm, Softmax)	2.0-5.0x	High occupancy
Element-Wise (ReLU, Add, Exp)	1.5-3.0x	Occupancy + num_stages
Reduction (Sum, Mean, Max)	1.8-4.0x	High occupancy

TileIR针对Blackwell（sm_100+）设计。即使没有nvtriton或Blackwell硬件，专家仍会添加TileIR优化配置，标准triton会自动忽略这些配置，为未来部署做好准备。

预期加速比（在Blackwell上使用nvtriton）：

内核类型	加速比	关键手段
点积相关（GEMM、注意力机制）	1.2-2.0倍	TMA + 2CTA
归一化类（LayerNorm、Softmax）	2.0-5.0倍	高occupancy
逐元素类（ReLU、Add、Exp）	1.5-3.0倍	Occupancy + num_stages
归约类（Sum、Mean、Max）	1.8-4.0倍	高occupancy

Workflow

工作流程

Five-phase workflow: compatibility, classify, transform, validate, benchmark.

分为五个阶段：兼容性测试、分类、转换、验证、基准测试。

Phase 1: Compatibility Test (ENABLE_TILE=0)

阶段1：兼容性测试（ENABLE_TILE=0）

Verify the kernel works in PTX mode before applying TileIR optimizations.

bash

python scripts/tileir_check.py

Then use the kernel-triton-writing skill's

verify_kernel.py

to verify with

ENABLE_TILE=0

bash

python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

在应用TileIR优化前，先验证内核在PTX模式下可正常运行。

bash

python scripts/tileir_check.py

然后使用kernel-triton-writing技能的

verify_kernel.py

，在

ENABLE_TILE=0

的情况下验证：

bash

python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

Phase 2: Classify Kernel

阶段2：内核分类

Determine kernel type to select the optimization strategy.

bash

python scripts/classify_kernel.py --file kernel.py

Classification decision tree:

Contains tl.dot()?
  YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
  NO  --> Has reduction + normalization?
            YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
            NO  --> Point-wise only?
                      YES --> element-wise: occupancy (1-16) + num_stages (2-4)
                      NO  --> reduction: high occupancy + num_warps

确定内核类型以选择优化策略。

bash

python scripts/classify_kernel.py --file kernel.py

分类决策树：

是否包含tl.dot()？
  是 --> 点积相关类：TMA + 2CTA + occupancy + 更大块大小
  否  --> 是否包含归约+归一化？
            是 --> 归一化类：高occupancy（2,4）+ num_warps（4,8）
            否  --> 是否仅为逐元素操作？
                      是 --> 逐元素类：occupancy（1-16）+ num_stages（2-4）
                      否  --> 归约类：高occupancy + num_warps

Phase 3: Apply Transformations

阶段3：应用转换

Classify and apply optimizations in one step:

bash

python scripts/classify_kernel.py --file kernel.py --apply-optimizations

Output JSON includes

optimized_code

and

changes_applied

fields.

Type-specific transformations:

Dot-related (highest priority):

Convert
```
tl.load
```
/
```
tl.store
```
to TMA descriptors (MANDATORY). See
```
references/tma-conversion.md
```
.
Add 2CTA configs (
```
num_ctas=2
```
) with SM oversubscription guard in pre-hook.
Add occupancy (1, 2, 4) and extended num_stages (4, 6).
Use larger block sizes (256x256, 256x128).

Norm-like (LayerNorm, Softmax, RMSNorm):

Add occupancy (2, 4), num_warps (4, 8). No TMA needed.

Element-wise (ReLU, GELU, Add, Mul, Exp):

Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.

Reduction (Sum, Mean, Max):

Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).

Gate TileIR-specific configs for sm_100+:

python

import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs

See

references/config-templates.md

for complete config templates per kernel type.

一步完成分类与优化应用：

bash

python scripts/classify_kernel.py --file kernel.py --apply-optimizations

输出JSON包含

optimized_code

和

changes_applied

字段。

特定类型转换：

点积相关类（最高优先级）：

将
```
tl.load
```
/
```
tl.store
```
转换为TMA描述符（必填）。详见
```
references/tma-conversion.md
```
。
添加2CTA配置（
```
num_ctas=2
```
），并在预钩子中加入SM超订阅防护。
添加occupancy（1,2,4）和扩展的num_stages（4,6）。
使用更大的块大小（256x256、256x128）。

归一化类（LayerNorm、Softmax、RMSNorm）：

添加occupancy（2,4）、num_warps（4,8）。无需TMA。

逐元素类（ReLU、GELU、Add、Mul、Exp）：

添加occupancy（1,2,4,16）、num_stages（2,3,4）。为小输入场景添加极端配置。

归约类（Sum、Mean、Max）：

与归一化类策略相同：高occupancy（2,4）、num_warps（4,8）。

为sm_100+设备添加TileIR专属配置的门控：

python

import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs

各内核类型的完整配置模板详见

references/config-templates.md

。

Phase 4: TileIR Validation (ENABLE_TILE=1)

阶段4：TileIR验证（ENABLE_TILE=1）

Use the kernel-triton-writing skill's

verify_kernel.py

to verify the optimized kernel with TileIR backend:

bash

python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

Set

ENABLE_TILE=1

before running. Check: numerical correctness, no compilation errors, TMA/2CTA patterns compile successfully.

使用kernel-triton-writing技能的

verify_kernel.py

验证优化后的内核在TileIR后端的运行情况：

bash

python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

运行前设置

ENABLE_TILE=1

。检查：数值正确性、无编译错误、TMA/2CTA模式编译成功。

Phase 5: Benchmark

阶段5：基准测试

Use

triton.testing.do_bench()

(as documented in the perf-workload-profiling skill) to compare PTX (

ENABLE_TILE=0

) vs TileIR (

ENABLE_TILE=1

Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.

使用

triton.testing.do_bench()

（详见perf-workload-profiling技能文档）对比PTX（

ENABLE_TILE=0

）与TileIR（

ENABLE_TILE=1

）的性能。

需在多种输入尺寸（128、1024、8192）下进行基准测试——性能会随尺寸变化。

Scripts

脚本

tileir_check.py

Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):

bash

python scripts/tileir_check.py

Returns JSON:

nvtriton_installed

tileir_active

blackwell_gpu

gpu_capability

recommendation

检查TileIR可用性（nvtriton、ENABLE_TILE、Blackwell GPU）：

bash

python scripts/tileir_check.py

返回JSON：

nvtriton_installed

、

tileir_active

、

blackwell_gpu

、

gpu_capability

、

recommendation

。

classify_kernel.py

Classify kernel type and optionally apply TileIR optimizations:

bash

undefined

分类内核类型并可选应用TileIR优化：

bash

undefined

Classify only

仅分类

python scripts/classify_kernel.py --file kernel.py

Classify + apply optimizations

分类+应用优化

python scripts/classify_kernel.py --file kernel.py --apply-optimizations

From inline code

针对内嵌代码

python scripts/classify_kernel.py --code '<kernel_code>'


Returns JSON: `classification`, `confidence`, `indicators`, `recommendations`.
With `--apply-optimizations`: adds `optimized_code` and `changes_applied`.

python scripts/classify_kernel.py --code '<kernel_code>'


返回JSON：`classification`、`confidence`、`indicators`、`recommendations`。
使用`--apply-optimizations`时：新增`optimized_code`和`changes_applied`字段。

Error Handling

错误处理

Common Pitfalls

常见陷阱

TMA descriptor errors (dot-related kernels):

Always pass
```
pre_hook=tma_set_block_size_hook
```
to config generation -- without it, TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
For GEMM: pass
```
b.T.contiguous()
```
in wrapper and use
```
tl.dot(a, b.T, accumulator)
```
in kernel. Transposition mismatch produces incorrect results silently.

2CTA oversubscription:

Adjust SM count in pre-hook when using

num_ctas=2

python

if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
    nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]

Config function signatures:

ALL config helper functions MUST accept

pre_hook=None

, even if unused. Without it:

TypeError: get_autotune_configs() takes 0 positional arguments

Hardware gating:

Gate TileIR configs with
```
torch.cuda.get_device_capability()[0] >= 10
```
. TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.

API availability:

Use
```
1.0 / (1.0 + tl.exp(-x))
```
instead of
```
tl.sigmoid(x)
```
-- not available in all Triton versions including some nvtriton builds.

Performance tuning:

Do not over-tune
```
num_warps
```
-- TileIR ignores it. Focus on
```
occupancy
```
.
Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
Benchmark across small/medium/large inputs; one-size configs underperform.

For exp/log heavy kernels, enable approximate math:

bash

export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1

TMA描述符错误（点积相关内核）：

配置生成时务必传入
```
pre_hook=tma_set_block_size_hook
```
——若未传入，TMA描述符会保留虚拟块大小，导致运行时错误或结果异常。
对于GEMM：在包装器中传入
```
b.T.contiguous()
```
，并在内核中使用
```
tl.dot(a, b.T, accumulator)
```
。转置不匹配会静默产生错误结果。

2CTA超订阅：

使用

num_ctas=2

时，需在预钩子中调整SM数量：

python

if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
    nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]

配置函数签名：

所有配置辅助函数必须接受
```
pre_hook=None
```
参数，即使未使用。若未添加，会报错：
```
TypeError: get_autotune_configs() takes 0 positional arguments
```
。

硬件门控：

使用
```
torch.cuda.get_device_capability()[0] >= 10
```
为TileIR配置添加门控。在Blackwell之前的GPU上使用TMA/2CTA会导致运行时崩溃。

API可用性：

使用
```
1.0 / (1.0 + tl.exp(-x))
```
替代
```
tl.sigmoid(x)
```
——并非所有Triton版本（包括部分nvtriton构建）都支持后者。

性能调优：

不要过度调优
```
num_warps
```
——TileIR会忽略该参数。应专注于
```
occupancy
```
。
为TileIR使用更大的块大小（256x256、256x128），而非PTX调优的小块大小。
在小/中/大输入下均进行基准测试；单一尺寸配置的性能表现不佳。
对于大量使用exp/log的内核，启用近似数学运算：
bash
```
export TILEIR_ENABLE_APPROX=1
export TILEIR_ENABLE_FTZ=1
```

When to Abort

终止优化的场景

Stop and report if:

No triton installed -- cannot proceed.
Compatibility test fails -- kernel has syntax/runtime errors before optimization.
TileIR validation fails -- optimized kernel produces wrong results.
No speedup -- TileIR version is slower than PTX baseline (with nvtriton).
Not Blackwell GPU -- still add configs for future deployment, but skip ENABLE_TILE testing and benchmarking.

出现以下情况时，停止优化并上报：

未安装triton——无法继续。
兼容性测试失败——优化前内核存在语法/运行时错误。
TileIR验证失败——优化后的内核产生错误结果。
无性能提升——TileIR版本比PTX基线（使用nvtriton）更慢。
非Blackwell GPU——仍需添加配置以支持未来部署，但跳过ENABLE_TILE测试与基准测试。

Output Format

输出格式

After optimization, return:

undefined

优化完成后，返回如下格式：

undefined

TileIR Optimization: kernel_name

TileIR优化：kernel_name

Classification

分类结果

Kernel type: [dot-related | norm-like | element-wise | reduction]
Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]

内核类型：[点积相关 | 归一化类 | 逐元素类 | 归约类]
优化策略：[TMA + 2CTA | 高occupancy | Occupancy + num_stages]

Compatibility Check (ENABLE_TILE=0)

兼容性检查（ENABLE_TILE=0）

[PASSED | FAILED] — Max difference: X.Xe-Y

[通过 | 失败] —— 最大差值：X.Xe-Y

Transformations Applied

已应用的转换

[List of transformations]

[转换列表]

TileIR Validation (ENABLE_TILE=1)

TileIR验证（ENABLE_TILE=1）

[PASSED | FAILED] — Max difference: X.Xe-Y

[通过 | 失败] —— 最大差值：X.Xe-Y

Benchmark Comparison

基准测试对比

Backend	Time (ms)	Speedup
PTX (ENABLE_TILE=0)	X.XXX	1.0x
TileIR (ENABLE_TILE=1)	X.XXX	Y.Yx

后端	耗时（ms）	加速比
PTX（ENABLE_TILE=0）	X.XXX	1.0倍
TileIR（ENABLE_TILE=1）	X.XXX	Y.Y倍

Output

输出文件

File: kernel_name_tileir.py

undefined

文件：kernel_name_tileir.py

undefined