kernel-tileir-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton TileIR Optimization

Triton TileIR优化

Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs. This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.
针对Blackwell GPU优化现有Triton内核以适配NVIDIA的TileIR后端。 本技能不会从零编写内核——那是Triton专家的工作。

Principles

原则

TileIR vs PTX Backend

TileIR与PTX后端对比

TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code using CGA-level (Cooperative Grid Array) tile representations. Critical differences:
ParameterPTX BackendTileIR Backend
num_warps
Strict directiveIgnored (compiler decides)
num_stages
Strict directiveCost hint (compiler optimizes)
occupancy
Not availableCritical tuning param (1-32)
num_ctas
Limited2CTA mode for Blackwell
Block sizesSmaller often betterLarger often better
TMANot availableRequired for dot kernels
Key implication: Do not tune
num_warps
for TileIR -- focus on
occupancy
instead.
TileIR是NVIDIA为Triton打造的编译器后端,它利用CGA级(协作网格数组)tile表示生成优化后的CUDA代码。核心差异如下:
参数PTX后端TileIR后端
num_warps
严格指令被忽略(由编译器决定)
num_stages
严格指令成本提示(由编译器优化)
occupancy
不可用关键调优参数(1-32)
num_ctas
受限Blackwell支持2CTA模式
块大小通常越小越好通常越大越好
TMA不可用点积内核必需
核心结论:不要为TileIR调优
num_warps
——应专注于
occupancy

Triton Package Landscape

Triton包生态

Three packages share
import triton
:
PackageSourceUse Case
pytorch-triton
PyTorch wheel
torch.compile
, standard kernels
triton
OpenAI PyPIOfficial Triton from triton-lang.org
nvtritonTriton-to-tile-IRTileIR backend for Blackwell
Only one triton package should be installed at a time. "Converting to TileIR" means adding TileIR-specific configs, NOT changing imports. TileIR activates via
ENABLE_TILE=1
.
三个包均使用
import triton
来源使用场景
pytorch-triton
PyTorch安装包
torch.compile
、标准内核
triton
OpenAI PyPItriton-lang.org官方Triton
nvtritonTriton-to-tile-IR适配Blackwell的TileIR后端
同一时间应仅安装一个triton包。"转换为TileIR"指添加TileIR专属配置,而非修改导入语句。可通过
ENABLE_TILE=1
激活TileIR。

When TileIR Applies

TileIR适用场景

TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the specialist still adds TileIR-optimized configs that standard triton safely ignores, enabling future deployment.
Expected speedups (with nvtriton on Blackwell):
Kernel TypeSpeedupKey Lever
Dot-Related (GEMM, Attention)1.2-2.0xTMA + 2CTA
Norm-Like (LayerNorm, Softmax)2.0-5.0xHigh occupancy
Element-Wise (ReLU, Add, Exp)1.5-3.0xOccupancy + num_stages
Reduction (Sum, Mean, Max)1.8-4.0xHigh occupancy
TileIR针对Blackwell(sm_100+)设计。即使没有nvtriton或Blackwell硬件,专家仍会添加TileIR优化配置,标准triton会自动忽略这些配置,为未来部署做好准备。
预期加速比(在Blackwell上使用nvtriton):
内核类型加速比关键手段
点积相关(GEMM、注意力机制)1.2-2.0倍TMA + 2CTA
归一化类(LayerNorm、Softmax)2.0-5.0倍高occupancy
逐元素类(ReLU、Add、Exp)1.5-3.0倍Occupancy + num_stages
归约类(Sum、Mean、Max)1.8-4.0倍高occupancy

Workflow

工作流程

Five-phase workflow: compatibility, classify, transform, validate, benchmark.
分为五个阶段:兼容性测试、分类、转换、验证、基准测试。

Phase 1: Compatibility Test (ENABLE_TILE=0)

阶段1:兼容性测试(ENABLE_TILE=0)

Verify the kernel works in PTX mode before applying TileIR optimizations.
bash
python scripts/tileir_check.py
Then use the kernel-triton-writing skill's
verify_kernel.py
to verify with
ENABLE_TILE=0
:
bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
在应用TileIR优化前,先验证内核在PTX模式下可正常运行。
bash
python scripts/tileir_check.py
然后使用kernel-triton-writing技能的
verify_kernel.py
,在
ENABLE_TILE=0
的情况下验证:
bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'

Phase 2: Classify Kernel

阶段2:内核分类

Determine kernel type to select the optimization strategy.
bash
python scripts/classify_kernel.py --file kernel.py
Classification decision tree:
Contains tl.dot()?
  YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
  NO  --> Has reduction + normalization?
            YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
            NO  --> Point-wise only?
                      YES --> element-wise: occupancy (1-16) + num_stages (2-4)
                      NO  --> reduction: high occupancy + num_warps
确定内核类型以选择优化策略。
bash
python scripts/classify_kernel.py --file kernel.py
分类决策树:
是否包含tl.dot()?
  是 --> 点积相关类:TMA + 2CTA + occupancy + 更大块大小
  否  --> 是否包含归约+归一化?
            是 --> 归一化类:高occupancy(2,4)+ num_warps(4,8)
            否  --> 是否仅为逐元素操作?
                      是 --> 逐元素类:occupancy(1-16)+ num_stages(2-4)
                      否  --> 归约类:高occupancy + num_warps

Phase 3: Apply Transformations

阶段3:应用转换

Classify and apply optimizations in one step:
bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
Output JSON includes
optimized_code
and
changes_applied
fields.
Type-specific transformations:
Dot-related (highest priority):
  1. Convert
    tl.load
    /
    tl.store
    to TMA descriptors (MANDATORY). See
    references/tma-conversion.md
    .
  2. Add 2CTA configs (
    num_ctas=2
    ) with SM oversubscription guard in pre-hook.
  3. Add occupancy (1, 2, 4) and extended num_stages (4, 6).
  4. Use larger block sizes (256x256, 256x128).
Norm-like (LayerNorm, Softmax, RMSNorm):
  • Add occupancy (2, 4), num_warps (4, 8). No TMA needed.
Element-wise (ReLU, GELU, Add, Mul, Exp):
  • Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.
Reduction (Sum, Mean, Max):
  • Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).
Gate TileIR-specific configs for sm_100+:
python
import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs
See
references/config-templates.md
for complete config templates per kernel type.
一步完成分类与优化应用:
bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
输出JSON包含
optimized_code
changes_applied
字段。
特定类型转换:
点积相关类(最高优先级):
  1. tl.load
    /
    tl.store
    转换为TMA描述符(必填)。详见
    references/tma-conversion.md
  2. 添加2CTA配置(
    num_ctas=2
    ),并在预钩子中加入SM超订阅防护。
  3. 添加occupancy(1,2,4)和扩展的num_stages(4,6)。
  4. 使用更大的块大小(256x256、256x128)。
归一化类(LayerNorm、Softmax、RMSNorm):
  • 添加occupancy(2,4)、num_warps(4,8)。无需TMA。
逐元素类(ReLU、GELU、Add、Mul、Exp):
  • 添加occupancy(1,2,4,16)、num_stages(2,3,4)。为小输入场景添加极端配置。
归约类(Sum、Mean、Max):
  • 与归一化类策略相同:高occupancy(2,4)、num_warps(4,8)。
为sm_100+设备添加TileIR专属配置的门控:
python
import torch

def get_configs_with_gating(pre_hook=None):
    configs = get_baseline_configs()
    if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
        configs.extend(get_tileir_specific_configs(pre_hook))
    return configs
各内核类型的完整配置模板详见
references/config-templates.md

Phase 4: TileIR Validation (ENABLE_TILE=1)

阶段4:TileIR验证(ENABLE_TILE=1)

Use the kernel-triton-writing skill's
verify_kernel.py
to verify the optimized kernel with TileIR backend:
bash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
Set
ENABLE_TILE=1
before running. Check: numerical correctness, no compilation errors, TMA/2CTA patterns compile successfully.
使用kernel-triton-writing技能的
verify_kernel.py
验证优化后的内核在TileIR后端的运行情况:
bash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'
运行前设置
ENABLE_TILE=1
。检查:数值正确性、无编译错误、TMA/2CTA模式编译成功。

Phase 5: Benchmark

阶段5:基准测试

Use
triton.testing.do_bench()
(as documented in the perf-workload-profiling skill) to compare PTX (
ENABLE_TILE=0
) vs TileIR (
ENABLE_TILE=1
).
Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.
使用
triton.testing.do_bench()
(详见perf-workload-profiling技能文档)对比PTX(
ENABLE_TILE=0
)与TileIR(
ENABLE_TILE=1
)的性能。
需在多种输入尺寸(128、1024、8192)下进行基准测试——性能会随尺寸变化。

Scripts

脚本

tileir_check.py

tileir_check.py

Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):
bash
python scripts/tileir_check.py
Returns JSON:
nvtriton_installed
,
tileir_active
,
blackwell_gpu
,
gpu_capability
,
recommendation
.
检查TileIR可用性(nvtriton、ENABLE_TILE、Blackwell GPU):
bash
python scripts/tileir_check.py
返回JSON:
nvtriton_installed
tileir_active
blackwell_gpu
gpu_capability
recommendation

classify_kernel.py

classify_kernel.py

Classify kernel type and optionally apply TileIR optimizations:
bash
undefined
分类内核类型并可选应用TileIR优化:
bash
undefined

Classify only

仅分类

python scripts/classify_kernel.py --file kernel.py
python scripts/classify_kernel.py --file kernel.py

Classify + apply optimizations

分类+应用优化

python scripts/classify_kernel.py --file kernel.py --apply-optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations

From inline code

针对内嵌代码

python scripts/classify_kernel.py --code '<kernel_code>'

Returns JSON: `classification`, `confidence`, `indicators`, `recommendations`.
With `--apply-optimizations`: adds `optimized_code` and `changes_applied`.
python scripts/classify_kernel.py --code '<kernel_code>'

返回JSON:`classification`、`confidence`、`indicators`、`recommendations`。
使用`--apply-optimizations`时:新增`optimized_code`和`changes_applied`字段。

Error Handling

错误处理

Common Pitfalls

常见陷阱

TMA descriptor errors (dot-related kernels):
  • Always pass
    pre_hook=tma_set_block_size_hook
    to config generation -- without it, TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
  • For GEMM: pass
    b.T.contiguous()
    in wrapper and use
    tl.dot(a, b.T, accumulator)
    in kernel. Transposition mismatch produces incorrect results silently.
2CTA oversubscription:
  • Adjust SM count in pre-hook when using
    num_ctas=2
    :
    python
    if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
        nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
Config function signatures:
  • ALL config helper functions MUST accept
    pre_hook=None
    , even if unused. Without it:
    TypeError: get_autotune_configs() takes 0 positional arguments
    .
Hardware gating:
  • Gate TileIR configs with
    torch.cuda.get_device_capability()[0] >= 10
    . TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.
API availability:
  • Use
    1.0 / (1.0 + tl.exp(-x))
    instead of
    tl.sigmoid(x)
    -- not available in all Triton versions including some nvtriton builds.
Performance tuning:
  • Do not over-tune
    num_warps
    -- TileIR ignores it. Focus on
    occupancy
    .
  • Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
  • Benchmark across small/medium/large inputs; one-size configs underperform.
  • For exp/log heavy kernels, enable approximate math:
    bash
    export TILEIR_ENABLE_APPROX=1
    export TILEIR_ENABLE_FTZ=1
TMA描述符错误(点积相关内核):
  • 配置生成时务必传入
    pre_hook=tma_set_block_size_hook
    ——若未传入,TMA描述符会保留虚拟块大小,导致运行时错误或结果异常。
  • 对于GEMM:在包装器中传入
    b.T.contiguous()
    ,并在内核中使用
    tl.dot(a, b.T, accumulator)
    。转置不匹配会静默产生错误结果。
2CTA超订阅
  • 使用
    num_ctas=2
    时,需在预钩子中调整SM数量:
    python
    if "NUM_SMS" in nargs and "NUM_CTAS" in nargs:
        nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
配置函数签名
  • 所有配置辅助函数必须接受
    pre_hook=None
    参数,即使未使用。 若未添加,会报错:
    TypeError: get_autotune_configs() takes 0 positional arguments
硬件门控
  • 使用
    torch.cuda.get_device_capability()[0] >= 10
    为TileIR配置添加门控。 在Blackwell之前的GPU上使用TMA/2CTA会导致运行时崩溃。
API可用性
  • 使用
    1.0 / (1.0 + tl.exp(-x))
    替代
    tl.sigmoid(x)
    ——并非所有Triton版本(包括部分nvtriton构建)都支持后者。
性能调优
  • 不要过度调优
    num_warps
    ——TileIR会忽略该参数。应专注于
    occupancy
  • 为TileIR使用更大的块大小(256x256、256x128),而非PTX调优的小块大小。
  • 在小/中/大输入下均进行基准测试;单一尺寸配置的性能表现不佳。
  • 对于大量使用exp/log的内核,启用近似数学运算:
    bash
    export TILEIR_ENABLE_APPROX=1
    export TILEIR_ENABLE_FTZ=1

When to Abort

终止优化的场景

Stop and report if:
  1. No triton installed -- cannot proceed.
  2. Compatibility test fails -- kernel has syntax/runtime errors before optimization.
  3. TileIR validation fails -- optimized kernel produces wrong results.
  4. No speedup -- TileIR version is slower than PTX baseline (with nvtriton).
  5. Not Blackwell GPU -- still add configs for future deployment, but skip ENABLE_TILE testing and benchmarking.
出现以下情况时,停止优化并上报:
  1. 未安装triton——无法继续。
  2. 兼容性测试失败——优化前内核存在语法/运行时错误。
  3. TileIR验证失败——优化后的内核产生错误结果。
  4. 无性能提升——TileIR版本比PTX基线(使用nvtriton)更慢。
  5. 非Blackwell GPU——仍需添加配置以支持未来部署,但跳过ENABLE_TILE测试与基准测试。

Output Format

输出格式

After optimization, return:
undefined
优化完成后,返回如下格式:
undefined

TileIR Optimization: kernel_name

TileIR优化:kernel_name

Classification

分类结果

  • Kernel type: [dot-related | norm-like | element-wise | reduction]
  • Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]
  • 内核类型:[点积相关 | 归一化类 | 逐元素类 | 归约类]
  • 优化策略:[TMA + 2CTA | 高occupancy | Occupancy + num_stages]

Compatibility Check (ENABLE_TILE=0)

兼容性检查(ENABLE_TILE=0)

[PASSED | FAILED] — Max difference: X.Xe-Y
[通过 | 失败] —— 最大差值:X.Xe-Y

Transformations Applied

已应用的转换

  • [List of transformations]
  • [转换列表]

TileIR Validation (ENABLE_TILE=1)

TileIR验证(ENABLE_TILE=1)

[PASSED | FAILED] — Max difference: X.Xe-Y
[通过 | 失败] —— 最大差值:X.Xe-Y

Benchmark Comparison

基准测试对比

BackendTime (ms)Speedup
PTX (ENABLE_TILE=0)X.XXX1.0x
TileIR (ENABLE_TILE=1)X.XXXY.Yx
后端耗时(ms)加速比
PTX(ENABLE_TILE=0)X.XXX1.0倍
TileIR(ENABLE_TILE=1)X.XXXY.Y倍

Output

输出文件

File: kernel_name_tileir.py
undefined
文件:kernel_name_tileir.py
undefined