kernel-tileir-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton TileIR Optimization
Triton TileIR优化
Optimize EXISTING Triton kernels for NVIDIA's TileIR backend on Blackwell GPUs.
This skill does NOT write kernels from scratch -- that is the Triton Specialist's job.
针对Blackwell GPU优化现有Triton内核以适配NVIDIA的TileIR后端。
本技能不会从零编写内核——那是Triton专家的工作。
Principles
原则
TileIR vs PTX Backend
TileIR与PTX后端对比
TileIR is NVIDIA's compiler backend for Triton that generates optimized CUDA code
using CGA-level (Cooperative Grid Array) tile representations. Critical differences:
| Parameter | PTX Backend | TileIR Backend |
|---|---|---|
| Strict directive | Ignored (compiler decides) |
| Strict directive | Cost hint (compiler optimizes) |
| Not available | Critical tuning param (1-32) |
| Limited | 2CTA mode for Blackwell |
| Block sizes | Smaller often better | Larger often better |
| TMA | Not available | Required for dot kernels |
Key implication: Do not tune for TileIR -- focus on instead.
num_warpsoccupancyTileIR是NVIDIA为Triton打造的编译器后端,它利用CGA级(协作网格数组)tile表示生成优化后的CUDA代码。核心差异如下:
| 参数 | PTX后端 | TileIR后端 |
|---|---|---|
| 严格指令 | 被忽略(由编译器决定) |
| 严格指令 | 成本提示(由编译器优化) |
| 不可用 | 关键调优参数(1-32) |
| 受限 | Blackwell支持2CTA模式 |
| 块大小 | 通常越小越好 | 通常越大越好 |
| TMA | 不可用 | 点积内核必需 |
核心结论:不要为TileIR调优——应专注于。
num_warpsoccupancyTriton Package Landscape
Triton包生态
Three packages share :
import triton| Package | Source | Use Case |
|---|---|---|
| PyTorch wheel | |
| OpenAI PyPI | Official Triton from triton-lang.org |
| nvtriton | Triton-to-tile-IR | TileIR backend for Blackwell |
Only one triton package should be installed at a time. "Converting to TileIR" means
adding TileIR-specific configs, NOT changing imports. TileIR activates via .
ENABLE_TILE=1三个包均使用:
import triton| 包 | 来源 | 使用场景 |
|---|---|---|
| PyTorch安装包 | |
| OpenAI PyPI | triton-lang.org官方Triton |
| nvtriton | Triton-to-tile-IR | 适配Blackwell的TileIR后端 |
同一时间应仅安装一个triton包。"转换为TileIR"指添加TileIR专属配置,而非修改导入语句。可通过激活TileIR。
ENABLE_TILE=1When TileIR Applies
TileIR适用场景
TileIR targets Blackwell (sm_100+). Without nvtriton or Blackwell hardware, the
specialist still adds TileIR-optimized configs that standard triton safely ignores,
enabling future deployment.
Expected speedups (with nvtriton on Blackwell):
| Kernel Type | Speedup | Key Lever |
|---|---|---|
| Dot-Related (GEMM, Attention) | 1.2-2.0x | TMA + 2CTA |
| Norm-Like (LayerNorm, Softmax) | 2.0-5.0x | High occupancy |
| Element-Wise (ReLU, Add, Exp) | 1.5-3.0x | Occupancy + num_stages |
| Reduction (Sum, Mean, Max) | 1.8-4.0x | High occupancy |
TileIR针对Blackwell(sm_100+)设计。即使没有nvtriton或Blackwell硬件,专家仍会添加TileIR优化配置,标准triton会自动忽略这些配置,为未来部署做好准备。
预期加速比(在Blackwell上使用nvtriton):
| 内核类型 | 加速比 | 关键手段 |
|---|---|---|
| 点积相关(GEMM、注意力机制) | 1.2-2.0倍 | TMA + 2CTA |
| 归一化类(LayerNorm、Softmax) | 2.0-5.0倍 | 高occupancy |
| 逐元素类(ReLU、Add、Exp) | 1.5-3.0倍 | Occupancy + num_stages |
| 归约类(Sum、Mean、Max) | 1.8-4.0倍 | 高occupancy |
Workflow
工作流程
Five-phase workflow: compatibility, classify, transform, validate, benchmark.
分为五个阶段:兼容性测试、分类、转换、验证、基准测试。
Phase 1: Compatibility Test (ENABLE_TILE=0)
阶段1:兼容性测试(ENABLE_TILE=0)
Verify the kernel works in PTX mode before applying TileIR optimizations.
bash
python scripts/tileir_check.pyThen use the kernel-triton-writing skill's to verify with :
verify_kernel.pyENABLE_TILE=0bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'在应用TileIR优化前,先验证内核在PTX模式下可正常运行。
bash
python scripts/tileir_check.py然后使用kernel-triton-writing技能的,在的情况下验证:
verify_kernel.pyENABLE_TILE=0bash
python scripts/verify_kernel.py --kernel path/to/kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'Phase 2: Classify Kernel
阶段2:内核分类
Determine kernel type to select the optimization strategy.
bash
python scripts/classify_kernel.py --file kernel.pyClassification decision tree:
Contains tl.dot()?
YES --> dot-related: TMA + 2CTA + occupancy + larger blocks
NO --> Has reduction + normalization?
YES --> norm-like: high occupancy (2, 4) + num_warps (4, 8)
NO --> Point-wise only?
YES --> element-wise: occupancy (1-16) + num_stages (2-4)
NO --> reduction: high occupancy + num_warps确定内核类型以选择优化策略。
bash
python scripts/classify_kernel.py --file kernel.py分类决策树:
是否包含tl.dot()?
是 --> 点积相关类:TMA + 2CTA + occupancy + 更大块大小
否 --> 是否包含归约+归一化?
是 --> 归一化类:高occupancy(2,4)+ num_warps(4,8)
否 --> 是否仅为逐元素操作?
是 --> 逐元素类:occupancy(1-16)+ num_stages(2-4)
否 --> 归约类:高occupancy + num_warpsPhase 3: Apply Transformations
阶段3:应用转换
Classify and apply optimizations in one step:
bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizationsOutput JSON includes and fields.
optimized_codechanges_appliedType-specific transformations:
Dot-related (highest priority):
- Convert /
tl.loadto TMA descriptors (MANDATORY). Seetl.store.references/tma-conversion.md - Add 2CTA configs () with SM oversubscription guard in pre-hook.
num_ctas=2 - Add occupancy (1, 2, 4) and extended num_stages (4, 6).
- Use larger block sizes (256x256, 256x128).
Norm-like (LayerNorm, Softmax, RMSNorm):
- Add occupancy (2, 4), num_warps (4, 8). No TMA needed.
Element-wise (ReLU, GELU, Add, Mul, Exp):
- Add occupancy (1, 2, 4, 16), num_stages (2, 3, 4). Include extreme configs for small inputs.
Reduction (Sum, Mean, Max):
- Same strategy as norm-like: high occupancy (2, 4), num_warps (4, 8).
Gate TileIR-specific configs for sm_100+:
python
import torch
def get_configs_with_gating(pre_hook=None):
configs = get_baseline_configs()
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
configs.extend(get_tileir_specific_configs(pre_hook))
return configsSee for complete config templates per kernel type.
references/config-templates.md一步完成分类与优化应用:
bash
python scripts/classify_kernel.py --file kernel.py --apply-optimizations输出JSON包含和字段。
optimized_codechanges_applied特定类型转换:
点积相关类(最高优先级):
- 将/
tl.load转换为TMA描述符(必填)。详见tl.store。references/tma-conversion.md - 添加2CTA配置(),并在预钩子中加入SM超订阅防护。
num_ctas=2 - 添加occupancy(1,2,4)和扩展的num_stages(4,6)。
- 使用更大的块大小(256x256、256x128)。
归一化类(LayerNorm、Softmax、RMSNorm):
- 添加occupancy(2,4)、num_warps(4,8)。无需TMA。
逐元素类(ReLU、GELU、Add、Mul、Exp):
- 添加occupancy(1,2,4,16)、num_stages(2,3,4)。为小输入场景添加极端配置。
归约类(Sum、Mean、Max):
- 与归一化类策略相同:高occupancy(2,4)、num_warps(4,8)。
为sm_100+设备添加TileIR专属配置的门控:
python
import torch
def get_configs_with_gating(pre_hook=None):
configs = get_baseline_configs()
if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10:
configs.extend(get_tileir_specific_configs(pre_hook))
return configs各内核类型的完整配置模板详见。
references/config-templates.mdPhase 4: TileIR Validation (ENABLE_TILE=1)
阶段4:TileIR验证(ENABLE_TILE=1)
Use the kernel-triton-writing skill's to verify the optimized kernel with TileIR backend:
verify_kernel.pybash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'Set before running. Check: numerical correctness, no compilation errors,
TMA/2CTA patterns compile successfully.
ENABLE_TILE=1使用kernel-triton-writing技能的验证优化后的内核在TileIR后端的运行情况:
verify_kernel.pybash
python scripts/verify_kernel.py --kernel path/to/optimized_kernel.py --reference 'torch reference' --shapes '{"x": [32, 512, 4096]}' --dtypes '{"x": "bfloat16"}'运行前设置。检查:数值正确性、无编译错误、TMA/2CTA模式编译成功。
ENABLE_TILE=1Phase 5: Benchmark
阶段5:基准测试
Use (as documented in the perf-workload-profiling skill) to compare PTX () vs TileIR ().
triton.testing.do_bench()ENABLE_TILE=0ENABLE_TILE=1Benchmark across multiple input sizes (128, 1024, 8192) -- performance varies by size.
使用(详见perf-workload-profiling技能文档)对比PTX()与TileIR()的性能。
triton.testing.do_bench()ENABLE_TILE=0ENABLE_TILE=1需在多种输入尺寸(128、1024、8192)下进行基准测试——性能会随尺寸变化。
Scripts
脚本
tileir_check.py
tileir_check.py
Check TileIR availability (nvtriton, ENABLE_TILE, Blackwell GPU):
bash
python scripts/tileir_check.pyReturns JSON: , , , , .
nvtriton_installedtileir_activeblackwell_gpugpu_capabilityrecommendation检查TileIR可用性(nvtriton、ENABLE_TILE、Blackwell GPU):
bash
python scripts/tileir_check.py返回JSON:、、、、。
nvtriton_installedtileir_activeblackwell_gpugpu_capabilityrecommendationclassify_kernel.py
classify_kernel.py
Classify kernel type and optionally apply TileIR optimizations:
bash
undefined分类内核类型并可选应用TileIR优化:
bash
undefinedClassify only
仅分类
python scripts/classify_kernel.py --file kernel.py
python scripts/classify_kernel.py --file kernel.py
Classify + apply optimizations
分类+应用优化
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
python scripts/classify_kernel.py --file kernel.py --apply-optimizations
From inline code
针对内嵌代码
python scripts/classify_kernel.py --code '<kernel_code>'
Returns JSON: `classification`, `confidence`, `indicators`, `recommendations`.
With `--apply-optimizations`: adds `optimized_code` and `changes_applied`.python scripts/classify_kernel.py --code '<kernel_code>'
返回JSON:`classification`、`confidence`、`indicators`、`recommendations`。
使用`--apply-optimizations`时:新增`optimized_code`和`changes_applied`字段。Error Handling
错误处理
Common Pitfalls
常见陷阱
TMA descriptor errors (dot-related kernels):
- Always pass to config generation -- without it, TMA descriptors keep dummy block sizes, causing runtime errors or wrong results.
pre_hook=tma_set_block_size_hook - For GEMM: pass in wrapper and use
b.T.contiguous()in kernel. Transposition mismatch produces incorrect results silently.tl.dot(a, b.T, accumulator)
2CTA oversubscription:
- Adjust SM count in pre-hook when using :
num_ctas=2pythonif "NUM_SMS" in nargs and "NUM_CTAS" in nargs: nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
Config function signatures:
- ALL config helper functions MUST accept , even if unused. Without it:
pre_hook=None.TypeError: get_autotune_configs() takes 0 positional arguments
Hardware gating:
- Gate TileIR configs with . TMA/2CTA on pre-Blackwell GPUs causes runtime crashes.
torch.cuda.get_device_capability()[0] >= 10
API availability:
- Use instead of
1.0 / (1.0 + tl.exp(-x))-- not available in all Triton versions including some nvtriton builds.tl.sigmoid(x)
Performance tuning:
- Do not over-tune -- TileIR ignores it. Focus on
num_warps.occupancy - Use larger block sizes (256x256, 256x128) for TileIR, not PTX-tuned small blocks.
- Benchmark across small/medium/large inputs; one-size configs underperform.
- For exp/log heavy kernels, enable approximate math:
bash
export TILEIR_ENABLE_APPROX=1 export TILEIR_ENABLE_FTZ=1
TMA描述符错误(点积相关内核):
- 配置生成时务必传入——若未传入,TMA描述符会保留虚拟块大小,导致运行时错误或结果异常。
pre_hook=tma_set_block_size_hook - 对于GEMM:在包装器中传入,并在内核中使用
b.T.contiguous()。转置不匹配会静默产生错误结果。tl.dot(a, b.T, accumulator)
2CTA超订阅:
- 使用时,需在预钩子中调整SM数量:
num_ctas=2pythonif "NUM_SMS" in nargs and "NUM_CTAS" in nargs: nargs["NUM_SMS"] = nargs["NUM_SMS"] // nargs["NUM_CTAS"]
配置函数签名:
- 所有配置辅助函数必须接受参数,即使未使用。 若未添加,会报错:
pre_hook=None。TypeError: get_autotune_configs() takes 0 positional arguments
硬件门控:
- 使用为TileIR配置添加门控。 在Blackwell之前的GPU上使用TMA/2CTA会导致运行时崩溃。
torch.cuda.get_device_capability()[0] >= 10
API可用性:
- 使用替代
1.0 / (1.0 + tl.exp(-x))——并非所有Triton版本(包括部分nvtriton构建)都支持后者。tl.sigmoid(x)
性能调优:
- 不要过度调优——TileIR会忽略该参数。应专注于
num_warps。occupancy - 为TileIR使用更大的块大小(256x256、256x128),而非PTX调优的小块大小。
- 在小/中/大输入下均进行基准测试;单一尺寸配置的性能表现不佳。
- 对于大量使用exp/log的内核,启用近似数学运算:
bash
export TILEIR_ENABLE_APPROX=1 export TILEIR_ENABLE_FTZ=1
When to Abort
终止优化的场景
Stop and report if:
- No triton installed -- cannot proceed.
- Compatibility test fails -- kernel has syntax/runtime errors before optimization.
- TileIR validation fails -- optimized kernel produces wrong results.
- No speedup -- TileIR version is slower than PTX baseline (with nvtriton).
- Not Blackwell GPU -- still add configs for future deployment, but skip ENABLE_TILE testing and benchmarking.
出现以下情况时,停止优化并上报:
- 未安装triton——无法继续。
- 兼容性测试失败——优化前内核存在语法/运行时错误。
- TileIR验证失败——优化后的内核产生错误结果。
- 无性能提升——TileIR版本比PTX基线(使用nvtriton)更慢。
- 非Blackwell GPU——仍需添加配置以支持未来部署,但跳过ENABLE_TILE测试与基准测试。
Output Format
输出格式
After optimization, return:
undefined优化完成后,返回如下格式:
undefinedTileIR Optimization: kernel_name
TileIR优化:kernel_name
Classification
分类结果
- Kernel type: [dot-related | norm-like | element-wise | reduction]
- Strategy: [TMA + 2CTA | High occupancy | Occupancy + num_stages]
- 内核类型:[点积相关 | 归一化类 | 逐元素类 | 归约类]
- 优化策略:[TMA + 2CTA | 高occupancy | Occupancy + num_stages]
Compatibility Check (ENABLE_TILE=0)
兼容性检查(ENABLE_TILE=0)
[PASSED | FAILED] — Max difference: X.Xe-Y
[通过 | 失败] —— 最大差值:X.Xe-Y
Transformations Applied
已应用的转换
- [List of transformations]
- [转换列表]
TileIR Validation (ENABLE_TILE=1)
TileIR验证(ENABLE_TILE=1)
[PASSED | FAILED] — Max difference: X.Xe-Y
[通过 | 失败] —— 最大差值:X.Xe-Y
Benchmark Comparison
基准测试对比
| Backend | Time (ms) | Speedup |
|---|---|---|
| PTX (ENABLE_TILE=0) | X.XXX | 1.0x |
| TileIR (ENABLE_TILE=1) | X.XXX | Y.Yx |
| 后端 | 耗时(ms) | 加速比 |
|---|---|---|
| PTX(ENABLE_TILE=0) | X.XXX | 1.0倍 |
| TileIR(ENABLE_TILE=1) | X.XXX | Y.Y倍 |
Output
输出文件
File: kernel_name_tileir.py
undefined文件:kernel_name_tileir.py
undefined