triton-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Triton & Gluon Kernel Development

Triton & Gluon GPU内核开发

Source Code Locations

源码位置

Triton 源码位于此 skill 安装目录下的
repos/triton/
。 实际路径取决于所用工具:
  • Cursor:
    ~/.cursor/skills/triton-skill/repos/triton/
  • Claude Code:
    ~/.claude/skills/triton-skill/repos/triton/
  • Codex:
    ~/.agents/skills/triton-skill/repos/triton/
TRITON_REPO: 下文示例用
~/.cursor/skills/triton-skill/repos/triton/
作占位符,替换为实际路径
如果该路径不存在,在项目目录下运行
bash update-repos.sh triton
Triton源码位于当前Skill安装目录下的
repos/triton/
。 实际路径取决于所用工具:
  • Cursor:
    ~/.cursor/skills/triton-skill/repos/triton/
  • Claude Code:
    ~/.claude/skills/triton-skill/repos/triton/
  • Codex:
    ~/.agents/skills/triton-skill/repos/triton/
TRITON_REPO: 下文示例用
~/.cursor/skills/triton-skill/repos/triton/
作占位符,请替换为实际路径
如果该路径不存在,在项目目录下运行
bash update-repos.sh triton

Triton Tutorials (入门到进阶)

Triton 教程(入门到进阶)

TRITON_REPO/python/tutorials/
├── 01-vector-add.py             # Triton 基础: @triton.jit, program_id, load/store
├── 02-fused-softmax.py          # 内核融合, reduction, tl.max/tl.sum/tl.exp
├── 03-matrix-multiplication.py  # Block matmul, L2 cache, @triton.autotune
├── 04-low-memory-dropout.py     # 并行 PRNG, tl.rand, seed-based dropout
├── 05-layer-norm.py             # Backward pass, atomic ops, tl.atomic_cas
├── 06-fused-attention.py        # Flash Attention v2, causal mask, FP8, warp spec
├── 07-extern-functions.py       # libdevice 外部函数调用
├── 08-grouped-gemm.py           # Group GEMM, TMA, tensor descriptors
├── 09-persistent-matmul.py      # 持久化内核, TMA, warp specialization, FP8
├── 10-block-scaled-matmul.py    # FP4/FP8, MXFP4, tl.dot_scaled
└── 11-programmatic-dependent-launch.py  # PDL, gdc_wait, gdc_launch_dependents
TRITON_REPO/python/tutorials/
├── 01-vector-add.py             # Triton 基础: @triton.jit, program_id, load/store
├── 02-fused-softmax.py          # 内核融合, 归约操作, tl.max/tl.sum/tl.exp
├── 03-matrix-multiplication.py  # 分块矩阵乘法, L2缓存, @triton.autotune
├── 04-low-memory-dropout.py     # 并行伪随机数生成, tl.rand, 基于种子的Dropout
├── 05-layer-norm.py             # 反向传播, 原子操作, tl.atomic_cas
├── 06-fused-attention.py        # Flash Attention v2, 因果掩码, FP8, warp特化
├── 07-extern-functions.py       # libdevice外部函数调用
├── 08-grouped-gemm.py           # 分组GEMM, TMA, 张量描述符
├── 09-persistent-matmul.py      # 持久化内核, TMA, warp特化, FP8
├── 10-block-scaled-matmul.py    # FP4/FP8, MXFP4, tl.dot_scaled
└── 11-programmatic-dependent-launch.py  # PDL, gdc_wait, gdc_launch_dependents

Gluon Tutorials (底层 GPU 编程)

Gluon 教程(底层GPU编程)

TRITON_REPO/python/tutorials/gluon/
├── 01-intro.py                  # Gluon vs Triton, tile-based SPMD, @gluon.jit
├── 02-layouts.py                # BlockedLayout, size_per_thread, warps_per_cta
├── 03-async-copy.py             # cp.async, 流水线, shared memory
├── 04-tma.py                    # Tensor Memory Accelerator, tensor desc, mbarrier
├── 05-wgmma.py                  # Warp-Group MMA, Hopper Tensor Core, async MMA
├── 06-tcgen05.py                # Blackwell Tensor Core, Tensor Memory, tcgen05_mma
├── 07-persistence.py            # 持久化内核, work assignment, 多级流水线
├── 08-warp-specialization.py    # Warp 特化, 任务重叠
├── 09-tma-gather-scatter.py     # Native TMA Gather/Scatter (Blackwell)
├── 10-tcgen05-copy.py           # tcgen05_copy, shared→tensor memory
├── 11-tcgen05-mma-scaled.py     # tcgen05_mma_scaled, nvfp4/mxfp4/mxfp8
└── 12-cluster-launch-control.py # CLC, 动态 work distribution
TRITON_REPO/python/tutorials/gluon/
├── 01-intro.py                  # Gluon vs Triton, 基于Tile的SPMD, @gluon.jit
├── 02-layouts.py                # BlockedLayout, size_per_thread, warps_per_cta
├── 03-async-copy.py             # cp.async, 流水线, 共享内存
├── 04-tma.py                    # Tensor Memory Accelerator, 张量描述符, mbarrier
├── 05-wgmma.py                  # Warp-Group MMA, Hopper张量核心, 异步MMA
├── 06-tcgen05.py                # Blackwell张量核心, 张量内存, tcgen05_mma
├── 07-persistence.py            # 持久化内核, 任务分配, 多级流水线
├── 08-warp-specialization.py    # Warp特化, 任务重叠
├── 09-tma-gather-scatter.py     # 原生TMA聚合/分散操作(Blackwell架构)
├── 10-tcgen05-copy.py           # tcgen05_copy, 共享内存→张量内存
├── 11-tcgen05-mma-scaled.py     # tcgen05_mma_scaled, nvfp4/mxfp4/mxfp8
└── 12-cluster-launch-control.py # CLC, 动态任务分配

Gluon Examples (完整实现)

Gluon 示例(完整实现)

TRITON_REPO/python/examples/gluon/
└── 01-attention-forward.py      # Flash Attention forward (Blackwell)
                                 # 完整的 producer/consumer, TMA, tcgen05_mma
TRITON_REPO/python/examples/gluon/
└── 01-attention-forward.py      # Flash Attention前向传播(Blackwell架构)
                                 # 完整的生产者/消费者模型, TMA, tcgen05_mma

Triton Kernels (生产级参考实现)

Triton 生产级内核参考实现

TRITON_REPO/python/triton_kernels/triton_kernels/
├── matmul.py                    # 矩阵乘法 API (融合激活/MoE/ragged)
├── matmul_details/
│   ├── _matmul.py               # Dense GEMM kernel (TMA, mxfp4/8)
│   ├── _p_matmul.py             # Persistent GEMM kernel (ragged TMA)
│   └── _common.py               # 偏移计算, XCD swizzle
├── reduce.py                    # Reduction kernel (mask/scale/mxfp/flexpoint)
├── topk.py                      # Top-K selection (forward/backward, bitmatrix)
├── swiglu.py                    # SwiGLU activation kernel
├── compaction.py                # Masked compaction kernel
├── numerics.py                  # FP8/MXFP 数值配置
├── numerics_details/
│   ├── mxfp.py                  # MXFP 量化/反量化
│   └── flexpoint.py             # Flexpoint 缩放
├── tensor.py                    # Tensor/Layout 抽象 (TMA descriptors)
├── tensor_details/
│   └── layout_details/          # Blackwell/Hopper/CDNA4 MX 布局
├── distributed.py               # 分布式 MoE, SymmetricMemory
├── testing.py                   # 测试工具 (assert_close, compute_sanitizer)
└── roofline.py                  # Roofline 性能分析
TRITON_REPO/python/triton_kernels/triton_kernels/
├── matmul.py                    # 矩阵乘法API(融合激活/MoE/不规则张量)
├── matmul_details/
│   ├── _matmul.py               # 稠密GEMM内核(TMA, mxfp4/8)
│   ├── _p_matmul.py             # 持久化GEMM内核(不规则TMA)
│   └── _common.py               # 偏移计算, XCD混洗
├── reduce.py                    # 归约内核(掩码/缩放/MXFP/Flexpoint)
├── topk.py                      # Top-K选择(前向/反向传播, 位矩阵)
├── swiglu.py                    # SwiGLU激活内核
├── compaction.py                # 掩码压缩内核
├── numerics.py                  # FP8/MXFP数值配置
├── numerics_details/
│   ├── mxfp.py                  # MXFP量化/反量化
│   └── flexpoint.py             # Flexpoint缩放
├── tensor.py                    # 张量/布局抽象(TMA描述符)
├── tensor_details/
│   └── layout_details/          # Blackwell/Hopper/CDNA4 MX布局
├── distributed.py               # 分布式MoE, SymmetricMemory
├── testing.py                   # 测试工具(assert_close, compute_sanitizer)
└── roofline.py                  # Roofline性能分析

Triton 语言源码

Triton 语言源码

TRITON_REPO/python/triton/language/           # tl.* 操作的定义和语义
TRITON_REPO/python/triton/experimental/gluon/ # gluon.* 操作的定义
TRITON_REPO/python/triton/runtime/            # JIT 编译, 缓存, 解释器
TRITON_REPO/python/triton/compiler/           # 代码生成
TRITON_REPO/python/triton/tools/              # Tensor descriptor 工具
TRITON_REPO/python/triton/language/           # tl.*操作的定义和语义
TRITON_REPO/python/triton/experimental/gluon/ # gluon.*操作的定义
TRITON_REPO/python/triton/runtime/            # JIT编译, 缓存, 解释器
TRITON_REPO/python/triton/compiler/           # 代码生成
TRITON_REPO/python/triton/tools/              # 张量描述符工具

C++ 编译器 (IR 定义和 Passes)

C++ 编译器(IR定义与Pass)

TRITON_REPO/include/triton/
├── Dialect/
│   ├── Triton/           # Triton IR dialect 定义 (.td, .h)
│   ├── TritonGPU/        # TritonGPU dialect (layouts, encodings)
│   ├── TritonNvidiaGPU/  # NVIDIA 特定 ops (wgmma, tma, tcgen05)
│   └── Gluon/            # Gluon dialect
├── Conversion/           # IR lowering passes (TritonGPU -> LLVM)
├── Analysis/             # Alias, Allocation, AxisInfo, Membar
└── Tools/                # 工具类

TRITON_REPO/lib/
├── Dialect/
│   ├── Triton/           # Triton ops 实现, canonicalize
│   ├── TritonGPU/        # GPU layout 优化 passes
│   ├── TritonNvidiaGPU/  # NVIDIA lowering
│   └── Gluon/            # Gluon ops 实现
├── Conversion/           # Lowering pass 实现 (TritonGPU -> LLVM IR)
├── Analysis/             # 分析 pass 实现
└── Target/               # 代码生成目标
TRITON_REPO/include/triton/
├── Dialect/
│   ├── Triton/           # Triton IR方言定义(.td, .h)
│   ├── TritonGPU/        # TritonGPU方言(布局, 编码)
│   ├── TritonNvidiaGPU/  # NVIDIA特定操作(wgmma, tma, tcgen05)
│   └── Gluon/            # Gluon方言
├── Conversion/           # IR下转换Pass(TritonGPU -> LLVM)
├── Analysis/             # 别名分析, 内存分配, 轴信息, 内存屏障
└── Tools/                # 工具类

TRITON_REPO/lib/
├── Dialect/
│   ├── Triton/           # Triton操作实现, 规范化处理
│   ├── TritonGPU/        # GPU布局优化Pass
│   ├── TritonNvidiaGPU/  # NVIDIA平台下转换
│   └── Gluon/            # Gluon操作实现
├── Conversion/           # 下转换Pass实现(TritonGPU -> LLVM IR)
├── Analysis/             # 分析Pass实现
└── Target/               # 代码生成目标

Search Strategy

搜索策略

用 Grep 工具搜索,不要整文件加载。
先确定 TRITON_REPO 的实际路径,然后用绝对路径搜索。
使用Grep工具搜索,不要加载整个文件。
先确定TRITON_REPO的实际路径,然后使用绝对路径进行搜索。

Triton API 用法

Triton API 用法

bash
undefined
bash
undefined

设置路径变量(替换为实际路径)

设置路径变量(替换为实际路径)

TRITON_REPO="$HOME/.cursor/skills/triton-skill/repos/triton"
TRITON_REPO="$HOME/.cursor/skills/triton-skill/repos/triton"

查找 tl.dot 的使用方式

查找tl.dot的使用方式

rg "tl.dot" $TRITON_REPO/python/tutorials/
rg "tl.dot" $TRITON_REPO/python/tutorials/

查找 autotune 配置示例

查找autotune配置示例

rg "@triton.autotune" $TRITON_REPO/python/tutorials/
rg "@triton.autotune" $TRITON_REPO/python/tutorials/

查找 tensor descriptor 创建

查找张量描述符创建

rg "TensorDescriptor" $TRITON_REPO/python/tutorials/
rg "TensorDescriptor" $TRITON_REPO/python/tutorials/

查找特定 tl 操作的定义

查找特定tl操作的定义

rg "def (load|store|dot)" $TRITON_REPO/python/triton/language/
undefined
rg "def (load|store|dot)" $TRITON_REPO/python/triton/language/
undefined

Gluon API 用法

Gluon API 用法

bash
undefined
bash
undefined

查找 gluon.jit 使用

查找gluon.jit使用

rg "@gluon.jit" $TRITON_REPO/python/tutorials/gluon/
rg "@gluon.jit" $TRITON_REPO/python/tutorials/gluon/

查找 wgmma 用法

查找wgmma用法

rg "wgmma" $TRITON_REPO/python/tutorials/gluon/05-wgmma.py
rg "wgmma" $TRITON_REPO/python/tutorials/gluon/05-wgmma.py

查找 tcgen05 用法 (Blackwell)

查找tcgen05用法 (Blackwell架构)

rg "tcgen05" $TRITON_REPO/python/tutorials/gluon/
rg "tcgen05" $TRITON_REPO/python/tutorials/gluon/

查找 TMA 异步拷贝模式

查找TMA异步拷贝模式

rg "async_copy" $TRITON_REPO/python/tutorials/gluon/
rg "async_copy" $TRITON_REPO/python/tutorials/gluon/

查找 mbarrier 使用

查找mbarrier使用

rg "mbarrier" $TRITON_REPO/python/tutorials/gluon/
undefined
rg "mbarrier" $TRITON_REPO/python/tutorials/gluon/
undefined

编译器 IR 和 Passes

编译器IR与Pass

bash
undefined
bash
undefined

查找 Triton IR op 定义 (TableGen)

查找Triton IR操作定义(TableGen)

rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Triton/IR/
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Triton/IR/

查找 TritonGPU layout encoding

查找TritonGPU布局编码

rg "Encoding" $TRITON_REPO/include/triton/Dialect/TritonGPU/IR/
rg "Encoding" $TRITON_REPO/include/triton/Dialect/TritonGPU/IR/

查找 NVIDIA 特定 ops (wgmma, tma)

查找NVIDIA特定操作(wgmma, tma)

rg "wgmma|tma|tcgen05" $TRITON_REPO/include/triton/Dialect/TritonNvidiaGPU/
rg "wgmma|tma|tcgen05" $TRITON_REPO/include/triton/Dialect/TritonNvidiaGPU/

查找 lowering pass 实现

查找下转换Pass实现

rg "Pattern|Rewrite" $TRITON_REPO/lib/Conversion/TritonGPUToLLVM/
rg "Pattern|Rewrite" $TRITON_REPO/lib/Conversion/TritonGPUToLLVM/

查找 Gluon dialect ops

查找Gluon方言操作

rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Gluon/
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Gluon/

查找特定 pass (如 coalesce, pipeline, prefetch)

查找特定Pass(如coalesce, pipeline, prefetch)

rg "coalesce|pipeline|prefetch" $TRITON_REPO/lib/Dialect/TritonGPU/Transforms/
undefined
rg "coalesce|pipeline|prefetch" $TRITON_REPO/lib/Dialect/TritonGPU/Transforms/
undefined

生产内核参考

生产级内核参考

bash
undefined
bash
undefined

查找 matmul 内核的 TMA 使用

查找matmul内核的TMA使用

rg "tma" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/
rg "tma" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/

查找 MXFP 量化实现

查找MXFP量化实现

rg "mxfp" $TRITON_REPO/python/triton_kernels/triton_kernels/numerics_details/
rg "mxfp" $TRITON_REPO/python/triton_kernels/triton_kernels/numerics_details/

查找 persistent kernel 模式

查找持久化内核模式

rg "persistent" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/_p_matmul.py
rg "persistent" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/_p_matmul.py

查找 layout swizzle

查找布局混洗

rg "swizzle" $TRITON_REPO/python/triton_kernels/triton_kernels/tensor_details/layout_details/
undefined
rg "swizzle" $TRITON_REPO/python/triton_kernels/triton_kernels/tensor_details/layout_details/
undefined

When to Use Each Source

各场景源码选择指南

NeedSourcePath
Triton 基础语法和模式Tutorials 01-05
python/tutorials/01-*.py
~
05-*.py
矩阵乘法优化Tutorial 03, 09, 10
python/tutorials/03-*.py
,
09-*.py
,
10-*.py
Attention 内核Tutorial 06, Gluon example
python/tutorials/06-*.py
,
python/examples/gluon/
Gluon 入门Gluon tutorials 01-02
python/tutorials/gluon/01-intro.py
,
02-layouts.py
TMA 异步拷贝Gluon tutorials 03-04
python/tutorials/gluon/03-*.py
,
04-*.py
WGMMA (Hopper)Gluon tutorial 05
python/tutorials/gluon/05-wgmma.py
tcgen05 (Blackwell)Gluon tutorials 06, 10, 11
python/tutorials/gluon/06-*.py
,
10-*.py
,
11-*.py
持久化内核模式Gluon tutorial 07
python/tutorials/gluon/07-persistence.py
Warp 特化模式Gluon tutorial 08
python/tutorials/gluon/08-warp-specialization.py
FP4/FP8/MXFP 量化Tutorial 10, triton_kernels
python/tutorials/10-*.py
,
triton_kernels/numerics_details/
生产级 GEMMtriton_kernels matmul
triton_kernels/triton_kernels/matmul_details/
MoE / Ragged tensortriton_kernels
triton_kernels/triton_kernels/distributed.py
,
tensor_details/ragged_tensor.py
Top-K / SwiGLU 内核triton_kernels
triton_kernels/triton_kernels/topk.py
,
swiglu.py
Roofline 性能分析triton_kernels
triton_kernels/triton_kernels/roofline.py
tl.* 操作语义/签名Language source
python/triton/language/
布局和 swizzletriton_kernels layouts
triton_kernels/triton_kernels/tensor_details/layout_details/
Triton IR op 定义include
include/triton/Dialect/Triton/IR/
GPU layout encodinginclude
include/triton/Dialect/TritonGPU/IR/
NVIDIA ops (wgmma/tma)include
include/triton/Dialect/TritonNvidiaGPU/
Gluon dialect 定义include
include/triton/Dialect/Gluon/
编译 pass 实现lib
lib/Dialect/TritonGPU/Transforms/
IR lowering (GPU->LLVM)lib
lib/Conversion/TritonGPUToLLVM/
需求场景源码类型路径
Triton基础语法与模式教程01-05
python/tutorials/01-*.py
~
05-*.py
矩阵乘法优化教程03、09、10
python/tutorials/03-*.py
,
09-*.py
,
10-*.py
Attention内核开发教程06、Gluon示例
python/tutorials/06-*.py
,
python/examples/gluon/
Gluon入门Gluon教程01-02
python/tutorials/gluon/01-intro.py
,
02-layouts.py
TMA异步拷贝Gluon教程03-04
python/tutorials/gluon/03-*.py
,
04-*.py
WGMMA(Hopper架构)Gluon教程05
python/tutorials/gluon/05-wgmma.py
tcgen05(Blackwell架构)Gluon教程06、10、11
python/tutorials/gluon/06-*.py
,
10-*.py
,
11-*.py
持久化内核模式Gluon教程07
python/tutorials/gluon/07-persistence.py
Warp特化模式Gluon教程08
python/tutorials/gluon/08-warp-specialization.py
FP4/FP8/MXFP量化教程10、triton_kernels
python/tutorials/10-*.py
,
triton_kernels/numerics_details/
生产级GEMMtriton_kernels矩阵乘法模块
triton_kernels/triton_kernels/matmul_details/
MoE / 不规则张量triton_kernels
triton_kernels/triton_kernels/distributed.py
,
tensor_details/ragged_tensor.py
Top-K / SwiGLU内核triton_kernels
triton_kernels/triton_kernels/topk.py
,
swiglu.py
Roofline性能分析triton_kernels
triton_kernels/triton_kernels/roofline.py
tl.*操作语义/签名语言源码
python/triton/language/
布局与混洗triton_kernels布局模块
triton_kernels/triton_kernels/tensor_details/layout_details/
Triton IR操作定义头文件
include/triton/Dialect/Triton/IR/
GPU布局编码头文件
include/triton/Dialect/TritonGPU/IR/
NVIDIA操作(wgmma/tma)头文件
include/triton/Dialect/TritonNvidiaGPU/
Gluon方言定义头文件
include/triton/Dialect/Gluon/
编译Pass实现库文件
lib/Dialect/TritonGPU/Transforms/
IR下转换(GPU→LLVM)库文件
lib/Conversion/TritonGPUToLLVM/

Triton Kernel 编写模式

Triton 内核编写模式

基本模式

基础模式

python
import triton
import triton.language as tl

@triton.jit
def kernel(x_ptr, y_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = x * 2  # 计算
    tl.store(y_ptr + offsets, y, mask=mask)
python
import triton
import triton.language as tl

@triton.jit
def kernel(x_ptr, y_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = x * 2  # 计算
    tl.store(y_ptr + offsets, y, mask=mask)

启动

启动

grid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),) kernel[grid](x, y, n, BLOCK_SIZE=1024)
undefined
grid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),) kernel[grid](x, y, n, BLOCK_SIZE=1024)
undefined

Autotune 模式

Autotune模式

python
@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 256, 'BLOCK_K': 32}, num_stages=4, num_warps=4),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, ...):
    ...
参考
python/tutorials/03-matrix-multiplication.py
获取完整 autotune matmul 示例。
python
@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 256, 'BLOCK_K': 32}, num_stages=4, num_warps=4),
    ],
    key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, ...):
    ...
参考
python/tutorials/03-matrix-multiplication.py
获取完整的autotune矩阵乘法示例。

Gluon 基本模式

Gluon 基础模式

python
from triton.experimental import gluon

@gluon.jit
def kernel(x: gluon.tensor[M, N, tl.float16]):
    # 直接使用 tile 操作
    y = x + 1.0
    return y
参考
python/tutorials/gluon/01-intro.py
获取 Gluon 入门示例。
python
from triton.experimental import gluon

@gluon.jit
def kernel(x: gluon.tensor[M, N, tl.float16]):
    # 直接使用tile操作
    y = x + 1.0
    return y
参考
python/tutorials/gluon/01-intro.py
获取Gluon入门示例。

常见问题排查

常见问题排查

问题可能原因查找参考
tl.dot 结果错误输入类型不匹配 (需要 float16/bfloat16)
rg "tl.dot" tutorials/03-*.py
CUDA OOMBLOCK_SIZE 过大, num_stages 过多
rg "num_stages" tutorials/09-*.py
autotune 无效key 参数未对齐实际变化维度
rg "key=" tutorials/03-*.py
TMA descriptor 错误tensor 不连续或维度不匹配
rg "TensorDescriptor" tutorials/gluon/04-*.py
wgmma 精度问题需要 float32 累加器
rg "accumulator" tutorials/gluon/05-wgmma.py
性能低未用 persistent kernel 或 warp spec
tutorials/gluon/07-persistence.py
,
08-warp-specialization.py
问题可能原因参考查找方向
tl.dot结果错误输入类型不匹配(需float16/bfloat16)
rg "tl.dot" tutorials/03-*.py
CUDA内存不足BLOCK_SIZE过大, num_stages过多
rg "num_stages" tutorials/09-*.py
autotune无效key参数未匹配实际变化维度
rg "key=" tutorials/03-*.py
TMA描述符错误张量不连续或维度不匹配
rg "TensorDescriptor" tutorials/gluon/04-*.py
wgmma精度问题需要float32累加器
rg "accumulator" tutorials/gluon/05-wgmma.py
性能低下未使用持久化内核或warp特化
tutorials/gluon/07-persistence.py
,
08-warp-specialization.py

Triton 编译和调试

Triton 编译与调试

bash
undefined
bash
undefined

查看生成的 PTX

查看生成的PTX

TRITON_PRINT_AUTOTUNING=1 python your_script.py
TRITON_PRINT_AUTOTUNING=1 python your_script.py

启用 IR dump

启用IR导出

MLIR_ENABLE_DUMP=1 python your_script.py
MLIR_ENABLE_DUMP=1 python your_script.py

使用 Triton 的 compute-sanitizer

使用Triton的compute-sanitizer

from triton_kernels.testing import compute_sanitizer
from triton_kernels.testing import compute_sanitizer

参考: python/triton_kernels/triton_kernels/testing.py

参考: python/triton_kernels/triton_kernels/testing.py

性能分析

性能分析

from triton_kernels.roofline import compute_roofline
from triton_kernels.roofline import compute_roofline

参考: python/triton_kernels/triton_kernels/roofline.py

参考: python/triton_kernels/triton_kernels/roofline.py

undefined
undefined

更新 Triton 源码

更新Triton源码

bash
undefined
bash
undefined

在 cursor-gpu-skills 项目目录下

在cursor-gpu-skills项目目录下

bash update-repos.sh triton
undefined
bash update-repos.sh triton
undefined

Additional References

额外参考资源

  • Triton 官方文档: https://triton-lang.org
  • Triton Language API:
    TRITON_REPO/python/triton/language/
  • Gluon Experimental API:
    TRITON_REPO/python/triton/experimental/gluon/
  • Triton官方文档: https://triton-lang.org
  • Triton语言API:
    TRITON_REPO/python/triton/language/
  • Gluon实验性API:
    TRITON_REPO/python/triton/experimental/gluon/