triton-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTriton & Gluon Kernel Development
Triton & Gluon GPU内核开发
Source Code Locations
源码位置
Triton 源码位于此 skill 安装目录下的 。
实际路径取决于所用工具:
repos/triton/- Cursor:
~/.cursor/skills/triton-skill/repos/triton/ - Claude Code:
~/.claude/skills/triton-skill/repos/triton/ - Codex:
~/.agents/skills/triton-skill/repos/triton/
TRITON_REPO: 下文示例用 作占位符,替换为实际路径。
~/.cursor/skills/triton-skill/repos/triton/如果该路径不存在,在项目目录下运行 。
bash update-repos.sh tritonTriton源码位于当前Skill安装目录下的 。
实际路径取决于所用工具:
repos/triton/- Cursor:
~/.cursor/skills/triton-skill/repos/triton/ - Claude Code:
~/.claude/skills/triton-skill/repos/triton/ - Codex:
~/.agents/skills/triton-skill/repos/triton/
TRITON_REPO: 下文示例用 作占位符,请替换为实际路径。
~/.cursor/skills/triton-skill/repos/triton/如果该路径不存在,在项目目录下运行 。
bash update-repos.sh tritonTriton Tutorials (入门到进阶)
Triton 教程(入门到进阶)
TRITON_REPO/python/tutorials/
├── 01-vector-add.py # Triton 基础: @triton.jit, program_id, load/store
├── 02-fused-softmax.py # 内核融合, reduction, tl.max/tl.sum/tl.exp
├── 03-matrix-multiplication.py # Block matmul, L2 cache, @triton.autotune
├── 04-low-memory-dropout.py # 并行 PRNG, tl.rand, seed-based dropout
├── 05-layer-norm.py # Backward pass, atomic ops, tl.atomic_cas
├── 06-fused-attention.py # Flash Attention v2, causal mask, FP8, warp spec
├── 07-extern-functions.py # libdevice 外部函数调用
├── 08-grouped-gemm.py # Group GEMM, TMA, tensor descriptors
├── 09-persistent-matmul.py # 持久化内核, TMA, warp specialization, FP8
├── 10-block-scaled-matmul.py # FP4/FP8, MXFP4, tl.dot_scaled
└── 11-programmatic-dependent-launch.py # PDL, gdc_wait, gdc_launch_dependentsTRITON_REPO/python/tutorials/
├── 01-vector-add.py # Triton 基础: @triton.jit, program_id, load/store
├── 02-fused-softmax.py # 内核融合, 归约操作, tl.max/tl.sum/tl.exp
├── 03-matrix-multiplication.py # 分块矩阵乘法, L2缓存, @triton.autotune
├── 04-low-memory-dropout.py # 并行伪随机数生成, tl.rand, 基于种子的Dropout
├── 05-layer-norm.py # 反向传播, 原子操作, tl.atomic_cas
├── 06-fused-attention.py # Flash Attention v2, 因果掩码, FP8, warp特化
├── 07-extern-functions.py # libdevice外部函数调用
├── 08-grouped-gemm.py # 分组GEMM, TMA, 张量描述符
├── 09-persistent-matmul.py # 持久化内核, TMA, warp特化, FP8
├── 10-block-scaled-matmul.py # FP4/FP8, MXFP4, tl.dot_scaled
└── 11-programmatic-dependent-launch.py # PDL, gdc_wait, gdc_launch_dependentsGluon Tutorials (底层 GPU 编程)
Gluon 教程(底层GPU编程)
TRITON_REPO/python/tutorials/gluon/
├── 01-intro.py # Gluon vs Triton, tile-based SPMD, @gluon.jit
├── 02-layouts.py # BlockedLayout, size_per_thread, warps_per_cta
├── 03-async-copy.py # cp.async, 流水线, shared memory
├── 04-tma.py # Tensor Memory Accelerator, tensor desc, mbarrier
├── 05-wgmma.py # Warp-Group MMA, Hopper Tensor Core, async MMA
├── 06-tcgen05.py # Blackwell Tensor Core, Tensor Memory, tcgen05_mma
├── 07-persistence.py # 持久化内核, work assignment, 多级流水线
├── 08-warp-specialization.py # Warp 特化, 任务重叠
├── 09-tma-gather-scatter.py # Native TMA Gather/Scatter (Blackwell)
├── 10-tcgen05-copy.py # tcgen05_copy, shared→tensor memory
├── 11-tcgen05-mma-scaled.py # tcgen05_mma_scaled, nvfp4/mxfp4/mxfp8
└── 12-cluster-launch-control.py # CLC, 动态 work distributionTRITON_REPO/python/tutorials/gluon/
├── 01-intro.py # Gluon vs Triton, 基于Tile的SPMD, @gluon.jit
├── 02-layouts.py # BlockedLayout, size_per_thread, warps_per_cta
├── 03-async-copy.py # cp.async, 流水线, 共享内存
├── 04-tma.py # Tensor Memory Accelerator, 张量描述符, mbarrier
├── 05-wgmma.py # Warp-Group MMA, Hopper张量核心, 异步MMA
├── 06-tcgen05.py # Blackwell张量核心, 张量内存, tcgen05_mma
├── 07-persistence.py # 持久化内核, 任务分配, 多级流水线
├── 08-warp-specialization.py # Warp特化, 任务重叠
├── 09-tma-gather-scatter.py # 原生TMA聚合/分散操作(Blackwell架构)
├── 10-tcgen05-copy.py # tcgen05_copy, 共享内存→张量内存
├── 11-tcgen05-mma-scaled.py # tcgen05_mma_scaled, nvfp4/mxfp4/mxfp8
└── 12-cluster-launch-control.py # CLC, 动态任务分配Gluon Examples (完整实现)
Gluon 示例(完整实现)
TRITON_REPO/python/examples/gluon/
└── 01-attention-forward.py # Flash Attention forward (Blackwell)
# 完整的 producer/consumer, TMA, tcgen05_mmaTRITON_REPO/python/examples/gluon/
└── 01-attention-forward.py # Flash Attention前向传播(Blackwell架构)
# 完整的生产者/消费者模型, TMA, tcgen05_mmaTriton Kernels (生产级参考实现)
Triton 生产级内核参考实现
TRITON_REPO/python/triton_kernels/triton_kernels/
├── matmul.py # 矩阵乘法 API (融合激活/MoE/ragged)
├── matmul_details/
│ ├── _matmul.py # Dense GEMM kernel (TMA, mxfp4/8)
│ ├── _p_matmul.py # Persistent GEMM kernel (ragged TMA)
│ └── _common.py # 偏移计算, XCD swizzle
├── reduce.py # Reduction kernel (mask/scale/mxfp/flexpoint)
├── topk.py # Top-K selection (forward/backward, bitmatrix)
├── swiglu.py # SwiGLU activation kernel
├── compaction.py # Masked compaction kernel
├── numerics.py # FP8/MXFP 数值配置
├── numerics_details/
│ ├── mxfp.py # MXFP 量化/反量化
│ └── flexpoint.py # Flexpoint 缩放
├── tensor.py # Tensor/Layout 抽象 (TMA descriptors)
├── tensor_details/
│ └── layout_details/ # Blackwell/Hopper/CDNA4 MX 布局
├── distributed.py # 分布式 MoE, SymmetricMemory
├── testing.py # 测试工具 (assert_close, compute_sanitizer)
└── roofline.py # Roofline 性能分析TRITON_REPO/python/triton_kernels/triton_kernels/
├── matmul.py # 矩阵乘法API(融合激活/MoE/不规则张量)
├── matmul_details/
│ ├── _matmul.py # 稠密GEMM内核(TMA, mxfp4/8)
│ ├── _p_matmul.py # 持久化GEMM内核(不规则TMA)
│ └── _common.py # 偏移计算, XCD混洗
├── reduce.py # 归约内核(掩码/缩放/MXFP/Flexpoint)
├── topk.py # Top-K选择(前向/反向传播, 位矩阵)
├── swiglu.py # SwiGLU激活内核
├── compaction.py # 掩码压缩内核
├── numerics.py # FP8/MXFP数值配置
├── numerics_details/
│ ├── mxfp.py # MXFP量化/反量化
│ └── flexpoint.py # Flexpoint缩放
├── tensor.py # 张量/布局抽象(TMA描述符)
├── tensor_details/
│ └── layout_details/ # Blackwell/Hopper/CDNA4 MX布局
├── distributed.py # 分布式MoE, SymmetricMemory
├── testing.py # 测试工具(assert_close, compute_sanitizer)
└── roofline.py # Roofline性能分析Triton 语言源码
Triton 语言源码
TRITON_REPO/python/triton/language/ # tl.* 操作的定义和语义
TRITON_REPO/python/triton/experimental/gluon/ # gluon.* 操作的定义
TRITON_REPO/python/triton/runtime/ # JIT 编译, 缓存, 解释器
TRITON_REPO/python/triton/compiler/ # 代码生成
TRITON_REPO/python/triton/tools/ # Tensor descriptor 工具TRITON_REPO/python/triton/language/ # tl.*操作的定义和语义
TRITON_REPO/python/triton/experimental/gluon/ # gluon.*操作的定义
TRITON_REPO/python/triton/runtime/ # JIT编译, 缓存, 解释器
TRITON_REPO/python/triton/compiler/ # 代码生成
TRITON_REPO/python/triton/tools/ # 张量描述符工具C++ 编译器 (IR 定义和 Passes)
C++ 编译器(IR定义与Pass)
TRITON_REPO/include/triton/
├── Dialect/
│ ├── Triton/ # Triton IR dialect 定义 (.td, .h)
│ ├── TritonGPU/ # TritonGPU dialect (layouts, encodings)
│ ├── TritonNvidiaGPU/ # NVIDIA 特定 ops (wgmma, tma, tcgen05)
│ └── Gluon/ # Gluon dialect
├── Conversion/ # IR lowering passes (TritonGPU -> LLVM)
├── Analysis/ # Alias, Allocation, AxisInfo, Membar
└── Tools/ # 工具类
TRITON_REPO/lib/
├── Dialect/
│ ├── Triton/ # Triton ops 实现, canonicalize
│ ├── TritonGPU/ # GPU layout 优化 passes
│ ├── TritonNvidiaGPU/ # NVIDIA lowering
│ └── Gluon/ # Gluon ops 实现
├── Conversion/ # Lowering pass 实现 (TritonGPU -> LLVM IR)
├── Analysis/ # 分析 pass 实现
└── Target/ # 代码生成目标TRITON_REPO/include/triton/
├── Dialect/
│ ├── Triton/ # Triton IR方言定义(.td, .h)
│ ├── TritonGPU/ # TritonGPU方言(布局, 编码)
│ ├── TritonNvidiaGPU/ # NVIDIA特定操作(wgmma, tma, tcgen05)
│ └── Gluon/ # Gluon方言
├── Conversion/ # IR下转换Pass(TritonGPU -> LLVM)
├── Analysis/ # 别名分析, 内存分配, 轴信息, 内存屏障
└── Tools/ # 工具类
TRITON_REPO/lib/
├── Dialect/
│ ├── Triton/ # Triton操作实现, 规范化处理
│ ├── TritonGPU/ # GPU布局优化Pass
│ ├── TritonNvidiaGPU/ # NVIDIA平台下转换
│ └── Gluon/ # Gluon操作实现
├── Conversion/ # 下转换Pass实现(TritonGPU -> LLVM IR)
├── Analysis/ # 分析Pass实现
└── Target/ # 代码生成目标Search Strategy
搜索策略
用 Grep 工具搜索,不要整文件加载。
先确定 TRITON_REPO 的实际路径,然后用绝对路径搜索。
使用Grep工具搜索,不要加载整个文件。
先确定TRITON_REPO的实际路径,然后使用绝对路径进行搜索。
Triton API 用法
Triton API 用法
bash
undefinedbash
undefined设置路径变量(替换为实际路径)
设置路径变量(替换为实际路径)
TRITON_REPO="$HOME/.cursor/skills/triton-skill/repos/triton"
TRITON_REPO="$HOME/.cursor/skills/triton-skill/repos/triton"
查找 tl.dot 的使用方式
查找tl.dot的使用方式
rg "tl.dot" $TRITON_REPO/python/tutorials/
rg "tl.dot" $TRITON_REPO/python/tutorials/
查找 autotune 配置示例
查找autotune配置示例
rg "@triton.autotune" $TRITON_REPO/python/tutorials/
rg "@triton.autotune" $TRITON_REPO/python/tutorials/
查找 tensor descriptor 创建
查找张量描述符创建
rg "TensorDescriptor" $TRITON_REPO/python/tutorials/
rg "TensorDescriptor" $TRITON_REPO/python/tutorials/
查找特定 tl 操作的定义
查找特定tl操作的定义
rg "def (load|store|dot)" $TRITON_REPO/python/triton/language/
undefinedrg "def (load|store|dot)" $TRITON_REPO/python/triton/language/
undefinedGluon API 用法
Gluon API 用法
bash
undefinedbash
undefined查找 gluon.jit 使用
查找gluon.jit使用
rg "@gluon.jit" $TRITON_REPO/python/tutorials/gluon/
rg "@gluon.jit" $TRITON_REPO/python/tutorials/gluon/
查找 wgmma 用法
查找wgmma用法
rg "wgmma" $TRITON_REPO/python/tutorials/gluon/05-wgmma.py
rg "wgmma" $TRITON_REPO/python/tutorials/gluon/05-wgmma.py
查找 tcgen05 用法 (Blackwell)
查找tcgen05用法 (Blackwell架构)
rg "tcgen05" $TRITON_REPO/python/tutorials/gluon/
rg "tcgen05" $TRITON_REPO/python/tutorials/gluon/
查找 TMA 异步拷贝模式
查找TMA异步拷贝模式
rg "async_copy" $TRITON_REPO/python/tutorials/gluon/
rg "async_copy" $TRITON_REPO/python/tutorials/gluon/
查找 mbarrier 使用
查找mbarrier使用
rg "mbarrier" $TRITON_REPO/python/tutorials/gluon/
undefinedrg "mbarrier" $TRITON_REPO/python/tutorials/gluon/
undefined编译器 IR 和 Passes
编译器IR与Pass
bash
undefinedbash
undefined查找 Triton IR op 定义 (TableGen)
查找Triton IR操作定义(TableGen)
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Triton/IR/
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Triton/IR/
查找 TritonGPU layout encoding
查找TritonGPU布局编码
rg "Encoding" $TRITON_REPO/include/triton/Dialect/TritonGPU/IR/
rg "Encoding" $TRITON_REPO/include/triton/Dialect/TritonGPU/IR/
查找 NVIDIA 特定 ops (wgmma, tma)
查找NVIDIA特定操作(wgmma, tma)
rg "wgmma|tma|tcgen05" $TRITON_REPO/include/triton/Dialect/TritonNvidiaGPU/
rg "wgmma|tma|tcgen05" $TRITON_REPO/include/triton/Dialect/TritonNvidiaGPU/
查找 lowering pass 实现
查找下转换Pass实现
rg "Pattern|Rewrite" $TRITON_REPO/lib/Conversion/TritonGPUToLLVM/
rg "Pattern|Rewrite" $TRITON_REPO/lib/Conversion/TritonGPUToLLVM/
查找 Gluon dialect ops
查找Gluon方言操作
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Gluon/
rg "def.*Op" $TRITON_REPO/include/triton/Dialect/Gluon/
查找特定 pass (如 coalesce, pipeline, prefetch)
查找特定Pass(如coalesce, pipeline, prefetch)
rg "coalesce|pipeline|prefetch" $TRITON_REPO/lib/Dialect/TritonGPU/Transforms/
undefinedrg "coalesce|pipeline|prefetch" $TRITON_REPO/lib/Dialect/TritonGPU/Transforms/
undefined生产内核参考
生产级内核参考
bash
undefinedbash
undefined查找 matmul 内核的 TMA 使用
查找matmul内核的TMA使用
rg "tma" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/
rg "tma" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/
查找 MXFP 量化实现
查找MXFP量化实现
rg "mxfp" $TRITON_REPO/python/triton_kernels/triton_kernels/numerics_details/
rg "mxfp" $TRITON_REPO/python/triton_kernels/triton_kernels/numerics_details/
查找 persistent kernel 模式
查找持久化内核模式
rg "persistent" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/_p_matmul.py
rg "persistent" $TRITON_REPO/python/triton_kernels/triton_kernels/matmul_details/_p_matmul.py
查找 layout swizzle
查找布局混洗
rg "swizzle" $TRITON_REPO/python/triton_kernels/triton_kernels/tensor_details/layout_details/
undefinedrg "swizzle" $TRITON_REPO/python/triton_kernels/triton_kernels/tensor_details/layout_details/
undefinedWhen to Use Each Source
各场景源码选择指南
| Need | Source | Path |
|---|---|---|
| Triton 基础语法和模式 | Tutorials 01-05 | |
| 矩阵乘法优化 | Tutorial 03, 09, 10 | |
| Attention 内核 | Tutorial 06, Gluon example | |
| Gluon 入门 | Gluon tutorials 01-02 | |
| TMA 异步拷贝 | Gluon tutorials 03-04 | |
| WGMMA (Hopper) | Gluon tutorial 05 | |
| tcgen05 (Blackwell) | Gluon tutorials 06, 10, 11 | |
| 持久化内核模式 | Gluon tutorial 07 | |
| Warp 特化模式 | Gluon tutorial 08 | |
| FP4/FP8/MXFP 量化 | Tutorial 10, triton_kernels | |
| 生产级 GEMM | triton_kernels matmul | |
| MoE / Ragged tensor | triton_kernels | |
| Top-K / SwiGLU 内核 | triton_kernels | |
| Roofline 性能分析 | triton_kernels | |
| tl.* 操作语义/签名 | Language source | |
| 布局和 swizzle | triton_kernels layouts | |
| Triton IR op 定义 | include | |
| GPU layout encoding | include | |
| NVIDIA ops (wgmma/tma) | include | |
| Gluon dialect 定义 | include | |
| 编译 pass 实现 | lib | |
| IR lowering (GPU->LLVM) | lib | |
| 需求场景 | 源码类型 | 路径 |
|---|---|---|
| Triton基础语法与模式 | 教程01-05 | |
| 矩阵乘法优化 | 教程03、09、10 | |
| Attention内核开发 | 教程06、Gluon示例 | |
| Gluon入门 | Gluon教程01-02 | |
| TMA异步拷贝 | Gluon教程03-04 | |
| WGMMA(Hopper架构) | Gluon教程05 | |
| tcgen05(Blackwell架构) | Gluon教程06、10、11 | |
| 持久化内核模式 | Gluon教程07 | |
| Warp特化模式 | Gluon教程08 | |
| FP4/FP8/MXFP量化 | 教程10、triton_kernels | |
| 生产级GEMM | triton_kernels矩阵乘法模块 | |
| MoE / 不规则张量 | triton_kernels | |
| Top-K / SwiGLU内核 | triton_kernels | |
| Roofline性能分析 | triton_kernels | |
| tl.*操作语义/签名 | 语言源码 | |
| 布局与混洗 | triton_kernels布局模块 | |
| Triton IR操作定义 | 头文件 | |
| GPU布局编码 | 头文件 | |
| NVIDIA操作(wgmma/tma) | 头文件 | |
| Gluon方言定义 | 头文件 | |
| 编译Pass实现 | 库文件 | |
| IR下转换(GPU→LLVM) | 库文件 | |
Triton Kernel 编写模式
Triton 内核编写模式
基本模式
基础模式
python
import triton
import triton.language as tl
@triton.jit
def kernel(x_ptr, y_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = x * 2 # 计算
tl.store(y_ptr + offsets, y, mask=mask)python
import triton
import triton.language as tl
@triton.jit
def kernel(x_ptr, y_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = x * 2 # 计算
tl.store(y_ptr + offsets, y, mask=mask)启动
启动
grid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),)
kernel[grid](x, y, n, BLOCK_SIZE=1024)
undefinedgrid = lambda meta: (triton.cdiv(n, meta['BLOCK_SIZE']),)
kernel[grid](x, y, n, BLOCK_SIZE=1024)
undefinedAutotune 模式
Autotune模式
python
@triton.autotune(
configs=[
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 256, 'BLOCK_K': 32}, num_stages=4, num_warps=4),
],
key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, ...):
...参考 获取完整 autotune matmul 示例。
python/tutorials/03-matrix-multiplication.pypython
@triton.autotune(
configs=[
triton.Config({'BLOCK_M': 128, 'BLOCK_N': 256, 'BLOCK_K': 64}, num_stages=3, num_warps=8),
triton.Config({'BLOCK_M': 64, 'BLOCK_N': 256, 'BLOCK_K': 32}, num_stages=4, num_warps=4),
],
key=['M', 'N', 'K'],
)
@triton.jit
def matmul_kernel(a_ptr, b_ptr, c_ptr, M, N, K, ...):
...参考获取完整的autotune矩阵乘法示例。
python/tutorials/03-matrix-multiplication.pyGluon 基本模式
Gluon 基础模式
python
from triton.experimental import gluon
@gluon.jit
def kernel(x: gluon.tensor[M, N, tl.float16]):
# 直接使用 tile 操作
y = x + 1.0
return y参考 获取 Gluon 入门示例。
python/tutorials/gluon/01-intro.pypython
from triton.experimental import gluon
@gluon.jit
def kernel(x: gluon.tensor[M, N, tl.float16]):
# 直接使用tile操作
y = x + 1.0
return y参考获取Gluon入门示例。
python/tutorials/gluon/01-intro.py常见问题排查
常见问题排查
| 问题 | 可能原因 | 查找参考 |
|---|---|---|
| tl.dot 结果错误 | 输入类型不匹配 (需要 float16/bfloat16) | |
| CUDA OOM | BLOCK_SIZE 过大, num_stages 过多 | |
| autotune 无效 | key 参数未对齐实际变化维度 | |
| TMA descriptor 错误 | tensor 不连续或维度不匹配 | |
| wgmma 精度问题 | 需要 float32 累加器 | |
| 性能低 | 未用 persistent kernel 或 warp spec | |
| 问题 | 可能原因 | 参考查找方向 |
|---|---|---|
| tl.dot结果错误 | 输入类型不匹配(需float16/bfloat16) | |
| CUDA内存不足 | BLOCK_SIZE过大, num_stages过多 | |
| autotune无效 | key参数未匹配实际变化维度 | |
| TMA描述符错误 | 张量不连续或维度不匹配 | |
| wgmma精度问题 | 需要float32累加器 | |
| 性能低下 | 未使用持久化内核或warp特化 | |
Triton 编译和调试
Triton 编译与调试
bash
undefinedbash
undefined查看生成的 PTX
查看生成的PTX
TRITON_PRINT_AUTOTUNING=1 python your_script.py
TRITON_PRINT_AUTOTUNING=1 python your_script.py
启用 IR dump
启用IR导出
MLIR_ENABLE_DUMP=1 python your_script.py
MLIR_ENABLE_DUMP=1 python your_script.py
使用 Triton 的 compute-sanitizer
使用Triton的compute-sanitizer
from triton_kernels.testing import compute_sanitizer
from triton_kernels.testing import compute_sanitizer
参考: python/triton_kernels/triton_kernels/testing.py
参考: python/triton_kernels/triton_kernels/testing.py
性能分析
性能分析
from triton_kernels.roofline import compute_roofline
from triton_kernels.roofline import compute_roofline
参考: python/triton_kernels/triton_kernels/roofline.py
参考: python/triton_kernels/triton_kernels/roofline.py
undefinedundefined更新 Triton 源码
更新Triton源码
bash
undefinedbash
undefined在 cursor-gpu-skills 项目目录下
在cursor-gpu-skills项目目录下
bash update-repos.sh triton
undefinedbash update-repos.sh triton
undefinedAdditional References
额外参考资源
- Triton 官方文档: https://triton-lang.org
- Triton Language API:
TRITON_REPO/python/triton/language/ - Gluon Experimental API:
TRITON_REPO/python/triton/experimental/gluon/
- Triton官方文档: https://triton-lang.org
- Triton语言API:
TRITON_REPO/python/triton/language/ - Gluon实验性API:
TRITON_REPO/python/triton/experimental/gluon/