kernel-cute-writing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCuTe DSL
CuTe DSL
CuTe DSL is a Python-based domain-specific language for GPU kernel development,
part of CUTLASS 4.x. It provides Python abstractions over CUTLASS C++ templates
with JIT compilation to optimized CUDA kernels via MLIR and ptxas.
CuTe DSL是一款基于Python的GPU内核开发领域特定语言,属于CUTLASS 4.x的一部分。它通过MLIR和ptxas实现JIT编译,为CUTLASS C++模板提供Python抽象,生成优化的CUDA内核。
When to Use
使用场景
Triggers:
- Writing CUDA kernels in Python (element-wise, GEMM, custom ops)
- Optimizing GPU memory access patterns (vectorized loads, TMA, shared memory)
- Building tensor core (MMA) kernels for Ampere/Hopper/Blackwell
- Integrating custom GPU kernels with PyTorch or JAX
- Prototyping high-performance kernels without C++ metaprogramming
Symptoms (wrong tool otherwise):
- Need shared memory coordination or tensor core MMA → use CuTe DSL (not Triton for complex patterns)
- Need simple element-wise ops with no shared memory → CuTe DSL or Triton both work
- Need to call existing CUTLASS C++ kernels → use CUTLASS C++ APIs instead
- Need reductions, scans, or non-GEMM collective ops → consider CUB/Thrust
Keywords: cute, cutlass, cute.jit, cute.kernel, from_dlpack, zipped_divide,
TiledMMA, TiledCopy, TMA, WGMMA, tcgen05, pipeline, mbarrier
触发条件:
- 使用Python编写CUDA内核(逐元素、GEMM、自定义算子)
- 优化GPU内存访问模式(向量化加载、TMA、共享内存)
- 为Ampere/Hopper/Blackwell架构构建张量核心(MMA)内核
- 将自定义GPU内核与PyTorch或JAX集成
- 无需C++元编程即可快速原型化高性能内核
不适用场景:
- 需要共享内存协调或张量核心MMA → 使用CuTe DSL(复杂模式下不推荐Triton)
- 无需共享内存的简单逐元素操作 → CuTe DSL或Triton均可
- 需要调用现有CUTLASS C++内核 → 使用CUTLASS C++ API
- 需要归约、扫描或非GEMM集合操作 → 考虑使用CUB/Thrust
关键词: cute, cutlass, cute.jit, cute.kernel, from_dlpack, zipped_divide,
TiledMMA, TiledCopy, TMA, WGMMA, tcgen05, pipeline, mbarrier
Requirements
环境要求
| Requirement | Detail |
|---|---|
| Platform | Linux x86_64 only |
| Python | 3.10–3.13 |
| GPU | NVIDIA Ampere+ (SM80, SM90, SM100) |
| CUDA Driver | ≥ 575.51.03 (Toolkit 12.9 compat) |
| Install | |
| Optional | |
| 要求项 | 详细说明 |
|---|---|
| 平台 | 仅支持Linux x86_64 |
| Python版本 | 3.10–3.13 |
| GPU | NVIDIA Ampere及以上(SM80, SM90, SM100) |
| CUDA驱动 | ≥ 575.51.03(兼容Toolkit 12.9) |
| 安装方式 | |
| 可选依赖 | |
Workflows
工作流程
Workflow 0: Starting from Examples (Recommended)
流程0:从示例开始(推荐)
For any non-trivial kernel (GEMM, attention, pipelined, fused ops), start by
finding the most similar existing example to use as a starting point — study
its structure, then rework it for your use case. Do not copy examples verbatim;
they target specific dtypes, architectures, and problem shapes that likely differ.
-
Pick the closest example from the index below. Prefer examples matching the target GPU architecture (check with) when the operation is similar.
torch.cuda.get_device_capability()Fetch viawith base URLweb_fetchhttps://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSLOperation Arch Example path (append to base URL) Element-wise add SM80 ampere/elementwise_add.pyElement-wise + autotune SM80 ampere/elementwise_add_autotune.pyElement-wise apply SM80 ampere/elementwise_apply.pySGEMM (scalar) SM80 ampere/sgemm.pyTensor-core GEMM SM80 ampere/tensorop_gemm.pyFlash Attention v2 SM80 ampere/flash_attention_v2.pyHSTU Attention SM80 ampere/hstu_attention.pyShared memory allocator SM80 ampere/smem_allocator.pyCTA norm (LayerNorm) SM90 hopper/cta_norm.pyDense GEMM SM90 hopper/dense_gemm.pyDense GEMM persistent SM90 hopper/dense_gemm_persistent.pyFlash MHA SM90 hopper/fmha.pyDense GEMM SM100 blackwell/dense_gemm.pyDense GEMM persistent SM100 blackwell/dense_gemm_persistent.pyDense GEMM + alpha/beta SM100 blackwell/dense_gemm_alpha_beta_persistent.pyRMSNorm SM100 blackwell/rmsnorm.pyReduce SM100 blackwell/reduce.pyFlash MHA SM100 blackwell/fmha.pyGrouped GEMM SM100 blackwell/grouped_gemm.pyMamba2 SSD SM100 blackwell/mamba2_ssd/GEMM tutorial (notebook) SM100 notebooks/tour_to_sol_gemm.ipynbExample: To fetch the Hopper dense GEMM:bashweb_fetch https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL/hopper/dense_gemm.py -
Read reference materials first — before diving into example code, read the relevantdocs to understand the patterns and APIs:
references/- For GEMM: (3-level tiling, epilogue fusion,
references/patterns-gemm.mdwithcute.compile, shared memory layouts)mark_layout_dynamic - For reductions: (warp reductions,
references/patterns-reduction.mdcache pattern)cute.compile - For element-wise: (variations A–E)
references/patterns-elementwise.md - Always: (available APIs, arch-specific caveats)
references/api-arch.md
This gives you the conceptual foundation so you can rework the example intelligently rather than trying to copy-paste complex pipelines. - For GEMM:
-
Fetch and study the example source — read for structure, not to copy:
- Identify: decorators, tiling strategy, shared memory usage, mainloop flow
- Note which dtype/arch it targets (many examples are fp16/bf16-specific)
- Check if it uses APIs tied to a specific arch (TMA → SM90+, tcgen05 → SM100)
-
Rework for the user's workload (do not copy-paste):
- Change shapes, data types, tile sizes to match requirements
- Replace compute logic (epilogue, activation fusion) as needed
- If dtype differs (e.g., example is fp16, need fp32), expect vectorization
and layout changes — the scalar-loop patterns in may be a better starting point than adapting a vectorized example
references/ - Runtime wrapper must be lightweight: should only call
kernel_fn()+ the compiled kernel. Never allocate intermediate tensors, copy data, or re-compile per call — these belong in one-time setupfrom_dlpack() - Apply optimizations from this skill's reference docs
⛔ Blackwell/Hopper GEMM + extra tensors — STOP: If the target GPU is SM90+ (Hopper/Blackwell) and the GEMM requires extra tensors beyond A, B, C in the epilogue (e.g., bias vector, activation inputs), do not attempt it. These examples use TMA descriptors for all data movement — adding tensors requires modifying TMA descriptor setup, which is prohibitively complex. Instead, tell the user this limitation and suggest a two-kernel approach: run the GEMM kernel as-is, then apply bias + activation in a separate element-wise kernel (Workflow 1). Plain GEMM (just A×B→C with scalar alpha/beta) on Hopper/Blackwell is fine. -
Validate and benchmark using companion scripts:bash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3 python scripts/benchmark_kernel.py kernel.pyThe kernel file must export,kernel_fn, andreference_fn.get_inputs()
When to skip examples: Pure element-wise operations (Workflow 1) have
complete patterns in — no need to fetch
external examples.
references/patterns-elementwise.mdReduction kernels (softmax, layernorm, RMSNorm): Use
which provides complete, proven patterns
for float32 reductions using scalar loops + butterfly shuffle + shared memory.
references/patterns-reduction.md对于任何非 trivial 的内核(GEMM、注意力机制、流水线、融合算子),先找到最相似的现有示例作为起点——研究其结构,再根据需求修改。不要直接复制示例,因为它们针对的特定数据类型、架构和问题形状可能与你的需求不同。
-
选择最接近的示例,从下方索引中挑选。 当操作类型相似时,优先选择与目标GPU架构匹配的示例(可通过查看)。
torch.cuda.get_device_capability()通过工具获取,基础URL为web_fetchhttps://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL操作类型 架构 示例路径(追加到基础URL后) 逐元素加法 SM80 ampere/elementwise_add.py逐元素+自动调优 SM80 ampere/elementwise_add_autotune.py逐元素应用 SM80 ampere/elementwise_apply.pySGEMM(标量) SM80 ampere/sgemm.py张量核心GEMM SM80 ampere/tensorop_gemm.pyFlash Attention v2 SM80 ampere/flash_attention_v2.pyHSTU注意力 SM80 ampere/hstu_attention.py共享内存分配器 SM80 ampere/smem_allocator.pyCTA归一化(LayerNorm) SM90 hopper/cta_norm.py稠密GEMM SM90 hopper/dense_gemm.py持久化稠密GEMM SM90 hopper/dense_gemm_persistent.pyFlash MHA SM90 hopper/fmha.py稠密GEMM SM100 blackwell/dense_gemm.py持久化稠密GEMM SM100 blackwell/dense_gemm_persistent.py带alpha/beta的稠密GEMM SM100 blackwell/dense_gemm_alpha_beta_persistent.pyRMSNorm SM100 blackwell/rmsnorm.py归约操作 SM100 blackwell/reduce.pyFlash MHA SM100 blackwell/fmha.py分组GEMM SM100 blackwell/grouped_gemm.pyMamba2 SSD SM100 blackwell/mamba2_ssd/GEMM教程(笔记本) SM100 notebooks/tour_to_sol_gemm.ipynb示例: 获取Hopper架构的稠密GEMM示例:bashweb_fetch https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL/hopper/dense_gemm.py -
先阅读参考资料——在深入示例代码之前,阅读相关文档以理解模式和API:
references/- 针对GEMM:(三级分块、尾端融合、
references/patterns-gemm.md配合cute.compile、共享内存布局)mark_layout_dynamic - 针对归约操作:(归约 warp、
references/patterns-reduction.md缓存模式)cute.compile - 针对逐元素操作:(变体A–E)
references/patterns-elementwise.md - 必看:(可用API、架构特定注意事项)
references/api-arch.md
这能为你打下概念基础,让你可以智能地修改示例,而非尝试复制粘贴复杂流水线。 - 针对GEMM:
-
获取并研究示例源码——关注结构而非复制:
- 识别:装饰器、分块策略、共享内存使用、主循环流程
- 注意它针对的数据类型/架构(许多示例是fp16/bf16专用)
- 检查是否使用了特定架构的API(TMA → SM90+,tcgen05 → SM100)
-
根据用户工作负载修改(不要复制粘贴):
- 修改形状、数据类型、分块大小以匹配需求
- 根据需要替换计算逻辑(尾端、激活融合)
- 如果数据类型不同(例如示例是fp16,需要fp32),预期会有向量化和布局变化——中的标量循环模式可能比修改向量化示例更适合作为起点
references/ - 运行时包装器必须轻量化:应仅调用
kernel_fn()+ 编译后的内核。绝不要在每次调用时分配中间张量、复制数据或重新编译——这些操作应放在一次性初始化中from_dlpack() - 应用本技能参考文档中的优化方法
⛔ Blackwell/Hopper GEMM + 额外张量——注意: 如果目标GPU是SM90+(Hopper/Blackwell)且GEMM需要在尾端使用A、B、C之外的额外张量(例如偏置向量、激活输入),请勿尝试实现。这些示例使用TMA描述符处理所有数据移动——添加张量需要修改TMA描述符设置,这极其复杂。请告知用户此限制,并建议采用双内核方案:按原样运行GEMM内核,然后在单独的逐元素内核中应用偏置+激活(流程1)。Hopper/Blackwell上的纯GEMM(仅A×B→C带标量alpha/beta)是可行的。 -
使用配套脚本验证和基准测试:bash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3 python scripts/benchmark_kernel.py kernel.py内核文件必须导出、kernel_fn和reference_fn。get_inputs()
何时跳过示例: 纯逐元素操作(流程1)在中有完整模式——无需获取外部示例。
references/patterns-elementwise.md归约内核(softmax、layernorm、RMSNorm):使用,其中提供了使用标量循环+蝶形洗牌+共享内存的完整、经过验证的float32归约模式。
references/patterns-reduction.mdWorkflow 1: Element-wise Kernel
流程1:逐元素内核
For unary/binary/in-place operations that map inputs to outputs 1:1.
- Determine kernel structure: inputs/outputs count, tensor rank, target arch
- Select pattern from (Variations A–E)
references/patterns-elementwise.md - Write kernel applying all four invariant principles:
- P1: for vector loads
from_dlpack(tensor, assumed_align=16) - P2: Derive from
vec_sizeelement_type.width - P3: for coalesced access
cute.zipped_divide(mA, tiler) - P4: for bounds
cutlass.dynamic_expr(thread_idx < total)
- P1:
- Critical rules: No early return, no (use
a * 2), noa + acute.math.sigmoid - Pre-compile with : Always pre-compile the kernel once using
cute.compile()so thatcute.compile()calls the compiled object, notkernel_fndirectly. Without pre-compilation, every call recompiles (~20-50ms overhead). Use@cute.jitso a single compiled kernel handles arbitrary input shapes without recompilation:.mark_layout_dynamic()python# Compile once with dynamic layouts — works for any shape fake_x = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"), assumed_align=16).mark_layout_dynamic() fake_out = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"), assumed_align=16).mark_layout_dynamic() compiled_kernel = cute.compile(host_fn, fake_x, fake_out) def kernel_fn(x): out = torch.empty_like(x) compiled_kernel(from_dlpack(x, assumed_align=16).mark_layout_dynamic(), from_dlpack(out, assumed_align=16).mark_layout_dynamic()) return out - Verify correctness using companion script:
The kernel file must exportbash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3,kernel_fn, andreference_fn.get_inputs() - Benchmark using companion script:
bash
python scripts/benchmark_kernel.py kernel.py
用于输入与输出1:1映射的一元/二元/原地操作。
- 确定内核结构:输入/输出数量、张量秩、目标架构
- 从中选择模式(变体A–E)
references/patterns-elementwise.md - 编写内核并应用四项不变原则:
- P1:使用进行向量加载
from_dlpack(tensor, assumed_align=16) - P2:从推导
element_type.widthvec_size - P3:使用实现合并访问
cute.zipped_divide(mA, tiler) - P4:使用处理边界
cutlass.dynamic_expr(thread_idx < total)
- P1:使用
- 关键规则:禁止提前返回,禁止使用(改用
a * 2),禁止使用a + acute.math.sigmoid - 使用预编译:始终使用
cute.compile()预编译内核一次,使cute.compile()调用编译后的对象,而非直接调用kernel_fn。如果不预编译,每次调用都会重新编译(约20-50ms开销)。使用@cute.jit使单个编译后的内核可处理任意输入形状而无需重新编译:.mark_layout_dynamic()python# 编译一次并启用动态布局——适用于任意形状 fake_x = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"), assumed_align=16).mark_layout_dynamic() fake_out = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"), assumed_align=16).mark_layout_dynamic() compiled_kernel = cute.compile(host_fn, fake_x, fake_out) def kernel_fn(x): out = torch.empty_like(x) compiled_kernel(from_dlpack(x, assumed_align=16).mark_layout_dynamic(), from_dlpack(out, assumed_align=16).mark_layout_dynamic()) return out - 使用配套脚本验证正确性:
内核文件必须导出bash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3、kernel_fn和reference_fn。get_inputs() - 使用配套脚本进行基准测试:
bash
python scripts/benchmark_kernel.py kernel.py
Workflow 2: GEMM Kernel
流程2:GEMM内核
For matrix multiplication with tiling, shared memory, and tensor cores.
- Define problem: shapes (M, N, K), data types, target architecture
- Choose tiling: CTA tile (bM, bN, bK), pipeline stages, cluster shape
- Three-level partitioning (see ):
references/patterns-gemm.md- Level 1: CTA tiling with
local_tile() - Level 2: Copy partitioning (global → shared) with
TiledCopy - Level 3: Compute partitioning (shared → register) with
TiledMMA
- Level 1: CTA tiling with
- Shared memory: Use swizzled layouts () to avoid bank conflicts
make_smem_layout_atom - Mainloop: K-tile loop with copy → sync → MMA → sync
- Pipeline: Use (Hopper) or
PipelineTmaAsync(Blackwell). ⚠️ TMA-based pipelines manage data movement via TMA descriptors — adding extra tensors (bias, activation inputs) to the epilogue requires modifying descriptor setup, which is prohibitively complex. See the stop condition in Workflow 0 step 4.PipelineTmaUmma - Epilogue: Predicated store with alpha/beta scaling
- Pre-compile with : Always pre-compile the GEMM kernel so
cute.compile()calls the compiled object, notkernel_fndirectly. Without pre-compilation, every call recompiles (~20-50ms overhead).@cute.jit - Autotune: Search over tile sizes, cluster shapes, pipeline depths
用于带分块、共享内存和张量核心的矩阵乘法。
- 定义问题:形状(M, N, K)、数据类型、目标架构
- 选择分块策略:CTA分块(bM, bN, bK)、流水线阶段、集群形状
- 三级分区(见):
references/patterns-gemm.md- 第一级:使用进行CTA分块
local_tile() - 第二级:使用进行复制分区(全局→共享)
TiledCopy - 第三级:使用进行计算分区(共享→寄存器)
TiledMMA
- 第一级:使用
- 共享内存:使用混洗布局()避免 bank 冲突
make_smem_layout_atom - 主循环:K分块循环,流程为复制→同步→MMA→同步
- 流水线:使用(Hopper)或
PipelineTmaAsync(Blackwell)。 ⚠️ 基于TMA的流水线通过TMA描述符管理数据移动——在尾端添加额外张量(偏置、激活输入)需要修改描述符设置,这极其复杂。请查看流程0步骤4中的注意事项。PipelineTmaUmma - 尾端:带alpha/beta缩放的谓词存储
- 使用预编译:始终预编译GEMM内核,使
cute.compile()调用编译后的对象,而非直接调用kernel_fn。如果不预编译,每次调用都会重新编译(约20-50ms开销)。@cute.jit - 自动调优:搜索分块大小、集群形状、流水线深度
Workflow 3: Framework Integration
流程3:框架集成
For wrapping CuTe DSL kernels as PyTorch/JAX custom operators.
- Write kernel using Workflow 1 or 2
- Create wrapper: Accept , convert via
torch.Tensor, call host fnfrom_dlpack - For production: Compile with TVM FFI for zero-overhead tensor passing:
python
compiled = cute.compile(host_fn, *fake_tensors, options="--enable-tvm-ffi") compiled(torch_a, torch_b) # Direct torch.Tensor, no from_dlpack - For deployment: Use AOT compilation → export to → load at runtime
.o
用于将CuTe DSL内核包装为PyTorch/JAX自定义算子。
- 使用流程1或2编写内核
- 创建包装器:接收,通过
torch.Tensor转换,调用宿主函数from_dlpack - 生产环境:使用TVM FFI编译以实现零开销张量传递:
python
compiled = cute.compile(host_fn, *fake_tensors, options="--enable-tvm-ffi") compiled(torch_a, torch_b) # 直接传入torch.Tensor,无需from_dlpack - 部署:使用AOT编译→导出为文件→运行时加载
.o
Workflow 4: Debugging & Profiling
流程4:调试与性能分析
- Set environment: ,
CUTE_DSL_PRINT_IR=1CUTE_DSL_KEEP_PTX=1 - Use for runtime values (not Python
cute.printf())print - Inspect generated code: ,
compiled.__ptx__compiled.__mlir__ - Profile: Enable , use Nsight Compute/Systems
CUTE_DSL_LINEINFO=1 - Debug memory: Run with
compute-sanitizer python script.py
- 设置环境变量:,
CUTE_DSL_PRINT_IR=1CUTE_DSL_KEEP_PTX=1 - **使用**输出运行时值(不要使用Python的
cute.printf())print - 检查生成的代码:,
compiled.__ptx__compiled.__mlir__ - 性能分析:启用,使用Nsight Compute/Systems
CUTE_DSL_LINEINFO=1 - 内存调试:运行
compute-sanitizer python script.py
Output Formats
输出格式
A typical CuTe DSL kernel project:
kernel_dir/
kernel.py # @cute.kernel + @cute.jit functions
test_kernel.py # Correctness test vs PyTorch reference
bench_kernel.py # Benchmark with cute.compile() setupSuccess indicators:
- Correctness test passes ()
torch.testing.assert_close - Nsight shows vector loads (LDG.128/LDG.256), not scalar loads
- For GEMM: tensor core utilization > 80% in Nsight Compute
典型的CuTe DSL内核项目结构:
kernel_dir/
kernel.py # @cute.kernel + @cute.jit函数
test_kernel.py # 与PyTorch参考实现对比的正确性测试
bench_kernel.py # 基于cute.compile()的基准测试设置成功指标:
- 正确性测试通过()
torch.testing.assert_close - Nsight显示向量加载(LDG.128/LDG.256),而非标量加载
- 对于GEMM:Nsight Compute中张量核心利用率>80%
Companion Script Contract
配套脚本约定
Kernel files used with and
must export three names:
scripts/verify_kernel.pyscripts/benchmark_kernel.py- — the CuTe DSL kernel wrapper (calls
kernel_fn(*inputs)+ runs kernel)cute.compile - — PyTorch reference implementation (same signature)
reference_fn(*inputs) - — returns a list of CUDA tensors for testing
get_inputs()
python
undefined与和配合使用的内核文件必须导出三个名称:
scripts/verify_kernel.pyscripts/benchmark_kernel.py- — CuTe DSL内核包装器(调用
kernel_fn(*inputs)并运行内核)cute.compile - — PyTorch参考实现(签名一致)
reference_fn(*inputs) - — 返回用于测试的CUDA张量列表
get_inputs()
python
undefinedExample kernel.py contract
示例kernel.py约定
import torch
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
def kernel_fn(x):
out = torch.empty_like(x)
# ... call compiled cute kernel ...
return out
def reference_fn(x):
return torch.nn.functional.gelu(x)
def get_inputs():
return [torch.randn(1024, 512, dtype=torch.float16, device="cuda")]
undefinedimport torch
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
def kernel_fn(x):
out = torch.empty_like(x)
# ... 调用编译后的cute内核 ...
return out
def reference_fn(x):
return torch.nn.functional.gelu(x)
def get_inputs():
return [torch.randn(1024, 512, dtype=torch.float16, device="cuda")]
undefinedExamples
示例
Example: 2D Unary Element-wise (ReLU)
示例:2D一元逐元素(ReLU)
python
import torch, cutlass, cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
@cute.kernel
def relu_kernel(gA: cute.Tensor, gC: cute.Tensor):
tidx, _, _ = cute.arch.thread_idx()
bidx, _, _ = cute.arch.block_idx()
bdim, _, _ = cute.arch.block_dim()
idx = bidx * bdim + tidx
m, n = gA.shape[1]
total = m * n
if cutlass.dynamic_expr(idx < total):
a = gA[(None, (idx // n, idx % n))].load()
gC[(None, (idx // n, idx % n))] = cute.where(a > 0, a, 0)
@cute.jit
def relu_host(mA: cute.Tensor, mC: cute.Tensor):
vec = 16 // (mA.element_type.width // 8)
gA = cute.zipped_divide(mA, (1, vec))
gC = cute.zipped_divide(mC, (1, vec))
T = 256
N = cute.size(gA.shape[1])
relu_kernel(gA, gC).launch(grid=((N+T-1)//T,1,1), block=(T,1,1))
x = torch.randn(1024, 512, dtype=torch.float16, device="cuda")
out = torch.empty_like(x)
relu_host(from_dlpack(x, assumed_align=16), from_dlpack(out, assumed_align=16))python
import torch, cutlass, cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
@cute.kernel
def relu_kernel(gA: cute.Tensor, gC: cute.Tensor):
tidx, _, _ = cute.arch.thread_idx()
bidx, _, _ = cute.arch.block_idx()
bdim, _, _ = cute.arch.block_dim()
idx = bidx * bdim + tidx
m, n = gA.shape[1]
total = m * n
if cutlass.dynamic_expr(idx < total):
a = gA[(None, (idx // n, idx % n))].load()
gC[(None, (idx // n, idx % n))] = cute.where(a > 0, a, 0)
@cute.jit
def relu_host(mA: cute.Tensor, mC: cute.Tensor):
vec = 16 // (mA.element_type.width // 8)
gA = cute.zipped_divide(mA, (1, vec))
gC = cute.zipped_divide(mC, (1, vec))
T = 256
N = cute.size(gA.shape[1])
relu_kernel(gA, gC).launch(grid=((N+T-1)//T,1,1), block=(T,1,1))
x = torch.randn(1024, 512, dtype=torch.float16, device="cuda")
out = torch.empty_like(x)
relu_host(from_dlpack(x, assumed_align=16), from_dlpack(out, assumed_align=16))Error Handling
错误处理
| Error | Cause | Fix |
|---|---|---|
| Called @kernel from Python | Launch via @cute.jit host function |
| Early return in @kernel | Use |
| Type mismatch on store | | Use |
| Kernel in | Write to file and import |
| Scalar loads in Nsight | Missing alignment hint | Add |
| Not all @jit params passed | Pass ALL declared parameters |
| No | Use |
See for the full error table and limitations.
references/troubleshooting.mdDebugging rule: Never delete kernel.py during debugging. Use
to save a checkpoint, then to iterate. If stuck, to
restore the backup. A partially-working kernel is always better than no kernel.
backup_fileedit_filerevert_file| 错误信息 | 原因 | 修复方法 |
|---|---|---|
| 从Python直接调用@kernel | 通过@cute.jit宿主函数启动 |
| @kernel中存在提前返回 | 使用 |
| 存储时类型不匹配 | | 使用 |
| 内核位于 | 写入文件后再导入 |
| Nsight显示标量加载 | 缺少对齐提示 | 在 |
| 未传递所有@jit参数 | 传递所有声明的参数 |
| 不存在 | 使用 |
完整的错误表和限制请查看。
references/troubleshooting.md调试规则: 调试期间绝不要删除kernel.py。使用保存检查点,然后使用迭代修改。如果遇到瓶颈,使用恢复备份。一个部分可用的内核总比没有内核好。
backup_fileedit_filerevert_fileFinding More Information
获取更多信息
Tier 1: This File (SKILL.md)
一级:本文件(SKILL.md)
Workflows above cover element-wise kernels, GEMM, framework integration, and
debugging. Search this file first for procedural questions.
上述工作流程涵盖了逐元素内核、GEMM、框架集成和调试。程序性问题请首先搜索本文件。
Tier 2: references/ Directory
二级:references/目录
Grep for keywords across . Headers are grep-friendly.
references/| File | Content |
|---|---|
| Core abstractions, terminology, compilation pipeline |
| Layout algebra: composition, complement, divide, swizzle |
| Tensor types, partitioning, tiling, predication |
| MMA atoms, TiledMMA, per-architecture tensor core ops |
| Installation, decorators, first kernel walkthrough |
| Invariant principles, pattern variations, reference impl |
| 3-level tiling, shared memory, pipelining, autotuning |
| from_dlpack, TMA, cp.async, TMEM, copy atoms |
| Control flow, JIT caching, TVM FFI, AOT compilation |
| Producer-consumer, pipeline classes, barriers, warp specialization |
| cute module: layouts, tensors, math, copy, gemm, printing |
| cute.arch: thread indexing, sync, atomics, memory ops |
| cute.nvgpu: warp/warpgroup/cpasync/tcgen05 MMA and copy |
| Runtime: from_dlpack, fake tensors, utils, schedulers |
| Debugging, env vars, common errors, limitations, FAQ |
How to search: Grep for your keyword across . Read only the
file and section that Grep points to.
references/在目录中搜索关键词。文件标题便于搜索。
references/| 文件 | 内容 |
|---|---|
| 核心抽象、术语、编译流水线 |
| 布局代数:组合、补集、划分、混洗 |
| 张量类型、分区、分块、谓词 |
| MMA原子、TiledMMA、各架构张量核心操作 |
| 安装、装饰器、第一个内核入门指南 |
| 不变原则、模式变体、参考实现 |
| 三级分块、共享内存、流水线、自动调优 |
| from_dlpack、TMA、cp.async、TMEM、复制原子 |
| 控制流、JIT缓存、TVM FFI、AOT编译 |
| 生产者-消费者、流水线类、屏障、Warp特化 |
| cute模块:布局、张量、数学、复制、gemm、打印 |
| cute.arch:线程索引、同步、原子操作、内存操作 |
| cute.nvgpu:warp/warpgroup/cpasync/tcgen05 MMA和复制 |
| 运行时:from_dlpack、伪张量、工具、调度器 |
| 调试、环境变量、常见错误、限制、常见问题 |
搜索方法: 在目录中搜索关键词。仅阅读Grep指向的文件和章节。
references/Tier 3: Original Documentation
三级:官方文档
If Tiers 1–2 don't answer, consult the source:
- Web: https://docs.nvidia.com/cutlass/latest/
- GitHub: https://github.com/NVIDIA/cutlass
- Fetch specific doc pages or search for "CUTLASS CuTe DSL <topic>"
- Consider distilling the answer back into references/
如果一级和二级无法解答问题,请查阅官方资料:
- 网页:https://docs.nvidia.com/cutlass/latest/
- GitHub:https://github.com/NVIDIA/cutlass
- 获取特定文档页面或搜索"CUTLASS CuTe DSL <主题>"
- 考虑将答案提炼后补充到references/目录中