CuTe DSL
CuTe DSL is a Python-based domain-specific language for GPU kernel development,
part of CUTLASS 4.x. It provides Python abstractions over CUTLASS C++ templates
with JIT compilation to optimized CUDA kernels via MLIR and ptxas.
When to Use
Triggers:
- Writing CUDA kernels in Python (element-wise, GEMM, custom ops)
- Optimizing GPU memory access patterns (vectorized loads, TMA, shared memory)
- Building tensor core (MMA) kernels for Ampere/Hopper/Blackwell
- Integrating custom GPU kernels with PyTorch or JAX
- Prototyping high-performance kernels without C++ metaprogramming
Symptoms (wrong tool otherwise):
- Need shared memory coordination or tensor core MMA → use CuTe DSL (not Triton for complex patterns)
- Need simple element-wise ops with no shared memory → CuTe DSL or Triton both work
- Need to call existing CUTLASS C++ kernels → use CUTLASS C++ APIs instead
- Need reductions, scans, or non-GEMM collective ops → consider CUB/Thrust
Keywords: cute, cutlass, cute.jit, cute.kernel, from_dlpack, zipped_divide,
TiledMMA, TiledCopy, TMA, WGMMA, tcgen05, pipeline, mbarrier
Requirements
| Requirement | Detail |
|---|
| Platform | Linux x86_64 only |
| Python | 3.10–3.13 |
| GPU | NVIDIA Ampere+ (SM80, SM90, SM100) |
| CUDA Driver | ≥ 575.51.03 (Toolkit 12.9 compat) |
| Install | pip install nvidia-cutlass-dsl
|
| Optional | , |
Workflows
Workflow 0: Starting from Examples (Recommended)
For any non-trivial kernel (GEMM, attention, pipelined, fused ops), start by
finding the most similar existing example to use as a starting point — study
its structure, then rework it for your use case. Do not copy examples verbatim;
they target specific dtypes, architectures, and problem shapes that likely differ.
-
Pick the closest example from the index below.
Prefer examples matching the target GPU architecture (check with
torch.cuda.get_device_capability()
) when the operation is similar.
Fetch via
with base URL
https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL
| Operation | Arch | Example path (append to base URL) |
|---|
| Element-wise add | SM80 | ampere/elementwise_add.py
|
| Element-wise + autotune | SM80 | ampere/elementwise_add_autotune.py
|
| Element-wise apply | SM80 | ampere/elementwise_apply.py
|
| SGEMM (scalar) | SM80 | |
| Tensor-core GEMM | SM80 | |
| Flash Attention v2 | SM80 | ampere/flash_attention_v2.py
|
| HSTU Attention | SM80 | |
| Shared memory allocator | SM80 | |
| CTA norm (LayerNorm) | SM90 | |
| Dense GEMM | SM90 | |
| Dense GEMM persistent | SM90 | hopper/dense_gemm_persistent.py
|
| Flash MHA | SM90 | |
| Dense GEMM | SM100 | |
| Dense GEMM persistent | SM100 | blackwell/dense_gemm_persistent.py
|
| Dense GEMM + alpha/beta | SM100 | blackwell/dense_gemm_alpha_beta_persistent.py
|
| RMSNorm | SM100 | |
| Reduce | SM100 | |
| Flash MHA | SM100 | |
| Grouped GEMM | SM100 | blackwell/grouped_gemm.py
|
| Mamba2 SSD | SM100 | |
| GEMM tutorial (notebook) | SM100 | notebooks/tour_to_sol_gemm.ipynb
|
Example: To fetch the Hopper dense GEMM:
bash
web_fetch https://raw.githubusercontent.com/NVIDIA/cutlass/main/examples/python/CuTeDSL/hopper/dense_gemm.py
-
Read reference materials first — before diving into example code, read
the relevant
docs to understand the patterns and APIs:
- For GEMM:
references/patterns-gemm.md
(3-level tiling, epilogue fusion,
with , shared memory layouts)
- For reductions:
references/patterns-reduction.md
(warp reductions,
cache pattern)
- For element-wise:
references/patterns-elementwise.md
(variations A–E)
- Always: (available APIs, arch-specific caveats)
This gives you the conceptual foundation so you can rework the example
intelligently rather than trying to copy-paste complex pipelines.
-
Fetch and study the example source — read for structure, not to copy:
- Identify: decorators, tiling strategy, shared memory usage, mainloop flow
- Note which dtype/arch it targets (many examples are fp16/bf16-specific)
- Check if it uses APIs tied to a specific arch (TMA → SM90+, tcgen05 → SM100)
-
Rework for the user's workload (do not copy-paste):
- Change shapes, data types, tile sizes to match requirements
- Replace compute logic (epilogue, activation fusion) as needed
- If dtype differs (e.g., example is fp16, need fp32), expect vectorization
and layout changes — the scalar-loop patterns in may be a
better starting point than adapting a vectorized example
- Runtime wrapper must be lightweight: should only call
+ the compiled kernel. Never allocate intermediate tensors,
copy data, or re-compile per call — these belong in one-time setup
- Apply optimizations from this skill's reference docs
⛔ Blackwell/Hopper GEMM + extra tensors — STOP:
If the target GPU is SM90+ (Hopper/Blackwell) and the GEMM requires
extra tensors beyond A, B, C in the epilogue (e.g., bias vector, activation
inputs), do not attempt it. These examples use TMA descriptors for all
data movement — adding tensors requires modifying TMA descriptor setup,
which is prohibitively complex. Instead, tell the user this limitation and
suggest a two-kernel approach: run the GEMM kernel as-is, then apply
bias + activation in a separate element-wise kernel (Workflow 1).
Plain GEMM (just A×B→C with scalar alpha/beta) on Hopper/Blackwell is fine.
-
Validate and benchmark using companion scripts:
bash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3
python scripts/benchmark_kernel.py kernel.py
The kernel file must export
,
, and
.
When to skip examples: Pure element-wise operations (Workflow 1) have
complete patterns in
references/patterns-elementwise.md
— no need to fetch
external examples.
Reduction kernels (softmax, layernorm, RMSNorm): Use
references/patterns-reduction.md
which provides complete, proven patterns
for float32 reductions using scalar loops + butterfly shuffle + shared memory.
Workflow 1: Element-wise Kernel
For unary/binary/in-place operations that map inputs to outputs 1:1.
- Determine kernel structure: inputs/outputs count, tensor rank, target arch
- Select pattern from
references/patterns-elementwise.md
(Variations A–E)
- Write kernel applying all four invariant principles:
- P1:
from_dlpack(tensor, assumed_align=16)
for vector loads
- P2: Derive from
- P3:
cute.zipped_divide(mA, tiler)
for coalesced access
- P4:
cutlass.dynamic_expr(thread_idx < total)
for bounds
- Critical rules: No early return, no (use ), no
- Pre-compile with : Always pre-compile the kernel once
using so that calls the compiled object, not
directly. Without pre-compilation, every call recompiles
(~20-50ms overhead). Use so a single compiled
kernel handles arbitrary input shapes without recompilation:
python
# Compile once with dynamic layouts — works for any shape
fake_x = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"),
assumed_align=16).mark_layout_dynamic()
fake_out = from_dlpack(torch.empty(1, 1, dtype=torch.float16, device="cuda"),
assumed_align=16).mark_layout_dynamic()
compiled_kernel = cute.compile(host_fn, fake_x, fake_out)
def kernel_fn(x):
out = torch.empty_like(x)
compiled_kernel(from_dlpack(x, assumed_align=16).mark_layout_dynamic(),
from_dlpack(out, assumed_align=16).mark_layout_dynamic())
return out
- Verify correctness using companion script:
bash
python scripts/verify_kernel.py kernel.py --rtol 1e-3 --atol 1e-3
The kernel file must export , , and .
- Benchmark using companion script:
bash
python scripts/benchmark_kernel.py kernel.py
Workflow 2: GEMM Kernel
For matrix multiplication with tiling, shared memory, and tensor cores.
- Define problem: shapes (M, N, K), data types, target architecture
- Choose tiling: CTA tile (bM, bN, bK), pipeline stages, cluster shape
- Three-level partitioning (see
references/patterns-gemm.md
):
- Level 1: CTA tiling with
- Level 2: Copy partitioning (global → shared) with
- Level 3: Compute partitioning (shared → register) with
- Shared memory: Use swizzled layouts () to avoid bank conflicts
- Mainloop: K-tile loop with copy → sync → MMA → sync
- Pipeline: Use (Hopper) or (Blackwell).
⚠️ TMA-based pipelines manage data movement via TMA descriptors — adding
extra tensors (bias, activation inputs) to the epilogue requires modifying
descriptor setup, which is prohibitively complex. See the stop condition in
Workflow 0 step 4.
- Epilogue: Predicated store with alpha/beta scaling
- Pre-compile with : Always pre-compile the GEMM kernel
so calls the compiled object, not directly.
Without pre-compilation, every call recompiles (~20-50ms overhead).
- Autotune: Search over tile sizes, cluster shapes, pipeline depths
Workflow 3: Framework Integration
For wrapping CuTe DSL kernels as PyTorch/JAX custom operators.
- Write kernel using Workflow 1 or 2
- Create wrapper: Accept , convert via , call host fn
- For production: Compile with TVM FFI for zero-overhead tensor passing:
python
compiled = cute.compile(host_fn, *fake_tensors, options="--enable-tvm-ffi")
compiled(torch_a, torch_b) # Direct torch.Tensor, no from_dlpack
- For deployment: Use AOT compilation → export to → load at runtime
Workflow 4: Debugging & Profiling
- Set environment: ,
- Use for runtime values (not Python )
- Inspect generated code: ,
- Profile: Enable , use Nsight Compute/Systems
- Debug memory: Run with
compute-sanitizer python script.py
Output Formats
A typical CuTe DSL kernel project:
kernel_dir/
kernel.py # @cute.kernel + @cute.jit functions
test_kernel.py # Correctness test vs PyTorch reference
bench_kernel.py # Benchmark with cute.compile() setup
Success indicators:
- Correctness test passes (
torch.testing.assert_close
)
- Nsight shows vector loads (LDG.128/LDG.256), not scalar loads
- For GEMM: tensor core utilization > 80% in Nsight Compute
Companion Script Contract
Kernel files used with
and
scripts/benchmark_kernel.py
must export three names:
- — the CuTe DSL kernel wrapper (calls + runs kernel)
- — PyTorch reference implementation (same signature)
- — returns a list of CUDA tensors for testing
python
# Example kernel.py contract
import torch
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
def kernel_fn(x):
out = torch.empty_like(x)
# ... call compiled cute kernel ...
return out
def reference_fn(x):
return torch.nn.functional.gelu(x)
def get_inputs():
return [torch.randn(1024, 512, dtype=torch.float16, device="cuda")]
Examples
Example: 2D Unary Element-wise (ReLU)
python
import torch, cutlass, cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack
@cute.kernel
def relu_kernel(gA: cute.Tensor, gC: cute.Tensor):
tidx, _, _ = cute.arch.thread_idx()
bidx, _, _ = cute.arch.block_idx()
bdim, _, _ = cute.arch.block_dim()
idx = bidx * bdim + tidx
m, n = gA.shape[1]
total = m * n
if cutlass.dynamic_expr(idx < total):
a = gA[(None, (idx // n, idx % n))].load()
gC[(None, (idx // n, idx % n))] = cute.where(a > 0, a, 0)
@cute.jit
def relu_host(mA: cute.Tensor, mC: cute.Tensor):
vec = 16 // (mA.element_type.width // 8)
gA = cute.zipped_divide(mA, (1, vec))
gC = cute.zipped_divide(mC, (1, vec))
T = 256
N = cute.size(gA.shape[1])
relu_kernel(gA, gC).launch(grid=((N+T-1)//T,1,1), block=(T,1,1))
x = torch.randn(1024, 512, dtype=torch.float16, device="cuda")
out = torch.empty_like(x)
relu_host(from_dlpack(x, assumed_align=16), from_dlpack(out, assumed_align=16))
Error Handling
| Error | Cause | Fix |
|---|
MLIR function requires a Context
| Called @kernel from Python | Launch via @cute.jit host function |
| on return | Early return in @kernel | Use if cutlass.dynamic_expr(cond):
|
| Type mismatch on store | promotes FP16→FP32 | Use or |
could not get source code
| Kernel in context | Write to file and import |
| Scalar loads in Nsight | Missing alignment hint | Add to |
Missing required argument
| Not all @jit params passed | Pass ALL declared parameters |
| No | Use 1.0/(1.0+cute.math.exp(-x))
|
See
references/troubleshooting.md
for the full error table and limitations.
Debugging rule: Never delete kernel.py during debugging. Use
to save a checkpoint, then
to iterate. If stuck,
to
restore the backup. A partially-working kernel is always better than no kernel.
Finding More Information
Tier 1: This File (SKILL.md)
Workflows above cover element-wise kernels, GEMM, framework integration, and
debugging. Search this file first for procedural questions.
Tier 2: references/ Directory
Grep for keywords across
. Headers are grep-friendly.
| File | Content |
|---|
| Core abstractions, terminology, compilation pipeline |
| Layout algebra: composition, complement, divide, swizzle |
| Tensor types, partitioning, tiling, predication |
| MMA atoms, TiledMMA, per-architecture tensor core ops |
patterns-getting-started.md
| Installation, decorators, first kernel walkthrough |
| Invariant principles, pattern variations, reference impl |
| 3-level tiling, shared memory, pipelining, autotuning |
| from_dlpack, TMA, cp.async, TMEM, copy atoms |
| Control flow, JIT caching, TVM FFI, AOT compilation |
| Producer-consumer, pipeline classes, barriers, warp specialization |
| cute module: layouts, tensors, math, copy, gemm, printing |
| cute.arch: thread indexing, sync, atomics, memory ops |
| cute.nvgpu: warp/warpgroup/cpasync/tcgen05 MMA and copy |
| Runtime: from_dlpack, fake tensors, utils, schedulers |
| Debugging, env vars, common errors, limitations, FAQ |
How to search: Grep for your keyword across
. Read only the
file and section that Grep points to.
Tier 3: Original Documentation
If Tiers 1–2 don't answer, consult the source:
- Web: https://docs.nvidia.com/cutlass/latest/
- GitHub: https://github.com/NVIDIA/cutlass
- Fetch specific doc pages or search for "CUTLASS CuTe DSL <topic>"
- Consider distilling the answer back into references/