learn-cute-dsl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLearn CuTe DSL from CUTLASS Examples
从CUTLASS示例学习CuTe DSL
A structured workflow for building CuTe Python DSL fluency by reading CUTLASS Blackwell
examples, optionally importing them into the project's runtime infrastructure, profiling
them, and distilling reusable patterns into the knowledge base.
这是一个结构化工作流,通过阅读CUTLASS Blackwell示例、(可选)将其导入项目运行时基础设施、进行性能分析,并将可复用模式提炼到知识库中,帮助你熟练掌握CuTe Python DSL。
Example Source
示例来源
Vast examples live under . Documentation is in (CuTe Python DSL) and (CuTe C++ concepts).
third_party/cutlass/examples/python/CuTeDSLdocs/cute-dsl/docs/cutlass-cpp/cute/| Category | Examples | Subdirectory |
|---|---|---|
| MLA decode | | |
| Attention | | (top-level) |
| Mixed attention | | |
| GEMM (basic) | | (top-level) |
| GEMM (pipelined) | | (top-level) |
| GEMM (persistent) | | (top-level) |
| GEMM (dynamic persistent) | | (top-level) |
| GEMM tutorials | | |
| Block-scaled GEMM | | (top-level) |
| Mixed-input GEMM | | |
| Epilogues | | |
| Reduction | | (top-level) |
| Dependent launch | | (top-level) |
| Mamba2 | | |
| Blockwise GEMM | | |
Other examples can be provided by the user directly.
大量示例位于目录下。文档存于(CuTe Python DSL)和(CuTe C++概念)中。
third_party/cutlass/examples/python/CuTeDSLdocs/cute-dsl/docs/cutlass-cpp/cute/| 类别 | 示例 | 子目录 |
|---|---|---|
| MLA解码 | | |
| 注意力 | | (顶层目录) |
| 混合注意力 | | |
| GEMM(基础版) | | (顶层目录) |
| GEMM(流水线版) | | (顶层目录) |
| GEMM(持久化版) | | (顶层目录) |
| GEMM(动态持久化版) | | (顶层目录) |
| GEMM教程 | | |
| 块缩放GEMM | | (顶层目录) |
| 混合输入GEMM | | |
| 收尾阶段(Epilogue) | | |
| 归约 | | (顶层目录) |
| 依赖式启动 | | (顶层目录) |
| Mamba2 | | |
| 块级GEMM | | |
用户也可以直接提供其他示例。
Workflow
工作流
text
1. SELECT -> Pick examples based on learning goals
2. READ -> Read and annotate, identify CuTe DSL patterns
3. IMPORT -> (Optional) Wrap in CuteKernel + KernelPlan for profiling
4. PROFILE -> Profile imported kernels, compare with existing cuTile versions
5. EXTRACT -> Distill reusable CuTe DSL patterns and anti-patterns
6. DOCUMENT -> Update knowledge base (docs/knowledge/languages/cute-dsl/)text
1. 选择 -> 根据学习目标挑选示例
2. 阅读 -> 阅读并标注,识别CuTe DSL模式
3. 导入 -> (可选)封装为CuteKernel + KernelPlan以进行性能分析
4. 性能分析 -> 对导入的内核进行性能分析,与现有cuTile版本对比
5. 提取 -> 提炼可复用的CuTe DSL模式与反模式
6. 记录文档 -> 更新知识库(docs/knowledge/languages/cute-dsl/)Step 1: SELECT
步骤1:选择
Pick examples based on what you need to learn.
| Learning goal | Start with |
|---|---|
| CuTe DSL basics (grid, TMA, MMA) | |
| Software pipelining | |
| Persistent kernel structure | |
| Warpgroup specialization | |
| TMA pipelining for attention | |
| MLA decode (direct reference) | |
| MLA decode with FP8 | |
| Cluster programming | |
| Custom epilogues | |
| Dependent kernel launch | |
Before reading, write down specific questions. Examples:
- "How does the MLA decode kernel set up TMA descriptors for the KV cache?"
- "How are warpgroups assigned producer/consumer roles in persistent GEMM?"
- "What barrier pattern does FMHA use for the K-loop pipeline?"
根据你的学习目标挑选示例。
| 学习目标 | 入门示例 |
|---|---|
| CuTe DSL基础(网格、TMA、MMA) | |
| 软件流水线 | |
| 持久化内核结构 | |
| 工作组特化 | |
| 注意力的TMA流水线 | |
| MLA解码(直接参考) | |
| 带FP8的MLA解码 | |
| 集群编程 | |
| 自定义收尾阶段 | |
| 依赖式内核启动 | |
阅读前,写下具体问题。例如:
- "MLA解码内核如何为KV缓存设置TMA描述符?"
- "在持久化GEMM中,工作组如何分配生产者/消费者角色?"
- "FMHA的K循环流水线使用了哪种屏障模式?"
Step 2: READ
步骤2:阅读
For each selected example, read the full source and annotate these CuTe DSL-specific aspects:
对于每个选中的示例,通读源码并标注以下CuTe DSL特定内容:
Host side (@cute.jit
)
@cute.jit主机端(@cute.jit
)
@cute.jit- TMA descriptor creation -- which tensors get TMA descriptors, what are the tile shapes?
- Grid and cluster dimensions -- how is the grid computed, is clustering used?
- Shared memory size -- how is the SMEM budget calculated from pipeline stages and tile shapes?
- Launch parameters -- , cluster shape, SMEM carveout
max_register_count
- TMA描述符创建 —— 哪些张量使用TMA描述符, tile形状是什么?
- 网格与集群维度 —— 如何计算网格,是否使用集群?
- 共享内存大小 —— 如何根据流水线阶段和tile形状计算SMEM预算?
- 启动参数 —— 、集群形状、SMEM预留
max_register_count
Device side (@cute.kernel
)
@cute.kernel设备端(@cute.kernel
)
@cute.kernel- Thread/warp/warpgroup identity -- how are roles assigned?
- Pipeline structure -- number of stages, barrier initialization, producer/consumer loops
- TMA operations -- , arrival patterns
cute.copy(tma_desc, ...) - MMA operations -- accumulator setup, MMA loop body, epilogue
- Synchronization -- ,
cute.barrier_arrive(),cute.barrier_wait()cute.cluster_barrier_* - Shared memory layout -- allocation, swizzling, partitioning across warpgroups
- 线程/线程束/工作组标识 —— 如何分配角色?
- 流水线结构 —— 阶段数量、屏障初始化、生产者/消费者循环
- TMA操作 —— 、到达模式
cute.copy(tma_desc, ...) - MMA操作 —— 累加器设置、MMA循环体、收尾阶段
- 同步 —— 、
cute.barrier_arrive()、cute.barrier_wait()cute.cluster_barrier_* - 共享内存布局 —— 分配、混洗、跨工作组分区
Pattern checklist
模式检查清单
While reading, flag instances of:
- TMA multi-stage pipeline (how many stages, barrier per stage)
- Warpgroup specialization (which warpgroups produce, which consume)
- Persistent kernel main loop (work tile scheduling, exit condition)
- Online softmax or streaming accumulation
- Register-level accumulator management
- Shared memory swizzle patterns
- Cluster-level communication (distributed shared memory)
- Epilogue pattern (in-register, through SMEM, or fused)
- Boundary handling (predication, masking for partial tiles)
- Pipeline drain (how the last pipeline stages are flushed)
阅读时,标记以下模式的实例:
- TMA多阶段流水线(阶段数量、每个阶段的屏障)
- 工作组特化(哪些工作组是生产者,哪些是消费者)
- 持久化内核主循环(工作tile调度、退出条件)
- 在线softmax或流式累加
- 寄存器级累加器管理
- 共享内存混洗模式
- 集群级通信(分布式共享内存)
- 收尾阶段模式(寄存器内、通过SMEM或融合式)
- 边界处理(断言、部分tile的掩码)
- 流水线排空(如何刷新最后几个流水线阶段)
Cross-reference
交叉参考
Check whether patterns already exist in the knowledge base:
- -- shared patterns (e.g., pipeline-driven-low-occupancy)
docs/knowledge/optimizations/ - -- existing CuTe DSL overlays
docs/knowledge/languages/cute-dsl/
Note whether the example confirms, refines, or contradicts existing knowledge.
检查知识库中是否已存在相关模式:
- —— 通用模式(如pipeline-driven-low-occupancy)
docs/knowledge/optimizations/ - —— 现有CuTe DSL覆盖内容
docs/knowledge/languages/cute-dsl/
记录该示例是确认、细化还是与现有知识相矛盾。
Step 3: IMPORT (Optional)
步骤3:导入(可选)
Import a CUTLASS example into the project to make it runnable through the standard CLI and profiling infrastructure.
将CUTLASS示例导入项目,使其可通过标准CLI和性能分析基础设施运行。
When to import
何时导入
- You want to profile the example on your hardware
- You want a starting point for a custom CuTe DSL kernel
- The example covers a workload comparable to existing cuTile kernels (e.g., MLA decode)
- 你需要在自己的硬件上对示例进行性能分析
- 你需要一个自定义CuTe DSL内核的起点
- 该示例涵盖的工作负载与现有cuTile内核类似(如MLA解码)
When NOT to import
何时不导入
- You only need to understand the pattern, not benchmark it
- The example has complex dependencies or helper infrastructure that doesn't map cleanly
- 你只需要理解模式,不需要进行基准测试
- 该示例有复杂的依赖或辅助基础设施,无法清晰映射到项目中
Import procedure
导入流程
-
Create the kernel package:
src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/Example for CUTLASS MLA decode FP16:src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/ -
Create the KernelPlan subclass ():
cutlass_mla_decode.py- -- allocate tensors matching the example's expected shapes (Q, KV cache, output)
prepare_inputs() - -- PyTorch reference for correctness
reference_fn() - -- tiling parameters from the example
_autotune_configs() - -- roofline analysis
_algorithmic_flops_bytes() - -- return a
plan()wrapping the example's host/device functionsCuteKernel
-
Create:
__main__.pypythonfrom .cutlass_mla_decode import CutlassMlaDecodeKernel if __name__ == "__main__": kernel_plan = CutlassMlaDecodeKernel() kernel_plan.benchmark_kernel_argparse() -
Createat each package level
__init__.py -
Handle helper modules: If the example uses helpers (e.g.,), either:
mla_helpers.py- Copy into the kernel package and adjust imports
- Import from the CUTLASS path (fragile but faster for exploration)
-
Test correctness:bash
source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check
-
创建内核包:
src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/CUTLASS MLA解码FP16的示例路径:src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/ -
创建KernelPlan子类():
cutlass_mla_decode.py- —— 分配与示例预期形状匹配的张量(Q、KV缓存、输出)
prepare_inputs() - —— 用于正确性验证的PyTorch参考实现
reference_fn() - —— 来自示例的分块参数
_autotune_configs() - —— roofline分析
_algorithmic_flops_bytes() - —— 返回封装了示例主机/设备函数的
plan()CuteKernel
-
创建:
__main__.pypythonfrom .cutlass_mla_decode import CutlassMlaDecodeKernel if __name__ == "__main__": kernel_plan = CutlassMlaDecodeKernel() kernel_plan.benchmark_kernel_argparse() -
在每个包层级创建
__init__.py -
处理辅助模块:如果示例使用了辅助模块(如),可以选择:
mla_helpers.py- 复制到内核包中并调整导入路径
- 从CUTLASS路径导入(虽然脆弱但探索速度更快)
-
验证正确性:bash
source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check
Step 4: PROFILE
步骤4:性能分析
Architecture compatibility check
架构兼容性检查
Before profiling, verify the available device matches the kernel's target architecture:
bash
nvidia-smi --query-gpu=name --format=csv,noheaderMany CUTLASS examples (especially those in , and any using MMA ops) target SM100 (B200, B300 datacenter GPUs) specifically. Be aware that "Blackwell" is a marketing name spanning multiple SM versions with different instruction sets:
mla/tcgen05- SM100 (B200, B300) — datacenter GPUs, supports MMA, TMEM, full cluster features
tcgen05 - SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses , does not support
sm_120aopstcgen05
A kernel using ops will not run on an RTX 5090 (SM120) despite both being marketed as "Blackwell". Match the GPU name to its actual SM version, not the marketing generation.
tcgen05If the kernel's target SM version does not match the available device, skip the PROFILE step entirely — execution will either fail at compile time or produce misleading results. In that case, proceed directly to Step 5 (EXTRACT) with insights from reading the source code only, and note the architecture mismatch in any documentation.
性能分析前,验证可用设备是否与内核的目标架构匹配:
bash
nvidia-smi --query-gpu=name --format=csv,noheader许多CUTLASS示例(尤其是目录下的,以及任何使用 MMA操作的示例)专门针对SM100(B200、B300数据中心GPU)。请注意,"Blackwell"是一个营销名称,涵盖了多个具有不同指令集的SM版本:
mla/tcgen05- SM100(B200、B300)—— 数据中心GPU,支持MMA、TMEM、完整集群功能
tcgen05 - SM120(GeForce RTX 5090、RTX 5080)—— 消费级Blackwell,使用,不支持
sm_120a操作tcgen05
使用操作的内核在RTX 5090(SM120)上无法运行,尽管两者都被称为"Blackwell"。请将GPU名称与其实际SM版本匹配,而非营销世代。
tcgen05如果内核的目标SM版本与可用设备不匹配,请完全跳过性能分析步骤——执行要么在编译时失败,要么产生误导性结果。这种情况下,直接进入步骤5(提取),仅基于阅读源码的见解,并在任何文档中注明架构不匹配的情况。
Profiling sequence
性能分析流程
-
Lock GPU clocks:
bash scripts/lock-gpu-clock.sh -
Annotation mode first:bash
source .venv/bin/activate && python -m mla_var3.kernel <kernel> \ --b=32 --s=1 --t=4096 --prof_type=annotation -
NCU deep dive if annotation reveals interesting patterns:bash
source .venv/bin/activate && python -m mla_var3.kernel <kernel> \ --b=32 --s=1 --t=4096 --prof_type=ncu
-
锁定GPU时钟:
bash scripts/lock-gpu-clock.sh -
先使用注释模式:bash
source .venv/bin/activate && python -m mla_var3.kernel <kernel> \ --b=32 --s=1 --t=4096 --prof_type=annotation -
如果注释模式揭示了有趣的模式,再进行NCU深度分析:bash
source .venv/bin/activate && python -m mla_var3.kernel <kernel> \ --b=32 --s=1 --t=4096 --prof_type=ncu
Focus areas
重点关注领域
Since the goal is learning (not just benchmarking), focus on understanding how the kernel achieves its performance:
- Pipeline efficiency: Are TMA copies fully overlapped with compute? Check for idle bubbles in the nsys timeline.
- Warpgroup utilization: Are all warpgroups busy? Check scheduler statistics in NCU.
- Register pressure vs occupancy tradeoff: How many registers per thread? Is low occupancy intentional?
- SMEM usage: How much of the available SMEM budget is used? How many pipeline stages fit?
- MMA throughput: What fraction of peak TC utilization is achieved?
由于目标是学习(而非仅基准测试),重点理解内核如何实现其性能:
- 流水线效率:TMA拷贝是否与计算完全重叠?检查nsys时间线中的空闲气泡。
- 工作组利用率:所有工作组都在忙碌吗?检查NCU中的调度器统计数据。
- 寄存器压力与占用率权衡:每个线程使用多少寄存器?低占用率是有意设计的吗?
- SMEM使用情况:使用了多少可用的SMEM预算?可以容纳多少个流水线阶段?
- MMA吞吐量:达到了峰值TC利用率的多少比例?
Compare with cuTile versions
与cuTile版本对比
If the imported kernel covers a workload that exists in cuTile (e.g., MLA decode), compare:
| Metric | cuTile version | CUTLASS CuTe DSL | Delta | Insight |
|---|---|---|---|---|
| Duration (us) | ||||
| TC% | ||||
| DRAM% | ||||
| Occupancy | ||||
| Registers/thread | ||||
| SMEM/block (KB) |
Differences reveal what explicit control (warpgroup scheduling, TMA pipelining) buys over cuTile's block-level abstraction.
如果导入的内核涵盖了cuTile中已有的工作负载(如MLA解码),请进行对比:
| 指标 | cuTile版本 | CUTLASS CuTe DSL | 差值 | 见解 |
|---|---|---|---|---|
| 耗时(微秒) | ||||
| TC利用率% | ||||
| DRAM利用率% | ||||
| 占用率 | ||||
| 每线程寄存器数 | ||||
| 每块SMEM(KB) |
差异揭示了显式控制(工作组调度、TMA流水线)相对于cuTile的块级抽象所带来的优势。
Step 5: EXTRACT
步骤5:提取
Distill observations into reusable CuTe DSL patterns.
将观察结果提炼为可复用的CuTe DSL模式。
Classification
分类
| Type | Destination |
|---|---|
| Shared algorithmic/hardware pattern | |
| Shared failure mode | |
| CuTe DSL API pattern or implementation detail | |
| CuTe DSL trap or pitfall | |
Most patterns from CUTLASS examples will be CuTe DSL-specific (the API patterns, TMA setup idioms, barrier choreography). Shared patterns (algorithmic insights like pipeline-driven scheduling) may already exist from cuTile work.
| 类型 | 目标路径 |
|---|---|
| 通用算法/硬件模式 | |
| 通用失败模式 | |
| CuTe DSL API模式或实现细节 | |
| CuTe DSL陷阱或误区 | |
来自CUTLASS示例的大多数模式都是CuTe DSL特定的(API模式、TMA设置惯用写法、屏障编排)。通用模式(如流水线驱动调度等算法见解)可能已在cuTile的工作中存在。
Extraction criteria
提取标准
Document a pattern if it:
- Transfers to other CuTe DSL kernels (not just this one example)
- Is non-obvious from the CuTe DSL documentation alone
- Has performance implications visible in profiling or architecturally clear
- Refines an existing shared pattern with CuTe DSL-specific implementation details
满足以下条件时记录模式:
- 可迁移到其他CuTe DSL内核(不仅限于当前示例)
- 非显而易见,仅从CuTe DSL文档无法得知
- 有性能影响,在性能分析中可见或在架构上明确
- 细化了现有通用模式,补充了CuTe DSL特定的实现细节
Pattern template
模板
markdown
undefinedmarkdown
undefined[Pattern Name]
[模式名称]
When to Apply
适用场景
- [CuTe DSL kernel type, workload shape, bottleneck condition]
- [CuTe DSL内核类型、工作负载形状、瓶颈条件]
Mechanism
机制
[How the pattern works in CuTe DSL -- reference specific APIs]
[CuTe DSL中该模式的工作原理——参考具体API]
Affected Metrics
影响指标
- [Metric 1]
- [Metric 2]
- [指标1]
- [指标2]
Implementation
实现
python
undefinedpython
undefinedCuTe DSL code snippet, generalized from the CUTLASS example
从CUTLASS示例泛化而来的CuTe DSL代码片段
undefinedundefinedSource
来源
[CUTLASS example name and path]
[CUTLASS示例名称和路径]
Performance Evidence
性能证据
Source: [example name, device, configuration]
| Config | Metric | Value | Context |
|---|
来源:[示例名称、设备、配置]
| 配置 | 指标 | 值 | 上下文 |
|---|
Generalization
泛化结论
[Device-agnostic takeaway]
[设备无关的要点]
Pitfalls
陷阱
- [CuTe DSL-specific failure modes]
- [CuTe DSL特定的失败模式]
Interactions
交互
- [How this interacts with other CuTe DSL patterns]
---- [该模式与其他CuTe DSL模式的交互方式]
---Step 6: DOCUMENT
步骤6:记录文档
- Write pattern files to or
docs/knowledge/languages/cute-dsl/optimizations/anti-patterns/ - Update the index in
/optimization-catalog-cute-dsl - If a pattern is shared (not DSL-specific), write to and update
docs/knowledge/optimizations//optimization-catalog - If you imported and profiled a kernel, write a devlog entry in :
docs/kernels/
markdown
undefined- 将模式文件写入或
docs/knowledge/languages/cute-dsl/optimizations/anti-patterns/ - 更新中的索引
/optimization-catalog-cute-dsl - 如果模式是通用的(非DSL特定),写入并更新
docs/knowledge/optimizations//optimization-catalog - 如果导入并分析了内核,在中编写开发日志条目:
docs/kernels/
markdown
undefined[Kernel Name] (Imported from CUTLASS)
[内核名称](从CUTLASS导入)
Overview
概述
[What this kernel does and why it was imported for study]
[该内核的功能以及为何导入学习]
Source
来源
Original:
Imported:
third_party/cutlass/examples/python/CuTeDSL/<path>src/mla_var3/kernel/cute_python/mla/<design>/原始路径:
导入路径:
third_party/cutlass/examples/python/CuTeDSL/<path>src/mla_var3/kernel/cute_python/mla/<design>/Performance
性能
[Profiling results table]
[性能分析结果表格]
Patterns Extracted
提取的模式
- [Pattern] ->
docs/knowledge/languages/cute-dsl/optimizations/<name>.md
- [模式] ->
docs/knowledge/languages/cute-dsl/optimizations/<name>.md
Insights
见解
[What was learned that informs future CuTe DSL kernel design]
---[学到的可指导未来CuTe DSL内核设计的内容]
---Recommended Learning Paths
推荐学习路径
Path A: CuTe DSL fundamentals
路径A:CuTe DSL基础
For building basic fluency before writing any MLA kernel:
text
1. tutorial_gemm/fp16_gemm_0.py -> Minimal kernel: grid, TMA load, MMA, store
2. tutorial_gemm/fp16_gemm_3.py -> Software pipelining basics
3. tutorial_gemm/fp16_gemm_6.py -> Full persistent GEMM with warpgroup specialization
4. dense_gemm_software_pipeline.py -> Production-quality pipelining
5. dense_gemm_persistent.py -> Persistent scheduling pattern用于在编写任何MLA内核前构建基础熟练度:
text
1. tutorial_gemm/fp16_gemm_0.py -> 最小内核:网格、TMA加载、MMA、存储
2. tutorial_gemm/fp16_gemm_3.py -> 软件流水线基础
3. tutorial_gemm/fp16_gemm_6.py -> 带工作组特化的完整持久化GEMM
4. dense_gemm_software_pipeline.py -> 生产级流水线
5. dense_gemm_persistent.py -> 持久化调度模式Path B: MLA-specific patterns
路径B:MLA特定模式
For directly studying how CUTLASS implements MLA:
text
1. mla/mla_helpers.py -> Shared utilities, tensor layouts, TMA setup
2. mla/mla_decode_fp16.py -> FP16 MLA decode with full pipeline
3. mla/mla_decode_fp8.py -> FP8 variant (quantization handling)
4. fmha.py -> Attention forward (different design choices)用于直接研究CUTLASS如何实现MLA:
text
1. mla/mla_helpers.py -> 通用工具、张量布局、TMA设置
2. mla/mla_decode_fp16.py -> 带完整流水线的FP16 MLA解码
3. mla/mla_decode_fp8.py -> FP8变体(量化处理)
4. fmha.py -> 注意力前向传播(不同设计选择)Path C: Advanced patterns
路径C:高级模式
For specific optimization techniques:
text
1. dense_gemm_persistent_prefetch.py -> Prefetching in persistent kernels
2. dense_gemm_persistent_dynamic.py -> Dynamic tile scheduling + clusters
3. epilogue/custom_epilogue_dense_gemm.py -> Fused epilogue patterns
4. programmatic_dependent_launch.py -> Dependent kernel launch用于学习特定优化技术:
text
1. dense_gemm_persistent_prefetch.py -> 持久化内核中的预取
2. dense_gemm_persistent_dynamic.py -> 动态tile调度 + 集群
3. epilogue/custom_epilogue_dense_gemm.py -> 融合式收尾阶段模式
4. programmatic_dependent_launch.py -> 依赖式内核启动