learn-cute-dsl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Learn CuTe DSL from CUTLASS Examples

从CUTLASS示例学习CuTe DSL

A structured workflow for building CuTe Python DSL fluency by reading CUTLASS Blackwell examples, optionally importing them into the project's runtime infrastructure, profiling them, and distilling reusable patterns into the knowledge base.
这是一个结构化工作流,通过阅读CUTLASS Blackwell示例、(可选)将其导入项目运行时基础设施、进行性能分析,并将可复用模式提炼到知识库中,帮助你熟练掌握CuTe Python DSL。

Example Source

示例来源

Vast examples live under
third_party/cutlass/examples/python/CuTeDSL
. Documentation is in
docs/cute-dsl/
(CuTe Python DSL) and
docs/cutlass-cpp/cute/
(CuTe C++ concepts).
CategoryExamplesSubdirectory
MLA decode
mla_decode_fp16.py
,
mla_decode_fp8.py
,
mla_helpers.py
mla/
Attention
fmha.py
,
fmha_bwd.py
(top-level)
Mixed attention
mixed_input_fmha_decode.py
,
mixed_input_fmha_prefill_*.py
mixed_input_fmha/
GEMM (basic)
dense_gemm.py
(top-level)
GEMM (pipelined)
dense_gemm_software_pipeline.py
(top-level)
GEMM (persistent)
dense_gemm_persistent.py
,
dense_gemm_persistent_prefetch.py
(top-level)
GEMM (dynamic persistent)
dense_gemm_persistent_dynamic.py
(top-level)
GEMM tutorials
fp16_gemm_0.py
through
fp16_gemm_6.py
tutorial_gemm/
Block-scaled GEMM
dense_blockscaled_gemm_persistent.py
,
*_prefetch.py
,
*_amax.py
(top-level)
Mixed-input GEMM
mixed_input_gemm.py
,
grouped_mixed_input_gemm.py
mixed_input_gemm/
Epilogues
custom_epilogue_dense_gemm.py
,
activation_custom_epilogue_dense_gemm.py
epilogue/
Reduction
reduce.py
,
rmsnorm.py
(top-level)
Dependent launch
programmatic_dependent_launch.py
(top-level)
Mamba2
mamba2_ssd.py
mamba2_ssd/
Blockwise GEMM
blockwise_gemm.py
,
contiguous_grouped_gemm.py
blockwise_gemm/
Other examples can be provided by the user directly.
大量示例位于
third_party/cutlass/examples/python/CuTeDSL
目录下。文档存于
docs/cute-dsl/
(CuTe Python DSL)和
docs/cutlass-cpp/cute/
(CuTe C++概念)中。
类别示例子目录
MLA解码
mla_decode_fp16.py
mla_decode_fp8.py
mla_helpers.py
mla/
注意力
fmha.py
fmha_bwd.py
(顶层目录)
混合注意力
mixed_input_fmha_decode.py
mixed_input_fmha_prefill_*.py
mixed_input_fmha/
GEMM(基础版)
dense_gemm.py
(顶层目录)
GEMM(流水线版)
dense_gemm_software_pipeline.py
(顶层目录)
GEMM(持久化版)
dense_gemm_persistent.py
dense_gemm_persistent_prefetch.py
(顶层目录)
GEMM(动态持久化版)
dense_gemm_persistent_dynamic.py
(顶层目录)
GEMM教程
fp16_gemm_0.py
fp16_gemm_6.py
tutorial_gemm/
块缩放GEMM
dense_blockscaled_gemm_persistent.py
*_prefetch.py
*_amax.py
(顶层目录)
混合输入GEMM
mixed_input_gemm.py
grouped_mixed_input_gemm.py
mixed_input_gemm/
收尾阶段(Epilogue)
custom_epilogue_dense_gemm.py
activation_custom_epilogue_dense_gemm.py
epilogue/
归约
reduce.py
rmsnorm.py
(顶层目录)
依赖式启动
programmatic_dependent_launch.py
(顶层目录)
Mamba2
mamba2_ssd.py
mamba2_ssd/
块级GEMM
blockwise_gemm.py
contiguous_grouped_gemm.py
blockwise_gemm/
用户也可以直接提供其他示例。

Workflow

工作流

text
1. SELECT    -> Pick examples based on learning goals
2. READ      -> Read and annotate, identify CuTe DSL patterns
3. IMPORT    -> (Optional) Wrap in CuteKernel + KernelPlan for profiling
4. PROFILE   -> Profile imported kernels, compare with existing cuTile versions
5. EXTRACT   -> Distill reusable CuTe DSL patterns and anti-patterns
6. DOCUMENT  -> Update knowledge base (docs/knowledge/languages/cute-dsl/)

text
1. 选择    -> 根据学习目标挑选示例
2. 阅读      -> 阅读并标注,识别CuTe DSL模式
3. 导入    -> (可选)封装为CuteKernel + KernelPlan以进行性能分析
4. 性能分析   -> 对导入的内核进行性能分析,与现有cuTile版本对比
5. 提取   -> 提炼可复用的CuTe DSL模式与反模式
6. 记录文档  -> 更新知识库(docs/knowledge/languages/cute-dsl/)

Step 1: SELECT

步骤1:选择

Pick examples based on what you need to learn.
Learning goalStart with
CuTe DSL basics (grid, TMA, MMA)
tutorial_gemm/fp16_gemm_0.py
through
fp16_gemm_6.py
Software pipelining
dense_gemm_software_pipeline.py
Persistent kernel structure
dense_gemm_persistent.py
Warpgroup specialization
dense_gemm_persistent_prefetch.py
TMA pipelining for attention
fmha.py
MLA decode (direct reference)
mla/mla_decode_fp16.py
+
mla/mla_helpers.py
MLA decode with FP8
mla/mla_decode_fp8.py
Cluster programming
dense_gemm_persistent_dynamic.py
Custom epilogues
epilogue/custom_epilogue_dense_gemm.py
Dependent kernel launch
programmatic_dependent_launch.py
Before reading, write down specific questions. Examples:
  • "How does the MLA decode kernel set up TMA descriptors for the KV cache?"
  • "How are warpgroups assigned producer/consumer roles in persistent GEMM?"
  • "What barrier pattern does FMHA use for the K-loop pipeline?"

根据你的学习目标挑选示例。
学习目标入门示例
CuTe DSL基础(网格、TMA、MMA)
tutorial_gemm/fp16_gemm_0.py
fp16_gemm_6.py
软件流水线
dense_gemm_software_pipeline.py
持久化内核结构
dense_gemm_persistent.py
工作组特化
dense_gemm_persistent_prefetch.py
注意力的TMA流水线
fmha.py
MLA解码(直接参考)
mla/mla_decode_fp16.py
+
mla/mla_helpers.py
带FP8的MLA解码
mla/mla_decode_fp8.py
集群编程
dense_gemm_persistent_dynamic.py
自定义收尾阶段
epilogue/custom_epilogue_dense_gemm.py
依赖式内核启动
programmatic_dependent_launch.py
阅读前,写下具体问题。例如:
  • "MLA解码内核如何为KV缓存设置TMA描述符?"
  • "在持久化GEMM中,工作组如何分配生产者/消费者角色?"
  • "FMHA的K循环流水线使用了哪种屏障模式?"

Step 2: READ

步骤2:阅读

For each selected example, read the full source and annotate these CuTe DSL-specific aspects:
对于每个选中的示例,通读源码并标注以下CuTe DSL特定内容:

Host side (
@cute.jit
)

主机端(
@cute.jit

  1. TMA descriptor creation -- which tensors get TMA descriptors, what are the tile shapes?
  2. Grid and cluster dimensions -- how is the grid computed, is clustering used?
  3. Shared memory size -- how is the SMEM budget calculated from pipeline stages and tile shapes?
  4. Launch parameters --
    max_register_count
    , cluster shape, SMEM carveout
  1. TMA描述符创建 —— 哪些张量使用TMA描述符, tile形状是什么?
  2. 网格与集群维度 —— 如何计算网格,是否使用集群?
  3. 共享内存大小 —— 如何根据流水线阶段和tile形状计算SMEM预算?
  4. 启动参数 ——
    max_register_count
    、集群形状、SMEM预留

Device side (
@cute.kernel
)

设备端(
@cute.kernel

  1. Thread/warp/warpgroup identity -- how are roles assigned?
  2. Pipeline structure -- number of stages, barrier initialization, producer/consumer loops
  3. TMA operations --
    cute.copy(tma_desc, ...)
    , arrival patterns
  4. MMA operations -- accumulator setup, MMA loop body, epilogue
  5. Synchronization --
    cute.barrier_arrive()
    ,
    cute.barrier_wait()
    ,
    cute.cluster_barrier_*
  6. Shared memory layout -- allocation, swizzling, partitioning across warpgroups
  1. 线程/线程束/工作组标识 —— 如何分配角色?
  2. 流水线结构 —— 阶段数量、屏障初始化、生产者/消费者循环
  3. TMA操作 ——
    cute.copy(tma_desc, ...)
    、到达模式
  4. MMA操作 —— 累加器设置、MMA循环体、收尾阶段
  5. 同步 ——
    cute.barrier_arrive()
    cute.barrier_wait()
    cute.cluster_barrier_*
  6. 共享内存布局 —— 分配、混洗、跨工作组分区

Pattern checklist

模式检查清单

While reading, flag instances of:
  • TMA multi-stage pipeline (how many stages, barrier per stage)
  • Warpgroup specialization (which warpgroups produce, which consume)
  • Persistent kernel main loop (work tile scheduling, exit condition)
  • Online softmax or streaming accumulation
  • Register-level accumulator management
  • Shared memory swizzle patterns
  • Cluster-level communication (distributed shared memory)
  • Epilogue pattern (in-register, through SMEM, or fused)
  • Boundary handling (predication, masking for partial tiles)
  • Pipeline drain (how the last pipeline stages are flushed)
阅读时,标记以下模式的实例:
  • TMA多阶段流水线(阶段数量、每个阶段的屏障)
  • 工作组特化(哪些工作组是生产者,哪些是消费者)
  • 持久化内核主循环(工作tile调度、退出条件)
  • 在线softmax或流式累加
  • 寄存器级累加器管理
  • 共享内存混洗模式
  • 集群级通信(分布式共享内存)
  • 收尾阶段模式(寄存器内、通过SMEM或融合式)
  • 边界处理(断言、部分tile的掩码)
  • 流水线排空(如何刷新最后几个流水线阶段)

Cross-reference

交叉参考

Check whether patterns already exist in the knowledge base:
  • docs/knowledge/optimizations/
    -- shared patterns (e.g., pipeline-driven-low-occupancy)
  • docs/knowledge/languages/cute-dsl/
    -- existing CuTe DSL overlays
Note whether the example confirms, refines, or contradicts existing knowledge.

检查知识库中是否已存在相关模式:
  • docs/knowledge/optimizations/
    —— 通用模式(如pipeline-driven-low-occupancy)
  • docs/knowledge/languages/cute-dsl/
    —— 现有CuTe DSL覆盖内容
记录该示例是确认、细化还是与现有知识相矛盾。

Step 3: IMPORT (Optional)

步骤3:导入(可选)

Import a CUTLASS example into the project to make it runnable through the standard CLI and profiling infrastructure.
将CUTLASS示例导入项目,使其可通过标准CLI和性能分析基础设施运行。

When to import

何时导入

  • You want to profile the example on your hardware
  • You want a starting point for a custom CuTe DSL kernel
  • The example covers a workload comparable to existing cuTile kernels (e.g., MLA decode)
  • 你需要在自己的硬件上对示例进行性能分析
  • 你需要一个自定义CuTe DSL内核的起点
  • 该示例涵盖的工作负载与现有cuTile内核类似(如MLA解码)

When NOT to import

何时不导入

  • You only need to understand the pattern, not benchmark it
  • The example has complex dependencies or helper infrastructure that doesn't map cleanly
  • 你只需要理解模式,不需要进行基准测试
  • 该示例有复杂的依赖或辅助基础设施,无法清晰映射到项目中

Import procedure

导入流程

  1. Create the kernel package:
    src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/
    Example for CUTLASS MLA decode FP16:
    src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/
  2. Create the KernelPlan subclass (
    cutlass_mla_decode.py
    ):
    • prepare_inputs()
      -- allocate tensors matching the example's expected shapes (Q, KV cache, output)
    • reference_fn()
      -- PyTorch reference for correctness
    • _autotune_configs()
      -- tiling parameters from the example
    • _algorithmic_flops_bytes()
      -- roofline analysis
    • plan()
      -- return a
      CuteKernel
      wrapping the example's host/device functions
  3. Create
    __main__.py
    :
    python
    from .cutlass_mla_decode import CutlassMlaDecodeKernel
    
    if __name__ == "__main__":
        kernel_plan = CutlassMlaDecodeKernel()
        kernel_plan.benchmark_kernel_argparse()
  4. Create
    __init__.py
    at each package level
  5. Handle helper modules: If the example uses helpers (e.g.,
    mla_helpers.py
    ), either:
    • Copy into the kernel package and adjust imports
    • Import from the CUTLASS path (fragile but faster for exploration)
  6. Test correctness:
    bash
    source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check

  1. 创建内核包
    src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/
    CUTLASS MLA解码FP16的示例路径:
    src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/
  2. 创建KernelPlan子类
    cutlass_mla_decode.py
    ):
    • prepare_inputs()
      —— 分配与示例预期形状匹配的张量(Q、KV缓存、输出)
    • reference_fn()
      —— 用于正确性验证的PyTorch参考实现
    • _autotune_configs()
      —— 来自示例的分块参数
    • _algorithmic_flops_bytes()
      —— roofline分析
    • plan()
      —— 返回封装了示例主机/设备函数的
      CuteKernel
  3. 创建
    __main__.py
    python
    from .cutlass_mla_decode import CutlassMlaDecodeKernel
    
    if __name__ == "__main__":
        kernel_plan = CutlassMlaDecodeKernel()
        kernel_plan.benchmark_kernel_argparse()
  4. 在每个包层级创建
    __init__.py
  5. 处理辅助模块:如果示例使用了辅助模块(如
    mla_helpers.py
    ),可以选择:
    • 复制到内核包中并调整导入路径
    • 从CUTLASS路径导入(虽然脆弱但探索速度更快)
  6. 验证正确性
    bash
    source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check

Step 4: PROFILE

步骤4:性能分析

Architecture compatibility check

架构兼容性检查

Before profiling, verify the available device matches the kernel's target architecture:
bash
nvidia-smi --query-gpu=name --format=csv,noheader
Many CUTLASS examples (especially those in
mla/
, and any using
tcgen05
MMA ops) target SM100 (B200, B300 datacenter GPUs) specifically. Be aware that "Blackwell" is a marketing name spanning multiple SM versions with different instruction sets:
  • SM100 (B200, B300) — datacenter GPUs, supports
    tcgen05
    MMA, TMEM, full cluster features
  • SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses
    sm_120a
    , does not support
    tcgen05
    ops
A kernel using
tcgen05
ops will not run on an RTX 5090 (SM120) despite both being marketed as "Blackwell". Match the GPU name to its actual SM version, not the marketing generation.
If the kernel's target SM version does not match the available device, skip the PROFILE step entirely — execution will either fail at compile time or produce misleading results. In that case, proceed directly to Step 5 (EXTRACT) with insights from reading the source code only, and note the architecture mismatch in any documentation.
性能分析前,验证可用设备是否与内核的目标架构匹配:
bash
nvidia-smi --query-gpu=name --format=csv,noheader
许多CUTLASS示例(尤其是
mla/
目录下的,以及任何使用
tcgen05
MMA操作的示例)专门针对SM100(B200、B300数据中心GPU)。请注意,"Blackwell"是一个营销名称,涵盖了多个具有不同指令集的SM版本:
  • SM100(B200、B300)—— 数据中心GPU,支持
    tcgen05
    MMA、TMEM、完整集群功能
  • SM120(GeForce RTX 5090、RTX 5080)—— 消费级Blackwell,使用
    sm_120a
    不支持
    tcgen05
    操作
使用
tcgen05
操作的内核在RTX 5090(SM120)上无法运行,尽管两者都被称为"Blackwell"。请将GPU名称与其实际SM版本匹配,而非营销世代。
如果内核的目标SM版本与可用设备不匹配,请完全跳过性能分析步骤——执行要么在编译时失败,要么产生误导性结果。这种情况下,直接进入步骤5(提取),仅基于阅读源码的见解,并在任何文档中注明架构不匹配的情况。

Profiling sequence

性能分析流程

  1. Lock GPU clocks:
    bash scripts/lock-gpu-clock.sh
  2. Annotation mode first:
    bash
    source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
        --b=32 --s=1 --t=4096 --prof_type=annotation
  3. NCU deep dive if annotation reveals interesting patterns:
    bash
    source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
        --b=32 --s=1 --t=4096 --prof_type=ncu
  1. 锁定GPU时钟:
    bash scripts/lock-gpu-clock.sh
  2. 先使用注释模式:
    bash
    source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
        --b=32 --s=1 --t=4096 --prof_type=annotation
  3. 如果注释模式揭示了有趣的模式,再进行NCU深度分析:
    bash
    source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
        --b=32 --s=1 --t=4096 --prof_type=ncu

Focus areas

重点关注领域

Since the goal is learning (not just benchmarking), focus on understanding how the kernel achieves its performance:
  • Pipeline efficiency: Are TMA copies fully overlapped with compute? Check for idle bubbles in the nsys timeline.
  • Warpgroup utilization: Are all warpgroups busy? Check scheduler statistics in NCU.
  • Register pressure vs occupancy tradeoff: How many registers per thread? Is low occupancy intentional?
  • SMEM usage: How much of the available SMEM budget is used? How many pipeline stages fit?
  • MMA throughput: What fraction of peak TC utilization is achieved?
由于目标是学习(而非仅基准测试),重点理解内核如何实现其性能:
  • 流水线效率:TMA拷贝是否与计算完全重叠?检查nsys时间线中的空闲气泡。
  • 工作组利用率:所有工作组都在忙碌吗?检查NCU中的调度器统计数据。
  • 寄存器压力与占用率权衡:每个线程使用多少寄存器?低占用率是有意设计的吗?
  • SMEM使用情况:使用了多少可用的SMEM预算?可以容纳多少个流水线阶段?
  • MMA吞吐量:达到了峰值TC利用率的多少比例?

Compare with cuTile versions

与cuTile版本对比

If the imported kernel covers a workload that exists in cuTile (e.g., MLA decode), compare:
MetriccuTile versionCUTLASS CuTe DSLDeltaInsight
Duration (us)
TC%
DRAM%
Occupancy
Registers/thread
SMEM/block (KB)
Differences reveal what explicit control (warpgroup scheduling, TMA pipelining) buys over cuTile's block-level abstraction.

如果导入的内核涵盖了cuTile中已有的工作负载(如MLA解码),请进行对比:
指标cuTile版本CUTLASS CuTe DSL差值见解
耗时(微秒)
TC利用率%
DRAM利用率%
占用率
每线程寄存器数
每块SMEM(KB)
差异揭示了显式控制(工作组调度、TMA流水线)相对于cuTile的块级抽象所带来的优势。

Step 5: EXTRACT

步骤5:提取

Distill observations into reusable CuTe DSL patterns.
将观察结果提炼为可复用的CuTe DSL模式。

Classification

分类

TypeDestination
Shared algorithmic/hardware pattern
docs/knowledge/optimizations/<name>.md
Shared failure mode
docs/knowledge/anti-patterns/<name>.md
CuTe DSL API pattern or implementation detail
docs/knowledge/languages/cute-dsl/optimizations/<name>.md
CuTe DSL trap or pitfall
docs/knowledge/languages/cute-dsl/anti-patterns/<name>.md
Most patterns from CUTLASS examples will be CuTe DSL-specific (the API patterns, TMA setup idioms, barrier choreography). Shared patterns (algorithmic insights like pipeline-driven scheduling) may already exist from cuTile work.
类型目标路径
通用算法/硬件模式
docs/knowledge/optimizations/<name>.md
通用失败模式
docs/knowledge/anti-patterns/<name>.md
CuTe DSL API模式或实现细节
docs/knowledge/languages/cute-dsl/optimizations/<name>.md
CuTe DSL陷阱或误区
docs/knowledge/languages/cute-dsl/anti-patterns/<name>.md
来自CUTLASS示例的大多数模式都是CuTe DSL特定的(API模式、TMA设置惯用写法、屏障编排)。通用模式(如流水线驱动调度等算法见解)可能已在cuTile的工作中存在。

Extraction criteria

提取标准

Document a pattern if it:
  • Transfers to other CuTe DSL kernels (not just this one example)
  • Is non-obvious from the CuTe DSL documentation alone
  • Has performance implications visible in profiling or architecturally clear
  • Refines an existing shared pattern with CuTe DSL-specific implementation details
满足以下条件时记录模式:
  • 可迁移到其他CuTe DSL内核(不仅限于当前示例)
  • 非显而易见,仅从CuTe DSL文档无法得知
  • 有性能影响,在性能分析中可见或在架构上明确
  • 细化了现有通用模式,补充了CuTe DSL特定的实现细节

Pattern template

模板

markdown
undefined
markdown
undefined

[Pattern Name]

[模式名称]

When to Apply

适用场景

  • [CuTe DSL kernel type, workload shape, bottleneck condition]
  • [CuTe DSL内核类型、工作负载形状、瓶颈条件]

Mechanism

机制

[How the pattern works in CuTe DSL -- reference specific APIs]
[CuTe DSL中该模式的工作原理——参考具体API]

Affected Metrics

影响指标

  • [Metric 1]
  • [Metric 2]
  • [指标1]
  • [指标2]

Implementation

实现

python
undefined
python
undefined

CuTe DSL code snippet, generalized from the CUTLASS example

从CUTLASS示例泛化而来的CuTe DSL代码片段

undefined
undefined

Source

来源

[CUTLASS example name and path]
[CUTLASS示例名称和路径]

Performance Evidence

性能证据

Source: [example name, device, configuration]
ConfigMetricValueContext
来源:[示例名称、设备、配置]
配置指标上下文

Generalization

泛化结论

[Device-agnostic takeaway]
[设备无关的要点]

Pitfalls

陷阱

  • [CuTe DSL-specific failure modes]
  • [CuTe DSL特定的失败模式]

Interactions

交互

  • [How this interacts with other CuTe DSL patterns]

---
  • [该模式与其他CuTe DSL模式的交互方式]

---

Step 6: DOCUMENT

步骤6:记录文档

  1. Write pattern files to
    docs/knowledge/languages/cute-dsl/optimizations/
    or
    anti-patterns/
  2. Update the index in
    /optimization-catalog-cute-dsl
  3. If a pattern is shared (not DSL-specific), write to
    docs/knowledge/optimizations/
    and update
    /optimization-catalog
  4. If you imported and profiled a kernel, write a devlog entry in
    docs/kernels/
    :
markdown
undefined
  1. 将模式文件写入
    docs/knowledge/languages/cute-dsl/optimizations/
    anti-patterns/
  2. 更新
    /optimization-catalog-cute-dsl
    中的索引
  3. 如果模式是通用的(非DSL特定),写入
    docs/knowledge/optimizations/
    并更新
    /optimization-catalog
  4. 如果导入并分析了内核,在
    docs/kernels/
    中编写开发日志条目:
markdown
undefined

[Kernel Name] (Imported from CUTLASS)

[内核名称](从CUTLASS导入)

Overview

概述

[What this kernel does and why it was imported for study]
[该内核的功能以及为何导入学习]

Source

来源

Original:
third_party/cutlass/examples/python/CuTeDSL/<path>
Imported:
src/mla_var3/kernel/cute_python/mla/<design>/
原始路径:
third_party/cutlass/examples/python/CuTeDSL/<path>
导入路径:
src/mla_var3/kernel/cute_python/mla/<design>/

Performance

性能

[Profiling results table]
[性能分析结果表格]

Patterns Extracted

提取的模式

  • [Pattern] ->
    docs/knowledge/languages/cute-dsl/optimizations/<name>.md
  • [模式] ->
    docs/knowledge/languages/cute-dsl/optimizations/<name>.md

Insights

见解

[What was learned that informs future CuTe DSL kernel design]

---
[学到的可指导未来CuTe DSL内核设计的内容]

---

Recommended Learning Paths

推荐学习路径

Path A: CuTe DSL fundamentals

路径A:CuTe DSL基础

For building basic fluency before writing any MLA kernel:
text
1. tutorial_gemm/fp16_gemm_0.py  -> Minimal kernel: grid, TMA load, MMA, store
2. tutorial_gemm/fp16_gemm_3.py  -> Software pipelining basics
3. tutorial_gemm/fp16_gemm_6.py  -> Full persistent GEMM with warpgroup specialization
4. dense_gemm_software_pipeline.py -> Production-quality pipelining
5. dense_gemm_persistent.py      -> Persistent scheduling pattern
用于在编写任何MLA内核前构建基础熟练度:
text
1. tutorial_gemm/fp16_gemm_0.py  -> 最小内核:网格、TMA加载、MMA、存储
2. tutorial_gemm/fp16_gemm_3.py  -> 软件流水线基础
3. tutorial_gemm/fp16_gemm_6.py  -> 带工作组特化的完整持久化GEMM
4. dense_gemm_software_pipeline.py -> 生产级流水线
5. dense_gemm_persistent.py      -> 持久化调度模式

Path B: MLA-specific patterns

路径B:MLA特定模式

For directly studying how CUTLASS implements MLA:
text
1. mla/mla_helpers.py            -> Shared utilities, tensor layouts, TMA setup
2. mla/mla_decode_fp16.py        -> FP16 MLA decode with full pipeline
3. mla/mla_decode_fp8.py         -> FP8 variant (quantization handling)
4. fmha.py                       -> Attention forward (different design choices)
用于直接研究CUTLASS如何实现MLA:
text
1. mla/mla_helpers.py            -> 通用工具、张量布局、TMA设置
2. mla/mla_decode_fp16.py        -> 带完整流水线的FP16 MLA解码
3. mla/mla_decode_fp8.py         -> FP8变体(量化处理)
4. fmha.py                       -> 注意力前向传播(不同设计选择)

Path C: Advanced patterns

路径C:高级模式

For specific optimization techniques:
text
1. dense_gemm_persistent_prefetch.py  -> Prefetching in persistent kernels
2. dense_gemm_persistent_dynamic.py   -> Dynamic tile scheduling + clusters
3. epilogue/custom_epilogue_dense_gemm.py -> Fused epilogue patterns
4. programmatic_dependent_launch.py   -> Dependent kernel launch
用于学习特定优化技术:
text
1. dense_gemm_persistent_prefetch.py  -> 持久化内核中的预取
2. dense_gemm_persistent_dynamic.py   -> 动态tile调度 + 集群
3. epilogue/custom_epilogue_dense_gemm.py -> 融合式收尾阶段模式
4. programmatic_dependent_launch.py   -> 依赖式内核启动