learn-cute-dsl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Learn CuTe DSL from CUTLASS Examples

从CUTLASS示例学习CuTe DSL

A structured workflow for building CuTe Python DSL fluency by reading CUTLASS Blackwell examples, optionally importing them into the project's runtime infrastructure, profiling them, and distilling reusable patterns into the knowledge base.

这是一个结构化工作流，通过阅读CUTLASS Blackwell示例、（可选）将其导入项目运行时基础设施、进行性能分析，并将可复用模式提炼到知识库中，帮助你熟练掌握CuTe Python DSL。

Example Source

示例来源

Vast examples live under

third_party/cutlass/examples/python/CuTeDSL

. Documentation is in

docs/cute-dsl/

(CuTe Python DSL) and

docs/cutlass-cpp/cute/

(CuTe C++ concepts).

Category	Examples	Subdirectory
MLA decode	`mla_decode_fp16.py` , `mla_decode_fp8.py` , `mla_helpers.py`	`mla/`
Attention	`fmha.py` , `fmha_bwd.py`	(top-level)
Mixed attention	`mixed_input_fmha_decode.py` , `mixed_input_fmha_prefill_*.py`	`mixed_input_fmha/`
GEMM (basic)	`dense_gemm.py`	(top-level)
GEMM (pipelined)	`dense_gemm_software_pipeline.py`	(top-level)
GEMM (persistent)	`dense_gemm_persistent.py` , `dense_gemm_persistent_prefetch.py`	(top-level)
GEMM (dynamic persistent)	`dense_gemm_persistent_dynamic.py`	(top-level)
GEMM tutorials	`fp16_gemm_0.py` through `fp16_gemm_6.py`	`tutorial_gemm/`
Block-scaled GEMM	`dense_blockscaled_gemm_persistent.py` , `_prefetch.py` , `_amax.py`	(top-level)
Mixed-input GEMM	`mixed_input_gemm.py` , `grouped_mixed_input_gemm.py`	`mixed_input_gemm/`
Epilogues	`custom_epilogue_dense_gemm.py` , `activation_custom_epilogue_dense_gemm.py`	`epilogue/`
Reduction	`reduce.py` , `rmsnorm.py`	(top-level)
Dependent launch	`programmatic_dependent_launch.py`	(top-level)
Mamba2	`mamba2_ssd.py`	`mamba2_ssd/`
Blockwise GEMM	`blockwise_gemm.py` , `contiguous_grouped_gemm.py`	`blockwise_gemm/`

Other examples can be provided by the user directly.

大量示例位于

third_party/cutlass/examples/python/CuTeDSL

目录下。文档存于

docs/cute-dsl/

（CuTe Python DSL）和

docs/cutlass-cpp/cute/

（CuTe C++概念）中。

类别	示例	子目录
MLA解码	`mla_decode_fp16.py` 、 `mla_decode_fp8.py` 、 `mla_helpers.py`	`mla/`
注意力	`fmha.py` 、 `fmha_bwd.py`	（顶层目录）
混合注意力	`mixed_input_fmha_decode.py` 、 `mixed_input_fmha_prefill_*.py`	`mixed_input_fmha/`
GEMM（基础版）	`dense_gemm.py`	（顶层目录）
GEMM（流水线版）	`dense_gemm_software_pipeline.py`	（顶层目录）
GEMM（持久化版）	`dense_gemm_persistent.py` 、 `dense_gemm_persistent_prefetch.py`	（顶层目录）
GEMM（动态持久化版）	`dense_gemm_persistent_dynamic.py`	（顶层目录）
GEMM教程	`fp16_gemm_0.py` 至 `fp16_gemm_6.py`	`tutorial_gemm/`
块缩放GEMM	`dense_blockscaled_gemm_persistent.py` 、 `_prefetch.py` 、 `_amax.py`	（顶层目录）
混合输入GEMM	`mixed_input_gemm.py` 、 `grouped_mixed_input_gemm.py`	`mixed_input_gemm/`
收尾阶段（Epilogue）	`custom_epilogue_dense_gemm.py` 、 `activation_custom_epilogue_dense_gemm.py`	`epilogue/`
归约	`reduce.py` 、 `rmsnorm.py`	（顶层目录）
依赖式启动	`programmatic_dependent_launch.py`	（顶层目录）
Mamba2	`mamba2_ssd.py`	`mamba2_ssd/`
块级GEMM	`blockwise_gemm.py` 、 `contiguous_grouped_gemm.py`	`blockwise_gemm/`

用户也可以直接提供其他示例。

Workflow

工作流

text

1. SELECT    -> Pick examples based on learning goals
2. READ      -> Read and annotate, identify CuTe DSL patterns
3. IMPORT    -> (Optional) Wrap in CuteKernel + KernelPlan for profiling
4. PROFILE   -> Profile imported kernels, compare with existing cuTile versions
5. EXTRACT   -> Distill reusable CuTe DSL patterns and anti-patterns
6. DOCUMENT  -> Update knowledge base (docs/knowledge/languages/cute-dsl/)

text

1. 选择    -> 根据学习目标挑选示例
2. 阅读      -> 阅读并标注，识别CuTe DSL模式
3. 导入    -> （可选）封装为CuteKernel + KernelPlan以进行性能分析
4. 性能分析   -> 对导入的内核进行性能分析，与现有cuTile版本对比
5. 提取   -> 提炼可复用的CuTe DSL模式与反模式
6. 记录文档  -> 更新知识库（docs/knowledge/languages/cute-dsl/）

Step 1: SELECT

步骤1：选择

Pick examples based on what you need to learn.

Learning goal	Start with
CuTe DSL basics (grid, TMA, MMA)	`tutorial_gemm/fp16_gemm_0.py` through `fp16_gemm_6.py`
Software pipelining	`dense_gemm_software_pipeline.py`
Persistent kernel structure	`dense_gemm_persistent.py`
Warpgroup specialization	`dense_gemm_persistent_prefetch.py`
TMA pipelining for attention	`fmha.py`
MLA decode (direct reference)	`mla/mla_decode_fp16.py` + `mla/mla_helpers.py`
MLA decode with FP8	`mla/mla_decode_fp8.py`
Cluster programming	`dense_gemm_persistent_dynamic.py`
Custom epilogues	`epilogue/custom_epilogue_dense_gemm.py`
Dependent kernel launch	`programmatic_dependent_launch.py`

Before reading, write down specific questions. Examples:

"How does the MLA decode kernel set up TMA descriptors for the KV cache?"
"How are warpgroups assigned producer/consumer roles in persistent GEMM?"
"What barrier pattern does FMHA use for the K-loop pipeline?"

根据你的学习目标挑选示例。

学习目标	入门示例
CuTe DSL基础（网格、TMA、MMA）	`tutorial_gemm/fp16_gemm_0.py` 至 `fp16_gemm_6.py`
软件流水线	`dense_gemm_software_pipeline.py`
持久化内核结构	`dense_gemm_persistent.py`
工作组特化	`dense_gemm_persistent_prefetch.py`
注意力的TMA流水线	`fmha.py`
MLA解码（直接参考）	`mla/mla_decode_fp16.py` + `mla/mla_helpers.py`
带FP8的MLA解码	`mla/mla_decode_fp8.py`
集群编程	`dense_gemm_persistent_dynamic.py`
自定义收尾阶段	`epilogue/custom_epilogue_dense_gemm.py`
依赖式内核启动	`programmatic_dependent_launch.py`

阅读前，写下具体问题。例如：

"MLA解码内核如何为KV缓存设置TMA描述符？"
"在持久化GEMM中，工作组如何分配生产者/消费者角色？"
"FMHA的K循环流水线使用了哪种屏障模式？"

Step 2: READ

步骤2：阅读

For each selected example, read the full source and annotate these CuTe DSL-specific aspects:

对于每个选中的示例，通读源码并标注以下CuTe DSL特定内容：

Host side (

@cute.jit

)

主机端（

@cute.jit

）

TMA descriptor creation -- which tensors get TMA descriptors, what are the tile shapes?
Grid and cluster dimensions -- how is the grid computed, is clustering used?
Shared memory size -- how is the SMEM budget calculated from pipeline stages and tile shapes?
Launch parameters --
```
max_register_count
```
, cluster shape, SMEM carveout

TMA描述符创建 —— 哪些张量使用TMA描述符， tile形状是什么？
网格与集群维度 —— 如何计算网格，是否使用集群？
共享内存大小 —— 如何根据流水线阶段和tile形状计算SMEM预算？
启动参数 ——
```
max_register_count
```
、集群形状、SMEM预留

Device side (

@cute.kernel

)

设备端（

@cute.kernel

）

Thread/warp/warpgroup identity -- how are roles assigned?
Pipeline structure -- number of stages, barrier initialization, producer/consumer loops
TMA operations --
```
cute.copy(tma_desc, ...)
```
, arrival patterns
MMA operations -- accumulator setup, MMA loop body, epilogue

Synchronization --

cute.barrier_arrive()

cute.barrier_wait()

cute.cluster_barrier_*

Shared memory layout -- allocation, swizzling, partitioning across warpgroups

线程/线程束/工作组标识 —— 如何分配角色？
流水线结构 —— 阶段数量、屏障初始化、生产者/消费者循环
TMA操作 ——
```
cute.copy(tma_desc, ...)
```
、到达模式
MMA操作 —— 累加器设置、MMA循环体、收尾阶段

同步 ——

cute.barrier_arrive()

、

cute.barrier_wait()

、

cute.cluster_barrier_*

共享内存布局 —— 分配、混洗、跨工作组分区

Pattern checklist

模式检查清单

Cross-reference

交叉参考

Check whether patterns already exist in the knowledge base:

```
docs/knowledge/optimizations/
```
-- shared patterns (e.g., pipeline-driven-low-occupancy)
```
docs/knowledge/languages/cute-dsl/
```
-- existing CuTe DSL overlays

Note whether the example confirms, refines, or contradicts existing knowledge.

检查知识库中是否已存在相关模式：

```
docs/knowledge/optimizations/
```
—— 通用模式（如pipeline-driven-low-occupancy）
```
docs/knowledge/languages/cute-dsl/
```
—— 现有CuTe DSL覆盖内容

记录该示例是确认、细化还是与现有知识相矛盾。

Step 3: IMPORT (Optional)

步骤3：导入（可选）

Import a CUTLASS example into the project to make it runnable through the standard CLI and profiling infrastructure.

将CUTLASS示例导入项目，使其可通过标准CLI和性能分析基础设施运行。

When to import

何时导入

You want to profile the example on your hardware
You want a starting point for a custom CuTe DSL kernel
The example covers a workload comparable to existing cuTile kernels (e.g., MLA decode)

你需要在自己的硬件上对示例进行性能分析
你需要一个自定义CuTe DSL内核的起点
该示例涵盖的工作负载与现有cuTile内核类似（如MLA解码）

When NOT to import

何时不导入

You only need to understand the pattern, not benchmark it
The example has complex dependencies or helper infrastructure that doesn't map cleanly

你只需要理解模式，不需要进行基准测试
该示例有复杂的依赖或辅助基础设施，无法清晰映射到项目中

Import procedure

导入流程

Create the kernel package:

src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/

Example for CUTLASS MLA decode FP16:

src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/

Create the KernelPlan subclass (
```
cutlass_mla_decode.py
```
):
- ```
prepare_inputs()
```
  -- allocate tensors matching the example's expected shapes (Q, KV cache, output)
- ```
reference_fn()
```
  -- PyTorch reference for correctness
- ```
_autotune_configs()
```
  -- tiling parameters from the example
- ```
_algorithmic_flops_bytes()
```
  -- roofline analysis
- ```
plan()
```
  -- return a
```
CuteKernel
```
  wrapping the example's host/device functions

Create
__main__.py
:

python

from .cutlass_mla_decode import CutlassMlaDecodeKernel

if __name__ == "__main__":
    kernel_plan = CutlassMlaDecodeKernel()
    kernel_plan.benchmark_kernel_argparse()

Create
__init__.py
at each package level
Handle helper modules: If the example uses helpers (e.g.,
```
mla_helpers.py
```
), either:
- Copy into the kernel package and adjust imports
- Import from the CUTLASS path (fragile but faster for exploration)

Test correctness:

bash

source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check

创建内核包：

src/mla_var3/kernel/cute_python/<layer>/<design>/<design>/

CUTLASS MLA解码FP16的示例路径：

src/mla_var3/kernel/cute_python/mla/cutlass_mla_decode/cutlass_mla_decode/

创建KernelPlan子类（
```
cutlass_mla_decode.py
```
）：
- ```
prepare_inputs()
```
  —— 分配与示例预期形状匹配的张量（Q、KV缓存、输出）
- ```
reference_fn()
```
  —— 用于正确性验证的PyTorch参考实现
- ```
_autotune_configs()
```
  —— 来自示例的分块参数
- ```
_algorithmic_flops_bytes()
```
  —— roofline分析
- ```
plan()
```
  —— 返回封装了示例主机/设备函数的
```
CuteKernel
```

创建
__main__.py
：

python

from .cutlass_mla_decode import CutlassMlaDecodeKernel

if __name__ == "__main__":
    kernel_plan = CutlassMlaDecodeKernel()
    kernel_plan.benchmark_kernel_argparse()

在每个包层级创建
__init__.py
处理辅助模块：如果示例使用了辅助模块（如
```
mla_helpers.py
```
），可以选择：
- 复制到内核包中并调整导入路径
- 从CUTLASS路径导入（虽然脆弱但探索速度更快）

验证正确性：

bash

source .venv/bin/activate && python -m mla_var3.kernel cutlass_mla_decode --prof_type=disabled --check

Step 4: PROFILE

步骤4：性能分析

Architecture compatibility check

架构兼容性检查

Before profiling, verify the available device matches the kernel's target architecture:

bash

nvidia-smi --query-gpu=name --format=csv,noheader

Many CUTLASS examples (especially those in

mla/

, and any using

tcgen05

MMA ops) target SM100 (B200, B300 datacenter GPUs) specifically. Be aware that "Blackwell" is a marketing name spanning multiple SM versions with different instruction sets:

SM100 (B200, B300) — datacenter GPUs, supports
```
tcgen05
```
MMA, TMEM, full cluster features
SM120 (GeForce RTX 5090, RTX 5080) — consumer Blackwell, uses
```
sm_120a
```
, does not support
```
tcgen05
```
ops

A kernel using

tcgen05

ops will not run on an RTX 5090 (SM120) despite both being marketed as "Blackwell". Match the GPU name to its actual SM version, not the marketing generation.

If the kernel's target SM version does not match the available device, skip the PROFILE step entirely — execution will either fail at compile time or produce misleading results. In that case, proceed directly to Step 5 (EXTRACT) with insights from reading the source code only, and note the architecture mismatch in any documentation.

性能分析前，验证可用设备是否与内核的目标架构匹配：

bash

nvidia-smi --query-gpu=name --format=csv,noheader

许多CUTLASS示例（尤其是

mla/

目录下的，以及任何使用

tcgen05

MMA操作的示例）专门针对SM100（B200、B300数据中心GPU）。请注意，"Blackwell"是一个营销名称，涵盖了多个具有不同指令集的SM版本：

SM100（B200、B300）—— 数据中心GPU，支持
```
tcgen05
```
MMA、TMEM、完整集群功能
SM120（GeForce RTX 5090、RTX 5080）—— 消费级Blackwell，使用
```
sm_120a
```
，不支持
```
tcgen05
```
操作

使用

tcgen05

操作的内核在RTX 5090（SM120）上无法运行，尽管两者都被称为"Blackwell"。请将GPU名称与其实际SM版本匹配，而非营销世代。

如果内核的目标SM版本与可用设备不匹配，请完全跳过性能分析步骤——执行要么在编译时失败，要么产生误导性结果。这种情况下，直接进入步骤5（提取），仅基于阅读源码的见解，并在任何文档中注明架构不匹配的情况。

Profiling sequence

性能分析流程

Lock GPU clocks:
```
bash scripts/lock-gpu-clock.sh
```

Annotation mode first:

bash

source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
    --b=32 --s=1 --t=4096 --prof_type=annotation

NCU deep dive if annotation reveals interesting patterns:

bash

source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
    --b=32 --s=1 --t=4096 --prof_type=ncu

锁定GPU时钟：
```
bash scripts/lock-gpu-clock.sh
```

先使用注释模式：

bash

source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
    --b=32 --s=1 --t=4096 --prof_type=annotation

如果注释模式揭示了有趣的模式，再进行NCU深度分析：

bash

source .venv/bin/activate && python -m mla_var3.kernel <kernel> \
    --b=32 --s=1 --t=4096 --prof_type=ncu

Focus areas

重点关注领域

Since the goal is learning (not just benchmarking), focus on understanding how the kernel achieves its performance:

Pipeline efficiency: Are TMA copies fully overlapped with compute? Check for idle bubbles in the nsys timeline.
Warpgroup utilization: Are all warpgroups busy? Check scheduler statistics in NCU.
Register pressure vs occupancy tradeoff: How many registers per thread? Is low occupancy intentional?
SMEM usage: How much of the available SMEM budget is used? How many pipeline stages fit?
MMA throughput: What fraction of peak TC utilization is achieved?

由于目标是学习（而非仅基准测试），重点理解内核如何实现其性能：

流水线效率：TMA拷贝是否与计算完全重叠？检查nsys时间线中的空闲气泡。
工作组利用率：所有工作组都在忙碌吗？检查NCU中的调度器统计数据。
寄存器压力与占用率权衡：每个线程使用多少寄存器？低占用率是有意设计的吗？
SMEM使用情况：使用了多少可用的SMEM预算？可以容纳多少个流水线阶段？
MMA吞吐量：达到了峰值TC利用率的多少比例？

Compare with cuTile versions

与cuTile版本对比

If the imported kernel covers a workload that exists in cuTile (e.g., MLA decode), compare:

Metric	cuTile version	CUTLASS CuTe DSL	Delta	Insight
Duration (us)
TC%
DRAM%
Occupancy
Registers/thread
SMEM/block (KB)

Differences reveal what explicit control (warpgroup scheduling, TMA pipelining) buys over cuTile's block-level abstraction.

如果导入的内核涵盖了cuTile中已有的工作负载（如MLA解码），请进行对比：

指标	cuTile版本	CUTLASS CuTe DSL	差值	见解
耗时（微秒）
TC利用率%
DRAM利用率%
占用率
每线程寄存器数
每块SMEM（KB）

差异揭示了显式控制（工作组调度、TMA流水线）相对于cuTile的块级抽象所带来的优势。

Step 5: EXTRACT

步骤5：提取

Distill observations into reusable CuTe DSL patterns.

将观察结果提炼为可复用的CuTe DSL模式。

Classification

分类

Type	Destination
Shared algorithmic/hardware pattern	`docs/knowledge/optimizations/<name>.md`
Shared failure mode	`docs/knowledge/anti-patterns/<name>.md`
CuTe DSL API pattern or implementation detail	`docs/knowledge/languages/cute-dsl/optimizations/<name>.md`
CuTe DSL trap or pitfall	`docs/knowledge/languages/cute-dsl/anti-patterns/<name>.md`

Most patterns from CUTLASS examples will be CuTe DSL-specific (the API patterns, TMA setup idioms, barrier choreography). Shared patterns (algorithmic insights like pipeline-driven scheduling) may already exist from cuTile work.

类型	目标路径
通用算法/硬件模式	`docs/knowledge/optimizations/<name>.md`
通用失败模式	`docs/knowledge/anti-patterns/<name>.md`
CuTe DSL API模式或实现细节	`docs/knowledge/languages/cute-dsl/optimizations/<name>.md`
CuTe DSL陷阱或误区	`docs/knowledge/languages/cute-dsl/anti-patterns/<name>.md`

来自CUTLASS示例的大多数模式都是CuTe DSL特定的（API模式、TMA设置惯用写法、屏障编排）。通用模式（如流水线驱动调度等算法见解）可能已在cuTile的工作中存在。

Extraction criteria

提取标准

Document a pattern if it:

Transfers to other CuTe DSL kernels (not just this one example)
Is non-obvious from the CuTe DSL documentation alone
Has performance implications visible in profiling or architecturally clear
Refines an existing shared pattern with CuTe DSL-specific implementation details

满足以下条件时记录模式：

可迁移到其他CuTe DSL内核（不仅限于当前示例）
非显而易见，仅从CuTe DSL文档无法得知
有性能影响，在性能分析中可见或在架构上明确
细化了现有通用模式，补充了CuTe DSL特定的实现细节

Pattern template

模板

markdown

undefined

markdown

undefined

[Pattern Name]

[模式名称]

When to Apply

适用场景

[CuTe DSL kernel type, workload shape, bottleneck condition]

[CuTe DSL内核类型、工作负载形状、瓶颈条件]

Mechanism

机制

[How the pattern works in CuTe DSL -- reference specific APIs]

[CuTe DSL中该模式的工作原理——参考具体API]

Affected Metrics

影响指标

[Metric 1]
[Metric 2]

[指标1]
[指标2]

Implementation

实现

python

undefined

python

undefined

CuTe DSL code snippet, generalized from the CUTLASS example

从CUTLASS示例泛化而来的CuTe DSL代码片段

undefined

undefined

Source

来源

[CUTLASS example name and path]

[CUTLASS示例名称和路径]

Performance Evidence

性能证据

Source: [example name, device, configuration]

Config	Metric	Value	Context

来源：[示例名称、设备、配置]

配置	指标	值	上下文

Generalization

泛化结论

[Device-agnostic takeaway]

[设备无关的要点]

Pitfalls

陷阱

[CuTe DSL-specific failure modes]

[CuTe DSL特定的失败模式]

Interactions

交互

[How this interacts with other CuTe DSL patterns]

---

[该模式与其他CuTe DSL模式的交互方式]

---

Step 6: DOCUMENT

步骤6：记录文档

Write pattern files to

docs/knowledge/languages/cute-dsl/optimizations/

anti-patterns/

Update the index in
```
/optimization-catalog-cute-dsl
```
If a pattern is shared (not DSL-specific), write to
```
docs/knowledge/optimizations/
```
and update
```
/optimization-catalog
```
If you imported and profiled a kernel, write a devlog entry in
```
docs/kernels/
```
:

markdown

undefined

将模式文件写入

docs/knowledge/languages/cute-dsl/optimizations/

或

anti-patterns/

更新
```
/optimization-catalog-cute-dsl
```
中的索引
如果模式是通用的（非DSL特定），写入
```
docs/knowledge/optimizations/
```
并更新
```
/optimization-catalog
```
如果导入并分析了内核，在
```
docs/kernels/
```
中编写开发日志条目：

markdown

undefined

[Kernel Name] (Imported from CUTLASS)

[内核名称]（从CUTLASS导入）

Overview

概述

[What this kernel does and why it was imported for study]

[该内核的功能以及为何导入学习]

Source

来源

Original:

third_party/cutlass/examples/python/CuTeDSL/<path>

Imported:

src/mla_var3/kernel/cute_python/mla/<design>/

原始路径：

third_party/cutlass/examples/python/CuTeDSL/<path>

导入路径：

src/mla_var3/kernel/cute_python/mla/<design>/

Performance

性能

[Profiling results table]

[性能分析结果表格]

Patterns Extracted

提取的模式

[Pattern] ->

docs/knowledge/languages/cute-dsl/optimizations/<name>.md

[模式] ->

docs/knowledge/languages/cute-dsl/optimizations/<name>.md

Insights

见解

[What was learned that informs future CuTe DSL kernel design]

---

[学到的可指导未来CuTe DSL内核设计的内容]

---

learn-cute-dsl

Original

Translation

Learn CuTe DSL from CUTLASS Examples

从CUTLASS示例学习CuTe DSL

Example Source

示例来源

Workflow

工作流

Step 1: SELECT

步骤1：选择

Step 2: READ

步骤2：阅读

Host side (@cute.jit)

主机端（@cute.jit）

Device side (@cute.kernel)

设备端（@cute.kernel）

Pattern checklist

模式检查清单

Cross-reference

交叉参考

Step 3: IMPORT (Optional)

步骤3：导入（可选）

When to import

何时导入

When NOT to import

何时不导入

Import procedure

导入流程

Step 4: PROFILE

步骤4：性能分析

Architecture compatibility check

架构兼容性检查

Profiling sequence

性能分析流程

Focus areas

重点关注领域

Compare with cuTile versions

与cuTile版本对比

Step 5: EXTRACT

步骤5：提取

Classification

分类

Extraction criteria

提取标准

Pattern template

模板

[Pattern Name]

[模式名称]

When to Apply

适用场景

Mechanism

机制

Affected Metrics

影响指标

Implementation

实现

CuTe DSL code snippet, generalized from the CUTLASS example

从CUTLASS示例泛化而来的CuTe DSL代码片段

Source

来源

Performance Evidence

性能证据

Generalization

泛化结论

Pitfalls

陷阱

Interactions

交互

Step 6: DOCUMENT

步骤6：记录文档

[Kernel Name] (Imported from CUTLASS)

[内核名称]（从CUTLASS导入）

Overview

概述

Source

来源

Performance

性能

Patterns Extracted

Host side (
`@cute.jit`
)

主机端（
`@cute.jit`
）

Device side (
`@cute.kernel`
)

设备端（
`@cute.kernel`
）