cutlass-skill

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CUTLASS & CuTeDSL Development

CUTLASS & CuTeDSL 开发

Source Code Locations

源代码位置

CUTLASS 源码位于此 skill 安装目录下的
repos/cutlass/
。 实际路径取决于所用工具:
  • Cursor:
    ~/.cursor/skills/cutlass-skill/repos/cutlass/
  • Claude Code:
    ~/.claude/skills/cutlass-skill/repos/cutlass/
  • Codex:
    ~/.agents/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例用
~/.cursor/skills/cutlass-skill/repos/cutlass/
作占位符,替换为实际路径
如果该路径不存在,在项目目录下运行
bash update-repos.sh cutlass
CUTLASS 源码位于当前Skill安装目录下的
repos/cutlass/
。 实际路径取决于你使用的工具:
  • Cursor:
    ~/.cursor/skills/cutlass-skill/repos/cutlass/
  • Claude Code:
    ~/.claude/skills/cutlass-skill/repos/cutlass/
  • Codex:
    ~/.agents/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例中使用
~/.cursor/skills/cutlass-skill/repos/cutlass/
作为占位符,请替换为你的实际路径
如果该路径不存在,请在项目目录下运行
bash update-repos.sh cutlass

CuTeDSL (Python DSL for GPU Kernels)

CuTeDSL(面向GPU内核的Python DSL)

CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
│   ├── base_dsl/       # DSL 基础: 类型, 变量, 函数, PTX emit
│   ├── cute/           # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy
│   ├── cutlass_dsl/    # CUTLASS DSL: GEMM builder, epilogue, pipeline
│   ├── pipeline/       # 流水线抽象: MainloopPipeline, PipelineAsync
│   ├── jax/            # JAX 集成
│   ├── utils/          # 编译工具, profiler, tensor 工具
│   └── torch.py        # PyTorch 集成
CuTeDSL 示例:
CUTLASS_REPO/examples/python/CuTeDSL/
├── ampere/             # Ampere: sgemm, tensorop_gemm, flash_attention_v2
├── hopper/             # Hopper: TMA gemm, FP8, grouped GEMM
├── blackwell/          # Blackwell: blockwise_gemm
├── blackwell_geforce/  # GeForce Blackwell
├── cute/               # CuTe tutorials (Python)
├── distributed/        # 分布式 GEMM
├── experimental/       # 实验性功能
├── jax/                # JAX 示例
├── notebooks/          # Jupyter notebooks (英文)
├── notebooks-zh/       # Jupyter notebooks (中文)
└── advanced_compiler_control/  # 高级编译控制
CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
│   ├── base_dsl/       # DSL 基础: 类型, 变量, 函数, PTX emit
│   ├── cute/           # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy
│   ├── cutlass_dsl/    # CUTLASS DSL: GEMM builder, epilogue, pipeline
│   ├── pipeline/       # 流水线抽象: MainloopPipeline, PipelineAsync
│   ├── jax/            # JAX 集成
│   ├── utils/          # 编译工具, profiler, tensor 工具
│   └── torch.py        # PyTorch 集成
CuTeDSL 示例:
CUTLASS_REPO/examples/python/CuTeDSL/
├── ampere/             # Ampere: sgemm, tensorop_gemm, flash_attention_v2
├── hopper/             # Hopper: TMA gemm, FP8, grouped GEMM
├── blackwell/          # Blackwell: blockwise_gemm
├── blackwell_geforce/  # GeForce Blackwell
├── cute/               # CuTe tutorials (Python)
├── distributed/        # 分布式 GEMM
├── experimental/       # 实验性功能
├── jax/                # JAX 示例
├── notebooks/          # Jupyter notebooks (英文)
├── notebooks-zh/       # Jupyter notebooks (中文)
└── advanced_compiler_control/  # 高级编译控制

CUTLASS C++ Examples (按架构分类)

CUTLASS C++ 示例(按架构分类)

CUTLASS_REPO/examples/
├── 00-47:  Ampere 及更早架构
├── 48-69:  Hopper (sm_90)
│   ├── 48_hopper_warp_specialized_gemm
│   ├── 49_hopper_gemm_with_collective_builder
│   ├── 54_hopper_fp8_warp_specialized_gemm
│   ├── 55_hopper_mixed_dtype_gemm
│   ├── 57_hopper_grouped_gemm
│   ├── 62_hopper_sparse_gemm
│   ├── 67_hopper_fp8..._blockwise_scaling
│   ├── 88_hopper_fmha
│   └── ...
├── 70-93:  Blackwell (sm_100)
│   ├── 70_blackwell_gemm
│   ├── 71_blackwell_gemm_with_collective_builder
│   ├── 72_blackwell_narrow_precision_gemm
│   ├── 77_blackwell_fmha
│   ├── 81_blackwell_gemm_blockwise
│   ├── 83_blackwell_sparse_gemm
│   ├── 92_blackwell_moe_gemm
│   ├── 93_blackwell_low_latency_gqa
│   └── ...
└── cute/tutorial/      # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)
CUTLASS_REPO/examples/
├── 00-47:  Ampere 及更早架构
├── 48-69:  Hopper (sm_90)
│   ├── 48_hopper_warp_specialized_gemm
│   ├── 49_hopper_gemm_with_collective_builder
│   ├── 54_hopper_fp8_warp_specialized_gemm
│   ├── 55_hopper_mixed_dtype_gemm
│   ├── 57_hopper_grouped_gemm
│   ├── 62_hopper_sparse_gemm
│   ├── 67_hopper_fp8..._blockwise_scaling
│   ├── 88_hopper_fmha
│   └── ...
├── 70-93:  Blackwell (sm_100)
│   ├── 70_blackwell_gemm
│   ├── 71_blackwell_gemm_with_collective_builder
│   ├── 72_blackwell_narrow_precision_gemm
│   ├── 77_blackwell_fmha
│   ├── 81_blackwell_gemm_blockwise
│   ├── 83_blackwell_sparse_gemm
│   ├── 92_blackwell_moe_gemm
│   ├── 93_blackwell_low_latency_gqa
│   └── ...
└── cute/tutorial/      # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)

CuTe C++ Headers

CuTe C++ 头文件

CUTLASS_REPO/include/cute/
├── layout.hpp          # Layout 核心: Shape, Stride, 组合
├── tensor.hpp          # Tensor: make_tensor, local_tile, partition
├── swizzle.hpp         # Swizzle 模式
├── algorithm/          # copy, gemm, fill, clear
├── arch/               # 架构特定: copy_sm90, mma_sm90, copy_sm100
├── atom/               # MMA atom, Copy atom 定义
│   ├── mma_atom.hpp
│   ├── copy_atom.hpp
│   └── mma_traits_sm90_gmma.hpp  # WGMMA traits
├── numeric/            # 数值类型
└── container/          # tuple, array
CUTLASS_REPO/include/cute/
├── layout.hpp          # Layout 核心: Shape, Stride, 组合
├── tensor.hpp          # Tensor: make_tensor, local_tile, partition
├── swizzle.hpp         # Swizzle 模式
├── algorithm/          # copy, gemm, fill, clear
├── arch/               # 架构特定: copy_sm90, mma_sm90, copy_sm100
├── atom/               # MMA atom, Copy atom 定义
│   ├── mma_atom.hpp
│   ├── copy_atom.hpp
│   └── mma_traits_sm90_gmma.hpp  # WGMMA traits
├── numeric/            # 数值类型
└── container/          # tuple, array

CUTLASS C++ Headers

CUTLASS C++ 头文件

CUTLASS_REPO/include/cutlass/
├── gemm/               # GEMM 设备层, collective, kernel
│   ├── collective/     # CollectiveMainloop, CollectiveEpilogue
│   ├── kernel/         # GemmUniversal
│   └── device/         # 设备启动接口
├── epilogue/           # Epilogue: bias, activation, scaling
├── conv/               # 卷积
├── arch/               # MMA 指令包装 (mma_sm90.h, mma_sm100.h)
├── pipeline/           # Pipeline: PipelineTmaAsync, PipelineAsync
├── experimental/       # 实验性 API
└── detail/             # 内部实现细节
CUTLASS_REPO/include/cutlass/
├── gemm/               # GEMM 设备层, collective, kernel
│   ├── collective/     # CollectiveMainloop, CollectiveEpilogue
│   ├── kernel/         # GemmUniversal
│   └── device/         # 设备启动接口
├── epilogue/           # Epilogue: bias, activation, scaling
├── conv/               # 卷积
├── arch/               # MMA 指令包装 (mma_sm90.h, mma_sm100.h)
├── pipeline/           # Pipeline: PipelineTmaAsync, PipelineAsync
├── experimental/       # 实验性 API
└── detail/             # 内部实现细节

pycute (Python CuTe 绑定)

pycute(Python CuTe 绑定)

CUTLASS_REPO/python/pycute/
├── layout.py           # Layout, make_layout, complement, coalesce
├── int_tuple.py        # IntTuple 操作
├── swizzle.py          # Swizzle
└── typing.py           # 类型定义
CUTLASS_REPO/python/pycute/
├── layout.py           # Layout, make_layout, complement, coalesce
├── int_tuple.py        # IntTuple 操作
├── swizzle.py          # Swizzle
└── typing.py           # 类型定义

Search Strategy

搜索策略

用 Grep 工具搜索,不要整文件加载。
使用Grep工具搜索,不要加载整个文件。

CuTeDSL 用法

CuTeDSL 用法

bash
CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"
bash
CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"

查找 CuTeDSL GEMM 示例

查找 CuTeDSL GEMM 示例

rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/
rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/

查找 TiledMMA 使用

查找 TiledMMA 使用

rg "TiledMMA|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/
rg "TiledMMA|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/

查找 pipeline 用法

查找 pipeline 用法

rg "MainloopPipeline|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/
rg "MainloopPipeline|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/

查找 Blackwell CuTeDSL 示例

查找 Blackwell CuTeDSL 示例

rg "sm_100|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/
undefined
rg "sm_100|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/
undefined

CuTe C++ 用法

CuTe C++ 用法

bash
undefined
bash
undefined

查找 Layout 操作

查找 Layout 操作

rg "make_layout|composition|complement" $CUTLASS_REPO/include/cute/layout.hpp
rg "make_layout|composition|complement" $CUTLASS_REPO/include/cute/layout.hpp

查找 TiledCopy 使用

查找 TiledCopy 使用

rg "TiledCopy|make_tiled_copy" $CUTLASS_REPO/include/cute/
rg "TiledCopy|make_tiled_copy" $CUTLASS_REPO/include/cute/

查找 MMA atom traits

查找 MMA atom traits

rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/
rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/

查找 Hopper WGMMA

查找 Hopper WGMMA

rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp
rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp

查找 TMA copy

查找 TMA copy

rg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/
undefined
rg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/
undefined

CUTLASS Collective Builder

CUTLASS Collective Builder

bash
undefined
bash
undefined

查找 CollectiveBuilder 使用

查找 CollectiveBuilder 使用

rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/
rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/

查找 Collective Mainloop

查找 Collective Mainloop

rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/
rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/

查找 Epilogue 融合

查找 Epilogue 融合

rg "fusion|EVT" $CUTLASS_REPO/include/cutlass/epilogue/
rg "fusion|EVT" $CUTLASS_REPO/include/cutlass/epilogue/

查找 kernel 启动模板

查找 kernel 启动模板

rg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/
undefined
rg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/
undefined

GEMM 示例搜索

GEMM 示例搜索

bash
undefined
bash
undefined

查找 FP8 GEMM 配置

查找 FP8 GEMM 配置

rg "float_e4m3|float_e5m2|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/
rg "float_e4m3|float_e5m2|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/

查找 blockwise scaling

查找 blockwise scaling

rg "blockwise|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/
rg "blockwise|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/

查找 grouped GEMM

查找 grouped GEMM

rg "grouped|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/
rg "grouped|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/

查找 sparse GEMM

查找 sparse GEMM

rg "sparse|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/
rg "sparse|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/

查找 StreamK

查找 StreamK

rg "StreamK|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/
undefined
rg "StreamK|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/
undefined

When to Use Each Source

各资源适用场景

NeedSourcePath
CuTeDSL 入门CuTeDSL examples
examples/python/CuTeDSL/ampere/
CuTeDSL Hopper GEMMCuTeDSL examples
examples/python/CuTeDSL/hopper/
CuTeDSL Blackwell GEMMCuTeDSL examples
examples/python/CuTeDSL/blackwell/
CuTeDSL API 定义CuTeDSL source
python/CuTeDSL/cutlass/
CuTe Layout 语义CuTe headers
include/cute/layout.hpp
CuTe Tensor 操作CuTe headers
include/cute/tensor.hpp
MMA atom traitsCuTe atom
include/cute/atom/
TMA copy 架构CuTe arch
include/cute/arch/copy_sm90*
CUTLASS GEMM 模板CUTLASS examples
examples/48-93_*
Collective BuilderCUTLASS examples
examples/49_hopper_gemm_with_collective_builder/
Collective MainloopCUTLASS headers
include/cutlass/gemm/collective/
Epilogue 融合CUTLASS headers
include/cutlass/epilogue/
Pipeline 抽象CUTLASS headers
include/cutlass/pipeline/
pycute Layout 操作pycute
python/pycute/
Hopper FMHACUTLASS examples
examples/88_hopper_fmha/
Blackwell FMHACUTLASS examples
examples/77_blackwell_fmha/
MoE GEMMCUTLASS examples
examples/92_blackwell_moe_gemm/
需求资源路径
CuTeDSL 入门CuTeDSL 示例
examples/python/CuTeDSL/ampere/
CuTeDSL Hopper GEMMCuTeDSL 示例
examples/python/CuTeDSL/hopper/
CuTeDSL Blackwell GEMMCuTeDSL 示例
examples/python/CuTeDSL/blackwell/
CuTeDSL API 定义CuTeDSL 源码
python/CuTeDSL/cutlass/
CuTe Layout 语义CuTe 头文件
include/cute/layout.hpp
CuTe Tensor 操作CuTe 头文件
include/cute/tensor.hpp
MMA atom traitsCuTe atom
include/cute/atom/
TMA copy 架构CuTe arch
include/cute/arch/copy_sm90*
CUTLASS GEMM 模板CUTLASS 示例
examples/48-93_*
Collective BuilderCUTLASS 示例
examples/49_hopper_gemm_with_collective_builder/
Collective MainloopCUTLASS 头文件
include/cutlass/gemm/collective/
Epilogue 融合CUTLASS 头文件
include/cutlass/epilogue/
Pipeline 抽象CUTLASS 头文件
include/cutlass/pipeline/
pycute Layout 操作pycute
python/pycute/
Hopper FMHACUTLASS 示例
examples/88_hopper_fmha/
Blackwell FMHACUTLASS 示例
examples/77_blackwell_fmha/
MoE GEMMCUTLASS 示例
examples/92_blackwell_moe_gemm/

CuTeDSL 编写模式

CuTeDSL 编写模式

基本 Elementwise Kernel

基础Elementwise内核

python
from cutlass import jit, Int32, Float32

@jit
def add_kernel(x: Float32, y: Float32) -> Float32:
    return x + y
参考
examples/python/CuTeDSL/ampere/elementwise_add.py
获取完整示例。
python
from cutlass import jit, Int32, Float32

@jit
def add_kernel(x: Float32, y: Float32) -> Float32:
    return x + y
请参考
examples/python/CuTeDSL/ampere/elementwise_add.py
获取完整示例。

CuTeDSL GEMM

CuTeDSL GEMM

参考
examples/python/CuTeDSL/ampere/sgemm.py
获取基础 SGEMM。 参考
examples/python/CuTeDSL/hopper/
获取 Hopper TMA GEMM。
请参考
examples/python/CuTeDSL/ampere/sgemm.py
获取基础SGEMM示例。 请参考
examples/python/CuTeDSL/hopper/
获取Hopper TMA GEMM示例。

CuTe C++ GEMM 模式

CuTe C++ GEMM 模式

参考
examples/cute/tutorial/sgemm_1.cu
~
sgemm_sm80.cu
获取 CuTe SGEMM 渐进教程。
请参考
examples/cute/tutorial/sgemm_1.cu
sgemm_sm80.cu
获取CuTe SGEMM渐进式教程。

Compilation Reference

编译参考

bash
undefined
bash
undefined

编译 CUTLASS example

编译 CUTLASS example

cd CUTLASS_REPO && mkdir -p build && cd build cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper cmake --build . --target 49_hopper_gemm_with_collective_builder
cd CUTLASS_REPO && mkdir -p build && cd build cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper cmake --build . --target 49_hopper_gemm_with_collective_builder

编译特定架构

编译特定架构

cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"
cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"

CuTeDSL 运行

CuTeDSL 运行

pip install -e python/CuTeDSL/ python examples/python/CuTeDSL/ampere/sgemm.py
undefined
pip install -e python/CuTeDSL/ python examples/python/CuTeDSL/ampere/sgemm.py
undefined

常见问题排查

常见问题排查

问题可能原因查找参考
GEMM 精度不对Epilogue 未配置正确的 accumulator 类型
rg "ElementAccumulator" examples/49_*
TMA 报错Tensor alignment 不满足 128B
rg "Alignment|alignment" examples/48_*
Collective Builder 编译失败架构不匹配或 pipeline 配置错误
rg "KernelSchedule|EpilogueSchedule" examples/49_*
CuTeDSL 类型错误DSL 类型与 CUDA 类型不匹配
rg "dtype|element_type" examples/python/CuTeDSL/
Layout swizzle 错误Swizzle 模式与数据排布不兼容
include/cute/swizzle.hpp
问题可能原因查找参考
GEMM 精度不对Epilogue 未配置正确的累加器类型
rg "ElementAccumulator" examples/49_*
TMA 报错Tensor对齐不满足128B
rg "Alignment|alignment" examples/48_*
Collective Builder 编译失败架构不匹配或流水线配置错误
rg "KernelSchedule|EpilogueSchedule" examples/49_*
CuTeDSL 类型错误DSL类型与CUDA类型不匹配
rg "dtype|element_type" examples/python/CuTeDSL/
Layout swizzle 错误Swizzle模式与数据排布不兼容
include/cute/swizzle.hpp

更新 CUTLASS 源码

更新 CUTLASS 源码

bash
undefined
bash
undefined

在 cursor-gpu-skills 项目目录下

在 cursor-gpu-skills 项目目录下

bash update-repos.sh cutlass
undefined
bash update-repos.sh cutlass
undefined

Additional References

额外参考

  • CUTLASS 官方文档: https://github.com/NVIDIA/cutlass
  • CuTe 文档:
    CUTLASS_REPO/media/docs/cute/
    (如果使用 --full 模式安装)
  • CuTeDSL notebooks:
    examples/python/CuTeDSL/notebooks/
  • CUTLASS 官方文档: https://github.com/NVIDIA/cutlass
  • CuTe 文档:
    CUTLASS_REPO/media/docs/cute/
    (如果使用--full模式安装)
  • CuTeDSL notebooks:
    examples/python/CuTeDSL/notebooks/