cutlass-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCUTLASS & CuTeDSL Development
CUTLASS & CuTeDSL 开发
Source Code Locations
源代码位置
CUTLASS 源码位于此 skill 安装目录下的 。
实际路径取决于所用工具:
repos/cutlass/- Cursor:
~/.cursor/skills/cutlass-skill/repos/cutlass/ - Claude Code:
~/.claude/skills/cutlass-skill/repos/cutlass/ - Codex:
~/.agents/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例用 作占位符,替换为实际路径。
~/.cursor/skills/cutlass-skill/repos/cutlass/如果该路径不存在,在项目目录下运行 。
bash update-repos.sh cutlassCUTLASS 源码位于当前Skill安装目录下的 。
实际路径取决于你使用的工具:
repos/cutlass/- Cursor:
~/.cursor/skills/cutlass-skill/repos/cutlass/ - Claude Code:
~/.claude/skills/cutlass-skill/repos/cutlass/ - Codex:
~/.agents/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例中使用 作为占位符,请替换为你的实际路径。
~/.cursor/skills/cutlass-skill/repos/cutlass/如果该路径不存在,请在项目目录下运行 。
bash update-repos.sh cutlassCuTeDSL (Python DSL for GPU Kernels)
CuTeDSL(面向GPU内核的Python DSL)
CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
│ ├── base_dsl/ # DSL 基础: 类型, 变量, 函数, PTX emit
│ ├── cute/ # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy
│ ├── cutlass_dsl/ # CUTLASS DSL: GEMM builder, epilogue, pipeline
│ ├── pipeline/ # 流水线抽象: MainloopPipeline, PipelineAsync
│ ├── jax/ # JAX 集成
│ ├── utils/ # 编译工具, profiler, tensor 工具
│ └── torch.py # PyTorch 集成CuTeDSL 示例:
CUTLASS_REPO/examples/python/CuTeDSL/
├── ampere/ # Ampere: sgemm, tensorop_gemm, flash_attention_v2
├── hopper/ # Hopper: TMA gemm, FP8, grouped GEMM
├── blackwell/ # Blackwell: blockwise_gemm
├── blackwell_geforce/ # GeForce Blackwell
├── cute/ # CuTe tutorials (Python)
├── distributed/ # 分布式 GEMM
├── experimental/ # 实验性功能
├── jax/ # JAX 示例
├── notebooks/ # Jupyter notebooks (英文)
├── notebooks-zh/ # Jupyter notebooks (中文)
└── advanced_compiler_control/ # 高级编译控制CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
│ ├── base_dsl/ # DSL 基础: 类型, 变量, 函数, PTX emit
│ ├── cute/ # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy
│ ├── cutlass_dsl/ # CUTLASS DSL: GEMM builder, epilogue, pipeline
│ ├── pipeline/ # 流水线抽象: MainloopPipeline, PipelineAsync
│ ├── jax/ # JAX 集成
│ ├── utils/ # 编译工具, profiler, tensor 工具
│ └── torch.py # PyTorch 集成CuTeDSL 示例:
CUTLASS_REPO/examples/python/CuTeDSL/
├── ampere/ # Ampere: sgemm, tensorop_gemm, flash_attention_v2
├── hopper/ # Hopper: TMA gemm, FP8, grouped GEMM
├── blackwell/ # Blackwell: blockwise_gemm
├── blackwell_geforce/ # GeForce Blackwell
├── cute/ # CuTe tutorials (Python)
├── distributed/ # 分布式 GEMM
├── experimental/ # 实验性功能
├── jax/ # JAX 示例
├── notebooks/ # Jupyter notebooks (英文)
├── notebooks-zh/ # Jupyter notebooks (中文)
└── advanced_compiler_control/ # 高级编译控制CUTLASS C++ Examples (按架构分类)
CUTLASS C++ 示例(按架构分类)
CUTLASS_REPO/examples/
├── 00-47: Ampere 及更早架构
├── 48-69: Hopper (sm_90)
│ ├── 48_hopper_warp_specialized_gemm
│ ├── 49_hopper_gemm_with_collective_builder
│ ├── 54_hopper_fp8_warp_specialized_gemm
│ ├── 55_hopper_mixed_dtype_gemm
│ ├── 57_hopper_grouped_gemm
│ ├── 62_hopper_sparse_gemm
│ ├── 67_hopper_fp8..._blockwise_scaling
│ ├── 88_hopper_fmha
│ └── ...
├── 70-93: Blackwell (sm_100)
│ ├── 70_blackwell_gemm
│ ├── 71_blackwell_gemm_with_collective_builder
│ ├── 72_blackwell_narrow_precision_gemm
│ ├── 77_blackwell_fmha
│ ├── 81_blackwell_gemm_blockwise
│ ├── 83_blackwell_sparse_gemm
│ ├── 92_blackwell_moe_gemm
│ ├── 93_blackwell_low_latency_gqa
│ └── ...
└── cute/tutorial/ # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)CUTLASS_REPO/examples/
├── 00-47: Ampere 及更早架构
├── 48-69: Hopper (sm_90)
│ ├── 48_hopper_warp_specialized_gemm
│ ├── 49_hopper_gemm_with_collective_builder
│ ├── 54_hopper_fp8_warp_specialized_gemm
│ ├── 55_hopper_mixed_dtype_gemm
│ ├── 57_hopper_grouped_gemm
│ ├── 62_hopper_sparse_gemm
│ ├── 67_hopper_fp8..._blockwise_scaling
│ ├── 88_hopper_fmha
│ └── ...
├── 70-93: Blackwell (sm_100)
│ ├── 70_blackwell_gemm
│ ├── 71_blackwell_gemm_with_collective_builder
│ ├── 72_blackwell_narrow_precision_gemm
│ ├── 77_blackwell_fmha
│ ├── 81_blackwell_gemm_blockwise
│ ├── 83_blackwell_sparse_gemm
│ ├── 92_blackwell_moe_gemm
│ ├── 93_blackwell_low_latency_gqa
│ └── ...
└── cute/tutorial/ # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)CuTe C++ Headers
CuTe C++ 头文件
CUTLASS_REPO/include/cute/
├── layout.hpp # Layout 核心: Shape, Stride, 组合
├── tensor.hpp # Tensor: make_tensor, local_tile, partition
├── swizzle.hpp # Swizzle 模式
├── algorithm/ # copy, gemm, fill, clear
├── arch/ # 架构特定: copy_sm90, mma_sm90, copy_sm100
├── atom/ # MMA atom, Copy atom 定义
│ ├── mma_atom.hpp
│ ├── copy_atom.hpp
│ └── mma_traits_sm90_gmma.hpp # WGMMA traits
├── numeric/ # 数值类型
└── container/ # tuple, arrayCUTLASS_REPO/include/cute/
├── layout.hpp # Layout 核心: Shape, Stride, 组合
├── tensor.hpp # Tensor: make_tensor, local_tile, partition
├── swizzle.hpp # Swizzle 模式
├── algorithm/ # copy, gemm, fill, clear
├── arch/ # 架构特定: copy_sm90, mma_sm90, copy_sm100
├── atom/ # MMA atom, Copy atom 定义
│ ├── mma_atom.hpp
│ ├── copy_atom.hpp
│ └── mma_traits_sm90_gmma.hpp # WGMMA traits
├── numeric/ # 数值类型
└── container/ # tuple, arrayCUTLASS C++ Headers
CUTLASS C++ 头文件
CUTLASS_REPO/include/cutlass/
├── gemm/ # GEMM 设备层, collective, kernel
│ ├── collective/ # CollectiveMainloop, CollectiveEpilogue
│ ├── kernel/ # GemmUniversal
│ └── device/ # 设备启动接口
├── epilogue/ # Epilogue: bias, activation, scaling
├── conv/ # 卷积
├── arch/ # MMA 指令包装 (mma_sm90.h, mma_sm100.h)
├── pipeline/ # Pipeline: PipelineTmaAsync, PipelineAsync
├── experimental/ # 实验性 API
└── detail/ # 内部实现细节CUTLASS_REPO/include/cutlass/
├── gemm/ # GEMM 设备层, collective, kernel
│ ├── collective/ # CollectiveMainloop, CollectiveEpilogue
│ ├── kernel/ # GemmUniversal
│ └── device/ # 设备启动接口
├── epilogue/ # Epilogue: bias, activation, scaling
├── conv/ # 卷积
├── arch/ # MMA 指令包装 (mma_sm90.h, mma_sm100.h)
├── pipeline/ # Pipeline: PipelineTmaAsync, PipelineAsync
├── experimental/ # 实验性 API
└── detail/ # 内部实现细节pycute (Python CuTe 绑定)
pycute(Python CuTe 绑定)
CUTLASS_REPO/python/pycute/
├── layout.py # Layout, make_layout, complement, coalesce
├── int_tuple.py # IntTuple 操作
├── swizzle.py # Swizzle
└── typing.py # 类型定义CUTLASS_REPO/python/pycute/
├── layout.py # Layout, make_layout, complement, coalesce
├── int_tuple.py # IntTuple 操作
├── swizzle.py # Swizzle
└── typing.py # 类型定义Search Strategy
搜索策略
用 Grep 工具搜索,不要整文件加载。
使用Grep工具搜索,不要加载整个文件。
CuTeDSL 用法
CuTeDSL 用法
bash
CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"bash
CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"查找 CuTeDSL GEMM 示例
查找 CuTeDSL GEMM 示例
rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/
rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/
查找 TiledMMA 使用
查找 TiledMMA 使用
rg "TiledMMA|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/
rg "TiledMMA|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/
查找 pipeline 用法
查找 pipeline 用法
rg "MainloopPipeline|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/
rg "MainloopPipeline|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/
查找 Blackwell CuTeDSL 示例
查找 Blackwell CuTeDSL 示例
rg "sm_100|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/
undefinedrg "sm_100|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/
undefinedCuTe C++ 用法
CuTe C++ 用法
bash
undefinedbash
undefined查找 Layout 操作
查找 Layout 操作
rg "make_layout|composition|complement" $CUTLASS_REPO/include/cute/layout.hpp
rg "make_layout|composition|complement" $CUTLASS_REPO/include/cute/layout.hpp
查找 TiledCopy 使用
查找 TiledCopy 使用
rg "TiledCopy|make_tiled_copy" $CUTLASS_REPO/include/cute/
rg "TiledCopy|make_tiled_copy" $CUTLASS_REPO/include/cute/
查找 MMA atom traits
查找 MMA atom traits
rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/
rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/
查找 Hopper WGMMA
查找 Hopper WGMMA
rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp
rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp
查找 TMA copy
查找 TMA copy
rg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/
undefinedrg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/
undefinedCUTLASS Collective Builder
CUTLASS Collective Builder
bash
undefinedbash
undefined查找 CollectiveBuilder 使用
查找 CollectiveBuilder 使用
rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/
rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/
查找 Collective Mainloop
查找 Collective Mainloop
rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/
rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/
查找 Epilogue 融合
查找 Epilogue 融合
rg "fusion|EVT" $CUTLASS_REPO/include/cutlass/epilogue/
rg "fusion|EVT" $CUTLASS_REPO/include/cutlass/epilogue/
查找 kernel 启动模板
查找 kernel 启动模板
rg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/
undefinedrg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/
undefinedGEMM 示例搜索
GEMM 示例搜索
bash
undefinedbash
undefined查找 FP8 GEMM 配置
查找 FP8 GEMM 配置
rg "float_e4m3|float_e5m2|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/
rg "float_e4m3|float_e5m2|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/
查找 blockwise scaling
查找 blockwise scaling
rg "blockwise|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/
rg "blockwise|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/
查找 grouped GEMM
查找 grouped GEMM
rg "grouped|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/
rg "grouped|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/
查找 sparse GEMM
查找 sparse GEMM
rg "sparse|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/
rg "sparse|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/
查找 StreamK
查找 StreamK
rg "StreamK|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/
undefinedrg "StreamK|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/
undefinedWhen to Use Each Source
各资源适用场景
| Need | Source | Path |
|---|---|---|
| CuTeDSL 入门 | CuTeDSL examples | |
| CuTeDSL Hopper GEMM | CuTeDSL examples | |
| CuTeDSL Blackwell GEMM | CuTeDSL examples | |
| CuTeDSL API 定义 | CuTeDSL source | |
| CuTe Layout 语义 | CuTe headers | |
| CuTe Tensor 操作 | CuTe headers | |
| MMA atom traits | CuTe atom | |
| TMA copy 架构 | CuTe arch | |
| CUTLASS GEMM 模板 | CUTLASS examples | |
| Collective Builder | CUTLASS examples | |
| Collective Mainloop | CUTLASS headers | |
| Epilogue 融合 | CUTLASS headers | |
| Pipeline 抽象 | CUTLASS headers | |
| pycute Layout 操作 | pycute | |
| Hopper FMHA | CUTLASS examples | |
| Blackwell FMHA | CUTLASS examples | |
| MoE GEMM | CUTLASS examples | |
| 需求 | 资源 | 路径 |
|---|---|---|
| CuTeDSL 入门 | CuTeDSL 示例 | |
| CuTeDSL Hopper GEMM | CuTeDSL 示例 | |
| CuTeDSL Blackwell GEMM | CuTeDSL 示例 | |
| CuTeDSL API 定义 | CuTeDSL 源码 | |
| CuTe Layout 语义 | CuTe 头文件 | |
| CuTe Tensor 操作 | CuTe 头文件 | |
| MMA atom traits | CuTe atom | |
| TMA copy 架构 | CuTe arch | |
| CUTLASS GEMM 模板 | CUTLASS 示例 | |
| Collective Builder | CUTLASS 示例 | |
| Collective Mainloop | CUTLASS 头文件 | |
| Epilogue 融合 | CUTLASS 头文件 | |
| Pipeline 抽象 | CUTLASS 头文件 | |
| pycute Layout 操作 | pycute | |
| Hopper FMHA | CUTLASS 示例 | |
| Blackwell FMHA | CUTLASS 示例 | |
| MoE GEMM | CUTLASS 示例 | |
CuTeDSL 编写模式
CuTeDSL 编写模式
基本 Elementwise Kernel
基础Elementwise内核
python
from cutlass import jit, Int32, Float32
@jit
def add_kernel(x: Float32, y: Float32) -> Float32:
return x + y参考 获取完整示例。
examples/python/CuTeDSL/ampere/elementwise_add.pypython
from cutlass import jit, Int32, Float32
@jit
def add_kernel(x: Float32, y: Float32) -> Float32:
return x + y请参考 获取完整示例。
examples/python/CuTeDSL/ampere/elementwise_add.pyCuTeDSL GEMM
CuTeDSL GEMM
参考 获取基础 SGEMM。
参考 获取 Hopper TMA GEMM。
examples/python/CuTeDSL/ampere/sgemm.pyexamples/python/CuTeDSL/hopper/请参考 获取基础SGEMM示例。
请参考 获取Hopper TMA GEMM示例。
examples/python/CuTeDSL/ampere/sgemm.pyexamples/python/CuTeDSL/hopper/CuTe C++ GEMM 模式
CuTe C++ GEMM 模式
参考 ~ 获取 CuTe SGEMM 渐进教程。
examples/cute/tutorial/sgemm_1.cusgemm_sm80.cu请参考 至 获取CuTe SGEMM渐进式教程。
examples/cute/tutorial/sgemm_1.cusgemm_sm80.cuCompilation Reference
编译参考
bash
undefinedbash
undefined编译 CUTLASS example
编译 CUTLASS example
cd CUTLASS_REPO && mkdir -p build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper
cmake --build . --target 49_hopper_gemm_with_collective_builder
cd CUTLASS_REPO && mkdir -p build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper
cmake --build . --target 49_hopper_gemm_with_collective_builder
编译特定架构
编译特定架构
cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"
cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"
CuTeDSL 运行
CuTeDSL 运行
pip install -e python/CuTeDSL/
python examples/python/CuTeDSL/ampere/sgemm.py
undefinedpip install -e python/CuTeDSL/
python examples/python/CuTeDSL/ampere/sgemm.py
undefined常见问题排查
常见问题排查
| 问题 | 可能原因 | 查找参考 |
|---|---|---|
| GEMM 精度不对 | Epilogue 未配置正确的 accumulator 类型 | |
| TMA 报错 | Tensor alignment 不满足 128B | |
| Collective Builder 编译失败 | 架构不匹配或 pipeline 配置错误 | |
| CuTeDSL 类型错误 | DSL 类型与 CUDA 类型不匹配 | |
| Layout swizzle 错误 | Swizzle 模式与数据排布不兼容 | |
| 问题 | 可能原因 | 查找参考 |
|---|---|---|
| GEMM 精度不对 | Epilogue 未配置正确的累加器类型 | |
| TMA 报错 | Tensor对齐不满足128B | |
| Collective Builder 编译失败 | 架构不匹配或流水线配置错误 | |
| CuTeDSL 类型错误 | DSL类型与CUDA类型不匹配 | |
| Layout swizzle 错误 | Swizzle模式与数据排布不兼容 | |
更新 CUTLASS 源码
更新 CUTLASS 源码
bash
undefinedbash
undefined在 cursor-gpu-skills 项目目录下
在 cursor-gpu-skills 项目目录下
bash update-repos.sh cutlass
undefinedbash update-repos.sh cutlass
undefinedAdditional References
额外参考
- CUTLASS 官方文档: https://github.com/NVIDIA/cutlass
- CuTe 文档: (如果使用 --full 模式安装)
CUTLASS_REPO/media/docs/cute/ - CuTeDSL notebooks:
examples/python/CuTeDSL/notebooks/
- CUTLASS 官方文档: https://github.com/NVIDIA/cutlass
- CuTe 文档: (如果使用--full模式安装)
CUTLASS_REPO/media/docs/cute/ - CuTeDSL notebooks:
examples/python/CuTeDSL/notebooks/