Loading...
Loading...
Found 2 Skills
Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates.
CUDA/GPU computing guardrails, patterns, and best practices for AI-assisted development. Use when working with CUDA files (.cu, .cuh), or when the user mentions CUDA/GPU programming. Provides kernel design patterns, memory hierarchy guidelines, and occupancy optimization specific to this project's coding standards.