Loading...
Loading...
Found 2 Skills
Modify, build, test, debug, and contribute to NVIDIA cuOpt (C++/CUDA, Python, server, CI). Use for solver internals, PRs, DCO, and code conventions.
Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates.