Loading...
Loading...
Found 6 Skills
Shared optimization guidance plus CuTe Python DSL overlays. Use when: (1) selecting optimizations for a CuTe Python DSL kernel, (2) deciding whether a finding is shared or cute-dsl-specific, (3) recording CuTe Python DSL implementation notes, (4) reviewing the knowledge layout for cute-dsl work, (5) mapping shared patterns to a CuTe Python DSL implementation surface.
CuTe Python DSL API reference and implementation patterns for NVIDIA GPU kernel programming. Provides execution model, core API table, key constraints, common patterns, and documentation index. Use when: (1) writing or modifying CuTe DSL kernel code, (2) looking up CuTe DSL API syntax, (3) implementing attention/GEMM/MLA patterns in CuTe DSL, (4) understanding CuTe DSL execution model and compilation pipeline, (5) checking what CuTe DSL can and cannot do.
Workflow for learning CuTe Python DSL by reading, importing, profiling, and extracting reusable patterns from CUTLASS Blackwell example kernels. Use when: (1) studying CUTLASS CuTe DSL reference implementations, (2) importing CUTLASS examples into the project runtime infrastructure, (3) building CuTe DSL knowledge base entries from profiling experiments, (4) understanding CuTe DSL API patterns, TMA pipelining, warpgroup scheduling, or persistent kernel structure.
CuTe Python DSL kernel workflow, CuteKernel runtime wrapper, suitability gate, tiling guidance, and CuTe-specific pitfalls. Use when: (1) planning or implementing a kernel in the CuTe Python DSL, (2) the optimization needs more explicit control than cuTile exposes but should remain in a Python-driven workflow, (3) defining package naming for cute-dsl kernels, (4) documenting CuTe Python DSL design choices, (5) recording language-specific knowledge for CuTe Python DSL.
Performance optimization coordination playbook. Contains specialist routing table, TileIR two-step pipeline, kernel generation specialist selection, prioritization criteria, and safe modification workflow. Use when the user asks to apply optimizations, write kernels, or improve performance. Covers both user-specified optimization and autopilot-driven iterative optimization.
Write and implement GPU kernels using NVIDIA CuTe DSL (CUTLASS 4.x Python API) — NOT for Triton, CUDA C++, or conceptual explanations. Trigger only when the user wants to write or implement a kernel, not when asking questions about CuTe DSL concepts or layouts. CuTe DSL uses cute.jit/cute.kernel decorators and cutlass.cute imports. Covers element-wise kernels, GEMM patterns, reductions, memory hierarchy (global/shared/register/TMA), MMA tensor core operations, software pipelining, and framework integration.