Loading...
Loading...
Found 2 Skills
Use when adding, modifying, optimizing, or debugging CuTile autotuning code. Trigger signals: `exhaustive_search` / `replace_hints` / `hints_fn` / `cuda.tile.tune` in code, `autotune` in filenames, or correctness/performance issues in autotuned CuTile kernels. Covers: tune-once/cache/launch pattern, per-architecture configs (sm80–sm120), parameter space design (tile sizes, occupancy, num_ctas), and 7 common pitfalls with solutions.
CuTe Python DSL kernel workflow, CuteKernel runtime wrapper, suitability gate, tiling guidance, and CuTe-specific pitfalls. Use when: (1) planning or implementing a kernel in the CuTe Python DSL, (2) the optimization needs more explicit control than cuTile exposes but should remain in a Python-driven workflow, (3) defining package naming for cute-dsl kernels, (4) documenting CuTe Python DSL design choices, (5) recording language-specific knowledge for CuTe Python DSL.