Loading...
Loading...
Found 2 Skills
NCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels. MANDATORY: every optimization MUST start with NCU profiling, followed by multi-dimensional analysis, then targeted code modification, then re-profiling to verify. Supports roofline, memory hierarchy, warp stalls, instruction mix, occupancy, divergence analysis. Provides implementation-specific code modifications: Native CUDA (launch config, memory patterns, async copy, Tensor Core), CUTLASS (ThreadblockShape, stages, epilogue, schedule policy, alignment), Triton (autotune params, compiler hints, tl.* API patterns), CuTe DSL (threads_per_cta, elems_per_thread, tiled_copy, copy atom, shared memory, warp/cta reduce). Use when optimizing any CUDA kernel performance.
GPU kernel profiling workflow across supported kernel implementation languages. Provides commands for all 4 profiling modes (annotation, event, ncu, nsys), metric interpretation tables, bottleneck identification rules, and the output contract for returning compact results to the orchestrator. Use when: (1) profiling a kernel version, (2) interpreting profiling artifacts/reports, (3) comparing kernel versions, (4) identifying bottlenecks and optimization opportunities, (5) documenting performance in the development log.