搜索：cutile - AI Agent Skills

AI & Machine Learningnvidia/skills

tilegym-cutile-python

Expert cuTile programming assistant. Write high-performance GPU kernels using cuTile's tile-based programming model with proper validation and optimization. Supports deep agent orchestration for complex multi-kernel tasks.

🇺🇸|EnglishTranslated

19

11 scripts/Attention

AI & Machine Learningnvidia/skills

tilegym-adding-cutile-kernel

Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.

🇺🇸|EnglishTranslated

18

AI & Machine Learningnvidia/skills

tilegym-improve-cutile-kernel-perf

Iteratively optimize cuTile kernel performance through systematic profiling, bottleneck analysis, IR comparison, and targeted tuning. Covers tile sizes, occupancy, autotune configs, TMA, latency hints, persistent scheduling, num_ctas, flush_to_zero, and IR-level debugging. Use when asked to "optimize cutile kernel", "improve kernel perf", "tune cutile performance", "make kernel faster", or iteratively benchmark and refine a cuTile GPU kernel in the TileGym project.

🇺🇸|EnglishTranslated

17

Backend Developmentnvidia/skills

tilegym-converting-cutile-to-julia

Converts cuTile Python GPU kernels (@ct.kernel) to cuTile.jl Julia equivalents. Handles kernel syntax translation, 0-indexed to 1-indexed conversion, broadcasting differences, memory layout (row-major to column-major), type system mapping, and launch API differences. Use when converting, porting, or translating cuTile Python kernels to Julia cuTile.jl, or debugging/optimizing existing Julia cuTile translations.

🇺🇸|EnglishTranslated

17

4 scripts/Attention

Backend Developmentpromptingcompany/nv-skill...

tilegym-cutile-autotuning

Use when adding, modifying, optimizing, or debugging CuTile autotuning code. Trigger signals: `exhaustive_search` / `replace_hints` / `hints_fn` / `cuda.tile.tune` in code, `autotune` in filenames, or correctness/performance issues in autotuned CuTile kernels. Covers: tune-once/cache/launch pattern, per-architecture configs (sm80–sm120), parameter space design (tile sizes, occupancy, num_ctas), and 7 common pitfalls with solutions.

🇺🇸|EnglishTranslated

13

6 scripts/Checked

AI & Machine Learningpepperu96/hyper-mla

design-cutile-dsl-kernel

cuTile Python DSL kernel implementation patterns, CtKernel runtime wrapper, suitability gate, and cuTile-specific pitfalls. Use when: (1) creating or modifying a cuTile Python DSL kernel version, (2) implementing an optimization that still fits within cuTile's exposed control surface, (3) deciding whether cuTile is still the right DSL, (4) reviewing cuTile-specific runtime patterns. Always also load /design-kernel for shared naming, versioning, and workflow.

🇺🇸|EnglishTranslated

10

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog-cutile-dsl

Shared optimization guidance plus cuTile Python DSL-specific overlays. Use when: (1) selecting optimizations for a cuTile Python DSL kernel, (2) checking cuTile-specific implementation traps, (3) deciding whether a profiling finding belongs in shared knowledge or a cuTile overlay, (4) updating cuTile Python DSL optimization docs, (5) reviewing how a shared pattern maps to cuTile.

🇺🇸|EnglishTranslated

9

AI & Machine Learningnvidia/skills

improve-cutile-kernel-perf

Iteratively optimize cuTile kernel performance through systematic profiling, bottleneck analysis, IR comparison, and targeted tuning. Covers tile sizes, occupancy, autotune configs, TMA, latency hints, persistent scheduling, num_ctas, flush_to_zero, and IR-level debugging. Use when asked to "optimize cutile kernel", "improve kernel perf", "tune cutile performance", "make kernel faster", or iteratively benchmark and refine a cuTile GPU kernel in the TileGym project.

🇺🇸|EnglishTranslated

9

Backend Developmentnvidia/skills

cutile-autotuning

Use when adding, modifying, optimizing, or debugging CuTile autotuning code. Trigger signals: `exhaustive_search` / `replace_hints` / `hints_fn` / `cuda.tile.tune` in code, `autotune` in filenames, or correctness/performance issues in autotuned CuTile kernels. Covers: tune-once/cache/launch pattern, per-architecture configs (sm80–sm120), parameter space design (tile sizes, occupancy, num_ctas), and 7 common pitfalls with solutions.

🇺🇸|EnglishTranslated

9

6 scripts/Checked

AI & Machine Learningpromptingcompany/nv-skill...

tilegym-converting-cutile-to-triton

Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.

🇺🇸|EnglishTranslated

9

10 scripts/Checked

Tools & Utilitiesnvidia/skills

converting-cutile-to-julia

Converts cuTile Python GPU kernels (@ct.kernel) to cuTile.jl Julia equivalents. Handles kernel syntax translation, 0-indexed to 1-indexed conversion, broadcasting differences, memory layout (row-major to column-major), type system mapping, and launch API differences. Use when converting, porting, or translating cuTile Python kernels to Julia cuTile.jl, or debugging/optimizing existing Julia cuTile translations.

🇺🇸|EnglishTranslated

8

4 scripts/Attention

AI & Machine Learningnvidia/skills

converting-cutile-to-triton

Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.

🇺🇸|EnglishTranslated

7

10 scripts/Checked

Search Results: cutile

tilegym-cutile-python

tilegym-adding-cutile-kernel

tilegym-improve-cutile-kernel-perf

tilegym-converting-cutile-to-julia

tilegym-cutile-autotuning

design-cutile-dsl-kernel

optimization-catalog-cutile-dsl

improve-cutile-kernel-perf

cutile-autotuning

tilegym-converting-cutile-to-triton

converting-cutile-to-julia

converting-cutile-to-triton

Search Results: cutile

tilegym-cutile-python

tilegym-adding-cutile-kernel

tilegym-improve-cutile-kernel-perf

tilegym-converting-cutile-to-julia

tilegym-cutile-autotuning

design-cutile-dsl-kernel

optimization-catalog-cutile-dsl

improve-cutile-kernel-perf

cutile-autotuning

tilegym-converting-cutile-to-triton

converting-cutile-to-julia

converting-cutile-to-triton