All Skills

Total 39,389 skills

Showing 12 of 39389 skills

Per page

Downloads

Sort

AI & Machine Learningpepperu96/hyper-mla

mla-analysis

MLA (Multi-Latent Attention) cost models, regime analysis, and kernel selection guide. Use when: (1) reasoning about which kernel approach to use for a given regime, (2) understanding cost model tradeoffs between FlashMLA, FlashAttention, and MLAvar6+, (3) analyzing roofline behavior across decode/speculative/prefill regimes, (4) setting optimization targets, (5) understanding MLA math and absorption trick.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-prod-incident-triage

Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learningkiterlin/intelligent-dete...

training-llms-megatron

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

🇺🇸|EnglishTranslated

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog-cute-dsl

Shared optimization guidance plus CuTe Python DSL overlays. Use when: (1) selecting optimizations for a CuTe Python DSL kernel, (2) deciding whether a finding is shared or cute-dsl-specific, (3) recording CuTe Python DSL implementation notes, (4) reviewing the knowledge layout for cute-dsl work, (5) mapping shared patterns to a CuTe Python DSL implementation surface.

🇺🇸|EnglishTranslated

AI & Machine Learningpepperu96/hyper-mla

design-cutile-dsl-kernel

cuTile Python DSL kernel implementation patterns, CtKernel runtime wrapper, suitability gate, and cuTile-specific pitfalls. Use when: (1) creating or modifying a cuTile Python DSL kernel version, (2) implementing an optimization that still fits within cuTile's exposed control surface, (3) deciding whether cuTile is still the right DSL, (4) reviewing cuTile-specific runtime patterns. Always also load /design-kernel for shared naming, versioning, and workflow.

🇺🇸|EnglishTranslated

Tools & Utilitiesultimatile/cuda-x-skills

cuda-webdoc-search

Search CUDA-X library documentation (cuBLAS, cuTENSOR, cuTensorNet, cuSOLVER, etc.) to find API symbols, functions, and types. Use when you need to look up CUDA library APIs, discover available functions, or find documentation URLs for specific operations.

🇺🇸|EnglishTranslated

12 scripts/Attention

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog

Compatibility router for the shared optimization knowledge base and the language-specific optimization catalog skills. Use when: (1) selecting which optimization catalog skill to load, (2) the implementation language is not fixed yet, (3) a workflow still references the legacy optimization-catalog skill name, (4) deciding whether a finding is shared or language-specific, (5) updating the generalized knowledge-base structure.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

h100

SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.

🇺🇸|EnglishTranslated

Backend Developmentpepperu96/hyper-mla

design-kernel

Shared kernel design workflow across all supported languages and DSLs. Provides language selection table, naming conventions, versioning rules, KernelPlan structure, composition patterns, clone workflow, implementation workflow, devlog template, and designer output contract. Use when: (1) choosing which language-specific kernel design skill to load, (2) the intended implementation language is not fixed yet, (3) you need naming or versioning guidance before selecting a DSL, (4) you are implementing any kernel regardless of DSL, (5) you are updating docs that refer to kernel design skills.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

ml-paper-writing

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. Includes LaTeX templates, reviewer guidelines, and citation verification workflows.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-torch-profiler-analysis

Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.

🇺🇸|EnglishTranslated

4 scripts/Checked

AI & Machine Learningkiterlin/intelligent-dete...

weights-and-biases

Track ML experiments with automatic logging, visualize training in real-time, optimize hyperparameters with sweeps, and manage model registry with W&B - collaborative MLOps platform

🇺🇸|EnglishTranslated