All Skills

Total 39,389 skills

Showing 12 of 39389 skills

Per page

Downloads

Sort

AI & Machine Learningvllm-project/vllm-skills

vllm-bench-serve

Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.

🇺🇸|EnglishTranslated

AI & Machine Learningvllm-project/vllm-skills

vllm-prefix-cache-bench

This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

torchforge-rl-training

Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

pytorch-lightning

High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

miles-rl-training

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

🇺🇸|EnglishTranslated

AI & Machine Learningpepperu96/hyper-mla

optimization-catalog-cutile-dsl

Shared optimization guidance plus cuTile Python DSL-specific overlays. Use when: (1) selecting optimizations for a cuTile Python DSL kernel, (2) checking cuTile-specific implementation traps, (3) deciding whether a profiling finding belongs in shared knowledge or a cuTile overlay, (4) updating cuTile Python DSL optimization docs, (5) reviewing how a shared pattern maps to cuTile.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

grpo-rl-training

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

🇺🇸|EnglishTranslated

2 scripts/Attention

AI & Machine Learningkiterlin/intelligent-dete...

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

ray-train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

ray-data

Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports Parquet/CSV/JSON/images. Integrates with Ray Train, PyTorch, TensorFlow. Scales from single machine to 100s of nodes. Use for batch inference, data preprocessing, multi-modal data loading, or distributed ETL pipelines.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

nemo-evaluator-sdk

Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

pytorch-fsdp2

Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.

🇺🇸|EnglishTranslated