Loading...
Loading...
Found 105 Skills
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
Live Google Search Console analytics — fetches real SEO data (clicks, impressions, CTR, rankings) and delivers actionable insights with CTR benchmarking and opportunity detection. Zero dependencies. Use when the user asks about GSC, Google Search Console, SEO performance, search performance, keywords, rankings, organic traffic, top pages, top queries, "how is my site performing in Google", "check rankings", or "search console report".
Orchestrate Xcode build optimization by benchmarking first, running the specialist analysis skills, prioritizing findings, requesting explicit approval, delegating approved fixes to xcode-build-fixer, and re-benchmarking after changes. Use when a developer wants an end-to-end build optimization workflow, asks to speed up Xcode builds, wants a full build audit, or needs a recommend-first optimization pass covering compilation, project settings, and packages.
Research-driven code review and validation at multiple levels of abstraction. Two modes: (1) Session review — after making changes, review and verify work using parallel reviewers that research-validate every assumption; (2) Full codebase audit — deep end-to-end evaluation using parallel teams of subagent-spawning reviewers. Use when reviewing changes, verifying work quality, auditing a codebase, validating correctness, checking assumptions, finding defects, reducing complexity. NOT for writing new code, explaining code, or benchmarking.
Optimizes Python library performance through profiling (cProfile, PyInstrument), memory analysis (memray, tracemalloc), benchmarking (pytest-benchmark), and optimization strategies. Use when analyzing performance bottlenecks, finding memory leaks, or setting up performance regression testing.
Set up performance benchmarks and CodSpeed harness for a project. Use this skill whenever the user wants to create benchmarks, add performance tests, set up CodSpeed, configure codspeed.yml, integrate a benchmarking framework (criterion, divan, pytest-benchmark, vitest bench, go test -bench, google benchmark), or when the user says 'add benchmarks', 'set up perf tests', 'create a benchmark', 'benchmark this', or wants to measure performance of their code for the first time. Also trigger when the optimize skill needs benchmarks that don't exist yet.
Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.
Filesystem RAG benchmarks: corpus/, train.json, evaluate_rag.py (RAGAS quality). Not for prod monitoring, latency/throughput benchmarking (use rag-perf), or evals outside this repo layout.
Challenge an outbound campaign copy by benchmarking it against the user's existing campaigns — what worked, what didn't, what the winners do differently — and return a concrete verdict plus prioritized fixes. Use whenever the user wants to know if a campaign or sequence is good, compare a draft to past campaigns, audit campaign copy against real performance, pressure-test a sequence before launch, validate a sequence before going live, or asks 'is this campaign as good as my best ones'. Triggers on: 'challenge this campaign', 'benchmark this sequence', 'is this campaign good', 'audit my copy', 'pressure-test before launch', 'compare to my best campaigns', 'should I launch this'. Pulls existing campaign performance from the La Growth Machine MCP when connected; otherwise works from stats and copy the user pastes; falls back to a best-practice baseline when there is no campaign history. For SDR, RevOps, Growth, Head of Sales/Marketing, founders launching outbound. Maintained by La Growth Machine.
Spatial indexing and world streaming for Three.js building games with thousands of pieces. Use when optimizing building games, implementing spatial queries, chunk loading, or profiling performance. Includes spatial hash grids, octrees, chunk managers, and benchmarking tools.