Search Results: ai-evaluation

Found 23 Skills

AI & Machine Learningexploreomni/omni-agent-sk...

omni-ai-eval

Evaluate Omni AI query generation accuracy by running test prompts through the Omni CLI, comparing generated query JSON against expected results, and scoring accuracy. Use this skill whenever someone wants to evaluate Omni AI, benchmark Blobby, run regression tests, compare AI output across branches or configurations, test prompt variations, measure AI quality, run A/B tests on model changes, assess impact of context changes, or any variant of "run evals", "test Blobby", "benchmark query generation", "compare AI results", "regression test", "how accurate is the AI", or "measure the impact of my changes".

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

get-results

Retrieve and analyze simulation results from a Coval run. Use when user wants to review evaluation outcomes or debug agent behavior.

🇺🇸|EnglishTranslated

AI & Machine Learningrysweet/amplihack

eval-recipes-runner

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Auto-activates when testing improvements, running evals, or benchmarking changes.

🇺🇸|EnglishTranslated

AI & Machine Learningyonatangross/orchestkit

golden-dataset-curation

Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data.

🇺🇸|EnglishTranslated

AI & Machine Learningwandb/skills

wandb-primary

Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningkeyvaluesoftwaresystems/n...

netra-best-practices

Code-first Netra best-practices playbook covering setup, instrumentation, context tracking, custom spans/metrics, integration patterns, evaluation, simulation, and troubleshooting.

🇺🇸|EnglishTranslated

AI & Machine Learningvercel/vercel-plugin

benchmark-agents

Advanced AI agent benchmark scenarios that push Vercel's cutting-edge platform features — Workflow DevKit, AI Gateway, MCP, Chat SDK, Queues, Flags, Sandbox, and multi-agent orchestration. Designed to stress-test skill injection for complex, multi-system builds.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge-with-debate

Evaluate solutions through multi-round debate between independent judges until consensus

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

onboard

Interactively set up a first Coval AI evaluation. Guides users through installing the CLI, connecting an agent, creating personas, building test cases, selecting metrics, and launching their first eval run. Use when user says "onboard", "get started", "set up evaluation", "first eval", "new to coval", or wants help creating their first test run.

🇺🇸|EnglishTranslated

Testing & QAyonatangross/orchestkit

testing-llm

LLM and AI testing patterns — mock responses, evaluation with DeepEval/RAGAS, structured output validation, and agentic test patterns (generator, healer, planner). Use when testing AI features, validating LLM outputs, or building evaluation pipelines.

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/everything-claud...

skill-stocktake

Use when auditing Claude skills and commands for quality. Supports Quick Scan (changed skills only) and Full Stocktake modes with sequential subagent batch evaluation.

🇺🇸|EnglishTranslated

3 scripts/Attention