AI Evals

Scope

Covers

Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

“Design evals for this LLM feature so we can ship with confidence.”
“Create a rubric + golden set + benchmark for our AI assistant/copilot.”
“We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
“Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

You need to decide what to build (use

problem-definition

building-with-llms

, or

ai-product-strategy

You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
You only want vendor/model selection with no defined task + data (use
```
evaluating-new-technology
```
first, then come back with a concrete use case).

Inputs

Minimum required

System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
The decision the eval must support (ship/no-ship, compare options, regression gate)
What “good” means: 3–10 target behaviors + top failure modes
Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

Outputs (deliverables)

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
Error taxonomy (from error analysis + open coding): failure modes, severity, examples
Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (7 steps)

1) Define the decision and write the Eval PRD

Inputs: SUT description, stakeholders, decision to support.
Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
Checks: A stakeholder can restate what is being measured, why, and what “pass” means.

2) Draft the golden set structure + coverage plan

Inputs: User workflows, edge cases, safety risks, data availability.
Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
Outputs: Test set spec + initial golden set.
Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

3) Run error analysis and open coding to build a taxonomy

Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
Outputs: Error taxonomy + “top failure modes” list.
Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.

4) Convert taxonomy → rubric + scoring rules

Inputs: Taxonomy, target behaviors, output formats.
Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
Outputs: Rubric + judging guide.
Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

5) Choose the judging approach + harness/runbook

Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
Outputs: Judge + harness plan.
Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

6) Define reporting, thresholds, and the iteration loop

Inputs: Stakeholder needs, release cadence.
Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
Outputs: Reporting + iteration loop.
Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

7) Quality gate + finalize

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
Outputs: Final AI Evals Pack.
Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

Quality gate (required)

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

Examples

Example 1 (answer quality + safety): “Use

ai-evals

to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”

Example 2 (structured extraction): “Use

ai-evals

to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for

amount

and

due_date

. Output: AI Evals Pack.”

Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.”
Response: out of scope; first define the job/spec and success metrics (use

problem-definition

building-with-llms

), then return to

ai-evals

with a concrete SUT.

ai-evals

NPX Install

Tags

SKILL.md Content