llm-as-a-judge
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
NPX Install
npx skill4agent add maragudk/evals-skills llm-as-a-judgeTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →LLM-as-a-Judge
When to Use LLM-as-Judge vs. Code
- JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
- These are fast, cheap, deterministic, and interpretable.
- Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
- These require a separate LLM (distinct from the application) to judge outputs.
The Full Workflow
1. Write Prompt Template
2. Split Labeled Data (Train / Dev / Test)
3. Iteratively Refine Prompt (measure TPR/TNR on Dev)
4. Estimate & Correct Success Rate (on Test + Unlabeled)Step 1: Write the Judge Prompt
references/prompt-template.md1. Clear Task and Evaluation Criterion
- ❌ "Is this email good?"
- ✅ "Is the tone appropriate for a luxury buyer persona?"
2. Precise Pass/Fail Definitions
3. Few-Shot Examples
- Use clear-cut cases, not edge cases, for initial examples.
- For binary judgments, include at least one Pass and one Fail example.
- If using finer-grained scales (e.g., 1–3 severity), include examples for every point on the scale.
4. Structured Output Format
{
"reasoning": "1-2 sentence explanation for the decision.",
"answer": "Pass"
}reasoningStep 2: Split Labeled Data
| Set | Purpose | Typical Allocation |
|---|---|---|
| Training | Pool of candidates for few-shot examples in the prompt | 10–20% |
| Dev | Iteratively refine the prompt; measure agreement with human labels | 40–45% |
| Test | Final, unbiased measurement of judge accuracy (TPR/TNR) | 40–45% |
- Dev examples must never appear in the prompt. This ensures generalization measurement.
- Test examples are held out until the prompt is finalized. Never look at them during development.
- In-context learning typically saturates after 1–8 well-chosen examples. Allocate more data to evaluation.
- Both Dev and Test should contain enough Pass and Fail examples—ideally 30–50 of each.
- Reusing examples across splits leads to overfitting and inflated accuracy.
Step 3: Iteratively Refine the Prompt
The Refinement Loop
- Write a baseline prompt using the four components above, with a few examples from the Training set.
- Run the judge on the Dev set. Compare each judgment to human ground truth.
- Measure agreement using TPR and TNR:
- TPR = (actual Passes correctly judged Pass) / (total actual Passes)
- TNR = (actual Fails correctly judged Fail) / (total actual Fails)
- Inspect disagreements. Review false passes (judge said Pass, human said Fail) and false fails. Identify ambiguous criteria or missing edge cases.
- Refine the prompt: Clarify Pass/Fail definitions, swap in better few-shot examples from Training, add representative edge cases.
- Repeat until TPR and TNR stabilize at acceptable levels.
Why TPR and TNR (Not Precision/Recall)
When to Stop
If Alignment Stalls
- Use a more capable LLM — a larger model may resolve subtle errors.
- Decompose the criterion — break a complex failure into smaller, atomic checks.
- Improve labeled data — add diverse, high-quality examples, especially edge cases.
- Verify label quality — sometimes the issue is inconsistent or incorrect human labels.
Step 4: Estimate True Success Rates
references/success-rate-estimation.mdQuick Reference
- Measure judge accuracy on Test set → TPR, TNR
- Observe raw success rate on unlabeled data → p_obs = k/m
- Correct for bias using Rogan-Gladen formula:
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1) [clipped to 0,1] - Bootstrap confidence interval — resample Test set labels B times, recompute corrected rate each time, take 2.5th/97.5th percentiles.
Key Insight
Common Pitfalls
-
Omitting examples from the prompt. Without concrete examples, the judge lacks grounding. This is the most common mistake.
-
Evaluating multiple criteria in a single prompt. Break complex metrics into narrower, specific prompts for better alignment and diagnosability.
-
Skipping alignment validation. Don't assume the judge "just works." Domain-specific criteria require prompt refinement and human-labeled validation.
-
Overfitting to labeled traces. If few-shot examples also appear in the evaluation set, TPR/TNR will be inflated. Any trace used in the prompt must be excluded from Dev and Test.
-
Never revisiting the judge. Production data drifts, new failure modes emerge, and LLM updates shift behavior. Periodically re-validate.
-
Not pinning the judge model version. In CI pipelines, pin the exact model version (e.g.,) to prevent results from fluctuating due to unannounced updates.
claude-sonnet-4-5-20250929
Long-Document Considerations
- Don't feed the full document into the judge — use only the relevant portion (e.g., the source paragraph a summary came from).
- Consider chunk-level evaluation with aggregated per-chunk judgments.
- Make rubrics especially clear about what "correct" means since the judge won't see the full context.
CI Integration
- Run all golden inputs through the pipeline.
- Evaluate outputs with your suite of automated evaluators (code-based + LLM-as-Judge).
- Pin the judge model version to prevent CI flicker.
- Include examples covering core features, known failure modes, and edge cases.