Loading...
Loading...
LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evidence citations
npx skill4agent add oimiragieo/agent-studio agent-evaluationverification-before-completionverification-before-completionsecurity-architectpnpm lint:fix| Dimension | Weight | What It Measures |
|---|---|---|
| Accuracy | 30% | Factual correctness; no hallucinations; claims are verifiable |
| Groundedness | 25% | Claims are supported by citations, file references, or evidence from the codebase |
| Coherence | 15% | Logical flow; internally consistent; no contradictions |
| Completeness | 20% | All required aspects addressed; no critical gaps |
| Helpfulness | 10% | Actionable; provides concrete next steps; reduces ambiguity |
| Score | Meaning |
|---|---|
| 5 | Excellent — fully meets the dimension's criteria with no gaps |
| 4 | Good — meets criteria with minor gaps |
| 3 | Adequate — partially meets criteria; some gaps present |
| 2 | Poor — significant gaps or errors in this dimension |
| 1 | Failing — does not meet the dimension's criteria |
- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentationChecklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patternsChecklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational orderChecklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are includedChecklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audiencecomposite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)| Composite Score | Verdict | Action |
|---|---|---|
| 4.5 – 5.0 | EXCELLENT | Approve; proceed |
| 3.5 – 4.4 | GOOD | Approve with minor notes |
| 2.5 – 3.4 | ADEQUATE | Request targeted improvements |
| 1.5 – 2.4 | POOR | Reject; requires significant rework |
| 1.0 – 1.4 | FAILING | Reject; restart task |
## Evaluation Verdict
**Output Evaluated**: [Brief description of what was evaluated]
**Evaluator**: [Agent name / task ID]
**Date**: [ISO 8601 date]
### Dimension Scores
| Dimension | Score | Weight | Weighted Score |
| ------------- | ----- | ------ | -------------- |
| Accuracy | X/5 | 30% | X.X |
| Groundedness | X/5 | 25% | X.X |
| Completeness | X/5 | 20% | X.X |
| Coherence | X/5 | 15% | X.X |
| Helpfulness | X/5 | 10% | X.X |
| **Composite** | | | **X.X / 5.0** |
### Evidence Citations
**Accuracy (X/5)**:
> [Direct quote or file:line reference]
> Rationale: [Why this score]
**Groundedness (X/5)**:
> [Direct quote or file:line reference]
> Rationale: [Why this score]
**Completeness (X/5)**:
> [Direct quote or file:line reference]
> Rationale: [Why this score]
**Coherence (X/5)**:
> [Direct quote or file:line reference]
> Rationale: [Why this score]
**Helpfulness (X/5)**:
> [Direct quote or file:line reference]
> Rationale: [Why this score]
### Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING]
**Summary**: [1-2 sentence overall assessment]
**Required Actions** (if verdict is ADEQUATE or worse):
1. [Specific improvement needed]
2. [Specific improvement needed]// Load plan document
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });
// Evaluate against 5-dimension rubric
Skill({ skill: 'agent-evaluation' });
// Provide the plan content as the output to evaluate// Agent generates implementation summary
// Before marking task complete, evaluate the summary quality
Skill({ skill: 'agent-evaluation' });
// If composite < 3.5, request improvements before TaskUpdate(completed)// After code-reviewer runs, evaluate the review quality
Skill({ skill: 'agent-evaluation' });
// Ensures review is grounded in actual code evidence, not assertions// Evaluate output A
// Save verdict A
// Evaluate output B
// Save verdict B
// Compare composites → choose higher scoring output// Step 1: Do the work
// Step 2: Evaluate with agent-evaluation
Skill({ skill: 'agent-evaluation' });
// If verdict is POOR or FAILING → rework before proceeding
// If verdict is ADEQUATE or better → proceed to verification
// Step 3: Final gate
Skill({ skill: 'verification-before-completion' });
// Step 4: Mark complete
TaskUpdate({ taskId: 'X', status: 'completed' });"Evidence: [file:line or direct quote]"accuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Skipping dimensions to save time | Each dimension catches different failures | Always score all 5 dimensions |
| No evidence citation per dimension | Assertions without grounding are invalid | Quote specific text or file:line for every score |
| Using simple average for composite | Accuracy (30%) matters more than helpfulness (10%) | Use the weighted composite formula |
| Only checking EXCELLENT vs FAILING | ADEQUATE outputs need targeted improvements, not full rework | Use all 5 verdict tiers with appropriate action per tier |
| Evaluating before work is done | Incomplete outputs score falsely low | Evaluate completed outputs only |
| Treating evaluation as binary gate | Quality is a spectrum; binary pass/fail loses nuance | Use composite score + per-dimension breakdown together |
qacode-reviewerreflection-agentcat .claude/context/memory/learnings.md.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.