advanced-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdvanced Evaluation
高级评估
LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.
LLM-as-a-Judge技术用于评估AI输出。这并非单一技术,而是一系列方法的集合——选择合适的方法并缓解偏差是核心能力。
When to Activate
适用场景
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards
- Debugging inconsistent evaluation results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- 为LLM输出构建自动化评估流水线
- 对比多个模型响应以选出最优结果
- 建立一致的质量标准
- 调试不一致的评估结果
- 为提示词或模型变更设计A/B测试
- 为人工或自动化评估创建评分标准
Core Concepts
核心概念
Evaluation Taxonomy
评估分类
Direct Scoring: Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
Pairwise Comparison: LLM compares two responses and selects better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
直接评分:单个LLM基于定义的评分标准对单个响应进行评分。
- 最适用于:客观标准(事实准确性、指令遵循度、毒性)
- 可靠性:针对定义清晰的标准,可靠性为中到高
两两对比:LLM对比两个响应并选出更优的一个。
- 最适用于:主观偏好(语气、风格、说服力)
- 可靠性:在偏好类评估中高于直接评分
Known Biases
已知偏差
| Bias | Description | Mitigation |
|---|---|---|
| Position | First-position preference | Swap positions, check consistency |
| Length | Longer = higher scores | Explicit prompting, length-normalized scoring |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
| 偏差类型 | 描述 | 缓解方法 |
|---|---|---|
| 位置偏差 | 偏好排在首位的内容 | 交换位置,检查一致性 |
| 长度偏差 | 内容越长得分越高 | 明确提示、基于长度归一化的评分 |
| 自我提升偏差 | 模型给自己的输出打更高分 | 使用不同模型进行评估 |
| 冗余偏差 | 不必要的细节获得更高评分 | 基于特定标准的评分细则 |
| 权威偏差 | 自信的语气获得更高评分 | 要求引用证据 |
Decision Framework
决策框架
Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)是否存在客观的基准事实?
├── 是 → 直接评分(事实准确性、格式合规性)
└── 否 → 两两对比(语气、风格、创造力)Quick Reference
快速参考
Direct Scoring Requirements
直接评分要求
- Clear criteria definitions
- Calibrated scale (1-5 recommended)
- Chain-of-thought: justification BEFORE score (improves reliability 15-25%)
- 清晰的标准定义
- 校准后的评分尺度(推荐1-5分)
- 思维链:先给出理由再打分(可将可靠性提升15-25%)
Pairwise Comparison Protocol
两两对比流程
- First pass: A in first position
- Second pass: B in first position (swap)
- Consistency check: If passes disagree → TIE
- Final verdict: Consistent winner with averaged confidence
- 第一轮:A排在首位
- 第二轮:B排在首位(交换位置)
- 一致性检查:若两轮结果不一致→判定为平局
- 最终结论:结果一致的胜出者,取平均置信度
Rubric Components
评分标准组成
- Level descriptions with clear boundaries
- Observable characteristics per level
- Edge case guidance
- Strictness calibration (lenient/balanced/strict)
- 具有清晰边界的等级描述
- 各等级的可观察特征
- 边缘情况指导
- 严格程度校准(宽松/平衡/严格)
Integration
集成
Works with:
- context-fundamentals - Effective context structure
- tool-design - Evaluation tool schemas
- evaluation (foundational) - Core evaluation concepts
For detailed implementation patterns, prompt templates, examples, and metrics:
references/full-guide.mdSee also: , ,
references/implementation-patterns.mdreferences/bias-mitigation.mdreferences/metrics-guide.md可与以下内容配合使用:
- context-fundamentals - 有效的上下文结构
- tool-design - 评估工具架构
- evaluation(基础)- 核心评估概念
如需详细的实现模式、提示词模板、示例及指标,请参考:
references/full-guide.md另请参阅:, ,
references/implementation-patterns.mdreferences/bias-mitigation.mdreferences/metrics-guide.md