advanced-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Advanced Evaluation

高级评估方法

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
本技能涵盖了以LLM作为评估器来评估LLM输出的生产级技术,整合了学术论文、行业实践和实际实施经验,为构建可靠的评估系统提供可落地的模式。
核心见解:LLM-as-a-Judge并非单一技术,而是一系列适用于不同评估场景的方法体系。选择合适的方法并缓解已知偏差是本技能要培养的核心能力。

When to Activate

激活场景

Activate this skill when:
  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments
在以下场景中激活本技能:
  • 为LLM输出构建自动化评估流水线
  • 比较多个模型响应以选出最优结果
  • 在评估团队间建立统一的质量标准
  • 调试结果不一致的评估系统
  • 针对提示词或模型变更设计A/B测试
  • 为人工或自动化评估创建rubrics
  • 分析自动化评估与人工判断的相关性

Core Concepts

核心概念

The Evaluation Taxonomy

评估分类体系

Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria
  • Failure mode: Score calibration drift, inconsistent scale interpretation
Pairwise Comparison: An LLM compares two responses and selects the better one.
  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
  • Failure mode: Position bias, length bias
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
评估方法分为两大类,各自具备不同的可靠性特征:
直接评分法:单个LLM基于定义好的评分标准对单个响应进行评分。
  • 适用场景:客观标准(事实准确性、指令遵循度、毒性检测)
  • 可靠性:针对定义清晰的标准,可靠性为中到高
  • 失效模式:分数校准漂移、评分标准解读不一致
成对比较法:LLM对比两个响应并选出更优的一个。
  • 适用场景:主观偏好(语气、风格、说服力)
  • 可靠性:在偏好类评估中,比直接评分法可靠性更高
  • 失效模式:位置偏差、长度偏差
MT-Bench论文(Zheng等人,2023)的研究表明,在基于偏好的评估中,成对比较法与人工评估的一致性高于直接评分法;而针对有明确事实依据的客观标准,直接评分法仍然适用。

The Bias Landscape

偏差类型与缓解

LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
作为评估器的LLM存在系统性偏差,必须主动缓解:
位置偏差:在成对比较中,排在首位的响应会得到偏好。缓解方法:交换位置进行两次评估,采用多数投票或一致性校验。
长度偏差:无论质量如何,更长的响应会获得更高评分。缓解方法:在提示词中明确要求忽略长度,采用长度归一化评分。
自我增强偏差:模型会给自己生成的输出打更高的分。缓解方法:使用不同的模型分别进行生成和评估,或在评估中明确说明该局限性。
冗余偏差:即使不必要,详细的解释也会获得更高分数。缓解方法:使用针对具体标准的rubrics,对无关细节进行扣分。
权威偏差:无论准确性如何,自信、权威的语气会获得更高评分。缓解方法:要求引用证据,增加事实校验层。

Metric Selection Framework

指标选择框架

Choose metrics based on the evaluation task structure:
Task TypePrimary MetricsSecondary Metrics
Binary classification (pass/fail)Recall, Precision, F1Cohen's κ
Ordinal scale (1-5 rating)Spearman's ρ, Kendall's τCohen's κ (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
根据评估任务的结构选择指标:
任务类型核心指标次要指标
二分类(通过/不通过)Recall, Precision, F1Cohen's κ
有序量表(1-5分)Spearman's ρ, Kendall's τCohen's κ(加权)
成对偏好一致率、位置一致性置信度校准
多标签Macro-F1, Micro-F1单标签精确率/召回率
关键见解:相比随机的不一致,系统性的不一致模式更为关键。在特定标准上持续与人类判断不一致的评估器,比存在随机误差的评估器问题更大。

Evaluation Approaches

评估方法实现

Direct Scoring Implementation

直接评分法实现

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
  • 1-3 scales: Binary with neutral option, lowest cognitive load
  • 1-5 scales: Standard Likert, good balance of granularity and reliability
  • 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
Prompt Structure for Direct Scoring:
You are an expert evaluator assessing response quality.
直接评分法需要三个组件:清晰的标准、校准后的评分量表、结构化输出格式。
标准定义模板
Criterion: [名称]
Description: [该标准的衡量内容]
Weight: [相对重要性,0-1]
评分量表校准
  • 1-3分制:带中性选项的二元评分,认知负荷最低
  • 1-5分制:标准李克特量表,在粒度和可靠性间达到良好平衡
  • 1-10分制:高粒度但难以校准,仅在搭配详细rubrics时使用
直接评分提示词结构
You are an expert evaluator assessing response quality.

Task

Task

Evaluate the following response against each criterion.
Evaluate the following response against each criterion.

Original Prompt

Original Prompt

{prompt}
{prompt}

Response to Evaluate

Response to Evaluate

{response}
{response}

Criteria

Criteria

{for each criterion: name, description, weight}
{for each criterion: name, description, weight}

Instructions

Instructions

For each criterion:
  1. Find specific evidence in the response
  2. Score according to the rubric (1-{max} scale)
  3. Justify your score with evidence
  4. Suggest one specific improvement
For each criterion:
  1. Find specific evidence in the response
  2. Score according to the rubric (1-{max} scale)
  3. Justify your score with evidence
  4. Suggest one specific improvement

Output Format

Output Format

Respond with structured JSON containing scores, justifications, and summary.

**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
Respond with structured JSON containing scores, justifications, and summary.

**思维链要求**:所有评分提示词必须要求先提供理由再给出分数。研究表明,与先给分再解释的方式相比,这种方法可将可靠性提升15-25%。

Pairwise Comparison Implementation

成对比较法实现

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
Position Bias Mitigation Protocol:
  1. First pass: Response A in first position, Response B in second
  2. Second pass: Response B in first position, Response A in second
  3. Consistency check: If passes disagree, return TIE with reduced confidence
  4. Final verdict: Consistent winner with averaged confidence
Prompt Structure for Pairwise Comparison:
You are an expert evaluator comparing two AI responses.
成对比较法在基于偏好的评估中本质上更可靠,但需要进行偏差缓解。
位置偏差缓解流程
  1. 第一轮:响应A排在首位,响应B排在第二位
  2. 第二轮:响应B排在首位,响应A排在第二位
  3. 一致性校验:若两轮结果不一致,则返回平局并降低置信度
  4. 最终结论:若两轮结果一致,则取平均置信度作为最终置信度
成对比较提示词结构
You are an expert evaluator comparing two AI responses.

Critical Instructions

Critical Instructions

  • Do NOT prefer responses because they are longer
  • Do NOT prefer responses based on position (first vs second)
  • Focus ONLY on quality according to the specified criteria
  • Ties are acceptable when responses are genuinely equivalent
  • Do NOT prefer responses because they are longer
  • Do NOT prefer responses based on position (first vs second)
  • Focus ONLY on quality according to the specified criteria
  • Ties are acceptable when responses are genuinely equivalent

Original Prompt

Original Prompt

{prompt}
{prompt}

Response A

Response A

{response_a}
{response_a}

Response B

Response B

{response_b}
{response_b}

Comparison Criteria

Comparison Criteria

{criteria list}
{criteria list}

Instructions

Instructions

  1. Analyze each response independently first
  2. Compare them on each criterion
  3. Determine overall winner with confidence level
  1. Analyze each response independently first
  2. Compare them on each criterion
  3. Determine overall winner with confidence level

Output Format

Output Format

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

**Confidence Calibration**: Confidence scores should reflect position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

**置信度校准**:置信度分数应反映位置一致性:
- 两轮结果一致:置信度 = 两轮单独置信度的平均值
- 两轮结果不一致:置信度 = 0.5,结论 = 平局

Rubric Generation

Rubrics生成

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
Rubric Components:
  1. Level descriptions: Clear boundaries for each score level
  2. Characteristics: Observable features that define each level
  3. Examples: Representative text for each level (optional but valuable)
  4. Edge cases: Guidance for ambiguous situations
  5. Scoring guidelines: General principles for consistent application
Strictness Calibration:
  • Lenient: Lower bar for passing scores, appropriate for encouraging iteration
  • Balanced: Fair, typical expectations for production use
  • Strict: High standards, appropriate for safety-critical or high-stakes evaluation
Domain Adaptation: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.
定义清晰的rubrics相比开放式评分,可将评估差异降低40-60%。
Rubrics组件
  1. 等级描述:每个分数等级的清晰边界
  2. 特征:定义每个等级的可观察特征
  3. 示例:每个等级的代表性文本(可选但有价值)
  4. 边缘情况:针对模糊场景的指导
  5. 评分指南:确保一致应用的通用原则
严格度校准
  • 宽松:通过分数的门槛较低,适合鼓励迭代
  • 平衡:符合生产环境的典型预期
  • 严格:高标准,适合安全关键型或高风险评估
领域适配:Rubrics应使用领域特定术语。例如,“代码可读性”rubrics会提及变量、函数和注释;“医疗准确性”rubrics会引用临床术语和证据标准。

Practical Guidance

实践指导

Evaluation Pipeline Design

评估流水线设计

Production evaluation systems require multiple layers:
┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘
生产级评估系统需要多层结构:
┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

Common Anti-Patterns

常见反模式

Anti-pattern: Scoring without justification
  • Problem: Scores lack grounding, difficult to debug or improve
  • Solution: Always require evidence-based justification before score
Anti-pattern: Single-pass pairwise comparison
  • Problem: Position bias corrupts results
  • Solution: Always swap positions and check consistency
Anti-pattern: Overloaded criteria
  • Problem: Criteria measuring multiple things are unreliable
  • Solution: One criterion = one measurable aspect
Anti-pattern: Missing edge case guidance
  • Problem: Evaluators handle ambiguous cases inconsistently
  • Solution: Include edge cases in rubrics with explicit guidance
Anti-pattern: Ignoring confidence calibration
  • Problem: High-confidence wrong judgments are worse than low-confidence
  • Solution: Calibrate confidence to position consistency and evidence strength
反模式:无理由评分
  • 问题:分数缺乏依据,难以调试或改进
  • 解决方案:始终要求先提供基于证据的理由,再给出分数
反模式:单次成对比较
  • 问题:位置偏差会影响结果
  • 解决方案:始终交换位置并进行一致性校验
反模式:标准过载
  • 问题:同时衡量多个维度的标准可靠性低
  • 解决方案:一个标准对应一个可衡量的维度
反模式:缺乏边缘情况指导
  • 问题:评估人员处理模糊场景时不一致
  • 解决方案:在rubrics中明确包含边缘情况的指导
反模式:忽略置信度校准
  • 问题:高置信度的错误判断比低置信度的判断危害更大
  • 解决方案:根据位置一致性和证据强度校准置信度

Decision Framework: Direct vs. Pairwise

决策框架:直接评分法 vs 成对比较法

Use this decision tree:
Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)
使用以下决策树:
Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)

Scaling Evaluation

评估规模化

For high-volume evaluation:
  1. Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes
    • Reduces individual model bias
    • More expensive but more reliable for high-stakes decisions
  2. Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases
    • Cost-effective for large volumes
    • Requires calibration of screening threshold
  3. Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence
    • Best reliability for critical applications
    • Design feedback loop to improve automated evaluation
针对高吞吐量的评估:
  1. LLM评估小组(PoLL):使用多个模型作为评估器,汇总投票结果
    • 减少单个模型的偏差
    • 成本更高,但在高风险决策中更可靠
  2. 分层评估:使用快速低成本的模型进行筛选,仅对边缘情况使用高成本模型
    • 适合大吞吐量场景,性价比高
    • 需要校准筛选阈值
  3. 人工介入:自动化评估处理清晰的场景,人工审核处理低置信度的场景
    • 关键应用中可靠性最高
    • 设计反馈回路以改进自动化评估

Examples

示例

Example 1: Direct Scoring for Accuracy

示例1:针对准确性的直接评分

Input:
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
Output:
json
{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}
输入
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
输出
json
{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

示例2:带位置交换的成对比较

Input:
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
First Pass (A first):
json
{ "winner": "B", "confidence": 0.8 }
Second Pass (B first):
json
{ "winner": "A", "confidence": 0.6 }
(Note: Winner is A because B was in first position)
Mapped Second Pass:
json
{ "winner": "B", "confidence": 0.6 }
Final Result:
json
{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}
输入
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
第一轮(A在前)
json
{ "winner": "B", "confidence": 0.8 }
第二轮(B在前)
json
{ "winner": "A", "confidence": 0.6 }
(注:因为B排在首位,所以结果为A)
映射后的第二轮结果
json
{ "winner": "B", "confidence": 0.6 }
最终结果
json
{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

示例3:Rubrics生成

Input:
criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"
Output (abbreviated):
json
{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}
输入
criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"
输出(节选):
json
{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

Guidelines

指南

  1. Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
  2. Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
  3. Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
  4. Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
  5. Include confidence scores - Calibrate to position consistency and evidence strength
  6. Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
  7. Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
  8. Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
  9. Monitor for systematic bias - Track disagreement patterns by criterion, response type, model
  10. Design for iteration - Evaluation systems improve with feedback loops
  1. 始终要求先提供理由再给分 - 思维链提示可将可靠性提升15-25%
  2. 在成对比较中始终交换位置 - 单次比较会受位置偏差影响
  3. 匹配量表粒度与rubrics的具体程度 - 若无详细的等级描述,不要使用1-10分制
  4. 区分客观与主观标准 - 客观标准用直接评分法,主观标准用成对比较法
  5. 包含置信度分数 - 根据位置一致性和证据强度进行校准
  6. 明确定义边缘情况 - 模糊场景是评估差异的主要来源
  7. 使用领域特定的rubrics - 通用rubrics会产生通用(价值较低)的评估结果
  8. 与人工判断进行验证 - 只有当自动化评估与人工评估相关时,才有价值
  9. 监控系统性偏差 - 按标准、响应类型、模型跟踪不一致模式
  10. 为迭代设计系统 - 评估系统会通过反馈回路不断改进

Integration

集成

This skill integrates with:
  • context-fundamentals - Evaluation prompts require effective context structure
  • tool-design - Evaluation tools need proper schemas and error handling
  • context-optimization - Evaluation prompts can be optimized for token efficiency
  • evaluation (foundational) - This skill extends the foundational evaluation concepts
本技能可与以下技能集成:
  • context-fundamentals - 评估提示词需要有效的上下文结构
  • tool-design - 评估工具需要合适的 schema 和错误处理
  • context-optimization - 可优化评估提示词以提升token效率
  • evaluation(基础版) - 本技能扩展了基础评估概念

References

参考文献

Internal reference:
  • LLM-as-Judge Implementation Patterns
  • Bias Mitigation Techniques
  • Metric Selection Guide
External research:
Related skills in this collection:
  • evaluation - Foundational evaluation concepts
  • context-fundamentals - Context structure for evaluation prompts
  • tool-design - Building evaluation tools

内部参考:
  • LLM-as-Judge Implementation Patterns
  • Bias Mitigation Techniques
  • Metric Selection Guide
外部研究:
本集合中的相关技能:
  • evaluation - 基础评估概念
  • context-fundamentals - 评估提示词的上下文结构
  • tool-design - 构建评估工具

Skill Metadata

技能元数据

Created: 2024-12-24 Last Updated: 2024-12-24 Author: Muratcan Koylan Version: 1.0.0
创建时间:2024-12-24 最后更新时间:2024-12-24 作者:Muratcan Koylan 版本:1.0.0