agentic-eval

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agentic Evaluation Patterns

智能体评估模式

Patterns for self-improvement through iterative evaluation and refinement.
通过迭代评估与优化实现自我改进的模式。

Overview

概述

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘
评估模式让Agent能够评估并改进自身输出,从单次生成升级为迭代优化循环。
Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

适用场景

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

  • 对质量要求严苛的生成任务:需要高精度的代码、报告、分析内容
  • 有明确评估标准的任务:存在已定义的成功指标
  • 需要符合特定规范的内容:风格指南、合规要求、格式规范

Pattern 1: Basic Reflection

模式1:基础反思

Agent evaluates and improves its own output through self-critique.
python
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output
Key insight: Use structured JSON output for reliable parsing of critique results.

Agent通过自我批判评估并改进自身输出。
python
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output
核心要点:使用结构化JSON输出以可靠解析批判结果。

Pattern 2: Evaluator-Optimizer

模式2:评估器-优化器

Separate generation and evaluation into distinct components for clearer responsibilities.
python
class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

将生成与评估拆分为独立组件,明确职责划分。
python
class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

模式3:代码专属反思

Test-driven refinement loop for code generation.
python
class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

针对代码生成的测试驱动优化循环。
python
class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

Evaluation Strategies

评估策略

Outcome-Based

基于结果的评估

Evaluate whether output achieves the expected result.
python
def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
评估输出是否达成预期结果。
python
def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

LLM作为评判者

Use LLM to compare and rank outputs.
python
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
使用LLM对比并排名输出内容。
python
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

基于评分标准的评估

Score outputs against weighted dimensions.
python
RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

对照加权维度为输出打分。
python
RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

最佳实践

PracticeRationale
Clear criteriaDefine specific, measurable evaluation criteria upfront
Iteration limitsSet max iterations (3-5) to prevent infinite loops
Convergence checkStop if output score isn't improving between iterations
Log historyKeep full trajectory for debugging and analysis
Structured outputUse JSON for reliable parsing of evaluation results

实践方法核心原因
明确的评估标准提前定义具体、可衡量的评估标准
迭代次数限制设置最大迭代次数(3-5次)以避免无限循环
收敛性检查若输出分数在迭代间无提升则停止循环
记录历史轨迹保留完整迭代过程用于调试与分析
结构化输出使用JSON格式以可靠解析评估结果

Quick Start Checklist

快速启动检查清单

markdown
undefined
markdown
undefined

Evaluation Implementation Checklist

评估实现检查清单

Setup

准备阶段

  • Define evaluation criteria/rubric
  • Set score threshold for "good enough"
  • Configure max iterations (default: 3)
  • 定义评估标准/评分规则
  • 设置“合格”分数阈值
  • 配置最大迭代次数(默认:3次)

Implementation

实现阶段

  • Implement generate() function
  • Implement evaluate() function with structured output
  • Implement optimize() function
  • Wire up the refinement loop
  • 实现generate()函数
  • 实现带结构化输出的evaluate()函数
  • 实现optimize()函数
  • 搭建优化循环流程

Safety

安全保障

  • Add convergence detection
  • Log all iterations for debugging
  • Handle evaluation parse failures gracefully
undefined
  • 添加收敛检测机制
  • 记录所有迭代过程用于调试
  • 优雅处理评估结果解析失败的情况
undefined