agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evaluation Methods

Agent评估方法

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Agent评估需要与传统软件不同的方法。Agent具有非确定性,可能采取不同的有效路径,且不存在单一正确答案。

Key Finding: 95% Performance Drivers

关键发现:95%的性能驱动因素

Research on BrowseComp found three factors explain 95% of variance:
FactorVarianceImplication
Token usage80%More tokens = better performance
Tool calls~10%More exploration helps
Model choice~5%Better models multiply efficiency
Implications: Model upgrades beat token increases. Multi-agent architectures validate.
针对BrowseComp的研究发现,三个因素可以解释95%的性能差异:
因素差异占比启示
Token使用量80%Token越多 = 性能越好
工具调用~10%更多探索有助于提升性能
模型选择~5%更优模型能放大效率
启示:模型升级的效果优于增加Token数量。多Agent架构可验证这一点。

Multi-Dimensional Rubric

多维度评估准则

DimensionExcellentGoodAcceptableFailed
Factual accuracyAll correctMinor errorsSome errorsWrong
CompletenessAll aspectsMost aspectsKey aspectsMissing
Citation accuracyAll matchMost matchSome matchWrong
Tool efficiencyOptimalGoodAdequateWasteful
维度优秀良好合格不合格
事实准确性全部正确轻微错误部分错误完全错误
完整性覆盖所有方面覆盖大部分方面覆盖关键方面缺失核心内容
引用准确性全部匹配大部分匹配部分匹配完全错误
工具效率最优良好足够冗余浪费

LLM-as-Judge

LLM-as-Judge

python
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}

Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)

Provide scores and reasoning.
"""
python
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}

Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)

Provide scores and reasoning.
"""

Test Set Design

测试集设计

python
test_set = [
    {"name": "simple", "complexity": "simple",
     "input": "What is capital of France?"},
    {"name": "medium", "complexity": "medium",
     "input": "Compare Apple and Microsoft revenue"},
    {"name": "complex", "complexity": "complex",
     "input": "Analyze Q1-Q4 sales trends"},
    {"name": "very_complex", "complexity": "very_complex",
     "input": "Research AI tech, evaluate impact, recommend strategy"}
]
python
test_set = [
    {"name": "simple", "complexity": "simple",
     "input": "What is capital of France?"},
    {"name": "medium", "complexity": "medium",
     "input": "Compare Apple and Microsoft revenue"},
    {"name": "complex", "complexity": "complex",
     "input": "Analyze Q1-Q4 sales trends"},
    {"name": "very_complex", "complexity": "very_complex",
     "input": "Research AI tech, evaluate impact, recommend strategy"}
]

Evaluation Pipeline

评估流程

python
def evaluate_agent(agent, test_set):
    results = []
    for test in test_set:
        output = agent.run(test["input"])
        scores = llm_judge(output, test)
        results.append({
            "test": test["name"],
            "scores": scores,
            "passed": scores["overall"] >= 0.7
        })
    return results
python
def evaluate_agent(agent, test_set):
    results = []
    for test in test_set:
        output = agent.run(test["input"])
        scores = llm_judge(output, test)
        results.append({
            "test": test["name"],
            "scores": scores,
            "passed": scores["overall"] >= 0.7
        })
    return results

Complexity Stratification

复杂度分层

LevelCharacteristics
SimpleSingle tool call
MediumMultiple tool calls
ComplexMany calls, ambiguity
Very ComplexExtended interaction, deep reasoning
级别特征
简单单次工具调用
中等多次工具调用
复杂大量调用,存在歧义
极复杂扩展交互,深度推理

Context Engineering Evaluation

上下文工程评估

Test context strategies systematically:
  1. Run agents with different strategies on same tests
  2. Compare quality scores, token usage, efficiency
  3. Identify degradation cliffs at different context sizes
系统地测试上下文策略:
  1. 在相同测试中使用不同策略运行Agent
  2. 比较质量分数、Token使用量和效率
  3. 识别不同上下文大小下的性能下降临界点

Continuous Evaluation

持续评估

  • Run evaluations on all agent changes
  • Track metrics over time
  • Set alerts for quality drops
  • Sample production interactions
  • 对Agent的所有变更执行评估
  • 随时间跟踪指标
  • 设置质量下降预警
  • 抽样生产环境交互

Avoiding Pitfalls

避免常见陷阱

PitfallSolution
Path overfittingEvaluate outcomes, not steps
Ignoring edge casesInclude diverse scenarios
Single metricMulti-dimensional rubrics
Ignoring contextTest realistic context sizes
No human reviewSupplement automated eval
陷阱解决方案
路径过拟合评估结果,而非步骤
忽略边缘情况包含多样场景
单一指标使用多维度评估准则
忽略上下文测试真实场景的上下文大小
无人工审核补充人工审核到自动化评估中

Best Practices

最佳实践

  1. Use multi-dimensional rubrics
  2. Evaluate outcomes, not specific paths
  3. Cover complexity levels
  4. Test with realistic context sizes
  5. Run evaluations continuously
  6. Supplement LLM with human review
  7. Track metrics for trends
  8. Set clear pass/fail thresholds
  1. 使用多维度评估准则
  2. 评估结果,而非特定路径
  3. 覆盖不同复杂度级别
  4. 使用真实场景的上下文大小进行测试
  5. 持续运行评估
  6. 用人工审核补充LLM评估
  7. 跟踪指标以发现趋势
  8. 设置明确的通过/失败阈值