agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Evaluation Methods

Agent评估方法

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.

Agent评估需要与传统软件不同的方法。Agent具有非确定性，可能采取不同的有效路径，且不存在单一正确答案。

Key Finding: 95% Performance Drivers

关键发现：95%的性能驱动因素

Research on BrowseComp found three factors explain 95% of variance:

Factor	Variance	Implication
Token usage	80%	More tokens = better performance
Tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

Implications: Model upgrades beat token increases. Multi-agent architectures validate.

针对BrowseComp的研究发现，三个因素可以解释95%的性能差异：

因素	差异占比	启示
Token使用量	80%	Token越多 = 性能越好
工具调用	~10%	更多探索有助于提升性能
模型选择	~5%	更优模型能放大效率

启示：模型升级的效果优于增加Token数量。多Agent架构可验证这一点。

Multi-Dimensional Rubric

多维度评估准则

Dimension	Excellent	Good	Acceptable	Failed
Factual accuracy	All correct	Minor errors	Some errors	Wrong
Completeness	All aspects	Most aspects	Key aspects	Missing
Citation accuracy	All match	Most match	Some match	Wrong
Tool efficiency	Optimal	Good	Adequate	Wasteful

维度	优秀	良好	合格	不合格
事实准确性	全部正确	轻微错误	部分错误	完全错误
完整性	覆盖所有方面	覆盖大部分方面	覆盖关键方面	缺失核心内容
引用准确性	全部匹配	大部分匹配	部分匹配	完全错误
工具效率	最优	良好	足够	冗余浪费

LLM-as-Judge

python

evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}

Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)

Provide scores and reasoning.
"""

python

evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}

Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)

Provide scores and reasoning.
"""

Test Set Design

测试集设计

python

test_set = [
    {"name": "simple", "complexity": "simple",
     "input": "What is capital of France?"},
    {"name": "medium", "complexity": "medium",
     "input": "Compare Apple and Microsoft revenue"},
    {"name": "complex", "complexity": "complex",
     "input": "Analyze Q1-Q4 sales trends"},
    {"name": "very_complex", "complexity": "very_complex",
     "input": "Research AI tech, evaluate impact, recommend strategy"}
]

python

test_set = [
    {"name": "simple", "complexity": "simple",
     "input": "What is capital of France?"},
    {"name": "medium", "complexity": "medium",
     "input": "Compare Apple and Microsoft revenue"},
    {"name": "complex", "complexity": "complex",
     "input": "Analyze Q1-Q4 sales trends"},
    {"name": "very_complex", "complexity": "very_complex",
     "input": "Research AI tech, evaluate impact, recommend strategy"}
]

Evaluation Pipeline

评估流程

python

def evaluate_agent(agent, test_set):
    results = []
    for test in test_set:
        output = agent.run(test["input"])
        scores = llm_judge(output, test)
        results.append({
            "test": test["name"],
            "scores": scores,
            "passed": scores["overall"] >= 0.7
        })
    return results

python

def evaluate_agent(agent, test_set):
    results = []
    for test in test_set:
        output = agent.run(test["input"])
        scores = llm_judge(output, test)
        results.append({
            "test": test["name"],
            "scores": scores,
            "passed": scores["overall"] >= 0.7
        })
    return results

Complexity Stratification

复杂度分层

Level	Characteristics
Simple	Single tool call
Medium	Multiple tool calls
Complex	Many calls, ambiguity
Very Complex	Extended interaction, deep reasoning

级别	特征
简单	单次工具调用
中等	多次工具调用
复杂	大量调用，存在歧义
极复杂	扩展交互，深度推理

Context Engineering Evaluation

上下文工程评估

Test context strategies systematically:

Run agents with different strategies on same tests
Compare quality scores, token usage, efficiency
Identify degradation cliffs at different context sizes

系统地测试上下文策略：

在相同测试中使用不同策略运行Agent
比较质量分数、Token使用量和效率
识别不同上下文大小下的性能下降临界点

Continuous Evaluation

持续评估

Run evaluations on all agent changes
Track metrics over time
Set alerts for quality drops
Sample production interactions

对Agent的所有变更执行评估
随时间跟踪指标
设置质量下降预警
抽样生产环境交互

Avoiding Pitfalls

避免常见陷阱

Pitfall	Solution
Path overfitting	Evaluate outcomes, not steps
Ignoring edge cases	Include diverse scenarios
Single metric	Multi-dimensional rubrics
Ignoring context	Test realistic context sizes
No human review	Supplement automated eval

陷阱	解决方案
路径过拟合	评估结果，而非步骤
忽略边缘情况	包含多样场景
单一指标	使用多维度评估准则
忽略上下文	测试真实场景的上下文大小
无人工审核	补充人工审核到自动化评估中

Best Practices

最佳实践

Use multi-dimensional rubrics
Evaluate outcomes, not specific paths
Cover complexity levels
Test with realistic context sizes
Run evaluations continuously
Supplement LLM with human review
Track metrics for trends
Set clear pass/fail thresholds

使用多维度评估准则
评估结果，而非特定路径
覆盖不同复杂度级别
使用真实场景的上下文大小进行测试
持续运行评估
用人工审核补充LLM评估
跟踪指标以发现趋势
设置明确的通过/失败阈值