agent-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evaluation Methods
Agent评估方法
Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Agent评估需要与传统软件不同的方法。Agent具有非确定性,可能采取不同的有效路径,且不存在单一正确答案。
Key Finding: 95% Performance Drivers
关键发现:95%的性能驱动因素
Research on BrowseComp found three factors explain 95% of variance:
| Factor | Variance | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications: Model upgrades beat token increases. Multi-agent architectures validate.
针对BrowseComp的研究发现,三个因素可以解释95%的性能差异:
| 因素 | 差异占比 | 启示 |
|---|---|---|
| Token使用量 | 80% | Token越多 = 性能越好 |
| 工具调用 | ~10% | 更多探索有助于提升性能 |
| 模型选择 | ~5% | 更优模型能放大效率 |
启示:模型升级的效果优于增加Token数量。多Agent架构可验证这一点。
Multi-Dimensional Rubric
多维度评估准则
| Dimension | Excellent | Good | Acceptable | Failed |
|---|---|---|---|---|
| Factual accuracy | All correct | Minor errors | Some errors | Wrong |
| Completeness | All aspects | Most aspects | Key aspects | Missing |
| Citation accuracy | All match | Most match | Some match | Wrong |
| Tool efficiency | Optimal | Good | Adequate | Wasteful |
| 维度 | 优秀 | 良好 | 合格 | 不合格 |
|---|---|---|---|---|
| 事实准确性 | 全部正确 | 轻微错误 | 部分错误 | 完全错误 |
| 完整性 | 覆盖所有方面 | 覆盖大部分方面 | 覆盖关键方面 | 缺失核心内容 |
| 引用准确性 | 全部匹配 | 大部分匹配 | 部分匹配 | 完全错误 |
| 工具效率 | 最优 | 良好 | 足够 | 冗余浪费 |
LLM-as-Judge
LLM-as-Judge
python
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}
Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)
Provide scores and reasoning.
"""python
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}
Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)
Provide scores and reasoning.
"""Test Set Design
测试集设计
python
test_set = [
{"name": "simple", "complexity": "simple",
"input": "What is capital of France?"},
{"name": "medium", "complexity": "medium",
"input": "Compare Apple and Microsoft revenue"},
{"name": "complex", "complexity": "complex",
"input": "Analyze Q1-Q4 sales trends"},
{"name": "very_complex", "complexity": "very_complex",
"input": "Research AI tech, evaluate impact, recommend strategy"}
]python
test_set = [
{"name": "simple", "complexity": "simple",
"input": "What is capital of France?"},
{"name": "medium", "complexity": "medium",
"input": "Compare Apple and Microsoft revenue"},
{"name": "complex", "complexity": "complex",
"input": "Analyze Q1-Q4 sales trends"},
{"name": "very_complex", "complexity": "very_complex",
"input": "Research AI tech, evaluate impact, recommend strategy"}
]Evaluation Pipeline
评估流程
python
def evaluate_agent(agent, test_set):
results = []
for test in test_set:
output = agent.run(test["input"])
scores = llm_judge(output, test)
results.append({
"test": test["name"],
"scores": scores,
"passed": scores["overall"] >= 0.7
})
return resultspython
def evaluate_agent(agent, test_set):
results = []
for test in test_set:
output = agent.run(test["input"])
scores = llm_judge(output, test)
results.append({
"test": test["name"],
"scores": scores,
"passed": scores["overall"] >= 0.7
})
return resultsComplexity Stratification
复杂度分层
| Level | Characteristics |
|---|---|
| Simple | Single tool call |
| Medium | Multiple tool calls |
| Complex | Many calls, ambiguity |
| Very Complex | Extended interaction, deep reasoning |
| 级别 | 特征 |
|---|---|
| 简单 | 单次工具调用 |
| 中等 | 多次工具调用 |
| 复杂 | 大量调用,存在歧义 |
| 极复杂 | 扩展交互,深度推理 |
Context Engineering Evaluation
上下文工程评估
Test context strategies systematically:
- Run agents with different strategies on same tests
- Compare quality scores, token usage, efficiency
- Identify degradation cliffs at different context sizes
系统地测试上下文策略:
- 在相同测试中使用不同策略运行Agent
- 比较质量分数、Token使用量和效率
- 识别不同上下文大小下的性能下降临界点
Continuous Evaluation
持续评估
- Run evaluations on all agent changes
- Track metrics over time
- Set alerts for quality drops
- Sample production interactions
- 对Agent的所有变更执行评估
- 随时间跟踪指标
- 设置质量下降预警
- 抽样生产环境交互
Avoiding Pitfalls
避免常见陷阱
| Pitfall | Solution |
|---|---|
| Path overfitting | Evaluate outcomes, not steps |
| Ignoring edge cases | Include diverse scenarios |
| Single metric | Multi-dimensional rubrics |
| Ignoring context | Test realistic context sizes |
| No human review | Supplement automated eval |
| 陷阱 | 解决方案 |
|---|---|
| 路径过拟合 | 评估结果,而非步骤 |
| 忽略边缘情况 | 包含多样场景 |
| 单一指标 | 使用多维度评估准则 |
| 忽略上下文 | 测试真实场景的上下文大小 |
| 无人工审核 | 补充人工审核到自动化评估中 |
Best Practices
最佳实践
- Use multi-dimensional rubrics
- Evaluate outcomes, not specific paths
- Cover complexity levels
- Test with realistic context sizes
- Run evaluations continuously
- Supplement LLM with human review
- Track metrics for trends
- Set clear pass/fail thresholds
- 使用多维度评估准则
- 评估结果,而非特定路径
- 覆盖不同复杂度级别
- 使用真实场景的上下文大小进行测试
- 持续运行评估
- 用人工审核补充LLM评估
- 跟踪指标以发现趋势
- 设置明确的通过/失败阈值