agent-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evaluation (AI Agent Evals)
Agent评估(AI Agent评估体系)
Based on Anthropic's "Demystifying evals for AI agents"
基于Anthropic的《揭秘AI Agent评估体系》
When to use this skill
适用场景
- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time
- 为AI Agent设计评估系统
- 为代码、对话或研究类Agent构建基准测试
- 创建评估器(基于代码、基于模型、人工评估)
- 为AI系统实现生产环境监控
- 搭建集成自动化评估的CI/CD流水线
- 调试Agent性能问题
- 跟踪Agent的长期性能提升
Core Concepts
核心概念
Eval Evolution: Single-turn → Multi-turn → Agentic
评估体系演进:单轮对话 → 多轮对话 → Agent化
| Type | Turns | State | Grading | Complexity |
|---|---|---|---|---|
| Single-turn | 1 | None | Simple | Low |
| Multi-turn | N | Conversation | Per-turn | Medium |
| Agentic | N | World + History | Outcome | High |
| 类型 | 轮次 | 状态 | 评估方式 | 复杂度 |
|---|---|---|---|---|
| 单轮对话 | 1 | 无 | 简单评估 | 低 |
| 多轮对话 | N | 对话记录 | 按轮次评估 | 中 |
| Agent化 | N | 外部环境+历史记录 | 基于结果评估 | 高 |
7 Key Terms
7个核心术语
| Term | Definition |
|---|---|
| Task | Single test case (prompt + expected outcome) |
| Trial | One agent run on a task |
| Grader | Scoring function (code/model/human) |
| Transcript | Full record of agent actions |
| Outcome | Final state for grading |
| Harness | Infrastructure running evals |
| Suite | Collection of related tasks |
| 术语 | 定义 |
|---|---|
| Task(任务) | 单个测试用例(提示词+预期结果) |
| Trial(测试轮次) | Agent在单个任务上的一次运行 |
| Grader(评估器) | 评分函数(代码/模型/人工) |
| Transcript(交互记录) | Agent操作的完整记录 |
| Outcome(结果) | 用于评估的最终状态 |
| Harness(评估框架) | 运行评估的基础设施 |
| Suite(测试套件) | 相关任务的集合 |
Instructions
实施步骤
Step 1: Understand Grader Types
步骤1:了解评估器类型
Code-based Graders (Recommended for Coding Agents)
基于代码的评估器(推荐用于代码Agent)
- Pros: Fast, objective, reproducible
- Cons: Requires clear success criteria
- Best for: Coding agents, structured outputs
python
undefined- 优势:速度快、客观、可复现
- 劣势:需要明确的成功判定标准
- 最适用场景:代码Agent、结构化输出场景
python
undefinedExample: Code-based grader
示例:基于代码的评估器
def grade_task(outcome: dict) -> float:
"""Grade coding task by test passage."""
tests_passed = outcome.get("tests_passed", 0)
total_tests = outcome.get("total_tests", 1)
return tests_passed / total_tests
def grade_task(outcome: dict) -> float:
"""通过测试通过率评估代码任务"""
tests_passed = outcome.get("tests_passed", 0)
total_tests = outcome.get("total_tests", 1)
return tests_passed / total_tests
SWE-bench style grader
SWE-bench风格评估器
def grade_swe_bench(repo_path: str, test_spec: dict) -> bool:
"""Run tests and check if patch resolves issue."""
result = subprocess.run(
["pytest", test_spec["test_file"]],
cwd=repo_path,
capture_output=True
)
return result.returncode == 0
undefineddef grade_swe_bench(repo_path: str, test_spec: dict) -> bool:
"""运行测试并检查补丁是否解决问题"""
result = subprocess.run(
["pytest", test_spec["test_file"]],
cwd=repo_path,
capture_output=True
)
return result.returncode == 0
undefinedModel-based Graders (LLM-as-Judge)
基于模型的评估器(大模型作为裁判)
- Pros: Flexible, handles nuance
- Cons: Requires calibration, can be inconsistent
- Best for: Conversational agents, open-ended tasks
yaml
undefined- 优势:灵活、能处理复杂场景
- 劣势:需要校准、结果可能不一致
- 最适用场景:对话Agent、开放式任务
yaml
undefinedExample: LLM Rubric for Customer Support Agent
示例:客服Agent的大模型评分标准
rubric:
dimensions:
- name: empathy
weight: 0.3
scale: 1-5
criteria: |
5: Acknowledges emotions, uses warm language
3: Polite but impersonal
1: Cold or dismissive
- name: resolution
weight: 0.5
scale: 1-5
criteria: |
5: Fully resolves issue
3: Partial resolution
1: No resolution
- name: efficiency
weight: 0.2
scale: 1-5
criteria: |
5: Resolved in minimal turns
3: Reasonable turns
1: Excessive back-and-forthundefinedrubric:
dimensions:
- name: empathy
weight: 0.3
scale: 1-5
criteria: |
5: 认可用户情绪,使用暖心话术
3: 礼貌但缺乏人情味
1: 冷漠或敷衍
- name: resolution
weight: 0.5
scale: 1-5
criteria: |
5: 完全解决问题
3: 部分解决问题
1: 未解决问题
- name: efficiency
weight: 0.2
scale: 1-5
criteria: |
5: 用最少轮次解决问题
3: 轮次合理
1: 反复沟通效率低下undefinedHuman Graders
人工评估器
- Pros: Highest accuracy, catches edge cases
- Cons: Expensive, slow, not scalable
- Best for: Final validation, ambiguous cases
- 优势:准确率最高、能覆盖边缘场景
- 劣势:成本高、速度慢、难以规模化
- 最适用场景:最终验证、模糊场景评估
Step 2: Choose Strategy by Agent Type
步骤2:根据Agent类型选择评估策略
2.1 Coding Agents
2.1 代码Agent
Benchmarks:
- SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
- Terminal-Bench: Complex terminal tasks
- Custom test suites with your codebase
Grading Strategy:
python
def grade_coding_agent(task: dict, outcome: dict) -> dict:
return {
"tests_passed": run_test_suite(outcome["code"]),
"lint_score": run_linter(outcome["code"]),
"builds": check_build(outcome["code"]),
"matches_spec": compare_to_reference(task["spec"], outcome["code"])
}Key Metrics:
- Test passage rate
- Build success
- Lint/style compliance
- Diff size (smaller is better)
基准测试:
- SWE-bench Verified:真实GitHub问题(通过率从40%提升至80%+)
- Terminal-Bench:复杂终端任务
- 基于自有代码库的自定义测试套件
评估策略:
python
def grade_coding_agent(task: dict, outcome: dict) -> dict:
return {
"tests_passed": run_test_suite(outcome["code"]),
"lint_score": run_linter(outcome["code"]),
"builds": check_build(outcome["code"]),
"matches_spec": compare_to_reference(task["spec"], outcome["code"])
}核心指标:
- 测试通过率
- 构建成功率
- 代码规范合规性
- 代码差异大小(越小越好)
2.2 Conversational Agents
2.2 对话Agent
Benchmarks:
- τ2-Bench: Multi-domain conversation
- Custom domain-specific suites
Grading Strategy (Multi-dimensional):
yaml
success_criteria:
- empathy_score: >= 4.0
- resolution_rate: >= 0.9
- avg_turns: <= 5
- escalation_rate: <= 0.1Key Metrics:
- Task resolution rate
- Customer satisfaction proxy
- Turn efficiency
- Escalation rate
基准测试:
- τ2-Bench:多领域对话测试
- 自定义领域专属测试套件
评估策略(多维度):
yaml
success_criteria:
- empathy_score: >= 4.0
- resolution_rate: >= 0.9
- avg_turns: <= 5
- escalation_rate: <= 0.1核心指标:
- 任务解决率
- 客户满意度代理指标
- 对话轮次效率
- 问题升级率
2.3 Research Agents
2.3 研究Agent
Grading Dimensions:
- Grounding: Claims backed by sources
- Coverage: All aspects addressed
- Source Quality: Authoritative sources used
python
def grade_research_agent(task: dict, outcome: dict) -> dict:
return {
"grounding": check_citations(outcome["report"]),
"coverage": check_topic_coverage(task["topics"], outcome["report"]),
"source_quality": score_sources(outcome["sources"]),
"factual_accuracy": verify_claims(outcome["claims"])
}评估维度:
- 事实依据:所有结论均有来源支持
- 内容覆盖:涵盖主题的所有方面
- 来源质量:使用权威来源
python
def grade_research_agent(task: dict, outcome: dict) -> dict:
return {
"grounding": check_citations(outcome["report"]),
"coverage": check_topic_coverage(task["topics"], outcome["report"]),
"source_quality": score_sources(outcome["sources"]),
"factual_accuracy": verify_claims(outcome["claims"])
}2.4 Computer Use Agents
2.4 计算机操作Agent
Benchmarks:
- WebArena: Web navigation tasks
- OSWorld: Desktop environment tasks
Grading Strategy:
python
def grade_computer_use(task: dict, outcome: dict) -> dict:
return {
"ui_state": verify_ui_state(outcome["screenshot"]),
"db_state": verify_database(task["expected_db_state"]),
"file_state": verify_files(task["expected_files"]),
"success": all_conditions_met(task, outcome)
}基准测试:
- WebArena:网页导航任务
- OSWorld:桌面环境任务
评估策略:
python
def grade_computer_use(task: dict, outcome: dict) -> dict:
return {
"ui_state": verify_ui_state(outcome["screenshot"]),
"db_state": verify_database(task["expected_db_state"]),
"file_state": verify_files(task["expected_files"]),
"success": all_conditions_met(task, outcome)
}Step 3: Follow the 8-Step Roadmap
步骤3:遵循8步实施路线图
Step 0: Start Early (20-50 Tasks)
步骤0:尽早启动(20-50个任务)
bash
undefinedbash
undefinedCreate initial eval suite structure
创建初始评估套件目录结构
mkdir -p evals/{tasks,results,graders}
mkdir -p evals/{tasks,results,graders}
Start with representative tasks
从代表性任务开始
- Common use cases (60%)
- 常规使用场景(60%)
- Edge cases (20%)
- 边缘场景(20%)
- Failure modes (20%)
- 故障模式(20%)
undefinedundefinedStep 1: Convert Manual Tests
步骤1:转换人工测试用例
python
undefinedpython
undefinedTransform existing QA tests into eval tasks
将现有QA测试转换为评估任务
def convert_qa_to_eval(qa_case: dict) -> dict:
return {
"id": qa_case["id"],
"prompt": qa_case["input"],
"expected_outcome": qa_case["expected"],
"grader": "code" if qa_case["has_tests"] else "model",
"tags": qa_case.get("tags", [])
}
undefineddef convert_qa_to_eval(qa_case: dict) -> dict:
return {
"id": qa_case["id"],
"prompt": qa_case["input"],
"expected_outcome": qa_case["expected"],
"grader": "code" if qa_case["has_tests"] else "model",
"tags": qa_case.get("tags", [])
}
undefinedStep 2: Ensure Clarity + Reference Solutions
步骤2:确保任务清晰+提供参考方案
yaml
undefinedyaml
undefinedGood task definition
规范的任务定义
task:
id: "api-design-001"
prompt: |
Design a REST API for user management with:
- CRUD operations
- Authentication via JWT
- Rate limiting
reference_solution: "./solutions/api-design-001/"
success_criteria:
- "All endpoints documented"
- "Auth middleware present"
- "Rate limit config exists"
undefinedtask:
id: "api-design-001"
prompt: |
设计一个用户管理REST API,要求:
- 支持CRUD操作
- 基于JWT实现认证
- 包含限流功能
reference_solution: "./solutions/api-design-001/"
success_criteria:
- "所有接口均有文档"
- "包含认证中间件"
- "有限流配置"
undefinedStep 3: Balance Positive/Negative Cases
步骤3:平衡正负向测试用例
python
undefinedpython
undefinedEnsure eval suite balance
确保测试套件的平衡性
suite_composition = {
"positive_cases": 0.5, # Should succeed
"negative_cases": 0.3, # Should fail gracefully
"edge_cases": 0.2 # Boundary conditions
}
undefinedsuite_composition = {
"positive_cases": 0.5, # 预期成功的用例
"negative_cases": 0.3, # 预期需优雅失败的用例
"edge_cases": 0.2 # 边界条件用例
}
undefinedStep 4: Isolate Environments
步骤4:隔离评估环境
yaml
undefinedyaml
undefinedDocker-based isolation for coding evals
基于Docker的代码评估隔离环境
eval_environment:
type: docker
image: "eval-sandbox:latest"
timeout: 300s
resources:
memory: "4g"
cpu: "2"
network: isolated
cleanup: always
undefinedeval_environment:
type: docker
image: "eval-sandbox:latest"
timeout: 300s
resources:
memory: "4g"
cpu: "2"
network: isolated
cleanup: always
undefinedStep 5: Focus on Outcomes, Not Paths
步骤5:关注结果,而非过程
python
undefinedpython
undefinedGOOD: Outcome-focused grader
推荐:基于结果的评估器
def grade_outcome(expected: dict, actual: dict) -> float:
return compare_final_states(expected, actual)
def grade_outcome(expected: dict, actual: dict) -> float:
return compare_final_states(expected, actual)
BAD: Path-focused grader (too brittle)
不推荐:基于过程的评估器(过于脆弱)
def grade_path(expected_steps: list, actual_steps: list) -> float:
return step_by_step_match(expected_steps, actual_steps)
undefineddef grade_path(expected_steps: list, actual_steps: list) -> float:
return step_by_step_match(expected_steps, actual_steps)
undefinedStep 6: Always Read Transcripts
步骤6:务必查看交互记录
python
undefinedpython
undefinedTranscript analysis for debugging
分析交互记录以调试问题
def analyze_transcript(transcript: list) -> dict:
return {
"total_steps": len(transcript),
"tool_usage": count_tool_calls(transcript),
"errors": extract_errors(transcript),
"decision_points": find_decision_points(transcript),
"recovery_attempts": find_recovery_patterns(transcript)
}
undefineddef analyze_transcript(transcript: list) -> dict:
return {
"total_steps": len(transcript),
"tool_usage": count_tool_calls(transcript),
"errors": extract_errors(transcript),
"decision_points": find_decision_points(transcript),
"recovery_attempts": find_recovery_patterns(transcript)
}
undefinedStep 7: Monitor Eval Saturation
步骤7:监控评估套件饱和情况
python
undefinedpython
undefinedDetect when evals are no longer useful
检测评估套件是否失效
def check_saturation(results: list, window: int = 10) -> dict:
recent = results[-window:]
return {
"pass_rate": sum(r["passed"] for r in recent) / len(recent),
"variance": calculate_variance(recent),
"is_saturated": all(r["passed"] for r in recent),
"recommendation": "Add harder tasks" if saturated else "Continue"
}
undefineddef check_saturation(results: list, window: int = 10) -> dict:
recent = results[-window:]
return {
"pass_rate": sum(r["passed"] for r in recent) / len(recent),
"variance": calculate_variance(recent),
"is_saturated": all(r["passed"] for r in recent),
"recommendation": "Add harder tasks" if saturated else "Continue"
}
undefinedStep 8: Long-term Maintenance
步骤8:长期维护
yaml
undefinedyaml
undefinedEval suite maintenance checklist
评估套件维护清单
maintenance:
weekly:
- Review failed evals for false negatives
- Check for flaky tests
monthly:
- Add new edge cases from production issues
- Retire saturated evals
- Update reference solutions
quarterly:
- Full benchmark recalibration
- Team contribution review
undefinedmaintenance:
weekly:
- 审核失败的评估用例,排查误判
- 检查不稳定的测试用例
monthly:
- 从生产问题中新增边缘场景用例
- 移除已饱和的评估用例
- 更新参考解决方案
quarterly:
- 全面重新校准基准测试
- 审核团队贡献的评估用例
undefinedStep 4: Integrate with Production
步骤4:集成到生产环境
CI/CD Integration
CI/CD集成
yaml
undefinedyaml
undefinedGitHub Actions example
GitHub Actions示例
name: Agent Evals
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evals
run: |
python run_evals.py --suite=core --mode=compact
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
undefinedname: Agent Evals
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Evals
run: |
python run_evals.py --suite=core --mode=compact
- name: Upload Results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results/
undefinedProduction Monitoring
生产环境监控
python
undefinedpython
undefinedReal-time eval sampling
实时评估采样
class ProductionMonitor:
def init(self, sample_rate: float = 0.1):
self.sample_rate = sample_rate
async def monitor(self, request, response):
if random.random() < self.sample_rate:
eval_result = await self.run_eval(request, response)
self.log_result(eval_result)
if eval_result["score"] < self.threshold:
self.alert("Low quality response detected")undefinedclass ProductionMonitor:
def init(self, sample_rate: float = 0.1):
self.sample_rate = sample_rate
async def monitor(self, request, response):
if random.random() < self.sample_rate:
eval_result = await self.run_eval(request, response)
self.log_result(eval_result)
if eval_result["score"] < self.threshold:
self.alert("Low quality response detected")undefinedA/B Testing
A/B测试
python
undefinedpython
undefinedCompare agent versions
对比不同版本的Agent
def run_ab_test(suite: str, versions: list) -> dict:
results = {}
for version in versions:
results[version] = run_eval_suite(suite, agent_version=version)
return {
"comparison": compare_results(results),
"winner": determine_winner(results),
"confidence": calculate_confidence(results)
}
undefineddef run_ab_test(suite: str, versions: list) -> dict:
results = {}
for version in versions:
results[version] = run_eval_suite(suite, agent_version=version)
return {
"comparison": compare_results(results),
"winner": determine_winner(results),
"confidence": calculate_confidence(results)
}
undefinedBest Practices
最佳实践
Do's ✅
✅ 建议做法
- Start with 20-50 representative tasks
- Use code-based graders when possible
- Focus on outcomes, not paths
- Read transcripts for debugging
- Monitor for eval saturation
- Balance positive/negative cases
- Isolate eval environments
- Version your eval suites
- 从20-50个代表性任务开始
- 尽可能使用基于代码的评估器
- 关注结果,而非过程
- 查看交互记录以调试问题
- 监控评估套件饱和情况
- 平衡正负向测试用例
- 隔离评估环境
- 对评估套件进行版本管理
Don'ts ❌
❌ 避免做法
- Don't over-rely on model-based graders without calibration
- Don't ignore failed evals (false negatives exist)
- Don't grade on intermediate steps
- Don't skip transcript analysis
- Don't use production data without sanitization
- Don't let eval suites become stale
- 不要在未校准的情况下过度依赖基于模型的评估器
- 不要忽略失败的评估用例(存在误判可能)
- 不要评估中间步骤
- 不要跳过交互记录分析
- 不要在未脱敏的情况下使用生产数据
- 不要让评估套件过时
Success Patterns
成功实践模式
Pattern 1: Graduated Eval Complexity
模式1:分级评估复杂度
Level 1: Unit evals (single capability)
Level 2: Integration evals (combined capabilities)
Level 3: End-to-end evals (full workflows)
Level 4: Adversarial evals (edge cases)Level 1: 单元评估(单一能力)
Level 2: 集成评估(组合能力)
Level 3: 端到端评估(完整工作流)
Level 4: 对抗性评估(边缘场景)Pattern 2: Eval-Driven Development
模式2:评估驱动开发
1. Write eval task for new feature
2. Run eval (expect failure)
3. Implement feature
4. Run eval (expect pass)
5. Add to regression suite1. 为新功能编写评估任务
2. 运行评估(预期失败)
3. 实现功能
4. 运行评估(预期通过)
5. 添加到回归测试套件Pattern 3: Continuous Calibration
模式3:持续校准
Weekly: Review grader accuracy
Monthly: Update rubrics based on feedback
Quarterly: Full grader audit with human baseline每周:审核评估器准确率
每月:根据反馈更新评分标准
每季度:以人工评估为基准全面审核评估器Troubleshooting
问题排查
Problem: Eval scores at 100%
问题:评估通过率达到100%
Solution: Add harder tasks, check for eval saturation (Step 7)
解决方案:添加更难的任务,检查评估套件是否饱和(步骤7)
Problem: Inconsistent model-based grader scores
问题:基于模型的评估器结果不一致
Solution: Add more examples to rubric, use structured output, ensemble graders
解决方案:为评分标准添加更多示例,使用结构化输出,集成多个评估器
Problem: Evals too slow for CI
问题:评估在CI中运行过慢
Solution: Use toon mode, parallelize, sample subset for PR checks
解决方案:使用精简模式,并行运行,PR检查时仅运行部分用例
Problem: Agent passes evals but fails in production
问题:Agent通过评估但在生产环境中失效
Solution: Add production failure cases to eval suite, increase diversity
解决方案:将生产失败案例添加到评估套件,增加用例多样性
References
参考资料
Examples
示例
Example 1: Simple Coding Agent Eval
示例1:简单代码Agent评估
python
undefinedpython
undefinedTask definition
任务定义
task = {
"id": "fizzbuzz-001",
"prompt": "Write a fizzbuzz function in Python",
"test_cases": [
{"input": 3, "expected": "Fizz"},
{"input": 5, "expected": "Buzz"},
{"input": 15, "expected": "FizzBuzz"},
{"input": 7, "expected": "7"}
]
}
task = {
"id": "fizzbuzz-001",
"prompt": "Write a fizzbuzz function in Python",
"test_cases": [
{"input": 3, "expected": "Fizz"},
{"input": 5, "expected": "Buzz"},
{"input": 15, "expected": "FizzBuzz"},
{"input": 7, "expected": "7"}
]
}
Grader
评估器
def grade(task, outcome):
code = outcome["code"]
exec(code) # In sandbox
for tc in task["test_cases"]:
if fizzbuzz(tc["input"]) != tc["expected"]:
return 0.0
return 1.0
undefineddef grade(task, outcome):
code = outcome["code"]
exec(code) # 在沙箱中运行
for tc in task["test_cases"]:
if fizzbuzz(tc["input"]) != tc["expected"]:
return 0.0
return 1.0
undefinedExample 2: Conversational Agent Eval with LLM Rubric
示例2:基于大模型评分标准的对话Agent评估
yaml
task:
id: "support-refund-001"
scenario: |
Customer wants refund for damaged product.
Product: Laptop, Order: #12345, Damage: Screen crack
expected_actions:
- Acknowledge issue
- Verify order
- Offer resolution options
max_turns: 5
grader:
type: model
model: claude-3-5-sonnet-20241022
rubric: |
Score 1-5 on each dimension:
- Empathy: Did agent acknowledge customer frustration?
- Resolution: Was a clear solution offered?
- Efficiency: Was issue resolved in reasonable turns?yaml
task:
id: "support-refund-001"
scenario: |
客户因商品损坏申请退款。
商品:笔记本电脑,订单号:#12345,损坏情况:屏幕碎裂
expected_actions:
- 认可用户问题
- 验证订单信息
- 提供解决方案选项
max_turns: 5
grader:
type: model
model: claude-3-5-sonnet-20241022
rubric: |
从以下维度评分(1-5分):
- 共情能力:Agent是否认可用户的不满?
- 问题解决:是否提供了明确的解决方案?
- 效率:是否在合理轮次内解决问题?