agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evaluation (AI Agent Evals)

Agent评估(AI Agent评估体系)

Based on Anthropic's "Demystifying evals for AI agents"
基于Anthropic的《揭秘AI Agent评估体系》

When to use this skill

适用场景

  • Designing evaluation systems for AI agents
  • Building benchmarks for coding, conversational, or research agents
  • Creating graders (code-based, model-based, human)
  • Implementing production monitoring for AI systems
  • Setting up CI/CD pipelines with automated evals
  • Debugging agent performance issues
  • Measuring agent improvement over time
  • 为AI Agent设计评估系统
  • 为代码、对话或研究类Agent构建基准测试
  • 创建评估器(基于代码、基于模型、人工评估)
  • 为AI系统实现生产环境监控
  • 搭建集成自动化评估的CI/CD流水线
  • 调试Agent性能问题
  • 跟踪Agent的长期性能提升

Core Concepts

核心概念

Eval Evolution: Single-turn → Multi-turn → Agentic

评估体系演进:单轮对话 → 多轮对话 → Agent化

TypeTurnsStateGradingComplexity
Single-turn1NoneSimpleLow
Multi-turnNConversationPer-turnMedium
AgenticNWorld + HistoryOutcomeHigh
类型轮次状态评估方式复杂度
单轮对话1简单评估
多轮对话N对话记录按轮次评估
Agent化N外部环境+历史记录基于结果评估

7 Key Terms

7个核心术语

TermDefinition
TaskSingle test case (prompt + expected outcome)
TrialOne agent run on a task
GraderScoring function (code/model/human)
TranscriptFull record of agent actions
OutcomeFinal state for grading
HarnessInfrastructure running evals
SuiteCollection of related tasks
术语定义
Task(任务)单个测试用例(提示词+预期结果)
Trial(测试轮次)Agent在单个任务上的一次运行
Grader(评估器)评分函数(代码/模型/人工)
Transcript(交互记录)Agent操作的完整记录
Outcome(结果)用于评估的最终状态
Harness(评估框架)运行评估的基础设施
Suite(测试套件)相关任务的集合

Instructions

实施步骤

Step 1: Understand Grader Types

步骤1:了解评估器类型

Code-based Graders (Recommended for Coding Agents)

基于代码的评估器(推荐用于代码Agent)

  • Pros: Fast, objective, reproducible
  • Cons: Requires clear success criteria
  • Best for: Coding agents, structured outputs
python
undefined
  • 优势:速度快、客观、可复现
  • 劣势:需要明确的成功判定标准
  • 最适用场景:代码Agent、结构化输出场景
python
undefined

Example: Code-based grader

示例:基于代码的评估器

def grade_task(outcome: dict) -> float: """Grade coding task by test passage.""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests
def grade_task(outcome: dict) -> float: """通过测试通过率评估代码任务""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests

SWE-bench style grader

SWE-bench风格评估器

def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """Run tests and check if patch resolves issue.""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0
undefined
def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """运行测试并检查补丁是否解决问题""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0
undefined

Model-based Graders (LLM-as-Judge)

基于模型的评估器(大模型作为裁判)

  • Pros: Flexible, handles nuance
  • Cons: Requires calibration, can be inconsistent
  • Best for: Conversational agents, open-ended tasks
yaml
undefined
  • 优势:灵活、能处理复杂场景
  • 劣势:需要校准、结果可能不一致
  • 最适用场景:对话Agent、开放式任务
yaml
undefined

Example: LLM Rubric for Customer Support Agent

示例:客服Agent的大模型评分标准

rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: Acknowledges emotions, uses warm language 3: Polite but impersonal 1: Cold or dismissive
- name: resolution
  weight: 0.5
  scale: 1-5
  criteria: |
    5: Fully resolves issue
    3: Partial resolution
    1: No resolution

- name: efficiency
  weight: 0.2
  scale: 1-5
  criteria: |
    5: Resolved in minimal turns
    3: Reasonable turns
    1: Excessive back-and-forth
undefined
rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: 认可用户情绪,使用暖心话术 3: 礼貌但缺乏人情味 1: 冷漠或敷衍
- name: resolution
  weight: 0.5
  scale: 1-5
  criteria: |
    5: 完全解决问题
    3: 部分解决问题
    1: 未解决问题

- name: efficiency
  weight: 0.2
  scale: 1-5
  criteria: |
    5: 用最少轮次解决问题
    3: 轮次合理
    1: 反复沟通效率低下
undefined

Human Graders

人工评估器

  • Pros: Highest accuracy, catches edge cases
  • Cons: Expensive, slow, not scalable
  • Best for: Final validation, ambiguous cases
  • 优势:准确率最高、能覆盖边缘场景
  • 劣势:成本高、速度慢、难以规模化
  • 最适用场景:最终验证、模糊场景评估

Step 2: Choose Strategy by Agent Type

步骤2:根据Agent类型选择评估策略

2.1 Coding Agents

2.1 代码Agent

Benchmarks:
  • SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
  • Terminal-Bench: Complex terminal tasks
  • Custom test suites with your codebase
Grading Strategy:
python
def grade_coding_agent(task: dict, outcome: dict) -> dict:
    return {
        "tests_passed": run_test_suite(outcome["code"]),
        "lint_score": run_linter(outcome["code"]),
        "builds": check_build(outcome["code"]),
        "matches_spec": compare_to_reference(task["spec"], outcome["code"])
    }
Key Metrics:
  • Test passage rate
  • Build success
  • Lint/style compliance
  • Diff size (smaller is better)
基准测试:
  • SWE-bench Verified:真实GitHub问题(通过率从40%提升至80%+)
  • Terminal-Bench:复杂终端任务
  • 基于自有代码库的自定义测试套件
评估策略:
python
def grade_coding_agent(task: dict, outcome: dict) -> dict:
    return {
        "tests_passed": run_test_suite(outcome["code"]),
        "lint_score": run_linter(outcome["code"]),
        "builds": check_build(outcome["code"]),
        "matches_spec": compare_to_reference(task["spec"], outcome["code"])
    }
核心指标:
  • 测试通过率
  • 构建成功率
  • 代码规范合规性
  • 代码差异大小(越小越好)

2.2 Conversational Agents

2.2 对话Agent

Benchmarks:
  • τ2-Bench: Multi-domain conversation
  • Custom domain-specific suites
Grading Strategy (Multi-dimensional):
yaml
success_criteria:
  - empathy_score: >= 4.0
  - resolution_rate: >= 0.9
  - avg_turns: <= 5
  - escalation_rate: <= 0.1
Key Metrics:
  • Task resolution rate
  • Customer satisfaction proxy
  • Turn efficiency
  • Escalation rate
基准测试:
  • τ2-Bench:多领域对话测试
  • 自定义领域专属测试套件
评估策略(多维度):
yaml
success_criteria:
  - empathy_score: >= 4.0
  - resolution_rate: >= 0.9
  - avg_turns: <= 5
  - escalation_rate: <= 0.1
核心指标:
  • 任务解决率
  • 客户满意度代理指标
  • 对话轮次效率
  • 问题升级率

2.3 Research Agents

2.3 研究Agent

Grading Dimensions:
  1. Grounding: Claims backed by sources
  2. Coverage: All aspects addressed
  3. Source Quality: Authoritative sources used
python
def grade_research_agent(task: dict, outcome: dict) -> dict:
    return {
        "grounding": check_citations(outcome["report"]),
        "coverage": check_topic_coverage(task["topics"], outcome["report"]),
        "source_quality": score_sources(outcome["sources"]),
        "factual_accuracy": verify_claims(outcome["claims"])
    }
评估维度:
  1. 事实依据:所有结论均有来源支持
  2. 内容覆盖:涵盖主题的所有方面
  3. 来源质量:使用权威来源
python
def grade_research_agent(task: dict, outcome: dict) -> dict:
    return {
        "grounding": check_citations(outcome["report"]),
        "coverage": check_topic_coverage(task["topics"], outcome["report"]),
        "source_quality": score_sources(outcome["sources"]),
        "factual_accuracy": verify_claims(outcome["claims"])
    }

2.4 Computer Use Agents

2.4 计算机操作Agent

Benchmarks:
  • WebArena: Web navigation tasks
  • OSWorld: Desktop environment tasks
Grading Strategy:
python
def grade_computer_use(task: dict, outcome: dict) -> dict:
    return {
        "ui_state": verify_ui_state(outcome["screenshot"]),
        "db_state": verify_database(task["expected_db_state"]),
        "file_state": verify_files(task["expected_files"]),
        "success": all_conditions_met(task, outcome)
    }
基准测试:
  • WebArena:网页导航任务
  • OSWorld:桌面环境任务
评估策略:
python
def grade_computer_use(task: dict, outcome: dict) -> dict:
    return {
        "ui_state": verify_ui_state(outcome["screenshot"]),
        "db_state": verify_database(task["expected_db_state"]),
        "file_state": verify_files(task["expected_files"]),
        "success": all_conditions_met(task, outcome)
    }

Step 3: Follow the 8-Step Roadmap

步骤3:遵循8步实施路线图

Step 0: Start Early (20-50 Tasks)

步骤0:尽早启动(20-50个任务)

bash
undefined
bash
undefined

Create initial eval suite structure

创建初始评估套件目录结构

mkdir -p evals/{tasks,results,graders}
mkdir -p evals/{tasks,results,graders}

Start with representative tasks

从代表性任务开始

- Common use cases (60%)

- 常规使用场景(60%)

- Edge cases (20%)

- 边缘场景(20%)

- Failure modes (20%)

- 故障模式(20%)

undefined
undefined

Step 1: Convert Manual Tests

步骤1:转换人工测试用例

python
undefined
python
undefined

Transform existing QA tests into eval tasks

将现有QA测试转换为评估任务

def convert_qa_to_eval(qa_case: dict) -> dict: return { "id": qa_case["id"], "prompt": qa_case["input"], "expected_outcome": qa_case["expected"], "grader": "code" if qa_case["has_tests"] else "model", "tags": qa_case.get("tags", []) }
undefined
def convert_qa_to_eval(qa_case: dict) -> dict: return { "id": qa_case["id"], "prompt": qa_case["input"], "expected_outcome": qa_case["expected"], "grader": "code" if qa_case["has_tests"] else "model", "tags": qa_case.get("tags", []) }
undefined

Step 2: Ensure Clarity + Reference Solutions

步骤2:确保任务清晰+提供参考方案

yaml
undefined
yaml
undefined

Good task definition

规范的任务定义

task: id: "api-design-001" prompt: | Design a REST API for user management with: - CRUD operations - Authentication via JWT - Rate limiting reference_solution: "./solutions/api-design-001/" success_criteria: - "All endpoints documented" - "Auth middleware present" - "Rate limit config exists"
undefined
task: id: "api-design-001" prompt: | 设计一个用户管理REST API,要求: - 支持CRUD操作 - 基于JWT实现认证 - 包含限流功能 reference_solution: "./solutions/api-design-001/" success_criteria: - "所有接口均有文档" - "包含认证中间件" - "有限流配置"
undefined

Step 3: Balance Positive/Negative Cases

步骤3:平衡正负向测试用例

python
undefined
python
undefined

Ensure eval suite balance

确保测试套件的平衡性

suite_composition = { "positive_cases": 0.5, # Should succeed "negative_cases": 0.3, # Should fail gracefully "edge_cases": 0.2 # Boundary conditions }
undefined
suite_composition = { "positive_cases": 0.5, # 预期成功的用例 "negative_cases": 0.3, # 预期需优雅失败的用例 "edge_cases": 0.2 # 边界条件用例 }
undefined

Step 4: Isolate Environments

步骤4:隔离评估环境

yaml
undefined
yaml
undefined

Docker-based isolation for coding evals

基于Docker的代码评估隔离环境

eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always
undefined
eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always
undefined

Step 5: Focus on Outcomes, Not Paths

步骤5:关注结果,而非过程

python
undefined
python
undefined

GOOD: Outcome-focused grader

推荐:基于结果的评估器

def grade_outcome(expected: dict, actual: dict) -> float: return compare_final_states(expected, actual)
def grade_outcome(expected: dict, actual: dict) -> float: return compare_final_states(expected, actual)

BAD: Path-focused grader (too brittle)

不推荐:基于过程的评估器(过于脆弱)

def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps)
undefined
def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps)
undefined

Step 6: Always Read Transcripts

步骤6:务必查看交互记录

python
undefined
python
undefined

Transcript analysis for debugging

分析交互记录以调试问题

def analyze_transcript(transcript: list) -> dict: return { "total_steps": len(transcript), "tool_usage": count_tool_calls(transcript), "errors": extract_errors(transcript), "decision_points": find_decision_points(transcript), "recovery_attempts": find_recovery_patterns(transcript) }
undefined
def analyze_transcript(transcript: list) -> dict: return { "total_steps": len(transcript), "tool_usage": count_tool_calls(transcript), "errors": extract_errors(transcript), "decision_points": find_decision_points(transcript), "recovery_attempts": find_recovery_patterns(transcript) }
undefined

Step 7: Monitor Eval Saturation

步骤7:监控评估套件饱和情况

python
undefined
python
undefined

Detect when evals are no longer useful

检测评估套件是否失效

def check_saturation(results: list, window: int = 10) -> dict: recent = results[-window:] return { "pass_rate": sum(r["passed"] for r in recent) / len(recent), "variance": calculate_variance(recent), "is_saturated": all(r["passed"] for r in recent), "recommendation": "Add harder tasks" if saturated else "Continue" }
undefined
def check_saturation(results: list, window: int = 10) -> dict: recent = results[-window:] return { "pass_rate": sum(r["passed"] for r in recent) / len(recent), "variance": calculate_variance(recent), "is_saturated": all(r["passed"] for r in recent), "recommendation": "Add harder tasks" if saturated else "Continue" }
undefined

Step 8: Long-term Maintenance

步骤8:长期维护

yaml
undefined
yaml
undefined

Eval suite maintenance checklist

评估套件维护清单

maintenance: weekly: - Review failed evals for false negatives - Check for flaky tests monthly: - Add new edge cases from production issues - Retire saturated evals - Update reference solutions quarterly: - Full benchmark recalibration - Team contribution review
undefined
maintenance: weekly: - 审核失败的评估用例,排查误判 - 检查不稳定的测试用例 monthly: - 从生产问题中新增边缘场景用例 - 移除已饱和的评估用例 - 更新参考解决方案 quarterly: - 全面重新校准基准测试 - 审核团队贡献的评估用例
undefined

Step 4: Integrate with Production

步骤4:集成到生产环境

CI/CD Integration

CI/CD集成

yaml
undefined
yaml
undefined

GitHub Actions example

GitHub Actions示例

name: Agent Evals on: [push, pull_request]
jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Evals run: | python run_evals.py --suite=core --mode=compact - name: Upload Results uses: actions/upload-artifact@v4 with: name: eval-results path: results/
undefined
name: Agent Evals on: [push, pull_request]
jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Evals run: | python run_evals.py --suite=core --mode=compact - name: Upload Results uses: actions/upload-artifact@v4 with: name: eval-results path: results/
undefined

Production Monitoring

生产环境监控

python
undefined
python
undefined

Real-time eval sampling

实时评估采样

class ProductionMonitor: def init(self, sample_rate: float = 0.1): self.sample_rate = sample_rate
async def monitor(self, request, response):
    if random.random() < self.sample_rate:
        eval_result = await self.run_eval(request, response)
        self.log_result(eval_result)
        if eval_result["score"] < self.threshold:
            self.alert("Low quality response detected")
undefined
class ProductionMonitor: def init(self, sample_rate: float = 0.1): self.sample_rate = sample_rate
async def monitor(self, request, response):
    if random.random() < self.sample_rate:
        eval_result = await self.run_eval(request, response)
        self.log_result(eval_result)
        if eval_result["score"] < self.threshold:
            self.alert("Low quality response detected")
undefined

A/B Testing

A/B测试

python
undefined
python
undefined

Compare agent versions

对比不同版本的Agent

def run_ab_test(suite: str, versions: list) -> dict: results = {} for version in versions: results[version] = run_eval_suite(suite, agent_version=version) return { "comparison": compare_results(results), "winner": determine_winner(results), "confidence": calculate_confidence(results) }
undefined
def run_ab_test(suite: str, versions: list) -> dict: results = {} for version in versions: results[version] = run_eval_suite(suite, agent_version=version) return { "comparison": compare_results(results), "winner": determine_winner(results), "confidence": calculate_confidence(results) }
undefined

Best Practices

最佳实践

Do's ✅

✅ 建议做法

  1. Start with 20-50 representative tasks
  2. Use code-based graders when possible
  3. Focus on outcomes, not paths
  4. Read transcripts for debugging
  5. Monitor for eval saturation
  6. Balance positive/negative cases
  7. Isolate eval environments
  8. Version your eval suites
  1. 从20-50个代表性任务开始
  2. 尽可能使用基于代码的评估器
  3. 关注结果,而非过程
  4. 查看交互记录以调试问题
  5. 监控评估套件饱和情况
  6. 平衡正负向测试用例
  7. 隔离评估环境
  8. 对评估套件进行版本管理

Don'ts ❌

❌ 避免做法

  1. Don't over-rely on model-based graders without calibration
  2. Don't ignore failed evals (false negatives exist)
  3. Don't grade on intermediate steps
  4. Don't skip transcript analysis
  5. Don't use production data without sanitization
  6. Don't let eval suites become stale
  1. 不要在未校准的情况下过度依赖基于模型的评估器
  2. 不要忽略失败的评估用例(存在误判可能)
  3. 不要评估中间步骤
  4. 不要跳过交互记录分析
  5. 不要在未脱敏的情况下使用生产数据
  6. 不要让评估套件过时

Success Patterns

成功实践模式

Pattern 1: Graduated Eval Complexity

模式1:分级评估复杂度

Level 1: Unit evals (single capability)
Level 2: Integration evals (combined capabilities)
Level 3: End-to-end evals (full workflows)
Level 4: Adversarial evals (edge cases)
Level 1: 单元评估(单一能力)
Level 2: 集成评估(组合能力)
Level 3: 端到端评估(完整工作流)
Level 4: 对抗性评估(边缘场景)

Pattern 2: Eval-Driven Development

模式2:评估驱动开发

1. Write eval task for new feature
2. Run eval (expect failure)
3. Implement feature
4. Run eval (expect pass)
5. Add to regression suite
1. 为新功能编写评估任务
2. 运行评估(预期失败)
3. 实现功能
4. 运行评估(预期通过)
5. 添加到回归测试套件

Pattern 3: Continuous Calibration

模式3:持续校准

Weekly: Review grader accuracy
Monthly: Update rubrics based on feedback
Quarterly: Full grader audit with human baseline
每周:审核评估器准确率
每月:根据反馈更新评分标准
每季度:以人工评估为基准全面审核评估器

Troubleshooting

问题排查

Problem: Eval scores at 100%

问题:评估通过率达到100%

Solution: Add harder tasks, check for eval saturation (Step 7)
解决方案:添加更难的任务,检查评估套件是否饱和(步骤7)

Problem: Inconsistent model-based grader scores

问题:基于模型的评估器结果不一致

Solution: Add more examples to rubric, use structured output, ensemble graders
解决方案:为评分标准添加更多示例,使用结构化输出,集成多个评估器

Problem: Evals too slow for CI

问题:评估在CI中运行过慢

Solution: Use toon mode, parallelize, sample subset for PR checks
解决方案:使用精简模式,并行运行,PR检查时仅运行部分用例

Problem: Agent passes evals but fails in production

问题:Agent通过评估但在生产环境中失效

Solution: Add production failure cases to eval suite, increase diversity
解决方案:将生产失败案例添加到评估套件,增加用例多样性

References

参考资料

Examples

示例

Example 1: Simple Coding Agent Eval

示例1:简单代码Agent评估

python
undefined
python
undefined

Task definition

任务定义

task = { "id": "fizzbuzz-001", "prompt": "Write a fizzbuzz function in Python", "test_cases": [ {"input": 3, "expected": "Fizz"}, {"input": 5, "expected": "Buzz"}, {"input": 15, "expected": "FizzBuzz"}, {"input": 7, "expected": "7"} ] }
task = { "id": "fizzbuzz-001", "prompt": "Write a fizzbuzz function in Python", "test_cases": [ {"input": 3, "expected": "Fizz"}, {"input": 5, "expected": "Buzz"}, {"input": 15, "expected": "FizzBuzz"}, {"input": 7, "expected": "7"} ] }

Grader

评估器

def grade(task, outcome): code = outcome["code"] exec(code) # In sandbox for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0
undefined
def grade(task, outcome): code = outcome["code"] exec(code) # 在沙箱中运行 for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0
undefined

Example 2: Conversational Agent Eval with LLM Rubric

示例2:基于大模型评分标准的对话Agent评估

yaml
task:
  id: "support-refund-001"
  scenario: |
    Customer wants refund for damaged product.
    Product: Laptop, Order: #12345, Damage: Screen crack
  expected_actions:
    - Acknowledge issue
    - Verify order
    - Offer resolution options
  max_turns: 5

grader:
  type: model
  model: claude-3-5-sonnet-20241022
  rubric: |
    Score 1-5 on each dimension:
    - Empathy: Did agent acknowledge customer frustration?
    - Resolution: Was a clear solution offered?
    - Efficiency: Was issue resolved in reasonable turns?
yaml
task:
  id: "support-refund-001"
  scenario: |
    客户因商品损坏申请退款。
    商品:笔记本电脑,订单号:#12345,损坏情况:屏幕碎裂
  expected_actions:
    - 认可用户问题
    - 验证订单信息
    - 提供解决方案选项
  max_turns: 5

grader:
  type: model
  model: claude-3-5-sonnet-20241022
  rubric: |
    从以下维度评分(1-5分):
    - 共情能力:Agent是否认可用户的不满?
    - 问题解决:是否提供了明确的解决方案?
    - 效率:是否在合理轮次内解决问题?