agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Evaluation (AI Agent Evals)

Agent评估（AI Agent评估体系）

Based on Anthropic's "Demystifying evals for AI agents"

基于Anthropic的《揭秘AI Agent评估体系》

When to use this skill

适用场景

Designing evaluation systems for AI agents
Building benchmarks for coding, conversational, or research agents
Creating graders (code-based, model-based, human)
Implementing production monitoring for AI systems
Setting up CI/CD pipelines with automated evals
Debugging agent performance issues
Measuring agent improvement over time

为AI Agent设计评估系统
为代码、对话或研究类Agent构建基准测试
创建评估器（基于代码、基于模型、人工评估）
为AI系统实现生产环境监控
搭建集成自动化评估的CI/CD流水线
调试Agent性能问题
跟踪Agent的长期性能提升

Core Concepts

核心概念

Eval Evolution: Single-turn → Multi-turn → Agentic

评估体系演进：单轮对话 → 多轮对话 → Agent化

Type	Turns	State	Grading	Complexity
Single-turn	1	None	Simple	Low
Multi-turn	N	Conversation	Per-turn	Medium
Agentic	N	World + History	Outcome	High

类型	轮次	状态	评估方式	复杂度
单轮对话	1	无	简单评估	低
多轮对话	N	对话记录	按轮次评估	中
Agent化	N	外部环境+历史记录	基于结果评估	高

7 Key Terms

7个核心术语

Term	Definition
Task	Single test case (prompt + expected outcome)
Trial	One agent run on a task
Grader	Scoring function (code/model/human)
Transcript	Full record of agent actions
Outcome	Final state for grading
Harness	Infrastructure running evals
Suite	Collection of related tasks

术语	定义
Task（任务）	单个测试用例（提示词+预期结果）
Trial（测试轮次）	Agent在单个任务上的一次运行
Grader（评估器）	评分函数（代码/模型/人工）
Transcript（交互记录）	Agent操作的完整记录
Outcome（结果）	用于评估的最终状态
Harness（评估框架）	运行评估的基础设施
Suite（测试套件）	相关任务的集合

Instructions

实施步骤

Step 1: Understand Grader Types

步骤1：了解评估器类型

Code-based Graders (Recommended for Coding Agents)

基于代码的评估器（推荐用于代码Agent）

Pros: Fast, objective, reproducible
Cons: Requires clear success criteria
Best for: Coding agents, structured outputs

python

undefined

优势：速度快、客观、可复现
劣势：需要明确的成功判定标准
最适用场景：代码Agent、结构化输出场景

python

undefined

Example: Code-based grader

示例：基于代码的评估器

def grade_task(outcome: dict) -> float: """Grade coding task by test passage.""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests

def grade_task(outcome: dict) -> float: """通过测试通过率评估代码任务""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests

SWE-bench style grader

SWE-bench风格评估器

def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """Run tests and check if patch resolves issue.""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0

undefined

def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """运行测试并检查补丁是否解决问题""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0

undefined

Model-based Graders (LLM-as-Judge)

基于模型的评估器（大模型作为裁判）

Pros: Flexible, handles nuance
Cons: Requires calibration, can be inconsistent
Best for: Conversational agents, open-ended tasks

yaml

undefined

优势：灵活、能处理复杂场景
劣势：需要校准、结果可能不一致
最适用场景：对话Agent、开放式任务

yaml

undefined

Example: LLM Rubric for Customer Support Agent

示例：客服Agent的大模型评分标准

rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: Acknowledges emotions, uses warm language 3: Polite but impersonal 1: Cold or dismissive

- name: resolution
  weight: 0.5
  scale: 1-5
  criteria: |
    5: Fully resolves issue
    3: Partial resolution
    1: No resolution

- name: efficiency
  weight: 0.2
  scale: 1-5
  criteria: |
    5: Resolved in minimal turns
    3: Reasonable turns
    1: Excessive back-and-forth

undefined

rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: 认可用户情绪，使用暖心话术 3: 礼貌但缺乏人情味 1: 冷漠或敷衍

- name: resolution
  weight: 0.5
  scale: 1-5
  criteria: |
    5: 完全解决问题
    3: 部分解决问题
    1: 未解决问题

- name: efficiency
  weight: 0.2
  scale: 1-5
  criteria: |
    5: 用最少轮次解决问题
    3: 轮次合理
    1: 反复沟通效率低下

undefined

Human Graders

人工评估器

Pros: Highest accuracy, catches edge cases
Cons: Expensive, slow, not scalable
Best for: Final validation, ambiguous cases

优势：准确率最高、能覆盖边缘场景
劣势：成本高、速度慢、难以规模化
最适用场景：最终验证、模糊场景评估

Step 2: Choose Strategy by Agent Type

步骤2：根据Agent类型选择评估策略

2.1 Coding Agents

2.1 代码Agent

Benchmarks:

SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
Terminal-Bench: Complex terminal tasks
Custom test suites with your codebase

Grading Strategy:

python

def grade_coding_agent(task: dict, outcome: dict) -> dict:
    return {
        "tests_passed": run_test_suite(outcome["code"]),
        "lint_score": run_linter(outcome["code"]),
        "builds": check_build(outcome["code"]),
        "matches_spec": compare_to_reference(task["spec"], outcome["code"])
    }

Key Metrics:

Test passage rate
Build success
Lint/style compliance
Diff size (smaller is better)

基准测试:

SWE-bench Verified：真实GitHub问题（通过率从40%提升至80%+）
Terminal-Bench：复杂终端任务
基于自有代码库的自定义测试套件

评估策略:

python

def grade_coding_agent(task: dict, outcome: dict) -> dict:
    return {
        "tests_passed": run_test_suite(outcome["code"]),
        "lint_score": run_linter(outcome["code"]),
        "builds": check_build(outcome["code"]),
        "matches_spec": compare_to_reference(task["spec"], outcome["code"])
    }

核心指标:

测试通过率
构建成功率
代码规范合规性
代码差异大小（越小越好）

2.2 Conversational Agents

2.2 对话Agent

Benchmarks:

τ2-Bench: Multi-domain conversation
Custom domain-specific suites

Grading Strategy (Multi-dimensional):

yaml

success_criteria:
  - empathy_score: >= 4.0
  - resolution_rate: >= 0.9
  - avg_turns: <= 5
  - escalation_rate: <= 0.1

Key Metrics:

Task resolution rate
Customer satisfaction proxy
Turn efficiency
Escalation rate

基准测试:

τ2-Bench：多领域对话测试
自定义领域专属测试套件

评估策略（多维度）:

yaml

success_criteria:
  - empathy_score: >= 4.0
  - resolution_rate: >= 0.9
  - avg_turns: <= 5
  - escalation_rate: <= 0.1

核心指标:

任务解决率
客户满意度代理指标
对话轮次效率
问题升级率

2.3 Research Agents

2.3 研究Agent

Grading Dimensions:

Grounding: Claims backed by sources
Coverage: All aspects addressed
Source Quality: Authoritative sources used

python

def grade_research_agent(task: dict, outcome: dict) -> dict:
    return {
        "grounding": check_citations(outcome["report"]),
        "coverage": check_topic_coverage(task["topics"], outcome["report"]),
        "source_quality": score_sources(outcome["sources"]),
        "factual_accuracy": verify_claims(outcome["claims"])
    }

评估维度:

事实依据：所有结论均有来源支持
内容覆盖：涵盖主题的所有方面
来源质量：使用权威来源

python

def grade_research_agent(task: dict, outcome: dict) -> dict:
    return {
        "grounding": check_citations(outcome["report"]),
        "coverage": check_topic_coverage(task["topics"], outcome["report"]),
        "source_quality": score_sources(outcome["sources"]),
        "factual_accuracy": verify_claims(outcome["claims"])
    }

2.4 Computer Use Agents

2.4 计算机操作Agent

Benchmarks:

WebArena: Web navigation tasks
OSWorld: Desktop environment tasks

Grading Strategy:

python

def grade_computer_use(task: dict, outcome: dict) -> dict:
    return {
        "ui_state": verify_ui_state(outcome["screenshot"]),
        "db_state": verify_database(task["expected_db_state"]),
        "file_state": verify_files(task["expected_files"]),
        "success": all_conditions_met(task, outcome)
    }

基准测试:

WebArena：网页导航任务
OSWorld：桌面环境任务

评估策略:

python

def grade_computer_use(task: dict, outcome: dict) -> dict:
    return {
        "ui_state": verify_ui_state(outcome["screenshot"]),
        "db_state": verify_database(task["expected_db_state"]),
        "file_state": verify_files(task["expected_files"]),
        "success": all_conditions_met(task, outcome)
    }

Step 3: Follow the 8-Step Roadmap

步骤3：遵循8步实施路线图

Step 0: Start Early (20-50 Tasks)

步骤0：尽早启动（20-50个任务）

bash

undefined

bash

undefined

Create initial eval suite structure

创建初始评估套件目录结构

mkdir -p evals/{tasks,results,graders}

Start with representative tasks

从代表性任务开始

- Common use cases (60%)

- 常规使用场景（60%）

- Edge cases (20%)

- 边缘场景（20%）

- Failure modes (20%)

- 故障模式（20%）

undefined

undefined

Step 1: Convert Manual Tests

步骤1：转换人工测试用例

python

undefined

python

undefined

Transform existing QA tests into eval tasks

将现有QA测试转换为评估任务

def convert_qa_to_eval(qa_case: dict) -> dict: return { "id": qa_case["id"], "prompt": qa_case["input"], "expected_outcome": qa_case["expected"], "grader": "code" if qa_case["has_tests"] else "model", "tags": qa_case.get("tags", []) }

undefined

undefined

Step 2: Ensure Clarity + Reference Solutions

步骤2：确保任务清晰+提供参考方案

yaml

undefined

yaml

undefined

Good task definition

规范的任务定义

task: id: "api-design-001" prompt: | Design a REST API for user management with: - CRUD operations - Authentication via JWT - Rate limiting reference_solution: "./solutions/api-design-001/" success_criteria: - "All endpoints documented" - "Auth middleware present" - "Rate limit config exists"

undefined

task: id: "api-design-001" prompt: | 设计一个用户管理REST API，要求： - 支持CRUD操作 - 基于JWT实现认证 - 包含限流功能 reference_solution: "./solutions/api-design-001/" success_criteria: - "所有接口均有文档" - "包含认证中间件" - "有限流配置"

undefined

Step 3: Balance Positive/Negative Cases

步骤3：平衡正负向测试用例

python

undefined

python

undefined

Ensure eval suite balance

确保测试套件的平衡性

suite_composition = { "positive_cases": 0.5, # Should succeed "negative_cases": 0.3, # Should fail gracefully "edge_cases": 0.2 # Boundary conditions }

undefined

suite_composition = { "positive_cases": 0.5, # 预期成功的用例 "negative_cases": 0.3, # 预期需优雅失败的用例 "edge_cases": 0.2 # 边界条件用例 }

undefined

Step 4: Isolate Environments

步骤4：隔离评估环境

yaml

undefined

yaml

undefined

Docker-based isolation for coding evals

基于Docker的代码评估隔离环境

eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always

undefined

eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always

undefined

Step 5: Focus on Outcomes, Not Paths

步骤5：关注结果，而非过程

python

undefined

python

undefined

GOOD: Outcome-focused grader

推荐：基于结果的评估器

def grade_outcome(expected: dict, actual: dict) -> float: return compare_final_states(expected, actual)

BAD: Path-focused grader (too brittle)

不推荐：基于过程的评估器（过于脆弱）

def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps)

undefined

def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps)

undefined

Step 6: Always Read Transcripts

步骤6：务必查看交互记录

python

undefined

python

undefined

Transcript analysis for debugging

分析交互记录以调试问题

def analyze_transcript(transcript: list) -> dict: return { "total_steps": len(transcript), "tool_usage": count_tool_calls(transcript), "errors": extract_errors(transcript), "decision_points": find_decision_points(transcript), "recovery_attempts": find_recovery_patterns(transcript) }

undefined

undefined

Step 7: Monitor Eval Saturation

步骤7：监控评估套件饱和情况

python

undefined

python

undefined

Detect when evals are no longer useful

检测评估套件是否失效

def check_saturation(results: list, window: int = 10) -> dict: recent = results[-window:] return { "pass_rate": sum(r["passed"] for r in recent) / len(recent), "variance": calculate_variance(recent), "is_saturated": all(r["passed"] for r in recent), "recommendation": "Add harder tasks" if saturated else "Continue" }

undefined

undefined

Step 8: Long-term Maintenance

步骤8：长期维护

yaml

undefined

yaml

undefined

Eval suite maintenance checklist

评估套件维护清单

maintenance: weekly: - Review failed evals for false negatives - Check for flaky tests monthly: - Add new edge cases from production issues - Retire saturated evals - Update reference solutions quarterly: - Full benchmark recalibration - Team contribution review

undefined

maintenance: weekly: - 审核失败的评估用例，排查误判 - 检查不稳定的测试用例 monthly: - 从生产问题中新增边缘场景用例 - 移除已饱和的评估用例 - 更新参考解决方案 quarterly: - 全面重新校准基准测试 - 审核团队贡献的评估用例

undefined

Step 4: Integrate with Production

步骤4：集成到生产环境

CI/CD Integration

CI/CD集成

yaml

undefined

yaml

undefined

GitHub Actions example

GitHub Actions示例

name: Agent Evals on: [push, pull_request]

jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Evals run: | python run_evals.py --suite=core --mode=compact - name: Upload Results uses: actions/upload-artifact@v4 with: name: eval-results path: results/

undefined

name: Agent Evals on: [push, pull_request]

undefined

Production Monitoring

生产环境监控

python

undefined

python

undefined

Real-time eval sampling

实时评估采样

class ProductionMonitor: def init(self, sample_rate: float = 0.1): self.sample_rate = sample_rate

async def monitor(self, request, response):
    if random.random() < self.sample_rate:
        eval_result = await self.run_eval(request, response)
        self.log_result(eval_result)
        if eval_result["score"] < self.threshold:
            self.alert("Low quality response detected")

undefined

class ProductionMonitor: def init(self, sample_rate: float = 0.1): self.sample_rate = sample_rate

async def monitor(self, request, response):
    if random.random() < self.sample_rate:
        eval_result = await self.run_eval(request, response)
        self.log_result(eval_result)
        if eval_result["score"] < self.threshold:
            self.alert("Low quality response detected")

undefined

A/B Testing

A/B测试

python

undefined

python

undefined

Compare agent versions

对比不同版本的Agent

def run_ab_test(suite: str, versions: list) -> dict: results = {} for version in versions: results[version] = run_eval_suite(suite, agent_version=version) return { "comparison": compare_results(results), "winner": determine_winner(results), "confidence": calculate_confidence(results) }

undefined

undefined

Best Practices

最佳实践

Do's ✅

✅ 建议做法

Start with 20-50 representative tasks
Use code-based graders when possible
Focus on outcomes, not paths
Read transcripts for debugging
Monitor for eval saturation
Balance positive/negative cases
Isolate eval environments
Version your eval suites

从20-50个代表性任务开始
尽可能使用基于代码的评估器
关注结果，而非过程
查看交互记录以调试问题
监控评估套件饱和情况
平衡正负向测试用例
隔离评估环境
对评估套件进行版本管理

Don'ts ❌

❌ 避免做法

Don't over-rely on model-based graders without calibration
Don't ignore failed evals (false negatives exist)
Don't grade on intermediate steps
Don't skip transcript analysis
Don't use production data without sanitization
Don't let eval suites become stale

不要在未校准的情况下过度依赖基于模型的评估器
不要忽略失败的评估用例（存在误判可能）
不要评估中间步骤
不要跳过交互记录分析
不要在未脱敏的情况下使用生产数据
不要让评估套件过时

Success Patterns

成功实践模式

Pattern 1: Graduated Eval Complexity

模式1：分级评估复杂度

Level 1: Unit evals (single capability)
Level 2: Integration evals (combined capabilities)
Level 3: End-to-end evals (full workflows)
Level 4: Adversarial evals (edge cases)

Level 1: 单元评估（单一能力）
Level 2: 集成评估（组合能力）
Level 3: 端到端评估（完整工作流）
Level 4: 对抗性评估（边缘场景）

Pattern 2: Eval-Driven Development

模式2：评估驱动开发

1. Write eval task for new feature
2. Run eval (expect failure)
3. Implement feature
4. Run eval (expect pass)
5. Add to regression suite

1. 为新功能编写评估任务
2. 运行评估（预期失败）
3. 实现功能
4. 运行评估（预期通过）
5. 添加到回归测试套件

Pattern 3: Continuous Calibration

模式3：持续校准

Weekly: Review grader accuracy
Monthly: Update rubrics based on feedback
Quarterly: Full grader audit with human baseline

每周：审核评估器准确率
每月：根据反馈更新评分标准
每季度：以人工评估为基准全面审核评估器

Troubleshooting

问题排查

Problem: Eval scores at 100%

问题：评估通过率达到100%

Solution: Add harder tasks, check for eval saturation (Step 7)

解决方案：添加更难的任务，检查评估套件是否饱和（步骤7）

Problem: Inconsistent model-based grader scores

问题：基于模型的评估器结果不一致

Solution: Add more examples to rubric, use structured output, ensemble graders

解决方案：为评分标准添加更多示例，使用结构化输出，集成多个评估器

Problem: Evals too slow for CI

问题：评估在CI中运行过慢

Solution: Use toon mode, parallelize, sample subset for PR checks

解决方案：使用精简模式，并行运行，PR检查时仅运行部分用例

Problem: Agent passes evals but fails in production

问题：Agent通过评估但在生产环境中失效

Solution: Add production failure cases to eval suite, increase diversity

解决方案：将生产失败案例添加到评估套件，增加用例多样性

References

参考资料

Examples

示例

Example 1: Simple Coding Agent Eval

示例1：简单代码Agent评估

python

undefined

python

undefined

Task definition

任务定义

task = { "id": "fizzbuzz-001", "prompt": "Write a fizzbuzz function in Python", "test_cases": [ {"input": 3, "expected": "Fizz"}, {"input": 5, "expected": "Buzz"}, {"input": 15, "expected": "FizzBuzz"}, {"input": 7, "expected": "7"} ] }

Grader

评估器

def grade(task, outcome): code = outcome["code"] exec(code) # In sandbox for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0

undefined

def grade(task, outcome): code = outcome["code"] exec(code) # 在沙箱中运行 for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0

undefined

Example 2: Conversational Agent Eval with LLM Rubric

示例2：基于大模型评分标准的对话Agent评估

yaml

task:
  id: "support-refund-001"
  scenario: |
    Customer wants refund for damaged product.
    Product: Laptop, Order: #12345, Damage: Screen crack
  expected_actions:
    - Acknowledge issue
    - Verify order
    - Offer resolution options
  max_turns: 5

grader:
  type: model
  model: claude-3-5-sonnet-20241022
  rubric: |
    Score 1-5 on each dimension:
    - Empathy: Did agent acknowledge customer frustration?
    - Resolution: Was a clear solution offered?
    - Efficiency: Was issue resolved in reasonable turns?

yaml

task:
  id: "support-refund-001"
  scenario: |
    客户因商品损坏申请退款。
    商品：笔记本电脑，订单号：#12345，损坏情况：屏幕碎裂
  expected_actions:
    - 认可用户问题
    - 验证订单信息
    - 提供解决方案选项
  max_turns: 5

grader:
  type: model
  model: claude-3-5-sonnet-20241022
  rubric: |
    从以下维度评分（1-5分）：
    - 共情能力：Agent是否认可用户的不满？
    - 问题解决：是否提供了明确的解决方案？
    - 效率：是否在合理轮次内解决问题？