agent-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Eval Skill
Agent Eval Skill
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.
一款轻量级CLI工具,用于在可复现任务上正面比较编码Agent。以往“哪种编码Agent最好?”的比较全凭感觉——而本工具将这一过程系统化。
When to Activate
适用场景
- Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
- Measuring agent performance before adopting a new tool or model
- Running regression checks when an agent updates its model or tooling
- Producing data-backed agent selection decisions for a team
- 在你自己的代码库上对比编码Agent(Claude Code、Aider、Codex等)
- 在采用新工具或模型前衡量Agent性能
- 当Agent更新其模型或工具时运行回归检查
- 为团队提供基于数据的Agent选型决策
Installation
安装
bash
undefinedbash
undefinedpinned to v0.1.0 — latest stable commit
pinned to v0.1.0 — latest stable commit
pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
undefinedpip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b
undefinedCore Concepts
核心概念
YAML Task Definitions
YAML任务定义
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility定义任务采用声明式方式。每个任务指定要执行的操作、要修改的文件以及如何判断任务成功:
yaml
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibilityGit Worktree Isolation
Git Worktree 隔离
Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
每个Agent运行都会获得独立的git worktree——无需Docker。这提供了可复现的隔离环境,确保Agent之间不会互相干扰,也不会损坏基础代码库。
Metrics Collected
收集的指标
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
| 指标 | 衡量内容 |
|---|---|
| 通过率 | Agent生成的代码是否通过验证? |
| 成本 | 每项任务的API花费(若可用) |
| 耗时 | 完成任务的实际耗时(秒) |
| 一致性 | 多次重复运行的通过率(例如:3/3 = 100%) |
Workflow
工作流程
1. Define Tasks
1. 定义任务
Create a directory with YAML files, one per task:
tasks/bash
mkdir tasks创建一个目录,每个任务对应一个YAML文件:
tasks/bash
mkdir tasksWrite task definitions (see template above)
Write task definitions (see template above)
undefinedundefined2. Run Agents
2. 运行Agent
Execute agents against your tasks:
bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3Each run:
- Creates a fresh git worktree from the specified commit
- Hands the prompt to the agent
- Runs the judge criteria
- Records pass/fail, cost, and time
让Agent执行你的任务:
bash
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3每个运行步骤:
- 从指定提交创建全新的git worktree
- 将提示语交给Agent
- 执行验证标准
- 记录通过/失败、成本和耗时
3. Compare Results
3. 对比结果
Generate a comparison report:
bash
agent-eval report --format tableTask: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘生成对比报告:
bash
agent-eval report --format tableTask: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent │ Pass Rate │ Cost │ Time │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code │ 3/3 │ $0.12 │ 45s │ 100% │
│ aider │ 2/3 │ $0.08 │ 38s │ 67% │
└──────────────┴───────────┴────────┴────────┴─────────────┘Judge Types
验证类型
Code-Based (deterministic)
基于代码(确定性)
yaml
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run buildyaml
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run buildPattern-Based
基于模式
yaml
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.pyyaml
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.pyModel-Based (LLM-as-judge)
基于模型(LLM作为验证者)
yaml
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.yaml
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.Best Practices
最佳实践
- Start with 3-5 tasks that represent your real workload, not toy examples
- Run at least 3 trials per agent to capture variance — agents are non-deterministic
- Pin the commit in your task YAML so results are reproducible across days/weeks
- Include at least one deterministic judge (tests, build) per task — LLM judges add noise
- Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
- Version your task definitions — they are test fixtures, treat them as code
- 从3-5个任务开始,这些任务要代表你的实际工作负载,而非玩具示例
- 每个Agent至少运行3次试验以捕捉差异——Agent具有非确定性
- 在任务YAML中固定提交版本,确保结果在数天/数周内可复现
- 每个任务至少包含一个确定性验证器(测试、构建)——LLM验证器会引入噪声
- 同时跟踪通过率和成本——通过率95%但成本是10倍的Agent可能不是最佳选择
- 对任务定义进行版本控制——它们是测试夹具,要像代码一样对待