agent-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

一款轻量级CLI工具，用于在可复现任务上正面比较编码Agent。以往“哪种编码Agent最好？”的比较全凭感觉——而本工具将这一过程系统化。

When to Activate

适用场景

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

在你自己的代码库上对比编码Agent（Claude Code、Aider、Codex等）
在采用新工具或模型前衡量Agent性能
当Agent更新其模型或工具时运行回归检查
为团队提供基于数据的Agent选型决策

Installation

安装

bash

undefined

bash

undefined

pinned to v0.1.0 — latest stable commit

pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

undefined

pip install git+https://github.com/joaquinhuigomez/agent-eval.git@6d062a2f5cda6ea443bf5d458d361892c04e749b

undefined

Core Concepts

核心概念

YAML Task Definitions

YAML任务定义

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

yaml

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

定义任务采用声明式方式。每个任务指定要执行的操作、要修改的文件以及如何判断任务成功：

yaml

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Git Worktree 隔离

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

每个Agent运行都会获得独立的git worktree——无需Docker。这提供了可复现的隔离环境，确保Agent之间不会互相干扰，也不会损坏基础代码库。

Metrics Collected

收集的指标

Metric	What It Measures
Pass rate	Did the agent produce code that passes the judge?
Cost	API spend per task (when available)
Time	Wall-clock seconds to completion
Consistency	Pass rate across repeated runs (e.g., 3/3 = 100%)

指标	衡量内容
通过率	Agent生成的代码是否通过验证？
成本	每项任务的API花费（若可用）
耗时	完成任务的实际耗时（秒）
一致性	多次重复运行的通过率（例如：3/3 = 100%）

Workflow

工作流程

1. Define Tasks

1. 定义任务

Create a

tasks/

directory with YAML files, one per task:

bash

mkdir tasks

创建一个

tasks/

目录，每个任务对应一个YAML文件：

bash

mkdir tasks

Write task definitions (see template above)

undefined

undefined

2. Run Agents

2. 运行Agent

Execute agents against your tasks:

bash

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

让Agent执行你的任务：

bash

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

每个运行步骤：

从指定提交创建全新的git worktree
将提示语交给Agent
执行验证标准
记录通过/失败、成本和耗时

3. Compare Results

3. 对比结果

Generate a comparison report:

bash

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

生成对比报告：

bash

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

验证类型

Code-Based (deterministic)

基于代码（确定性）

yaml

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

yaml

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

基于模式

yaml

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

yaml

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

基于模型（LLM作为验证者）

yaml

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

yaml

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

最佳实践

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

从3-5个任务开始，这些任务要代表你的实际工作负载，而非玩具示例
每个Agent至少运行3次试验以捕捉差异——Agent具有非确定性
在任务YAML中固定提交版本，确保结果在数天/数周内可复现
每个任务至少包含一个确定性验证器（测试、构建）——LLM验证器会引入噪声
同时跟踪通过率和成本——通过率95%但成本是10倍的Agent可能不是最佳选择
对任务定义进行版本控制——它们是测试夹具，要像代码一样对待

链接

Repository: github.com/joaquinhuigomez/agent-eval

仓库: github.com/joaquinhuigomez/agent-eval

agent-eval

Original

Translation

Agent Eval Skill

Agent Eval Skill

When to Activate

适用场景

Installation

安装

pinned to v0.1.0 — latest stable commit

pinned to v0.1.0 — latest stable commit

Core Concepts

核心概念

YAML Task Definitions

YAML任务定义

Git Worktree Isolation

Git Worktree 隔离

Metrics Collected

收集的指标

Workflow

工作流程

1. Define Tasks

1. 定义任务

Write task definitions (see template above)

Write task definitions (see template above)

2. Run Agents

2. 运行Agent

3. Compare Results

3. 对比结果

Judge Types

验证类型

Code-Based (deterministic)

基于代码（确定性）

Pattern-Based

基于模式

Model-Based (LLM-as-judge)

基于模型（LLM作为验证者）

Best Practices

最佳实践

Links

链接