eval-recipes-runner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

eval-recipes Runner Skill

Purpose

用途

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

运行微软的eval-recipes基准测试，对比基准Agent验证amplihack的改进效果。

When to Use

使用场景

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

用户要求“用eval-recipes测试”
用户说“运行评估”或“对该变更进行基准测试”
用户希望对比codex/claude_code验证改进效果
测试PR分支以证明其提升了评分

Capabilities

功能

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

我可以运行eval-recipes基准测试来：

测试特定的amplihack分支
对比基准Agent（codex、claude_code）
运行特定任务（linkedin_drafting、email_drafting等）
对比PR的前后评分
生成包含评分提升情况的报告

How It Works

工作原理

Setup (One-Time)

一次性设置

bash

undefined

bash

undefined

Clone eval-recipes from Microsoft

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes

Copy our agent configs

cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

Install dependencies

uv sync

undefined

uv sync

undefined

Running Benchmarks

运行基准测试

Test a specific branch:

bash

undefined

测试特定分支：

bash

undefined

Update install.dockerfile to use specific branch

Then run benchmark

cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3


**Compare before/after:**

```bash

cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3


**对比前后版本：**

```bash

Test baseline (main)

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

Test PR branch (edit install.dockerfile to checkout PR branch)

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

Compare scores

undefined

undefined

Available Tasks

可用任务

Common tasks from eval-recipes:

```
linkedin_drafting
```
- Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
```
email_drafting
```
- Create CLI tool for emails (scored 26/100 before)
```
arxiv_paper_summarizer
```
- Research tool
```
github_docs_extractor
```
- Documentation tool
Many more in
```
~/eval-recipes/data/tasks/
```

eval-recipes中的常见任务：

```
linkedin_drafting
```
- 创建LinkedIn帖子工具（PR #1443前评分6.5/100）
```
email_drafting
```
- 创建邮件CLI工具（之前评分26/100）
```
arxiv_paper_summarizer
```
- 研究工具
```
github_docs_extractor
```
- 文档工具
更多任务位于
```
~/eval-recipes/data/tasks/
```

Typical Workflow

典型工作流程

When user says "test this change with eval-recipes":

Identify the branch/PR to test

Update agent config to use that branch:

dockerfile

# In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
    cd /tmp/amplihack && \
    git checkout BRANCH_NAME && \
    pip install -e .

Copy to eval-recipes:

bash

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

Run benchmark:

bash

cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

Report scores and compare with baseline

当用户说“用eval-recipes测试该变更”时：

确定要测试的分支/PR

更新Agent配置以使用该分支：

dockerfile

# In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
    cd /tmp/amplihack && \
    git checkout BRANCH_NAME && \
    pip install -e .

复制到eval-recipes：

bash

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

运行基准测试：

bash

cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

报告评分并与基准对比

Expected Scores

预期评分

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

基准（main分支）：

总体：40.6/100
LinkedIn：6.5/100
邮件：26/100

使用PR #1443（任务分类）：

预期：55-60/100（提升15-20分）
LinkedIn：30-40/100（可创建实际工具）
邮件：45/100（执行稳定）

Example Usage

示例用法

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout
```
feat/issue-1435-task-classification
```

Copy to eval-recipes:

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

Run:

cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Report results: "Score: 35.2/100 (up from 6.5 baseline)"

用户说： "用eval-recipes在LinkedIn任务上测试PR #1443"

我会：

更新install.dockerfile以检出
```
feat/issue-1435-task-classification
```
分支

复制到eval-recipes：

cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

运行：

cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

报告结果："评分：35.2/100（较基准6.5有所提升）"

Prerequisites

前提条件

eval-recipes cloned to
```
~/eval-recipes
```
API key in environment:
```
export ANTHROPIC_API_KEY=sk-ant-...
```
Docker installed (for containerized runs)

uv installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

eval-recipes已克隆到
```
~/eval-recipes
```
环境中已设置API密钥：
```
export ANTHROPIC_API_KEY=sk-ant-...
```
已安装Docker（用于容器化运行）

已安装uv：

curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

注意事项

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to
```
.benchmark_results/
```
in eval-recipes repo

基准测试的耗时取决于任务复杂度，每个任务需要2-15分钟
多次测试（3-5次）可获得更可靠的平均值
Docker构建可缓存以提高速度
结果保存到eval-recipes仓库的
```
.benchmark_results/
```
目录

Automation

自动化

For fully autonomous testing:

bash

undefined

完全自动化测试：

bash

undefined

Test suite for a PR

tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done

Compare results

cat .benchmark_results//amplihack//score.txt

undefined

cat .benchmark_results//amplihack//score.txt

undefined