eval-recipes-runner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

eval-recipes Runner Skill

eval-recipes Runner Skill

Purpose

用途

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
运行微软的eval-recipes基准测试,对比基准Agent验证amplihack的改进效果。

When to Use

使用场景

  • User asks to "test with eval-recipes"
  • User says "run the evals" or "benchmark this change"
  • User wants to validate improvements against codex/claude_code
  • Testing a PR branch to prove it improves scores
  • 用户要求“用eval-recipes测试”
  • 用户说“运行评估”或“对该变更进行基准测试”
  • 用户希望对比codex/claude_code验证改进效果
  • 测试PR分支以证明其提升了评分

Capabilities

功能

I can run eval-recipes benchmarks to:
  1. Test specific amplihack branches
  2. Compare against baseline agents (codex, claude_code)
  3. Run specific tasks (linkedin_drafting, email_drafting, etc.)
  4. Compare before/after scores for PRs
  5. Generate reports with score improvements
我可以运行eval-recipes基准测试来:
  1. 测试特定的amplihack分支
  2. 对比基准Agent(codex、claude_code)
  3. 运行特定任务(linkedin_drafting、email_drafting等)
  4. 对比PR的前后评分
  5. 生成包含评分提升情况的报告

How It Works

工作原理

Setup (One-Time)

一次性设置

bash
undefined
bash
undefined

Clone eval-recipes from Microsoft

Clone eval-recipes from Microsoft

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes

Copy our agent configs

Copy our agent configs

cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

Install dependencies

Install dependencies

uv sync
undefined
uv sync
undefined

Running Benchmarks

运行基准测试

Test a specific branch:
bash
undefined
测试特定分支:
bash
undefined

Update install.dockerfile to use specific branch

Update install.dockerfile to use specific branch

Then run benchmark

Then run benchmark

cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

**Compare before/after:**

```bash
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

**对比前后版本:**

```bash

Test baseline (main)

Test baseline (main)

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

Test PR branch (edit install.dockerfile to checkout PR branch)

Test PR branch (edit install.dockerfile to checkout PR branch)

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

Compare scores

Compare scores

undefined
undefined

Available Tasks

可用任务

Common tasks from eval-recipes:
  • linkedin_drafting
    - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
  • email_drafting
    - Create CLI tool for emails (scored 26/100 before)
  • arxiv_paper_summarizer
    - Research tool
  • github_docs_extractor
    - Documentation tool
  • Many more in
    ~/eval-recipes/data/tasks/
eval-recipes中的常见任务:
  • linkedin_drafting
    - 创建LinkedIn帖子工具(PR #1443前评分6.5/100)
  • email_drafting
    - 创建邮件CLI工具(之前评分26/100)
  • arxiv_paper_summarizer
    - 研究工具
  • github_docs_extractor
    - 文档工具
  • 更多任务位于
    ~/eval-recipes/data/tasks/

Typical Workflow

典型工作流程

When user says "test this change with eval-recipes":
  1. Identify the branch/PR to test
  2. Update agent config to use that branch:
    dockerfile
    # In .claude/agents/eval-recipes/amplihack/install.dockerfile
    RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
        cd /tmp/amplihack && \
        git checkout BRANCH_NAME && \
        pip install -e .
  3. Copy to eval-recipes:
    bash
    cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
  4. Run benchmark:
    bash
    cd ~/eval-recipes
    uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3
  5. Report scores and compare with baseline
当用户说“用eval-recipes测试该变更”时:
  1. 确定要测试的分支/PR
  2. 更新Agent配置以使用该分支:
    dockerfile
    # In .claude/agents/eval-recipes/amplihack/install.dockerfile
    RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \
        cd /tmp/amplihack && \
        git checkout BRANCH_NAME && \
        pip install -e .
  3. 复制到eval-recipes:
    bash
    cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
  4. 运行基准测试:
    bash
    cd ~/eval-recipes
    uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3
  5. 报告评分并与基准对比

Expected Scores

预期评分

Baseline (main branch):
  • Overall: 40.6/100
  • LinkedIn: 6.5/100
  • Email: 26/100
With PR #1443 (task classification):
  • Expected: 55-60/100 (+15-20 points)
  • LinkedIn: 30-40/100 (creates actual tool)
  • Email: 45/100 (consistent execution)
基准(main分支):
  • 总体:40.6/100
  • LinkedIn:6.5/100
  • 邮件:26/100
使用PR #1443(任务分类):
  • 预期:55-60/100(提升15-20分)
  • LinkedIn:30-40/100(可创建实际工具)
  • 邮件:45/100(执行稳定)

Example Usage

示例用法

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"
I do:
  1. Update install.dockerfile to checkout
    feat/issue-1435-task-classification
  2. Copy to eval-recipes:
    cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
  3. Run:
    cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
  4. Report results: "Score: 35.2/100 (up from 6.5 baseline)"
用户说: "用eval-recipes在LinkedIn任务上测试PR #1443"
我会:
  1. 更新install.dockerfile以检出
    feat/issue-1435-task-classification
    分支
  2. 复制到eval-recipes:
    cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
  3. 运行:
    cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
  4. 报告结果:"评分:35.2/100(较基准6.5有所提升)"

Prerequisites

前提条件

  • eval-recipes cloned to
    ~/eval-recipes
  • API key in environment:
    export ANTHROPIC_API_KEY=sk-ant-...
  • Docker installed (for containerized runs)
  • uv installed:
    curl -LsSf https://astral.sh/uv/install.sh | sh
  • eval-recipes已克隆到
    ~/eval-recipes
  • 环境中已设置API密钥:
    export ANTHROPIC_API_KEY=sk-ant-...
  • 已安装Docker(用于容器化运行)
  • 已安装uv:
    curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

注意事项

  • Benchmarks take 2-15 minutes per task depending on complexity
  • Multiple trials (3-5) give more reliable averages
  • Docker builds can be cached for speed
  • Results saved to
    .benchmark_results/
    in eval-recipes repo
  • 基准测试的耗时取决于任务复杂度,每个任务需要2-15分钟
  • 多次测试(3-5次)可获得更可靠的平均值
  • Docker构建可缓存以提高速度
  • 结果保存到eval-recipes仓库的
    .benchmark_results/
    目录

Automation

自动化

For fully autonomous testing:
bash
undefined
完全自动化测试:
bash
undefined

Test suite for a PR

Test suite for a PR

tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done

Compare results

Compare results

cat .benchmark_results//amplihack//score.txt
undefined
cat .benchmark_results//amplihack//score.txt
undefined