eval-recipes-runner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseeval-recipes Runner Skill
eval-recipes Runner Skill
Purpose
用途
Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
运行微软的eval-recipes基准测试,对比基准Agent验证amplihack的改进效果。
When to Use
使用场景
- User asks to "test with eval-recipes"
- User says "run the evals" or "benchmark this change"
- User wants to validate improvements against codex/claude_code
- Testing a PR branch to prove it improves scores
- 用户要求“用eval-recipes测试”
- 用户说“运行评估”或“对该变更进行基准测试”
- 用户希望对比codex/claude_code验证改进效果
- 测试PR分支以证明其提升了评分
Capabilities
功能
I can run eval-recipes benchmarks to:
- Test specific amplihack branches
- Compare against baseline agents (codex, claude_code)
- Run specific tasks (linkedin_drafting, email_drafting, etc.)
- Compare before/after scores for PRs
- Generate reports with score improvements
我可以运行eval-recipes基准测试来:
- 测试特定的amplihack分支
- 对比基准Agent(codex、claude_code)
- 运行特定任务(linkedin_drafting、email_drafting等)
- 对比PR的前后评分
- 生成包含评分提升情况的报告
How It Works
工作原理
Setup (One-Time)
一次性设置
bash
undefinedbash
undefinedClone eval-recipes from Microsoft
Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes
cd ~/eval-recipes
Copy our agent configs
Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/
Install dependencies
Install dependencies
uv sync
undefineduv sync
undefinedRunning Benchmarks
运行基准测试
Test a specific branch:
bash
undefined测试特定分支:
bash
undefinedUpdate install.dockerfile to use specific branch
Update install.dockerfile to use specific branch
Then run benchmark
Then run benchmark
cd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
**Compare before/after:**
```bashcd ~/eval-recipes
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
**对比前后版本:**
```bashTest baseline (main)
Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting
Test PR branch (edit install.dockerfile to checkout PR branch)
Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting
Compare scores
Compare scores
undefinedundefinedAvailable Tasks
可用任务
Common tasks from eval-recipes:
- - Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
linkedin_drafting - - Create CLI tool for emails (scored 26/100 before)
email_drafting - - Research tool
arxiv_paper_summarizer - - Documentation tool
github_docs_extractor - Many more in
~/eval-recipes/data/tasks/
eval-recipes中的常见任务:
- - 创建LinkedIn帖子工具(PR #1443前评分6.5/100)
linkedin_drafting - - 创建邮件CLI工具(之前评分26/100)
email_drafting - - 研究工具
arxiv_paper_summarizer - - 文档工具
github_docs_extractor - 更多任务位于
~/eval-recipes/data/tasks/
Typical Workflow
典型工作流程
When user says "test this change with eval-recipes":
- Identify the branch/PR to test
- Update agent config to use that branch:
dockerfile
# In .claude/agents/eval-recipes/amplihack/install.dockerfile RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \ cd /tmp/amplihack && \ git checkout BRANCH_NAME && \ pip install -e . - Copy to eval-recipes:
bash
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run benchmark:
bash
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3 - Report scores and compare with baseline
当用户说“用eval-recipes测试该变更”时:
- 确定要测试的分支/PR
- 更新Agent配置以使用该分支:
dockerfile
# In .claude/agents/eval-recipes/amplihack/install.dockerfile RUN git clone https://github.com/rysweet/...git /tmp/amplihack && \ cd /tmp/amplihack && \ git checkout BRANCH_NAME && \ pip install -e . - 复制到eval-recipes:
bash
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - 运行基准测试:
bash
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3 - 报告评分并与基准对比
Expected Scores
预期评分
Baseline (main branch):
- Overall: 40.6/100
- LinkedIn: 6.5/100
- Email: 26/100
With PR #1443 (task classification):
- Expected: 55-60/100 (+15-20 points)
- LinkedIn: 30-40/100 (creates actual tool)
- Email: 45/100 (consistent execution)
基准(main分支):
- 总体:40.6/100
- LinkedIn:6.5/100
- 邮件:26/100
使用PR #1443(任务分类):
- 预期:55-60/100(提升15-20分)
- LinkedIn:30-40/100(可创建实际工具)
- 邮件:45/100(执行稳定)
Example Usage
示例用法
User says: "Test PR #1443 with eval-recipes on the LinkedIn task"
I do:
- Update install.dockerfile to checkout
feat/issue-1435-task-classification - Copy to eval-recipes:
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - Run:
cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3 - Report results: "Score: 35.2/100 (up from 6.5 baseline)"
用户说: "用eval-recipes在LinkedIn任务上测试PR #1443"
我会:
- 更新install.dockerfile以检出分支
feat/issue-1435-task-classification - 复制到eval-recipes:
cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/ - 运行:
cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3 - 报告结果:"评分:35.2/100(较基准6.5有所提升)"
Prerequisites
前提条件
- eval-recipes cloned to
~/eval-recipes - API key in environment:
export ANTHROPIC_API_KEY=sk-ant-... - Docker installed (for containerized runs)
- uv installed:
curl -LsSf https://astral.sh/uv/install.sh | sh
- eval-recipes已克隆到
~/eval-recipes - 环境中已设置API密钥:
export ANTHROPIC_API_KEY=sk-ant-... - 已安装Docker(用于容器化运行)
- 已安装uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
Notes
注意事项
- Benchmarks take 2-15 minutes per task depending on complexity
- Multiple trials (3-5) give more reliable averages
- Docker builds can be cached for speed
- Results saved to in eval-recipes repo
.benchmark_results/
- 基准测试的耗时取决于任务复杂度,每个任务需要2-15分钟
- 多次测试(3-5次)可获得更可靠的平均值
- Docker构建可缓存以提高速度
- 结果保存到eval-recipes仓库的目录
.benchmark_results/
Automation
自动化
For fully autonomous testing:
bash
undefined完全自动化测试:
bash
undefinedTest suite for a PR
Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer"
for task in $tasks; do
uv run eval_recipes/main.py --agent amplihack --task $task --trials 3
done
Compare results
Compare results
cat .benchmark_results//amplihack//score.txt
undefinedcat .benchmark_results//amplihack//score.txt
undefined