model-evaluation-benchmark
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModel Evaluation Benchmark Skill
模型评估基准测试Skill
Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.
Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.
目标:遵循Benchmark Suite V3参考实现,自动复现全面的模型评估基准测试。
自动激活场景:在智能体工作流中,当用户请求进行模型基准测试、对比评估或AI模型间性能测试时自动激活。
Skill Description
Skill说明
This skill orchestrates end-to-end model evaluation benchmarks that measure:
- Efficiency: Duration, turns, cost, tool calls
- Quality: Code quality scores via reviewer agents
- Workflow Adherence: Subagent calls, skills used, workflow step compliance
- Artifacts: GitHub issues, PRs, documentation generated
The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.
该Skill可编排端到端的模型评估基准测试,衡量以下维度:
- 效率:耗时、交互轮次、成本、工具调用情况
- 质量:通过评审智能体(reviewer agents)给出的代码质量评分
- 工作流合规性:子智能体调用情况、使用的技能、工作流步骤合规性
- 产出物:GitHub议题、PR、生成的文档
该Skill遵循v3参考实现,自动完成从执行到清理的整个基准测试工作流。
When to Use
适用场景
✅ Use when:
- Comparing AI models (Opus vs Sonnet, etc.)
- Measuring workflow adherence
- Generating comprehensive benchmark reports
- Need reproducible benchmarking
❌ Don't use when:
- Simple code reviews (use )
reviewer - Performance profiling (use )
optimizer - Architecture decisions (use )
architect
✅ 适用场景:
- 对比AI模型(如Opus与Sonnet等)
- 衡量工作流合规性
- 生成全面的基准测试报告
- 需要可复现的基准测试
❌ 不适用场景:
- 简单代码评审(使用)
reviewer - 性能分析(使用)
optimizer - 架构决策(使用)
architect
Execution Instructions
执行说明
When this skill is invoked, follow these steps:
调用该Skill时,请遵循以下步骤:
Phase 1: Setup
阶段1:准备
- Read
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
- Create TodoWrite list with all phases
- 阅读
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - 确定要测试的模型(默认:Opus 4.5、Sonnet 4.5)
- 创建包含所有阶段的TodoWrite清单
Phase 2: Execute Benchmarks
阶段2:执行基准测试
For each task × model:
bash
cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4针对每个任务×模型组合:
bash
cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4Phase 3: Analyze Results
阶段3:分析结果
- Read all result files:
~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json - Launch parallel Task tool calls with to:
subagent_type="reviewer"- Analyze trace logs for tool/agent/skill usage
- Score code quality (1-5 scale)
- Synthesize findings
- 读取所有结果文件:
~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json - 启动并行任务工具调用,设置以:
subagent_type="reviewer"- 分析跟踪日志中的工具/智能体/技能使用情况
- 为代码质量评分(1-5分制)
- 整合分析结果
Phase 4: Generate Report
阶段4:生成报告
- Create markdown report following structure
BENCHMARK_REPORT_V3.md - Create GitHub issue with report
- Archive artifacts to GitHub release
- Update issue with release link
- 按照的结构创建Markdown报告
BENCHMARK_REPORT_V3.md - 在GitHub上创建包含该报告的议题
- 将产出物归档到GitHub Release
- 更新议题,添加Release链接
Phase 5: Cleanup (MANDATORY)
阶段5:清理(必填)
- Close all benchmark PRs:
gh pr close {numbers} - Close all benchmark issues:
gh issue close {numbers} - Remove worktrees:
git worktree remove worktrees/bench-* - Verify cleanup complete
See for detailed cleanup instructions.
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md- 关闭所有基准测试相关PR:
gh pr close {numbers} - 关闭所有基准测试相关议题:
gh issue close {numbers} - 删除工作树:
git worktree remove worktrees/bench-* - 验证清理完成
详细清理说明请参考。
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.mdExample Usage
示例用法
User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.
[Executes phases 1-5 above]
Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts用户:"运行模型评估基准测试"
助手:我将遵循v3参考实现运行完整的基准测试套件。
[执行上述阶段1-5步骤]
最终报告:请查看GitHub议题#XXXX
产出物:https://github.com/.../releases/tag/benchmark-suite-v3-artifactsReferences
参考资料
- Reference Report:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md - Task Definitions:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - Cleanup Guide:
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md - Runner Script:
tests/benchmarks/benchmark_suite_v3/run_benchmarks.py
Last Updated: 2025-11-26
Reference Implementation: Benchmark Suite V3
GitHub Issue Example: #1698
- 参考报告:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md - 任务定义:
tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md - 清理指南:
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md - 运行脚本:
tests/benchmarks/benchmark_suite_v3/run_benchmarks.py
最后更新:2025-11-26
参考实现:Benchmark Suite V3
GitHub议题示例:#1698