model-evaluation-benchmark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Evaluation Benchmark Skill

模型评估基准测试Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.
Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.
目标:遵循Benchmark Suite V3参考实现,自动复现全面的模型评估基准测试。
自动激活场景:在智能体工作流中,当用户请求进行模型基准测试、对比评估或AI模型间性能测试时自动激活。

Skill Description

Skill说明

This skill orchestrates end-to-end model evaluation benchmarks that measure:
  • Efficiency: Duration, turns, cost, tool calls
  • Quality: Code quality scores via reviewer agents
  • Workflow Adherence: Subagent calls, skills used, workflow step compliance
  • Artifacts: GitHub issues, PRs, documentation generated
The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.
该Skill可编排端到端的模型评估基准测试,衡量以下维度:
  • 效率:耗时、交互轮次、成本、工具调用情况
  • 质量:通过评审智能体(reviewer agents)给出的代码质量评分
  • 工作流合规性:子智能体调用情况、使用的技能、工作流步骤合规性
  • 产出物:GitHub议题、PR、生成的文档
该Skill遵循v3参考实现,自动完成从执行到清理的整个基准测试工作流。

When to Use

适用场景

Use when:
  • Comparing AI models (Opus vs Sonnet, etc.)
  • Measuring workflow adherence
  • Generating comprehensive benchmark reports
  • Need reproducible benchmarking
Don't use when:
  • Simple code reviews (use
    reviewer
    )
  • Performance profiling (use
    optimizer
    )
  • Architecture decisions (use
    architect
    )
适用场景
  • 对比AI模型(如Opus与Sonnet等)
  • 衡量工作流合规性
  • 生成全面的基准测试报告
  • 需要可复现的基准测试
不适用场景
  • 简单代码评审(使用
    reviewer
  • 性能分析(使用
    optimizer
  • 架构决策(使用
    architect

Execution Instructions

执行说明

When this skill is invoked, follow these steps:
调用该Skill时,请遵循以下步骤:

Phase 1: Setup

阶段1:准备

  1. Read
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
  2. Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
  3. Create TodoWrite list with all phases
  1. 阅读
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
  2. 确定要测试的模型(默认:Opus 4.5、Sonnet 4.5)
  3. 创建包含所有阶段的TodoWrite清单

Phase 2: Execute Benchmarks

阶段2:执行基准测试

For each task × model:
bash
cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4
针对每个任务×模型组合:
bash
cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

Phase 3: Analyze Results

阶段3:分析结果

  1. Read all result files:
    ~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json
  2. Launch parallel Task tool calls with
    subagent_type="reviewer"
    to:
    • Analyze trace logs for tool/agent/skill usage
    • Score code quality (1-5 scale)
  3. Synthesize findings
  1. 读取所有结果文件:
    ~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json
  2. 启动并行任务工具调用,设置
    subagent_type="reviewer"
    以:
    • 分析跟踪日志中的工具/智能体/技能使用情况
    • 为代码质量评分(1-5分制)
  3. 整合分析结果

Phase 4: Generate Report

阶段4:生成报告

  1. Create markdown report following
    BENCHMARK_REPORT_V3.md
    structure
  2. Create GitHub issue with report
  3. Archive artifacts to GitHub release
  4. Update issue with release link
  1. 按照
    BENCHMARK_REPORT_V3.md
    的结构创建Markdown报告
  2. 在GitHub上创建包含该报告的议题
  3. 将产出物归档到GitHub Release
  4. 更新议题,添加Release链接

Phase 5: Cleanup (MANDATORY)

阶段5:清理(必填)

  1. Close all benchmark PRs:
    gh pr close {numbers}
  2. Close all benchmark issues:
    gh issue close {numbers}
  3. Remove worktrees:
    git worktree remove worktrees/bench-*
  4. Verify cleanup complete
See
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md
for detailed cleanup instructions.
  1. 关闭所有基准测试相关PR:
    gh pr close {numbers}
  2. 关闭所有基准测试相关议题:
    gh issue close {numbers}
  3. 删除工作树:
    git worktree remove worktrees/bench-*
  4. 验证清理完成
详细清理说明请参考
tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md

Example Usage

示例用法

User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.

[Executes phases 1-5 above]

Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts
用户:"运行模型评估基准测试"
助手:我将遵循v3参考实现运行完整的基准测试套件。

[执行上述阶段1-5步骤]

最终报告:请查看GitHub议题#XXXX
产出物:https://github.com/.../releases/tag/benchmark-suite-v3-artifacts

References

参考资料

  • Reference Report:
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md
  • Task Definitions:
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
  • Cleanup Guide:
    tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md
  • Runner Script:
    tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

Last Updated: 2025-11-26 Reference Implementation: Benchmark Suite V3 GitHub Issue Example: #1698
  • 参考报告
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md
  • 任务定义
    tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md
  • 清理指南
    tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md
  • 运行脚本
    tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

最后更新:2025-11-26 参考实现:Benchmark Suite V3 GitHub议题示例:#1698