model-evaluation-benchmark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Model Evaluation Benchmark Skill

模型评估基准测试Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

目标：遵循Benchmark Suite V3参考实现，自动复现全面的模型评估基准测试。

自动激活场景：在智能体工作流中，当用户请求进行模型基准测试、对比评估或AI模型间性能测试时自动激活。

Skill Description

Skill说明

This skill orchestrates end-to-end model evaluation benchmarks that measure:

Efficiency: Duration, turns, cost, tool calls
Quality: Code quality scores via reviewer agents
Workflow Adherence: Subagent calls, skills used, workflow step compliance
Artifacts: GitHub issues, PRs, documentation generated

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

该Skill可编排端到端的模型评估基准测试，衡量以下维度：

效率：耗时、交互轮次、成本、工具调用情况
质量：通过评审智能体（reviewer agents）给出的代码质量评分
工作流合规性：子智能体调用情况、使用的技能、工作流步骤合规性
产出物：GitHub议题、PR、生成的文档

该Skill遵循v3参考实现，自动完成从执行到清理的整个基准测试工作流。

When to Use

适用场景

✅ Use when:

Comparing AI models (Opus vs Sonnet, etc.)
Measuring workflow adherence
Generating comprehensive benchmark reports
Need reproducible benchmarking

❌ Don't use when:

Simple code reviews (use
```
reviewer
```
)
Performance profiling (use
```
optimizer
```
)
Architecture decisions (use
```
architect
```
)

✅ 适用场景：

对比AI模型（如Opus与Sonnet等）
衡量工作流合规性
生成全面的基准测试报告
需要可复现的基准测试

❌ 不适用场景：

简单代码评审（使用
```
reviewer
```
）
性能分析（使用
```
optimizer
```
）
架构决策（使用
```
architect
```
）

Execution Instructions

执行说明

When this skill is invoked, follow these steps:

调用该Skill时，请遵循以下步骤：

Phase 1: Setup

阶段1：准备

Read

tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md

Identify models to benchmark (default: Opus 4.5, Sonnet 4.5)
Create TodoWrite list with all phases

阅读

tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md

确定要测试的模型（默认：Opus 4.5、Sonnet 4.5）
创建包含所有阶段的TodoWrite清单

Phase 2: Execute Benchmarks

阶段2：执行基准测试

For each task × model:

bash

cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

针对每个任务×模型组合：

bash

cd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

Phase 3: Analyze Results

阶段3：分析结果

Read all result files:

~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json

Launch parallel Task tool calls with
```
subagent_type="reviewer"
```
to:
- Analyze trace logs for tool/agent/skill usage
- Score code quality (1-5 scale)
Synthesize findings

读取所有结果文件：

~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json

启动并行任务工具调用，设置
```
subagent_type="reviewer"
```
以：
- 分析跟踪日志中的工具/智能体/技能使用情况
- 为代码质量评分（1-5分制）
整合分析结果

Phase 4: Generate Report

阶段4：生成报告

Create markdown report following
```
BENCHMARK_REPORT_V3.md
```
structure
Create GitHub issue with report
Archive artifacts to GitHub release
Update issue with release link

按照
```
BENCHMARK_REPORT_V3.md
```
的结构创建Markdown报告
在GitHub上创建包含该报告的议题
将产出物归档到GitHub Release
更新议题，添加Release链接

Phase 5: Cleanup (MANDATORY)

阶段5：清理（必填）

Close all benchmark PRs:
```
gh pr close {numbers}
```
Close all benchmark issues:
```
gh issue close {numbers}
```
Remove worktrees:
```
git worktree remove worktrees/bench-*
```
Verify cleanup complete

See

tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md

for detailed cleanup instructions.

关闭所有基准测试相关PR：
```
gh pr close {numbers}
```
关闭所有基准测试相关议题：
```
gh issue close {numbers}
```
删除工作树：
```
git worktree remove worktrees/bench-*
```
验证清理完成

详细清理说明请参考

tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md

。

Example Usage

示例用法

User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.

[Executes phases 1-5 above]

Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifacts

用户："运行模型评估基准测试"
助手：我将遵循v3参考实现运行完整的基准测试套件。

[执行上述阶段1-5步骤]

最终报告：请查看GitHub议题#XXXX
产出物：https://github.com/.../releases/tag/benchmark-suite-v3-artifacts

References

参考资料

Reference Report:

tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md

Task Definitions:

tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md

Cleanup Guide:

tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md

Runner Script:

tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

Last Updated: 2025-11-26 Reference Implementation: Benchmark Suite V3 GitHub Issue Example: #1698

参考报告：

tests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.md

任务定义：

tests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.md

清理指南：

tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md

运行脚本：

tests/benchmarks/benchmark_suite_v3/run_benchmarks.py

最后更新：2025-11-26 参考实现：Benchmark Suite V3 GitHub议题示例：#1698