Loading...
Loading...
Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3. Auto-activates for model benchmarking, comparison evaluation, or performance testing between AI models.
npx skill4agent add rysweet/amplihack model-evaluation-benchmarkrevieweroptimizerarchitecttests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.mdcd tests/benchmarks/benchmark_suite_v3
python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.jsonsubagent_type="reviewer"BENCHMARK_REPORT_V3.mdgh pr close {numbers}gh issue close {numbers}git worktree remove worktrees/bench-*tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.mdUser: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.
[Executes phases 1-5 above]
Final Report: See GitHub Issue #XXXX
Artifacts: https://github.com/.../releases/tag/benchmark-suite-v3-artifactstests/benchmarks/benchmark_suite_v3/BENCHMARK_REPORT_V3.mdtests/benchmarks/benchmark_suite_v3/BENCHMARK_TASKS.mdtests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.mdtests/benchmarks/benchmark_suite_v3/run_benchmarks.py