gaia-submission
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGAIA Submission Skill
GAIA提交技能
Walk Claude Code through every step needed to go from a clean environment to a
signed, HAL-compatible submission package ready to upload to the Princeton
GAIA leaderboard.
引导Claude Code完成从干净环境到生成已签名、HAL兼容的提交包的每一步,该提交包可上传至普林斯顿GAIA排行榜。
When to use
使用场景
When the user wants to:
- Run a benchmark and submit results to the HAL leaderboard
- Package an existing results file into a submission archive
- Confirm their environment is ready for a benchmark run
当用户需要:
- 运行基准测试并将结果提交至HAL排行榜
- 将现有结果文件打包为提交归档包
- 确认其环境已准备好进行基准测试运行
Prerequisites
前置条件
Before starting, confirm these are available:
| Requirement | Check |
|---|---|
| |
| |
| Node.js 20+ | |
| CLI built | |
开始前,请确认以下内容已就绪:
| 要求 | 检查方式 |
|---|---|
| |
| |
| Node.js 20+ | |
| CLI已构建 | |
Phase 1 — Validate environment
阶段1 — 验证环境
bash
undefinedbash
undefinedRun all pre-flight checks
运行所有预检检查
/gaia validate
If any check fails, resolve it before continuing./gaia validate
如果任何检查失败,请先解决问题再继续。Phase 2 — Estimate cost and confirm
阶段2 — 估算成本并确认
Ask the user for their configuration:
- Level (default: 1)
- Question limit (default: 53 for a quick run, 165 for the full L1 set)
- Models (default: )
claude-sonnet-4-6 - Self-consistency voting (default: 1; use 3 for L2/L3)
bash
/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTINGIf projected cost > $5, show the estimate and ask: "This run will cost
approximately $X. Proceed? (y/N)"
向用户询问其配置信息:
- 级别(默认:1)
- 问题数量限制(默认:快速运行为53,完整L1数据集为165)
- 模型(默认:)
claude-sonnet-4-6 - 自一致性投票(默认:1;L2/L3级别使用3)
bash
/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING如果预计成本超过5美元,显示估算值并询问:"本次运行预计花费约X美元。是否继续?(y/N)"
Phase 3 — Run the benchmark
阶段3 — 运行基准测试
bash
/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTINGWhile running, progress is reported every 5 questions:
[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18Store the run summary in memory for history tracking:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-runs \
--key "run-$(date +%Y%m%d-%H%M)" \
--value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'bash
/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING运行过程中,每完成5个问题会报告进度:
[12/53] 22.7% (22个评分问题中5个通过) — 预计剩余花费: $0.18将运行摘要存储到内存中以进行历史跟踪:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-runs \
--key "run-$(date +%Y%m%d-%H%M)" \
--value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'Phase 4 — Package for submission
阶段4 — 打包提交
bash
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.jsonThis produces:
submission-<date>-<sha>/
├── results.jsonl ← HAL-compatible, one JSON per line
├── trajectories.jsonl ← full agent traces
├── metadata.json ← harness info, model, tool catalogue
├── manifest.md.json ← Ed25519-signed witness
└── README.md ← human summary + leaderboard comparisonbash
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json此命令会生成:
submission-<date>-<sha>/
├── results.jsonl ← HAL兼容格式,每行一个JSON
├── trajectories.jsonl ← 完整Agent轨迹
├── metadata.json ← 测试工具信息、模型、工具目录
├── manifest.md.json ← Ed25519签名验证文件
└── README.md ← 人工摘要 + 排行榜对比Phase 5 — Compare and report
阶段5 — 对比并报告
bash
/gaia leaderboard --level=$LEVEL
/gaia historyInterpret the gap between ruflo's score and the leaderboard top-10.
Identify the primary failure mode (tool gap, reasoning miss, extraction bug)
using the skill if needed.
/gaia-debuggingbash
/gaia leaderboard --level=$LEVEL
/gaia history分析ruflo得分与排行榜前10名之间的差距。如有需要,使用技能确定主要失败模式(工具缺口、推理失误、提取错误)。
/gaia-debuggingPhase 6 — Persist learnings
阶段6 — 保存经验
bash
npx @claude-flow/cli@latest hooks post-task \
--task-id "gaia-submission-$(date +%Y%m%d)" \
--success true \
--train-neural trueStore any discovered patterns:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "submission-notes-$(date +%Y%m%d)" \
--value "Level $LEVEL, $MODEL: $NOTES"bash
npx @claude-flow/cli@latest hooks post-task \
--task-id "gaia-submission-$(date +%Y%m%d)" \
--success true \
--train-neural true存储发现的任何模式:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "submission-notes-$(date +%Y%m%d)" \
--value "Level $LEVEL, $MODEL: $NOTES"Extensibility note
扩展性说明
This skill is intentionally structured to be benchmark-agnostic. The phase
headers (validate → estimate → run → package → compare → learn) apply to
SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.
本技能特意设计为与基准测试无关。阶段标题(验证→估算→运行→打包→对比→学习)适用于SWE-bench、WebArena和HumanEval,仅需修改阶段3-4的细节。