gaia-submission

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GAIA Submission Skill

GAIA提交技能

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.
引导Claude Code完成从干净环境到生成已签名、HAL兼容的提交包的每一步,该提交包可上传至普林斯顿GAIA排行榜。

When to use

使用场景

When the user wants to:
  • Run a benchmark and submit results to the HAL leaderboard
  • Package an existing results file into a submission archive
  • Confirm their environment is ready for a benchmark run
当用户需要:
  • 运行基准测试并将结果提交至HAL排行榜
  • 将现有结果文件打包为提交归档包
  • 确认其环境已准备好进行基准测试运行

Prerequisites

前置条件

Before starting, confirm these are available:
RequirementCheck
ANTHROPIC_API_KEY
echo ${ANTHROPIC_API_KEY:0:8}…
(should show
sk-ant-…
)
HF_TOKEN
echo ${HF_TOKEN:0:5}…
(should show
hf_…
)
Node.js 20+
node --version
CLI built
node v3/@claude-flow/cli/bin/cli.js --version
开始前,请确认以下内容已就绪:
要求检查方式
ANTHROPIC_API_KEY
echo ${ANTHROPIC_API_KEY:0:8}…
(应显示
sk-ant-…
HF_TOKEN
echo ${HF_TOKEN:0:5}…
(应显示
hf_…
Node.js 20+
node --version
CLI已构建
node v3/@claude-flow/cli/bin/cli.js --version

Phase 1 — Validate environment

阶段1 — 验证环境

bash
undefined
bash
undefined

Run all pre-flight checks

运行所有预检检查

/gaia validate

If any check fails, resolve it before continuing.
/gaia validate

如果任何检查失败,请先解决问题再继续。

Phase 2 — Estimate cost and confirm

阶段2 — 估算成本并确认

Ask the user for their configuration:
  • Level (default: 1)
  • Question limit (default: 53 for a quick run, 165 for the full L1 set)
  • Models (default:
    claude-sonnet-4-6
    )
  • Self-consistency voting (default: 1; use 3 for L2/L3)
bash
/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"
向用户询问其配置信息:
  • 级别(默认:1)
  • 问题数量限制(默认:快速运行为53,完整L1数据集为165)
  • 模型(默认:
    claude-sonnet-4-6
  • 自一致性投票(默认:1;L2/L3级别使用3)
bash
/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
如果预计成本超过5美元,显示估算值并询问:"本次运行预计花费约X美元。是否继续?(y/N)"

Phase 3 — Run the benchmark

阶段3 — 运行基准测试

bash
/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
While running, progress is reported every 5 questions:
[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18
Store the run summary in memory for history tracking:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'
bash
/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING
运行过程中,每完成5个问题会报告进度:
[12/53] 22.7% (22个评分问题中5个通过) — 预计剩余花费: $0.18
将运行摘要存储到内存中以进行历史跟踪:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

阶段4 — 打包提交

bash
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json
This produces:
submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison
bash
/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json
此命令会生成:
submission-<date>-<sha>/
├── results.jsonl        ← HAL兼容格式,每行一个JSON
├── trajectories.jsonl   ← 完整Agent轨迹
├── metadata.json        ← 测试工具信息、模型、工具目录
├── manifest.md.json     ← Ed25519签名验证文件
└── README.md            ← 人工摘要 + 排行榜对比

Phase 5 — Compare and report

阶段5 — 对比并报告

bash
/gaia leaderboard --level=$LEVEL
/gaia history
Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the
/gaia-debugging
skill if needed.
bash
/gaia leaderboard --level=$LEVEL
/gaia history
分析ruflo得分与排行榜前10名之间的差距。如有需要,使用
/gaia-debugging
技能确定主要失败模式(工具缺口、推理失误、提取错误)。

Phase 6 — Persist learnings

阶段6 — 保存经验

bash
npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true
Store any discovered patterns:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"
bash
npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true
存储发现的任何模式:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

Extensibility note

扩展性说明

This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.
本技能特意设计为与基准测试无关。阶段标题(验证→估算→运行→打包→对比→学习)适用于SWE-bench、WebArena和HumanEval,仅需修改阶段3-4的细节。