gaia-submission

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GAIA Submission Skill

GAIA提交技能

Walk Claude Code through every step needed to go from a clean environment to a signed, HAL-compatible submission package ready to upload to the Princeton GAIA leaderboard.

引导Claude Code完成从干净环境到生成已签名、HAL兼容的提交包的每一步，该提交包可上传至普林斯顿GAIA排行榜。

When to use

使用场景

When the user wants to:

Run a benchmark and submit results to the HAL leaderboard
Package an existing results file into a submission archive
Confirm their environment is ready for a benchmark run

当用户需要：

运行基准测试并将结果提交至HAL排行榜
将现有结果文件打包为提交归档包
确认其环境已准备好进行基准测试运行

Prerequisites

前置条件

Before starting, confirm these are available:

Requirement	Check
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` (should show `sk-ant-…` )
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` (should show `hf_…` )
Node.js 20+	`node --version`
CLI built	`node v3/@claude-flow/cli/bin/cli.js --version`

开始前，请确认以下内容已就绪：

要求	检查方式
`ANTHROPIC_API_KEY`	`echo ${ANTHROPIC_API_KEY:0:8}…` （应显示 `sk-ant-…` ）
`HF_TOKEN`	`echo ${HF_TOKEN:0:5}…` （应显示 `hf_…` ）
Node.js 20+	`node --version`
CLI已构建	`node v3/@claude-flow/cli/bin/cli.js --version`

Phase 1 — Validate environment

阶段1 — 验证环境

bash

undefined

bash

undefined

Run all pre-flight checks

运行所有预检检查

/gaia validate


If any check fails, resolve it before continuing.

/gaia validate


如果任何检查失败，请先解决问题再继续。

Phase 2 — Estimate cost and confirm

阶段2 — 估算成本并确认

Ask the user for their configuration:

Level (default: 1)
Question limit (default: 53 for a quick run, 165 for the full L1 set)
Models (default:
```
claude-sonnet-4-6
```
)
Self-consistency voting (default: 1; use 3 for L2/L3)

bash

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

If projected cost > $5, show the estimate and ask: "This run will cost approximately $X. Proceed? (y/N)"

向用户询问其配置信息：

级别（默认：1）
问题数量限制（默认：快速运行为53，完整L1数据集为165）
模型（默认：
```
claude-sonnet-4-6
```
）
自一致性投票（默认：1；L2/L3级别使用3）

bash

/gaia cost --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

如果预计成本超过5美元，显示估算值并询问："本次运行预计花费约X美元。是否继续？(y/N)"

Phase 3 — Run the benchmark

阶段3 — 运行基准测试

bash

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

While running, progress is reported every 5 questions:

[12/53] 22.7% (5 passed of 22 scored) — est. remaining: $0.18

Store the run summary in memory for history tracking:

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

bash

/gaia run --level=$LEVEL --limit=$LIMIT --models=$MODELS --voting=$VOTING

运行过程中，每完成5个问题会报告进度：

[12/53] 22.7% (22个评分问题中5个通过) — 预计剩余花费: $0.18

将运行摘要存储到内存中以进行历史跟踪：

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-runs \
  --key "run-$(date +%Y%m%d-%H%M)" \
  --value '{"level":$LEVEL,"model":"$MODEL","total":$TOTAL,"passed":$PASSED,"pass_rate":$RATE,"est_cost_usd":$COST}'

Phase 4 — Package for submission

阶段4 — 打包提交

bash

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

This produces:

submission-<date>-<sha>/
├── results.jsonl        ← HAL-compatible, one JSON per line
├── trajectories.jsonl   ← full agent traces
├── metadata.json        ← harness info, model, tool catalogue
├── manifest.md.json     ← Ed25519-signed witness
└── README.md            ← human summary + leaderboard comparison

bash

/gaia submit --results=~/.cache/ruflo/gaia/results-latest.json

此命令会生成：

submission-<date>-<sha>/
├── results.jsonl        ← HAL兼容格式，每行一个JSON
├── trajectories.jsonl   ← 完整Agent轨迹
├── metadata.json        ← 测试工具信息、模型、工具目录
├── manifest.md.json     ← Ed25519签名验证文件
└── README.md            ← 人工摘要 + 排行榜对比

Phase 5 — Compare and report

阶段5 — 对比并报告

bash

/gaia leaderboard --level=$LEVEL
/gaia history

Interpret the gap between ruflo's score and the leaderboard top-10. Identify the primary failure mode (tool gap, reasoning miss, extraction bug) using the

/gaia-debugging

skill if needed.

bash

/gaia leaderboard --level=$LEVEL
/gaia history

分析ruflo得分与排行榜前10名之间的差距。如有需要，使用

/gaia-debugging

技能确定主要失败模式（工具缺口、推理失误、提取错误）。

Phase 6 — Persist learnings

阶段6 — 保存经验

bash

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

Store any discovered patterns:

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

bash

npx @claude-flow/cli@latest hooks post-task \
  --task-id "gaia-submission-$(date +%Y%m%d)" \
  --success true \
  --train-neural true

存储发现的任何模式：

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "submission-notes-$(date +%Y%m%d)" \
  --value "Level $LEVEL, $MODEL: $NOTES"

Extensibility note

扩展性说明

This skill is intentionally structured to be benchmark-agnostic. The phase headers (validate → estimate → run → package → compare → learn) apply to SWE-bench, WebArena, and HumanEval with only phase 3-4 details changing.

本技能特意设计为与基准测试无关。阶段标题（验证→估算→运行→打包→对比→学习）适用于SWE-bench、WebArena和HumanEval，仅需修改阶段3-4的细节。