gaia-architecture-comparison
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGAIA Architecture Comparison Skill
GAIA架构对比Skill
Compare ruflo's GAIA benchmark harness against the Princeton HAL reference
implementation and other open-source harnesses to understand capability gaps
and prioritize improvements.
将ruflo的GAIA基准测试框架与普林斯顿HAL参考实现及其他开源测试框架进行对比,以了解能力差距并确定改进优先级。
When to use
使用场景
- Planning the next iteration of GAIA work
- Evaluating which architectural change has the highest pass-rate ROI
- Onboarding a new contributor to the benchmark codebase
- 规划GAIA工作的下一迭代
- 评估哪些架构变更能带来最高通过率投资回报率(ROI)
- 帮助新贡献者熟悉基准测试代码库
Architecture overview
架构概述
ruflo harness (current)
ruflo测试框架(当前版本)
gaia-bench run
└─ gaia-loader.ts — HF dataset download + cache
└─ gaia-agent.ts — multi-turn Anthropic Messages loop
└─ gaia-tools/ — web_search, file_read, web_browse,
image_describe, python_exec
└─ gaia-voting.ts — Track A self-consistency (N attempts → majority vote)
└─ gaia-hardness/ — Track Q difficulty predictor (ADR-136)
└─ gaia-judge.ts — two-stage LLM-as-judge scorergaia-bench run
└─ gaia-loader.ts — HF数据集下载 + 缓存
└─ gaia-agent.ts — 多轮Anthropic Messages循环
└─ gaia-tools/ — web_search、file_read、web_browse、
image_describe、python_exec
└─ gaia-voting.ts — 跟踪A:自一致性(N次尝试→多数投票)
└─ gaia-hardness/ — 跟踪Q:难度预测器(ADR-136)
└─ gaia-judge.ts — 两阶段LLM-as-judge评分器HAL reference (Princeton)
HAL参考实现(普林斯顿)
HAL uses a similar loop but with:
- OpenAI function calling as the tool interface
- BrowserBase / Playwright for real browser automation
- Code interpreter sandbox (Jupyter kernel)
- Larger token budget per turn (4096+)
- Full 300-question evaluation set
HAL采用类似的循环,但具备以下特性:
- 以OpenAI函数调用作为工具接口
- 使用BrowserBase / Playwright实现真实浏览器自动化
- 代码解释器沙箱(Jupyter内核)
- 每轮更大的token预算(4096+)
- 完整的300题评估集
Key differences
核心差异
| Dimension | ruflo | HAL reference | Gap |
|---|---|---|---|
| Question count | 53 (partial L1) | 300 (full L1) | Use |
| Web search | DuckDuckGo / Google CSE | BrowserBase live | Add Playwright or Browserless |
| Code execution | python_exec stub | Real Jupyter kernel | Implement real sandbox |
| Image OCR | image_describe (Gemini) | GPT-4V / Gemini | Functionally equivalent |
| File handling | file_read | Full PDF/XLSX/ZIP parser | Expand file_read |
| Self-consistency | voting.ts (Track A) | Not in reference | ruflo advantage |
| Hardness routing | predictor.ts (Track Q) | Not in reference | ruflo advantage |
| Memory | AgentDB HNSW | None | ruflo advantage |
| Pass-rate L1 | ~20.8% (iter 23) | 74.6% (HAL Sonnet 4.5) | ~54 pp gap |
| 维度 | ruflo | HAL参考实现 | 差距 |
|---|---|---|---|
| 题目数量 | 53道(L1部分题) | 300道(完整L1) | 使用 |
| 网页搜索 | DuckDuckGo / Google CSE | BrowserBase实时搜索 | 添加Playwright或Browserless |
| 代码执行 | python_exec存根 | 真实Jupyter内核 | 实现真实沙箱 |
| 图片OCR | image_describe(Gemini) | GPT-4V / Gemini | 功能等效 |
| 文件处理 | file_read | 完整PDF/XLSX/ZIP解析器 | 扩展file_read功能 |
| 自一致性 | voting.ts(跟踪A) | 参考实现中无此功能 | ruflo优势 |
| 难度路由 | predictor.ts(跟踪Q) | 参考实现中无此功能 | ruflo优势 |
| 内存 | AgentDB HNSW | 无 | ruflo优势 |
| L1通过率 | ~20.8%(迭代23) | 74.6%(HAL Sonnet 4.5) | ~54个百分点差距 |
Gap analysis
差距分析
Primary gaps (high impact)
主要差距(高影响)
-
Real code execution — many L2/L3 questions require running Python to compute a numerical answer. The currenttool is a stub. Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single highest-ROI change.
python_exec -
Full question set — running 53/300 L1 questions underestimates true pass-rate because the first 53 skew easier. Run(full L1) for a comparable HAL score.
--limit 165 -
Real browser —currently fetches raw HTML. Replacing it with Playwright/Browserless for JavaScript-rendered pages would unlock many web navigation questions.
web_browse
-
真实代码执行 —— 许多L2/L3题目需要运行Python来计算数值答案。当前的工具只是一个存根。实现真实沙箱(E2B、Pyodide或子进程)是投资回报率最高的单一变更。
python_exec -
完整题目集 —— 仅运行300道L1题中的53道会低估真实通过率,因为前53道题难度偏低。运行(完整L1)可获得与HAL可比的分数。
--limit 165 -
真实浏览器 ——目前仅获取原始HTML。将其替换为Playwright/Browserless以支持JavaScript渲染页面,将解锁许多网页导航类题目。
web_browse
Secondary gaps (medium impact)
次要差距(中等影响)
-
Structured file parsing — PDF, XLSX, and ZIP attachments require dedicated parsers.currently handles plain text and images only.
file_read -
Turn budget — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3.
-
System prompt tuning — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering.
-
结构化文件解析 —— PDF、XLSX和ZIP附件需要专用解析器。目前仅支持纯文本和图片。
file_read -
轮次预算 —— 12轮可能不足以应对复杂的多步骤题目。HAL针对L3题目使用最多20轮。
-
系统提示调优 —— HAL的系统提示更详尽,明确指示模型在回答前使用工具。
ruflo advantages
ruflo的优势
-
Self-consistency voting (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this.
-
Hardness routing (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones.
-
AgentDB memory — storing patterns across runs enables the agent to recall successful strategies for similar question types.
-
自一致性投票(跟踪A)—— 每个题目运行N次尝试并取多数答案,可降低边缘题目的方差。HAL未实现此功能。
-
难度路由(跟踪Q)—— 根据预测难度将每个题目路由到合适的模型和轮次预算。这降低了简单题目的成本,同时为难题提供更多资源。
-
AgentDB内存 —— 存储跨运行的模式,使Agent能够回忆起针对相似题型的成功策略。
Improvement roadmap
改进路线图
| Priority | Change | Expected Lift | Effort |
|---|---|---|---|
| P0 | Real python_exec sandbox (E2B) | +15-25 pp | High |
| P0 | Full 165-Q L1 evaluation | Accurate baseline | Low |
| P1 | Playwright-based web_browse | +5-10 pp | Medium |
| P1 | PDF/XLSX file parser | +3-8 pp | Medium |
| P2 | Increase max-turns to 20 for L2/L3 | +2-5 pp | Low |
| P2 | System prompt tuning (iter 30 research) | +2-5 pp | Low |
| P3 | Google Grounding via Gemini (iter 32) | +3-7 pp | Medium |
| P3 | Multi-provider routing (Gemini Flash for cheap Q's) | Cost reduction | Medium |
| 优先级 | 变更内容 | 预期提升 | 工作量 |
|---|---|---|---|
| P0 | 实现真实python_exec沙箱(E2B) | +15-25个百分点 | 高 |
| P0 | 完整165题L1评估 | 准确基准线 | 低 |
| P1 | 基于Playwright的web_browse | +5-10个百分点 | 中 |
| P1 | PDF/XLSX文件解析器 | +3-8个百分点 | 中 |
| P2 | 将L2/L3的最大轮次增加至20 | +2-5个百分点 | 低 |
| P2 | 系统提示调优(迭代30研究) | +2-5个百分点 | 低 |
| P3 | 通过Gemini实现Google Grounding(迭代32) | +3-7个百分点 | 中 |
| P3 | 多提供商路由(使用Gemini Flash处理低成本题目) | 成本降低 | 中 |
Loading context from past research
从过往研究加载上下文
bash
npx @claude-flow/cli@latest memory search \
--namespace gaia-patterns \
--query "architecture comparison HAL benchmark"bash
npx @claude-flow/cli@latest memory search \
--namespace gaia-patterns \
--query "architecture comparison HAL benchmark"Storing comparison findings
存储对比结果
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "architecture-comparison-$(date +%Y%m%d)" \
--value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "architecture-comparison-$(date +%Y%m%d)" \
--value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."