gaia-architecture-comparison

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GAIA Architecture Comparison Skill

GAIA架构对比Skill

Compare ruflo's GAIA benchmark harness against the Princeton HAL reference implementation and other open-source harnesses to understand capability gaps and prioritize improvements.
将ruflo的GAIA基准测试框架与普林斯顿HAL参考实现及其他开源测试框架进行对比,以了解能力差距并确定改进优先级。

When to use

使用场景

  • Planning the next iteration of GAIA work
  • Evaluating which architectural change has the highest pass-rate ROI
  • Onboarding a new contributor to the benchmark codebase
  • 规划GAIA工作的下一迭代
  • 评估哪些架构变更能带来最高通过率投资回报率(ROI)
  • 帮助新贡献者熟悉基准测试代码库

Architecture overview

架构概述

ruflo harness (current)

ruflo测试框架(当前版本)

gaia-bench run
  └─ gaia-loader.ts      — HF dataset download + cache
  └─ gaia-agent.ts       — multi-turn Anthropic Messages loop
       └─ gaia-tools/    — web_search, file_read, web_browse,
                           image_describe, python_exec
  └─ gaia-voting.ts      — Track A self-consistency (N attempts → majority vote)
  └─ gaia-hardness/      — Track Q difficulty predictor (ADR-136)
  └─ gaia-judge.ts       — two-stage LLM-as-judge scorer
gaia-bench run
  └─ gaia-loader.ts      — HF数据集下载 + 缓存
  └─ gaia-agent.ts       — 多轮Anthropic Messages循环
       └─ gaia-tools/    — web_search、file_read、web_browse、
                           image_describe、python_exec
  └─ gaia-voting.ts      — 跟踪A:自一致性(N次尝试→多数投票)
  └─ gaia-hardness/      — 跟踪Q:难度预测器(ADR-136)
  └─ gaia-judge.ts       — 两阶段LLM-as-judge评分器

HAL reference (Princeton)

HAL参考实现(普林斯顿)

HAL uses a similar loop but with:
  • OpenAI function calling as the tool interface
  • BrowserBase / Playwright for real browser automation
  • Code interpreter sandbox (Jupyter kernel)
  • Larger token budget per turn (4096+)
  • Full 300-question evaluation set
HAL采用类似的循环,但具备以下特性:
  • 以OpenAI函数调用作为工具接口
  • 使用BrowserBase / Playwright实现真实浏览器自动化
  • 代码解释器沙箱(Jupyter内核)
  • 每轮更大的token预算(4096+)
  • 完整的300题评估集

Key differences

核心差异

DimensionrufloHAL referenceGap
Question count53 (partial L1)300 (full L1)Use
--limit 165
for full L1
Web searchDuckDuckGo / Google CSEBrowserBase liveAdd Playwright or Browserless
Code executionpython_exec stubReal Jupyter kernelImplement real sandbox
Image OCRimage_describe (Gemini)GPT-4V / GeminiFunctionally equivalent
File handlingfile_readFull PDF/XLSX/ZIP parserExpand file_read
Self-consistencyvoting.ts (Track A)Not in referenceruflo advantage
Hardness routingpredictor.ts (Track Q)Not in referenceruflo advantage
MemoryAgentDB HNSWNoneruflo advantage
Pass-rate L1~20.8% (iter 23)74.6% (HAL Sonnet 4.5)~54 pp gap
维度rufloHAL参考实现差距
题目数量53道(L1部分题)300道(完整L1)使用
--limit 165
获取完整L1
网页搜索DuckDuckGo / Google CSEBrowserBase实时搜索添加Playwright或Browserless
代码执行python_exec存根真实Jupyter内核实现真实沙箱
图片OCRimage_describe(Gemini)GPT-4V / Gemini功能等效
文件处理file_read完整PDF/XLSX/ZIP解析器扩展file_read功能
自一致性voting.ts(跟踪A)参考实现中无此功能ruflo优势
难度路由predictor.ts(跟踪Q)参考实现中无此功能ruflo优势
内存AgentDB HNSWruflo优势
L1通过率~20.8%(迭代23)74.6%(HAL Sonnet 4.5)~54个百分点差距

Gap analysis

差距分析

Primary gaps (high impact)

主要差距(高影响)

  1. Real code execution — many L2/L3 questions require running Python to compute a numerical answer. The current
    python_exec
    tool is a stub. Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single highest-ROI change.
  2. Full question set — running 53/300 L1 questions underestimates true pass-rate because the first 53 skew easier. Run
    --limit 165
    (full L1) for a comparable HAL score.
  3. Real browser
    web_browse
    currently fetches raw HTML. Replacing it with Playwright/Browserless for JavaScript-rendered pages would unlock many web navigation questions.
  1. 真实代码执行 —— 许多L2/L3题目需要运行Python来计算数值答案。当前的
    python_exec
    工具只是一个存根。实现真实沙箱(E2B、Pyodide或子进程)是投资回报率最高的单一变更。
  2. 完整题目集 —— 仅运行300道L1题中的53道会低估真实通过率,因为前53道题难度偏低。运行
    --limit 165
    (完整L1)可获得与HAL可比的分数。
  3. 真实浏览器 ——
    web_browse
    目前仅获取原始HTML。将其替换为Playwright/Browserless以支持JavaScript渲染页面,将解锁许多网页导航类题目。

Secondary gaps (medium impact)

次要差距(中等影响)

  1. Structured file parsing — PDF, XLSX, and ZIP attachments require dedicated parsers.
    file_read
    currently handles plain text and images only.
  2. Turn budget — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3.
  3. System prompt tuning — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering.
  1. 结构化文件解析 —— PDF、XLSX和ZIP附件需要专用解析器。
    file_read
    目前仅支持纯文本和图片。
  2. 轮次预算 —— 12轮可能不足以应对复杂的多步骤题目。HAL针对L3题目使用最多20轮。
  3. 系统提示调优 —— HAL的系统提示更详尽,明确指示模型在回答前使用工具。

ruflo advantages

ruflo的优势

  1. Self-consistency voting (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this.
  2. Hardness routing (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones.
  3. AgentDB memory — storing patterns across runs enables the agent to recall successful strategies for similar question types.
  1. 自一致性投票(跟踪A)—— 每个题目运行N次尝试并取多数答案,可降低边缘题目的方差。HAL未实现此功能。
  2. 难度路由(跟踪Q)—— 根据预测难度将每个题目路由到合适的模型和轮次预算。这降低了简单题目的成本,同时为难题提供更多资源。
  3. AgentDB内存 —— 存储跨运行的模式,使Agent能够回忆起针对相似题型的成功策略。

Improvement roadmap

改进路线图

PriorityChangeExpected LiftEffort
P0Real python_exec sandbox (E2B)+15-25 ppHigh
P0Full 165-Q L1 evaluationAccurate baselineLow
P1Playwright-based web_browse+5-10 ppMedium
P1PDF/XLSX file parser+3-8 ppMedium
P2Increase max-turns to 20 for L2/L3+2-5 ppLow
P2System prompt tuning (iter 30 research)+2-5 ppLow
P3Google Grounding via Gemini (iter 32)+3-7 ppMedium
P3Multi-provider routing (Gemini Flash for cheap Q's)Cost reductionMedium
优先级变更内容预期提升工作量
P0实现真实python_exec沙箱(E2B)+15-25个百分点
P0完整165题L1评估准确基准线
P1基于Playwright的web_browse+5-10个百分点
P1PDF/XLSX文件解析器+3-8个百分点
P2将L2/L3的最大轮次增加至20+2-5个百分点
P2系统提示调优(迭代30研究)+2-5个百分点
P3通过Gemini实现Google Grounding(迭代32)+3-7个百分点
P3多提供商路由(使用Gemini Flash处理低成本题目)成本降低

Loading context from past research

从过往研究加载上下文

bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-patterns \
  --query "architecture comparison HAL benchmark"
bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-patterns \
  --query "architecture comparison HAL benchmark"

Storing comparison findings

存储对比结果

bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "architecture-comparison-$(date +%Y%m%d)" \
  --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "architecture-comparison-$(date +%Y%m%d)" \
  --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."