gaia-architecture-comparison

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GAIA Architecture Comparison Skill

GAIA架构对比Skill

Compare ruflo's GAIA benchmark harness against the Princeton HAL reference implementation and other open-source harnesses to understand capability gaps and prioritize improvements.

将ruflo的GAIA基准测试框架与普林斯顿HAL参考实现及其他开源测试框架进行对比，以了解能力差距并确定改进优先级。

When to use

使用场景

Planning the next iteration of GAIA work
Evaluating which architectural change has the highest pass-rate ROI
Onboarding a new contributor to the benchmark codebase

规划GAIA工作的下一迭代
评估哪些架构变更能带来最高通过率投资回报率（ROI）
帮助新贡献者熟悉基准测试代码库

Architecture overview

架构概述

ruflo harness (current)

ruflo测试框架（当前版本）

gaia-bench run
  └─ gaia-loader.ts      — HF dataset download + cache
  └─ gaia-agent.ts       — multi-turn Anthropic Messages loop
       └─ gaia-tools/    — web_search, file_read, web_browse,
                           image_describe, python_exec
  └─ gaia-voting.ts      — Track A self-consistency (N attempts → majority vote)
  └─ gaia-hardness/      — Track Q difficulty predictor (ADR-136)
  └─ gaia-judge.ts       — two-stage LLM-as-judge scorer

gaia-bench run
  └─ gaia-loader.ts      — HF数据集下载 + 缓存
  └─ gaia-agent.ts       — 多轮Anthropic Messages循环
       └─ gaia-tools/    — web_search、file_read、web_browse、
                           image_describe、python_exec
  └─ gaia-voting.ts      — 跟踪A：自一致性（N次尝试→多数投票）
  └─ gaia-hardness/      — 跟踪Q：难度预测器（ADR-136）
  └─ gaia-judge.ts       — 两阶段LLM-as-judge评分器

HAL reference (Princeton)

HAL参考实现（普林斯顿）

HAL uses a similar loop but with:

OpenAI function calling as the tool interface
BrowserBase / Playwright for real browser automation
Code interpreter sandbox (Jupyter kernel)
Larger token budget per turn (4096+)
Full 300-question evaluation set

HAL采用类似的循环，但具备以下特性：

以OpenAI函数调用作为工具接口
使用BrowserBase / Playwright实现真实浏览器自动化
代码解释器沙箱（Jupyter内核）
每轮更大的token预算（4096+）
完整的300题评估集

Key differences

核心差异

Dimension	ruflo	HAL reference	Gap
Question count	53 (partial L1)	300 (full L1)	Use `--limit 165` for full L1
Web search	DuckDuckGo / Google CSE	BrowserBase live	Add Playwright or Browserless
Code execution	python_exec stub	Real Jupyter kernel	Implement real sandbox
Image OCR	image_describe (Gemini)	GPT-4V / Gemini	Functionally equivalent
File handling	file_read	Full PDF/XLSX/ZIP parser	Expand file_read
Self-consistency	voting.ts (Track A)	Not in reference	ruflo advantage
Hardness routing	predictor.ts (Track Q)	Not in reference	ruflo advantage
Memory	AgentDB HNSW	None	ruflo advantage
Pass-rate L1	~20.8% (iter 23)	74.6% (HAL Sonnet 4.5)	~54 pp gap

维度	ruflo	HAL参考实现	差距
题目数量	53道（L1部分题）	300道（完整L1）	使用 `--limit 165` 获取完整L1
网页搜索	DuckDuckGo / Google CSE	BrowserBase实时搜索	添加Playwright或Browserless
代码执行	python_exec存根	真实Jupyter内核	实现真实沙箱
图片OCR	image_describe（Gemini）	GPT-4V / Gemini	功能等效
文件处理	file_read	完整PDF/XLSX/ZIP解析器	扩展file_read功能
自一致性	voting.ts（跟踪A）	参考实现中无此功能	ruflo优势
难度路由	predictor.ts（跟踪Q）	参考实现中无此功能	ruflo优势
内存	AgentDB HNSW	无	ruflo优势
L1通过率	~20.8%（迭代23）	74.6%（HAL Sonnet 4.5）	~54个百分点差距

Gap analysis

差距分析

Primary gaps (high impact)

主要差距（高影响）

Real code execution — many L2/L3 questions require running Python to compute a numerical answer. The current
```
python_exec
```
tool is a stub. Implementing a real sandbox (E2B, Pyodide, or subprocess) is the single highest-ROI change.
Full question set — running 53/300 L1 questions underestimates true pass-rate because the first 53 skew easier. Run
```
--limit 165
```
(full L1) for a comparable HAL score.
Real browser —
```
web_browse
```
currently fetches raw HTML. Replacing it with Playwright/Browserless for JavaScript-rendered pages would unlock many web navigation questions.

真实代码执行 —— 许多L2/L3题目需要运行Python来计算数值答案。当前的
```
python_exec
```
工具只是一个存根。实现真实沙箱（E2B、Pyodide或子进程）是投资回报率最高的单一变更。
完整题目集 —— 仅运行300道L1题中的53道会低估真实通过率，因为前53道题难度偏低。运行
```
--limit 165
```
（完整L1）可获得与HAL可比的分数。
真实浏览器 ——
```
web_browse
```
目前仅获取原始HTML。将其替换为Playwright/Browserless以支持JavaScript渲染页面，将解锁许多网页导航类题目。

Secondary gaps (medium impact)

次要差距（中等影响）

Structured file parsing — PDF, XLSX, and ZIP attachments require dedicated parsers.
```
file_read
```
currently handles plain text and images only.
Turn budget — 12 turns may be insufficient for complex multi-step questions. HAL uses up to 20 turns for L3.
System prompt tuning — HAL's system prompt is more elaborate and explicitly instructs the model to use tools before answering.

结构化文件解析 —— PDF、XLSX和ZIP附件需要专用解析器。
```
file_read
```
目前仅支持纯文本和图片。
轮次预算 —— 12轮可能不足以应对复杂的多步骤题目。HAL针对L3题目使用最多20轮。
系统提示调优 —— HAL的系统提示更详尽，明确指示模型在回答前使用工具。

ruflo advantages

ruflo的优势

Self-consistency voting (Track A) — running N attempts per question and taking the majority answer reduces variance on borderline questions. HAL does not implement this.
Hardness routing (Track Q) — routing each question to an appropriate model and turn budget based on predicted difficulty. This reduces cost on easy questions while providing more resources for hard ones.
AgentDB memory — storing patterns across runs enables the agent to recall successful strategies for similar question types.

自一致性投票（跟踪A）—— 每个题目运行N次尝试并取多数答案，可降低边缘题目的方差。HAL未实现此功能。
难度路由（跟踪Q）—— 根据预测难度将每个题目路由到合适的模型和轮次预算。这降低了简单题目的成本，同时为难题提供更多资源。
AgentDB内存 —— 存储跨运行的模式，使Agent能够回忆起针对相似题型的成功策略。

Improvement roadmap

改进路线图

Priority	Change	Expected Lift	Effort
P0	Real python_exec sandbox (E2B)	+15-25 pp	High
P0	Full 165-Q L1 evaluation	Accurate baseline	Low
P1	Playwright-based web_browse	+5-10 pp	Medium
P1	PDF/XLSX file parser	+3-8 pp	Medium
P2	Increase max-turns to 20 for L2/L3	+2-5 pp	Low
P2	System prompt tuning (iter 30 research)	+2-5 pp	Low
P3	Google Grounding via Gemini (iter 32)	+3-7 pp	Medium
P3	Multi-provider routing (Gemini Flash for cheap Q's)	Cost reduction	Medium

优先级	变更内容	预期提升	工作量
P0	实现真实python_exec沙箱（E2B）	+15-25个百分点	高
P0	完整165题L1评估	准确基准线	低
P1	基于Playwright的web_browse	+5-10个百分点	中
P1	PDF/XLSX文件解析器	+3-8个百分点	中
P2	将L2/L3的最大轮次增加至20	+2-5个百分点	低
P2	系统提示调优（迭代30研究）	+2-5个百分点	低
P3	通过Gemini实现Google Grounding（迭代32）	+3-7个百分点	中
P3	多提供商路由（使用Gemini Flash处理低成本题目）	成本降低	中

Loading context from past research

从过往研究加载上下文

bash

npx @claude-flow/cli@latest memory search \
  --namespace gaia-patterns \
  --query "architecture comparison HAL benchmark"

bash

npx @claude-flow/cli@latest memory search \
  --namespace gaia-patterns \
  --query "architecture comparison HAL benchmark"

Storing comparison findings

存储对比结果

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "architecture-comparison-$(date +%Y%m%d)" \
  --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-patterns \
  --key "architecture-comparison-$(date +%Y%m%d)" \
  --value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."