Loading...
Loading...
Side-by-side comparison of ruflo vs HAL vs other GAIA harnesses — capability gaps, design decisions, and improvement roadmap
npx skill4agent add ruvnet/ruflo gaia-architecture-comparisongaia-bench run
└─ gaia-loader.ts — HF dataset download + cache
└─ gaia-agent.ts — multi-turn Anthropic Messages loop
└─ gaia-tools/ — web_search, file_read, web_browse,
image_describe, python_exec
└─ gaia-voting.ts — Track A self-consistency (N attempts → majority vote)
└─ gaia-hardness/ — Track Q difficulty predictor (ADR-136)
└─ gaia-judge.ts — two-stage LLM-as-judge scorer| Dimension | ruflo | HAL reference | Gap |
|---|---|---|---|
| Question count | 53 (partial L1) | 300 (full L1) | Use |
| Web search | DuckDuckGo / Google CSE | BrowserBase live | Add Playwright or Browserless |
| Code execution | python_exec stub | Real Jupyter kernel | Implement real sandbox |
| Image OCR | image_describe (Gemini) | GPT-4V / Gemini | Functionally equivalent |
| File handling | file_read | Full PDF/XLSX/ZIP parser | Expand file_read |
| Self-consistency | voting.ts (Track A) | Not in reference | ruflo advantage |
| Hardness routing | predictor.ts (Track Q) | Not in reference | ruflo advantage |
| Memory | AgentDB HNSW | None | ruflo advantage |
| Pass-rate L1 | ~20.8% (iter 23) | 74.6% (HAL Sonnet 4.5) | ~54 pp gap |
python_exec--limit 165web_browsefile_read| Priority | Change | Expected Lift | Effort |
|---|---|---|---|
| P0 | Real python_exec sandbox (E2B) | +15-25 pp | High |
| P0 | Full 165-Q L1 evaluation | Accurate baseline | Low |
| P1 | Playwright-based web_browse | +5-10 pp | Medium |
| P1 | PDF/XLSX file parser | +3-8 pp | Medium |
| P2 | Increase max-turns to 20 for L2/L3 | +2-5 pp | Low |
| P2 | System prompt tuning (iter 30 research) | +2-5 pp | Low |
| P3 | Google Grounding via Gemini (iter 32) | +3-7 pp | Medium |
| P3 | Multi-provider routing (Gemini Flash for cheap Q's) | Cost reduction | Medium |
npx @claude-flow/cli@latest memory search \
--namespace gaia-patterns \
--query "architecture comparison HAL benchmark"npx @claude-flow/cli@latest memory store \
--namespace gaia-patterns \
--key "architecture-comparison-$(date +%Y%m%d)" \
--value "HAL gap: 54pp. Primary: python_exec stub. Secondary: browser, file parsing."