Loading...
Loading...
Diagnose why a GAIA question failed — extract trace, classify failure mode, and propose a fix
npx skill4agent add ruvnet/ruflo gaia-debuggingtask_id| Code | Mode | Symptom | Fix direction |
|---|---|---|---|
| TG | Tool Gap | Agent lacks a required tool (no image OCR, no PDF reader) | Add tool to catalogue |
| RM | Reasoning Miss | Agent has the right data but draws wrong conclusion | Improve system prompt, add CoT instruction |
| EB | Extraction Bug | Answer is in the trace but | Fix answer extraction pattern |
| LI | Loop Issue | Agent loops (re-asks same tool call) and hits turn limit | Increase max-turns or add loop-detection |
| DS | Dataset Shift | Ground truth differs from what web currently shows | Flag for HAL dataset audit |
| AT | API Timeout | Tool call times out; agent never gets the result | Increase per-turn timeout |
# Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
const q = r.results.find(x => x.task_id === '$TASK_ID');
console.log(JSON.stringify(q, null, 2));
"node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
--level 1 --limit 1 \
--task-id $TASK_ID \
--models claude-sonnet-4-6 \
--max-turns 20 \
--output json| Failure | Action |
|---|---|
| TG — missing web_browse | Verify |
| TG — missing image OCR | Add |
| RM — reasoning | Add a system prompt instruction: "Before answering, list all facts you have gathered" |
| EB — extraction | Test the |
| LI — loop | Add a tool-call deduplication guard in |
| AT — timeout | Set |
# Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
# If now passing, store the pattern
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)" \
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"node -e "
const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
const cat = createDefaultToolCatalogue({});
console.log('Tools registered:', cat.definitions.map(t => t.name));
"web_searchfile_readweb_browseimage_describepython_execnpx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "session-$(date +%Y%m%d-%H%M)" \
--value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'npx @claude-flow/cli@latest memory search \
--namespace gaia-debug-patterns \
--query "extraction bug final answer regex"