gaia-debugging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GAIA Debugging Skill

GAIA调试技能

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.
当GAIA问题失败时,系统地诊断根本原因并提出针对性修复方案。

When to use

使用场景

  • A specific
    task_id
    returns the wrong answer or times out
  • Pass-rate dropped between two runs and you need to find the regression
  • You want to understand why a particular question class is consistently failing
  • 特定
    task_id
    返回错误答案或超时
  • 两次运行间通过率下降,需要找出回归问题
  • 希望了解某类问题持续失败的原因

Failure mode taxonomy

失败模式分类

CodeModeSymptomFix direction
TGTool GapAgent lacks a required tool (no image OCR, no PDF reader)Add tool to catalogue
RMReasoning MissAgent has the right data but draws wrong conclusionImprove system prompt, add CoT instruction
EBExtraction BugAnswer is in the trace but
FINAL_ANSWER:
regex fails
Fix answer extraction pattern
LILoop IssueAgent loops (re-asks same tool call) and hits turn limitIncrease max-turns or add loop-detection
DSDataset ShiftGround truth differs from what web currently showsFlag for HAL dataset audit
ATAPI TimeoutTool call times out; agent never gets the resultIncrease per-turn timeout
代码模式症状修复方向
TG工具缺失Agent缺少所需工具(无图像OCR、无PDF阅读器)将工具添加到工具库
RM推理错误Agent拥有正确数据但得出错误结论优化系统提示词,添加CoT指令
EB提取漏洞答案存在于跟踪信息中但
FINAL_ANSWER:
正则表达式匹配失败
修复答案提取规则
LI循环问题Agent陷入循环(重复调用同一工具)并达到回合限制增加最大回合数或添加循环检测机制
DS数据集偏移真实值与当前网页显示内容不符标记需进行HAL数据集审核
ATAPI超时工具调用超时;Agent从未获取到结果延长单回合超时时间

Diagnostic workflow

诊断流程

Step 1 — Load the question trace

步骤1 — 加载问题跟踪信息

bash
undefined
bash
undefined

Find the result for the task_id in the latest run

Find the result for the task_id in the latest run

RESULTS=~/.cache/ruflo/gaia/results-latest.json node -e " const r = JSON.parse(require('fs').readFileSync('$RESULTS')); const q = r.results.find(x => x.task_id === '$TASK_ID'); console.log(JSON.stringify(q, null, 2)); "
undefined
RESULTS=~/.cache/ruflo/gaia/results-latest.json node -e " const r = JSON.parse(require('fs').readFileSync('$RESULTS')); const q = r.results.find(x => x.task_id === '$TASK_ID'); console.log(JSON.stringify(q, null, 2)); "
undefined

Step 2 — Classify the failure

步骤2 — 分类失败类型

Look at the trace output:
  1. No tools called at all → RM or configuration issue
  2. Tool called but returned error → TG or AT
  3. Tool returned data, wrong answer → RM or EB
  4. Correct answer in trace but marked wrong → EB
  5. max-turns hit → LI or question too hard for current model
查看跟踪输出:
  1. 未调用任何工具 → RM或配置问题
  2. 调用工具但返回错误 → TG或AT
  3. 工具返回数据但答案错误 → RM或EB
  4. 跟踪信息中存在正确答案但被标记为错误 → EB
  5. 达到最大回合数 → LI或当前模型无法处理该问题

Step 3 — Re-run with extended logging

步骤3 — 启用扩展日志重新运行

bash
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json
bash
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

步骤4 — 应用针对性修复

FailureAction
TG — missing web_browseVerify
gaia-tools/index.ts
exports
web_browse
; check tool registration
TG — missing image OCRAdd
image_describe
tool call; verify
GOOGLE_AI_API_KEY
RM — reasoningAdd a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extractionTest the
FINAL_ANSWER_RE
regex against the trace manually
LI — loopAdd a tool-call deduplication guard in
gaia-agent.ts
AT — timeoutSet
DEFAULT_PER_TURN_TIMEOUT_MS
higher or use
--max-turns
flag
失败类型操作
TG — 缺少web_browse工具验证
gaia-tools/index.ts
是否导出
web_browse
;检查工具注册情况
TG — 缺少图像OCR工具添加
image_describe
工具调用;验证
GOOGLE_AI_API_KEY
有效性
RM — 推理问题添加系统提示指令:"回答前,请列出所有收集到的事实"
EB — 提取漏洞手动用跟踪信息测试
FINAL_ANSWER_RE
正则表达式
LI — 循环问题
gaia-agent.ts
中添加工具调用去重机制
AT — 超时问题调高
DEFAULT_PER_TURN_TIMEOUT_MS
值或使用
--max-turns
参数

Step 5 — Verify fix and store pattern

步骤5 — 验证修复并存储模式

bash
undefined
bash
undefined

Re-run the single question

Re-run the single question

node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

If now passing, store the pattern

If now passing, store the pattern

npx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
undefined
npx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
undefined

Quick reference: tool catalogue check

快速参考:工具库检查

bash
node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"
Expected:
web_search
,
file_read
,
web_browse
,
image_describe
,
python_exec
bash
node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"
预期工具:
web_search
,
file_read
,
web_browse
,
image_describe
,
python_exec

Pattern storage

模式存储

After resolving a debugging session, store the finding:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'
Search for similar past failures:
bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"
完成调试会话后,存储发现结果:
bash
npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'
搜索过往类似失败案例:
bash
npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"