gaia-debugging

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GAIA Debugging Skill

GAIA调试技能

When a GAIA question fails, systematically diagnose the root cause and propose a targeted fix.

当GAIA问题失败时，系统地诊断根本原因并提出针对性修复方案。

When to use

使用场景

A specific
```
task_id
```
returns the wrong answer or times out
Pass-rate dropped between two runs and you need to find the regression
You want to understand why a particular question class is consistently failing

特定
```
task_id
```
返回错误答案或超时
两次运行间通过率下降，需要找出回归问题
希望了解某类问题持续失败的原因

Failure mode taxonomy

失败模式分类

Code	Mode	Symptom	Fix direction
TG	Tool Gap	Agent lacks a required tool (no image OCR, no PDF reader)	Add tool to catalogue
RM	Reasoning Miss	Agent has the right data but draws wrong conclusion	Improve system prompt, add CoT instruction
EB	Extraction Bug	Answer is in the trace but `FINAL_ANSWER:` regex fails	Fix answer extraction pattern
LI	Loop Issue	Agent loops (re-asks same tool call) and hits turn limit	Increase max-turns or add loop-detection
DS	Dataset Shift	Ground truth differs from what web currently shows	Flag for HAL dataset audit
AT	API Timeout	Tool call times out; agent never gets the result	Increase per-turn timeout

代码	模式	症状	修复方向
TG	工具缺失	Agent缺少所需工具（无图像OCR、无PDF阅读器）	将工具添加到工具库
RM	推理错误	Agent拥有正确数据但得出错误结论	优化系统提示词，添加CoT指令
EB	提取漏洞	答案存在于跟踪信息中但 `FINAL_ANSWER:` 正则表达式匹配失败	修复答案提取规则
LI	循环问题	Agent陷入循环（重复调用同一工具）并达到回合限制	增加最大回合数或添加循环检测机制
DS	数据集偏移	真实值与当前网页显示内容不符	标记需进行HAL数据集审核
AT	API超时	工具调用超时；Agent从未获取到结果	延长单回合超时时间

Diagnostic workflow

诊断流程

Step 1 — Load the question trace

步骤1 — 加载问题跟踪信息

bash

undefined

bash

undefined

Find the result for the task_id in the latest run

RESULTS=~/.cache/ruflo/gaia/results-latest.json node -e " const r = JSON.parse(require('fs').readFileSync('$RESULTS')); const q = r.results.find(x => x.task_id === '$TASK_ID'); console.log(JSON.stringify(q, null, 2)); "

undefined

undefined

Step 2 — Classify the failure

步骤2 — 分类失败类型

Look at the trace output:

No tools called at all → RM or configuration issue
Tool called but returned error → TG or AT
Tool returned data, wrong answer → RM or EB
Correct answer in trace but marked wrong → EB
max-turns hit → LI or question too hard for current model

查看跟踪输出：

未调用任何工具 → RM或配置问题
调用工具但返回错误 → TG或AT
工具返回数据但答案错误 → RM或EB
跟踪信息中存在正确答案但被标记为错误 → EB
达到最大回合数 → LI或当前模型无法处理该问题

Step 3 — Re-run with extended logging

步骤3 — 启用扩展日志重新运行

bash

node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

bash

node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
  --level 1 --limit 1 \
  --task-id $TASK_ID \
  --models claude-sonnet-4-6 \
  --max-turns 20 \
  --output json

Step 4 — Apply targeted fix

步骤4 — 应用针对性修复

Failure	Action
TG — missing web_browse	Verify `gaia-tools/index.ts` exports `web_browse` ; check tool registration
TG — missing image OCR	Add `image_describe` tool call; verify `GOOGLE_AI_API_KEY`
RM — reasoning	Add a system prompt instruction: "Before answering, list all facts you have gathered"
EB — extraction	Test the `FINAL_ANSWER_RE` regex against the trace manually
LI — loop	Add a tool-call deduplication guard in `gaia-agent.ts`
AT — timeout	Set `DEFAULT_PER_TURN_TIMEOUT_MS` higher or use `--max-turns` flag

失败类型	操作
TG — 缺少web_browse工具	验证 `gaia-tools/index.ts` 是否导出 `web_browse` ；检查工具注册情况
TG — 缺少图像OCR工具	添加 `image_describe` 工具调用；验证 `GOOGLE_AI_API_KEY` 有效性
RM — 推理问题	添加系统提示指令："回答前，请列出所有收集到的事实"
EB — 提取漏洞	手动用跟踪信息测试 `FINAL_ANSWER_RE` 正则表达式
LI — 循环问题	在 `gaia-agent.ts` 中添加工具调用去重机制
AT — 超时问题	调高 `DEFAULT_PER_TURN_TIMEOUT_MS` 值或使用 `--max-turns` 参数

Step 5 — Verify fix and store pattern

步骤5 — 验证修复并存储模式

bash

undefined

bash

undefined

Re-run the single question

node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json

If now passing, store the pattern

npx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

undefined

npx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"

undefined

Quick reference: tool catalogue check

快速参考：工具库检查

bash

node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

Expected:

web_search

file_read

web_browse

image_describe

python_exec

bash

node -e "
  const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
  const cat = createDefaultToolCatalogue({});
  console.log('Tools registered:', cat.definitions.map(t => t.name));
"

预期工具：

web_search

file_read

web_browse

image_describe

python_exec

Pattern storage

模式存储

After resolving a debugging session, store the finding:

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

Search for similar past failures:

bash

npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"

完成调试会话后，存储发现结果：

bash

npx @claude-flow/cli@latest memory store \
  --namespace gaia-debug-patterns \
  --key "session-$(date +%Y%m%d-%H%M)" \
  --value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'

搜索过往类似失败案例：

bash

npx @claude-flow/cli@latest memory search \
  --namespace gaia-debug-patterns \
  --query "extraction bug final answer regex"