gaia-debugging
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGAIA Debugging Skill
GAIA调试技能
When a GAIA question fails, systematically diagnose the root cause and propose
a targeted fix.
当GAIA问题失败时,系统地诊断根本原因并提出针对性修复方案。
When to use
使用场景
- A specific returns the wrong answer or times out
task_id - Pass-rate dropped between two runs and you need to find the regression
- You want to understand why a particular question class is consistently failing
- 特定返回错误答案或超时
task_id - 两次运行间通过率下降,需要找出回归问题
- 希望了解某类问题持续失败的原因
Failure mode taxonomy
失败模式分类
| Code | Mode | Symptom | Fix direction |
|---|---|---|---|
| TG | Tool Gap | Agent lacks a required tool (no image OCR, no PDF reader) | Add tool to catalogue |
| RM | Reasoning Miss | Agent has the right data but draws wrong conclusion | Improve system prompt, add CoT instruction |
| EB | Extraction Bug | Answer is in the trace but | Fix answer extraction pattern |
| LI | Loop Issue | Agent loops (re-asks same tool call) and hits turn limit | Increase max-turns or add loop-detection |
| DS | Dataset Shift | Ground truth differs from what web currently shows | Flag for HAL dataset audit |
| AT | API Timeout | Tool call times out; agent never gets the result | Increase per-turn timeout |
| 代码 | 模式 | 症状 | 修复方向 |
|---|---|---|---|
| TG | 工具缺失 | Agent缺少所需工具(无图像OCR、无PDF阅读器) | 将工具添加到工具库 |
| RM | 推理错误 | Agent拥有正确数据但得出错误结论 | 优化系统提示词,添加CoT指令 |
| EB | 提取漏洞 | 答案存在于跟踪信息中但 | 修复答案提取规则 |
| LI | 循环问题 | Agent陷入循环(重复调用同一工具)并达到回合限制 | 增加最大回合数或添加循环检测机制 |
| DS | 数据集偏移 | 真实值与当前网页显示内容不符 | 标记需进行HAL数据集审核 |
| AT | API超时 | 工具调用超时;Agent从未获取到结果 | 延长单回合超时时间 |
Diagnostic workflow
诊断流程
Step 1 — Load the question trace
步骤1 — 加载问题跟踪信息
bash
undefinedbash
undefinedFind the result for the task_id in the latest run
Find the result for the task_id in the latest run
RESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
const q = r.results.find(x => x.task_id === '$TASK_ID');
console.log(JSON.stringify(q, null, 2));
"
undefinedRESULTS=~/.cache/ruflo/gaia/results-latest.json
node -e "
const r = JSON.parse(require('fs').readFileSync('$RESULTS'));
const q = r.results.find(x => x.task_id === '$TASK_ID');
console.log(JSON.stringify(q, null, 2));
"
undefinedStep 2 — Classify the failure
步骤2 — 分类失败类型
Look at the trace output:
- No tools called at all → RM or configuration issue
- Tool called but returned error → TG or AT
- Tool returned data, wrong answer → RM or EB
- Correct answer in trace but marked wrong → EB
- max-turns hit → LI or question too hard for current model
查看跟踪输出:
- 未调用任何工具 → RM或配置问题
- 调用工具但返回错误 → TG或AT
- 工具返回数据但答案错误 → RM或EB
- 跟踪信息中存在正确答案但被标记为错误 → EB
- 达到最大回合数 → LI或当前模型无法处理该问题
Step 3 — Re-run with extended logging
步骤3 — 启用扩展日志重新运行
bash
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
--level 1 --limit 1 \
--task-id $TASK_ID \
--models claude-sonnet-4-6 \
--max-turns 20 \
--output jsonbash
node v3/@claude-flow/cli/bin/cli.js gaia-bench run \
--level 1 --limit 1 \
--task-id $TASK_ID \
--models claude-sonnet-4-6 \
--max-turns 20 \
--output jsonStep 4 — Apply targeted fix
步骤4 — 应用针对性修复
| Failure | Action |
|---|---|
| TG — missing web_browse | Verify |
| TG — missing image OCR | Add |
| RM — reasoning | Add a system prompt instruction: "Before answering, list all facts you have gathered" |
| EB — extraction | Test the |
| LI — loop | Add a tool-call deduplication guard in |
| AT — timeout | Set |
| 失败类型 | 操作 |
|---|---|
| TG — 缺少web_browse工具 | 验证 |
| TG — 缺少图像OCR工具 | 添加 |
| RM — 推理问题 | 添加系统提示指令:"回答前,请列出所有收集到的事实" |
| EB — 提取漏洞 | 手动用跟踪信息测试 |
| LI — 循环问题 | 在 |
| AT — 超时问题 | 调高 |
Step 5 — Verify fix and store pattern
步骤5 — 验证修复并存储模式
bash
undefinedbash
undefinedRe-run the single question
Re-run the single question
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
node … gaia-bench run --task-id $TASK_ID --models $MODEL --output json
If now passing, store the pattern
If now passing, store the pattern
npx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
undefinednpx @claude-flow/cli@latest memory store
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
--namespace gaia-debug-patterns
--key "fix-$FAILURE_CODE-$(date +%Y%m%d)"
--value "task_id=$TASK_ID, mode=$FAILURE_CODE, fix=$FIX_DESCRIPTION"
undefinedQuick reference: tool catalogue check
快速参考:工具库检查
bash
node -e "
const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
const cat = createDefaultToolCatalogue({});
console.log('Tools registered:', cat.definitions.map(t => t.name));
"Expected: , , , ,
web_searchfile_readweb_browseimage_describepython_execbash
node -e "
const { createDefaultToolCatalogue } = require('./v3/@claude-flow/cli/src/benchmarks/gaia-tools/index.js');
const cat = createDefaultToolCatalogue({});
console.log('Tools registered:', cat.definitions.map(t => t.name));
"预期工具:, , , ,
web_searchfile_readweb_browseimage_describepython_execPattern storage
模式存储
After resolving a debugging session, store the finding:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "session-$(date +%Y%m%d-%H%M)" \
--value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'Search for similar past failures:
bash
npx @claude-flow/cli@latest memory search \
--namespace gaia-debug-patterns \
--query "extraction bug final answer regex"完成调试会话后,存储发现结果:
bash
npx @claude-flow/cli@latest memory store \
--namespace gaia-debug-patterns \
--key "session-$(date +%Y%m%d-%H%M)" \
--value '{"task_id":"$TASK_ID","failure_mode":"$CODE","fix":"$FIX","verified":true}'搜索过往类似失败案例:
bash
npx @claude-flow/cli@latest memory search \
--namespace gaia-debug-patterns \
--query "extraction bug final answer regex"