researcher

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Researcher Skill

Researcher 技能

Sharded Deep Research for analyzing large codebases. Uses LangGraph with Map-Plan-Loop-Synthesize architecture to handle repositories that exceed LLM context limits.
用于分析大型代码库的分片式深度研究工具。采用LangGraph的Map-Plan-Loop-Synthesize架构,可处理超出LLM上下文限制的代码仓库。

Architecture

架构

┌─────────┐     ┌──────────────┐     ┌────────────────┐     ┌──────────────┐
│  Setup  │ --> │  Architect   │ --> │ Process Shard  │ --> │ Synthesize  │
│  Clone  │     │   (Plan)     │     │    (Loop)      │     │   Index.md   │
└─────────┘     └──────────────┘     └────────────────┘     └──────────────┘
     │                  │                    │
     │              3-5 shards          compress
     │              defined by           + analyze
     │              LLM                  each shard
┌─────────┐     ┌──────────────┐     ┌────────────────┐     ┌──────────────┐
│  Setup  │ --> │  Architect   │ --> │ Process Shard  │ --> │ Synthesize  │
│  Clone  │     │   (Plan)     │     │    (Loop)      │     │   Index.md   │
└─────────┘     └──────────────┘     └────────────────┘     └──────────────┘
     │                  │                    │
     │              3-5 shards          compress
     │              defined by           + analyze
     │              LLM                  each shard

Commands

命令

run_research_graph

run_research_graph

[CORE] Execute the Sharded Deep Research Workflow.
This autonomously:
  1. Clones the repository to a temporary workspace
  2. Maps the file structure (god view)
  3. Plans 3-5 logical analysis shards (subsystems) via LLM
  4. Iterates through each shard:
    • Compress with repomix (shard-specific config)
    • Analyze with LLM
    • Save shard analysis to
      shards/<id>_<name>.md
  5. Synthesizes
    index.md
    linking all shard analyses
Parameters:
  • repo_url
    (string, required): Git repository URL to analyze
  • request
    (string, optional): Research goal/focus (default: "Analyze the architecture")
  • visualize
    (bool, optional): If true, return workflow diagram only
  • chunked
    (bool, optional): If true, use step-by-step actions (like knowledge recall)
  • action
    (string, optional): When chunked:
    "start"
    |
    "shard"
    |
    "synthesize"
  • session_id
    (string, optional): When chunked: required for
    shard
    and
    synthesize
    (returned from
    start
    )
  • chunk_id
    (string, optional): When chunked +
    action="shard"
    , run one specific chunk (e.g.
    c1
    )
  • chunk_ids
    (list[string], optional): When chunked +
    action="shard"
    , run multiple chunks in parallel in one call
  • max_concurrent
    (int, optional): Max concurrent shard LLM calls; null = unbounded. Set to 6–8 if API rate limits (429). Falls back to
    researcher.max_concurrent
    in settings.
Chunked mode (step-by-step):
    1. Call with
      chunked=true
      ,
      action="start"
      → returns
      session_id
      ,
      chunk_plan
      (
      c1
      ,
      c2
      , ...),
      next_action
      .
    1. Call with
      chunked=true
      ,
      action="shard"
      ,
      session_id=<from start>
      , and either:
      chunk_id=<cx>
      for one chunk, or
      chunk_ids=[...]
      for parallel chunk execution. If omitted, all pending chunks are executed in parallel in that call.
    1. Call with
      chunked=true
      ,
      action="synthesize"
      ,
      session_id=<same>
      after all chunks complete.
State is persisted in the checkpoint store under workflow type
research_chunked
.
Returns:
json
{
  "success": true,
  "harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
  "shards_analyzed": 4,
  "revision": "abc1234",
  "shard_summaries": [
    "- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
    "- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
  ],
  "summary": "Research Complete!..."
}
Output Location:
.data/harvested/<owner>/<repo_name>/
├── index.md                    # Master index with YAML frontmatter (includes revision)
└── shards/
    ├── 01_core_kernel.md       # Shard 1 analysis
    ├── 02_api_layer.md         # Shard 2 analysis
    └── ...
index.md Frontmatter:
yaml
---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---
[核心功能] 执行分片式深度研究工作流。
该工具会自动执行以下步骤:
  1. 克隆:将代码仓库克隆到临时工作区
  2. 映射:生成文件结构全景视图
  3. 规划:通过LLM规划3-5个逻辑分析分片(子系统)
  4. 迭代处理每个分片
    • 使用repomix进行压缩(采用分片专属配置)
    • 通过LLM进行分析
    • 将分片分析结果保存至
      shards/<id>_<name>.md
  5. 合成:生成
    index.md
    文件,关联所有分片分析结果
参数:
  • repo_url
    (字符串,必填):待分析的Git仓库URL
  • request
    (字符串,可选):研究目标/重点(默认值:"分析架构")
  • visualize
    (布尔值,可选):若为true,仅返回工作流图
  • chunked
    (布尔值,可选):若为true,使用分步执行模式(类似知识召回)
  • action
    (字符串,可选):当启用分步模式时,可选值为
    "start"
    |
    "shard"
    |
    "synthesize"
  • session_id
    (字符串,可选):启用分步模式且
    action
    "shard"
    "synthesize"
    时必填,由
    start
    步骤返回
  • chunk_id
    (字符串,可选):启用分步模式且
    action="shard"
    时,指定要运行的单个分片(例如
    c1
  • chunk_ids
    (字符串列表,可选):启用分步模式且
    action="shard"
    时,指定要并行运行的多个分片
  • max_concurrent
    (整数,可选):LLM分片调用的最大并发数;null表示无限制。若遇到API速率限制(429错误),可设置为6–8。默认使用设置中的
    researcher.max_concurrent
分步执行模式(逐步操作):
    1. 调用时设置
      chunked=true
      action="start"
      → 返回
      session_id
      chunk_plan
      c1
      c2
      ...)、
      next_action
    1. 调用时设置
      chunked=true
      action="shard"
      session_id=<从start步骤获取的值>
      ,并选择: 指定
      chunk_id=<cx>
      以运行单个分片,或指定
      chunk_ids=[...]
      以并行运行多个分片。 若未指定,则会在本次调用中并行执行所有待处理的分片
    1. 所有分片处理完成后,调用时设置
      chunked=true
      action="synthesize"
      session_id=<同一会话ID>
工作流状态会以
research_chunked
类型存储在检查点存储中。
返回结果:
json
{
  "success": true,
  "harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
  "shards_analyzed": 4,
  "revision": "abc1234",
  "shard_summaries": [
    "- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
    "- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
  ],
  "summary": "Research Complete!..."
}
输出位置:
.data/harvested/<owner>/<repo_name>/
├── index.md                    # 包含YAML前置元数据的主索引文件(含版本信息)
└── shards/
    ├── 01_core_kernel.md       # 分片1分析结果
    ├── 02_api_layer.md         # 分片2分析结果
    └── ...
index.md前置元数据:
yaml
---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---

Usage Example

使用示例

python
undefined
python
undefined

Analyze a repository's security patterns

分析仓库的安全模式

await researcher.run_research_graph( repo_url="https://github.com/example/large-repo", request="Analyze security patterns and vulnerability surfaces" )
await researcher.run_research_graph( repo_url="https://github.com/example/large-repo", request="Analyze security patterns and vulnerability surfaces" )

Result: Multiple shard analyses saved to .data/harvested/

结果:多个分片分析结果将保存至.data/harvested/

undefined
undefined

Technical Details

技术细节

  • Repomix: Used directly (not via npx) for code compression
  • Sharding: LLM (architect) proposes subsystems; normalization enforces efficient bounds
  • Loop: Conditional edges in LangGraph process shards until queue empty
  • Checkpoint: MemorySaver enables resumption of interrupted workflows
  • Chunked API: Same workflow type as knowledge recall; one step per MCP call via
    action
    and
    session_id
  • Repomix:直接调用(而非通过npx)用于代码压缩
  • 分片机制:由LLM(architect模块)提议子系统;规范化处理确保分片在高效范围内
  • 循环处理:LangGraph中的条件边会处理分片,直到队列清空
  • 检查点:MemorySaver支持中断的工作流恢复
  • 分步API:与知识召回使用相同的工作流类型;通过
    action
    session_id
    实现每次MCP调用执行一个步骤

Efficient sharding design

高效分片设计

To avoid timeouts and unbalanced work, sharding is constrained and normalized:
  1. Architect prompt limits
    • At most 5 files per shard, total files ≤ 25 across all shards.
    • 4–6 subsystems; explicit “stay under limits” so the LLM does not propose oversized shards.
  2. Post-architect normalization (
    _normalize_shards
    )
    • Split: Any shard with > 5 files is split into multiple shards (e.g. “Core (1)”, “Core (2)”).
    • Cap: Total files across all shards are capped at 30; excess is trimmed from the end.
    • Merge: Consecutive shards with ≤ 2 files each are merged into one shard (up to 5 files) to reduce round-trips and balance size.
  3. Per-shard processing limits
    • Repomix output per shard capped at 32k chars; subprocess timeout 120s; run in executor so heartbeat can run.
    • LLM input 28k chars, output 4096 tokens.
Result: each
action=shard
runs on a bounded amount of code and stays within MCP idle/total timeout when heartbeat is used.
为避免超时和工作负载失衡,分片机制经过约束与规范化处理
  1. Architect提示限制
    • 每个分片最多包含5个文件,所有分片的总文件数≤25个
    • 提议4–6个子系统;明确要求“控制在限制内”,避免LLM生成过大的分片
  2. Architect后规范化处理
    _normalize_shards
    • 拆分:任何包含>5个文件的分片会被拆分为多个分片(例如“Core (1)”、“Core (2)”)
    • 上限控制:所有分片的总文件数上限为30;超出部分从末尾截断
    • 合并:连续的、每个包含≤2个文件的分片会被合并为一个分片(最多5个文件),以减少往返次数并平衡分片大小
  3. 分片处理限制
    • 每个分片的Repomix输出上限为32000字符;子进程超时时间120秒;在执行器中运行以支持心跳检测
    • LLM输入上限为28000字符,输出上限为4096个token
结果:每个
action=shard
步骤处理的代码量可控,且在使用心跳检测时不会超出MCP的空闲/总超时限制。

Performance & timeouts

性能与超时

Shard processing is tuned and uses progress-aware timeout:
  • Idle timeout (
    mcp.idle_timeout
    , default 120s): Cancel only when there is no progress for this long. The researcher calls
    heartbeat()
    every 10s during repomix and LLM, so the runner does not kill the tool while it is still working.
  • Total timeout (
    mcp.timeout
    , default 180s): Hard cap (wall-clock); 0 = disable.
  • Repomix: Output capped at 32k chars per shard; subprocess timeout 120s; run in executor so heartbeat can run.
  • LLM: Input 28k chars, output 4096 tokens; architect prefers 4–6 shards with 3–6 files each.
To allow longer runs without changing behaviour, increase timeouts in settings:
yaml
mcp:
  timeout: 300 # Hard cap (seconds); 0 = disable
  idle_timeout: 120 # Cancel only after no heartbeat for this long; 0 = use only timeout
分片处理经过调优,采用进度感知型超时机制
  • 空闲超时
    mcp.idle_timeout
    ,默认120秒):仅当长时间无进度更新时才取消任务。研究员会在repomix和LLM处理期间每10秒调用一次
    heartbeat()
    ,因此工具运行时不会被执行器终止
  • 总超时
    mcp.timeout
    ,默认180秒):硬时间上限(挂钟时间);设置为0则禁用
  • Repomix:每个分片的输出上限为32000字符;子进程超时时间120秒;在执行器中运行以支持心跳检测
  • LLM:输入上限28000字符,输出上限4096个token;architect模块倾向于生成4–6个分片,每个分片包含3–6个文件
如需允许更长时间的运行且不改变现有行为,可在设置中增加超时时间:
yaml
mcp:
  timeout: 300 # 硬时间上限(秒);0表示禁用
  idle_timeout: 120 # 仅当心跳检测中断持续该时长时取消任务;0表示仅使用总超时