researcher

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Researcher Skill

Researcher 技能

Sharded Deep Research for analyzing large codebases. Uses LangGraph with Map-Plan-Loop-Synthesize architecture to handle repositories that exceed LLM context limits.

用于分析大型代码库的分片式深度研究工具。采用LangGraph的Map-Plan-Loop-Synthesize架构，可处理超出LLM上下文限制的代码仓库。

Architecture

架构

┌─────────┐     ┌──────────────┐     ┌────────────────┐     ┌──────────────┐
│  Setup  │ --> │  Architect   │ --> │ Process Shard  │ --> │ Synthesize  │
│  Clone  │     │   (Plan)     │     │    (Loop)      │     │   Index.md   │
└─────────┘     └──────────────┘     └────────────────┘     └──────────────┘
     │                  │                    │
     │              3-5 shards          compress
     │              defined by           + analyze
     │              LLM                  each shard

┌─────────┐     ┌──────────────┐     ┌────────────────┐     ┌──────────────┐
│  Setup  │ --> │  Architect   │ --> │ Process Shard  │ --> │ Synthesize  │
│  Clone  │     │   (Plan)     │     │    (Loop)      │     │   Index.md   │
└─────────┘     └──────────────┘     └────────────────┘     └──────────────┘
     │                  │                    │
     │              3-5 shards          compress
     │              defined by           + analyze
     │              LLM                  each shard

Commands

命令

run_research_graph

[CORE] Execute the Sharded Deep Research Workflow.

This autonomously:

Clones the repository to a temporary workspace
Maps the file structure (god view)
Plans 3-5 logical analysis shards (subsystems) via LLM
Iterates through each shard:
- Compress with repomix (shard-specific config)
- Analyze with LLM
- Save shard analysis to
```
shards/<id>_<name>.md
```
Synthesizes
```
index.md
```
linking all shard analyses

Parameters:

```
repo_url
```
(string, required): Git repository URL to analyze
```
request
```
(string, optional): Research goal/focus (default: "Analyze the architecture")
```
visualize
```
(bool, optional): If true, return workflow diagram only
```
chunked
```
(bool, optional): If true, use step-by-step actions (like knowledge recall)
```
action
```
(string, optional): When chunked:
```
"start"
```
|
```
"shard"
```
|
```
"synthesize"
```
```
session_id
```
(string, optional): When chunked: required for
```
shard
```
and
```
synthesize
```
(returned from
```
start
```
)
```
chunk_id
```
(string, optional): When chunked +
```
action="shard"
```
, run one specific chunk (e.g.
```
c1
```
)
```
chunk_ids
```
(list[string], optional): When chunked +
```
action="shard"
```
, run multiple chunks in parallel in one call
```
max_concurrent
```
(int, optional): Max concurrent shard LLM calls; null = unbounded. Set to 6–8 if API rate limits (429). Falls back to
```
researcher.max_concurrent
```
in settings.

Chunked mode (step-by-step):

Call with

chunked=true

action="start"

→ returns

session_id

chunk_plan

(

c1

c2

, ...),

next_action

1. Call with
```
chunked=true
```
  ,
```
action="shard"
```
  ,
```
session_id=<from start>
```
  , and either:
```
chunk_id=<cx>
```
  for one chunk, or
```
chunk_ids=[...]
```
  for parallel chunk execution. If omitted, all pending chunks are executed in parallel in that call.

Call with

chunked=true

action="synthesize"

session_id=<same>

after all chunks complete.

State is persisted in the checkpoint store under workflow type

research_chunked

Returns:

json

{
  "success": true,
  "harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
  "shards_analyzed": 4,
  "revision": "abc1234",
  "shard_summaries": [
    "- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
    "- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
  ],
  "summary": "Research Complete!..."
}

Output Location:

.data/harvested/<owner>/<repo_name>/
├── index.md                    # Master index with YAML frontmatter (includes revision)
└── shards/
    ├── 01_core_kernel.md       # Shard 1 analysis
    ├── 02_api_layer.md         # Shard 2 analysis
    └── ...

index.md Frontmatter:

yaml

---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---

[核心功能] 执行分片式深度研究工作流。

该工具会自动执行以下步骤：

克隆：将代码仓库克隆到临时工作区
映射：生成文件结构全景视图
规划：通过LLM规划3-5个逻辑分析分片（子系统）
迭代处理每个分片：
- 使用repomix进行压缩（采用分片专属配置）
- 通过LLM进行分析
- 将分片分析结果保存至
```
shards/<id>_<name>.md
```
合成：生成
```
index.md
```
文件，关联所有分片分析结果

参数：

```
repo_url
```
（字符串，必填）：待分析的Git仓库URL
```
request
```
（字符串，可选）：研究目标/重点（默认值："分析架构"）
```
visualize
```
（布尔值，可选）：若为true，仅返回工作流图
```
chunked
```
（布尔值，可选）：若为true，使用分步执行模式（类似知识召回）
```
action
```
（字符串，可选）：当启用分步模式时，可选值为
```
"start"
```
|
```
"shard"
```
|
```
"synthesize"
```
```
session_id
```
（字符串，可选）：启用分步模式且
```
action
```
为
```
"shard"
```
或
```
"synthesize"
```
时必填，由
```
start
```
步骤返回
```
chunk_id
```
（字符串，可选）：启用分步模式且
```
action="shard"
```
时，指定要运行的单个分片（例如
```
c1
```
）
```
chunk_ids
```
（字符串列表，可选）：启用分步模式且
```
action="shard"
```
时，指定要并行运行的多个分片
```
max_concurrent
```
（整数，可选）：LLM分片调用的最大并发数；null表示无限制。若遇到API速率限制（429错误），可设置为6–8。默认使用设置中的
```
researcher.max_concurrent
```
值

分步执行模式（逐步操作）：

调用时设置

chunked=true

、

action="start"

→ 返回

session_id

、

chunk_plan

（

c1

、

c2

...）、

next_action

1. 调用时设置
```
chunked=true
```
  、
```
action="shard"
```
  、
```
session_id=<从start步骤获取的值>
```
  ，并选择：指定
```
chunk_id=<cx>
```
  以运行单个分片，或指定
```
chunk_ids=[...]
```
  以并行运行多个分片。若未指定，则会在本次调用中并行执行所有待处理的分片

所有分片处理完成后，调用时设置

chunked=true

、

action="synthesize"

、

session_id=<同一会话ID>

工作流状态会以

research_chunked

类型存储在检查点存储中。

返回结果：

json

{
  "success": true,
  "harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
  "shards_analyzed": 4,
  "revision": "abc1234",
  "shard_summaries": [
    "- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
    "- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
  ],
  "summary": "Research Complete!..."
}

输出位置：

.data/harvested/<owner>/<repo_name>/
├── index.md                    # 包含YAML前置元数据的主索引文件（含版本信息）
└── shards/
    ├── 01_core_kernel.md       # 分片1分析结果
    ├── 02_api_layer.md         # 分片2分析结果
    └── ...

index.md前置元数据：

yaml

---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---

Usage Example

使用示例

python

undefined

python

undefined

Analyze a repository's security patterns

分析仓库的安全模式

await researcher.run_research_graph( repo_url="https://github.com/example/large-repo", request="Analyze security patterns and vulnerability surfaces" )

Result: Multiple shard analyses saved to .data/harvested/

结果：多个分片分析结果将保存至.data/harvested/

undefined

undefined

Technical Details

技术细节

Repomix: Used directly (not via npx) for code compression
Sharding: LLM (architect) proposes subsystems; normalization enforces efficient bounds
Loop: Conditional edges in LangGraph process shards until queue empty
Checkpoint: MemorySaver enables resumption of interrupted workflows
Chunked API: Same workflow type as knowledge recall; one step per MCP call via
```
action
```
and
```
session_id
```

Repomix：直接调用（而非通过npx）用于代码压缩
分片机制：由LLM（architect模块）提议子系统；规范化处理确保分片在高效范围内
循环处理：LangGraph中的条件边会处理分片，直到队列清空
检查点：MemorySaver支持中断的工作流恢复
分步API：与知识召回使用相同的工作流类型；通过
```
action
```
和
```
session_id
```
实现每次MCP调用执行一个步骤

Efficient sharding design

高效分片设计

To avoid timeouts and unbalanced work, sharding is constrained and normalized:

Architect prompt limits
- At most 5 files per shard, total files ≤ 25 across all shards.
- 4–6 subsystems; explicit “stay under limits” so the LLM does not propose oversized shards.
Post-architect normalization (
```
_normalize_shards
```
)
- Split: Any shard with > 5 files is split into multiple shards (e.g. “Core (1)”, “Core (2)”).
- Cap: Total files across all shards are capped at 30; excess is trimmed from the end.
- Merge: Consecutive shards with ≤ 2 files each are merged into one shard (up to 5 files) to reduce round-trips and balance size.
Per-shard processing limits
- Repomix output per shard capped at 32k chars; subprocess timeout 120s; run in executor so heartbeat can run.
- LLM input 28k chars, output 4096 tokens.

Result: each

action=shard

runs on a bounded amount of code and stays within MCP idle/total timeout when heartbeat is used.

为避免超时和工作负载失衡，分片机制经过约束与规范化处理：

Architect提示限制
- 每个分片最多包含5个文件，所有分片的总文件数≤25个
- 提议4–6个子系统；明确要求“控制在限制内”，避免LLM生成过大的分片
Architect后规范化处理（
```
_normalize_shards
```
）
- 拆分：任何包含>5个文件的分片会被拆分为多个分片（例如“Core (1)”、“Core (2)”）
- 上限控制：所有分片的总文件数上限为30；超出部分从末尾截断
- 合并：连续的、每个包含≤2个文件的分片会被合并为一个分片（最多5个文件），以减少往返次数并平衡分片大小
分片处理限制
- 每个分片的Repomix输出上限为32000字符；子进程超时时间120秒；在执行器中运行以支持心跳检测
- LLM输入上限为28000字符，输出上限为4096个token

结果：每个

action=shard

步骤处理的代码量可控，且在使用心跳检测时不会超出MCP的空闲/总超时限制。

Performance & timeouts

性能与超时

Shard processing is tuned and uses progress-aware timeout:

Idle timeout (
```
mcp.idle_timeout
```
, default 120s): Cancel only when there is no progress for this long. The researcher calls
```
heartbeat()
```
every 10s during repomix and LLM, so the runner does not kill the tool while it is still working.
Total timeout (
```
mcp.timeout
```
, default 180s): Hard cap (wall-clock); 0 = disable.
Repomix: Output capped at 32k chars per shard; subprocess timeout 120s; run in executor so heartbeat can run.
LLM: Input 28k chars, output 4096 tokens; architect prefers 4–6 shards with 3–6 files each.

To allow longer runs without changing behaviour, increase timeouts in settings:

yaml

mcp:
  timeout: 300 # Hard cap (seconds); 0 = disable
  idle_timeout: 120 # Cancel only after no heartbeat for this long; 0 = use only timeout

分片处理经过调优，采用进度感知型超时机制：

空闲超时（
```
mcp.idle_timeout
```
，默认120秒）：仅当长时间无进度更新时才取消任务。研究员会在repomix和LLM处理期间每10秒调用一次
```
heartbeat()
```
，因此工具运行时不会被执行器终止
总超时（
```
mcp.timeout
```
，默认180秒）：硬时间上限（挂钟时间）；设置为0则禁用
Repomix：每个分片的输出上限为32000字符；子进程超时时间120秒；在执行器中运行以支持心跳检测
LLM：输入上限28000字符，输出上限4096个token；architect模块倾向于生成4–6个分片，每个分片包含3–6个文件

如需允许更长时间的运行且不改变现有行为，可在设置中增加超时时间：

yaml

mcp:
  timeout: 300 # 硬时间上限（秒）；0表示禁用
  idle_timeout: 120 # 仅当心跳检测中断持续该时长时取消任务；0表示仅使用总超时