researcher
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResearcher Skill
Researcher 技能
Sharded Deep Research for analyzing large codebases. Uses LangGraph with Map-Plan-Loop-Synthesize architecture to handle repositories that exceed LLM context limits.
用于分析大型代码库的分片式深度研究工具。采用LangGraph的Map-Plan-Loop-Synthesize架构,可处理超出LLM上下文限制的代码仓库。
Architecture
架构
┌─────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ Setup │ --> │ Architect │ --> │ Process Shard │ --> │ Synthesize │
│ Clone │ │ (Plan) │ │ (Loop) │ │ Index.md │
└─────────┘ └──────────────┘ └────────────────┘ └──────────────┘
│ │ │
│ 3-5 shards compress
│ defined by + analyze
│ LLM each shard┌─────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ Setup │ --> │ Architect │ --> │ Process Shard │ --> │ Synthesize │
│ Clone │ │ (Plan) │ │ (Loop) │ │ Index.md │
└─────────┘ └──────────────┘ └────────────────┘ └──────────────┘
│ │ │
│ 3-5 shards compress
│ defined by + analyze
│ LLM each shardCommands
命令
run_research_graph
run_research_graph
[CORE] Execute the Sharded Deep Research Workflow.
This autonomously:
- Clones the repository to a temporary workspace
- Maps the file structure (god view)
- Plans 3-5 logical analysis shards (subsystems) via LLM
- Iterates through each shard:
- Compress with repomix (shard-specific config)
- Analyze with LLM
- Save shard analysis to
shards/<id>_<name>.md
- Synthesizes linking all shard analyses
index.md
Parameters:
- (string, required): Git repository URL to analyze
repo_url - (string, optional): Research goal/focus (default: "Analyze the architecture")
request - (bool, optional): If true, return workflow diagram only
visualize - (bool, optional): If true, use step-by-step actions (like knowledge recall)
chunked - (string, optional): When chunked:
action|"start"|"shard""synthesize" - (string, optional): When chunked: required for
session_idandshard(returned fromsynthesize)start - (string, optional): When chunked +
chunk_id, run one specific chunk (e.g.action="shard")c1 - (list[string], optional): When chunked +
chunk_ids, run multiple chunks in parallel in one callaction="shard" - (int, optional): Max concurrent shard LLM calls; null = unbounded. Set to 6–8 if API rate limits (429). Falls back to
max_concurrentin settings.researcher.max_concurrent
Chunked mode (step-by-step):
-
- Call with ,
chunked=true→ returnsaction="start",session_id(chunk_plan,c1, ...),c2.next_action
- Call with
-
- Call with ,
chunked=true,action="shard", and either:session_id=<from start>for one chunk, orchunk_id=<cx>for parallel chunk execution. If omitted, all pending chunks are executed in parallel in that call.chunk_ids=[...]
- Call with
-
- Call with ,
chunked=true,action="synthesize"after all chunks complete.session_id=<same>
- Call with
State is persisted in the checkpoint store under workflow type .
research_chunkedReturns:
json
{
"success": true,
"harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
"shards_analyzed": 4,
"revision": "abc1234",
"shard_summaries": [
"- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
"- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
],
"summary": "Research Complete!..."
}Output Location:
.data/harvested/<owner>/<repo_name>/
├── index.md # Master index with YAML frontmatter (includes revision)
└── shards/
├── 01_core_kernel.md # Shard 1 analysis
├── 02_api_layer.md # Shard 2 analysis
└── ...index.md Frontmatter:
yaml
---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---[核心功能] 执行分片式深度研究工作流。
该工具会自动执行以下步骤:
- 克隆:将代码仓库克隆到临时工作区
- 映射:生成文件结构全景视图
- 规划:通过LLM规划3-5个逻辑分析分片(子系统)
- 迭代处理每个分片:
- 使用repomix进行压缩(采用分片专属配置)
- 通过LLM进行分析
- 将分片分析结果保存至
shards/<id>_<name>.md
- 合成:生成文件,关联所有分片分析结果
index.md
参数:
- (字符串,必填):待分析的Git仓库URL
repo_url - (字符串,可选):研究目标/重点(默认值:"分析架构")
request - (布尔值,可选):若为true,仅返回工作流图
visualize - (布尔值,可选):若为true,使用分步执行模式(类似知识召回)
chunked - (字符串,可选):当启用分步模式时,可选值为
action|"start"|"shard""synthesize" - (字符串,可选):启用分步模式且
session_id为action或"shard"时必填,由"synthesize"步骤返回start - (字符串,可选):启用分步模式且
chunk_id时,指定要运行的单个分片(例如action="shard")c1 - (字符串列表,可选):启用分步模式且
chunk_ids时,指定要并行运行的多个分片action="shard" - (整数,可选):LLM分片调用的最大并发数;null表示无限制。若遇到API速率限制(429错误),可设置为6–8。默认使用设置中的
max_concurrent值researcher.max_concurrent
分步执行模式(逐步操作):
-
- 调用时设置、
chunked=true→ 返回action="start"、session_id(chunk_plan、c1...)、c2next_action
- 调用时设置
-
- 调用时设置、
chunked=true、action="shard",并选择: 指定session_id=<从start步骤获取的值>以运行单个分片,或指定chunk_id=<cx>以并行运行多个分片。 若未指定,则会在本次调用中并行执行所有待处理的分片chunk_ids=[...]
- 调用时设置
-
- 所有分片处理完成后,调用时设置、
chunked=true、action="synthesize"session_id=<同一会话ID>
- 所有分片处理完成后,调用时设置
工作流状态会以类型存储在检查点存储中。
research_chunked返回结果:
json
{
"success": true,
"harvest_dir": "/path/to/.data/harvested/<owner>/<repo_name>/",
"shards_analyzed": 4,
"revision": "abc1234",
"shard_summaries": [
"- **[Core Kernel](./shards/01_core_kernel.md)**: Main business logic",
"- **[API Layer](./shards/02_api_layer.md)**: HTTP handlers"
],
"summary": "Research Complete!..."
}输出位置:
.data/harvested/<owner>/<repo_name>/
├── index.md # 包含YAML前置元数据的主索引文件(含版本信息)
└── shards/
├── 01_core_kernel.md # 分片1分析结果
├── 02_api_layer.md # 分片2分析结果
└── ...index.md前置元数据:
yaml
---
title: Research Analysis: <repo_name>
source: <repo_url>
revision: <git_hash>
revision_date: <YYYY-MM-DD HH:MM:SS TZ>
generated: <YYYY-MM-DD>
shards: <count>
---Usage Example
使用示例
python
undefinedpython
undefinedAnalyze a repository's security patterns
分析仓库的安全模式
await researcher.run_research_graph(
repo_url="https://github.com/example/large-repo",
request="Analyze security patterns and vulnerability surfaces"
)
await researcher.run_research_graph(
repo_url="https://github.com/example/large-repo",
request="Analyze security patterns and vulnerability surfaces"
)
Result: Multiple shard analyses saved to .data/harvested/
结果:多个分片分析结果将保存至.data/harvested/
undefinedundefinedTechnical Details
技术细节
- Repomix: Used directly (not via npx) for code compression
- Sharding: LLM (architect) proposes subsystems; normalization enforces efficient bounds
- Loop: Conditional edges in LangGraph process shards until queue empty
- Checkpoint: MemorySaver enables resumption of interrupted workflows
- Chunked API: Same workflow type as knowledge recall; one step per MCP call via and
actionsession_id
- Repomix:直接调用(而非通过npx)用于代码压缩
- 分片机制:由LLM(architect模块)提议子系统;规范化处理确保分片在高效范围内
- 循环处理:LangGraph中的条件边会处理分片,直到队列清空
- 检查点:MemorySaver支持中断的工作流恢复
- 分步API:与知识召回使用相同的工作流类型;通过和
action实现每次MCP调用执行一个步骤session_id
Efficient sharding design
高效分片设计
To avoid timeouts and unbalanced work, sharding is constrained and normalized:
-
Architect prompt limits
- At most 5 files per shard, total files ≤ 25 across all shards.
- 4–6 subsystems; explicit “stay under limits” so the LLM does not propose oversized shards.
-
Post-architect normalization ()
_normalize_shards- Split: Any shard with > 5 files is split into multiple shards (e.g. “Core (1)”, “Core (2)”).
- Cap: Total files across all shards are capped at 30; excess is trimmed from the end.
- Merge: Consecutive shards with ≤ 2 files each are merged into one shard (up to 5 files) to reduce round-trips and balance size.
-
Per-shard processing limits
- Repomix output per shard capped at 32k chars; subprocess timeout 120s; run in executor so heartbeat can run.
- LLM input 28k chars, output 4096 tokens.
Result: each runs on a bounded amount of code and stays within MCP idle/total timeout when heartbeat is used.
action=shard为避免超时和工作负载失衡,分片机制经过约束与规范化处理:
-
Architect提示限制
- 每个分片最多包含5个文件,所有分片的总文件数≤25个
- 提议4–6个子系统;明确要求“控制在限制内”,避免LLM生成过大的分片
-
Architect后规范化处理()
_normalize_shards- 拆分:任何包含>5个文件的分片会被拆分为多个分片(例如“Core (1)”、“Core (2)”)
- 上限控制:所有分片的总文件数上限为30;超出部分从末尾截断
- 合并:连续的、每个包含≤2个文件的分片会被合并为一个分片(最多5个文件),以减少往返次数并平衡分片大小
-
分片处理限制
- 每个分片的Repomix输出上限为32000字符;子进程超时时间120秒;在执行器中运行以支持心跳检测
- LLM输入上限为28000字符,输出上限为4096个token
结果:每个步骤处理的代码量可控,且在使用心跳检测时不会超出MCP的空闲/总超时限制。
action=shardPerformance & timeouts
性能与超时
Shard processing is tuned and uses progress-aware timeout:
- Idle timeout (, default 120s): Cancel only when there is no progress for this long. The researcher calls
mcp.idle_timeoutevery 10s during repomix and LLM, so the runner does not kill the tool while it is still working.heartbeat() - Total timeout (, default 180s): Hard cap (wall-clock); 0 = disable.
mcp.timeout - Repomix: Output capped at 32k chars per shard; subprocess timeout 120s; run in executor so heartbeat can run.
- LLM: Input 28k chars, output 4096 tokens; architect prefers 4–6 shards with 3–6 files each.
To allow longer runs without changing behaviour, increase timeouts in settings:
yaml
mcp:
timeout: 300 # Hard cap (seconds); 0 = disable
idle_timeout: 120 # Cancel only after no heartbeat for this long; 0 = use only timeout分片处理经过调优,采用进度感知型超时机制:
- 空闲超时(,默认120秒):仅当长时间无进度更新时才取消任务。研究员会在repomix和LLM处理期间每10秒调用一次
mcp.idle_timeout,因此工具运行时不会被执行器终止heartbeat() - 总超时(,默认180秒):硬时间上限(挂钟时间);设置为0则禁用
mcp.timeout - Repomix:每个分片的输出上限为32000字符;子进程超时时间120秒;在执行器中运行以支持心跳检测
- LLM:输入上限28000字符,输出上限4096个token;architect模块倾向于生成4–6个分片,每个分片包含3–6个文件
如需允许更长时间的运行且不改变现有行为,可在设置中增加超时时间:
yaml
mcp:
timeout: 300 # 硬时间上限(秒);0表示禁用
idle_timeout: 120 # 仅当心跳检测中断持续该时长时取消任务;0表示仅使用总超时