Deep Research

General-purpose deep research with multi-source synthesis, confidence scoring, and anti-hallucination verification. Adopts SOTA patterns from OpenAI Deep Research (multi-agent triage pipeline), Google Gemini Deep Research (user-reviewable plans), STORM (perspective-guided conversations), Perplexity (source confidence ratings), and LangChain ODR (supervisor-researcher with reflection).

Vocabulary

Term	Definition
query	The user's research question or topic; the unit of investigation
claim	A discrete assertion to be verified; extracted from sources or user input
source	A specific origin of information: URL, document, database record, or API response
evidence	A source-backed datum supporting or contradicting a claim; always has provenance
provenance	The chain from evidence to source: tool used, URL, access timestamp, excerpt
confidence	Score 0.0-1.0 per claim; based on evidence strength and cross-validation
cross-validation	Verifying a claim across 2+ independent sources; the core anti-hallucination mechanism
triangulation	Confirming a finding using 3+ methodologically diverse sources
contradiction	When two credible sources assert incompatible claims; must be surfaced explicitly
synthesis	The final research product: not a summary but a novel integration of evidence with analysis
journal	The saved markdown record of a research session, stored in `~/.claude/research/`
sweep	Wave 1: broad parallel search across multiple tools and sources
deep dive	Wave 2: targeted follow-up on specific leads from the sweep
lead	A promising source or thread identified during the sweep, warranting deeper investigation
tier	Complexity classification: Quick (0-2), Standard (3-5), Deep (6-8), Exhaustive (9-10)
finding	A verified claim with evidence chain, confidence score, and provenance; the atomic unit of output
gap	An identified area where evidence is insufficient, contradictory, or absent
bias marker	An explicit flag on a finding indicating potential bias (recency, authority, LLM prior, etc.)
degraded mode	Operation when research tools are unavailable; confidence ceilings applied

Dispatch

`$ARGUMENTS`	Action
Question or topic text (has verb or `?` )	Investigate — classify complexity, execute wave pipeline
Vague input (<5 words, no verb, no `?` )	Intake — ask 2-3 clarifying questions, then classify
`check <claim>` or `verify <claim>`	Fact-check — verify claim against 3+ search engines
`compare <A> vs <B> [vs <C>...]`	Compare — structured comparison with decision matrix output
`survey <field or topic>`	Survey — landscape mapping, annotated bibliography
`track <topic>`	Track — load prior journal, search for updates since last session
`resume [number or keyword]`	Resume — resume a saved research session
`list [active\|domain\|tier]`	List — show journal metadata table
`archive`	Archive — move journals older than 90 days
`delete <N>`	Delete — delete journal N with confirmation
`export [N]`	Export — render HTML dashboard for journal N (default: current)
Empty	Gallery — show topic examples + "ask me anything" prompt

Auto-Detection Heuristic

If no mode keyword matches:

Ends with
```
?
```
or starts with question word (who/what/when/where/why/how/is/are/can/does/should/will) → Investigate
Contains
```
vs
```
,
```
versus
```
,
```
compared to
```
,
```
or
```
between noun phrases → Compare
Declarative statement with factual claim, no question syntax → Fact-check
Broad field name with no specific question → ask: "Investigate a specific question, or survey the entire field?"
Ambiguous → ask: "Would you like me to investigate this question, verify this claim, or survey this field?"

Gallery (Empty Arguments)

Present research examples spanning domains:

#	Domain	Example	Likely Tier
1	Technology	"What are the current best practices for LLM agent architectures?"	Deep
2	Academic	"What is the state of evidence on intermittent fasting for longevity?"	Standard
3	Market	"How does the competitive landscape for vector databases compare?"	Deep
4	Fact-check	"Is it true that 90% of startups fail within the first year?"	Standard
5	Architecture	"When should you choose event sourcing over CRUD?"	Standard
6	Trends	"What emerging programming languages gained traction in 2025-2026?"	Standard

Pick a number, paste your own question, or type
guide me
.

Skill Awareness

Before starting research, check if another skill is a better fit:

Signal	Redirect
Code review, PR review, diff analysis	Suggest `/honest-review`
Strategic decision with adversaries, game theory	Suggest `/wargame`
Multi-perspective expert debate	Suggest `/host-panel`
Prompt optimization, model-specific prompting	Suggest `/prompt-engineer`

If the user confirms they want general research, proceed.

Complexity Classification

Score the query on 5 dimensions (0-2 each, total 0-10):

Dimension	0	1	2
Scope breadth	Single fact/definition	Multi-faceted, 2-3 domains	Cross-disciplinary, 4+ domains
Source difficulty	Top search results suffice	Specialized databases or multiple source types	Paywalled, fragmented, or conflicting sources
Temporal sensitivity	Stable/historical	Evolving field (months matter)	Fast-moving (days/weeks matter), active controversy
Verification complexity	Easily verifiable (official docs)	2-3 independent sources needed	Contested claims, expert disagreement, no consensus
Synthesis demand	Answer is a fact or list	Compare/contrast viewpoints	Novel integration of conflicting threads

Total	Tier	Strategy
0-2	Quick	Inline, 1-2 searches, fire-and-forget
3-5	Standard	Subagent wave, 3-5 parallel searchers, report delivered
6-8	Deep	Agent team (TeamCreate), 3-5 teammates, interactive session
9-10	Exhaustive	Agent team, 4-6 teammates + nested subagent waves, interactive

Present the scoring to the user. User can override tier with

--depth <tier>

Wave Pipeline

All non-Quick research follows this 5-wave pipeline. Quick merges Waves 0+1+4 inline.

Wave 0: Triage (always inline, never parallelized)

Run

!uv run python skills/research/scripts/research-scanner.py "$ARGUMENTS"

for deterministic pre-scan

Decompose query into 2-5 sub-questions
Score complexity on the 5-dimension rubric
Check tool availability — probe key MCP tools; set degraded mode flags and confidence ceilings per
```
references/source-selection.md
```
Select tools per domain signals — read
```
references/source-selection.md
```
Check for existing journals — if
```
track
```
or
```
resume
```
, load prior state
Present triage to user — show: complexity score, sub-questions, planned strategy, estimated tier. User may override.

Wave 1: Broad Sweep (parallel)

Scale by tier:

Quick (inline): 1-2 tool calls sequentially. No subagents.

Standard (subagent wave): Dispatch 3-5 parallel subagents via Task tool:

Subagent A → brave-search + duckduckgo-search for sub-question 1
Subagent B → exa + g-search for sub-question 2
Subagent C → context7 / deepwiki / arxiv / semantic-scholar for technical specifics
Subagent D → wikipedia / wikidata for factual grounding
[Subagent E → PubMed / openalex if academic domain detected]

Deep (agent team): TeamCreate

"research-{slug}"

Lead: triage (Wave 0), orchestrate, judge reconcile (Wave 3), synthesize (Wave 4)
  |-- web-researcher:       brave-search, duckduckgo-search, exa, g-search
  |-- tech-researcher:      context7, deepwiki, arxiv, semantic-scholar, package-version
  |-- content-extractor:    fetcher, trafilatura, docling, wikipedia, wayback
  |-- [academic-researcher: arxiv, semantic-scholar, openalex, crossref, PubMed]
  |-- [adversarial-reviewer: devil's advocate — counter-search all emerging findings]

Spawn academic-researcher if domain signals include academic/scientific. Spawn adversarial-reviewer for Exhaustive tier or if verification complexity >= 2.

Exhaustive: Deep team + each teammate runs nested subagent waves internally.

Each subagent/teammate returns structured findings:

json

{
  "sub_question": "...",
  "findings": [{"claim": "...", "source_url": "...", "source_tool": "...", "excerpt": "...", "confidence_raw": 0.6}],
  "leads": ["url1", "url2"],
  "gaps": ["could not find data on X"]
}

Wave 1.5: Perspective Expansion (Deep/Exhaustive only)

STORM-style perspective-guided conversation. Spawn 2-4 perspective subagents:

Perspective	Focus	Question Style
Skeptic	What could be wrong? What's missing?	"What evidence would disprove this?"
Domain Expert	Technical depth, nuance, edge cases	"What do practitioners actually encounter?"
Practitioner	Real-world applicability, trade-offs	"What matters when you actually build this?"
Theorist	First principles, abstractions, frameworks	"What underlying model explains this?"

Each perspective agent reviews Wave 1 findings and generates 2-3 additional sub-questions from their viewpoint. These sub-questions feed into Wave 2.

Wave 2: Deep Dive (parallel, targeted)

Rank leads from Wave 1 by potential value (citation frequency, source authority, relevance)
Dispatch deep-read subagents — use fetcher/trafilatura/docling to extract full content from top leads
Follow citation chains — if a source cites another, fetch the original
Fill gaps — for each gap identified in Wave 1, dispatch targeted searches
Use thinking MCPs:
- ```
cascade-thinking
```
  for multi-perspective analysis of complex findings
- ```
structured-thinking
```
  for tracking evidence chains and contradictions
- ```
think-strategies
```
  for complex question decomposition (Standard+ only)

Wave 3: Cross-Validation (parallel)

The anti-hallucination wave. Read

references/confidence-rubric.md

and

references/self-verification.md

For every claim surviving Waves 1-2:

Independence check — are supporting sources truly independent? Sources citing each other are NOT independent.
Counter-search — explicitly search for evidence AGAINST each major claim using a different search engine
Freshness check — verify sources are current (flag if >1 year old for time-sensitive topics)
Contradiction scan — read
```
references/contradiction-protocol.md
```
, identify and classify disagreements
Confidence scoring — assign 0.0-1.0 per
```
references/confidence-rubric.md
```
Bias sweep — check each finding against 10 bias categories (7 core + 3 LLM-specific) per
```
references/bias-detection.md
```

Self-Verification (3+ findings survive): Spawn devil's advocate subagent per

references/self-verification.md

For each finding, attempt to disprove it. Search for counterarguments. Check if evidence is outdated. Verify claims actually follow from cited evidence. Flag LLM confabulations.

Adjust confidence: Survives +0.05, Weakened -0.10, Disproven set to 0.0. Adjustments are subject to hard caps — single-source claims remain capped at 0.60 even after survival adjustment.

Wave 4: Synthesis (always inline, lead only)

Produce the final research product. Read

references/output-formats.md

for templates.

The synthesis is NOT a summary. It must:

Answer directly — answer the user's question clearly
Map evidence — all verified findings with confidence and citations
Surface contradictions — where sources disagree, with analysis of why
Show confidence landscape — what is known confidently, what is uncertain, what is unknown
Audit biases — biases detected during research
Identify gaps — what evidence is missing, what further research would help
Distill takeaways — 3-7 numbered key findings
Cite sources — full bibliography with provenance

Output format adapts to mode:

Investigate → Research Brief (Standard) or Deep Report (Deep/Exhaustive)
Fact-check → Quick Answer with verdict + evidence
Compare → Decision Matrix
Survey → Annotated Bibliography
User can override with
```
--format brief|deep|bib|matrix
```

Confidence Scoring

Score	Basis
0.9-1.0	Official docs + 2 independent sources agree, no contradictions
0.7-0.8	2+ independent sources agree, minor qualifications
0.5-0.6	Single authoritative source, or 2 sources with partial agreement
0.3-0.4	Single non-authoritative source, or conflicting evidence
0.2-0.3	Multiple non-authoritative sources with partial agreement, or single source with significant caveats
0.1-0.2	LLM reasoning only, no external evidence found
0.0	Actively contradicted by evidence

Hard rules:

No claim reported at >= 0.7 unless supported by 2+ independent sources
Single-source claims cap at 0.6 regardless of source authority
Degraded mode (all research tools unavailable): max confidence 0.4, all findings labeled "unverified"

Merged confidence (for claims supported by multiple sources):

c_merged = 1 - (1-c1)(1-c2)...(1-cN)

capped at 0.99

Evidence Chain Structure

Every finding carries this structure:

FINDING RR-{seq:03d}: [claim statement]
  CONFIDENCE: [0.0-1.0]
  EVIDENCE:
    1. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
    2. [source_tool] [url] [access_timestamp] — [relevant excerpt, max 100 words]
  CROSS-VALIDATION: [agrees|contradicts|partial] across [N] independent sources
  BIAS MARKERS: [none | list of detected biases with category]
  GAPS: [none | what additional evidence would strengthen this finding]

Use

!uv run python skills/research/scripts/finding-formatter.py --format markdown

to normalize.

Source Selection

Read

references/source-selection.md

during Wave 0 for the full tool-to-domain mapping. Summary:

Domain Signal	Primary Tools	Secondary Tools
Library/API docs	context7, deepwiki, package-version	brave-search
Academic/scientific	arxiv, semantic-scholar, PubMed, openalex	crossref, brave-search
Current events/trends	brave-search, exa, duckduckgo-search, g-search	fetcher, trafilatura
GitHub repos/OSS	deepwiki, repomix	brave-search
General knowledge	wikipedia, wikidata, brave-search	fetcher
Historical content	wayback, brave-search	fetcher
Fact-checking	3+ search engines mandatory	wikidata for structured claims
PDF/document analysis	docling	trafilatura

Multi-engine protocol: For any claim requiring verification, use minimum 2 different search engines. Different engines have different indices and biases. Agreement across engines increases confidence.

Bias Detection

Check every finding against 10 bias categories. Read

references/bias-detection.md

for full detection signals and mitigation strategies.

Bias	Detection Signal	Mitigation
LLM prior	Matches common training patterns, lacks fresh evidence	Flag; require fresh source confirmation
Recency	Overweighting recent results, ignoring historical context	Search for historical perspective
Authority	Uncritically accepting prestigious sources	Cross-validate even authoritative claims
Confirmation	Queries constructed to confirm initial hypothesis	Use neutral queries; search for counterarguments
Survivorship	Only finding successful examples	Search for failures/counterexamples
Selection	Search engine bubble, English-only	Use multiple engines; note coverage limitations
Anchoring	First source disproportionately shapes interpretation	Document first source separately; seek contrast

State Management

Journal path:
```
~/.claude/research/
```
Archive path:
```
~/.claude/research/archive/
```

Filename convention:

{YYYY-MM-DD}-{domain}-{slug}.md

{domain}

tech

academic

market

policy

factcheck

compare

survey

track

general

```
{slug}
```
: 3-5 word semantic summary, kebab-case
Collision: append
```
-v2
```
,
```
-v3
```

Format: YAML frontmatter + markdown body +
```

```
blocks

Save protocol:

Quick: save once at end with
```
status: Complete
```
Standard/Deep/Exhaustive: save after Wave 1 with
```
status: In Progress
```
, update after each wave, finalize after synthesis

Resume protocol:

```
resume
```
(no args): find
```
status: In Progress
```
journals. One → auto-resume. Multiple → show list.
```
resume N
```
: Nth journal from
```
list
```
output (reverse chronological).
```
resume keyword
```
: search frontmatter
```
query
```
and
```
domain_tags
```
for match.

Use

!uv run python skills/research/scripts/journal-store.py

for all journal operations.

State snapshot (appended after each wave save):

html

<!-- STATE
wave_completed: 2
findings_count: 12
leads_pending: ["url1", "url2"]
gaps: ["topic X needs more sources"]
contradictions: 1
next_action: "Wave 3: cross-validate top 8 findings"
-->

In-Session Commands (Deep/Exhaustive)

Available during active research sessions:

Command	Effect
`drill <finding #>`	Deep dive into a specific finding with more sources
`pivot <new angle>`	Redirect research to a new sub-question
`counter <finding #>`	Explicitly search for evidence against a finding
`export`	Render HTML dashboard
`status`	Show current research state without advancing
`sources`	List all sources consulted so far
`confidence`	Show confidence distribution across findings
`gaps`	List identified knowledge gaps
`?`	Show command menu