Deep Research (Deep Research Orchestration Workflow)

Treat "deep research" as a reusable, parallelizable production process: The main controller is responsible for clarifying goals, splitting sub-goals, scheduling child processes, aggregating and refining; child processes are responsible for collecting/extracting/local analyzing and outputting structured Markdown materials; the final deliverable must be an independent finished file rather than a chat post.

Key Constraints (Must Be Followed)

Keep default model and configuration unchanged: Do not explicitly override
```
--model
```
or use additional
```
-c
```
to overwrite default model/inference settings; only adjust relevant configurations when explicitly authorized by the user.
Default minimum permissions: Child processes run in
```
--sandbox workspace-write
```
by default; enable permissions like network access only when necessary. If a sub-task must execute commands that "require shell networking" (such as
```
curl
```
/
```
wget
```
), add
```
-c sandbox_workspace_write.network_access=true
```
to
```
codex exec
```
.
Network access prioritizes skills, then MCP: Prioritize using installed skills; if MCP must be used, prioritize
```
firecrawl
```
, then
```
tavily
```
; consider
```
curl
```
/
```
wget
```
only when it's truly impossible to meet requirements with the above.
Non-interactive friendly: Child processes do not use the plan tool, and do not interact with users in a "waiting for confirmation/feedback" manner; focus on file落地 and traceable logs.
File delivery first: The final deliverable must be saved as an independent file; it is prohibited to post the complete draft in chat.
Output decision and progress logs at each step: Especially during splitting, scheduling, aggregation, refinement, and before delivery.

Task Objectives

Derive a set of parallel sub-goals from the user's high-level goals (such as link lists, data shards, module lists, time slices, etc.).
Launch independent
```
codex exec
```
child processes for each sub-goal and assign appropriate permissions (default sandbox; enable network access if necessary).
Execute in parallel and produce sub-reports (natural language Markdown, which can include sections/tables/lists); output error descriptions with reasons and follow-up suggestions if failures occur.
Aggregate sub-outputs in order using scripts to generate a unified draft.
Conduct sanity checks and minimal fixes on the draft, then provide the final artefact path and summary of key findings.

Delivery Standards

Deliverables must be structured, insight-driven overall finished products; it is prohibited to directly splice sub-task Markdown as the final draft.
When it is necessary to retain the original text of sub-tasks, save it as an internal file (e.g.,
```
.research/<name>/aggregated_raw.md
```
), and only absorb key insights/evidence in the finished product.
Refinement and revision should be iterated chapter by chapter and paragraph by paragraph; do not delete the entire draft and rewrite it at once; check references, data and context after each modification to ensure traceability.
Deliver detailed, in-depth analytical reports by default.
Conduct "double inspection" before delivery:
1. Check whether it is truly produced through "chapter-by-chapter, multi-round integration"; if it is generated in one go, return it to rewrite by chapters.
2. Evaluate whether it is detailed enough; if it is too thin, first judge whether it is "insufficient materials from sub-tasks" or "over-compressed during finalization": the former drives supplementary/ additional research, while the latter continues to expand and refine based on existing materials until it meets the detailed standard.

End-to-End Process (Strictly Follow the Order)

Pre-execution Planning and Assessment (Mandatory; Completed by the Main Controller)
- First clarify goals, risks, resource/permission constraints, and identify core dimensions of subsequent diffusion dependencies (theme clusters, people/organizations, regions, time slices, etc.).
- If public directories/indexes (tab pages, API lists, etc.) exist, crawl and cache them in a minimal way and count entries; if not, conduct "desk research" to obtain real samples (news, materials, datasets, etc.), record sources/time/key points as evidence.
- Display at least one representative sample of real retrieval or browsing before forming the list; relying solely on experience speculation does not count as completing the assessment.
- During the assessment phase, must obtain real samples through a "traceable toolchain" at least once and record references: prioritize using installed skills; if MCP is needed, prioritize
```
firecrawl
```
  , then
```
tavily
```
  ; if neither is available, record the reason and choose an alternative solution (downgrade to minimal direct network crawling only when necessary).
- Output an initial (or draft) list: list the discovered dimensions, options and samples mastered in each dimension, scale estimation, and mark uncertainties/gaps. If no real samples have been obtained yet, complete the research first and prohibit proceeding to the next step.
- Complement the executable plan based on the above structure (splitting, scripts/tools, output format, permissions, timeout strategy, etc.), report dimension statistics and plan content in the user's language; wait until a clear "execute/start" response is received.
Initialization and Overall Planning
- Clarify goals, expected output format and evaluation criteria.
- Generate a semantic and non-repetitive name
```
name
```
  according to the current task (recommended format:
```
<YYYYMMDD>-<short-title>-<random-suffix>
```
  , all lowercase, separated by hyphens, no spaces).
- Create a running directory
```
.research/<name>/
```
  , and save all products in this directory (subdirectories such as
```
prompts/
```
  ,
```
logs/
```
  ,
```
child_outputs/
```
  ,
```
raw/
```
  ,
```
cache/
```
  ,
```
tmp/
```
  ).
- Keep default model and configuration unchanged; obtain user consent first when needing to adjust any model/inference/permission-related settings, and note the reason for the change and scope of impact in the logs.
Sub-goal Identification
- Extract or construct a list of sub-goals through scripts/commands.
- When source data is insufficient (e.g., the page only provides two main links), record the reason truthfully, and then the main process directly takes over to complete the remaining work.
Generate Scheduling Script
- Create a scheduling script (e.g.,
```
.research/<name>/run_children.sh
```
  ), which requires:
  - Receive the list of sub-goals (can be stored in JSON/CSV) and schedule them one by one.
  - Construct
    codex exec
    calls for each sub-goal, recommended key points:
    - Recommended form:
      codex exec --full-auto --sandbox workspace-write ...
      (refer to
      codex exec --help
      for details).
    - State in the prompt: All networking requirements prioritize using installed skills (skill priority); if MCP must be used, prioritize
      firecrawl
      , then
      tavily
      ; use
      curl
      /
      wget
      only when it's truly impossible; do not use the plan tool and "manual interaction waiting".
    - Do not pass
      --model
      unless required by the user, and do not use additional
      -c
      to overwrite default model/inference settings; consider adjusting only when explicitly authorized by the user and the result quality is indeed insufficient.
    - Specify the output path for sub-results (e.g.,
      .research/<name>/child_outputs/<id>.md
      ).
    - Explicitly prohibit the use of deprecated parameters (such as
      --prompt-file
      ,
      --mcp
      ,
      --name
      ), and remind to run
      codex exec --help
      first to get the latest instructions. The following call template can be referenced (only demonstrates parameters, does not involve parallelism):
      bash
      timeout 600 codex exec --full-auto --sandbox workspace-write \ --output-last-message "$output_file" \ - <"$prompt_file"
    - If child processes are allowed to execute commands that "require shell networking" (such as
      curl
      /
      wget
      ), append:
      -c sandbox_workspace_write.network_access=true
      to the
      codex exec
      call.
    - Set timeouts based on task scale: assign 5 minutes (
      timeout 300
      ) for small tasks first, and relax to a maximum of 15 minutes (
      timeout 900
      ) for larger tasks, with an external
      timeout
      command as a fallback. When the 5-minute timeout is hit for the first time, judge whether to split/modify parameters and retry based on the actual task situation; if it still cannot be completed within 15 minutes, it is considered that the prompt/process needs to be investigated.
    - For small-scale tasks (<8), use loops + background tasks (or queue control) to achieve parallelism, avoiding failures caused by command line length limits; for large-scale tasks, use
      xargs
      /GNU Parallel, but first verify parameter expansion with a small scale. The default parallelism is 8, which can be adjusted according to hardware or quotas.
    - Do not use "running one by one in series" to replace parallelism; do not bypass the established process by means such as "the main process searches casually".
    - Capture the exit code of each child process and write logs to the running directory; use methods like
      stdbuf -oL -eL codex exec … | tee .research/<name>/logs/<id>.log
      to ensure real-time refreshing, which is convenient for observing progress with
      tail -f
      .
    - Note that
      codex exec
      does not provide parameters such as
      --output
      and
      --log-level
      ; files need to be written through pipes, and the exit code should be confirmed with the correct
      PIPESTATUS
      index after multiple pipes. Review available parameters with
      codex exec --help
      before running.
- When data volume is sufficient, the main controller should try not to personally undertake heavy tasks such as downloading/parsing; assign these tasks to child processes, and the main controller focuses on prompt, template and environment preparation.
Design Child Process Prompt
- Dynamically generate a prompt template, which must include at least:
  - Sub-goal description, input data, constraint boundaries.
  - Limit the total number of rounds of network retrieval/extraction during the planning phase to no more than X (selected according to complexity; usually 10 is recommended), and converge when information is sufficient; tool priority: skills → MCP (
    firecrawl
    →
    tavily
    ) → minimal direct crawling.
  - Output results in natural language Markdown: including conclusions, key evidence lists, reference links; provide error descriptions and follow-up suggestions in Markdown format if errors occur.
  - When generating actual prompt files, prioritize using
    printf
    /line-by-line writing to inject variables, avoiding the known issue of Bash 3.2 truncating variables in
    cat <<EOF
    scenarios with multi-byte characters.
- Write the template to a file (e.g.,
```
.research/<name>/child_prompt_template.md
```
  ) for auditing and reuse.
- Before starting the scheduling script, quickly review the generated prompt files one by one (e.g.,
```
cat .research/<name>/prompts/<id>.md
```
  ), and dispatch tasks only after confirming that variable substitution is correct and instructions are complete.
Parallel Execution and Monitoring
- Run the scheduling script.
- Record the start/end time, duration and status of each child process.
- Make clear decisions on failed/timeout child processes: mark, retry, or explain in the final report; record that the prompt/process needs to be investigated when the 15-minute timeout limit is reached. During the execution of long tasks, users can be prompted to track real-time output with
```
tail -f .research/<name>/logs/<id>.log
```
  .
Programmatic Aggregation (Generate Draft)
- Use a script (e.g.,
```
.research/<name>/aggregate.py
```
  ) to read all Markdown files under
```
.research/<name>/child_outputs/
```
  , and aggregate them in the preset order to generate an initial main document (e.g.,
```
.research/<name>/final_report.md
```
  ).
Interpret Aggregation Results and Design Structure
- Read through
```
.research/<name>/final_report.md
```
  and key sub-outputs.
- Design the chapter outline of the refined report and "material mapping" (e.g.,
```
.research/<name>/polish_outline.md
```
  ), clarifying the target audience, chapter order and core arguments of each chapter.
Chapter-by-Chapter Refinement and Finalization
- Create a refined draft (e.g.,
```
.research/<name>/polished_report.md
```
  ), and write chapter by chapter according to the outline; self-check facts, references and language requirements immediately after finishing each chapter, and trace back to sub-drafts for verification if necessary.
- Avoid rewriting the entire draft at once; adhere to "chapter-by-chapter iteration" to maintain consistency and reduce the risk of omissions, while recording the highlights, problems and handling methods of each chapter.
- Uniformly organize duplicate information, citation formats, and items to be confirmed, while retaining core facts and quantitative data.
Delivery
- Confirm that the refined draft meets the delivery standards (complete structure, unified tone, accurate references), and use this finished product as the external report.
- The final deliverable must be saved as an independent file (located in
```
.research/<name>/
```
  ); report to the user by providing the file path and necessary summary, and it is prohibited to post the complete draft in chat.
- Outline core conclusions and actionable suggestions in the final reply; supplement follow-up methods for items to be confirmed if necessary.
- Do not attach intermediate drafts or internal notes externally to ensure that users see high-quality finished products.

Notes

Keep the process idempotent: Generate a new
```
.research/<name>/
```
for each run to avoid overwriting old files.
All structured outputs must be valid UTF-8 text.
Elevate permissions only when authorized or truly necessary; avoid using
```
--dangerously-bypass-approvals-and-sandbox
```
.
Be cautious when cleaning up temporary resources to ensure that logs and outputs are traceable.
Provide degradable explanations for failed processes: Attempt crawling tasks at least twice; if it still fails, add a section "Failure Reasons/Follow-up Suggestions" in Markdown to avoid blanks during aggregation.
Cache first: Raw materials obtained through skills/MCP should be written to cache directories such as
```
.research/<name>/raw/
```
first, and local cache should be prioritized for subsequent processing to reduce repeated requests.
Understand completely before summarizing: Process the complete original text before summarizing/extracting; do not mechanically truncate to a fixed length (e.g., the first 500 characters). You can write scripts for full-text parsing, key sentence extraction or key point generation, but do not rely on "hard truncation".
Temporary directory isolation: Intermediate products (script logs, parsing results, cache, debugging outputs, etc.) are placed in subdirectories such as
```
.research/<name>/tmp/
```
,
```
.research/<name>/raw/
```
,
```
.research/<name>/cache/
```
, and can be cleaned up as needed after the process ends.
Search service priority: Prioritize using installed skills for networking operations; if MCP is needed, first check available MCP servers (e.g., run
```
codex mcp list
```
), and prioritize
```
firecrawl
```
, then
```
tavily
```
; fall back to minimal direct crawling capability when MCP is unavailable.
MCP parameter and output control: For tools that may return excessively large results, avoid requesting fields like "raw full text" to prevent response bloat; if necessary, extract in segments, list directories first and then delve into details as needed.
Image retrieval: If MCP supports image search/description, enable it and present image clues together with text evidence unless the user explicitly requires "plain text only".

General Experience and Best Practices

Verify environment assumptions first: Before writing the scheduling script, use
```
realpath
```
/
```
test -d
```
to confirm that key paths (such as
```
venv
```
, resource directories) exist; if necessary, derive the warehouse root path with
```
dirname "$0"
```
and pass it in as a parameter to avoid hardcoding.
Make extraction logic configurable: Do not assume that web pages share the same DOM; parsing scripts should provide configurable selectors/boundary conditions/readability parsers, and only need to modify configurations when reused across sites.
Run through a small scale first before parallelizing: Before full parallelism, run 1–2 sub-goals in series to verify agent configuration, skills/MCP toolchain and output path; increase concurrency only after confirming that the link is stable, avoiding "unable to see errors after taking off".
Hierarchical logs for easy tracing: The scheduler writes to
```
.research/<name>/dispatcher.log
```
; sub-tasks write to
```
.research/<name>/logs/<id>.log
```
separately, and directly
```
tail
```
the corresponding log to locate MCP/call details when failures occur.
Failure isolation and retry: When parallel failures occur, first record the failed ID and logs, and prioritize retrying individual failed tasks; maintain a
```
failed_ids
```
list and uniformly prompt follow-up suggestions during the final stage.
Avoid repeated crawling: Before retrying, check whether
```
.research/<name>/child_outputs/<id>.md
```
already exists legally; skip if it exists to reduce quota consumption and repeated access.
Final review and refinement: Before delivery, review whether the aggregated and refined draft meets language requirements (e.g., full Chinese if required), and check whether references and data points are consistent with the source files; do not lose key facts and quantitative information during refinement, so that the finished product has insights rather than just stacking facts.
Present references in place: Directly add Markdown links to sources after each key point (e.g.,
```
[Source](https://example.com)
```
), avoiding concentrating links at the end of paragraphs for easy immediate verification.
Coverage check script: After batch generation, use a lightweight script to count missing entries, empty fields or tag quantities to ensure that problems are discovered and remedied before reporting.
Set boundary constraints for child processes: Clearly specify accessible scopes (only specified URLs/directories) and available tools in child prompts, reducing the risk of out-of-bounds and repeated crawling, and making the process safe and controllable on any site.

Thinking and Writing Guidelines

Think first, then act: Pursue in-depth, independent thinking, and insights that exceed expectations (but do not mention "surprise" in the answer); figure out why the user asks this question, what the underlying assumptions are, and whether there is a more essential way to ask it; at the same time, clarify the success criteria that your answer should meet, and then organize content around the criteria.

Maintain collaboration: Your goal is not to mechanically execute instructions, nor to force a definite answer when information is insufficient; but to advance together with the user, gradually approaching better questions and more reliable conclusions.

Writing style requirements:

Do not overuse bullet points, limit them to the top level as much as possible; use natural language paragraphs when possible.
Do not use quotation marks unless directly quoting.
Maintain a friendly, easy-to-understand, rational and restrained tone when writing.

When executing this skill, output clear decision and progress logs at each step.

deep-research

NPX Install

Tags

SKILL.md Content (Chinese)