dag-fleet

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fleet

Fleet

A skill for running parallel
claude -p
or
codex exec
workers in tmux with budgets and DAG dependencies. Supports both Claude and Codex providers — set per-fleet or per-worker. Operator owns all kill / steer / re-direction — there is no auto-restart, no auto-verify, no babysitter loop.
一款用于在tmux中运行带预算控制和DAG依赖的并行
claude -p
codex exec
任务节点的Skill。支持Claude和Codex两种服务商——可在集群级别或单个节点级别进行配置。由操作者全权负责终止/操控/重定向操作——无自动重启、无自动验证、无监控循环。

When to use this skill (and when NOT to)

何时使用(及何时不使用)此Skill

FIRST: prefer Claude Code's built-in Agent tool when any of these are true. It's simpler, faster, and avoids the fleet machinery entirely.
  • The work fits inside the current conversation
  • All sub-agents will finish in under 10 minutes
  • You'll synthesize the results in the same session
  • You don't need budget caps or dependency ordering
Reach for THIS skill only when ≥1 of these is true:
  • Persistence: the run will outlive the parent
    claude
    process (e.g. multi-hour fleet, user closes laptop)
  • Per-worker budgets: you need
    max_budget_usd: N
    enforced per worker
  • DAG dependencies: worker D must wait for A, B, C to finish first
  • Mixed models per worker: Sonnet researchers + Haiku validators in the same fleet
  • Tmux pane visibility: the user wants to attach to individual workers and watch them stream
If none of those apply, stop reading this skill and use the Agent tool.
首先:当满足以下任一条件时,优先使用Claude Code内置的Agent工具。它更简单、更快,完全无需用到Fleet的复杂机制。
  • 任务可在当前对话范围内完成
  • 所有子Agent可在10分钟内完成任务
  • 你将在同一会话中整合结果
  • 你不需要预算上限或依赖排序
仅当满足以下至少一个条件时,才使用本Skill:
  • 持久化: 任务运行时长超过父
    claude
    进程的生命周期(例如:需运行数小时的集群,用户关闭笔记本电脑后仍需继续)
  • 单节点预算: 你需要为每个节点设置
    max_budget_usd: N
    的强制预算上限
  • DAG依赖: 节点D必须等待A、B、C完成后才能启动
  • 节点混合模型: 同一集群中同时使用Sonnet研究节点和Haiku验证节点
  • Tmux面板可见性: 用户需要连接到单个节点并查看实时输出流
如果以上条件都不满足,请停止阅读本Skill文档,改用Agent工具。

What this skill is NOT

本Skill不具备的功能

  • Not an auto-recovery system. If a worker fails or hangs, the operator decides what to do.
  • Not a babysitter. There is no
    orchestrate.sh
    , no stuck-detection that kills, no mid-flight steering.
  • Not for "spawn 3 quick lookups in parallel" — that's the Agent tool's job.
  • 不是自动恢复系统。如果节点失败或挂起,由操作者决定后续操作。
  • 不是监控工具。没有
    orchestrate.sh
    脚本,没有检测到卡顿就终止的机制,也不支持中途操控。
  • 不适用于“快速并行执行3次查询”这类场景——那是Agent工具的职责。

Available scripts

可用脚本

ScriptWhen to callArgs
launch.sh
Start a new fleet from a
fleet.json
you generated
<fleet-root>
status.sh
Show what's running, what's done, live cost, last message per worker
<fleet-name-or-root> [-v] [--watch] [--json]
kill.sh
Stop one worker or the entire fleet (operator's hard stop)
<fleet-name-or-root> <worker-id>|all [--force]
relaunch-worker.sh
After editing one worker's
prompt.md
, re-run just that worker
<fleet-name-or-root> <worker-id>
report.sh
Generate a markdown summary when the fleet is done
<fleet-root>
view.sh
Capture a single worker's tmux pane content
<fleet-name-or-root> <worker-id>
feed.sh
Stream a unified event feed across all workers
<fleet-name-or-root> [--agent <id>]
Utilities (in
lib/
):
UtilityPurposeUsage
dag-viz.py
Visualize fleet DAG structure (ASCII or mermaid)
python3 ${CLAUDE_SKILL_DIR}/lib/dag-viz.py <fleet.json> [--mermaid]
All scripts accept either an absolute fleet-root path or a fleet name (resolved via
~/.claude/fleet-registry.json
, populated automatically by
launch.sh
).
脚本调用时机参数
launch.sh
从你生成的
fleet.json
启动新集群
<fleet-root>
status.sh
查看运行状态、已完成任务、实时成本、每个节点的最后一条消息
<fleet-name-or-root> [-v] [--watch] [--json]
kill.sh
终止单个节点或整个集群(操作者强制终止)
<fleet-name-or-root> <worker-id>|all [--force]
relaunch-worker.sh
编辑单个节点的
prompt.md
后,仅重新运行该节点
<fleet-name-or-root> <worker-id>
report.sh
集群任务完成后生成Markdown汇总报告
<fleet-root>
view.sh
捕获单个节点的tmux面板内容
<fleet-name-or-root> <worker-id>
feed.sh
流式展示所有节点的统一事件流
<fleet-name-or-root> [--agent <id>]
工具脚本(位于
lib/
目录):
工具用途使用方法
dag-viz.py
可视化集群DAG结构(ASCII或Mermaid格式)
python3 ${CLAUDE_SKILL_DIR}/lib/dag-viz.py <fleet.json> [--mermaid]
所有脚本均接受绝对集群根路径 集群名称(通过
~/.claude/fleet-registry.json
解析,由
launch.sh
自动填充)。

Launch procedure (MUST follow exactly)

启动流程(必须严格遵循)

When the user asks you to launch a fleet:
  1. Set
    FLEET_ROOT
    to the user's specified directory. Default to cwd if unspecified. Use absolute paths only.
  2. mkdir -p $FLEET_ROOT/workers
  3. Generate
    $FLEET_ROOT/fleet.json
    — see
    references/fleet-json-schema.md
    for the full schema. Required top-level fields:
    fleet_name
    ,
    config
    ,
    workers[]
    . Each worker needs
    id
    ,
    type
    ,
    task
    ,
    model
    ,
    max_turns
    ,
    max_budget_usd
    . Use
    depends_on: [...]
    for DAG ordering.
  4. For each worker, create
    $FLEET_ROOT/workers/{id}/prompt.md
    .
    The prompt MUST include this line verbatim:
    Save ALL output files to $FLEET_ROOT/workers/{id}/output/ — use absolute paths.
    (Substitute the real fleet root and worker id.)
  5. Run:
    bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh $FLEET_ROOT
  6. Do NOT write your own tmux/claude commands.
    launch.sh
    handles topo sort, tmux session creation, per-worker spawning, budgets, and the registry.
  7. ALWAYS tell the user the exact status command so they can monitor manually:
    bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-name-or-root>
    This is mandatory after every launch. The user must be able to check status without asking you.
当用户要求你启动集群时:
  1. **设置
    FLEET_ROOT
    **为用户指定的目录。若未指定则默认为当前工作目录。仅使用绝对路径。
  2. 执行
    mkdir -p $FLEET_ROOT/workers
  3. 生成
    $FLEET_ROOT/fleet.json
    ——完整 schema 请参考
    references/fleet-json-schema.md
    。必填顶层字段:
    fleet_name
    config
    workers[]
    。每个节点需要
    id
    type
    task
    model
    max_turns
    max_budget_usd
    。使用
    depends_on: [...]
    设置DAG排序。
  4. 为每个节点创建
    $FLEET_ROOT/workers/{id}/prompt.md
    。prompt中必须包含以下精确内容:
    Save ALL output files to $FLEET_ROOT/workers/{id}/output/ — use absolute paths.
    (替换为实际的集群根路径和节点ID。)
  5. 运行:
    bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh $FLEET_ROOT
  6. 请勿自行编写tmux/claude命令。
    launch.sh
    会处理拓扑排序、tmux会话创建、单节点启动、预算控制和注册表操作。
  7. 务必告知用户精确的状态查询命令,以便他们手动监控:
    bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-name-or-root>
    每次启动后都必须执行此步骤。用户必须能够无需询问即可检查状态。

Re-running ONE worker (the addendum workflow)

重新运行单个节点(补充流程)

The user has a finished fleet and wants to add 1-2 sources / change one worker's instructions:
  1. Edit
    $FLEET_ROOT/workers/{id}/prompt.md
    (add the new sources / instructions)
  2. Run
    bash ${CLAUDE_SKILL_DIR}/scripts/relaunch-worker.sh <fleet-name> {id}
  3. The worker's old
    session.jsonl
    is rotated to
    .bak
    , a fresh tmux window spawns, other workers are untouched
  4. The fleet's tmux session must still exist. If it's been killed, the user must
    launch.sh --force-relaunch
    the whole fleet —
    relaunch-worker.sh
    only works against a live fleet session.
If the user wants to re-run multiple workers, do it one at a time. There is no batch re-run; that's intentional.
用户已有已完成的集群,想要添加1-2个数据源/修改某个节点的指令:
  1. 编辑
    $FLEET_ROOT/workers/{id}/prompt.md
    (添加新数据源/指令)
  2. 运行
    bash ${CLAUDE_SKILL_DIR}/scripts/relaunch-worker.sh <fleet-name> {id}
  3. 该节点旧的
    session.jsonl
    会被备份为
    .bak
    ,新的tmux窗口会启动,其他节点不受影响
  4. 集群的tmux会话必须仍存在。如果已被终止,用户必须使用
    launch.sh --force-relaunch
    重启整个集群——
    relaunch-worker.sh
    仅对活跃的集群会话有效。
如果用户想要重新运行多个节点,请逐个操作。不支持批量重新运行——这是有意设计的。

Killing

终止操作

There are two operator-initiated kill paths and no automatic kills:
  • kill.sh <fleet> <worker-id>
    — kill one worker. Sweeps subprocess descendants. Use this when you've decided a single worker is going down the wrong path.
  • kill.sh <fleet> all --force
    — tear down the entire fleet, kill all tmux windows, sweep every orphan subprocess, mark workers KILLED, unregister from the registry.
There is no
steer.sh
. There is no mid-flight redirection. The intentional workflow for "I want this worker to take a different direction" is:
kill.sh
it, edit
prompt.md
,
relaunch-worker.sh
. Three steps, fully under operator control.
有两种由操作者发起的终止方式,无自动终止
  • kill.sh <fleet> <worker-id>
    —— 终止单个节点。会清理子进程后代。当你确定单个节点运行方向错误时使用此命令。
  • kill.sh <fleet> all --force
    —— 销毁整个集群,终止所有tmux窗口,清理所有孤立子进程,标记节点为已终止,从注册表中注销。
没有
steer.sh
脚本。不支持中途重定向。“我想让这个节点改变运行方向”的标准流程是:
kill.sh
终止节点,编辑
prompt.md
relaunch-worker.sh
重新启动。共三步,完全由操作者控制。

Worker types

节点类型

The
type
field on each worker controls the
--disallowed-tools
set passed to claude. Pick one:
  • read-only
    — disallows: Bash, Edit, Write, Agent, WebFetch, WebSearch. Cannot write files. Only use for pure analysis where output is captured from assistant messages in session.jsonl.
  • write
    — disallows: Bash, Agent, WebFetch, WebSearch. Use for synthesizers and any worker that writes output files.
  • code-run
    — disallows: Agent, WebFetch, WebSearch (the typical default for build/test workers)
  • research
    — disallows: Bash, Edit, Agent (web access enabled). Use for researchers, not
    read-only
    .
  • reviewer
    — disallows: Bash, Edit, Agent, WebFetch, WebSearch. Has Read + Write only. Use for reviewers that write verdict files.
  • orchestrator
    — disallows: Agent, WebFetch, WebSearch, Edit
WARNING:
read-only
cannot write files. If a worker needs to save output (findings.md, synthesis.md, etc.), use
write
,
research
,
reviewer
, or
code-run
. Setting a synthesizer to
read-only
will burn its entire budget trying to find a Write tool.
See
references/worker-types.md
for the full permission matrix.
每个节点的
type
字段控制传递给claude的
--disallowed-tools
集合。请选择以下类型之一:
  • read-only
    —— 禁用:Bash、Edit、Write、Agent、WebFetch、WebSearch。无法写入文件。仅用于纯分析场景,输出从session.jsonl中的助手消息捕获。
  • write
    —— 禁用:Bash、Agent、WebFetch、WebSearch。用于合成类任务及任何需要写入输出文件的节点
  • code-run
    —— 禁用:Agent、WebFetch、WebSearch(构建/测试节点的典型默认值)
  • research
    —— 禁用:Bash、Edit、Agent(启用网络访问)。用于研究类节点,而非
    read-only
  • reviewer
    —— 禁用:Bash、Edit、Agent、WebFetch、WebSearch。仅拥有读写权限。用于需要写入 verdict 文件的评审节点。
  • orchestrator
    —— 禁用:Agent、WebFetch、WebSearch、Edit
警告:
read-only
无法写入文件。如果节点需要保存输出(如findings.md、synthesis.md等),请使用
write
research
reviewer
code-run
类型。将合成类节点设置为
read-only
会耗尽其全部预算,因为它会尝试寻找Write工具。
完整权限矩阵请参考
references/worker-types.md

Provider support (Claude + Codex)

服务商支持(Claude + Codex)

Workers can run on either
claude
(default) or
codex
(OpenAI Codex CLI). Set at fleet level or per-worker:
json
{
  "config": {
    "provider": "codex",
    "model": "gpt-5.4",
    "reasoning_effort": "medium"
  },
  "workers": [
    { "id": "researcher", "type": "research", "provider": "codex", "model": "gpt-5.4", "reasoning_effort": "medium" },
    { "id": "writer", "type": "write", "provider": "claude", "model": "sonnet" }
  ]
}
节点可运行在
claude
(默认)或
codex
(OpenAI Codex CLI)上。可在集群级别或单个节点级别设置:
json
{
  "config": {
    "provider": "codex",
    "model": "gpt-5.4",
    "reasoning_effort": "medium"
  },
  "workers": [
    { "id": "researcher", "type": "research", "provider": "codex", "model": "gpt-5.4", "reasoning_effort": "medium" },
    { "id": "writer", "type": "write", "provider": "claude", "model": "sonnet" }
  ]
}

Codex-specific fields

Codex专属字段

FieldValuesDefaultScope
provider
"claude"
|
"codex"
"claude"
config + per-worker
reasoning_effort
"low"
|
"medium"
|
"high"
(none)config + per-worker, codex only
字段取值默认值作用范围
provider
"claude"
|
"codex"
"claude"
全局配置 + 单节点配置
reasoning_effort
"low"
|
"medium"
|
"high"
全局配置 + 单节点配置,仅Codex可用

Codex model aliases

Codex模型别名

ModelUse case
gpt-5.4
Flagship — strongest reasoning, recommended default
gpt-5.4-mini
Fast/cheap — validators, simple tasks
gpt-5.3-codex
Coding-focused (migrating to gpt-5.4)
模型使用场景
gpt-5.4
旗舰模型——推理能力最强,推荐作为默认选项
gpt-5.4-mini
快速/低成本——适用于验证器、简单任务
gpt-5.3-codex
专注于编码(正在迁移至gpt-5.4)

Codex limitations vs Claude

Codex与Claude的差异限制

  • No
    --max-budget-usd
    — codex has no per-worker budget cap. Fleet-level cost tracking still works (estimated from token counts).
  • No
    --fallback-model
    — codex has no automatic model fallback.
  • No per-tool disabling — codex uses sandbox modes (
    read-only
    ,
    workspace-write
    ) instead of
    --disallowed-tools
    . Worker types are mapped automatically.
  • Web search — research workers get
    -c 'web_search="live"'
    automatically.
  • All output workers need
    workspace-write
    — codex
    read-only
    sandbox blocks ALL file writes including output.
  • --max-budget-usd
    ——Codex无单节点预算上限。集群级成本追踪仍有效(基于令牌数量估算)。
  • --fallback-model
    ——Codex无自动模型降级机制。
  • 无单工具禁用——Codex使用沙箱模式(
    read-only
    workspace-write
    )替代
    --disallowed-tools
    。节点类型会自动映射到对应沙箱模式。
  • 网络搜索——研究类节点会自动添加
    -c 'web_search="live"'
    参数。
  • 所有输出节点需要
    workspace-write
    ——Codex的
    read-only
    沙箱会阻止所有文件写入,包括输出文件。

DAG dependencies

DAG依赖

json
{
  "id": "synthesizer",
  "depends_on": ["researcher-01", "researcher-02"]
}
launch.sh
uses the shared
lib/dag.sh
primitives (Kahn's BFS-layered topo-sort) to order workers and waits for dependencies to emit a terminal
result
event before starting dependents. Cycles are detected before any tmux state is created — fleet exits 2 with
CYCLE:a,b,...
on stderr. Workers within a layer run in parallel up to
max_concurrent
. Use
dag-viz.py
to preview the DAG structure before launch.
json
{
  "id": "synthesizer",
  "depends_on": ["researcher-01", "researcher-02"]
}
launch.sh
使用共享的
lib/dag.sh
原语(Kahn的BFS分层拓扑排序)对节点进行排序,等待依赖节点发出终端
result
事件后再启动依赖节点。在创建任何tmux状态之前会检测循环依赖——如果存在循环,集群会以退出码2终止,并在stderr输出
CYCLE:a,b,...
。同一层的节点会并行运行,最多不超过
max_concurrent
限制。启动前可使用
dag-viz.py
预览DAG结构。

Budgets

预算控制

  • worker.max_budget_usd: N
    — per-worker hard cap, passed to
    claude --max-budget-usd
  • config.max_budget_fleet: N
    — total fleet cap;
    launch.sh
    stops launching new workers once this is exceeded (already-running workers are not killed, the cap is "no new spending")
  • worker.max_budget_usd: N
    ——单节点硬上限,传递给
    claude --max-budget-usd
    参数
  • config.max_budget_fleet: N
    ——集群总预算上限;一旦超出,
    launch.sh
    会停止启动新节点(已运行的节点不会被终止,此上限仅限制“新支出”)

STRICT RULES

严格规则

  1. ALWAYS use the scripts above for EVERY operation. Never write your own tmux / claude commands.
  2. NEVER use the
    --bare
    flag
    with
    claude
    — causes auth failures.
  3. Fleet root = user's directory. Default to cwd. ALL fleet files go inside
    $FLEET_ROOT
    .
  4. Worker output paths must be absolute:
    $FLEET_ROOT/workers/{id}/output/
    . Tell the worker this in its prompt.md.
  5. launch.sh
    is the only way to start workers.
    relaunch-worker.sh
    is the only way to selectively re-run one. There is no other path.
  6. Operator owns kill and direction changes. Do not auto-kill, do not auto-restart, do not auto-redirect. If a worker is misbehaving, surface it to the user and let them decide.
  7. Do NOT invent missing scripts. If you find yourself wanting
    steer.sh
    ,
    verify.sh
    ,
    add-worker.sh
    , or
    orchestrate.sh
    — they were intentionally removed. Use the operator-owned workflow above instead.
  1. 所有操作必须使用上述脚本。请勿自行编写tmux/claude命令。
  2. 切勿使用
    --bare
    参数
    搭配
    claude
    ——会导致认证失败。
  3. 集群根目录=用户指定的目录。默认当前工作目录。所有集群文件必须放在
    $FLEET_ROOT
    内。
  4. 节点输出路径必须为绝对路径:
    $FLEET_ROOT/workers/{id}/output/
    。在节点的prompt.md中告知节点此路径。
  5. launch.sh
    是启动节点的唯一方式
    relaunch-worker.sh
    是选择性重新运行单个节点的唯一方式。无其他途径。
  6. 操作者全权负责终止和方向变更。请勿自动终止、自动重启、自动重定向。如果节点行为异常,告知用户并由用户决定后续操作。
  7. 请勿自行创建缺失的脚本。如果你需要
    steer.sh
    verify.sh
    add-worker.sh
    orchestrate.sh
    ——这些脚本是被有意移除的。请改用上述由操作者控制的流程。

Rationalizations to reject

需要拒绝的不合理请求

Agent saysRebuttal
"The task is small enough that I can write the tmux commands myself"The skill exists to prevent the 15 things you'll forget (unset CLAUDECODE, --disallowed-tools, session naming, registry, topo sort). Use launch.sh.
"I'll use
relaunch-worker.sh
to restart all stuck workers at once"
One at a time, intentional. Batch restart is how experiment 001 burned $20 — cache rebuilds on every worker compounded.
"The worker seems stuck — I should kill and restart it"Long thinking blocks look like hangs. Check
status.sh
or
view.sh
first. Only the operator kills workers.
"I should add a verify step after each worker finishes"There is no verify step. The operator reads output and decides. Auto-verify was removed after it caused more harm than the failures it caught.
"I'll just add
--bare
to speed things up"
--bare
causes auth failures. Never use it. This is STRICT RULE #2.
Agent提出的请求反驳理由
"任务很小,我可以自己编写tmux命令"本Skill的存在就是为了避免你忘记的15件事(未设置CLAUDECODE、--disallowed-tools、会话命名、注册表、拓扑排序等)。请使用launch.sh。
"我将使用
relaunch-worker.sh
一次性重启所有卡住的节点"
必须逐个重启,这是有意设计的。批量重启导致实验001损失20美元——每个节点的缓存重建会叠加成本。
"节点似乎卡住了——我应该终止并重启它"长时间思考看起来像卡住。请先检查
status.sh
view.sh
。只有操作者可以终止节点。
"我应该在每个节点完成后添加验证步骤"没有验证步骤。由操作者读取输出并决定后续操作。自动验证功能因弊大于利已被移除。
"我只需添加
--bare
参数来加快速度"
--bare
会导致认证失败。切勿使用。这是严格规则第2条。

When to give up on this skill

何时放弃使用本Skill

If the user asks for behavior that requires auto-recovery, mid-flight steering, or per-worker validation loops, tell them this skill no longer does those things by design. Suggest:
  • For auto-recovery → they should run a watcher script themselves and call
    kill.sh
    +
    relaunch-worker.sh
    from it
  • For mid-flight steering → kill + edit prompt.md + relaunch-worker
  • For validation → they read the output files themselves and decide
The skill's surface area was deliberately reduced after experiments where automated behavior caused more harm than the failures it was trying to recover from.
$ARGUMENTS
如果用户要求的功能需要自动恢复、中途操控或单节点验证循环,请告知用户本Skill出于设计已不再支持这些功能。建议:
  • 如需自动恢复→用户应自行编写监控脚本,并从中调用
    kill.sh
    +
    relaunch-worker.sh
  • 如需中途操控→终止节点+编辑prompt.md+重新启动节点
  • 如需验证→用户自行读取输出文件并决定后续操作
本Skill的功能范围被刻意缩小,因为之前的实验表明自动化行为造成的危害超过了它试图解决的问题。
$ARGUMENTS