dag-fleet
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFleet
Fleet
A skill for running parallel or workers in tmux with budgets and DAG dependencies. Supports both Claude and Codex providers — set per-fleet or per-worker. Operator owns all kill / steer / re-direction — there is no auto-restart, no auto-verify, no babysitter loop.
claude -pcodex exec一款用于在tmux中运行带预算控制和DAG依赖的并行或任务节点的Skill。支持Claude和Codex两种服务商——可在集群级别或单个节点级别进行配置。由操作者全权负责终止/操控/重定向操作——无自动重启、无自动验证、无监控循环。
claude -pcodex execWhen to use this skill (and when NOT to)
何时使用(及何时不使用)此Skill
FIRST: prefer Claude Code's built-in Agent tool when any of these are true. It's simpler, faster, and avoids the fleet machinery entirely.
- The work fits inside the current conversation
- All sub-agents will finish in under 10 minutes
- You'll synthesize the results in the same session
- You don't need budget caps or dependency ordering
Reach for THIS skill only when ≥1 of these is true:
- Persistence: the run will outlive the parent process (e.g. multi-hour fleet, user closes laptop)
claude - Per-worker budgets: you need enforced per worker
max_budget_usd: N - DAG dependencies: worker D must wait for A, B, C to finish first
- Mixed models per worker: Sonnet researchers + Haiku validators in the same fleet
- Tmux pane visibility: the user wants to attach to individual workers and watch them stream
If none of those apply, stop reading this skill and use the Agent tool.
首先:当满足以下任一条件时,优先使用Claude Code内置的Agent工具。它更简单、更快,完全无需用到Fleet的复杂机制。
- 任务可在当前对话范围内完成
- 所有子Agent可在10分钟内完成任务
- 你将在同一会话中整合结果
- 你不需要预算上限或依赖排序
仅当满足以下至少一个条件时,才使用本Skill:
- 持久化: 任务运行时长超过父进程的生命周期(例如:需运行数小时的集群,用户关闭笔记本电脑后仍需继续)
claude - 单节点预算: 你需要为每个节点设置的强制预算上限
max_budget_usd: N - DAG依赖: 节点D必须等待A、B、C完成后才能启动
- 节点混合模型: 同一集群中同时使用Sonnet研究节点和Haiku验证节点
- Tmux面板可见性: 用户需要连接到单个节点并查看实时输出流
如果以上条件都不满足,请停止阅读本Skill文档,改用Agent工具。
What this skill is NOT
本Skill不具备的功能
- Not an auto-recovery system. If a worker fails or hangs, the operator decides what to do.
- Not a babysitter. There is no , no stuck-detection that kills, no mid-flight steering.
orchestrate.sh - Not for "spawn 3 quick lookups in parallel" — that's the Agent tool's job.
- 不是自动恢复系统。如果节点失败或挂起,由操作者决定后续操作。
- 不是监控工具。没有脚本,没有检测到卡顿就终止的机制,也不支持中途操控。
orchestrate.sh - 不适用于“快速并行执行3次查询”这类场景——那是Agent工具的职责。
Available scripts
可用脚本
| Script | When to call | Args |
|---|---|---|
| Start a new fleet from a | |
| Show what's running, what's done, live cost, last message per worker | |
| Stop one worker or the entire fleet (operator's hard stop) | |
| After editing one worker's | |
| Generate a markdown summary when the fleet is done | |
| Capture a single worker's tmux pane content | |
| Stream a unified event feed across all workers | |
Utilities (in ):
lib/| Utility | Purpose | Usage |
|---|---|---|
| Visualize fleet DAG structure (ASCII or mermaid) | |
All scripts accept either an absolute fleet-root path or a fleet name (resolved via , populated automatically by ).
~/.claude/fleet-registry.jsonlaunch.sh| 脚本 | 调用时机 | 参数 |
|---|---|---|
| 从你生成的 | |
| 查看运行状态、已完成任务、实时成本、每个节点的最后一条消息 | |
| 终止单个节点或整个集群(操作者强制终止) | |
| 编辑单个节点的 | |
| 集群任务完成后生成Markdown汇总报告 | |
| 捕获单个节点的tmux面板内容 | |
| 流式展示所有节点的统一事件流 | |
工具脚本(位于目录):
lib/| 工具 | 用途 | 使用方法 |
|---|---|---|
| 可视化集群DAG结构(ASCII或Mermaid格式) | |
所有脚本均接受绝对集群根路径 或 集群名称(通过解析,由自动填充)。
~/.claude/fleet-registry.jsonlaunch.shLaunch procedure (MUST follow exactly)
启动流程(必须严格遵循)
When the user asks you to launch a fleet:
- Set to the user's specified directory. Default to cwd if unspecified. Use absolute paths only.
FLEET_ROOT mkdir -p $FLEET_ROOT/workers- Generate — see
$FLEET_ROOT/fleet.jsonfor the full schema. Required top-level fields:references/fleet-json-schema.md,fleet_name,config. Each worker needsworkers[],id,type,task,model,max_turns. Usemax_budget_usdfor DAG ordering.depends_on: [...] - For each worker, create . The prompt MUST include this line verbatim:
$FLEET_ROOT/workers/{id}/prompt.md(Substitute the real fleet root and worker id.)Save ALL output files to $FLEET_ROOT/workers/{id}/output/ — use absolute paths. - Run:
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh $FLEET_ROOT - Do NOT write your own tmux/claude commands. handles topo sort, tmux session creation, per-worker spawning, budgets, and the registry.
launch.sh - ALWAYS tell the user the exact status command so they can monitor manually:
This is mandatory after every launch. The user must be able to check status without asking you.
bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-name-or-root>
当用户要求你启动集群时:
- **设置**为用户指定的目录。若未指定则默认为当前工作目录。仅使用绝对路径。
FLEET_ROOT - 执行
mkdir -p $FLEET_ROOT/workers - 生成——完整 schema 请参考
$FLEET_ROOT/fleet.json。必填顶层字段:references/fleet-json-schema.md、fleet_name、config。每个节点需要workers[]、id、type、task、model、max_turns。使用max_budget_usd设置DAG排序。depends_on: [...] - 为每个节点创建。prompt中必须包含以下精确内容:
$FLEET_ROOT/workers/{id}/prompt.md(替换为实际的集群根路径和节点ID。)Save ALL output files to $FLEET_ROOT/workers/{id}/output/ — use absolute paths. - 运行:
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh $FLEET_ROOT - 请勿自行编写tmux/claude命令。会处理拓扑排序、tmux会话创建、单节点启动、预算控制和注册表操作。
launch.sh - 务必告知用户精确的状态查询命令,以便他们手动监控:
每次启动后都必须执行此步骤。用户必须能够无需询问即可检查状态。
bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-name-or-root>
Re-running ONE worker (the addendum workflow)
重新运行单个节点(补充流程)
The user has a finished fleet and wants to add 1-2 sources / change one worker's instructions:
- Edit (add the new sources / instructions)
$FLEET_ROOT/workers/{id}/prompt.md - Run
bash ${CLAUDE_SKILL_DIR}/scripts/relaunch-worker.sh <fleet-name> {id} - The worker's old is rotated to
session.jsonl, a fresh tmux window spawns, other workers are untouched.bak - The fleet's tmux session must still exist. If it's been killed, the user must the whole fleet —
launch.sh --force-relaunchonly works against a live fleet session.relaunch-worker.sh
If the user wants to re-run multiple workers, do it one at a time. There is no batch re-run; that's intentional.
用户已有已完成的集群,想要添加1-2个数据源/修改某个节点的指令:
- 编辑(添加新数据源/指令)
$FLEET_ROOT/workers/{id}/prompt.md - 运行
bash ${CLAUDE_SKILL_DIR}/scripts/relaunch-worker.sh <fleet-name> {id} - 该节点旧的会被备份为
session.jsonl,新的tmux窗口会启动,其他节点不受影响.bak - 集群的tmux会话必须仍存在。如果已被终止,用户必须使用重启整个集群——
launch.sh --force-relaunch仅对活跃的集群会话有效。relaunch-worker.sh
如果用户想要重新运行多个节点,请逐个操作。不支持批量重新运行——这是有意设计的。
Killing
终止操作
There are two operator-initiated kill paths and no automatic kills:
- — kill one worker. Sweeps subprocess descendants. Use this when you've decided a single worker is going down the wrong path.
kill.sh <fleet> <worker-id> - — tear down the entire fleet, kill all tmux windows, sweep every orphan subprocess, mark workers KILLED, unregister from the registry.
kill.sh <fleet> all --force
There is no . There is no mid-flight redirection. The intentional workflow for "I want this worker to take a different direction" is: it, edit , . Three steps, fully under operator control.
steer.shkill.shprompt.mdrelaunch-worker.sh有两种由操作者发起的终止方式,无自动终止:
- —— 终止单个节点。会清理子进程后代。当你确定单个节点运行方向错误时使用此命令。
kill.sh <fleet> <worker-id> - —— 销毁整个集群,终止所有tmux窗口,清理所有孤立子进程,标记节点为已终止,从注册表中注销。
kill.sh <fleet> all --force
没有脚本。不支持中途重定向。“我想让这个节点改变运行方向”的标准流程是:终止节点,编辑,重新启动。共三步,完全由操作者控制。
steer.shkill.shprompt.mdrelaunch-worker.shWorker types
节点类型
The field on each worker controls the set passed to claude. Pick one:
type--disallowed-tools- — disallows: Bash, Edit, Write, Agent, WebFetch, WebSearch. Cannot write files. Only use for pure analysis where output is captured from assistant messages in session.jsonl.
read-only - — disallows: Bash, Agent, WebFetch, WebSearch. Use for synthesizers and any worker that writes output files.
write - — disallows: Agent, WebFetch, WebSearch (the typical default for build/test workers)
code-run - — disallows: Bash, Edit, Agent (web access enabled). Use for researchers, not
research.read-only - — disallows: Bash, Edit, Agent, WebFetch, WebSearch. Has Read + Write only. Use for reviewers that write verdict files.
reviewer - — disallows: Agent, WebFetch, WebSearch, Edit
orchestrator
WARNING: cannot write files. If a worker needs to save output (findings.md, synthesis.md, etc.), use , , , or . Setting a synthesizer to will burn its entire budget trying to find a Write tool.
read-onlywriteresearchreviewercode-runread-onlySee for the full permission matrix.
references/worker-types.md每个节点的字段控制传递给claude的集合。请选择以下类型之一:
type--disallowed-tools- —— 禁用:Bash、Edit、Write、Agent、WebFetch、WebSearch。无法写入文件。仅用于纯分析场景,输出从session.jsonl中的助手消息捕获。
read-only - —— 禁用:Bash、Agent、WebFetch、WebSearch。用于合成类任务及任何需要写入输出文件的节点。
write - —— 禁用:Agent、WebFetch、WebSearch(构建/测试节点的典型默认值)
code-run - —— 禁用:Bash、Edit、Agent(启用网络访问)。用于研究类节点,而非
research。read-only - —— 禁用:Bash、Edit、Agent、WebFetch、WebSearch。仅拥有读写权限。用于需要写入 verdict 文件的评审节点。
reviewer - —— 禁用:Agent、WebFetch、WebSearch、Edit
orchestrator
警告: 无法写入文件。如果节点需要保存输出(如findings.md、synthesis.md等),请使用、、或类型。将合成类节点设置为会耗尽其全部预算,因为它会尝试寻找Write工具。
read-onlywriteresearchreviewercode-runread-only完整权限矩阵请参考。
references/worker-types.mdProvider support (Claude + Codex)
服务商支持(Claude + Codex)
Workers can run on either (default) or (OpenAI Codex CLI). Set at fleet level or per-worker:
claudecodexjson
{
"config": {
"provider": "codex",
"model": "gpt-5.4",
"reasoning_effort": "medium"
},
"workers": [
{ "id": "researcher", "type": "research", "provider": "codex", "model": "gpt-5.4", "reasoning_effort": "medium" },
{ "id": "writer", "type": "write", "provider": "claude", "model": "sonnet" }
]
}节点可运行在(默认)或(OpenAI Codex CLI)上。可在集群级别或单个节点级别设置:
claudecodexjson
{
"config": {
"provider": "codex",
"model": "gpt-5.4",
"reasoning_effort": "medium"
},
"workers": [
{ "id": "researcher", "type": "research", "provider": "codex", "model": "gpt-5.4", "reasoning_effort": "medium" },
{ "id": "writer", "type": "write", "provider": "claude", "model": "sonnet" }
]
}Codex-specific fields
Codex专属字段
| Field | Values | Default | Scope |
|---|---|---|---|
| | | config + per-worker |
| | (none) | config + per-worker, codex only |
| 字段 | 取值 | 默认值 | 作用范围 |
|---|---|---|---|
| | | 全局配置 + 单节点配置 |
| | 无 | 全局配置 + 单节点配置,仅Codex可用 |
Codex model aliases
Codex模型别名
| Model | Use case |
|---|---|
| Flagship — strongest reasoning, recommended default |
| Fast/cheap — validators, simple tasks |
| Coding-focused (migrating to gpt-5.4) |
| 模型 | 使用场景 |
|---|---|
| 旗舰模型——推理能力最强,推荐作为默认选项 |
| 快速/低成本——适用于验证器、简单任务 |
| 专注于编码(正在迁移至gpt-5.4) |
Codex limitations vs Claude
Codex与Claude的差异限制
- No — codex has no per-worker budget cap. Fleet-level cost tracking still works (estimated from token counts).
--max-budget-usd - No — codex has no automatic model fallback.
--fallback-model - No per-tool disabling — codex uses sandbox modes (,
read-only) instead ofworkspace-write. Worker types are mapped automatically.--disallowed-tools - Web search — research workers get automatically.
-c 'web_search="live"' - All output workers need — codex
workspace-writesandbox blocks ALL file writes including output.read-only
- 无——Codex无单节点预算上限。集群级成本追踪仍有效(基于令牌数量估算)。
--max-budget-usd - 无——Codex无自动模型降级机制。
--fallback-model - 无单工具禁用——Codex使用沙箱模式(、
read-only)替代workspace-write。节点类型会自动映射到对应沙箱模式。--disallowed-tools - 网络搜索——研究类节点会自动添加参数。
-c 'web_search="live"' - 所有输出节点需要——Codex的
workspace-write沙箱会阻止所有文件写入,包括输出文件。read-only
DAG dependencies
DAG依赖
json
{
"id": "synthesizer",
"depends_on": ["researcher-01", "researcher-02"]
}launch.shlib/dag.shresultCYCLE:a,b,...max_concurrentdag-viz.pyjson
{
"id": "synthesizer",
"depends_on": ["researcher-01", "researcher-02"]
}launch.shlib/dag.shresultCYCLE:a,b,...max_concurrentdag-viz.pyBudgets
预算控制
- — per-worker hard cap, passed to
worker.max_budget_usd: Nclaude --max-budget-usd - — total fleet cap;
config.max_budget_fleet: Nstops launching new workers once this is exceeded (already-running workers are not killed, the cap is "no new spending")launch.sh
- ——单节点硬上限,传递给
worker.max_budget_usd: N参数claude --max-budget-usd - ——集群总预算上限;一旦超出,
config.max_budget_fleet: N会停止启动新节点(已运行的节点不会被终止,此上限仅限制“新支出”)launch.sh
STRICT RULES
严格规则
- ALWAYS use the scripts above for EVERY operation. Never write your own tmux / claude commands.
- NEVER use the flag with
--bare— causes auth failures.claude - Fleet root = user's directory. Default to cwd. ALL fleet files go inside .
$FLEET_ROOT - Worker output paths must be absolute: . Tell the worker this in its prompt.md.
$FLEET_ROOT/workers/{id}/output/ - is the only way to start workers.
launch.shis the only way to selectively re-run one. There is no other path.relaunch-worker.sh - Operator owns kill and direction changes. Do not auto-kill, do not auto-restart, do not auto-redirect. If a worker is misbehaving, surface it to the user and let them decide.
- Do NOT invent missing scripts. If you find yourself wanting ,
steer.sh,verify.sh, oradd-worker.sh— they were intentionally removed. Use the operator-owned workflow above instead.orchestrate.sh
- 所有操作必须使用上述脚本。请勿自行编写tmux/claude命令。
- 切勿使用参数搭配
--bare——会导致认证失败。claude - 集群根目录=用户指定的目录。默认当前工作目录。所有集群文件必须放在内。
$FLEET_ROOT - 节点输出路径必须为绝对路径:。在节点的prompt.md中告知节点此路径。
$FLEET_ROOT/workers/{id}/output/ - 是启动节点的唯一方式。
launch.sh是选择性重新运行单个节点的唯一方式。无其他途径。relaunch-worker.sh - 操作者全权负责终止和方向变更。请勿自动终止、自动重启、自动重定向。如果节点行为异常,告知用户并由用户决定后续操作。
- 请勿自行创建缺失的脚本。如果你需要、
steer.sh、verify.sh或add-worker.sh——这些脚本是被有意移除的。请改用上述由操作者控制的流程。orchestrate.sh
Rationalizations to reject
需要拒绝的不合理请求
| Agent says | Rebuttal |
|---|---|
| "The task is small enough that I can write the tmux commands myself" | The skill exists to prevent the 15 things you'll forget (unset CLAUDECODE, --disallowed-tools, session naming, registry, topo sort). Use launch.sh. |
"I'll use | One at a time, intentional. Batch restart is how experiment 001 burned $20 — cache rebuilds on every worker compounded. |
| "The worker seems stuck — I should kill and restart it" | Long thinking blocks look like hangs. Check |
| "I should add a verify step after each worker finishes" | There is no verify step. The operator reads output and decides. Auto-verify was removed after it caused more harm than the failures it caught. |
"I'll just add | |
| Agent提出的请求 | 反驳理由 |
|---|---|
| "任务很小,我可以自己编写tmux命令" | 本Skill的存在就是为了避免你忘记的15件事(未设置CLAUDECODE、--disallowed-tools、会话命名、注册表、拓扑排序等)。请使用launch.sh。 |
"我将使用 | 必须逐个重启,这是有意设计的。批量重启导致实验001损失20美元——每个节点的缓存重建会叠加成本。 |
| "节点似乎卡住了——我应该终止并重启它" | 长时间思考看起来像卡住。请先检查 |
| "我应该在每个节点完成后添加验证步骤" | 没有验证步骤。由操作者读取输出并决定后续操作。自动验证功能因弊大于利已被移除。 |
"我只需添加 | |
When to give up on this skill
何时放弃使用本Skill
If the user asks for behavior that requires auto-recovery, mid-flight steering, or per-worker validation loops, tell them this skill no longer does those things by design. Suggest:
- For auto-recovery → they should run a watcher script themselves and call +
kill.shfrom itrelaunch-worker.sh - For mid-flight steering → kill + edit prompt.md + relaunch-worker
- For validation → they read the output files themselves and decide
The skill's surface area was deliberately reduced after experiments where automated behavior caused more harm than the failures it was trying to recover from.
$ARGUMENTS
如果用户要求的功能需要自动恢复、中途操控或单节点验证循环,请告知用户本Skill出于设计已不再支持这些功能。建议:
- 如需自动恢复→用户应自行编写监控脚本,并从中调用+
kill.shrelaunch-worker.sh - 如需中途操控→终止节点+编辑prompt.md+重新启动节点
- 如需验证→用户自行读取输出文件并决定后续操作
本Skill的功能范围被刻意缩小,因为之前的实验表明自动化行为造成的危害超过了它试图解决的问题。
$ARGUMENTS