autoresearch-fleet
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoresearch Fleet
自主研究集群
Autonomous research loop inspired by karpathy/autoresearch. One mutable file, one immutable eval harness, git as state machine, and a "NEVER STOP" directive. The agent modifies code, evaluates the result, keeps improvements, discards regressions, and repeats indefinitely.
Open-world extension: when the agent plateaus (N consecutive discards), the orchestrator injects a web-search prompt, breaking through knowledge ceilings the LLM can't cross alone.
受karpathy/autoresearch启发的自主研究循环。包含一个可修改文件、一个不可变评估工具、Git状态机,以及「永不停止」指令。Agent会修改代码、评估结果、保留优化内容、回退退化修改,并无限重复此过程。
开放世界扩展:当Agent进入平台期(连续N次丢弃修改),编排器会注入网页搜索提示,帮助突破LLM自身无法跨越的知识瓶颈。
When to use
适用场景
- Optimizing a single metric (latency, accuracy, loss, score)
- The problem has a fast, deterministic eval harness
- You want autonomous overnight runs (100+ experiments while you sleep)
- The search space is too large for manual exploration
- 优化单一指标(延迟、准确率、损失值、评分等)
- 问题具备快速、确定性的评估工具
- 需要自主夜间运行(睡眠期间完成100+次实验)
- 搜索空间过大,人工探索效率低下
How it works
工作原理
┌─────────────────────────────────────────────────┐
│ orchestrator.sh │
│ │
│ for each iteration: │
│ 1. Count trailing discards in results.tsv │
│ 2. If >= plateau_threshold → search prompt │
│ 3. Spawn agent (claude -p or codex exec) │
│ 4. Agent reads program.md, edits file, evals │
│ 5. Agent updates results.tsv, keeps/reverts │
│ 6. Check stop conditions (iter/cost/plateau) │
│ 7. Loop │
└─────────────────────────────────────────────────┘The agent handles everything: reading files, editing code, running eval, committing, updating results.tsv, and reverting on failure. The orchestrator just loops, switches prompts on plateau, and enforces stop conditions.
┌─────────────────────────────────────────────────┐
│ orchestrator.sh │
│ │
│ for each iteration: │
│ 1. Count trailing discards in results.tsv │
│ 2. If >= plateau_threshold → search prompt │
│ 3. Spawn agent (claude -p or codex exec) │
│ 4. Agent reads program.md, edits file, evals │
│ 5. Agent updates results.tsv, keeps/reverts │
│ 6. Check stop conditions (iter/cost/plateau) │
│ 7. Loop │
└─────────────────────────────────────────────────┘Agent负责所有操作:读取文件、编辑代码、运行评估、提交修改、更新results.tsv,以及在失败时回退。编排器仅负责循环执行、平台期切换提示,以及强制执行停止条件。
Directory structure
目录结构
$FLEET_ROOT/ # The problem directory
fleet.json # Fleet config
program.md # Agent instructions (you write this)
eval.py # Immutable eval harness (you write this)
solution.py # Mutable file (agent edits this)
results.tsv # Experiment log (agent updates, git-untracked)
orchestrator.sh # Generated by launch.sh
.orch-state.json # Iteration state
.paused # Sentinel (touch to pause)
logs/
session-iter-1.jsonl # Per-iteration session logs
session-iter-2.jsonl
...$FLEET_ROOT/ # 问题目录
fleet.json # 集群配置
program.md # Agent指令文件(用户编写)
eval.py # 不可变评估工具(用户编写)
solution.py # 可修改文件(Agent编辑)
results.tsv # 实验日志(Agent更新,Git不追踪)
orchestrator.sh # 由launch.sh生成
.orch-state.json # 迭代状态
.paused # 暂停标记文件(创建即可暂停)
logs/
session-iter-1.jsonl # 单迭代会话日志
session-iter-2.jsonl
...fleet.json schema
fleet.json 模式
json
{
"fleet_name": "optimize-api-latency",
"type": "autoresearch",
"config": {
"model": "sonnet",
"fallback_model": "haiku",
"provider": "claude",
"budget_per_iter": 2.00,
"max_turns": 0
},
"problem": {
"workdir": "/home/user/my-project",
"eval_command": "make benchmark",
"metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
"metric_direction": "minimize",
"results_file": "results.tsv",
"program_md": "program.md"
},
"stop_when": {
"max_iterations": 30,
"cost_cap_usd": 15.0
},
"search": {
"enabled": true,
"plateau_threshold": 3
}
}json
{
"fleet_name": "optimize-api-latency",
"type": "autoresearch",
"config": {
"model": "sonnet",
"fallback_model": "haiku",
"provider": "claude",
"budget_per_iter": 2.00,
"max_turns": 0
},
"problem": {
"workdir": "/home/user/my-project",
"eval_command": "make benchmark",
"metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
"metric_direction": "minimize",
"results_file": "results.tsv",
"program_md": "program.md"
},
"stop_when": {
"max_iterations": 30,
"cost_cap_usd": 15.0
},
"search": {
"enabled": true,
"plateau_threshold": 3
}
}Config fields
配置字段
| Field | Default | Description |
|---|---|---|
| | Agent model |
| | Fallback model (must differ from model) |
| | |
| | Max USD per iteration |
| | Max agent turns (0 = unlimited) |
| fleet root | Working directory — the repo/dir the agent operates in. Fleet root stores config + logs only. |
| required | Command to run evaluation ( |
| (optional) | Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number. |
| | |
| | TSV log file (in workdir) |
| | Agent instructions file (checked in workdir first, then fleet root) |
| | Hard iteration limit |
| | Total cost limit (0 = no limit) |
| | Enable plateau-triggered web search |
| | Consecutive discards before search |
| 字段 | 默认值 | 描述 |
|---|---|---|
| | Agent使用的模型 |
| | 备用模型(需与主模型不同) |
| | 模型提供商,可选 |
| | 单迭代最大预算(美元) |
| | Agent最大交互轮次(0表示无限制) |
| 集群根目录 | 工作目录 — Agent操作的代码仓库/目录。集群根目录仅存储配置和日志。 |
| 必填 | 运行评估的命令(如 |
| (可选) | 从评估输出中提取指标的正则表达式。必须包含一个捕获组。如果评估仅输出单个数值,可省略此字段。 |
| | 指标优化方向,可选 |
| | 存储实验结果的TSV文件(位于工作目录) |
| | Agent指令文件(优先从工作目录读取,其次是集群根目录) |
| | 最大迭代次数限制 |
| | 总预算限制(0表示无限制) |
| | 是否启用平台期触发的网页搜索 |
| | 触发搜索所需的连续丢弃次数 |
Required inputs
必需输入
You need 3 things in a fleet root directory:
- — points
fleet.jsonat the target repo, setsproblem.workdireval_command - — agent instructions. Must say NEVER STOP.
program.md - An eval command — anything that outputs a metric
fleet-root/ ← fleet.json + program.md + logs
fleet.json
program.md
logs/ ← created automatically
your-repo/ ← workdir (agent operates here)
src/...
results.tsv ← created automatically你需要在集群根目录中准备3项内容:
- — 指定
fleet.json指向目标代码仓库,设置problem.workdireval_command - — Agent指令文件,必须包含「永不停止」的要求
program.md - 评估命令 — 任何可输出指标的命令
fleet-root/ ← 包含fleet.json + program.md + logs
fleet.json
program.md
logs/ ← 自动创建
your-repo/ ← 工作目录(Agent在此操作)
src/...
results.tsv ← 自动创建When the user doesn't specify a benchmark
当用户未指定基准测试时
If the user gives you a repo and a goal but no eval command:
- Check for existing benchmarks: look for targets (
Makefile,make benchmark),make perfscripts (package.json,npm run bench),yarn testmarkers (pytest), orpytest -m benchmarkdirectories.bench/ - If found: use it as . Set
eval_commandif it doesn't print a single number.metric_regex - If not found: write a benchmark script (or
bench.sh) in the workdir that:bench.py- Runs the relevant operation (API call, function invocation, build, test suite)
- Measures the metric the user cares about (latency, pass rate, bundle size, etc.)
- Prints a single number to stdout
- Exits 0 on success, non-zero on crash
- Set to run this script.
eval_command
Example: user says "optimize API latency in my Express app":
bash
#!/usr/bin/env bash如果用户提供了代码仓库和目标,但未给出评估命令:
- 检查现有基准测试:查找目标(如
Makefile、make benchmark)、make perf脚本(如package.json、npm run bench)、yarn test标记(如pytest)或pytest -m benchmark目录。bench/ - 如果找到:将其用作。如果输出不是单个数值,设置
eval_command。metric_regex - 如果未找到:编写基准测试脚本(或
bench.sh)并放入工作目录,脚本需满足:bench.py- 运行相关操作(API调用、函数执行、构建、测试套件等)
- 测量用户关注的指标(延迟、通过率、包大小等)
- 向标准输出打印单个数值
- 成功时退出码为0,崩溃时退出码非0
- 将设置为运行该脚本。
eval_command
示例:用户要求「优化我的Express应用的API延迟」:
bash
#!/usr/bin/env bashbench.sh — measure p99 latency
bench.sh — 测量p99延迟
npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT"
Then set `"eval_command": "bash bench.sh"` in fleet.json.npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT"
然后在fleet.json中设置`"eval_command": "bash bench.sh"`。Setup
部署步骤
- Create fleet root with +
fleet.json(+program.mdif you wrote one)bench.sh - (git init + results.tsv auto-created)
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root> - to monitor
bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>
- 创建集群根目录,包含+
fleet.json(如果编写了基准测试脚本,也需包含)program.md - 执行(自动初始化Git并创建results.tsv)
bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root> - 执行监控运行状态
bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>
Available scripts
可用脚本
| Script | Purpose |
|---|---|
| Generate orchestrator.sh, spawn in tmux with monitor |
| Show iteration, best metric, results.tsv, cost, plateau |
| View parsed session events for a specific iteration |
| Generate markdown summary after run completes |
| Pause at next iteration boundary |
| Resume paused fleet |
| Hard stop: kill tmux, sweep orphans |
| 脚本 | 用途 |
|---|---|
| 生成orchestrator.sh,在tmux中启动并监控 |
| 显示迭代次数、最优指标、results.tsv内容、成本和平台期状态 |
| 查看指定迭代的解析后会话事件 |
| 运行完成后生成Markdown格式的总结报告 |
| 在下次迭代边界处暂停运行 |
| 恢复暂停的集群 |
| 强制停止:终止tmux会话,清理残留进程 |
program.md template
program.md 模板
Your should follow this structure (adapt to your problem):
program.mdmarkdown
undefined你的应遵循以下结构(根据实际问题调整):
program.mdmarkdown
undefinedautoresearch: <problem description>
autoresearch: <问题描述>
Setup
初始化步骤
- Explore the codebase to understand the architecture.
- Read for prior experiment history.
results.tsv - Run to establish a baseline.
<eval_command>
- 探索代码库以理解架构。
- 读取了解过往实验历史。
results.tsv - 运行建立基准指标。
<eval_command>
Rules
规则
- Goal: <minimize|maximize> the metric.
- Make ONE change per experiment. Keep changes focused.
- <any constraints: don't touch tests, don't modify config, etc.>
- 目标:<最小化/最大化>指定指标。
- 每次实验仅做一项修改,保持修改聚焦。
- <任何约束条件:如不得修改测试代码、不得修改配置等。>
The experiment loop
实验循环
LOOP FOREVER:
- Read results.tsv for context on what's been tried.
- Make ONE change to the codebase.
git add -A && git commit -m "short description"- Run:
<eval_command> - Record in results.tsv (tab-separated):
commit metric status description - If metric improved: keep the commit.
- If worse or crash: and log as discard/crash.
git reset --hard HEAD~1 - Go to step 1.
NEVER STOP. Run until manually interrupted.
undefined无限循环:
- 读取results.tsv,了解已尝试的方案。
- 对代码库做出一项修改。
- 执行
git add -A && git commit -m "简短描述" - 运行:
<eval_command> - 在results.tsv中记录(制表符分隔):
commit metric status description - 如果指标优化:保留提交。
- 如果指标退化或崩溃:执行并记录为丢弃/崩溃。
git reset --hard HEAD~1 - 返回步骤1。
永不停止。持续运行直至手动中断。
undefinedKey design principles (from Karpathy)
核心设计原则(来自Karpathy)
- Git as state machine — improvement = advance branch; regression = reset
- Fixed eval — makes all experiments comparable
- results.tsv as shared memory — agent reads history to avoid repeating failures
- NEVER STOP — agent runs autonomously until killed
- Simplicity criterion — a small gain with ugly complexity is not worth it
- Git作为状态机 — 优化成功则推进分支,退化则回退
- 固定评估工具 — 确保所有实验具备可比性
- results.tsv作为共享内存 — Agent读取历史记录以避免重复失败
- 永不停止 — Agent自主运行直至被终止
- 简洁性准则 — 伴随复杂实现的微小收益不值得采纳
Open-world search (the extension)
开放世界搜索(扩展功能)
When is true, the orchestrator counts trailing discards in results.tsv. If the count exceeds , the next iteration gets a search-augmented prompt telling the agent to use WebSearch before coding.
search.enabledsearch.plateau_thresholdThis is validated: in experiment 009, search found Winograd's Strassen variant (15 additions vs 18) — a technique not in the LLM's training data — breaking through a plateau where vanilla autoresearch was stuck.
Critical: plateau detection is done in bash (deterministic), not by the LLM. The agent miscounted consecutive discards in early experiments, hallucinating plateaus. External counting is mandatory.
当为true时,编排器会统计results.tsv中的连续丢弃次数。如果次数超过,下一次迭代会使用增强搜索提示,告知Agent在编码前先使用WebSearch。
search.enabledsearch.plateau_threshold此功能已验证:在实验009中,搜索发现了Winograd的Strassen变体(15次加法 vs 原18次)—— 这是LLM训练数据中没有的技术,突破了普通自主研究陷入的平台期。
关键提示:平台期检测由bash完成(确定性),而非LLM。早期实验中Agent会错误统计连续丢弃次数,产生幻觉平台期。因此必须由外部系统负责计数。
Rationalizations to reject
需拒绝的不合理建议
| Agent says | Rebuttal |
|---|---|
| "The agent should search every iteration for best results" | Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards). |
| "I should manage git from the orchestrator" | The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops. |
| "The eval harness can reuse the same inputs" | Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run. |
| "I should use iterative-fleet for this" | Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill. |
$ARGUMENTS
| Agent提议 | 反驳理由 |
|---|---|
| "Agent应在每次迭代中搜索以获得最佳结果" | 平台期触发搜索优于每次都搜索。早期多数搜索是冗余的,还会增加延迟。仅在陷入停滞(3次以上连续丢弃)时才搜索。 |
| "我应该从编排器管理Git" | Git操作由Agent负责。它可以修复提交信息、处理边缘情况,并智能回退。编排器仅负责循环执行。 |
| "评估工具可以复用相同输入" | 复用输入容易被钻空子。Agent会发现基于身份的记忆优化,针对基准测试而非实际问题进行优化。每次计时运行应使用全新的种子输入。 |
| "我应该用iterative-fleet来处理这个问题" | Iterative-fleet包含审核环节。Autoresearch没有审核者——评估脚本就是质量关卡。这是两种不同的模式,对应不同的技能。 |
$ARGUMENTS