autoresearch-fleet

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Autoresearch Fleet

自主研究集群

Autonomous research loop inspired by karpathy/autoresearch. One mutable file, one immutable eval harness, git as state machine, and a "NEVER STOP" directive. The agent modifies code, evaluates the result, keeps improvements, discards regressions, and repeats indefinitely.
Open-world extension: when the agent plateaus (N consecutive discards), the orchestrator injects a web-search prompt, breaking through knowledge ceilings the LLM can't cross alone.
karpathy/autoresearch启发的自主研究循环。包含一个可修改文件、一个不可变评估工具、Git状态机,以及「永不停止」指令。Agent会修改代码、评估结果、保留优化内容、回退退化修改,并无限重复此过程。
开放世界扩展:当Agent进入平台期(连续N次丢弃修改),编排器会注入网页搜索提示,帮助突破LLM自身无法跨越的知识瓶颈。

When to use

适用场景

  • Optimizing a single metric (latency, accuracy, loss, score)
  • The problem has a fast, deterministic eval harness
  • You want autonomous overnight runs (100+ experiments while you sleep)
  • The search space is too large for manual exploration
  • 优化单一指标(延迟、准确率、损失值、评分等)
  • 问题具备快速、确定性的评估工具
  • 需要自主夜间运行(睡眠期间完成100+次实验)
  • 搜索空间过大,人工探索效率低下

How it works

工作原理

┌─────────────────────────────────────────────────┐
│                 orchestrator.sh                  │
│                                                  │
│  for each iteration:                             │
│    1. Count trailing discards in results.tsv     │
│    2. If >= plateau_threshold → search prompt    │
│    3. Spawn agent (claude -p or codex exec)      │
│    4. Agent reads program.md, edits file, evals  │
│    5. Agent updates results.tsv, keeps/reverts   │
│    6. Check stop conditions (iter/cost/plateau)  │
│    7. Loop                                       │
└─────────────────────────────────────────────────┘
The agent handles everything: reading files, editing code, running eval, committing, updating results.tsv, and reverting on failure. The orchestrator just loops, switches prompts on plateau, and enforces stop conditions.
┌─────────────────────────────────────────────────┐
│                 orchestrator.sh                  │
│                                                  │
│  for each iteration:                             │
│    1. Count trailing discards in results.tsv     │
│    2. If >= plateau_threshold → search prompt    │
│    3. Spawn agent (claude -p or codex exec)      │
│    4. Agent reads program.md, edits file, evals  │
│    5. Agent updates results.tsv, keeps/reverts   │
│    6. Check stop conditions (iter/cost/plateau)  │
│    7. Loop                                       │
└─────────────────────────────────────────────────┘
Agent负责所有操作:读取文件、编辑代码、运行评估、提交修改、更新results.tsv,以及在失败时回退。编排器仅负责循环执行、平台期切换提示,以及强制执行停止条件。

Directory structure

目录结构

$FLEET_ROOT/                    # The problem directory
  fleet.json                    # Fleet config
  program.md                    # Agent instructions (you write this)
  eval.py                       # Immutable eval harness (you write this)
  solution.py                   # Mutable file (agent edits this)
  results.tsv                   # Experiment log (agent updates, git-untracked)
  orchestrator.sh               # Generated by launch.sh
  .orch-state.json              # Iteration state
  .paused                       # Sentinel (touch to pause)
  logs/
    session-iter-1.jsonl         # Per-iteration session logs
    session-iter-2.jsonl
    ...
$FLEET_ROOT/                    # 问题目录
  fleet.json                    # 集群配置
  program.md                    # Agent指令文件(用户编写)
  eval.py                       # 不可变评估工具(用户编写)
  solution.py                   # 可修改文件(Agent编辑)
  results.tsv                   # 实验日志(Agent更新,Git不追踪)
  orchestrator.sh               # 由launch.sh生成
  .orch-state.json              # 迭代状态
  .paused                       # 暂停标记文件(创建即可暂停)
  logs/
    session-iter-1.jsonl         # 单迭代会话日志
    session-iter-2.jsonl
    ...

fleet.json schema

fleet.json 模式

json
{
  "fleet_name": "optimize-api-latency",
  "type": "autoresearch",
  "config": {
    "model": "sonnet",
    "fallback_model": "haiku",
    "provider": "claude",
    "budget_per_iter": 2.00,
    "max_turns": 0
  },
  "problem": {
    "workdir": "/home/user/my-project",
    "eval_command": "make benchmark",
    "metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
    "metric_direction": "minimize",
    "results_file": "results.tsv",
    "program_md": "program.md"
  },
  "stop_when": {
    "max_iterations": 30,
    "cost_cap_usd": 15.0
  },
  "search": {
    "enabled": true,
    "plateau_threshold": 3
  }
}
json
{
  "fleet_name": "optimize-api-latency",
  "type": "autoresearch",
  "config": {
    "model": "sonnet",
    "fallback_model": "haiku",
    "provider": "claude",
    "budget_per_iter": 2.00,
    "max_turns": 0
  },
  "problem": {
    "workdir": "/home/user/my-project",
    "eval_command": "make benchmark",
    "metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
    "metric_direction": "minimize",
    "results_file": "results.tsv",
    "program_md": "program.md"
  },
  "stop_when": {
    "max_iterations": 30,
    "cost_cap_usd": 15.0
  },
  "search": {
    "enabled": true,
    "plateau_threshold": 3
  }
}

Config fields

配置字段

FieldDefaultDescription
config.model
sonnet
Agent model
config.fallback_model
haiku
Fallback model (must differ from model)
config.provider
claude
claude
or
codex
config.budget_per_iter
1.00
Max USD per iteration
config.max_turns
0
Max agent turns (0 = unlimited)
problem.workdir
fleet rootWorking directory — the repo/dir the agent operates in. Fleet root stores config + logs only.
problem.eval_command
requiredCommand to run evaluation (
python3 eval.py
,
make benchmark
,
pytest --tb=short
, etc.)
problem.metric_regex
(optional)Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number.
problem.metric_direction
minimize
minimize
or
maximize
problem.results_file
results.tsv
TSV log file (in workdir)
problem.program_md
program.md
Agent instructions file (checked in workdir first, then fleet root)
stop_when.max_iterations
50
Hard iteration limit
stop_when.cost_cap_usd
0
Total cost limit (0 = no limit)
search.enabled
true
Enable plateau-triggered web search
search.plateau_threshold
3
Consecutive discards before search
字段默认值描述
config.model
sonnet
Agent使用的模型
config.fallback_model
haiku
备用模型(需与主模型不同)
config.provider
claude
模型提供商,可选
claude
codex
config.budget_per_iter
1.00
单迭代最大预算(美元)
config.max_turns
0
Agent最大交互轮次(0表示无限制)
problem.workdir
集群根目录工作目录 — Agent操作的代码仓库/目录。集群根目录仅存储配置和日志。
problem.eval_command
必填运行评估的命令(如
python3 eval.py
make benchmark
pytest --tb=short
等)
problem.metric_regex
(可选)从评估输出中提取指标的正则表达式。必须包含一个捕获组。如果评估仅输出单个数值,可省略此字段。
problem.metric_direction
minimize
指标优化方向,可选
minimize
(最小化)或
maximize
(最大化)
problem.results_file
results.tsv
存储实验结果的TSV文件(位于工作目录)
problem.program_md
program.md
Agent指令文件(优先从工作目录读取,其次是集群根目录)
stop_when.max_iterations
50
最大迭代次数限制
stop_when.cost_cap_usd
0
总预算限制(0表示无限制)
search.enabled
true
是否启用平台期触发的网页搜索
search.plateau_threshold
3
触发搜索所需的连续丢弃次数

Required inputs

必需输入

You need 3 things in a fleet root directory:
  1. fleet.json
    — points
    problem.workdir
    at the target repo, sets
    eval_command
  2. program.md
    — agent instructions. Must say NEVER STOP.
  3. An eval command — anything that outputs a metric
fleet-root/              ← fleet.json + program.md + logs
  fleet.json
  program.md
  logs/                  ← created automatically
your-repo/               ← workdir (agent operates here)
  src/...
  results.tsv            ← created automatically
你需要在集群根目录中准备3项内容
  1. fleet.json
    — 指定
    problem.workdir
    指向目标代码仓库,设置
    eval_command
  2. program.md
    — Agent指令文件,必须包含「永不停止」的要求
  3. 评估命令 — 任何可输出指标的命令
fleet-root/              ← 包含fleet.json + program.md + logs
  fleet.json
  program.md
  logs/                  ← 自动创建
your-repo/               ← 工作目录(Agent在此操作)
  src/...
  results.tsv            ← 自动创建

When the user doesn't specify a benchmark

当用户未指定基准测试时

If the user gives you a repo and a goal but no eval command:
  1. Check for existing benchmarks: look for
    Makefile
    targets (
    make benchmark
    ,
    make perf
    ),
    package.json
    scripts (
    npm run bench
    ,
    yarn test
    ),
    pytest
    markers (
    pytest -m benchmark
    ), or
    bench/
    directories.
  2. If found: use it as
    eval_command
    . Set
    metric_regex
    if it doesn't print a single number.
  3. If not found: write a benchmark script (
    bench.sh
    or
    bench.py
    ) in the workdir that:
    • Runs the relevant operation (API call, function invocation, build, test suite)
    • Measures the metric the user cares about (latency, pass rate, bundle size, etc.)
    • Prints a single number to stdout
    • Exits 0 on success, non-zero on crash
  4. Set
    eval_command
    to run this script.
Example: user says "optimize API latency in my Express app":
bash
#!/usr/bin/env bash
如果用户提供了代码仓库和目标,但未给出评估命令:
  1. 检查现有基准测试:查找
    Makefile
    目标(如
    make benchmark
    make perf
    )、
    package.json
    脚本(如
    npm run bench
    yarn test
    )、
    pytest
    标记(如
    pytest -m benchmark
    )或
    bench/
    目录。
  2. 如果找到:将其用作
    eval_command
    。如果输出不是单个数值,设置
    metric_regex
  3. 如果未找到编写基准测试脚本
    bench.sh
    bench.py
    )并放入工作目录,脚本需满足:
    • 运行相关操作(API调用、函数执行、构建、测试套件等)
    • 测量用户关注的指标(延迟、通过率、包大小等)
    • 向标准输出打印单个数值
    • 成功时退出码为0,崩溃时退出码非0
  4. eval_command
    设置为运行该脚本。
示例:用户要求「优化我的Express应用的API延迟」:
bash
#!/usr/bin/env bash

bench.sh — measure p99 latency

bench.sh — 测量p99延迟

npm start &>/dev/null & PID=$! sleep 3 RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health) kill $PID 2>/dev/null echo "$RESULT"

Then set `"eval_command": "bash bench.sh"` in fleet.json.
npm start &>/dev/null & PID=$! sleep 3 RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health) kill $PID 2>/dev/null echo "$RESULT"

然后在fleet.json中设置`"eval_command": "bash bench.sh"`。

Setup

部署步骤

  1. Create fleet root with
    fleet.json
    +
    program.md
    (+
    bench.sh
    if you wrote one)
  2. bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root>
    (git init + results.tsv auto-created)
  3. bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>
    to monitor
  1. 创建集群根目录,包含
    fleet.json
    +
    program.md
    (如果编写了基准测试脚本,也需包含)
  2. 执行
    bash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root>
    (自动初始化Git并创建results.tsv)
  3. 执行
    bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>
    监控运行状态

Available scripts

可用脚本

ScriptPurpose
launch.sh <fleet-root> [--dry-run]
Generate orchestrator.sh, spawn in tmux with monitor
status.sh <fleet-root> [--watch]
Show iteration, best metric, results.tsv, cost, plateau
view.sh <fleet-root> <iter|latest> [--follow]
View parsed session events for a specific iteration
report.sh <fleet-root> [--output file.md]
Generate markdown summary after run completes
pause.sh <fleet-root>
Pause at next iteration boundary
resume.sh <fleet-root>
Resume paused fleet
kill.sh <fleet-root>
Hard stop: kill tmux, sweep orphans
脚本用途
launch.sh <fleet-root> [--dry-run]
生成orchestrator.sh,在tmux中启动并监控
status.sh <fleet-root> [--watch]
显示迭代次数、最优指标、results.tsv内容、成本和平台期状态
view.sh <fleet-root> <iter|latest> [--follow]
查看指定迭代的解析后会话事件
report.sh <fleet-root> [--output file.md]
运行完成后生成Markdown格式的总结报告
pause.sh <fleet-root>
在下次迭代边界处暂停运行
resume.sh <fleet-root>
恢复暂停的集群
kill.sh <fleet-root>
强制停止:终止tmux会话,清理残留进程

program.md template

program.md 模板

Your
program.md
should follow this structure (adapt to your problem):
markdown
undefined
你的
program.md
应遵循以下结构(根据实际问题调整):
markdown
undefined

autoresearch: <problem description>

autoresearch: <问题描述>

Setup

初始化步骤

  1. Explore the codebase to understand the architecture.
  2. Read
    results.tsv
    for prior experiment history.
  3. Run
    <eval_command>
    to establish a baseline.
  1. 探索代码库以理解架构。
  2. 读取
    results.tsv
    了解过往实验历史。
  3. 运行
    <eval_command>
    建立基准指标。

Rules

规则

  • Goal: <minimize|maximize> the metric.
  • Make ONE change per experiment. Keep changes focused.
  • <any constraints: don't touch tests, don't modify config, etc.>
  • 目标:<最小化/最大化>指定指标。
  • 每次实验仅做一项修改,保持修改聚焦。
  • <任何约束条件:如不得修改测试代码、不得修改配置等。>

The experiment loop

实验循环

LOOP FOREVER:
  1. Read results.tsv for context on what's been tried.
  2. Make ONE change to the codebase.
  3. git add -A && git commit -m "short description"
  4. Run:
    <eval_command>
  5. Record in results.tsv (tab-separated):
    commit  metric  status  description
  6. If metric improved: keep the commit.
  7. If worse or crash:
    git reset --hard HEAD~1
    and log as discard/crash.
  8. Go to step 1.
NEVER STOP. Run until manually interrupted.
undefined
无限循环:
  1. 读取results.tsv,了解已尝试的方案。
  2. 对代码库做出一项修改。
  3. 执行
    git add -A && git commit -m "简短描述"
  4. 运行:
    <eval_command>
  5. 在results.tsv中记录(制表符分隔):
    commit  metric  status  description
  6. 如果指标优化:保留提交。
  7. 如果指标退化或崩溃:执行
    git reset --hard HEAD~1
    并记录为丢弃/崩溃。
  8. 返回步骤1。
永不停止。持续运行直至手动中断。
undefined

Key design principles (from Karpathy)

核心设计原则(来自Karpathy)

  1. Git as state machine — improvement = advance branch; regression = reset
  2. Fixed eval — makes all experiments comparable
  3. results.tsv as shared memory — agent reads history to avoid repeating failures
  4. NEVER STOP — agent runs autonomously until killed
  5. Simplicity criterion — a small gain with ugly complexity is not worth it
  1. Git作为状态机 — 优化成功则推进分支,退化则回退
  2. 固定评估工具 — 确保所有实验具备可比性
  3. results.tsv作为共享内存 — Agent读取历史记录以避免重复失败
  4. 永不停止 — Agent自主运行直至被终止
  5. 简洁性准则 — 伴随复杂实现的微小收益不值得采纳

Open-world search (the extension)

开放世界搜索(扩展功能)

When
search.enabled
is true, the orchestrator counts trailing discards in results.tsv. If the count exceeds
search.plateau_threshold
, the next iteration gets a search-augmented prompt telling the agent to use WebSearch before coding.
This is validated: in experiment 009, search found Winograd's Strassen variant (15 additions vs 18) — a technique not in the LLM's training data — breaking through a plateau where vanilla autoresearch was stuck.
Critical: plateau detection is done in bash (deterministic), not by the LLM. The agent miscounted consecutive discards in early experiments, hallucinating plateaus. External counting is mandatory.
search.enabled
为true时,编排器会统计results.tsv中的连续丢弃次数。如果次数超过
search.plateau_threshold
,下一次迭代会使用增强搜索提示,告知Agent在编码前先使用WebSearch。
此功能已验证:在实验009中,搜索发现了Winograd的Strassen变体(15次加法 vs 原18次)—— 这是LLM训练数据中没有的技术,突破了普通自主研究陷入的平台期。
关键提示:平台期检测由bash完成(确定性),而非LLM。早期实验中Agent会错误统计连续丢弃次数,产生幻觉平台期。因此必须由外部系统负责计数。

Rationalizations to reject

需拒绝的不合理建议

Agent saysRebuttal
"The agent should search every iteration for best results"Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards).
"I should manage git from the orchestrator"The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops.
"The eval harness can reuse the same inputs"Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run.
"I should use iterative-fleet for this"Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill.
$ARGUMENTS
Agent提议反驳理由
"Agent应在每次迭代中搜索以获得最佳结果"平台期触发搜索优于每次都搜索。早期多数搜索是冗余的,还会增加延迟。仅在陷入停滞(3次以上连续丢弃)时才搜索。
"我应该从编排器管理Git"Git操作由Agent负责。它可以修复提交信息、处理边缘情况,并智能回退。编排器仅负责循环执行。
"评估工具可以复用相同输入"复用输入容易被钻空子。Agent会发现基于身份的记忆优化,针对基准测试而非实际问题进行优化。每次计时运行应使用全新的种子输入。
"我应该用iterative-fleet来处理这个问题"Iterative-fleet包含审核环节。Autoresearch没有审核者——评估脚本就是质量关卡。这是两种不同的模式,对应不同的技能。
$ARGUMENTS