Loading...
Loading...
Karpathy-inspired autonomous research loop. Agent edits one file, evals, keeps or discards, repeats. Plateau-triggered web search breaks through ceilings. Git as state machine. Runs until stopped or budget exhausted.
npx skill4agent add quickcall-dev/skills autoresearch-fleet┌─────────────────────────────────────────────────┐
│ orchestrator.sh │
│ │
│ for each iteration: │
│ 1. Count trailing discards in results.tsv │
│ 2. If >= plateau_threshold → search prompt │
│ 3. Spawn agent (claude -p or codex exec) │
│ 4. Agent reads program.md, edits file, evals │
│ 5. Agent updates results.tsv, keeps/reverts │
│ 6. Check stop conditions (iter/cost/plateau) │
│ 7. Loop │
└─────────────────────────────────────────────────┘$FLEET_ROOT/ # The problem directory
fleet.json # Fleet config
program.md # Agent instructions (you write this)
eval.py # Immutable eval harness (you write this)
solution.py # Mutable file (agent edits this)
results.tsv # Experiment log (agent updates, git-untracked)
orchestrator.sh # Generated by launch.sh
.orch-state.json # Iteration state
.paused # Sentinel (touch to pause)
logs/
session-iter-1.jsonl # Per-iteration session logs
session-iter-2.jsonl
...{
"fleet_name": "optimize-api-latency",
"type": "autoresearch",
"config": {
"model": "sonnet",
"fallback_model": "haiku",
"provider": "claude",
"budget_per_iter": 2.00,
"max_turns": 0
},
"problem": {
"workdir": "/home/user/my-project",
"eval_command": "make benchmark",
"metric_regex": "^p99_latency_ms:\\s*([0-9.]+)",
"metric_direction": "minimize",
"results_file": "results.tsv",
"program_md": "program.md"
},
"stop_when": {
"max_iterations": 30,
"cost_cap_usd": 15.0
},
"search": {
"enabled": true,
"plateau_threshold": 3
}
}| Field | Default | Description |
|---|---|---|
| | Agent model |
| | Fallback model (must differ from model) |
| | |
| | Max USD per iteration |
| | Max agent turns (0 = unlimited) |
| fleet root | Working directory — the repo/dir the agent operates in. Fleet root stores config + logs only. |
| required | Command to run evaluation ( |
| (optional) | Regex to extract metric from eval output. Must have one capture group. Omit if eval prints a single number. |
| | |
| | TSV log file (in workdir) |
| | Agent instructions file (checked in workdir first, then fleet root) |
| | Hard iteration limit |
| | Total cost limit (0 = no limit) |
| | Enable plateau-triggered web search |
| | Consecutive discards before search |
fleet.jsonproblem.workdireval_commandprogram.mdfleet-root/ ← fleet.json + program.md + logs
fleet.json
program.md
logs/ ← created automatically
your-repo/ ← workdir (agent operates here)
src/...
results.tsv ← created automaticallyMakefilemake benchmarkmake perfpackage.jsonnpm run benchyarn testpytestpytest -m benchmarkbench/eval_commandmetric_regexbench.shbench.pyeval_command#!/usr/bin/env bash
# bench.sh — measure p99 latency
npm start &>/dev/null &
PID=$!
sleep 3
RESULT=$(curl -s -o /dev/null -w '%{time_total}' http://localhost:3000/api/health)
kill $PID 2>/dev/null
echo "$RESULT""eval_command": "bash bench.sh"fleet.jsonprogram.mdbench.shbash ${CLAUDE_SKILL_DIR}/scripts/launch.sh <fleet-root>bash ${CLAUDE_SKILL_DIR}/scripts/status.sh <fleet-root>| Script | Purpose |
|---|---|
| Generate orchestrator.sh, spawn in tmux with monitor |
| Show iteration, best metric, results.tsv, cost, plateau |
| View parsed session events for a specific iteration |
| Generate markdown summary after run completes |
| Pause at next iteration boundary |
| Resume paused fleet |
| Hard stop: kill tmux, sweep orphans |
program.md# autoresearch: <problem description>
## Setup
1. Explore the codebase to understand the architecture.
2. Read `results.tsv` for prior experiment history.
3. Run `<eval_command>` to establish a baseline.
## Rules
- Goal: <minimize|maximize> the metric.
- Make ONE change per experiment. Keep changes focused.
- <any constraints: don't touch tests, don't modify config, etc.>
## The experiment loop
LOOP FOREVER:
1. Read results.tsv for context on what's been tried.
2. Make ONE change to the codebase.
3. `git add -A && git commit -m "short description"`
4. Run: `<eval_command>`
5. Record in results.tsv (tab-separated): `commit metric status description`
6. If metric improved: keep the commit.
7. If worse or crash: `git reset --hard HEAD~1` and log as discard/crash.
8. Go to step 1.
**NEVER STOP.** Run until manually interrupted.search.enabledsearch.plateau_threshold| Agent says | Rebuttal |
|---|---|
| "The agent should search every iteration for best results" | Search-on-plateau beats always-search. Most early searches are redundant and add latency. Only search when stuck (3+ discards). |
| "I should manage git from the orchestrator" | The agent handles git. It can fix commit messages, handle edge cases, and revert intelligently. The orchestrator just loops. |
| "The eval harness can reuse the same inputs" | Reusing inputs is gameable. The agent will discover identity-based memoization and optimize for the benchmark, not the problem. Use fresh seeded inputs per timed run. |
| "I should use iterative-fleet for this" | Iterative-fleet has a reviewer. Autoresearch has no reviewer — the eval script IS the quality gate. Different pattern, different skill. |