autoresearch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoresearch
Autoresearch
Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.
Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.
自动实验循环:尝试各种思路、衡量结果、保留有效方案、摒弃无效尝试、持续运行不中断。
适用于各类优化目标:测试速度、打包体积、LLM训练、构建时长、Lighthouse评分、二进制文件大小、延迟、内存占用。
Setup
配置步骤
If already exists in the working directory, skip setup and resume the loop — read , , and , then continue experimenting.
autoresearch.mdautoresearch.mdautoresearch.jsonlgit logOtherwise:
- Gather context: Ask (or infer from and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.
$ARGUMENTS - Create branch: (e.g.
git checkout -b autoresearch/<goal>-<date>).autoresearch/test-speed-2026-03-21 - Read source files: Understand the workload deeply before writing anything. Read every file in scope.
- Write session files: Create and
autoresearch.md(see templates below). If constraints require correctness validation (tests must pass, types must check), also createautoresearch.sh. Commit all.autoresearch.checks.sh - Run baseline: Execute the first experiment with no changes to establish the baseline metric.
- Start looping: Begin the experiment loop immediately after the baseline is logged.
如果工作目录中已存在,跳过配置步骤并恢复循环——读取、和后,继续实验即可。
autoresearch.mdautoresearch.mdautoresearch.jsonlgit log若不存在,则执行以下步骤:
- 收集上下文:询问(或从及对话内容中推断)目标、用于基准测试的命令、核心指标(名称+优化方向)、涉及文件范围以及约束条件。
$ARGUMENTS - 创建分支:执行(示例:
git checkout -b autoresearch/<goal>-<date>)。autoresearch/test-speed-2026-03-21 - 读取源文件:在编写任何内容前,深入理解工作负载。读取所有涉及范围内的文件。
- 创建会话文件:生成和
autoresearch.md(参考下方模板)。若约束条件要求正确性验证(如必须通过测试、类型检查需通过),还需创建autoresearch.sh。提交所有文件。autoresearch.checks.sh - 运行基准测试:在无任何修改的情况下执行首次实验,确立基准指标。
- 启动循环:基准测试结果记录完成后,立即启动实验循环。
autoresearch.md
autoresearch.mdautoresearch.md
autoresearch.mdThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.
markdown
undefined这是会话的核心文件。即使是无上下文的新Agent,只需读取此文件就能有效运行循环。请花时间确保文件内容完善。
markdown
undefinedAutoresearch: <goal>
Autoresearch: <目标>
Objective
目标说明
<Specific description of what we're optimizing and the workload.>
<详细描述我们要优化的对象及对应的工作负载。>
Metrics
指标
- Primary: <name> (<unit>, lower/higher is better)
- Secondary: <name>, <name>, ...
- 核心指标:<名称>(<单位>,数值越低/越高越好)
- 次要指标:<名称>、<名称>……
How to Run
运行方式
./autoresearch.shMETRIC name=value./autoresearch.shMETRIC name=valueFiles in Scope
涉及文件范围
<Every file the agent may modify, with a brief note on what it does.>
<Agent可修改的所有文件,附带简要功能说明。>
Off Limits
禁止修改内容
<What must NOT be touched — evaluation harness, data prep, etc.>
<绝对不能触碰的部分——如评估框架、数据预处理等。>
Constraints
约束条件
<Hard rules: tests must pass, no new deps, fixed time budget, etc.>
<硬性规则:必须通过测试、不能添加新依赖、固定时间预算等。>
What's Been Tried
已尝试方案
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
Update `autoresearch.md` periodically — especially "What's Been Tried" — so resuming agents have full context.<随着实验推进更新此部分。记录关键成果、无效尝试以及架构层面的洞察,避免Agent重复失败的方案。>
定期更新`autoresearch.md`——尤其是“已尝试方案”部分——确保恢复运行的Agent拥有完整上下文。autoresearch.sh
autoresearch.shautoresearch.sh
autoresearch.shBash script that runs the benchmark and outputs structured metrics.
bash
#!/bin/bash
set -euo pipefail用于运行基准测试并输出结构化指标的Bash脚本。
bash
#!/bin/bash
set -euo pipefailPre-checks (fast, <1s — catch syntax errors early)
预检查(快速,耗时<1秒——提前捕获语法错误)
python3 -c "import ast; ast.parse(open('train.py').read())"
python3 -c "import ast; ast.parse(open('train.py').read())"
Run benchmark
运行基准测试
uv run train.py > /tmp/autoresearch-output.log 2>&1
uv run train.py > /tmp/autoresearch-output.log 2>&1
Extract and output metrics as METRIC lines
提取并输出指标,格式为METRIC行
val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}')
echo "METRIC val_bpb=$val_bpb"
Rules:
- Use `set -euo pipefail`.
- Output `METRIC name=value` lines to stdout (one per metric). The primary metric name must match what's documented in `autoresearch.md`.
- Metric names: word chars, dots, or `µ` (e.g. `val_bpb`, `total_µs`, `bundle.size_kb`).
- Keep the script fast — every second is multiplied by hundreds of runs.
- For fast/noisy benchmarks (<5s), run multiple times inside the script and report the median.
- Update the script during the loop as needed.val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}')
echo "METRIC val_bpb=$val_bpb"
规则:
- 必须使用`set -euo pipefail`。
- 向标准输出输出`METRIC name=value`格式的行(每个指标一行)。核心指标名称必须与`autoresearch.md`中的记录一致。
- 指标名称:可使用单词字符、点或`µ`(示例:`val_bpb`、`total_µs`、`bundle.size_kb`)。
- 保持脚本运行速度——每一秒的耗时都会在数百次运行中被放大。
- 对于快速但存在噪声的基准测试(耗时<5秒),在脚本内多次运行并报告中位数。
- 可在循环运行过程中按需更新脚本。autoresearch.checks.sh
(optional)
autoresearch.checks.shautoresearch.checks.sh
(可选)
autoresearch.checks.shBackpressure checks: tests, types, lint. Only create when constraints require correctness validation.
bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || trueWhen this file exists:
- Run it after every passing benchmark (exit 0).
- If checks fail, log the experiment as and revert.
checks_failed - Check execution time does NOT affect the primary metric.
- Keep output minimal — suppress verbose progress, only show errors.
When this file does not exist, skip checks entirely.
反向验证检查:测试、类型检查、代码规范检查。仅当约束条件要求正确性验证时创建此文件。
bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true当此文件存在时:
- 每次基准测试通过后(退出码为0)运行此脚本。
- 若检查失败,将实验状态记录为并回滚代码。
checks_failed - 检查的执行时间不影响核心指标。
- 保持输出简洁——抑制冗长的进度信息,仅显示错误内容。
当此文件不存在时,跳过所有检查步骤。
The Experiment Loop
实验循环流程
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
Each iteration:
- Formulate hypothesis: Based on prior results, source code understanding, and any ideas in , choose what to try next.
autoresearch.ideas.md - Edit code: Modify the in-scope files. Make a single, focused change per experiment.
- Commit:
git add -A && git commit -m "<short description of what this experiment tries>" - Run benchmark:
If the command times out or crashes, treat it as a failure.bash
timeout 600 ./autoresearch.sh > run.log 2>&1 - Parse metrics: Extract lines from the output:
METRICIf no METRIC lines found, the run crashed — readbashgrep '^METRIC ' run.logfor the error.tail -50 run.log - Run checks (if exists and benchmark passed):
autoresearch.checks.shbashtimeout 300 ./autoresearch.checks.sh > checks.log 2>&1 - Evaluate and log:
- Improved (primary metric better than best so far) → status . The commit stays.
keep - Worse or equal → status . Revert: stage autoresearch files first, then reset.
discard - Crash (benchmark failed) → status . Fix if trivial, otherwise revert and move on.
crash - Checks failed → status . Revert.
checks_failed
- Improved (primary metric better than best so far) → status
- Log to JSONL: Append one line to :
autoresearch.jsonljson{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null} - On discard/crash/checks_failed — revert code changes:
bash
# Preserve autoresearch session files, revert everything else git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true git checkout -- . git clean -fd - Check confidence: After 3+ runs, run the confidence script from the skill's installation directory:
Or locate it via the skill path and run it directly. Interpret the score:bash
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"- >= 2.0x: Improvement is likely real (green).
- 1.0-2.0x: Above noise but marginal (yellow).
- < 1.0x: Within noise — consider re-running to confirm (red).
- Update session: Periodically update "What's Been Tried" section and run the summary script to review progress.
autoresearch.md
Repeat forever until interrupted.
持续循环运行。 绝对不要询问“是否继续?”——用户期望的是自主运行。
每次迭代步骤:
- 提出假设:基于过往结果、对源代码的理解以及中的思路,选择下一步要尝试的方案。
autoresearch.ideas.md - 修改代码:修改范围内的文件。每次实验仅做单一、聚焦的修改。
- 提交代码:
git add -A && git commit -m "<此实验尝试内容的简短描述>" - 运行基准测试:
若命令超时或崩溃,视为实验失败。bash
timeout 600 ./autoresearch.sh > run.log 2>&1 - 解析指标:从输出中提取行:
METRIC若未找到METRIC行,说明运行崩溃——查看bashgrep '^METRIC ' run.log获取错误信息。tail -50 run.log - 运行检查(若存在且基准测试通过):
autoresearch.checks.shbashtimeout 300 ./autoresearch.checks.sh > checks.log 2>&1 - 评估并记录:
- 优化(核心指标优于当前最佳结果)→ 状态标记为。保留此次提交。
keep - 变差或持平→ 状态标记为。回滚代码:先暂存autoresearch相关文件,再重置代码。
discard - 崩溃(基准测试失败)→ 状态标记为。若为小问题(如拼写错误、缺失导入)则修复,否则记录后继续下一个实验。
crash - 检查失败→ 状态标记为。回滚代码。
checks_failed
- 优化(核心指标优于当前最佳结果)→ 状态标记为
- 记录至JSONL:向追加一行内容:
autoresearch.jsonljson{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null} - 若状态为discard/crash/checks_failed——回滚代码修改:
bash
# 保留autoresearch会话文件,回滚其他所有内容 git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true git checkout -- . git clean -fd - 检查置信度:完成3次及以上运行后,运行技能安装目录下的置信度脚本:
或通过技能路径找到脚本并直接运行。解读得分:bash
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"- >= 2.0x:优化效果大概率真实有效(绿色)。
- 1.0-2.0x:优于噪声但提升幅度微小(黄色)。
- < 1.0x:处于噪声范围内——考虑重新运行以确认(红色)。
- 更新会话:定期更新的“已尝试方案”部分,并运行汇总脚本回顾进度。
autoresearch.md
持续重复上述步骤,直到被中断。
JSONL Schema
JSONL 格式规范
Each line in is a JSON object:
autoresearch.jsonl| Field | Type | Description |
|---|---|---|
| number | 1-indexed experiment count |
| string | Short git SHA (7 chars) |
| number | Primary metric value |
| object | All metrics dict (primary + secondary) |
| string | |
| string | What this experiment tried |
| number | Unix timestamp (ms) |
| number or null | MAD-based confidence score (null if <3 runs) |
autoresearch.jsonl| 字段 | 类型 | 描述 |
|---|---|---|
| 数字 | 从1开始的实验计数 |
| 字符串 | 短Git SHA值(7个字符) |
| 数字 | 核心指标数值 |
| 对象 | 所有指标的字典(核心指标+次要指标) |
| 字符串 | |
| 字符串 | 此实验尝试内容的描述 |
| 数字 | Unix时间戳(毫秒) |
| 数字或null | 基于MAD的置信度得分(若运行次数<3则为null) |
Resuming
恢复运行
When exists in the working directory:
autoresearch.md- Read for full context (objective, what's been tried, constraints).
autoresearch.md - Read to reconstruct state (best metric, run count, last segment).
autoresearch.jsonl - Read for recent commit history.
git log --oneline -20 - Check if it exists — prune stale entries, experiment with promising ones.
autoresearch.ideas.md - Continue the loop from where it left off. Do not re-run the baseline.
当工作目录中存在时:
autoresearch.md- 读取获取完整上下文(目标、已尝试方案、约束条件)。
autoresearch.md - 读取重建状态(最佳指标、运行次数、最后阶段)。
autoresearch.jsonl - 读取查看近期提交历史。
git log --oneline -20 - 若存在则检查该文件——删除过时/已尝试的条目,尝试有前景的方案。
autoresearch.ideas.md - 从上次中断的位置继续运行循环。无需重新运行基准测试。
Ideas Backlog
思路备份
When you discover complex but promising optimizations you won't pursue right now, append them as bullets to . Don't let good ideas get lost.
autoresearch.ideas.mdOn resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to .
autoresearch.md当发现复杂但有前景的优化方案但暂时不打算尝试时,将其作为项目符号追加到中。避免遗漏优质思路。
autoresearch.ideas.md恢复运行时,检查此文件——删除过时/已尝试的条目,尝试剩余方案。当所有思路都尝试完毕后,删除该文件并在中写入最终总结。
autoresearch.mdLoop Rules
循环规则
See for the full reference. Key rules:
references/loop-rules.md- Primary metric is king. Improved → keep. Worse/equal → discard.
- Simpler is better. Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.
- Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
- Think longer when stuck. Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.
- Crashes: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.
- NEVER STOP. The user may be away for hours. Keep going until interrupted.
完整规则请参考。核心规则:
references/loop-rules.md- 核心指标优先。优化有效则保留,变差或持平则回滚。
- 越简洁越好。在性能持平的情况下删除代码=保留。为微小提升引入复杂实现=回滚。
- 不要反复尝试无效方案。多次回滚同一思路?尝试结构上不同的方案。
- 遇到瓶颈时深入思考。重新阅读源代码,分析CPU/编译器/运行时的实际行为。深度理解优于随机尝试。
- 崩溃处理:小问题(如拼写错误、缺失导入)则修复,否则记录后继续。不要过度投入时间。
- 绝对不要停止。用户可能数小时不在场。持续运行直到被中断。
User Messages During Experiments
实验运行中的用户消息
If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.
若实验运行时用户发送消息,需先完成当前的“运行-评估-记录”周期,再在下一次迭代中纳入用户反馈。