autoresearch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Autoresearch

Autoresearch

Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.
Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.
自动实验循环:尝试各种思路、衡量结果、保留有效方案、摒弃无效尝试、持续运行不中断。
适用于各类优化目标:测试速度、打包体积、LLM训练、构建时长、Lighthouse评分、二进制文件大小、延迟、内存占用。

Setup

配置步骤

If
autoresearch.md
already exists in the working directory, skip setup and resume the loop — read
autoresearch.md
,
autoresearch.jsonl
, and
git log
, then continue experimenting.
Otherwise:
  1. Gather context: Ask (or infer from
    $ARGUMENTS
    and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.
  2. Create branch:
    git checkout -b autoresearch/<goal>-<date>
    (e.g.
    autoresearch/test-speed-2026-03-21
    ).
  3. Read source files: Understand the workload deeply before writing anything. Read every file in scope.
  4. Write session files: Create
    autoresearch.md
    and
    autoresearch.sh
    (see templates below). If constraints require correctness validation (tests must pass, types must check), also create
    autoresearch.checks.sh
    . Commit all.
  5. Run baseline: Execute the first experiment with no changes to establish the baseline metric.
  6. Start looping: Begin the experiment loop immediately after the baseline is logged.
如果工作目录中已存在
autoresearch.md
跳过配置步骤并恢复循环——读取
autoresearch.md
autoresearch.jsonl
git log
后,继续实验即可。
若不存在,则执行以下步骤:
  1. 收集上下文:询问(或从
    $ARGUMENTS
    及对话内容中推断)目标、用于基准测试的命令核心指标(名称+优化方向)、涉及文件范围以及约束条件
  2. 创建分支:执行
    git checkout -b autoresearch/<goal>-<date>
    (示例:
    autoresearch/test-speed-2026-03-21
    )。
  3. 读取源文件:在编写任何内容前,深入理解工作负载。读取所有涉及范围内的文件。
  4. 创建会话文件:生成
    autoresearch.md
    autoresearch.sh
    (参考下方模板)。若约束条件要求正确性验证(如必须通过测试、类型检查需通过),还需创建
    autoresearch.checks.sh
    。提交所有文件。
  5. 运行基准测试:在无任何修改的情况下执行首次实验,确立基准指标。
  6. 启动循环:基准测试结果记录完成后,立即启动实验循环。

autoresearch.md

autoresearch.md

The heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.
markdown
undefined
这是会话的核心文件。即使是无上下文的新Agent,只需读取此文件就能有效运行循环。请花时间确保文件内容完善。
markdown
undefined

Autoresearch: <goal>

Autoresearch: <目标>

Objective

目标说明

<Specific description of what we're optimizing and the workload.>
<详细描述我们要优化的对象及对应的工作负载。>

Metrics

指标

  • Primary: <name> (<unit>, lower/higher is better)
  • Secondary: <name>, <name>, ...
  • 核心指标:<名称>(<单位>,数值越低/越高越好)
  • 次要指标:<名称>、<名称>……

How to Run

运行方式

./autoresearch.sh
— outputs
METRIC name=value
lines.
./autoresearch.sh
—— 输出格式为
METRIC name=value
的行。

Files in Scope

涉及文件范围

<Every file the agent may modify, with a brief note on what it does.>
<Agent可修改的所有文件,附带简要功能说明。>

Off Limits

禁止修改内容

<What must NOT be touched — evaluation harness, data prep, etc.>
<绝对不能触碰的部分——如评估框架、数据预处理等。>

Constraints

约束条件

<Hard rules: tests must pass, no new deps, fixed time budget, etc.>
<硬性规则:必须通过测试、不能添加新依赖、固定时间预算等。>

What's Been Tried

已尝试方案

<Update this section as experiments accumulate. Note key wins, dead ends, and architectural insights so the agent doesn't repeat failed approaches.>

Update `autoresearch.md` periodically — especially "What's Been Tried" — so resuming agents have full context.
<随着实验推进更新此部分。记录关键成果、无效尝试以及架构层面的洞察,避免Agent重复失败的方案。>

定期更新`autoresearch.md`——尤其是“已尝试方案”部分——确保恢复运行的Agent拥有完整上下文。

autoresearch.sh

autoresearch.sh

Bash script that runs the benchmark and outputs structured metrics.
bash
#!/bin/bash
set -euo pipefail
用于运行基准测试并输出结构化指标的Bash脚本。
bash
#!/bin/bash
set -euo pipefail

Pre-checks (fast, <1s — catch syntax errors early)

预检查(快速,耗时<1秒——提前捕获语法错误)

python3 -c "import ast; ast.parse(open('train.py').read())"
python3 -c "import ast; ast.parse(open('train.py').read())"

Run benchmark

运行基准测试

uv run train.py > /tmp/autoresearch-output.log 2>&1
uv run train.py > /tmp/autoresearch-output.log 2>&1

Extract and output metrics as METRIC lines

提取并输出指标,格式为METRIC行

val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}') echo "METRIC val_bpb=$val_bpb"

Rules:
- Use `set -euo pipefail`.
- Output `METRIC name=value` lines to stdout (one per metric). The primary metric name must match what's documented in `autoresearch.md`.
- Metric names: word chars, dots, or `µ` (e.g. `val_bpb`, `total_µs`, `bundle.size_kb`).
- Keep the script fast — every second is multiplied by hundreds of runs.
- For fast/noisy benchmarks (<5s), run multiple times inside the script and report the median.
- Update the script during the loop as needed.
val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}') echo "METRIC val_bpb=$val_bpb"

规则:
- 必须使用`set -euo pipefail`。
- 向标准输出输出`METRIC name=value`格式的行(每个指标一行)。核心指标名称必须与`autoresearch.md`中的记录一致。
- 指标名称:可使用单词字符、点或`µ`(示例:`val_bpb`、`total_µs`、`bundle.size_kb`)。
- 保持脚本运行速度——每一秒的耗时都会在数百次运行中被放大。
- 对于快速但存在噪声的基准测试(耗时<5秒),在脚本内多次运行并报告中位数。
- 可在循环运行过程中按需更新脚本。

autoresearch.checks.sh
(optional)

autoresearch.checks.sh
(可选)

Backpressure checks: tests, types, lint. Only create when constraints require correctness validation.
bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
When this file exists:
  • Run it after every passing benchmark (exit 0).
  • If checks fail, log the experiment as
    checks_failed
    and revert.
  • Check execution time does NOT affect the primary metric.
  • Keep output minimal — suppress verbose progress, only show errors.
When this file does not exist, skip checks entirely.
反向验证检查:测试、类型检查、代码规范检查。仅当约束条件要求正确性验证时创建此文件。
bash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true
当此文件存在时:
  • 每次基准测试通过后(退出码为0)运行此脚本。
  • 若检查失败,将实验状态记录为
    checks_failed
    并回滚代码。
  • 检查的执行时间不影响核心指标。
  • 保持输出简洁——抑制冗长的进度信息,仅显示错误内容。
当此文件不存在时,跳过所有检查步骤。

The Experiment Loop

实验循环流程

LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.
Each iteration:
  1. Formulate hypothesis: Based on prior results, source code understanding, and any ideas in
    autoresearch.ideas.md
    , choose what to try next.
  2. Edit code: Modify the in-scope files. Make a single, focused change per experiment.
  3. Commit:
    git add -A && git commit -m "<short description of what this experiment tries>"
  4. Run benchmark:
    bash
    timeout 600 ./autoresearch.sh > run.log 2>&1
    If the command times out or crashes, treat it as a failure.
  5. Parse metrics: Extract
    METRIC
    lines from the output:
    bash
    grep '^METRIC ' run.log
    If no METRIC lines found, the run crashed — read
    tail -50 run.log
    for the error.
  6. Run checks (if
    autoresearch.checks.sh
    exists and benchmark passed):
    bash
    timeout 300 ./autoresearch.checks.sh > checks.log 2>&1
  7. Evaluate and log:
    • Improved (primary metric better than best so far) → status
      keep
      . The commit stays.
    • Worse or equal → status
      discard
      . Revert: stage autoresearch files first, then reset.
    • Crash (benchmark failed) → status
      crash
      . Fix if trivial, otherwise revert and move on.
    • Checks failed → status
      checks_failed
      . Revert.
  8. Log to JSONL: Append one line to
    autoresearch.jsonl
    :
    json
    {"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}
  9. On discard/crash/checks_failed — revert code changes:
    bash
    # Preserve autoresearch session files, revert everything else
    git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
    git checkout -- .
    git clean -fd
  10. Check confidence: After 3+ runs, run the confidence script from the skill's installation directory:
    bash
    bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
    Or locate it via the skill path and run it directly. Interpret the score:
    • >= 2.0x: Improvement is likely real (green).
    • 1.0-2.0x: Above noise but marginal (yellow).
    • < 1.0x: Within noise — consider re-running to confirm (red).
  11. Update session: Periodically update
    autoresearch.md
    "What's Been Tried" section and run the summary script to review progress.
Repeat forever until interrupted.
持续循环运行。 绝对不要询问“是否继续?”——用户期望的是自主运行。
每次迭代步骤:
  1. 提出假设:基于过往结果、对源代码的理解以及
    autoresearch.ideas.md
    中的思路,选择下一步要尝试的方案。
  2. 修改代码:修改范围内的文件。每次实验仅做单一、聚焦的修改。
  3. 提交代码
    git add -A && git commit -m "<此实验尝试内容的简短描述>"
  4. 运行基准测试
    bash
    timeout 600 ./autoresearch.sh > run.log 2>&1
    若命令超时或崩溃,视为实验失败。
  5. 解析指标:从输出中提取
    METRIC
    行:
    bash
    grep '^METRIC ' run.log
    若未找到METRIC行,说明运行崩溃——查看
    tail -50 run.log
    获取错误信息。
  6. 运行检查(若
    autoresearch.checks.sh
    存在且基准测试通过):
    bash
    timeout 300 ./autoresearch.checks.sh > checks.log 2>&1
  7. 评估并记录
    • 优化(核心指标优于当前最佳结果)→ 状态标记为
      keep
      。保留此次提交。
    • 变差或持平→ 状态标记为
      discard
      。回滚代码:先暂存autoresearch相关文件,再重置代码。
    • 崩溃(基准测试失败)→ 状态标记为
      crash
      。若为小问题(如拼写错误、缺失导入)则修复,否则记录后继续下一个实验。
    • 检查失败→ 状态标记为
      checks_failed
      。回滚代码。
  8. 记录至JSONL:向
    autoresearch.jsonl
    追加一行内容:
    json
    {"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}
  9. 若状态为discard/crash/checks_failed——回滚代码修改
    bash
    # 保留autoresearch会话文件,回滚其他所有内容
    git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
    git checkout -- .
    git clean -fd
  10. 检查置信度:完成3次及以上运行后,运行技能安装目录下的置信度脚本:
    bash
    bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
    或通过技能路径找到脚本并直接运行。解读得分:
    • >= 2.0x:优化效果大概率真实有效(绿色)。
    • 1.0-2.0x:优于噪声但提升幅度微小(黄色)。
    • < 1.0x:处于噪声范围内——考虑重新运行以确认(红色)。
  11. 更新会话:定期更新
    autoresearch.md
    的“已尝试方案”部分,并运行汇总脚本回顾进度。
持续重复上述步骤,直到被中断。

JSONL Schema

JSONL 格式规范

Each line in
autoresearch.jsonl
is a JSON object:
FieldTypeDescription
run
number1-indexed experiment count
commit
stringShort git SHA (7 chars)
metric
numberPrimary metric value
metrics
objectAll metrics dict (primary + secondary)
status
string
keep
,
discard
,
crash
, or
checks_failed
description
stringWhat this experiment tried
timestamp
numberUnix timestamp (ms)
confidence
number or nullMAD-based confidence score (null if <3 runs)
autoresearch.jsonl
中的每一行都是一个JSON对象:
字段类型描述
run
数字从1开始的实验计数
commit
字符串短Git SHA值(7个字符)
metric
数字核心指标数值
metrics
对象所有指标的字典(核心指标+次要指标)
status
字符串
keep
discard
crash
checks_failed
description
字符串此实验尝试内容的描述
timestamp
数字Unix时间戳(毫秒)
confidence
数字或null基于MAD的置信度得分(若运行次数<3则为null)

Resuming

恢复运行

When
autoresearch.md
exists in the working directory:
  1. Read
    autoresearch.md
    for full context (objective, what's been tried, constraints).
  2. Read
    autoresearch.jsonl
    to reconstruct state (best metric, run count, last segment).
  3. Read
    git log --oneline -20
    for recent commit history.
  4. Check
    autoresearch.ideas.md
    if it exists — prune stale entries, experiment with promising ones.
  5. Continue the loop from where it left off. Do not re-run the baseline.
当工作目录中存在
autoresearch.md
时:
  1. 读取
    autoresearch.md
    获取完整上下文(目标、已尝试方案、约束条件)。
  2. 读取
    autoresearch.jsonl
    重建状态(最佳指标、运行次数、最后阶段)。
  3. 读取
    git log --oneline -20
    查看近期提交历史。
  4. 若存在
    autoresearch.ideas.md
    则检查该文件——删除过时/已尝试的条目,尝试有前景的方案。
  5. 从上次中断的位置继续运行循环。无需重新运行基准测试。

Ideas Backlog

思路备份

When you discover complex but promising optimizations you won't pursue right now, append them as bullets to
autoresearch.ideas.md
. Don't let good ideas get lost.
On resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to
autoresearch.md
.
当发现复杂但有前景的优化方案但暂时不打算尝试时,将其作为项目符号追加到
autoresearch.ideas.md
中。避免遗漏优质思路。
恢复运行时,检查此文件——删除过时/已尝试的条目,尝试剩余方案。当所有思路都尝试完毕后,删除该文件并在
autoresearch.md
中写入最终总结。

Loop Rules

循环规则

See
references/loop-rules.md
for the full reference. Key rules:
  • Primary metric is king. Improved → keep. Worse/equal → discard.
  • Simpler is better. Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.
  • Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
  • Think longer when stuck. Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.
  • Crashes: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.
  • NEVER STOP. The user may be away for hours. Keep going until interrupted.
完整规则请参考
references/loop-rules.md
。核心规则:
  • 核心指标优先。优化有效则保留,变差或持平则回滚。
  • 越简洁越好。在性能持平的情况下删除代码=保留。为微小提升引入复杂实现=回滚。
  • 不要反复尝试无效方案。多次回滚同一思路?尝试结构上不同的方案。
  • 遇到瓶颈时深入思考。重新阅读源代码,分析CPU/编译器/运行时的实际行为。深度理解优于随机尝试。
  • 崩溃处理:小问题(如拼写错误、缺失导入)则修复,否则记录后继续。不要过度投入时间。
  • 绝对不要停止。用户可能数小时不在场。持续运行直到被中断。

User Messages During Experiments

实验运行中的用户消息

If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.
若实验运行时用户发送消息,需先完成当前的“运行-评估-记录”周期,再在下一次迭代中纳入用户反馈。