autoresearch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autoresearch

Autonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.

Works for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.

自动实验循环：尝试各种思路、衡量结果、保留有效方案、摒弃无效尝试、持续运行不中断。

适用于各类优化目标：测试速度、打包体积、LLM训练、构建时长、Lighthouse评分、二进制文件大小、延迟、内存占用。

Setup

配置步骤

autoresearch.md

already exists in the working directory, skip setup and resume the loop — read

autoresearch.md

autoresearch.jsonl

, and

git log

, then continue experimenting.

Otherwise:

Gather context: Ask (or infer from
```
$ARGUMENTS
```
and conversation) the Goal, Command to benchmark, Primary metric (name + direction), Files in scope, and Constraints.

Create branch:

git checkout -b autoresearch/<goal>-<date>

(e.g.

autoresearch/test-speed-2026-03-21

Read source files: Understand the workload deeply before writing anything. Read every file in scope.
Write session files: Create
```
autoresearch.md
```
and
```
autoresearch.sh
```
(see templates below). If constraints require correctness validation (tests must pass, types must check), also create
```
autoresearch.checks.sh
```
. Commit all.
Run baseline: Execute the first experiment with no changes to establish the baseline metric.
Start looping: Begin the experiment loop immediately after the baseline is logged.

如果工作目录中已存在

autoresearch.md

，跳过配置步骤并恢复循环——读取

autoresearch.md

、

autoresearch.jsonl

和

git log

后，继续实验即可。

若不存在，则执行以下步骤：

收集上下文：询问（或从
```
$ARGUMENTS
```
及对话内容中推断）目标、用于基准测试的命令、核心指标（名称+优化方向）、涉及文件范围以及约束条件。

创建分支：执行

git checkout -b autoresearch/<goal>-<date>

（示例：

autoresearch/test-speed-2026-03-21

）。

读取源文件：在编写任何内容前，深入理解工作负载。读取所有涉及范围内的文件。
创建会话文件：生成
```
autoresearch.md
```
和
```
autoresearch.sh
```
（参考下方模板）。若约束条件要求正确性验证（如必须通过测试、类型检查需通过），还需创建
```
autoresearch.checks.sh
```
。提交所有文件。
运行基准测试：在无任何修改的情况下执行首次实验，确立基准指标。
启动循环：基准测试结果记录完成后，立即启动实验循环。

autoresearch.md

autoresearch.md

The heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.

markdown

undefined

这是会话的核心文件。即使是无上下文的新Agent，只需读取此文件就能有效运行循环。请花时间确保文件内容完善。

markdown

undefined

Autoresearch: <goal>

Autoresearch: <目标>

Objective

目标说明

<详细描述我们要优化的对象及对应的工作负载。>

Metrics

指标

Primary: <name> (<unit>, lower/higher is better)
Secondary: <name>, <name>, ...

核心指标：<名称>（<单位>，数值越低/越高越好）
次要指标：<名称>、<名称>……

How to Run

运行方式

./autoresearch.sh

— outputs

METRIC name=value

lines.

./autoresearch.sh

—— 输出格式为

METRIC name=value

的行。

Files in Scope

涉及文件范围

<Agent可修改的所有文件，附带简要功能说明。>

Off Limits

禁止修改内容

<绝对不能触碰的部分——如评估框架、数据预处理等。>

Constraints

约束条件

<硬性规则：必须通过测试、不能添加新依赖、固定时间预算等。>

What's Been Tried

已尝试方案


Update `autoresearch.md` periodically — especially "What's Been Tried" — so resuming agents have full context.

<随着实验推进更新此部分。记录关键成果、无效尝试以及架构层面的洞察，避免Agent重复失败的方案。>


定期更新`autoresearch.md`——尤其是“已尝试方案”部分——确保恢复运行的Agent拥有完整上下文。

autoresearch.sh

autoresearch.sh

Bash script that runs the benchmark and outputs structured metrics.

bash

#!/bin/bash
set -euo pipefail

用于运行基准测试并输出结构化指标的Bash脚本。

bash

#!/bin/bash
set -euo pipefail

Pre-checks (fast, <1s — catch syntax errors early)

预检查（快速，耗时<1秒——提前捕获语法错误）

python3 -c "import ast; ast.parse(open('train.py').read())"

Run benchmark

运行基准测试

uv run train.py > /tmp/autoresearch-output.log 2>&1

Extract and output metrics as METRIC lines

提取并输出指标，格式为METRIC行

val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}') echo "METRIC val_bpb=$val_bpb"


Rules:
- Use `set -euo pipefail`.
- Output `METRIC name=value` lines to stdout (one per metric). The primary metric name must match what's documented in `autoresearch.md`.
- Metric names: word chars, dots, or `µ` (e.g. `val_bpb`, `total_µs`, `bundle.size_kb`).
- Keep the script fast — every second is multiplied by hundreds of runs.
- For fast/noisy benchmarks (<5s), run multiple times inside the script and report the median.
- Update the script during the loop as needed.

val_bpb=$(grep "^val_bpb:" /tmp/autoresearch-output.log | awk '{print $2}') echo "METRIC val_bpb=$val_bpb"


规则：
- 必须使用`set -euo pipefail`。
- 向标准输出输出`METRIC name=value`格式的行（每个指标一行）。核心指标名称必须与`autoresearch.md`中的记录一致。
- 指标名称：可使用单词字符、点或`µ`（示例：`val_bpb`、`total_µs`、`bundle.size_kb`）。
- 保持脚本运行速度——每一秒的耗时都会在数百次运行中被放大。
- 对于快速但存在噪声的基准测试（耗时<5秒），在脚本内多次运行并报告中位数。
- 可在循环运行过程中按需更新脚本。

autoresearch.checks.sh

(optional)

autoresearch.checks.sh

（可选）

Backpressure checks: tests, types, lint. Only create when constraints require correctness validation.

bash

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true

When this file exists:

Run it after every passing benchmark (exit 0).
If checks fail, log the experiment as
```
checks_failed
```
and revert.
Check execution time does NOT affect the primary metric.
Keep output minimal — suppress verbose progress, only show errors.

When this file does not exist, skip checks entirely.

反向验证检查：测试、类型检查、代码规范检查。仅当约束条件要求正确性验证时创建此文件。

bash

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true

当此文件存在时：

每次基准测试通过后（退出码为0）运行此脚本。
若检查失败，将实验状态记录为
```
checks_failed
```
并回滚代码。
检查的执行时间不影响核心指标。
保持输出简洁——抑制冗长的进度信息，仅显示错误内容。

当此文件不存在时，跳过所有检查步骤。

The Experiment Loop

实验循环流程

LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work.

Each iteration:

Formulate hypothesis: Based on prior results, source code understanding, and any ideas in
```
autoresearch.ideas.md
```
, choose what to try next.
Edit code: Modify the in-scope files. Make a single, focused change per experiment.

Commit:

git add -A && git commit -m "<short description of what this experiment tries>"

Run benchmark:
bash
```
timeout 600 ./autoresearch.sh > run.log 2>&1
```
If the command times out or crashes, treat it as a failure.
Parse metrics: Extract
```
METRIC
```
lines from the output:
bash
```
grep '^METRIC ' run.log
```
If no METRIC lines found, the run crashed — read
```
tail -50 run.log
```
for the error.

Run checks (if

autoresearch.checks.sh

exists and benchmark passed):

bash

timeout 300 ./autoresearch.checks.sh > checks.log 2>&1

Evaluate and log:
- Improved (primary metric better than best so far) → status
```
keep
```
  . The commit stays.
- Worse or equal → status
```
discard
```
  . Revert: stage autoresearch files first, then reset.
- Crash (benchmark failed) → status
```
crash
```
  . Fix if trivial, otherwise revert and move on.
- Checks failed → status
```
checks_failed
```
  . Revert.

Log to JSONL: Append one line to

autoresearch.jsonl

json

{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}

On discard/crash/checks_failed — revert code changes:

bash

# Preserve autoresearch session files, revert everything else
git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
git checkout -- .
git clean -fd

Check confidence: After 3+ runs, run the confidence script from the skill's installation directory:
bash
```
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
```
Or locate it via the skill path and run it directly. Interpret the score:
- >= 2.0x: Improvement is likely real (green).
- 1.0-2.0x: Above noise but marginal (yellow).
- < 1.0x: Within noise — consider re-running to confirm (red).
Update session: Periodically update
```
autoresearch.md
```
"What's Been Tried" section and run the summary script to review progress.

Repeat forever until interrupted.

持续循环运行。 绝对不要询问“是否继续？”——用户期望的是自主运行。

每次迭代步骤：

提出假设：基于过往结果、对源代码的理解以及
```
autoresearch.ideas.md
```
中的思路，选择下一步要尝试的方案。
修改代码：修改范围内的文件。每次实验仅做单一、聚焦的修改。

提交代码：

git add -A && git commit -m "<此实验尝试内容的简短描述>"

运行基准测试：
bash
```
timeout 600 ./autoresearch.sh > run.log 2>&1
```
若命令超时或崩溃，视为实验失败。
解析指标：从输出中提取
```
METRIC
```
行：
bash
```
grep '^METRIC ' run.log
```
若未找到METRIC行，说明运行崩溃——查看
```
tail -50 run.log
```
获取错误信息。

运行检查（若

autoresearch.checks.sh

存在且基准测试通过）：

bash

timeout 300 ./autoresearch.checks.sh > checks.log 2>&1

评估并记录：
- 优化（核心指标优于当前最佳结果）→ 状态标记为
```
keep
```
  。保留此次提交。
- 变差或持平→ 状态标记为
```
discard
```
  。回滚代码：先暂存autoresearch相关文件，再重置代码。
- 崩溃（基准测试失败）→ 状态标记为
```
crash
```
  。若为小问题（如拼写错误、缺失导入）则修复，否则记录后继续下一个实验。
- 检查失败→ 状态标记为
```
checks_failed
```
  。回滚代码。

记录至JSONL：向

autoresearch.jsonl

追加一行内容：

json

{"run":1,"commit":"a1b2c3d","metric":0.9979,"metrics":{"val_bpb":0.9979,"peak_vram_mb":45060.2},"status":"keep","description":"baseline","timestamp":1711036800000,"confidence":null}

若状态为discard/crash/checks_failed——回滚代码修改：

bash

# 保留autoresearch会话文件，回滚其他所有内容
git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true
git checkout -- .
git clean -fd

检查置信度：完成3次及以上运行后，运行技能安装目录下的置信度脚本：
bash
```
bash "$(dirname "$(readlink -f "$0")")/scripts/confidence.sh"
```
或通过技能路径找到脚本并直接运行。解读得分：
- >= 2.0x：优化效果大概率真实有效（绿色）。
- 1.0-2.0x：优于噪声但提升幅度微小（黄色）。
- < 1.0x：处于噪声范围内——考虑重新运行以确认（红色）。
更新会话：定期更新
```
autoresearch.md
```
的“已尝试方案”部分，并运行汇总脚本回顾进度。

持续重复上述步骤，直到被中断。

JSONL Schema

JSONL 格式规范

Each line in

autoresearch.jsonl

is a JSON object:

Field	Type	Description
`run`	number	1-indexed experiment count
`commit`	string	Short git SHA (7 chars)
`metric`	number	Primary metric value
`metrics`	object	All metrics dict (primary + secondary)
`status`	string	`keep` , `discard` , `crash` , or `checks_failed`
`description`	string	What this experiment tried
`timestamp`	number	Unix timestamp (ms)
`confidence`	number or null	MAD-based confidence score (null if <3 runs)

autoresearch.jsonl

中的每一行都是一个JSON对象：

字段	类型	描述
`run`	数字	从1开始的实验计数
`commit`	字符串	短Git SHA值（7个字符）
`metric`	数字	核心指标数值
`metrics`	对象	所有指标的字典（核心指标+次要指标）
`status`	字符串	`keep` 、 `discard` 、 `crash` 或 `checks_failed`
`description`	字符串	此实验尝试内容的描述
`timestamp`	数字	Unix时间戳（毫秒）
`confidence`	数字或null	基于MAD的置信度得分（若运行次数<3则为null）

Resuming

恢复运行

When

autoresearch.md

exists in the working directory:

Read
```
autoresearch.md
```
for full context (objective, what's been tried, constraints).
Read
```
autoresearch.jsonl
```
to reconstruct state (best metric, run count, last segment).
Read
```
git log --oneline -20
```
for recent commit history.
Check
```
autoresearch.ideas.md
```
if it exists — prune stale entries, experiment with promising ones.
Continue the loop from where it left off. Do not re-run the baseline.

当工作目录中存在

autoresearch.md

时：

读取
```
autoresearch.md
```
获取完整上下文（目标、已尝试方案、约束条件）。
读取
```
autoresearch.jsonl
```
重建状态（最佳指标、运行次数、最后阶段）。
读取
```
git log --oneline -20
```
查看近期提交历史。
若存在
```
autoresearch.ideas.md
```
则检查该文件——删除过时/已尝试的条目，尝试有前景的方案。
从上次中断的位置继续运行循环。无需重新运行基准测试。

Ideas Backlog

思路备份

When you discover complex but promising optimizations you won't pursue right now, append them as bullets to

autoresearch.ideas.md

. Don't let good ideas get lost.

On resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to

autoresearch.md

当发现复杂但有前景的优化方案但暂时不打算尝试时，将其作为项目符号追加到

autoresearch.ideas.md

中。避免遗漏优质思路。

恢复运行时，检查此文件——删除过时/已尝试的条目，尝试剩余方案。当所有思路都尝试完毕后，删除该文件并在

autoresearch.md

中写入最终总结。

Loop Rules

循环规则

See

references/loop-rules.md

for the full reference. Key rules:

Primary metric is king. Improved → keep. Worse/equal → discard.
Simpler is better. Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.
Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
Think longer when stuck. Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.
Crashes: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.
NEVER STOP. The user may be away for hours. Keep going until interrupted.

完整规则请参考

references/loop-rules.md

。核心规则：

核心指标优先。优化有效则保留，变差或持平则回滚。
越简洁越好。在性能持平的情况下删除代码=保留。为微小提升引入复杂实现=回滚。
不要反复尝试无效方案。多次回滚同一思路？尝试结构上不同的方案。
遇到瓶颈时深入思考。重新阅读源代码，分析CPU/编译器/运行时的实际行为。深度理解优于随机尝试。
崩溃处理：小问题（如拼写错误、缺失导入）则修复，否则记录后继续。不要过度投入时间。
绝对不要停止。用户可能数小时不在场。持续运行直到被中断。

User Messages During Experiments

实验运行中的用户消息

If the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.

若实验运行时用户发送消息，需先完成当前的“运行-评估-记录”周期，再在下一次迭代中纳入用户反馈。

autoresearch

Original

Translation

Autoresearch

Autoresearch

Setup

配置步骤

autoresearch.md

autoresearch.md

Autoresearch: <goal>

Autoresearch: <目标>

Objective

目标说明

Metrics

指标

How to Run

运行方式

Files in Scope

涉及文件范围

Off Limits

禁止修改内容

Constraints

约束条件

What's Been Tried

已尝试方案

autoresearch.sh

autoresearch.sh

Pre-checks (fast, <1s — catch syntax errors early)

预检查（快速，耗时<1秒——提前捕获语法错误）

Run benchmark

运行基准测试

Extract and output metrics as METRIC lines

提取并输出指标，格式为METRIC行

autoresearch.checks.sh (optional)

autoresearch.checks.sh（可选）

The Experiment Loop

实验循环流程

JSONL Schema

JSONL 格式规范

Resuming

恢复运行

Ideas Backlog

思路备份

Loop Rules

循环规则

User Messages During Experiments

实验运行中的用户消息

`autoresearch.md`

`autoresearch.md`

`autoresearch.sh`

`autoresearch.sh`

`autoresearch.checks.sh`
(optional)

`autoresearch.checks.sh`
（可选）