autoresearch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autoresearch

Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.

自主实验循环：尝试想法，保留有效方案，舍弃无效方案，持续运行。

Overview

概述

You are running an autonomous optimization loop. Your job is to systematically improve a measurable metric by making changes, running experiments, and keeping only the improvements. You maintain structured state files so that any session — including a fresh one with no memory — can resume exactly where you left off.

If the user is asking you to do this and you are not currently in mission mode, suggest that they might want to run this inside a mission (

/enter-mission

) for better progress tracking, milestone validation, and multi-session continuity. Don't block on it — just mention it once during setup.

If you are already in mission mode, invoke the mission planning skills first (

mission-planning

and

define-mission-skills

) before diving into this skill's procedure. Use the mission system's planning, decomposition, and worker design to structure the autoresearch work — then combine that guidance with this skill's experiment loop procedure. This skill defines how to run experiments; the mission system defines how to plan, track, and validate them.

你正在运行一个自主优化循环。你的任务是通过修改代码、运行实验并仅保留有效改进，系统性地提升可衡量的指标。你需要维护结构化的状态文件，以便任何会话——包括无记忆的全新会话——都能精确地从上次中断的位置继续。

如果用户要求你执行此操作，而你当前未处于任务模式，建议他们在任务中运行（

/enter-mission

），以便更好地跟踪进度、验证里程碑并实现多会话连续性。无需强制要求——仅在设置阶段提及一次即可。

如果已处于任务模式，请先调用任务规划技能（

mission-planning

和

define-mission-skills

），再深入执行本技能的流程。使用任务系统的规划、分解和工作设计来构建自动研究工作——然后将该指导与本技能的实验循环流程相结合。本技能定义了如何运行实验；任务系统定义了如何规划、跟踪和验证实验。

Setup

设置

Before the loop starts, you need to establish the experiment.

在循环开始前，你需要完成实验的初始化工作。

Step 1: Gather Information

步骤1：收集信息

Ask the user (or infer from context) for:

Goal: What are we optimizing? (e.g., "minimize val_bpb", "reduce test runtime", "shrink bundle size")

Command: What to run (e.g.,

uv run train.py

pnpm test

pnpm build && du -sb dist

)

Primary metric: Name, unit, and direction (e.g.,
```
val_bpb
```
, unitless, lower is better)
Files in scope: Which files may be modified
Constraints: Hard rules (tests must pass, no new deps, etc.)
Termination condition: When to stop. Ask the user — options are:
- Fixed experiment count (e.g., 20 experiments)
- Fixed time budget (e.g., 2 hours)
- Target metric (e.g., val_bpb < 1.0)
- Run until interrupted (default)

向用户询问（或从上下文推断）以下内容：

目标：我们要优化什么？（例如："最小化val_bpb"、"缩短测试运行时间"、"减小包体积"）

命令：运行什么命令（例如：

uv run train.py

、

pnpm test

、

pnpm build && du -sb dist

）

主指标：名称、单位和优化方向（例如：
```
val_bpb
```
，无单位，越低越好）
作用范围文件：可修改的文件列表
约束条件：硬性规则（必须通过测试、不得添加新依赖等）
终止条件：何时停止。询问用户——选项包括：
- 固定实验次数（例如：20次实验）
- 固定时间预算（例如：2小时）
- 目标指标值（例如：val_bpb < 1.0）
- 运行至被中断（默认选项）

Step 2: Create Branch and State Files

步骤2：创建分支和状态文件

bash

git checkout autoresearch/<goal>-<date> 2>/dev/null || git checkout -b autoresearch/<goal>-<date>

Read the source files thoroughly. Understand the workload deeply before writing anything.

Create three files:

bash

git checkout autoresearch/<goal>-<date> 2>/dev/null || git checkout -b autoresearch/<goal>-<date>

仔细阅读源文件。在编写任何内容之前，深入理解工作负载。

创建三个文件：

autoresearch.md

autoresearch.md

The living research document. A fresh agent with no context should be able to read this file and run the loop effectively. Invest time making it excellent.

markdown

undefined

动态更新的研究文档。即使是无上下文的全新Agent，也应能通过阅读此文件有效运行循环。花时间确保文档质量优秀。

markdown

undefined

Autoresearch: <goal>

Objective

目标

<关于优化对象和工作负载的具体描述。>

Metrics

指标

Primary: <name> (<unit>, lower/higher is better) — the optimization target
Secondary: <name>, <name>, ... — independent tradeoff monitors

主指标：<名称>（<单位>，越低/越高越好）——优化目标
次要指标：<名称>、<名称>……——独立的权衡监控指标

How to Run

运行方式

./autoresearch.sh

— outputs

METRIC name=number

lines.

./autoresearch.sh

—— 输出

METRIC name=number

格式的行。

Files in Scope

作用范围文件

<Agent可修改的所有文件，附带简要功能说明。>

Off Limits

禁止修改内容

<绝对不能触碰的内容。>

Constraints

约束条件

<硬性规则：必须通过测试、不得添加新依赖等。>

Termination

终止条件

<停止时机：实验次数、时间预算、目标指标值，或运行至被中断。>

What's Been Tried

已尝试方案

undefined

<随着实验积累更新此部分。记录关键成果、无效尝试和架构见解，避免Agent重复失败的方法。>

undefined

autoresearch.sh

autoresearch.sh

Bash script (

set -euo pipefail

) that: pre-checks fast (syntax errors in <1s), runs the benchmark, and outputs structured

METRIC name=value

lines to stdout. Keep the script fast.

For fast, noisy benchmarks (< 5s), run the workload multiple times inside the script and report the median. Slow workloads (ML training, large builds) don't need this.

Example:

bash

#!/bin/bash
set -euo pipefail

Bash脚本（

set -euo pipefail

），用于快速预检查（1秒内完成语法错误检查）、运行基准测试，并向标准输出输出结构化的

METRIC name=value

行。保持脚本运行速度快。

对于快速且有噪声的基准测试（<5秒），在脚本内多次运行工作负载并报告中位数。慢工作负载（ML训练、大型构建）无需此操作。

示例：

bash

#!/bin/bash
set -euo pipefail

Pre-check: syntax validation

预检查：语法验证

python3 -c "import ast; ast.parse(open('train.py').read())" 2>&1 || { echo "SYNTAX ERROR"; exit 1; }

Run the workload

运行工作负载

output=$(uv run train.py 2>&1)

Extract and output metrics

提取并输出指标

val_bpb=$(echo "$output" | grep -oP 'val_bpb=\K[0-9.]+' | tail -1) echo "METRIC val_bpb=$val_bpb"

undefined

val_bpb=$(echo "$output" | grep -oP 'val_bpb=\K[0-9.]+' | tail -1) echo "METRIC val_bpb=$val_bpb"

undefined

autoresearch.checks.sh

(optional)

autoresearch.checks.sh

（可选）

Only create this when the user's constraints require correctness validation (e.g., "tests must pass", "types must check"). Bash script (

set -euo pipefail

) for backpressure checks.

bash

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true

仅当用户的约束条件需要正确性验证时创建（例如："必须通过测试"、"必须通过类型检查"）。用于压力检查的Bash脚本（

set -euo pipefail

）。

bash

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true

Step 3: Initialize JSONL and Commit State Files

步骤3：初始化JSONL并提交状态文件

Initialize the experiment log:

bash

python3 autoresearch_helper.py init --jsonl autoresearch.jsonl --name '<goal>' --metric-name '<metric_name>' --direction <lower|higher>

Commit all state files:

bash

git add autoresearch.md autoresearch.sh autoresearch.jsonl
git commit -m "autoresearch: initialize experiment session"

初始化实验日志：

bash

python3 autoresearch_helper.py init --jsonl autoresearch.jsonl --name '<goal>' --metric-name '<metric_name>' --direction <lower|higher>

提交所有状态文件：

bash

git add autoresearch.md autoresearch.sh autoresearch.jsonl
git commit -m "autoresearch: initialize experiment session"

Step 4: Run Baseline

步骤4：运行基准测试

Run the benchmark and record the baseline result:

bash

bash autoresearch.sh

Parse the METRIC lines from the output, then log the baseline as a keep:

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <baseline_value> \
  --status keep \
  --description "baseline" \
  --asi '{"hypothesis": "baseline measurement"}'

This is experiment #1 — it establishes the starting point for all future comparisons.

运行基准测试并记录基线结果：

bash

bash autoresearch.sh

解析输出中的METRIC行，然后将基线记录为保留项：

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <baseline_value> \
  --status keep \
  --description "baseline" \
  --asi '{"hypothesis": "baseline measurement"}'

这是第1次实验——为所有后续比较建立起点。

The Experiment Loop

实验循环

LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work. Only stop when:

The termination condition from setup is met
The user interrupts
You detect you're running low on context (see Context Management below)

持续循环。 永远不要询问"是否继续？"——用户期望自主工作。仅在以下情况停止：

满足设置阶段定义的终止条件
用户中断
检测到上下文即将耗尽（见下文上下文管理）

For Each Experiment:

每次实验流程：

1. Choose What to Try

1. 选择尝试方案

Read

autoresearch.md

(especially "What's Been Tried") and

autoresearch.ideas.md

(if it exists) to pick the next hypothesis. Think about what the data tells you. The best ideas come from deep understanding, not random variations.

阅读

autoresearch.md

（尤其是"已尝试方案"部分）和

autoresearch.ideas.md

（如果存在），选择下一个假设。思考数据所传达的信息。最佳想法来自深入理解，而非随机尝试。

2. Make Changes

2. 修改代码

Edit the files in scope. Keep changes focused — one hypothesis per experiment.

编辑作用范围内的文件。保持修改聚焦——每次实验仅验证一个假设。

3. Run the Experiment

3. 运行实验

Execute the benchmark:

bash

timeout 600 bash autoresearch.sh

Capture the full output. Parse

METRIC name=value

lines from the output.

If the run crashes or times out, log it as a crash and revert.

autoresearch.checks.sh

exists and the benchmark passed, run it:

bash

timeout 300 bash autoresearch.checks.sh

If checks fail, log as

checks_failed

and revert.

执行基准测试：

bash

timeout 600 bash autoresearch.sh

捕获完整输出。解析输出中的

METRIC name=value

行。

如果运行崩溃或超时，记录为崩溃并回退。

如果存在

autoresearch.checks.sh

且基准测试通过，运行该脚本：

bash

timeout 300 bash autoresearch.checks.sh

如果检查失败，记录为

checks_failed

并回退。

4. Evaluate Results

4. 评估结果

Compare the primary metric against the current best (or baseline if no keeps yet) using the helper script:

bash

python3 autoresearch_helper.py evaluate --jsonl autoresearch.jsonl --metric <value> --direction <lower|higher>

This outputs whether to keep or discard, the confidence score, and delta from baseline.

Decision rules:

Primary metric improved ->
```
keep
```
Primary metric worse or unchanged ->
```
discard
```
Simpler code for equal performance ->
```
keep
```
(removing code for same perf is a win)
Ugly complexity for tiny gain -> probably
```
discard
```
Secondary metrics rarely affect the keep/discard decision. Only discard a primary improvement if a secondary metric degraded catastrophically.

使用辅助脚本将主指标与当前最优值（若尚无保留项则与基线）进行比较：

bash

python3 autoresearch_helper.py evaluate --jsonl autoresearch.jsonl --metric <value> --direction <lower|higher>

输出结果为保留或舍弃、置信度评分以及与基线的差值。

决策规则：

主指标提升 ->
```
keep
```
主指标下降或无变化 ->
```
discard
```
代码更简洁但性能相同 ->
```
keep
```
（移除代码但性能不变即为改进）
微小提升但代码复杂度大幅增加 -> 建议
```
discard
```
次要指标极少影响保留/舍弃决策。仅当次要指标出现灾难性退化时，才舍弃主指标的提升。

5. Record Results

5. 记录结果

On keep:

Log to JSONL first (so the entry is included in the commit):

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <value> \
  --status keep \
  --description "<what was tried>" \
  --asi '{"hypothesis": "<what you tried>"}' \
  # --metrics '{"compile_us": <value>, "render_us": <value>}'  # optional secondary metrics
  --direction <lower|higher>

Then commit all changes (including the JSONL entry):

bash

git add -A
git commit -m "<description>

Result: {\"status\": \"keep\", \"<metric_name>\": <value>}"

On discard/crash/checks_failed:

Log to JSONL first (before reverting, so the entry is preserved):

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit "0000000" \
  --metric <value_or_0> \
  --status <discard|crash|checks_failed> \
  --description "<what was tried>" \
  --asi '{"hypothesis": "<what you tried>", "rollback_reason": "<why it failed>"}' \
  # --metrics '{"compile_us": <value>, "render_us": <value>}'  # optional secondary metrics
  --direction <lower|higher>

Then revert changes, backing up state files so

git clean -fd

doesn't destroy them:

bash

undefined

保留时：

先记录到JSONL（确保条目被包含在提交中）：

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit $(git rev-parse --short=7 HEAD) \
  --metric <value> \
  --status keep \
  --description "<尝试的内容>" \
  --asi '{"hypothesis": "<尝试的内容>"}' \
  # --metrics '{"compile_us": <value>, "render_us": <value>}'  # 可选次要指标
  --direction <lower|higher>

然后提交所有修改（包括JSONL条目）：

bash

git add -A
git commit -m "<description>

Result: {\"status\": \"keep\", \"<metric_name>\": <value>}"

舍弃/崩溃/检查失败时：

先记录到JSONL（回退前记录，确保条目被保留）：

bash

python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
  --commit "0000000" \
  --metric <value_or_0> \
  --status <discard|crash|checks_failed> \
  --description "<尝试的内容>" \
  --asi '{"hypothesis": "<尝试的内容>", "rollback_reason": "<失败原因>"}' \
  # --metrics '{"compile_us": <value>, "render_us": <value>}'  # 可选次要指标
  --direction <lower|higher>

然后回退修改，备份状态文件以避免

git clean -fd

销毁它们：

bash

undefined

Backup state files

备份状态文件

cp autoresearch.jsonl autoresearch.jsonl.bak 2>/dev/null || true cp autoresearch.md autoresearch.md.bak 2>/dev/null || true cp autoresearch.ideas.md autoresearch.ideas.md.bak 2>/dev/null || true

Revert all changes

回退所有修改

git checkout -- . git clean -fd 2>/dev/null

Restore state files

恢复状态文件

cp autoresearch.jsonl.bak autoresearch.jsonl 2>/dev/null || true cp autoresearch.md.bak autoresearch.md 2>/dev/null || true cp autoresearch.ideas.md.bak autoresearch.ideas.md 2>/dev/null || true rm -f autoresearch.jsonl.bak autoresearch.md.bak autoresearch.ideas.md.bak

undefined

undefined

6. Update Research Journal

6. 更新研究日志

After every few experiments (or after significant findings), update the "What's Been Tried" section in

autoresearch.md

. Include:

What worked and why
What didn't work and why
Dead ends to avoid
Current best result and how it was achieved

每进行几次实验（或取得重要成果后），更新

autoresearch.md

中的"已尝试方案"部分。包括：

有效的方案及其原因
无效的方案及其原因
需要避免的无效路径
当前最优结果及其实现方式

7. Maintain Ideas Backlog

7. 维护想法待办列表

When you discover promising but deferred optimizations, append them as bullet points to

autoresearch.ideas.md

. Don't let good ideas get lost. Prune stale or tried entries.

当发现有前景但暂未尝试的优化方案时，将其作为项目符号添加到

autoresearch.ideas.md

中。不要遗漏好想法。定期删除过时或已尝试的条目。

8. Loop

8. 循环

Go back to step 1.

返回步骤1。

State Files Reference

状态文件参考

File	Format	Purpose
`autoresearch.jsonl`	JSON Lines	Append-only experiment log. One JSON object per line.
`autoresearch.md`	Markdown	Living research document. Objective, what's been tried, current best.
`autoresearch.ideas.md`	Markdown	Hypothesis backlog. Bullet points of promising ideas to try.
`autoresearch.sh`	Bash	Benchmark script. Outputs `METRIC name=value` lines.
`autoresearch.checks.sh`	Bash	Optional correctness checks (tests, types, lint).

文件	格式	用途
`autoresearch.jsonl`	JSON Lines	仅追加的实验日志。每行一个JSON对象。
`autoresearch.md`	Markdown	动态研究文档。包含目标、已尝试方案、当前最优结果。
`autoresearch.ideas.md`	Markdown	假设待办列表。记录有前景的想法项目符号。
`autoresearch.sh`	Bash	基准测试脚本。输出 `METRIC name=value` 行。
`autoresearch.checks.sh`	Bash	可选正确性检查（测试、类型、代码规范）。

JSONL Schema

Each line in

autoresearch.jsonl

is either a config header or an experiment result:

Config header (first line, or on re-init):

json

{"type": "config", "name": "...", "metricName": "...", "metricUnit": "...", "bestDirection": "lower|higher"}

Experiment result:

json

{
  "run": 1,
  "commit": "abc1234",
  "metric": 1.234,
  "metrics": {"compile_us": 4200, "render_us": 9800},
  "status": "keep|discard|crash|checks_failed",
  "description": "what was tried",
  "timestamp": 1711600000000,
  "segment": 0,
  "confidence": 2.1,
  "asi": {"hypothesis": "...", "rollback_reason": "...", "next_action_hint": "..."}
}

autoresearch.jsonl

中的每行要么是配置头，要么是实验结果：

配置头（第一行或重新初始化时）：

json

{"type": "config", "name": "...", "metricName": "...", "metricUnit": "...", "bestDirection": "lower|higher"}

实验结果：

json

{
  "run": 1,
  "commit": "abc1234",
  "metric": 1.234,
  "metrics": {"compile_us": 4200, "render_us": 9800},
  "status": "keep|discard|crash|checks_failed",
  "description": "what was tried",
  "timestamp": 1711600000000,
  "segment": 0,
  "confidence": 2.1,
  "asi": {"hypothesis": "...", "rollback_reason": "...", "next_action_hint": "..."}
}

ASI (Actionable Side Information)

ASI（可操作辅助信息）

Always record ASI with every experiment. At minimum:

{"hypothesis": "what you tried"}

. On discard/crash, also include

rollback_reason

and

next_action_hint

. Add any other key/value pairs that capture what you learned — dead ends, surprising findings, error details, bottlenecks.

ASI is the only structured memory that survives reverts. Without it, future iterations waste time re-discovering the same dead ends.

每次实验都必须记录ASI。至少包含：

{"hypothesis": "尝试的内容"}

。舍弃/崩溃时，还需包含

rollback_reason

和

next_action_hint

。添加任何能记录所获信息的键值对——无效路径、意外发现、错误详情、瓶颈等。

ASI是唯一能在回退后保留的结构化记忆。没有它，后续迭代会浪费时间重复发现相同的无效路径。

Confidence Scoring

置信度评分

After 3+ experiments, the helper script computes a confidence score using Median Absolute Deviation (MAD):

Confidence	Meaning
>= 2.0x	Improvement is likely real
1.0-2.0x	Above noise but marginal
< 1.0x	Within noise — consider re-running to confirm

The score is advisory — it never auto-discards. If confidence is below 1.0x, consider re-running the same experiment to confirm before keeping.

进行3次以上实验后，辅助脚本使用中位数绝对偏差（MAD）计算置信度评分：

置信度	含义
>= 2.0x	提升很可能是真实有效的
1.0-2.0x	超出噪声范围但提升幅度微小
< 1.0x	在噪声范围内——考虑重新运行以确认

评分仅作参考——不会自动舍弃结果。如果置信度低于1.0x，考虑重新运行相同实验以确认后再保留。

Context Management

上下文管理

Droid sessions have finite context. To handle this gracefully:

Track experiment count in the current session. After ~15 experiments, context is getting heavy.
Save state proactively — all state lives in files (jsonl, md), so a new session can resume immediately.
When context is getting exhausted: update
```
autoresearch.md
```
with current findings, commit state files, and stop. The next session reads the files and continues.
On resume: read
```
autoresearch.md
```
,
```
autoresearch.jsonl
```
, and
```
git log --oneline -20
```
to understand where things stand. Check current status:

bash

python3 autoresearch_helper.py status --jsonl autoresearch.jsonl

Droid会话的上下文是有限的。为此需优雅处理：

跟踪当前会话的实验次数。约15次实验后，上下文会变得繁重。
主动保存状态——所有状态都存储在文件中（jsonl、md），因此新会话可立即恢复。
当上下文即将耗尽时：更新
```
autoresearch.md
```
记录当前发现，提交状态文件，然后停止。下一个会话读取文件即可继续。
恢复时：阅读
```
autoresearch.md
```
、
```
autoresearch.jsonl
```
和
```
git log --oneline -20
```
以了解当前状态。检查当前状态：

bash

python3 autoresearch_helper.py status --jsonl autoresearch.jsonl

Loop Rules Summary

循环规则总结

LOOP FOREVER. Never ask "should I continue?"
Primary metric is king. Improved -> keep. Worse/equal -> discard.
Annotate every run with ASI. Record what you learned, not just what you did.
Watch the confidence score. < 1.0x means within noise — re-run to confirm.
Simpler is better. Removing code for equal perf = keep.
Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
Crashes: fix if trivial, otherwise log and move on.
Think longer when stuck. Re-read source files, study the data, reason about what's actually happening. The best ideas come from deep understanding.
Resuming: read autoresearch.md + git log, continue looping.

持续循环。 永远不要询问"是否继续？"
主指标优先。 提升则保留，下降/无变化则舍弃。
每次运行都要标注ASI。 记录所获信息，而非仅记录操作。
关注置信度评分。 <1.0x意味着在噪声范围内——重新运行确认。
简洁优先。 移除代码但性能不变=保留。
不要重复无效尝试。 反复回退同一想法？尝试结构上不同的方案。
崩溃处理： 若易修复则修复，否则记录并继续。
遇到瓶颈时多思考。 重新阅读源文件，研究数据，推理实际情况。最佳想法来自深入理解。
恢复： 阅读autoresearch.md + git日志，继续循环。

Finalization

收尾

When the experiment loop ends (termination condition met, user interrupts, or context exhausted), finalize the results into clean, reviewable branches. This is the last phase of an autoresearch session.

当实验循环结束（满足终止条件、用户中断或上下文耗尽），将结果整理为清晰、可评审的分支。这是自动研究会话的最后阶段。

Step 1: Summarize Results

步骤1：总结结果

bash

python3 autoresearch_helper.py summary --jsonl autoresearch.jsonl

Review the git log for actual commits:

bash

git log --oneline --stat $(git merge-base HEAD main)..HEAD

bash

python3 autoresearch_helper.py summary --jsonl autoresearch.jsonl

查看git日志中的实际提交：

bash

git log --oneline --stat $(git merge-base HEAD main)..HEAD

Step 2: Group Changes

步骤2：分组修改

Group kept experiments into logical changesets. Each group should:

Represent a single coherent optimization or change
Not share modified files with other groups (so branches can merge independently)
Have a clear description of what it achieves and the metric improvement

Present the proposed grouping to the user for approval:

Group 1: "Reduce model depth from 8 to 6"
  Files: train.py (DEPTH, HEAD_DIM, N_EMBED)
  Metric improvement: val_bpb 1.15 -> 1.08 (-6.1%)
  Experiments: #3, #7, #12

Group 2: "Switch to cosine LR schedule"
  Files: train.py (lr_schedule, warmup_steps)
  Metric improvement: val_bpb 1.08 -> 1.05 (-2.8%)
  Experiments: #15, #18

Wait for user confirmation before proceeding. In mission worker mode, proceed with the best grouping without waiting for confirmation.

将保留的实验分组为逻辑变更集。每个分组应：

代表单一连贯的优化或变更
不与其他分组共享修改的文件（以便分支可独立合并）
清晰描述其实现的效果和指标提升

向用户展示建议的分组以获得批准：

分组1："将模型深度从8层减少到6层"
  文件：train.py（DEPTH、HEAD_DIM、N_EMBED）
  指标提升：val_bpb从1.15降至1.08（-6.1%）
  实验：#3、#7、#12

分组2："切换为余弦LR调度"
  文件：train.py（lr_schedule、warmup_steps）
  指标提升：val_bpb从1.08降至1.05（-2.8%）
  实验：#15、#18

继续前等待用户确认。在任务工作模式下，无需等待确认即可按最佳分组继续。

Step 3: Resolve File Conflicts

步骤3：解决文件冲突

If groups share files, resolve before creating branches:

Merge the groups into one (if changes are related)
Split the file changes more carefully (if they're truly independent modifications to different parts)
Ask the user which group gets priority

Groups must not share files — each branch must be independently mergeable. If all changes touch the same file and can't be separated, create a single finalized branch with all improvements combined.

如果分组共享文件，创建分支前需解决冲突：

将分组合并为一个（若变更相关）
更细致地拆分文件变更（若确实是对不同部分的独立修改）
询问用户哪个分组优先

分组不得共享文件——每个分支必须可独立合并。如果所有变更都涉及同一文件且无法分离，创建一个包含所有改进的单一最终分支。

Step 4: Create Clean Branches

步骤4：创建清晰分支

For each group:

bash

merge_base=$(git merge-base HEAD main)
git checkout -b autoresearch/finalize/<group-name> $merge_base
git checkout autoresearch/<session-branch> -- <file1> <file2> ...
git commit -m "<group description>

Autoresearch results:
- Metric: <name> improved from <baseline> to <best> (<delta>%)
- Confidence: <score>x noise floor
- Experiments: <count> total, <kept> kept"

为每个分组执行：

bash

merge_base=$(git merge-base HEAD main)
git checkout -b autoresearch/finalize/<group-name> $merge_base
git checkout autoresearch/<session-branch> -- <file1> <file2> ...
git commit -m "<分组描述>

Autoresearch results:
- Metric: <名称>从<基线值>提升至<最优值>（<差值>%）
- Confidence: <评分>x噪声阈值
- Experiments: 共<总次数>次，保留<保留次数>次"

Step 5: Verify and Report

步骤5：验证并报告

For each finalized branch, run the benchmark to confirm the improvement holds, run any checks if applicable, and verify it merges cleanly with main.

Present a summary to the user:

Created 2 clean branches from 20 experiments:

  autoresearch/finalize/reduce-depth
    val_bpb: 1.15 -> 1.08 (-6.1%)
    Ready for review

  autoresearch/finalize/cosine-schedule
    val_bpb: 1.08 -> 1.05 (-2.8%)
    Ready for review

Original experiment branch preserved: autoresearch/<session-branch>

The original experiment branch is always preserved — finalization creates new branches.

对每个最终分支，运行基准测试确认提升效果仍然存在，若适用则运行检查，并验证其可与main分支干净合并。

向用户展示总结：

从20次实验中创建了2个清晰分支：

  autoresearch/finalize/reduce-depth
    val_bpb: 1.15 -> 1.08 (-6.1%)
    已准备好评审

  autoresearch/finalize/cosine-schedule
    val_bpb: 1.08 -> 1.05 (-2.8%)
    已准备好评审

原始实验分支已保留：autoresearch/<session-branch>

原始实验分支始终保留——收尾阶段会创建新分支。

Mission Worker Mode

任务工作模式

When running as a mission worker, the feature description specifies the optimization goal, termination condition, files in scope, and constraints. Read it carefully, follow the same loop procedure above, and respect the termination condition. When the condition is met, run finalization and report results in the handoff.

作为任务工作者运行时，功能描述会指定优化目标、终止条件、作用范围文件和约束条件。仔细阅读，遵循上述循环流程，并遵守终止条件。满足条件后，执行收尾并在交接时报告结果。

autoresearch

Original

Translation

Autoresearch

Autoresearch

Overview

概述

Setup

设置

Step 1: Gather Information

步骤1：收集信息

Step 2: Create Branch and State Files

步骤2：创建分支和状态文件

autoresearch.md

autoresearch.md

Autoresearch: <goal>

Autoresearch: <goal>

Objective

目标

Metrics

指标

How to Run

运行方式

Files in Scope

作用范围文件

Off Limits

禁止修改内容

Constraints

约束条件

Termination

终止条件

What's Been Tried

已尝试方案

autoresearch.sh

autoresearch.sh

Pre-check: syntax validation

预检查：语法验证

Run the workload

运行工作负载

Extract and output metrics

提取并输出指标

autoresearch.checks.sh (optional)

autoresearch.checks.sh（可选）

Step 3: Initialize JSONL and Commit State Files

步骤3：初始化JSONL并提交状态文件

Step 4: Run Baseline

步骤4：运行基准测试

The Experiment Loop

实验循环

For Each Experiment:

每次实验流程：

1. Choose What to Try

1. 选择尝试方案

2. Make Changes

2. 修改代码

3. Run the Experiment

3. 运行实验

4. Evaluate Results

4. 评估结果

5. Record Results

5. 记录结果

Backup state files

备份状态文件

Revert all changes

回退所有修改

Restore state files

恢复状态文件

6. Update Research Journal

6. 更新研究日志

7. Maintain Ideas Backlog

7. 维护想法待办列表

8. Loop

8. 循环

State Files Reference

状态文件参考

JSONL Schema

JSONL Schema

ASI (Actionable Side Information)

ASI（可操作辅助信息）

Confidence Scoring

`autoresearch.md`

`autoresearch.md`

`autoresearch.sh`

`autoresearch.sh`

`autoresearch.checks.sh`
(optional)

`autoresearch.checks.sh`
（可选）