autoresearch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoresearch
Autoresearch
Autonomous experiment loop: try ideas, keep what works, discard what doesn't, never stop.
自主实验循环:尝试想法,保留有效方案,舍弃无效方案,持续运行。
Overview
概述
You are running an autonomous optimization loop. Your job is to systematically improve a measurable metric by making changes, running experiments, and keeping only the improvements. You maintain structured state files so that any session — including a fresh one with no memory — can resume exactly where you left off.
If the user is asking you to do this and you are not currently in mission mode, suggest that they might want to run this inside a mission () for better progress tracking, milestone validation, and multi-session continuity. Don't block on it — just mention it once during setup.
/enter-missionIf you are already in mission mode, invoke the mission planning skills first ( and ) before diving into this skill's procedure. Use the mission system's planning, decomposition, and worker design to structure the autoresearch work — then combine that guidance with this skill's experiment loop procedure. This skill defines how to run experiments; the mission system defines how to plan, track, and validate them.
mission-planningdefine-mission-skills你正在运行一个自主优化循环。你的任务是通过修改代码、运行实验并仅保留有效改进,系统性地提升可衡量的指标。你需要维护结构化的状态文件,以便任何会话——包括无记忆的全新会话——都能精确地从上次中断的位置继续。
如果用户要求你执行此操作,而你当前未处于任务模式,建议他们在任务中运行(),以便更好地跟踪进度、验证里程碑并实现多会话连续性。无需强制要求——仅在设置阶段提及一次即可。
/enter-mission如果已处于任务模式,请先调用任务规划技能( 和 ),再深入执行本技能的流程。使用任务系统的规划、分解和工作设计来构建自动研究工作——然后将该指导与本技能的实验循环流程相结合。本技能定义了如何运行实验;任务系统定义了如何规划、跟踪和验证实验。
mission-planningdefine-mission-skillsSetup
设置
Before the loop starts, you need to establish the experiment.
在循环开始前,你需要完成实验的初始化工作。
Step 1: Gather Information
步骤1:收集信息
Ask the user (or infer from context) for:
- Goal: What are we optimizing? (e.g., "minimize val_bpb", "reduce test runtime", "shrink bundle size")
- Command: What to run (e.g., ,
uv run train.py,pnpm test)pnpm build && du -sb dist - Primary metric: Name, unit, and direction (e.g., , unitless, lower is better)
val_bpb - Files in scope: Which files may be modified
- Constraints: Hard rules (tests must pass, no new deps, etc.)
- Termination condition: When to stop. Ask the user — options are:
- Fixed experiment count (e.g., 20 experiments)
- Fixed time budget (e.g., 2 hours)
- Target metric (e.g., val_bpb < 1.0)
- Run until interrupted (default)
向用户询问(或从上下文推断)以下内容:
- 目标:我们要优化什么?(例如:"最小化val_bpb"、"缩短测试运行时间"、"减小包体积")
- 命令:运行什么命令(例如:、
uv run train.py、pnpm test)pnpm build && du -sb dist - 主指标:名称、单位和优化方向(例如:,无单位,越低越好)
val_bpb - 作用范围文件:可修改的文件列表
- 约束条件:硬性规则(必须通过测试、不得添加新依赖等)
- 终止条件:何时停止。询问用户——选项包括:
- 固定实验次数(例如:20次实验)
- 固定时间预算(例如:2小时)
- 目标指标值(例如:val_bpb < 1.0)
- 运行至被中断(默认选项)
Step 2: Create Branch and State Files
步骤2:创建分支和状态文件
bash
git checkout autoresearch/<goal>-<date> 2>/dev/null || git checkout -b autoresearch/<goal>-<date>Read the source files thoroughly. Understand the workload deeply before writing anything.
Create three files:
bash
git checkout autoresearch/<goal>-<date> 2>/dev/null || git checkout -b autoresearch/<goal>-<date>仔细阅读源文件。在编写任何内容之前,深入理解工作负载。
创建三个文件:
autoresearch.md
autoresearch.mdautoresearch.md
autoresearch.mdThe living research document. A fresh agent with no context should be able to read this file and run the loop effectively. Invest time making it excellent.
markdown
undefined动态更新的研究文档。即使是无上下文的全新Agent,也应能通过阅读此文件有效运行循环。花时间确保文档质量优秀。
markdown
undefinedAutoresearch: <goal>
Autoresearch: <goal>
Objective
目标
<Specific description of what we're optimizing and the workload.>
<关于优化对象和工作负载的具体描述。>
Metrics
指标
- Primary: <name> (<unit>, lower/higher is better) — the optimization target
- Secondary: <name>, <name>, ... — independent tradeoff monitors
- 主指标:<名称>(<单位>,越低/越高越好)——优化目标
- 次要指标:<名称>、<名称>……——独立的权衡监控指标
How to Run
运行方式
./autoresearch.shMETRIC name=number./autoresearch.shMETRIC name=numberFiles in Scope
作用范围文件
<Every file the agent may modify, with a brief note on what it does.>
<Agent可修改的所有文件,附带简要功能说明。>
Off Limits
禁止修改内容
<What must NOT be touched.>
<绝对不能触碰的内容。>
Constraints
约束条件
<Hard rules: tests must pass, no new deps, etc.>
<硬性规则:必须通过测试、不得添加新依赖等。>
Termination
终止条件
<When to stop: experiment count, time budget, target metric, or run until interrupted.>
<停止时机:实验次数、时间预算、目标指标值,或运行至被中断。>
What's Been Tried
已尝试方案
<Update this section as experiments accumulate. Note key wins, dead ends,
and architectural insights so the agent doesn't repeat failed approaches.>
undefined<随着实验积累更新此部分。记录关键成果、无效尝试和架构见解,避免Agent重复失败的方法。>
undefinedautoresearch.sh
autoresearch.shautoresearch.sh
autoresearch.shBash script () that: pre-checks fast (syntax errors in <1s), runs the benchmark, and outputs structured lines to stdout. Keep the script fast.
set -euo pipefailMETRIC name=valueFor fast, noisy benchmarks (< 5s), run the workload multiple times inside the script and report the median. Slow workloads (ML training, large builds) don't need this.
Example:
bash
#!/bin/bash
set -euo pipefailBash脚本(),用于快速预检查(1秒内完成语法错误检查)、运行基准测试,并向标准输出输出结构化的 行。保持脚本运行速度快。
set -euo pipefailMETRIC name=value对于快速且有噪声的基准测试(<5秒),在脚本内多次运行工作负载并报告中位数。慢工作负载(ML训练、大型构建)无需此操作。
示例:
bash
#!/bin/bash
set -euo pipefailPre-check: syntax validation
预检查:语法验证
python3 -c "import ast; ast.parse(open('train.py').read())" 2>&1 || { echo "SYNTAX ERROR"; exit 1; }
python3 -c "import ast; ast.parse(open('train.py').read())" 2>&1 || { echo "SYNTAX ERROR"; exit 1; }
Run the workload
运行工作负载
output=$(uv run train.py 2>&1)
output=$(uv run train.py 2>&1)
Extract and output metrics
提取并输出指标
val_bpb=$(echo "$output" | grep -oP 'val_bpb=\K[0-9.]+' | tail -1)
echo "METRIC val_bpb=$val_bpb"
undefinedval_bpb=$(echo "$output" | grep -oP 'val_bpb=\K[0-9.]+' | tail -1)
echo "METRIC val_bpb=$val_bpb"
undefinedautoresearch.checks.sh
(optional)
autoresearch.checks.shautoresearch.checks.sh
(可选)
autoresearch.checks.shOnly create this when the user's constraints require correctness validation (e.g., "tests must pass", "types must check"). Bash script () for backpressure checks.
set -euo pipefailbash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || true仅当用户的约束条件需要正确性验证时创建(例如:"必须通过测试"、"必须通过类型检查")。用于压力检查的Bash脚本()。
set -euo pipefailbash
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50
pnpm typecheck 2>&1 | grep -i error || trueStep 3: Initialize JSONL and Commit State Files
步骤3:初始化JSONL并提交状态文件
Initialize the experiment log:
bash
python3 autoresearch_helper.py init --jsonl autoresearch.jsonl --name '<goal>' --metric-name '<metric_name>' --direction <lower|higher>Commit all state files:
bash
git add autoresearch.md autoresearch.sh autoresearch.jsonl
git commit -m "autoresearch: initialize experiment session"初始化实验日志:
bash
python3 autoresearch_helper.py init --jsonl autoresearch.jsonl --name '<goal>' --metric-name '<metric_name>' --direction <lower|higher>提交所有状态文件:
bash
git add autoresearch.md autoresearch.sh autoresearch.jsonl
git commit -m "autoresearch: initialize experiment session"Step 4: Run Baseline
步骤4:运行基准测试
Run the benchmark and record the baseline result:
bash
bash autoresearch.shParse the METRIC lines from the output, then log the baseline as a keep:
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit $(git rev-parse --short=7 HEAD) \
--metric <baseline_value> \
--status keep \
--description "baseline" \
--asi '{"hypothesis": "baseline measurement"}'This is experiment #1 — it establishes the starting point for all future comparisons.
运行基准测试并记录基线结果:
bash
bash autoresearch.sh解析输出中的METRIC行,然后将基线记录为保留项:
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit $(git rev-parse --short=7 HEAD) \
--metric <baseline_value> \
--status keep \
--description "baseline" \
--asi '{"hypothesis": "baseline measurement"}'这是第1次实验——为所有后续比较建立起点。
The Experiment Loop
实验循环
LOOP FOREVER. Never ask "should I continue?" — the user expects autonomous work. Only stop when:
- The termination condition from setup is met
- The user interrupts
- You detect you're running low on context (see Context Management below)
持续循环。 永远不要询问"是否继续?"——用户期望自主工作。仅在以下情况停止:
- 满足设置阶段定义的终止条件
- 用户中断
- 检测到上下文即将耗尽(见下文上下文管理)
For Each Experiment:
每次实验流程:
1. Choose What to Try
1. 选择尝试方案
Read (especially "What's Been Tried") and (if it exists) to pick the next hypothesis. Think about what the data tells you. The best ideas come from deep understanding, not random variations.
autoresearch.mdautoresearch.ideas.md阅读 (尤其是"已尝试方案"部分)和 (如果存在),选择下一个假设。思考数据所传达的信息。最佳想法来自深入理解,而非随机尝试。
autoresearch.mdautoresearch.ideas.md2. Make Changes
2. 修改代码
Edit the files in scope. Keep changes focused — one hypothesis per experiment.
编辑作用范围内的文件。保持修改聚焦——每次实验仅验证一个假设。
3. Run the Experiment
3. 运行实验
Execute the benchmark:
bash
timeout 600 bash autoresearch.shCapture the full output. Parse lines from the output.
METRIC name=valueIf the run crashes or times out, log it as a crash and revert.
If exists and the benchmark passed, run it:
autoresearch.checks.shbash
timeout 300 bash autoresearch.checks.shIf checks fail, log as and revert.
checks_failed执行基准测试:
bash
timeout 600 bash autoresearch.sh捕获完整输出。解析输出中的 行。
METRIC name=value如果运行崩溃或超时,记录为崩溃并回退。
如果存在 且基准测试通过,运行该脚本:
autoresearch.checks.shbash
timeout 300 bash autoresearch.checks.sh如果检查失败,记录为并回退。
checks_failed4. Evaluate Results
4. 评估结果
Compare the primary metric against the current best (or baseline if no keeps yet) using the helper script:
bash
python3 autoresearch_helper.py evaluate --jsonl autoresearch.jsonl --metric <value> --direction <lower|higher>This outputs whether to keep or discard, the confidence score, and delta from baseline.
Decision rules:
- Primary metric improved ->
keep - Primary metric worse or unchanged ->
discard - Simpler code for equal performance -> (removing code for same perf is a win)
keep - Ugly complexity for tiny gain -> probably
discard - Secondary metrics rarely affect the keep/discard decision. Only discard a primary improvement if a secondary metric degraded catastrophically.
使用辅助脚本将主指标与当前最优值(若尚无保留项则与基线)进行比较:
bash
python3 autoresearch_helper.py evaluate --jsonl autoresearch.jsonl --metric <value> --direction <lower|higher>输出结果为保留或舍弃、置信度评分以及与基线的差值。
决策规则:
- 主指标提升 ->
keep - 主指标下降或无变化 ->
discard - 代码更简洁但性能相同 -> (移除代码但性能不变即为改进)
keep - 微小提升但代码复杂度大幅增加 -> 建议
discard - 次要指标极少影响保留/舍弃决策。仅当次要指标出现灾难性退化时,才舍弃主指标的提升。
5. Record Results
5. 记录结果
On keep:
Log to JSONL first (so the entry is included in the commit):
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit $(git rev-parse --short=7 HEAD) \
--metric <value> \
--status keep \
--description "<what was tried>" \
--asi '{"hypothesis": "<what you tried>"}' \
# --metrics '{"compile_us": <value>, "render_us": <value>}' # optional secondary metrics
--direction <lower|higher>Then commit all changes (including the JSONL entry):
bash
git add -A
git commit -m "<description>
Result: {\"status\": \"keep\", \"<metric_name>\": <value>}"On discard/crash/checks_failed:
Log to JSONL first (before reverting, so the entry is preserved):
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit "0000000" \
--metric <value_or_0> \
--status <discard|crash|checks_failed> \
--description "<what was tried>" \
--asi '{"hypothesis": "<what you tried>", "rollback_reason": "<why it failed>"}' \
# --metrics '{"compile_us": <value>, "render_us": <value>}' # optional secondary metrics
--direction <lower|higher>Then revert changes, backing up state files so doesn't destroy them:
git clean -fdbash
undefined保留时:
先记录到JSONL(确保条目被包含在提交中):
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit $(git rev-parse --short=7 HEAD) \
--metric <value> \
--status keep \
--description "<尝试的内容>" \
--asi '{"hypothesis": "<尝试的内容>"}' \
# --metrics '{"compile_us": <value>, "render_us": <value>}' # 可选次要指标
--direction <lower|higher>然后提交所有修改(包括JSONL条目):
bash
git add -A
git commit -m "<description>
Result: {\"status\": \"keep\", \"<metric_name>\": <value>}"舍弃/崩溃/检查失败时:
先记录到JSONL(回退前记录,确保条目被保留):
bash
python3 autoresearch_helper.py log --jsonl autoresearch.jsonl \
--commit "0000000" \
--metric <value_or_0> \
--status <discard|crash|checks_failed> \
--description "<尝试的内容>" \
--asi '{"hypothesis": "<尝试的内容>", "rollback_reason": "<失败原因>"}' \
# --metrics '{"compile_us": <value>, "render_us": <value>}' # 可选次要指标
--direction <lower|higher>然后回退修改,备份状态文件以避免销毁它们:
git clean -fdbash
undefinedBackup state files
备份状态文件
cp autoresearch.jsonl autoresearch.jsonl.bak 2>/dev/null || true
cp autoresearch.md autoresearch.md.bak 2>/dev/null || true
cp autoresearch.ideas.md autoresearch.ideas.md.bak 2>/dev/null || true
cp autoresearch.jsonl autoresearch.jsonl.bak 2>/dev/null || true
cp autoresearch.md autoresearch.md.bak 2>/dev/null || true
cp autoresearch.ideas.md autoresearch.ideas.md.bak 2>/dev/null || true
Revert all changes
回退所有修改
git checkout -- .
git clean -fd 2>/dev/null
git checkout -- .
git clean -fd 2>/dev/null
Restore state files
恢复状态文件
cp autoresearch.jsonl.bak autoresearch.jsonl 2>/dev/null || true
cp autoresearch.md.bak autoresearch.md 2>/dev/null || true
cp autoresearch.ideas.md.bak autoresearch.ideas.md 2>/dev/null || true
rm -f autoresearch.jsonl.bak autoresearch.md.bak autoresearch.ideas.md.bak
undefinedcp autoresearch.jsonl.bak autoresearch.jsonl 2>/dev/null || true
cp autoresearch.md.bak autoresearch.md 2>/dev/null || true
cp autoresearch.ideas.md.bak autoresearch.ideas.md 2>/dev/null || true
rm -f autoresearch.jsonl.bak autoresearch.md.bak autoresearch.ideas.md.bak
undefined6. Update Research Journal
6. 更新研究日志
After every few experiments (or after significant findings), update the "What's Been Tried" section in . Include:
autoresearch.md- What worked and why
- What didn't work and why
- Dead ends to avoid
- Current best result and how it was achieved
每进行几次实验(或取得重要成果后),更新中的"已尝试方案"部分。包括:
autoresearch.md- 有效的方案及其原因
- 无效的方案及其原因
- 需要避免的无效路径
- 当前最优结果及其实现方式
7. Maintain Ideas Backlog
7. 维护想法待办列表
When you discover promising but deferred optimizations, append them as bullet points to . Don't let good ideas get lost. Prune stale or tried entries.
autoresearch.ideas.md当发现有前景但暂未尝试的优化方案时,将其作为项目符号添加到中。不要遗漏好想法。定期删除过时或已尝试的条目。
autoresearch.ideas.md8. Loop
8. 循环
Go back to step 1.
返回步骤1。
State Files Reference
状态文件参考
| File | Format | Purpose |
|---|---|---|
| JSON Lines | Append-only experiment log. One JSON object per line. |
| Markdown | Living research document. Objective, what's been tried, current best. |
| Markdown | Hypothesis backlog. Bullet points of promising ideas to try. |
| Bash | Benchmark script. Outputs |
| Bash | Optional correctness checks (tests, types, lint). |
| 文件 | 格式 | 用途 |
|---|---|---|
| JSON Lines | 仅追加的实验日志。每行一个JSON对象。 |
| Markdown | 动态研究文档。包含目标、已尝试方案、当前最优结果。 |
| Markdown | 假设待办列表。记录有前景的想法项目符号。 |
| Bash | 基准测试脚本。输出 |
| Bash | 可选正确性检查(测试、类型、代码规范)。 |
JSONL Schema
JSONL Schema
Each line in is either a config header or an experiment result:
autoresearch.jsonlConfig header (first line, or on re-init):
json
{"type": "config", "name": "...", "metricName": "...", "metricUnit": "...", "bestDirection": "lower|higher"}Experiment result:
json
{
"run": 1,
"commit": "abc1234",
"metric": 1.234,
"metrics": {"compile_us": 4200, "render_us": 9800},
"status": "keep|discard|crash|checks_failed",
"description": "what was tried",
"timestamp": 1711600000000,
"segment": 0,
"confidence": 2.1,
"asi": {"hypothesis": "...", "rollback_reason": "...", "next_action_hint": "..."}
}autoresearch.jsonl配置头(第一行或重新初始化时):
json
{"type": "config", "name": "...", "metricName": "...", "metricUnit": "...", "bestDirection": "lower|higher"}实验结果:
json
{
"run": 1,
"commit": "abc1234",
"metric": 1.234,
"metrics": {"compile_us": 4200, "render_us": 9800},
"status": "keep|discard|crash|checks_failed",
"description": "what was tried",
"timestamp": 1711600000000,
"segment": 0,
"confidence": 2.1,
"asi": {"hypothesis": "...", "rollback_reason": "...", "next_action_hint": "..."}
}ASI (Actionable Side Information)
ASI(可操作辅助信息)
Always record ASI with every experiment. At minimum: . On discard/crash, also include and . Add any other key/value pairs that capture what you learned — dead ends, surprising findings, error details, bottlenecks.
{"hypothesis": "what you tried"}rollback_reasonnext_action_hintASI is the only structured memory that survives reverts. Without it, future iterations waste time re-discovering the same dead ends.
每次实验都必须记录ASI。至少包含:。舍弃/崩溃时,还需包含和。添加任何能记录所获信息的键值对——无效路径、意外发现、错误详情、瓶颈等。
{"hypothesis": "尝试的内容"}rollback_reasonnext_action_hintASI是唯一能在回退后保留的结构化记忆。没有它,后续迭代会浪费时间重复发现相同的无效路径。
Confidence Scoring
置信度评分
After 3+ experiments, the helper script computes a confidence score using Median Absolute Deviation (MAD):
| Confidence | Meaning |
|---|---|
| >= 2.0x | Improvement is likely real |
| 1.0-2.0x | Above noise but marginal |
| < 1.0x | Within noise — consider re-running to confirm |
The score is advisory — it never auto-discards. If confidence is below 1.0x, consider re-running the same experiment to confirm before keeping.
进行3次以上实验后,辅助脚本使用中位数绝对偏差(MAD)计算置信度评分:
| 置信度 | 含义 |
|---|---|
| >= 2.0x | 提升很可能是真实有效的 |
| 1.0-2.0x | 超出噪声范围但提升幅度微小 |
| < 1.0x | 在噪声范围内——考虑重新运行以确认 |
评分仅作参考——不会自动舍弃结果。如果置信度低于1.0x,考虑重新运行相同实验以确认后再保留。
Context Management
上下文管理
Droid sessions have finite context. To handle this gracefully:
- Track experiment count in the current session. After ~15 experiments, context is getting heavy.
- Save state proactively — all state lives in files (jsonl, md), so a new session can resume immediately.
- When context is getting exhausted: update with current findings, commit state files, and stop. The next session reads the files and continues.
autoresearch.md - On resume: read ,
autoresearch.md, andautoresearch.jsonlto understand where things stand. Check current status:git log --oneline -20
bash
python3 autoresearch_helper.py status --jsonl autoresearch.jsonlDroid会话的上下文是有限的。为此需优雅处理:
- 跟踪当前会话的实验次数。约15次实验后,上下文会变得繁重。
- 主动保存状态——所有状态都存储在文件中(jsonl、md),因此新会话可立即恢复。
- 当上下文即将耗尽时:更新记录当前发现,提交状态文件,然后停止。下一个会话读取文件即可继续。
autoresearch.md - 恢复时:阅读、
autoresearch.md和autoresearch.jsonl以了解当前状态。检查当前状态:git log --oneline -20
bash
python3 autoresearch_helper.py status --jsonl autoresearch.jsonlLoop Rules Summary
循环规则总结
- LOOP FOREVER. Never ask "should I continue?"
- Primary metric is king. Improved -> keep. Worse/equal -> discard.
- Annotate every run with ASI. Record what you learned, not just what you did.
- Watch the confidence score. < 1.0x means within noise — re-run to confirm.
- Simpler is better. Removing code for equal perf = keep.
- Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
- Crashes: fix if trivial, otherwise log and move on.
- Think longer when stuck. Re-read source files, study the data, reason about what's actually happening. The best ideas come from deep understanding.
- Resuming: read autoresearch.md + git log, continue looping.
- 持续循环。 永远不要询问"是否继续?"
- 主指标优先。 提升则保留,下降/无变化则舍弃。
- 每次运行都要标注ASI。 记录所获信息,而非仅记录操作。
- 关注置信度评分。 <1.0x意味着在噪声范围内——重新运行确认。
- 简洁优先。 移除代码但性能不变=保留。
- 不要重复无效尝试。 反复回退同一想法?尝试结构上不同的方案。
- 崩溃处理: 若易修复则修复,否则记录并继续。
- 遇到瓶颈时多思考。 重新阅读源文件,研究数据,推理实际情况。最佳想法来自深入理解。
- 恢复: 阅读autoresearch.md + git日志,继续循环。
Finalization
收尾
When the experiment loop ends (termination condition met, user interrupts, or context exhausted), finalize the results into clean, reviewable branches. This is the last phase of an autoresearch session.
当实验循环结束(满足终止条件、用户中断或上下文耗尽),将结果整理为清晰、可评审的分支。这是自动研究会话的最后阶段。
Step 1: Summarize Results
步骤1:总结结果
bash
python3 autoresearch_helper.py summary --jsonl autoresearch.jsonlReview the git log for actual commits:
bash
git log --oneline --stat $(git merge-base HEAD main)..HEADbash
python3 autoresearch_helper.py summary --jsonl autoresearch.jsonl查看git日志中的实际提交:
bash
git log --oneline --stat $(git merge-base HEAD main)..HEADStep 2: Group Changes
步骤2:分组修改
Group kept experiments into logical changesets. Each group should:
- Represent a single coherent optimization or change
- Not share modified files with other groups (so branches can merge independently)
- Have a clear description of what it achieves and the metric improvement
Present the proposed grouping to the user for approval:
Group 1: "Reduce model depth from 8 to 6"
Files: train.py (DEPTH, HEAD_DIM, N_EMBED)
Metric improvement: val_bpb 1.15 -> 1.08 (-6.1%)
Experiments: #3, #7, #12
Group 2: "Switch to cosine LR schedule"
Files: train.py (lr_schedule, warmup_steps)
Metric improvement: val_bpb 1.08 -> 1.05 (-2.8%)
Experiments: #15, #18Wait for user confirmation before proceeding. In mission worker mode, proceed with the best grouping without waiting for confirmation.
将保留的实验分组为逻辑变更集。每个分组应:
- 代表单一连贯的优化或变更
- 不与其他分组共享修改的文件(以便分支可独立合并)
- 清晰描述其实现的效果和指标提升
向用户展示建议的分组以获得批准:
分组1:"将模型深度从8层减少到6层"
文件:train.py(DEPTH、HEAD_DIM、N_EMBED)
指标提升:val_bpb从1.15降至1.08(-6.1%)
实验:#3、#7、#12
分组2:"切换为余弦LR调度"
文件:train.py(lr_schedule、warmup_steps)
指标提升:val_bpb从1.08降至1.05(-2.8%)
实验:#15、#18继续前等待用户确认。在任务工作模式下,无需等待确认即可按最佳分组继续。
Step 3: Resolve File Conflicts
步骤3:解决文件冲突
If groups share files, resolve before creating branches:
- Merge the groups into one (if changes are related)
- Split the file changes more carefully (if they're truly independent modifications to different parts)
- Ask the user which group gets priority
Groups must not share files — each branch must be independently mergeable. If all changes touch the same file and can't be separated, create a single finalized branch with all improvements combined.
如果分组共享文件,创建分支前需解决冲突:
- 将分组合并为一个(若变更相关)
- 更细致地拆分文件变更(若确实是对不同部分的独立修改)
- 询问用户哪个分组优先
分组不得共享文件——每个分支必须可独立合并。如果所有变更都涉及同一文件且无法分离,创建一个包含所有改进的单一最终分支。
Step 4: Create Clean Branches
步骤4:创建清晰分支
For each group:
bash
merge_base=$(git merge-base HEAD main)
git checkout -b autoresearch/finalize/<group-name> $merge_base
git checkout autoresearch/<session-branch> -- <file1> <file2> ...
git commit -m "<group description>
Autoresearch results:
- Metric: <name> improved from <baseline> to <best> (<delta>%)
- Confidence: <score>x noise floor
- Experiments: <count> total, <kept> kept"为每个分组执行:
bash
merge_base=$(git merge-base HEAD main)
git checkout -b autoresearch/finalize/<group-name> $merge_base
git checkout autoresearch/<session-branch> -- <file1> <file2> ...
git commit -m "<分组描述>
Autoresearch results:
- Metric: <名称>从<基线值>提升至<最优值>(<差值>%)
- Confidence: <评分>x噪声阈值
- Experiments: 共<总次数>次,保留<保留次数>次"Step 5: Verify and Report
步骤5:验证并报告
For each finalized branch, run the benchmark to confirm the improvement holds, run any checks if applicable, and verify it merges cleanly with main.
Present a summary to the user:
Created 2 clean branches from 20 experiments:
autoresearch/finalize/reduce-depth
val_bpb: 1.15 -> 1.08 (-6.1%)
Ready for review
autoresearch/finalize/cosine-schedule
val_bpb: 1.08 -> 1.05 (-2.8%)
Ready for review
Original experiment branch preserved: autoresearch/<session-branch>The original experiment branch is always preserved — finalization creates new branches.
对每个最终分支,运行基准测试确认提升效果仍然存在,若适用则运行检查,并验证其可与main分支干净合并。
向用户展示总结:
从20次实验中创建了2个清晰分支:
autoresearch/finalize/reduce-depth
val_bpb: 1.15 -> 1.08 (-6.1%)
已准备好评审
autoresearch/finalize/cosine-schedule
val_bpb: 1.08 -> 1.05 (-2.8%)
已准备好评审
原始实验分支已保留:autoresearch/<session-branch>原始实验分支始终保留——收尾阶段会创建新分支。
Mission Worker Mode
任务工作模式
When running as a mission worker, the feature description specifies the optimization goal, termination condition, files in scope, and constraints. Read it carefully, follow the same loop procedure above, and respect the termination condition. When the condition is met, run finalization and report results in the handoff.
作为任务工作者运行时,功能描述会指定优化目标、终止条件、作用范围文件和约束条件。仔细阅读,遵循上述循环流程,并遵守终止条件。满足条件后,执行收尾并在交接时报告结果。