experiment

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Experiment

实验

Run experiments reproducibly. Log everything to

RESEARCH.md

RESEARCH.md

includes

## Supervision Policy

, apply it for notifications, approvals, resource boundaries, and stop limits. If absent, preserve the defaults below.

可复现地运行实验。将所有内容记录至

RESEARCH.md

。

如果

RESEARCH.md

包含

## Supervision Policy

（监督策略），则按照其要求处理通知、审批、资源边界和停止限制。若不存在，则遵循以下默认规则。

Before Running

运行前准备

Use an isolated virtual environment. If none exists, create one first. Escape hatch: If the user has already activated a venv, skip this step.
Validate the target script before execution:
bash
```
python -m py_compile path/to/script.py
```
Set a fixed time budget per run. Default: 5 minutes. Do not exceed without explicit user approval, even if the supervision preset is permissive.

使用独立的虚拟环境。若不存在，则先创建一个。 例外情况：如果用户已激活venv，则跳过此步骤。
执行前验证目标脚本：
bash
```
python -m py_compile path/to/script.py
```
为每次运行设置固定时间预算。默认值：5分钟。即使监督预设允许，未经用户明确批准也不得超出预算。

Single Run

单轮运行

Capture stdout and stderr to a log file.
Watch for:
```
NaN
```
/
```
Inf
```
in loss, OOM failures, silent hangs (>60s no output). Terminate and record on hang.
Failure: patch once, retry once. On second failure — stop with reason
```
error_threshold
```
, log to
```
RESEARCH.md
```
Context (script path, hash, error, patch, outcome), surface to user.
Success: extract key metrics, append structured block to
```
RESEARCH.md
```
Context (date, script hash, metrics, notes).

将标准输出（stdout）和标准错误（stderr）捕获到日志文件中。
监控以下情况：损失值出现
```
NaN
```
/
```
Inf
```
、内存不足（OOM）故障、无输出静默挂起（超过60秒无输出）。挂起时终止运行并记录。
故障处理：修补一次，重试一次。若第二次仍失败——以
```
error_threshold
```
为原因停止运行，将上下文信息（脚本路径、哈希值、错误信息、修补内容、结果）记录至
```
RESEARCH.md
```
，并告知用户。
成功处理：提取关键指标，将结构化内容块追加至
```
RESEARCH.md
```
的上下文部分（日期、脚本哈希值、指标、备注）。

BFS Mode — opt-in

BFS模式——可选启用

Activate when the user asks to "explore", "autoresearch", or "find the best config". Requires two preconditions:

A single target file the agent is allowed to modify (e.g.,
```
train.py
```
).
A verifiable scalar metric to minimize or maximize (e.g.,
```
val_bpb
```
,
```
val_acc
```
). Must be extractable from run output with a grep/awk one-liner.

Apply supervision policy: notify on

experiment-start

if configured; require approval before entering the loop if

bfs-start

is in

Approve

; in

wild

, proceed only within already granted resource and budget boundaries.

Loop (runs autonomously until budget or N experiments exhausted):

Design the next hypothesis — one focused change to the target file (e.g., change learning rate schedule, add residual scaling, modify attention pattern). State the hypothesis in one line before modifying.

git commit -am "hypothesis: <one-line description>"

Run the script for the fixed time budget. Extract the metric:
bash
```
grep "^<metric_key>:" run.log | awk '{print $2}'
```
Compare to the current best:
- Improvement → log as
```
keep
```
  in
```
results.tsv
```
  , update best.
- No improvement or failure →
```
git reset --hard HEAD~1
```
  , log as
```
discard
```
  .

Append a row to

results.tsv

commit_hash  metric_value  status   description

Loop to step 8. Do not pause to ask the user between iterations unless a configured stop target or hard limit has been reached, or a forbidden resource / approval boundary blocks further progress.

End of batch: write a summary table to

RESEARCH.md

Context; surface best commit and top 3. Apply stop policy (

target_reached

) or wild continuation as configured, otherwise ask: "Best result is

<hash>

(

<metric>=<value>

). Continue exploring or proceed to writing?"

Constraints: only modify the target file; one focused change per hypothesis; log every run including discards; supervision presets never override target-file, metric, budget, or resource boundaries.

当用户要求"explore"（探索）、"autoresearch"（自主研究）或"find the best config"（寻找最优配置）时激活该模式。需满足两个前提条件：

一个允许Agent修改的单一目标文件（例如
```
train.py
```
）。
一个用于最小化或最大化的可验证标量指标（例如
```
val_bpb
```
、
```
val_acc
```
）。必须能够通过grep/awk单行命令从运行输出中提取。

遵循监督策略：若配置了

experiment-start

通知，则发送通知；若

bfs-start

在

Approve

（需审批）列表中，进入循环前需获得用户批准；在

wild

（无限制）模式下，仅在已授权的资源和预算边界内运行。

循环流程（自主运行直至预算耗尽或完成N次实验）：

设计下一个假设——对目标文件进行一项聚焦性修改（例如，改变学习率调度、添加残差缩放、修改注意力模式）。修改前用一句话描述该假设。

执行

git commit -am "hypothesis: <one-line description>"

在固定时间预算内运行脚本。提取指标：
bash
```
grep "^<metric_key>:" run.log | awk '{print $2}'
```
与当前最优结果比较：
- 优化结果 → 在
```
results.tsv
```
  中记录为
```
keep
```
  ，更新最优结果。
- 无优化或失败 → 执行
```
git reset --hard HEAD~1
```
  ，在
```
results.tsv
```
  中记录为
```
discard
```
  。

向

results.tsv

追加一行内容：

commit_hash  metric_value  status   description

回到步骤8循环。除非达到配置的停止目标或硬限制，或遇到禁止的资源/审批边界阻碍进一步运行，否则迭代过程中无需暂停询问用户。

批量结束：将汇总表格写入

RESEARCH.md

的上下文部分；向用户展示最优提交记录和排名前三的结果。按照配置执行停止策略（

target_reached

）或无限制继续运行，否则询问用户："最优结果为

<hash>

（

<metric>=<value>

）。是否继续探索或进入文档撰写阶段？"

约束条件：仅修改目标文件；每个假设仅包含一项聚焦性修改；记录所有运行（包括被丢弃的实验）；监督预设不得覆盖目标文件、指标、预算或资源边界的限制。

Example

示例

Single:

python train.py --lr 0.01

, budget=5min → appends

2026-03-29 14:32 — val_acc=0.923, hash=abc1234

to RESEARCH.md Context.

BFS: target=

train.py

, metric=

val_bpb

(minimize), budget=5min/run, N=10 → runs 10 hypothesis variants autonomously, keeps 3 improvements, produces

results.tsv

+ summary table in RESEARCH.md Context.

单轮运行：

python train.py --lr 0.01

，预算=5分钟 → 向RESEARCH.md的上下文部分追加

2026-03-29 14:32 — val_acc=0.923, hash=abc1234

。

BFS模式：目标文件=

train.py

，指标=

val_bpb

（最小化），单轮预算=5分钟，实验次数N=10 → 自主运行10个假设变体，保留3个优化结果，生成

results.tsv

并在RESEARCH.md的上下文部分添加汇总表格。