experiment

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Experiment

实验

Run experiments reproducibly. Log everything to
RESEARCH.md
.
If
RESEARCH.md
includes
## Supervision Policy
, apply it for notifications, approvals, resource boundaries, and stop limits. If absent, preserve the defaults below.
可复现地运行实验。将所有内容记录至
RESEARCH.md
如果
RESEARCH.md
包含
## Supervision Policy
(监督策略),则按照其要求处理通知、审批、资源边界和停止限制。若不存在,则遵循以下默认规则。

Before Running

运行前准备

  1. Use an isolated virtual environment. If none exists, create one first. Escape hatch: If the user has already activated a venv, skip this step.
  2. Validate the target script before execution:
    bash
    python -m py_compile path/to/script.py
  3. Set a fixed time budget per run. Default: 5 minutes. Do not exceed without explicit user approval, even if the supervision preset is permissive.
  1. 使用独立的虚拟环境。若不存在,则先创建一个。 例外情况:如果用户已激活venv,则跳过此步骤。
  2. 执行前验证目标脚本:
    bash
    python -m py_compile path/to/script.py
  3. 为每次运行设置固定时间预算。默认值:5分钟。即使监督预设允许,未经用户明确批准也不得超出预算。

Single Run

单轮运行

  1. Capture stdout and stderr to a log file.
  2. Watch for:
    NaN
    /
    Inf
    in loss, OOM failures, silent hangs (>60s no output). Terminate and record on hang.
  3. Failure: patch once, retry once. On second failure — stop with reason
    error_threshold
    , log to
    RESEARCH.md
    Context (script path, hash, error, patch, outcome), surface to user.
  4. Success: extract key metrics, append structured block to
    RESEARCH.md
    Context (date, script hash, metrics, notes).
  1. 将标准输出(stdout)和标准错误(stderr)捕获到日志文件中。
  2. 监控以下情况:损失值出现
    NaN
    /
    Inf
    、内存不足(OOM)故障、无输出静默挂起(超过60秒无输出)。挂起时终止运行并记录。
  3. 故障处理:修补一次,重试一次。若第二次仍失败——以
    error_threshold
    为原因停止运行,将上下文信息(脚本路径、哈希值、错误信息、修补内容、结果)记录至
    RESEARCH.md
    ,并告知用户。
  4. 成功处理:提取关键指标,将结构化内容块追加至
    RESEARCH.md
    的上下文部分(日期、脚本哈希值、指标、备注)。

BFS Mode — opt-in

BFS模式——可选启用

Activate when the user asks to "explore", "autoresearch", or "find the best config". Requires two preconditions:
  • A single target file the agent is allowed to modify (e.g.,
    train.py
    ).
  • A verifiable scalar metric to minimize or maximize (e.g.,
    val_bpb
    ,
    val_acc
    ). Must be extractable from run output with a grep/awk one-liner.
Apply supervision policy: notify on
experiment-start
if configured; require approval before entering the loop if
bfs-start
is in
Approve
; in
wild
, proceed only within already granted resource and budget boundaries.
Loop (runs autonomously until budget or N experiments exhausted):
  1. Design the next hypothesis — one focused change to the target file (e.g., change learning rate schedule, add residual scaling, modify attention pattern). State the hypothesis in one line before modifying.
  2. git commit -am "hypothesis: <one-line description>"
  3. Run the script for the fixed time budget. Extract the metric:
    bash
    grep "^<metric_key>:" run.log | awk '{print $2}'
  4. Compare to the current best:
    • Improvement → log as
      keep
      in
      results.tsv
      , update best.
    • No improvement or failure
      git reset --hard HEAD~1
      , log as
      discard
      .
  5. Append a row to
    results.tsv
    :
    commit_hash  metric_value  status   description
  6. Loop to step 8. Do not pause to ask the user between iterations unless a configured stop target or hard limit has been reached, or a forbidden resource / approval boundary blocks further progress.
End of batch: write a summary table to
RESEARCH.md
Context; surface best commit and top 3. Apply stop policy (
target_reached
) or wild continuation as configured, otherwise ask: "Best result is
<hash>
(
<metric>=<value>
). Continue exploring or proceed to writing?"
Constraints: only modify the target file; one focused change per hypothesis; log every run including discards; supervision presets never override target-file, metric, budget, or resource boundaries.
当用户要求"explore"(探索)、"autoresearch"(自主研究)或"find the best config"(寻找最优配置)时激活该模式。需满足两个前提条件:
  • 一个允许Agent修改的单一目标文件(例如
    train.py
    )。
  • 一个用于最小化或最大化的可验证标量指标(例如
    val_bpb
    val_acc
    )。必须能够通过grep/awk单行命令从运行输出中提取。
遵循监督策略:若配置了
experiment-start
通知,则发送通知;若
bfs-start
Approve
(需审批)列表中,进入循环前需获得用户批准;在
wild
(无限制)模式下,仅在已授权的资源和预算边界内运行。
循环流程(自主运行直至预算耗尽或完成N次实验):
  1. 设计下一个假设——对目标文件进行一项聚焦性修改(例如,改变学习率调度、添加残差缩放、修改注意力模式)。修改前用一句话描述该假设。
  2. 执行
    git commit -am "hypothesis: <one-line description>"
  3. 在固定时间预算内运行脚本。提取指标:
    bash
    grep "^<metric_key>:" run.log | awk '{print $2}'
  4. 与当前最优结果比较:
    • 优化结果 → 在
      results.tsv
      中记录为
      keep
      ,更新最优结果。
    • 无优化或失败 → 执行
      git reset --hard HEAD~1
      ,在
      results.tsv
      中记录为
      discard
  5. results.tsv
    追加一行内容:
    commit_hash  metric_value  status   description
  6. 回到步骤8循环。除非达到配置的停止目标或硬限制,或遇到禁止的资源/审批边界阻碍进一步运行,否则迭代过程中无需暂停询问用户。
批量结束:将汇总表格写入
RESEARCH.md
的上下文部分;向用户展示最优提交记录和排名前三的结果。按照配置执行停止策略(
target_reached
)或无限制继续运行,否则询问用户:"最优结果为
<hash>
<metric>=<value>
)。是否继续探索或进入文档撰写阶段?"
约束条件:仅修改目标文件;每个假设仅包含一项聚焦性修改;记录所有运行(包括被丢弃的实验);监督预设不得覆盖目标文件、指标、预算或资源边界的限制。

Example

示例

Single:
python train.py --lr 0.01
, budget=5min → appends
2026-03-29 14:32 — val_acc=0.923, hash=abc1234
to RESEARCH.md Context.
BFS: target=
train.py
, metric=
val_bpb
(minimize), budget=5min/run, N=10 → runs 10 hypothesis variants autonomously, keeps 3 improvements, produces
results.tsv
+ summary table in RESEARCH.md Context.
单轮运行
python train.py --lr 0.01
,预算=5分钟 → 向RESEARCH.md的上下文部分追加
2026-03-29 14:32 — val_acc=0.923, hash=abc1234
BFS模式:目标文件=
train.py
,指标=
val_bpb
(最小化),单轮预算=5分钟,实验次数N=10 → 自主运行10个假设变体,保留3个优化结果,生成
results.tsv
并在RESEARCH.md的上下文部分添加汇总表格。