experiment
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExperiment
实验
Run experiments reproducibly. Log everything to .
RESEARCH.mdIf includes , apply it for notifications, approvals, resource boundaries, and stop limits. If absent, preserve the defaults below.
RESEARCH.md## Supervision Policy可复现地运行实验。将所有内容记录至。
RESEARCH.md如果包含(监督策略),则按照其要求处理通知、审批、资源边界和停止限制。若不存在,则遵循以下默认规则。
RESEARCH.md## Supervision PolicyBefore Running
运行前准备
- Use an isolated virtual environment. If none exists, create one first. Escape hatch: If the user has already activated a venv, skip this step.
- Validate the target script before execution:
bash
python -m py_compile path/to/script.py - Set a fixed time budget per run. Default: 5 minutes. Do not exceed without explicit user approval, even if the supervision preset is permissive.
- 使用独立的虚拟环境。若不存在,则先创建一个。 例外情况:如果用户已激活venv,则跳过此步骤。
- 执行前验证目标脚本:
bash
python -m py_compile path/to/script.py - 为每次运行设置固定时间预算。默认值:5分钟。即使监督预设允许,未经用户明确批准也不得超出预算。
Single Run
单轮运行
- Capture stdout and stderr to a log file.
- Watch for: /
NaNin loss, OOM failures, silent hangs (>60s no output). Terminate and record on hang.Inf - Failure: patch once, retry once. On second failure — stop with reason , log to
error_thresholdContext (script path, hash, error, patch, outcome), surface to user.RESEARCH.md - Success: extract key metrics, append structured block to Context (date, script hash, metrics, notes).
RESEARCH.md
- 将标准输出(stdout)和标准错误(stderr)捕获到日志文件中。
- 监控以下情况:损失值出现/
NaN、内存不足(OOM)故障、无输出静默挂起(超过60秒无输出)。挂起时终止运行并记录。Inf - 故障处理:修补一次,重试一次。若第二次仍失败——以为原因停止运行,将上下文信息(脚本路径、哈希值、错误信息、修补内容、结果)记录至
error_threshold,并告知用户。RESEARCH.md - 成功处理:提取关键指标,将结构化内容块追加至的上下文部分(日期、脚本哈希值、指标、备注)。
RESEARCH.md
BFS Mode — opt-in
BFS模式——可选启用
Activate when the user asks to "explore", "autoresearch", or "find the best config". Requires two preconditions:
- A single target file the agent is allowed to modify (e.g., ).
train.py - A verifiable scalar metric to minimize or maximize (e.g., ,
val_bpb). Must be extractable from run output with a grep/awk one-liner.val_acc
Apply supervision policy: notify on if configured; require approval before entering the loop if is in ; in , proceed only within already granted resource and budget boundaries.
experiment-startbfs-startApprovewildLoop (runs autonomously until budget or N experiments exhausted):
- Design the next hypothesis — one focused change to the target file (e.g., change learning rate schedule, add residual scaling, modify attention pattern). State the hypothesis in one line before modifying.
git commit -am "hypothesis: <one-line description>"- Run the script for the fixed time budget. Extract the metric:
bash
grep "^<metric_key>:" run.log | awk '{print $2}' - Compare to the current best:
- Improvement → log as in
keep, update best.results.tsv - No improvement or failure → , log as
git reset --hard HEAD~1.discard
- Improvement → log as
- Append a row to :
results.tsvcommit_hash metric_value status description - Loop to step 8. Do not pause to ask the user between iterations unless a configured stop target or hard limit has been reached, or a forbidden resource / approval boundary blocks further progress.
End of batch: write a summary table to Context; surface best commit and top 3. Apply stop policy () or wild continuation as configured, otherwise ask: "Best result is (). Continue exploring or proceed to writing?"
RESEARCH.mdtarget_reached<hash><metric>=<value>Constraints: only modify the target file; one focused change per hypothesis; log every run including discards; supervision presets never override target-file, metric, budget, or resource boundaries.
当用户要求"explore"(探索)、"autoresearch"(自主研究)或"find the best config"(寻找最优配置)时激活该模式。需满足两个前提条件:
- 一个允许Agent修改的单一目标文件(例如)。
train.py - 一个用于最小化或最大化的可验证标量指标(例如、
val_bpb)。必须能够通过grep/awk单行命令从运行输出中提取。val_acc
遵循监督策略:若配置了通知,则发送通知;若在(需审批)列表中,进入循环前需获得用户批准;在(无限制)模式下,仅在已授权的资源和预算边界内运行。
experiment-startbfs-startApprovewild循环流程(自主运行直至预算耗尽或完成N次实验):
- 设计下一个假设——对目标文件进行一项聚焦性修改(例如,改变学习率调度、添加残差缩放、修改注意力模式)。修改前用一句话描述该假设。
- 执行
git commit -am "hypothesis: <one-line description>" - 在固定时间预算内运行脚本。提取指标:
bash
grep "^<metric_key>:" run.log | awk '{print $2}' - 与当前最优结果比较:
- 优化结果 → 在中记录为
results.tsv,更新最优结果。keep - 无优化或失败 → 执行,在
git reset --hard HEAD~1中记录为results.tsv。discard
- 优化结果 → 在
- 向追加一行内容:
results.tsvcommit_hash metric_value status description - 回到步骤8循环。除非达到配置的停止目标或硬限制,或遇到禁止的资源/审批边界阻碍进一步运行,否则迭代过程中无需暂停询问用户。
批量结束:将汇总表格写入的上下文部分;向用户展示最优提交记录和排名前三的结果。按照配置执行停止策略()或无限制继续运行,否则询问用户:"最优结果为()。是否继续探索或进入文档撰写阶段?"
RESEARCH.mdtarget_reached<hash><metric>=<value>约束条件:仅修改目标文件;每个假设仅包含一项聚焦性修改;记录所有运行(包括被丢弃的实验);监督预设不得覆盖目标文件、指标、预算或资源边界的限制。
Example
示例
Single: , budget=5min → appends to RESEARCH.md Context.
python train.py --lr 0.012026-03-29 14:32 — val_acc=0.923, hash=abc1234BFS: target=, metric= (minimize), budget=5min/run, N=10 → runs 10 hypothesis variants autonomously, keeps 3 improvements, produces + summary table in RESEARCH.md Context.
train.pyval_bpbresults.tsv单轮运行:,预算=5分钟 → 向RESEARCH.md的上下文部分追加。
python train.py --lr 0.012026-03-29 14:32 — val_acc=0.923, hash=abc1234BFS模式:目标文件=,指标=(最小化),单轮预算=5分钟,实验次数N=10 → 自主运行10个假设变体,保留3个优化结果,生成并在RESEARCH.md的上下文部分添加汇总表格。
train.pyval_bpbresults.tsv