autoresearch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

autoresearch

autoresearch

"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies
train.py
, runs fixed 5-minute GPU experiments, evaluates with a single metric (
val_bpb
), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
"研究者的工作从编写Python代码转变为编写Markdown文档。" — Andrej Karpathy
autoresearch是一个自动化机器学习实验框架。AI Agent会迭代修改
train.py
,运行固定时长为5分钟的GPU实验,通过单一指标
val_bpb
进行评估,并仅通过git棘轮机制提交有性能提升的修改。最终结果:一觉醒来就能看到100+次已记录的实验,以及性能持续优化的模型。

When to use this skill

适用场景

  • Setting up autoresearch on a GPU machine for the first time
  • Writing or refining
    program.md
    research directives for the agent
  • Launching an overnight autonomous experiment loop
  • Interpreting
    results.tsv
    to understand what the agent found
  • Configuring the system for constrained hardware (limited VRAM)
  • Understanding the ratcheting mechanism and git workflow
  • Porting to Apple Silicon (MLX) or Windows RTX
  • 首次在GPU机器上搭建autoresearch环境
  • 为Agent编写或优化
    program.md
    研究指令
  • 启动夜间自动化实验循环
  • 解读
    results.tsv
    文件以理解Agent的实验发现
  • 为受限硬件配置系统(如显存不足的情况)
  • 理解棘轮机制与git工作流
  • 移植到Apple Silicon(MLX)或Windows RTX平台

Core Architecture

核心架构

Human authors program.md
Agent reads program.md + train.py
Agent modifies train.py → git commit
uv run train.py  (exactly 300 seconds)
Extract val_bpb + peak_vram_mb
  ┌────┴────┐
improved?   no improvement
  │              │
keep commit   git reset HEAD~1
  │              │
  └──────┬───────┘
   log to results.tsv
    repeat ∞
人类编写program.md
Agent读取program.md + train.py
Agent修改train.py → git提交
uv run train.py  (精确运行300秒)
提取val_bpb + peak_vram_mb
  ┌────┴────┐
性能提升?   无性能提升
  │              │
保留提交   git reset HEAD~1
  │              │
  └──────┬───────┘
   记录到results.tsv
    无限重复

Mutable vs. Immutable Files

可变文件与不可变文件

FileAgent accessPurpose
train.py
Read + WriteModel, optimizer, training loop (~630 lines)
program.md
Read-onlyHuman research directives
prepare.py
Read-onlyData pipeline +
evaluate_bpb()
harness
constants.py
Read-only
TIME_BUDGET=300
,
MAX_SEQ_LEN
,
EVAL_TOKENS
pyproject.toml
Read-onlyLocked dependencies (no new packages)
results.tsv
AppendAll experiments: kept and discarded
文件Agent访问权限用途
train.py
可读可写模型、优化器、训练循环(约630行)
program.md
只读人类编写的研究指令
prepare.py
只读数据管道 +
evaluate_bpb()
测试框架
constants.py
只读配置常量(
TIME_BUDGET=300
MAX_SEQ_LEN
EVAL_TOKENS
pyproject.toml
只读锁定的依赖包(禁止新增包)
results.tsv
追加写入所有实验记录:包括保留和丢弃的实验

Instructions

使用说明

Step 1: Install Prerequisites

步骤1:安装前置依赖

bash
undefined
bash
undefined

Install uv (fast Python package manager)

安装uv(快速Python包管理器)

Clone the repository

克隆仓库

Install locked dependencies

安装锁定的依赖包

uv sync
undefined
uv sync
undefined

Step 2: Prepare Data (One-Time, ~2 Minutes)

步骤2:准备数据(一次性操作,约2分钟)

bash
undefined
bash
undefined

Downloads FineWeb-Edu parquet shards, trains BPE tokenizer

下载FineWeb-Edu parquet分片,训练BPE分词器

Last shard is reserved for validation — never seen during training

最后一个分片预留作验证——训练过程中不会使用

uv run prepare.py

For constrained hardware, edit `prepare.py` before running:
```python
uv run prepare.py

对于显存受限的硬件,运行前编辑`prepare.py`:
```python

Lower MAX_SEQ_LEN for GPUs with limited VRAM

为显存不足的GPU降低MAX_SEQ_LEN

MAX_SEQ_LEN = 256 # default: 2048
undefined
MAX_SEQ_LEN = 256 # 默认值:2048
undefined

Step 3: Run a Baseline Experiment

步骤3:运行基准实验

bash
undefined
bash
undefined

Single 5-minute experiment to verify setup

单次5分钟实验,验证环境是否搭建成功

uv run train.py > run.log 2>&1
uv run train.py > run.log 2>&1

Extract key metrics

提取关键指标

grep "^val_bpb:|^peak_vram_mb:" run.log

Expected output:
val_bpb: 0.9979 peak_vram_mb: 38420
undefined
grep "^val_bpb:|^peak_vram_mb:" run.log

预期输出:
val_bpb: 0.9979 peak_vram_mb: 38420
undefined

Step 4: Author program.md

步骤4:编写program.md

program.md
is the human-written research charter the agent reads at the start of every loop iteration. Write it as precise Markdown instructions:
markdown
undefined
program.md
是人类编写的研究章程,Agent在每次循环迭代开始时都会读取它。请以精确的Markdown格式编写指令:
markdown
undefined

Research Program

研究方案

Goal

目标

Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
在300秒的时间预算内,最小化FineWeb-Edu验证集上的val_bpb指标。

Current Baseline

当前基准

val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
val_bpb: 0.9979(12层GPT,Muon + AdamW优化器)

Directions to Explore

探索方向

  1. Attention variants: MLA, GQA, sliding window, local-global hybrid
  2. Layer types: MoE FFN layers, SwiGLU activations
  3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
  4. Architectural depth/width tradeoffs within VRAM budget
  1. 注意力机制变体:MLA、GQA、滑动窗口、局部-全局混合
  2. 层类型:MoE FFN层、SwiGLU激活函数
  3. 优化器调优:Muon动量、AdamW β值、学习率调度
  4. 显存预算内的架构深度/宽度权衡

Constraints

约束条件

  • Must complete within 300 seconds
  • Peak VRAM must stay under 39GB
  • No new packages (use only what is in pyproject.toml)
  • Do not modify prepare.py or constants.py
  • 必须在300秒内完成实验
  • 峰值显存不得超过39GB
  • 禁止使用新包(仅使用pyproject.toml中已有的依赖)
  • 不得修改prepare.py或constants.py

Notes from Previous Runs

之前实验的笔记

  • Depth-12 improvements transfer to depth-24 (scale-invariant gains)
  • RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)

**Effective program.md principles:**
- Be specific about what to explore — vague directives waste experiments
- Record what has already been tried (prevents redundant experiments)
- Note hardware constraints explicitly
- Use the current best `val_bpb` as a reference point
  • 12层模型的优化效果可迁移到24层模型(缩放不变增益)
  • RoPE位置编码的表现优于学习型嵌入(val_bpb提升+0.008)

**编写高效program.md的原则:**
- 明确探索方向——模糊的指令会浪费实验资源
- 记录已尝试的内容(避免重复实验)
- 明确注明硬件约束
- 以当前最优的`val_bpb`作为参考点

Step 5: Run the Autonomous Agent Loop

步骤5:运行自动化Agent循环

Point your AI agent (Claude Code, Codex, etc.) at the repository with
program.md
as its research context. The agent will:
  1. Read
    program.md
    + current
    train.py
  2. Hypothesize an improvement
  3. Modify
    train.py
    + commit
  4. Execute
    uv run train.py
    (300 seconds)
  5. Extract
    val_bpb
    ; keep or revert via git
  6. Append to
    results.tsv
  7. Repeat
With Claude Code (OMC):
bash
undefined
将AI Agent(如Claude Code、Codex等)指向仓库,并以
program.md
作为研究上下文。Agent会执行以下操作:
  1. 读取
    program.md
    + 当前
    train.py
  2. 提出性能改进假设
  3. 修改
    train.py
    + 提交到git
  4. 执行
    uv run train.py
    (运行300秒)
  5. 提取
    val_bpb
    指标;通过git保留或回滚修改
  6. 将结果追加到
    results.tsv
  7. 重复上述流程
使用Claude Code(OMC):
bash
undefined

From inside autoresearch/

在autoresearch目录内执行

Give Claude the context: "Run the autoresearch loop following program.md"

给Claude的上下文:"按照program.md运行autoresearch循环"


**With Claude Code CLI directly:**
```bash
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."

**直接使用Claude Code CLI:**
```bash
claude "遵循program.md。在train.py上运行自动化研究循环。
执行:uv run train.py,提取val_bpb,保留有提升的修改,回滚失败的修改。
将所有内容记录到results.tsv。直到我停止前持续运行。"

Step 6: Monitor Results

步骤6:监控实验结果

bash
undefined
bash
undefined

Live monitoring during a run

运行期间实时监控

watch -n 30 "tail -20 results.tsv"
watch -n 30 "tail -20 results.tsv"

Count kept vs. discarded

统计保留与丢弃的实验数量

awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c

Find the best experiment

找到最优实验

sort -t$'\t' -k2 -n results.tsv | head -5
sort -t$'\t' -k2 -n results.tsv | head -5

Check current best val_bpb

查看当前最优的val_bpb

git log --oneline -5
undefined
git log --oneline -5
undefined

Step 7: Interpret results.tsv

步骤7:解读results.tsv

commit    val_bpb    memory_gb    status     description
a3f2c91   0.9697     37.2         keep       SwiGLU activation + depth-12
b8e1d04   0.9821     38.1         discard    MoE 4-expert: marginal gain
c1a5f30   crash      —            crash      OOM: sequence length 4096
StatusMeaning
keep
val_bpb
improved; commit retained on branch
discard
No improvement;
git reset HEAD~1
applied
crash
OOM, syntax error, or timeout; always reverted
commit    val_bpb    memory_gb    状态     描述
a3f2c91   0.9697     37.2         keep       SwiGLU激活函数 + 12层
b8e1d04   0.9821     38.1         discard    MoE 4专家:提升幅度微小
c1a5f30   crash      —            crash      显存不足:序列长度4096
状态含义
keep
val_bpb
指标提升;提交被保留在分支上
discard
无性能提升;执行
git reset HEAD~1
回滚
crash
显存不足、语法错误或超时;始终会被回滚

Examples

示例

Example 1: Overnight Run Summary

示例1:夜间运行总结

Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb
会话总结:126次实验,18次性能提升
最优val_bpb:0.9697(初始值:0.9979)
主要提升点:
- SwiGLU激活函数:val_bpb降低0.012
- 4个KV头的GQA:val_bpb降低0.009
- Muon动量从0.92调整为0.95:val_bpb降低0.006

Example 2: Low-VRAM Configuration (6GB GPU)

示例2:低显存配置(6GB GPU)

python
undefined
python
undefined

In prepare.py — edit before uv run prepare.py

在prepare.py中修改——运行uv run prepare.py前编辑

MAX_SEQ_LEN = 256 # was 2048 EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
undefined
MAX_SEQ_LEN = 256 # 原默认值2048 EVAL_TOKENS = 2_097_152 # 原默认值20_971_520(按比例缩小)
undefined

Example 3: Extract Experiments by Category

示例3:按类别筛选实验

bash
undefined
bash
undefined

Find all attention-related experiments

查找所有与注意力机制相关的实验

grep -i "attention|GQA|MLA|MHA" results.tsv
grep -i "attention|GQA|MLA|MHA" results.tsv

List only improvements sorted by gain

仅列出按提升幅度排序的有效实验

awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
undefined
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
undefined

Available scripts

可用脚本

Run from inside the autoresearch repository directory:
ScriptPurposeUsage
setup.sh
One-time environment setup
bash scripts/setup.sh [--seq-len 512]
run-experiment.sh
Single 5-min experiment + metric extraction
bash scripts/run-experiment.sh
run-loop.sh
Autonomous loop: run → keep/revert → repeat
bash scripts/run-loop.sh [--max 20]
show-results.sh
Human-readable results.tsv report
bash scripts/show-results.sh [--top 10]
check-hardware.sh
GPU/CUDA/uv availability check (JSON output)
bash scripts/check-hardware.sh
bash
undefined
在autoresearch目录内运行:
脚本用途使用方式
setup.sh
一次性环境搭建
bash scripts/setup.sh [--seq-len 512]
run-experiment.sh
单次5分钟实验 + 指标提取
bash scripts/run-experiment.sh
run-loop.sh
自动化循环:运行→保留/回滚→重复
bash scripts/run-loop.sh [--max 20]
show-results.sh
人类可读的results.tsv报告
bash scripts/show-results.sh [--top 10]
check-hardware.sh
GPU/CUDA/uv可用性检查(JSON输出)
bash scripts/check-hardware.sh
bash
undefined

Typical overnight session

典型夜间运行流程

bash scripts/check-hardware.sh bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
bash scripts/check-hardware.sh bash scripts/setup.sh --seq-len 512 # 根据你的显存调整

Edit program.md with your research directives

编辑program.md添加研究指令

bash scripts/run-loop.sh --max 100 --desc "session-1" bash scripts/show-results.sh --kept-only
undefined
bash scripts/run-loop.sh --max 100 --desc "session-1" bash scripts/show-results.sh --kept-only
undefined

References

参考文档

Detailed documentation in
references/
:
FileContents
references/architecture.md
System design, immutability contract, git ratcheting, key design decisions
references/program-md-guide.md
How to write effective
program.md
directives; full template + principles
references/hardware-config.md
VRAM settings by GPU, memory optimization techniques, troubleshooting
详细文档位于
references/
目录:
文件内容
references/architecture.md
系统设计、不可变性约定、git棘轮机制、核心设计决策
references/program-md-guide.md
如何编写高效的
program.md
指令;完整模板与原则
references/hardware-config.md
不同GPU的显存设置、内存优化技巧、故障排查

Best practices

最佳实践

  1. Write program.md before running — the agent is only as good as its directives; vague programs waste compute
  2. Start with the baseline first — always
    uv run train.py
    manually before launching the loop to confirm the setup works
  3. Keep
    MAX_SEQ_LEN
    in
    prepare.py
    consistent
    — changing it mid-run invalidates val_bpb comparisons
  4. Never modify
    prepare.py
    or
    constants.py
    — the evaluation harness must stay fixed for results to be meaningful
  5. Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
  6. Commit
    program.md
    updates
    — version-control your research directives alongside
    results.tsv
    for reproducibility
  7. Monitor VRAM — add
    peak_vram_mb
    constraints in
    program.md
    for your GPU's headroom
  8. No new dependencies — the agent cannot
    pip install
    ; it can only use what is in
    pyproject.toml
  1. 运行前先编写program.md — Agent的能力取决于指令质量;模糊的指令会浪费计算资源
  2. 先运行基准实验 — 启动循环前务必手动运行
    uv run train.py
    以确认环境搭建正确
  3. 保持prepare.py中的MAX_SEQ_LEN一致 — 运行中途修改会导致val_bpb指标无法对比
  4. 永远不要修改prepare.py或constants.py — 评估框架必须保持固定才能保证结果的有效性
  5. 在提交前验证优化的可扩展性 — 先测试12层模型的优化是否适用于24层模型,再将其视为根本性改进
  6. 提交program.md的更新 — 将研究指令与
    results.tsv
    一起纳入版本控制,确保可复现
  7. 监控显存使用 — 在program.md中添加针对你的GPU显存余量的
    peak_vram_mb
    约束
  8. 不要新增依赖 — Agent无法执行
    pip install
    ;只能使用pyproject.toml中已有的依赖

Hardware Requirements

硬件要求

HardwareStatusNotes
H100 80GBRecommendedDefault config, full MAX_SEQ_LEN=2048
A100 40GBSupportedLower MAX_SEQ_LEN if needed
RTX 4090 24GBCommunityReduce MAX_SEQ_LEN to 512
GTX 1660 Ti 6GBCommunity forkMAX_SEQ_LEN=256, reduced EVAL_TOKENS
Apple Silicon (M-series)MLX portCommunity fork; different optimizer API
Windows RTXCommunityWSL2 + CUDA recommended
硬件支持状态说明
H100 80GB推荐默认配置,完整MAX_SEQ_LEN=2048
A100 40GB支持必要时降低MAX_SEQ_LEN
RTX 4090 24GB社区支持将MAX_SEQ_LEN降至512
GTX 1660 Ti 6GB社区分支MAX_SEQ_LEN=256,减少EVAL_TOKENS
Apple Silicon(M系列)MLX移植版社区分支;优化器API不同
Windows RTX社区支持推荐使用WSL2 + CUDA

Key Metrics Reference

核心指标参考

MetricDirectionDescription
val_bpb
Lower = betterValidation bits-per-byte; vocabulary-size-independent
peak_vram_mb
Lower = more headroomPeak GPU memory during the training run
Experiments/hourHigher = faster search~12 at TIME_BUDGET=300
指标优化方向描述
val_bpb
越低越好验证集的每字节比特数;与词汇表大小无关
peak_vram_mb
越低显存余量越大训练过程中的GPU峰值显存使用
每小时实验次数越高搜索速度越快TIME_BUDGET=300时约为12次

References

参考链接