autoresearch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseautoresearch
autoresearch
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies , runs fixed 5-minute GPU experiments, evaluates with a single metric (), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
train.pyval_bpb"研究者的工作从编写Python代码转变为编写Markdown文档。" — Andrej Karpathy
autoresearch是一个自动化机器学习实验框架。AI Agent会迭代修改,运行固定时长为5分钟的GPU实验,通过单一指标进行评估,并仅通过git棘轮机制提交有性能提升的修改。最终结果:一觉醒来就能看到100+次已记录的实验,以及性能持续优化的模型。
train.pyval_bpbWhen to use this skill
适用场景
- Setting up autoresearch on a GPU machine for the first time
- Writing or refining research directives for the agent
program.md - Launching an overnight autonomous experiment loop
- Interpreting to understand what the agent found
results.tsv - Configuring the system for constrained hardware (limited VRAM)
- Understanding the ratcheting mechanism and git workflow
- Porting to Apple Silicon (MLX) or Windows RTX
- 首次在GPU机器上搭建autoresearch环境
- 为Agent编写或优化研究指令
program.md - 启动夜间自动化实验循环
- 解读文件以理解Agent的实验发现
results.tsv - 为受限硬件配置系统(如显存不足的情况)
- 理解棘轮机制与git工作流
- 移植到Apple Silicon(MLX)或Windows RTX平台
Core Architecture
核心架构
Human authors program.md
│
▼
Agent reads program.md + train.py
│
▼
Agent modifies train.py → git commit
│
▼
uv run train.py (exactly 300 seconds)
│
▼
Extract val_bpb + peak_vram_mb
│
┌────┴────┐
improved? no improvement
│ │
keep commit git reset HEAD~1
│ │
└──────┬───────┘
│
log to results.tsv
│
▼
repeat ∞人类编写program.md
│
▼
Agent读取program.md + train.py
│
▼
Agent修改train.py → git提交
│
▼
uv run train.py (精确运行300秒)
│
▼
提取val_bpb + peak_vram_mb
│
┌────┴────┐
性能提升? 无性能提升
│ │
保留提交 git reset HEAD~1
│ │
└──────┬───────┘
│
记录到results.tsv
│
▼
无限重复Mutable vs. Immutable Files
可变文件与不可变文件
| File | Agent access | Purpose |
|---|---|---|
| Read + Write | Model, optimizer, training loop (~630 lines) |
| Read-only | Human research directives |
| Read-only | Data pipeline + |
| Read-only | |
| Read-only | Locked dependencies (no new packages) |
| Append | All experiments: kept and discarded |
| 文件 | Agent访问权限 | 用途 |
|---|---|---|
| 可读可写 | 模型、优化器、训练循环(约630行) |
| 只读 | 人类编写的研究指令 |
| 只读 | 数据管道 + |
| 只读 | 配置常量( |
| 只读 | 锁定的依赖包(禁止新增包) |
| 追加写入 | 所有实验记录:包括保留和丢弃的实验 |
Instructions
使用说明
Step 1: Install Prerequisites
步骤1:安装前置依赖
bash
undefinedbash
undefinedInstall uv (fast Python package manager)
安装uv(快速Python包管理器)
curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Clone the repository
克隆仓库
git clone https://github.com/karpathy/autoresearch
cd autoresearch
git clone https://github.com/karpathy/autoresearch
cd autoresearch
Install locked dependencies
安装锁定的依赖包
uv sync
undefineduv sync
undefinedStep 2: Prepare Data (One-Time, ~2 Minutes)
步骤2:准备数据(一次性操作,约2分钟)
bash
undefinedbash
undefinedDownloads FineWeb-Edu parquet shards, trains BPE tokenizer
下载FineWeb-Edu parquet分片,训练BPE分词器
Last shard is reserved for validation — never seen during training
最后一个分片预留作验证——训练过程中不会使用
uv run prepare.py
For constrained hardware, edit `prepare.py` before running:
```pythonuv run prepare.py
对于显存受限的硬件,运行前编辑`prepare.py`:
```pythonLower MAX_SEQ_LEN for GPUs with limited VRAM
为显存不足的GPU降低MAX_SEQ_LEN
MAX_SEQ_LEN = 256 # default: 2048
undefinedMAX_SEQ_LEN = 256 # 默认值:2048
undefinedStep 3: Run a Baseline Experiment
步骤3:运行基准实验
bash
undefinedbash
undefinedSingle 5-minute experiment to verify setup
单次5分钟实验,验证环境是否搭建成功
uv run train.py > run.log 2>&1
uv run train.py > run.log 2>&1
Extract key metrics
提取关键指标
grep "^val_bpb:|^peak_vram_mb:" run.log
Expected output:val_bpb: 0.9979
peak_vram_mb: 38420
undefinedgrep "^val_bpb:|^peak_vram_mb:" run.log
预期输出:val_bpb: 0.9979
peak_vram_mb: 38420
undefinedStep 4: Author program.md
步骤4:编写program.md
program.mdmarkdown
undefinedprogram.mdmarkdown
undefinedResearch Program
研究方案
Goal
目标
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
在300秒的时间预算内,最小化FineWeb-Edu验证集上的val_bpb指标。
Current Baseline
当前基准
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
val_bpb: 0.9979(12层GPT,Muon + AdamW优化器)
Directions to Explore
探索方向
- Attention variants: MLA, GQA, sliding window, local-global hybrid
- Layer types: MoE FFN layers, SwiGLU activations
- Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
- Architectural depth/width tradeoffs within VRAM budget
- 注意力机制变体:MLA、GQA、滑动窗口、局部-全局混合
- 层类型:MoE FFN层、SwiGLU激活函数
- 优化器调优:Muon动量、AdamW β值、学习率调度
- 显存预算内的架构深度/宽度权衡
Constraints
约束条件
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
- 必须在300秒内完成实验
- 峰值显存不得超过39GB
- 禁止使用新包(仅使用pyproject.toml中已有的依赖)
- 不得修改prepare.py或constants.py
Notes from Previous Runs
之前实验的笔记
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)
**Effective program.md principles:**
- Be specific about what to explore — vague directives waste experiments
- Record what has already been tried (prevents redundant experiments)
- Note hardware constraints explicitly
- Use the current best `val_bpb` as a reference point- 12层模型的优化效果可迁移到24层模型(缩放不变增益)
- RoPE位置编码的表现优于学习型嵌入(val_bpb提升+0.008)
**编写高效program.md的原则:**
- 明确探索方向——模糊的指令会浪费实验资源
- 记录已尝试的内容(避免重复实验)
- 明确注明硬件约束
- 以当前最优的`val_bpb`作为参考点Step 5: Run the Autonomous Agent Loop
步骤5:运行自动化Agent循环
Point your AI agent (Claude Code, Codex, etc.) at the repository with as its research context. The agent will:
program.md- Read + current
program.mdtrain.py - Hypothesize an improvement
- Modify + commit
train.py - Execute (300 seconds)
uv run train.py - Extract ; keep or revert via git
val_bpb - Append to
results.tsv - Repeat
With Claude Code (OMC):
bash
undefined将AI Agent(如Claude Code、Codex等)指向仓库,并以作为研究上下文。Agent会执行以下操作:
program.md- 读取+ 当前
program.mdtrain.py - 提出性能改进假设
- 修改+ 提交到git
train.py - 执行(运行300秒)
uv run train.py - 提取指标;通过git保留或回滚修改
val_bpb - 将结果追加到
results.tsv - 重复上述流程
使用Claude Code(OMC):
bash
undefinedFrom inside autoresearch/
在autoresearch目录内执行
Give Claude the context: "Run the autoresearch loop following program.md"
给Claude的上下文:"按照program.md运行autoresearch循环"
**With Claude Code CLI directly:**
```bash
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."
**直接使用Claude Code CLI:**
```bash
claude "遵循program.md。在train.py上运行自动化研究循环。
执行:uv run train.py,提取val_bpb,保留有提升的修改,回滚失败的修改。
将所有内容记录到results.tsv。直到我停止前持续运行。"Step 6: Monitor Results
步骤6:监控实验结果
bash
undefinedbash
undefinedLive monitoring during a run
运行期间实时监控
watch -n 30 "tail -20 results.tsv"
watch -n 30 "tail -20 results.tsv"
Count kept vs. discarded
统计保留与丢弃的实验数量
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
Find the best experiment
找到最优实验
sort -t$'\t' -k2 -n results.tsv | head -5
sort -t$'\t' -k2 -n results.tsv | head -5
Check current best val_bpb
查看当前最优的val_bpb
git log --oneline -5
undefinedgit log --oneline -5
undefinedStep 7: Interpret results.tsv
步骤7:解读results.tsv
commit val_bpb memory_gb status description
a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12
b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain
c1a5f30 crash — crash OOM: sequence length 4096| Status | Meaning |
|---|---|
| |
| No improvement; |
| OOM, syntax error, or timeout; always reverted |
commit val_bpb memory_gb 状态 描述
a3f2c91 0.9697 37.2 keep SwiGLU激活函数 + 12层
b8e1d04 0.9821 38.1 discard MoE 4专家:提升幅度微小
c1a5f30 crash — crash 显存不足:序列长度4096| 状态 | 含义 |
|---|---|
| |
| 无性能提升;执行 |
| 显存不足、语法错误或超时;始终会被回滚 |
Examples
示例
Example 1: Overnight Run Summary
示例1:夜间运行总结
Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpb会话总结:126次实验,18次性能提升
最优val_bpb:0.9697(初始值:0.9979)
主要提升点:
- SwiGLU激活函数:val_bpb降低0.012
- 4个KV头的GQA:val_bpb降低0.009
- Muon动量从0.92调整为0.95:val_bpb降低0.006Example 2: Low-VRAM Configuration (6GB GPU)
示例2:低显存配置(6GB GPU)
python
undefinedpython
undefinedIn prepare.py — edit before uv run prepare.py
在prepare.py中修改——运行uv run prepare.py前编辑
MAX_SEQ_LEN = 256 # was 2048
EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)
undefinedMAX_SEQ_LEN = 256 # 原默认值2048
EVAL_TOKENS = 2_097_152 # 原默认值20_971_520(按比例缩小)
undefinedExample 3: Extract Experiments by Category
示例3:按类别筛选实验
bash
undefinedbash
undefinedFind all attention-related experiments
查找所有与注意力机制相关的实验
grep -i "attention|GQA|MLA|MHA" results.tsv
grep -i "attention|GQA|MLA|MHA" results.tsv
List only improvements sorted by gain
仅列出按提升幅度排序的有效实验
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
undefinedawk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -n
undefinedAvailable scripts
可用脚本
Run from inside the autoresearch repository directory:
| Script | Purpose | Usage |
|---|---|---|
| One-time environment setup | |
| Single 5-min experiment + metric extraction | |
| Autonomous loop: run → keep/revert → repeat | |
| Human-readable results.tsv report | |
| GPU/CUDA/uv availability check (JSON output) | |
bash
undefined在autoresearch目录内运行:
| 脚本 | 用途 | 使用方式 |
|---|---|---|
| 一次性环境搭建 | |
| 单次5分钟实验 + 指标提取 | |
| 自动化循环:运行→保留/回滚→重复 | |
| 人类可读的results.tsv报告 | |
| GPU/CUDA/uv可用性检查(JSON输出) | |
bash
undefinedTypical overnight session
典型夜间运行流程
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # 根据你的显存调整
Edit program.md with your research directives
编辑program.md添加研究指令
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only
undefinedbash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-only
undefinedReferences
参考文档
Detailed documentation in :
references/| File | Contents |
|---|---|
| System design, immutability contract, git ratcheting, key design decisions |
| How to write effective |
| VRAM settings by GPU, memory optimization techniques, troubleshooting |
详细文档位于目录:
references/| 文件 | 内容 |
|---|---|
| 系统设计、不可变性约定、git棘轮机制、核心设计决策 |
| 如何编写高效的 |
| 不同GPU的显存设置、内存优化技巧、故障排查 |
Best practices
最佳实践
- Write program.md before running — the agent is only as good as its directives; vague programs waste compute
- Start with the baseline first — always manually before launching the loop to confirm the setup works
uv run train.py - Keep in
MAX_SEQ_LENconsistent — changing it mid-run invalidates val_bpb comparisonsprepare.py - Never modify or
prepare.py— the evaluation harness must stay fixed for results to be meaningfulconstants.py - Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
- Commit updates — version-control your research directives alongside
program.mdfor reproducibilityresults.tsv - Monitor VRAM — add constraints in
peak_vram_mbfor your GPU's headroomprogram.md - No new dependencies — the agent cannot ; it can only use what is in
pip installpyproject.toml
- 运行前先编写program.md — Agent的能力取决于指令质量;模糊的指令会浪费计算资源
- 先运行基准实验 — 启动循环前务必手动运行以确认环境搭建正确
uv run train.py - 保持prepare.py中的MAX_SEQ_LEN一致 — 运行中途修改会导致val_bpb指标无法对比
- 永远不要修改prepare.py或constants.py — 评估框架必须保持固定才能保证结果的有效性
- 在提交前验证优化的可扩展性 — 先测试12层模型的优化是否适用于24层模型,再将其视为根本性改进
- 提交program.md的更新 — 将研究指令与一起纳入版本控制,确保可复现
results.tsv - 监控显存使用 — 在program.md中添加针对你的GPU显存余量的约束
peak_vram_mb - 不要新增依赖 — Agent无法执行;只能使用pyproject.toml中已有的依赖
pip install
Hardware Requirements
硬件要求
| Hardware | Status | Notes |
|---|---|---|
| H100 80GB | Recommended | Default config, full MAX_SEQ_LEN=2048 |
| A100 40GB | Supported | Lower MAX_SEQ_LEN if needed |
| RTX 4090 24GB | Community | Reduce MAX_SEQ_LEN to 512 |
| GTX 1660 Ti 6GB | Community fork | MAX_SEQ_LEN=256, reduced EVAL_TOKENS |
| Apple Silicon (M-series) | MLX port | Community fork; different optimizer API |
| Windows RTX | Community | WSL2 + CUDA recommended |
| 硬件 | 支持状态 | 说明 |
|---|---|---|
| H100 80GB | 推荐 | 默认配置,完整MAX_SEQ_LEN=2048 |
| A100 40GB | 支持 | 必要时降低MAX_SEQ_LEN |
| RTX 4090 24GB | 社区支持 | 将MAX_SEQ_LEN降至512 |
| GTX 1660 Ti 6GB | 社区分支 | MAX_SEQ_LEN=256,减少EVAL_TOKENS |
| Apple Silicon(M系列) | MLX移植版 | 社区分支;优化器API不同 |
| Windows RTX | 社区支持 | 推荐使用WSL2 + CUDA |
Key Metrics Reference
核心指标参考
| Metric | Direction | Description |
|---|---|---|
| Lower = better | Validation bits-per-byte; vocabulary-size-independent |
| Lower = more headroom | Peak GPU memory during the training run |
| Experiments/hour | Higher = faster search | ~12 at TIME_BUDGET=300 |
| 指标 | 优化方向 | 描述 |
|---|---|---|
| 越低越好 | 验证集的每字节比特数;与词汇表大小无关 |
| 越低显存余量越大 | 训练过程中的GPU峰值显存使用 |
| 每小时实验次数 | 越高搜索速度越快 | TIME_BUDGET=300时约为12次 |