autoresearch
Original:🇺🇸 English
Translated
5 scripts
Autonomous ML experimentation framework by Andrej Karpathy. AI agent autonomously modifies train.py, runs 5-minute GPU experiments, evaluates with val_bpb, and commits only improvements via git ratcheting — so you wake up to 100+ experiments and a better model. Use when setting up autoresearch, writing program.md directives, interpreting results, configuring hardware, or running overnight autonomous ML experiments. Triggers on: autoresearch, autonomous ml experiments, overnight gpu experiments, karpathy autoresearch, train.py experiments, val_bpb, program.md research directives, ai runs experiments.
8installs
Added on
NPX Install
npx skill4agent add supercent-io/skills-template autoresearchTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →autoresearch
"The researcher's job shifts from writing Python to writing Markdown." — Andrej Karpathy
Autoresearch is an autonomous ML experimentation framework. An AI agent iteratively modifies , runs fixed 5-minute GPU experiments, evaluates with a single metric (), and commits only improvements via git ratcheting. The result: wake up to 100+ experiments logged and a monotonically better model.
train.pyval_bpbWhen to use this skill
- Setting up autoresearch on a GPU machine for the first time
- Writing or refining research directives for the agent
program.md - Launching an overnight autonomous experiment loop
- Interpreting to understand what the agent found
results.tsv - Configuring the system for constrained hardware (limited VRAM)
- Understanding the ratcheting mechanism and git workflow
- Porting to Apple Silicon (MLX) or Windows RTX
Core Architecture
Human authors program.md
│
▼
Agent reads program.md + train.py
│
▼
Agent modifies train.py → git commit
│
▼
uv run train.py (exactly 300 seconds)
│
▼
Extract val_bpb + peak_vram_mb
│
┌────┴────┐
improved? no improvement
│ │
keep commit git reset HEAD~1
│ │
└──────┬───────┘
│
log to results.tsv
│
▼
repeat ∞Mutable vs. Immutable Files
| File | Agent access | Purpose |
|---|---|---|
| Read + Write | Model, optimizer, training loop (~630 lines) |
| Read-only | Human research directives |
| Read-only | Data pipeline + |
| Read-only | |
| Read-only | Locked dependencies (no new packages) |
| Append | All experiments: kept and discarded |
Instructions
Step 1: Install Prerequisites
bash
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/karpathy/autoresearch
cd autoresearch
# Install locked dependencies
uv syncStep 2: Prepare Data (One-Time, ~2 Minutes)
bash
# Downloads FineWeb-Edu parquet shards, trains BPE tokenizer
# Last shard is reserved for validation — never seen during training
uv run prepare.pyFor constrained hardware, edit before running:
prepare.pypython
# Lower MAX_SEQ_LEN for GPUs with limited VRAM
MAX_SEQ_LEN = 256 # default: 2048Step 3: Run a Baseline Experiment
bash
# Single 5-minute experiment to verify setup
uv run train.py > run.log 2>&1
# Extract key metrics
grep "^val_bpb:\|^peak_vram_mb:" run.logExpected output:
val_bpb: 0.9979
peak_vram_mb: 38420Step 4: Author program.md
program.mdmarkdown
# Research Program
## Goal
Minimize val_bpb on the FineWeb-Edu validation set within the 300-second budget.
## Current Baseline
val_bpb: 0.9979 (depth-12 GPT, Muon + AdamW optimizer)
## Directions to Explore
1. Attention variants: MLA, GQA, sliding window, local-global hybrid
2. Layer types: MoE FFN layers, SwiGLU activations
3. Optimizer tuning: Muon momentum, AdamW β values, learning rate schedule
4. Architectural depth/width tradeoffs within VRAM budget
## Constraints
- Must complete within 300 seconds
- Peak VRAM must stay under 39GB
- No new packages (use only what is in pyproject.toml)
- Do not modify prepare.py or constants.py
## Notes from Previous Runs
- Depth-12 improvements transfer to depth-24 (scale-invariant gains)
- RoPE positional encoding outperformed learned embeddings (+0.008 val_bpb)Effective program.md principles:
- Be specific about what to explore — vague directives waste experiments
- Record what has already been tried (prevents redundant experiments)
- Note hardware constraints explicitly
- Use the current best as a reference point
val_bpb
Step 5: Run the Autonomous Agent Loop
Point your AI agent (Claude Code, Codex, etc.) at the repository with as its research context. The agent will:
program.md- Read + current
program.mdtrain.py - Hypothesize an improvement
- Modify + commit
train.py - Execute (300 seconds)
uv run train.py - Extract ; keep or revert via git
val_bpb - Append to
results.tsv - Repeat
With Claude Code (OMC):
bash
# From inside autoresearch/
# Give Claude the context: "Run the autoresearch loop following program.md"With Claude Code CLI directly:
bash
claude "Follow program.md. Run autonomous research loop on train.py.
Execute: uv run train.py, extract val_bpb, keep improvements, revert failures.
Log everything to results.tsv. Do not stop until I say so."Step 6: Monitor Results
bash
# Live monitoring during a run
watch -n 30 "tail -20 results.tsv"
# Count kept vs. discarded
awk -F'\t' '{print $4}' results.tsv | sort | uniq -c
# Find the best experiment
sort -t$'\t' -k2 -n results.tsv | head -5
# Check current best val_bpb
git log --oneline -5Step 7: Interpret results.tsv
commit val_bpb memory_gb status description
a3f2c91 0.9697 37.2 keep SwiGLU activation + depth-12
b8e1d04 0.9821 38.1 discard MoE 4-expert: marginal gain
c1a5f30 crash — crash OOM: sequence length 4096| Status | Meaning |
|---|---|
| |
| No improvement; |
| OOM, syntax error, or timeout; always reverted |
Examples
Example 1: Overnight Run Summary
Session summary: 126 experiments, 18 improvements
Best val_bpb: 0.9697 (started: 0.9979)
Top improvements:
- SwiGLU activation: -0.012 val_bpb
- GQA with 4 KV heads: -0.009 val_bpb
- Muon momentum 0.92→0.95: -0.006 val_bpbExample 2: Low-VRAM Configuration (6GB GPU)
python
# In prepare.py — edit before uv run prepare.py
MAX_SEQ_LEN = 256 # was 2048
EVAL_TOKENS = 2_097_152 # was 20_971_520 (scale down proportionally)Example 3: Extract Experiments by Category
bash
# Find all attention-related experiments
grep -i "attention\|GQA\|MLA\|MHA" results.tsv
# List only improvements sorted by gain
awk -F'\t' '$4=="keep"' results.tsv | sort -t$'\t' -k2 -nAvailable scripts
Run from inside the autoresearch repository directory:
| Script | Purpose | Usage |
|---|---|---|
| One-time environment setup | |
| Single 5-min experiment + metric extraction | |
| Autonomous loop: run → keep/revert → repeat | |
| Human-readable results.tsv report | |
| GPU/CUDA/uv availability check (JSON output) | |
bash
# Typical overnight session
bash scripts/check-hardware.sh
bash scripts/setup.sh --seq-len 512 # adjust for your VRAM
# Edit program.md with your research directives
bash scripts/run-loop.sh --max 100 --desc "session-1"
bash scripts/show-results.sh --kept-onlyReferences
Detailed documentation in :
references/| File | Contents |
|---|---|
| System design, immutability contract, git ratcheting, key design decisions |
| How to write effective |
| VRAM settings by GPU, memory optimization techniques, troubleshooting |
Best practices
- Write program.md before running — the agent is only as good as its directives; vague programs waste compute
- Start with the baseline first — always manually before launching the loop to confirm the setup works
uv run train.py - Keep in
MAX_SEQ_LENconsistent — changing it mid-run invalidates val_bpb comparisonsprepare.py - Never modify or
prepare.py— the evaluation harness must stay fixed for results to be meaningfulconstants.py - Scale improvements before committing — test that a depth-12 improvement also holds at depth-24 before treating it as a fundamental gain
- Commit updates — version-control your research directives alongside
program.mdfor reproducibilityresults.tsv - Monitor VRAM — add constraints in
peak_vram_mbfor your GPU's headroomprogram.md - No new dependencies — the agent cannot ; it can only use what is in
pip installpyproject.toml
Hardware Requirements
| Hardware | Status | Notes |
|---|---|---|
| H100 80GB | Recommended | Default config, full MAX_SEQ_LEN=2048 |
| A100 40GB | Supported | Lower MAX_SEQ_LEN if needed |
| RTX 4090 24GB | Community | Reduce MAX_SEQ_LEN to 512 |
| GTX 1660 Ti 6GB | Community fork | MAX_SEQ_LEN=256, reduced EVAL_TOKENS |
| Apple Silicon (M-series) | MLX port | Community fork; different optimizer API |
| Windows RTX | Community | WSL2 + CUDA recommended |
Key Metrics Reference
| Metric | Direction | Description |
|---|---|---|
| Lower = better | Validation bits-per-byte; vocabulary-size-independent |
| Lower = more headroom | Peak GPU memory during the training run |
| Experiments/hour | Higher = faster search | ~12 at TIME_BUDGET=300 |