Autoresearch ML: Autonomous LLM Training Optimization

An autonomous experiment loop for single-GPU LLM pretraining. Edit

train.py

→ commit → run 5-minute training → measure

val_bpb

→ keep improvement or revert → repeat forever.

This skill is self-contained — it includes everything needed to set up and run the loop.

Setup Phase

1. Copy Template Assets

Copy the bundled training template to the project directory:

bash

cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .

2. Install and Prepare

bash

uv sync                    # Install dependencies
uv run prepare.py          # Download data shards, train tokenizer (~2 min)

3. Verify GPU

bash

nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

4. Initialize the Experiment Session

Create a branch:

git checkout -b autoresearch/<tag>-<date>

Ensure session files are gitignored (critical —

git revert

will fail if tracked):

bash

echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"

Read
```
prepare.py
```
and
```
train.py
```
thoroughly to understand the codebase
Write
```
autoresearch.md
```
— a living session document recording goal, metrics, files in scope, constraints, and learnings
Write
```
autoresearch.sh
```
— the benchmark script (see Benchmark Script section below)
Commit session files
Run baseline:
```
bash autoresearch.sh
```
Parse metrics from output (lines matching
```
METRIC name=value
```
)

Record baseline in

autoresearch.jsonl

First write a config header:

{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}

Then record the baseline result

Begin the experiment loop

The Experiment Loop

LOOP FOREVER. Never ask "should I continue?" — just keep going.

The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read

train.py

for new angles, try combining previous near-misses, try more radical architectural changes.

Each iteration:

1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat

Decision Rules

val_bpb improved (lower) →
```
keep
```
(commit stays, branch advances)

val_bpb equal or worse →

discard

(run

git revert $(git rev-parse HEAD) --no-edit

)

Crash (OOM, CUDA error, NaN loss) →
```
discard
```
(revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
Simpler code for equal val_bpb →
```
keep
```
(removing complexity is a win)
Catastrophic VRAM increase → consider
```
discard
```
even if val_bpb improved slightly

Simplicity Criterion

All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.

Constraints

Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
Single file modification. Only
```
train.py
```
changes;
```
prepare.py
```
is immutable. This ensures fair comparison (same data, same evaluation).
VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
No new packages. You can only use what's already in
```
pyproject.toml
```
.
Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.

Don't Thrash

If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read

train.py

for new angles. Try a fundamentally different approach.

Handling User Messages

If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.

Logging to autoresearch.jsonl

Each experiment appends one JSON line:

json

{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}

Use the shared logging script:

bash

bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'

Parse metrics from benchmark output:

bash

bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh

Valid statuses:

keep

discard

crash

checks_failed

ASI (Actionable Side Information)

ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.

Record ASI for every experiment:

json

{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}

Resuming After Context Reset

autoresearch.jsonl

and

autoresearch.md

exist in the working directory:

Read
```
autoresearch.md
```
for full context (goal, metrics, files, constraints, learnings)
Read
```
autoresearch.jsonl
```
to see all past experiments, current best, and ASI annotations
Check git log to verify current branch state matches expected state
If git state is dirty (unclean shutdown), revert uncommitted changes
Resume the loop from where it left off — no re-setup needed
Resume immediately — do not ask "should I continue?"

Confidence Scoring

After 3+ experiments, assess whether improvements are real or noise:

Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
Confidence = |best improvement| / MAD
≥2.0× → likely real improvement
1.0–2.0× → marginal, could be noise
<1.0× → within noise floor

ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.

Template Architecture

prepare.py (FIXED — never modify)

Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
Evaluation:
```
evaluate_bpb()
```
computes bits-per-byte (vocab-size-independent metric)

Key constants:

MAX_SEQ_LEN = 2048

TIME_BUDGET = 300

EVAL_TOKENS = 40 * 524288

VOCAB_SIZE = 8192

train.py (MODIFIED BY AGENT — the only editable file)

Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
Optimizer: Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)
Training: Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget

Editable:

ASPECT_RATIO

DEPTH

WINDOW_PATTERN

TOTAL_BATCH_SIZE

, learning rates, LR schedule phases, and the full model architecture.

GPU Requirements

Supported GPU Tiers

Tier	GPUs	VRAM	Notes
Consumer	GTX 1080 Ti, RTX 2080 Ti	11GB	fp32 fallback, gradient checkpointing required
Consumer+	RTX 3090, RTX 4090	24GB	Great for experiments
Enthusiast	RTX 5090	32GB	Excellent — larger models possible
Datacenter	A100, H100	40-80GB	Original development target

Consumer GPU Adaptations

For GPUs with limited VRAM (< 16GB), apply these changes to

train.py

during the first experiment:

Remove Flash Attention 3 import and dependency — the top-level
```
from kernels import get_kernel
```
block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the
```
fa3.flash_attn_func()
```
call in
```
CausalSelfAttention.forward()
```
with
```
torch.nn.functional.scaled_dot_product_attention
```
. Also remove
```
kernels
```
from
```
pyproject.toml
```
and run
```
uv sync
```
again.
Enable gradient checkpointing — use
```
torch.utils.checkpoint.checkpoint()
```
with
```
use_reentrant=False
```
to trade ~30% compute for ~50% VRAM savings
Auto-scale model size — reduce
```
DEPTH
```
and
```
DEVICE_BATCH_SIZE
```
to fit VRAM budget (see table below)
Cap evaluation steps — scale eval batch count by available VRAM (30-100 steps)
fp32 fallback — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.

VRAM Auto-Scaling Guide

VRAM Budget	DEPTH	n_embd	Batch Size	Seq Length	~Params
4GB	2	128	4	512	~1M
8GB	4	256	8	1024	~5M
12GB	6	384	16	1024	~14M
16GB	8	512	32	2048	~25M
24GB	8	512	128	2048	~50M
32GB	12	768	128	2048	~85M
80GB	16	1024	128	2048	~200M

Note:

n_embd

must be a multiple of

HEAD_DIM

(default 128). Config search: start with the largest depth that fits, reduce

DEVICE_BATCH_SIZE

then

MAX_SEQ_LEN

if OOM.

Experiment Strategies

Architecture: Layer count, attention patterns, embedding dimensions, activation functions
Optimizer: Learning rates (per-parameter), schedule phases, momentum, weight decay
Attention: Window sizes, sliding window configs, full vs. local attention
Batch size: Trade-off between gradient quality and steps-per-budget
Initialization: Weight init schemes, residual scaling parameters
Advanced: Value embeddings, softcapped logits, GQA

Metric: Bits Per Byte (BPB)

How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See

references/gpu-training-guide.md

for the formula and interpretation table.

Benchmark Script

Use this as

autoresearch.sh

bash

#!/usr/bin/env bash
set -euo pipefail

uv run train.py > run.log 2>&1

val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")

echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"

Session Files

File	Purpose
`autoresearch.md`	Living session document — goal, metrics, scope, learnings
`autoresearch.sh`	Benchmark script — outputs `METRIC name=value` lines
`autoresearch.jsonl`	Append-only experiment log with ASI (survives restarts)

Additional Resources

references/gpu-training-guide.md
— Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning
scripts/parse-metrics.sh
— Extract METRIC lines from benchmark output
scripts/log-experiment.sh
— Append experiment results to autoresearch.jsonl
assets/prepare.py
— Data preparation (download, tokenizer, dataloader, evaluation)
assets/train.py
— Model architecture and training loop
assets/program.md
— Self-contained agent instructions for the ML loop
assets/pyproject.toml
— Python dependencies (PyTorch, Flash Attention, etc.)

autoresearch-ml

NPX Install

Tags

SKILL.md Content

Autoresearch ML: Autonomous LLM Training Optimization

Setup Phase

1. Copy Template Assets

2. Install and Prepare

3. Verify GPU

4. Initialize the Experiment Session

The Experiment Loop

Decision Rules

Simplicity Criterion

Constraints

Don't Thrash

Handling User Messages

Logging to autoresearch.jsonl

ASI (Actionable Side Information)

Resuming After Context Reset

Confidence Scoring

Template Architecture

prepare.py (FIXED — never modify)

train.py (MODIFIED BY AGENT — the only editable file)

GPU Requirements

Supported GPU Tiers

Consumer GPU Adaptations

VRAM Auto-Scaling Guide

Experiment Strategies

Metric: Bits Per Byte (BPB)

Benchmark Script

Session Files

Additional Resources