Loading...
Loading...
Autonomous LLM training optimization with GPU support. Runs 5-minute training experiments, measures val_bpb, keeps improvements or reverts — repeat forever. Use this skill when the user asks to "train a model autonomously", "optimize LLM training", "run ML experiments", "autoresearch with GPU", "optimize val_bpb", "autonomous ML training", "LLM pretraining loop", "setup ML autoresearch", "GPU training experiments", "pretrain from scratch", "speed up training", "lower my loss", "GPU optimization", "CUDA training", or mentions "train.py", "prepare.py", "bits per byte", "val_bpb", "NVIDIA GPU training", "RTX training", "H100 training", "autonomous model training", "consumer GPU training", "low VRAM training". Always use this skill when the user wants to autonomously optimize any ML training metric.
npx skill4agent add proyecto26/autoresearch-ai-plugin autoresearch-mltrain.pyval_bpbcp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .uv sync # Install dependencies
uv run prepare.py # Download data shards, train tokenizer (~2 min)nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"git checkout -b autoresearch/<tag>-<date>git revertecho -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"prepare.pytrain.pyautoresearch.mdautoresearch.shbash autoresearch.shMETRIC name=valueautoresearch.jsonl{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}train.py1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeatkeepdiscardgit revert $(git rev-parse HEAD) --no-editdiscardkeepdiscardtrain.pyprepare.pypyproject.tomltrain.py{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 2 \
--commit "$(git rev-parse --short HEAD)" \
--metric 0.993 \
--status keep \
--description "increase LR to 0.04" \
--metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
--segment 0 \
--asi '{"hypothesis":"higher LR converges faster"}'bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.shkeepdiscardcrashchecks_failed{
"hypothesis": "Deeper model with fewer steps should compress better",
"arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
"result": "val_bpb improved 0.998→0.992, but 2x VRAM",
"next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}autoresearch.jsonlautoresearch.mdautoresearch.mdautoresearch.jsonlevaluate_bpb()MAX_SEQ_LEN = 2048TIME_BUDGET = 300EVAL_TOKENS = 40 * 524288VOCAB_SIZE = 8192ASPECT_RATIODEPTHWINDOW_PATTERNTOTAL_BATCH_SIZE| Tier | GPUs | VRAM | Notes |
|---|---|---|---|
| Consumer | GTX 1080 Ti, RTX 2080 Ti | 11GB | fp32 fallback, gradient checkpointing required |
| Consumer+ | RTX 3090, RTX 4090 | 24GB | Great for experiments |
| Enthusiast | RTX 5090 | 32GB | Excellent — larger models possible |
| Datacenter | A100, H100 | 40-80GB | Original development target |
train.pyfrom kernels import get_kernelfa3.flash_attn_func()CausalSelfAttention.forward()torch.nn.functional.scaled_dot_product_attentionkernelspyproject.tomluv synctorch.utils.checkpoint.checkpoint()use_reentrant=FalseDEPTHDEVICE_BATCH_SIZE| VRAM Budget | DEPTH | n_embd | Batch Size | Seq Length | ~Params |
|---|---|---|---|---|---|
| 4GB | 2 | 128 | 4 | 512 | ~1M |
| 8GB | 4 | 256 | 8 | 1024 | ~5M |
| 12GB | 6 | 384 | 16 | 1024 | ~14M |
| 16GB | 8 | 512 | 32 | 2048 | ~25M |
| 24GB | 8 | 512 | 128 | 2048 | ~50M |
| 32GB | 12 | 768 | 128 | 2048 | ~85M |
| 80GB | 16 | 1024 | 128 | 2048 | ~200M |
n_embdHEAD_DIMDEVICE_BATCH_SIZEMAX_SEQ_LENreferences/gpu-training-guide.mdautoresearch.sh#!/usr/bin/env bash
set -euo pipefail
uv run train.py > run.log 2>&1
val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")
echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"| File | Purpose |
|---|---|
| Living session document — goal, metrics, scope, learnings |
| Benchmark script — outputs |
| Append-only experiment log with ASI (survives restarts) |
references/gpu-training-guide.mdscripts/parse-metrics.shscripts/log-experiment.shassets/prepare.pyassets/train.pyassets/program.mdassets/pyproject.toml