autoresearch-ml

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Autoresearch ML: Autonomous LLM Training Optimization

Autoresearch ML：自主LLM训练优化

An autonomous experiment loop for single-GPU LLM pretraining. Edit

train.py

→ commit → run 5-minute training → measure

val_bpb

→ keep improvement or revert → repeat forever.

This skill is self-contained — it includes everything needed to set up and run the loop.

面向单GPU LLM预训练的自主实验循环。编辑

train.py

→ 提交改动 → 运行5分钟训练 → 测量

val_bpb

→ 保留改进或回退改动 → 无限循环执行。

本工具为完全自包含版本——包含搭建和运行整个循环所需的全部内容。

Setup Phase

搭建阶段

1. Copy Template Assets

1. 复制模板资源

Copy the bundled training template to the project directory:

bash

cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .

将内置的训练模板复制到项目目录：

bash

cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .

2. Install and Prepare

2. 安装依赖并准备环境

bash

uv sync                    # Install dependencies
uv run prepare.py          # Download data shards, train tokenizer (~2 min)

bash

uv sync                    # 安装依赖
uv run prepare.py          # 下载数据分片，训练Tokenizer（约2分钟）

3. Verify GPU

3. 验证GPU状态

bash

nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

bash

nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

4. Initialize the Experiment Session

4. 初始化实验会话

Create a branch:

git checkout -b autoresearch/<tag>-<date>

Ensure session files are gitignored (critical —

git revert

will fail if tracked):

bash

echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"

Read
```
prepare.py
```
and
```
train.py
```
thoroughly to understand the codebase
Write
```
autoresearch.md
```
— a living session document recording goal, metrics, files in scope, constraints, and learnings
Write
```
autoresearch.sh
```
— the benchmark script (see Benchmark Script section below)
Commit session files
Run baseline:
```
bash autoresearch.sh
```
Parse metrics from output (lines matching
```
METRIC name=value
```
)

Record baseline in

autoresearch.jsonl

First write a config header:

{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}

Then record the baseline result

Begin the experiment loop

创建分支：

git checkout -b autoresearch/<tag>-<date>

确保会话文件被git忽略（非常重要——如果文件被追踪，

git revert

会执行失败）：

bash

echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
git add .gitignore && git commit -m "autoresearch: add session files to gitignore"

通读
```
prepare.py
```
和
```
train.py
```
，充分理解代码库逻辑
编写
```
autoresearch.md
```
——动态更新的会话文档，记录目标、指标、涉及文件、约束和经验总结
编写
```
autoresearch.sh
```
——基准测试脚本（见下文基准测试脚本章节）
提交会话文件
运行基线测试：
```
bash autoresearch.sh
```
解析输出中的指标（匹配
```
METRIC name=value
```
格式的行）

将基线结果记录到

autoresearch.jsonl

：

首先写入配置头：

{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}

然后记录基线测试结果

启动实验循环

The Experiment Loop

实验循环

LOOP FOREVER. Never ask "should I continue?" — just keep going.

The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read

train.py

for new angles, try combining previous near-misses, try more radical architectural changes.

Each iteration:

1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat

永久循环运行，无需询问「是否继续」——直接持续执行即可。

用户可能处于睡眠状态、不在电脑旁，或者预期你可以无限制运行。每次实验耗时约5分钟，每小时可运行约12次，一整晚可运行约100次。循环会一直执行直到用户手动中断，没有其他终止条件。如果没有新的实验思路，请再深入思考——重读

train.py

寻找新的优化角度，尝试结合之前接近成功的方案，尝试更激进的架构改动。

每轮迭代步骤：

1. 读取当前git状态和autoresearch.md内容
2. 基于过往实验结果和ASI备注，选择一个要对train.py做的实验性改动
3. 编辑train.py（唯一允许修改的文件）
4. git add train.py && git commit -m "experiment: <改动描述>"
5. 运行：bash autoresearch.sh > run.log 2>&1
6. 解析输出中的METRIC行
7. 如果输出为空（程序崩溃）：读取run.log最后50行查看栈追踪信息
8. 决定：保留改动还是丢弃
9. 将结果记录到autoresearch.jsonl（包含ASI注解）
10. 如果丢弃/崩溃：执行git revert $(git rev-parse HEAD) --no-edit
11. 每完成几次实验后，更新autoresearch.md记录经验总结
12. 重复执行

Decision Rules

决策规则

val_bpb improved (lower) →
```
keep
```
(commit stays, branch advances)

val_bpb equal or worse →

discard

(run

git revert $(git rev-parse HEAD) --no-edit

)

Crash (OOM, CUDA error, NaN loss) →
```
discard
```
(revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
Simpler code for equal val_bpb →
```
keep
```
(removing complexity is a win)
Catastrophic VRAM increase → consider
```
discard
```
even if val_bpb improved slightly

val_bpb降低（指标优化） →
```
keep
```
（保留提交，分支前进）
val_bpb持平或升高（指标无优化或变差） →
```
discard
```
（执行
```
git revert $(git rev-parse HEAD) --no-edit
```
回退）
崩溃（OOM、CUDA错误、损失NaN） →
```
discard
```
（回退）。如果是简单可修复问题（拼写错误、导入错误），修复后重新运行。如果思路本身存在根本性缺陷，记录为崩溃后继续下一个实验
val_bpb持平但代码更简洁 →
```
keep
```
（降低复杂度也是收益）
显存占用大幅增加 → 即使val_bpb有小幅提升，也可考虑
```
discard
```

Simplicity Criterion

简洁性准则

All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.

其他条件相同时，更简洁的方案更优。如果0.001的val_bpb提升需要添加20行冗余hack代码？通常不值得。如果删除代码就能带来0.001的提升？务必保留。val_bpb持平但代码大幅简化？保留。

Constraints

约束条件

Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
Single file modification. Only
```
train.py
```
changes;
```
prepare.py
```
is immutable. This ensures fair comparison (same data, same evaluation).
VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
No new packages. You can only use what's already in
```
pyproject.toml
```
.
Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.

固定5分钟时间预算。所有实验可直接横向对比——运行时间是公平的衡量标准
仅允许修改单个文件。仅
```
train.py
```
可改动；
```
prepare.py
```
不可修改。这保证了对比的公平性（相同数据、相同评估逻辑）
显存是软约束。允许使用更多显存，但需要标注权衡（更大的模型=5分钟内能跑的训练步数更少）
不允许新增依赖包。仅可使用
```
pyproject.toml
```
中已有的依赖
超时规则：如果单次运行超过10分钟，终止进程并按崩溃处理

Don't Thrash

避免无效尝试

If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read

train.py

for new angles. Try a fundamentally different approach.

如果连续3次实验失败或被丢弃，暂停并思考原因。重读

train.py

寻找新的优化角度，尝试完全不同的优化方向。

Handling User Messages

处理用户消息

If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.

如果循环运行过程中用户发送了消息：完成当前迭代周期，响应反馈内容后立即恢复循环——无需等待用户许可。

Logging to autoresearch.jsonl

记录到autoresearch.jsonl

Each experiment appends one JSON line:

json

{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}

Use the shared logging script:

bash

bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'

Parse metrics from benchmark output:

bash

bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh

Valid statuses:

keep

discard

crash

checks_failed

每次实验追加一行JSON记录：

json

{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}

使用内置的日志脚本：

bash

bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'

从基准测试输出中解析指标：

bash

bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh

有效状态值：

keep

、

discard

、

crash

、

checks_failed

ASI (Actionable Side Information)

ASI（可操作辅助信息）

ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.

Record ASI for every experiment:

json

{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}

ASI是每个实验的结构化注解，在代码回退后仍会保留。当代码改动被丢弃时，只有改动描述和ASI会留存——这是过往实验仅有的结构化记忆。

为每个实验记录ASI：

json

{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}

Resuming After Context Reset

上下文重置后恢复运行

autoresearch.jsonl

and

autoresearch.md

exist in the working directory:

Read
```
autoresearch.md
```
for full context (goal, metrics, files, constraints, learnings)
Read
```
autoresearch.jsonl
```
to see all past experiments, current best, and ASI annotations
Check git log to verify current branch state matches expected state
If git state is dirty (unclean shutdown), revert uncommitted changes
Resume the loop from where it left off — no re-setup needed
Resume immediately — do not ask "should I continue?"

如果工作目录中已存在

autoresearch.jsonl

和

autoresearch.md

：

读取
```
autoresearch.md
```
获取完整上下文（目标、指标、文件、约束、经验总结）
读取
```
autoresearch.jsonl
```
查看所有过往实验、当前最优结果和ASI注解
检查git日志，确认当前分支状态与预期一致
如果git状态不干净（异常关闭导致），回退未提交的改动
从断点处恢复循环运行——无需重新搭建环境
立即恢复运行——无需询问「是否继续」

Confidence Scoring

置信度评分

After 3+ experiments, assess whether improvements are real or noise:

Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
Confidence = |best improvement| / MAD
≥2.0× → likely real improvement
1.0–2.0× → marginal, could be noise
<1.0× → within noise floor

ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.

完成3次及以上实验后，评估优化是真实提升还是噪声：

计算所有指标值的**中位数绝对偏差（MAD）**作为噪声基线
置信度 = |最优提升幅度| / MAD
≥2.0倍 → 大概率是真实提升
1.0–2.0倍 → 边际提升，可能是噪声
<1.0倍 → 处于噪声基线范围内

固定种子的ML训练基本是确定性的，因此噪声基线通常非常低。

Template Architecture

模板架构

prepare.py (FIXED — never modify)

prepare.py（固定——禁止修改）

Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
Evaluation:
```
evaluate_bpb()
```
computes bits-per-byte (vocab-size-independent metric)

Key constants:

MAX_SEQ_LEN = 2048

TIME_BUDGET = 300

EVAL_TOKENS = 40 * 524288

VOCAB_SIZE = 8192

数据下载：从HuggingFace拉取parquet分片（climbmix-400b-shuffle）
Tokenizer训练：使用rustbpe/tiktoken训练BPE Tokenizer（8192词表大小）
数据加载器：最优文档打包，Token利用率100%，BOS对齐
评估逻辑：
```
evaluate_bpb()
```
计算bits-per-byte（与词表大小无关的指标）

核心常量：

MAX_SEQ_LEN = 2048

、

TIME_BUDGET = 300

、

EVAL_TOKENS = 40 * 524288

、

VOCAB_SIZE = 8192

train.py (MODIFIED BY AGENT — the only editable file)

train.py（由Agent修改——唯一可编辑文件）

Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
Optimizer: Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)
Training: Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget

Editable:

ASPECT_RATIO

DEPTH

WINDOW_PATTERN

TOTAL_BATCH_SIZE

, learning rates, LR schedule phases, and the full model architecture.

模型：搭载RoPE、滑动窗口注意力、值嵌入、Flash Attention 3的GPT架构
优化器：混合MuonAdamW（矩阵参数用Muon，其他参数用AdamW）
训练逻辑：梯度累积、LR调度（预热/平稳/降温）、固定时间预算

可修改项：

ASPECT_RATIO

、

DEPTH

、

WINDOW_PATTERN

、

TOTAL_BATCH_SIZE

、学习率、LR调度阶段、完整模型架构。

GPU Requirements

GPU要求

Supported GPU Tiers

支持的GPU层级

Tier	GPUs	VRAM	Notes
Consumer	GTX 1080 Ti, RTX 2080 Ti	11GB	fp32 fallback, gradient checkpointing required
Consumer+	RTX 3090, RTX 4090	24GB	Great for experiments
Enthusiast	RTX 5090	32GB	Excellent — larger models possible
Datacenter	A100, H100	40-80GB	Original development target

层级	GPU型号	显存	备注
消费级	GTX 1080 Ti, RTX 2080 Ti	11GB	需要fp32 fallback，开启梯度检查点
消费级+	RTX 3090, RTX 4090	24GB	非常适合做实验
发烧友级	RTX 5090	32GB	极佳——可运行更大模型
数据中心级	A100, H100	40-80GB	原始开发目标环境

Consumer GPU Adaptations

消费级GPU适配方案

For GPUs with limited VRAM (< 16GB), apply these changes to

train.py

during the first experiment:

Remove Flash Attention 3 import and dependency — the top-level
```
from kernels import get_kernel
```
block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the
```
fa3.flash_attn_func()
```
call in
```
CausalSelfAttention.forward()
```
with
```
torch.nn.functional.scaled_dot_product_attention
```
. Also remove
```
kernels
```
from
```
pyproject.toml
```
and run
```
uv sync
```
again.
Enable gradient checkpointing — use
```
torch.utils.checkpoint.checkpoint()
```
with
```
use_reentrant=False
```
to trade ~30% compute for ~50% VRAM savings
Auto-scale model size — reduce
```
DEPTH
```
and
```
DEVICE_BATCH_SIZE
```
to fit VRAM budget (see table below)
Cap evaluation steps — scale eval batch count by available VRAM (30-100 steps)
fp32 fallback — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.

对于显存有限的GPU（<16GB），第一次实验时对

train.py

做以下修改：

移除Flash Attention 3导入和依赖——顶层
```
from kernels import get_kernel
```
代码块（第20-24行）会在启动时无条件运行，在非Hopper架构GPU上会运行失败。将整个代码块和
```
CausalSelfAttention.forward()
```
中的
```
fa3.flash_attn_func()
```
调用替换为
```
torch.nn.functional.scaled_dot_product_attention
```
。同时从
```
pyproject.toml
```
中移除
```
kernels
```
依赖，重新执行
```
uv sync
```
开启梯度检查点——使用
```
torch.utils.checkpoint.checkpoint()
```
并设置
```
use_reentrant=False
```
，用约30%的计算开销换取约50%的显存节省
自动缩放模型大小——降低
```
DEPTH
```
和
```
DEVICE_BATCH_SIZE
```
以适配显存预算（见下表）
限制评估步数——根据可用显存缩放评估批次数量（30-100步）
fp32 fallback——Pascal架构GPU（计算能力<7.5）使用fp32替代bf16。修改自动混合精度的dtype，禁用bf16专属优化

VRAM Auto-Scaling Guide

显存自动缩放指南

VRAM Budget	DEPTH	n_embd	Batch Size	Seq Length	~Params
4GB	2	128	4	512	~1M
8GB	4	256	8	1024	~5M
12GB	6	384	16	1024	~14M
16GB	8	512	32	2048	~25M
24GB	8	512	128	2048	~50M
32GB	12	768	128	2048	~85M
80GB	16	1024	128	2048	~200M

Note:

n_embd

must be a multiple of

HEAD_DIM

(default 128). Config search: start with the largest depth that fits, reduce

DEVICE_BATCH_SIZE

then

MAX_SEQ_LEN

if OOM.

显存预算	DEPTH	n_embd	批次大小	序列长度	参数量约值
4GB	2	128	4	512	~1M
8GB	4	256	8	1024	~5M
12GB	6	384	16	1024	~14M
16GB	8	512	32	2048	~25M
24GB	8	512	128	2048	~50M
32GB	12	768	128	2048	~85M
80GB	16	1024	128	2048	~200M

注意：

n_embd

必须是

HEAD_DIM

（默认128）的倍数。配置搜索逻辑：优先选择能适配的最大深度，出现OOM时再降低

DEVICE_BATCH_SIZE

，之后再降低

MAX_SEQ_LEN

。

Experiment Strategies

实验策略

Architecture: Layer count, attention patterns, embedding dimensions, activation functions
Optimizer: Learning rates (per-parameter), schedule phases, momentum, weight decay
Attention: Window sizes, sliding window configs, full vs. local attention
Batch size: Trade-off between gradient quality and steps-per-budget
Initialization: Weight init schemes, residual scaling parameters
Advanced: Value embeddings, softcapped logits, GQA

架构优化：层数、注意力模式、嵌入维度、激活函数
优化器优化：学习率（按参数设置）、调度阶段、动量、权重衰减
注意力优化：窗口大小、滑动窗口配置、全局vs局部注意力
批次大小优化：梯度质量和预算内可跑步数的权衡
初始化优化：权重初始化方案、残差缩放参数
高级优化：值嵌入、logit软截断、GQA

Metric: Bits Per Byte (BPB)

指标：Bits Per Byte（BPB）

How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See

references/gpu-training-guide.md

for the formula and interpretation table.

衡量模型对文本的压缩能力，按字节数归一化。与词表大小无关——所有架构可直接横向对比。数值越低越好。计算公式和解读表见

references/gpu-training-guide.md

。

Benchmark Script

基准测试脚本

Use this as

autoresearch.sh

bash

#!/usr/bin/env bash
set -euo pipefail

uv run train.py > run.log 2>&1

val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")

echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"

autoresearch.sh

使用以下模板：

bash

#!/usr/bin/env bash
set -euo pipefail

uv run train.py > run.log 2>&1

val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")

echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"

Session Files

会话文件

File	Purpose
`autoresearch.md`	Living session document — goal, metrics, scope, learnings
`autoresearch.sh`	Benchmark script — outputs `METRIC name=value` lines
`autoresearch.jsonl`	Append-only experiment log with ASI (survives restarts)

文件名	用途
`autoresearch.md`	动态会话文档——记录目标、指标、范围、经验总结
`autoresearch.sh`	基准测试脚本——输出 `METRIC name=value` 格式的指标行
`autoresearch.jsonl`	仅追加的实验日志，包含ASI（重启后仍保留）

Additional Resources

额外资源

references/gpu-training-guide.md
— Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning
scripts/parse-metrics.sh
— Extract METRIC lines from benchmark output
scripts/log-experiment.sh
— Append experiment results to autoresearch.jsonl
assets/prepare.py
— Data preparation (download, tokenizer, dataloader, evaluation)
assets/train.py
— Model architecture and training loop
assets/program.md
— Self-contained agent instructions for the ML loop
assets/pyproject.toml
— Python dependencies (PyTorch, Flash Attention, etc.)

references/gpu-training-guide.md
——详细的GPU搭建、CUDA配置、OOM故障排查、BPB计算公式、性能调优指南
scripts/parse-metrics.sh
——从基准测试输出中提取METRIC行
scripts/log-experiment.sh
——将实验结果追加到autoresearch.jsonl
assets/prepare.py
——数据准备（下载、Tokenizer、数据加载器、评估逻辑）
assets/train.py
——模型架构和训练循环
assets/program.md
——ML循环的自包含Agent指令
assets/pyproject.toml
——Python依赖（PyTorch、Flash Attention等）