autoresearch-ml
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoresearch ML: Autonomous LLM Training Optimization
Autoresearch ML:自主LLM训练优化
An autonomous experiment loop for single-GPU LLM pretraining. Edit → commit → run 5-minute training → measure → keep improvement or revert → repeat forever.
train.pyval_bpbThis skill is self-contained — it includes everything needed to set up and run the loop.
面向单GPU LLM预训练的自主实验循环。编辑 → 提交改动 → 运行5分钟训练 → 测量 → 保留改进或回退改动 → 无限循环执行。
train.pyval_bpb本工具为完全自包含版本——包含搭建和运行整个循环所需的全部内容。
Setup Phase
搭建阶段
1. Copy Template Assets
1. 复制模板资源
Copy the bundled training template to the project directory:
bash
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .将内置的训练模板复制到项目目录:
bash
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .2. Install and Prepare
2. 安装依赖并准备环境
bash
uv sync # Install dependencies
uv run prepare.py # Download data shards, train tokenizer (~2 min)bash
uv sync # 安装依赖
uv run prepare.py # 下载数据分片,训练Tokenizer(约2分钟)3. Verify GPU
3. 验证GPU状态
bash
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"bash
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"4. Initialize the Experiment Session
4. 初始化实验会话
- Create a branch:
git checkout -b autoresearch/<tag>-<date> - Ensure session files are gitignored (critical — will fail if tracked):
git revertbashecho -e "autoresearch.jsonl\nrun.log" >> .gitignore git add .gitignore && git commit -m "autoresearch: add session files to gitignore" - Read and
prepare.pythoroughly to understand the codebasetrain.py - Write — a living session document recording goal, metrics, files in scope, constraints, and learnings
autoresearch.md - Write — the benchmark script (see Benchmark Script section below)
autoresearch.sh - Commit session files
- Run baseline:
bash autoresearch.sh - Parse metrics from output (lines matching )
METRIC name=value - Record baseline in :
autoresearch.jsonl- First write a config header:
{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"} - Then record the baseline result
- First write a config header:
- Begin the experiment loop
- 创建分支:
git checkout -b autoresearch/<tag>-<date> - 确保会话文件被git忽略(非常重要——如果文件被追踪,会执行失败):
git revertbashecho -e "autoresearch.jsonl\nrun.log" >> .gitignore git add .gitignore && git commit -m "autoresearch: add session files to gitignore" - 通读和
prepare.py,充分理解代码库逻辑train.py - 编写——动态更新的会话文档,记录目标、指标、涉及文件、约束和经验总结
autoresearch.md - 编写——基准测试脚本(见下文基准测试脚本章节)
autoresearch.sh - 提交会话文件
- 运行基线测试:
bash autoresearch.sh - 解析输出中的指标(匹配格式的行)
METRIC name=value - 将基线结果记录到:
autoresearch.jsonl- 首先写入配置头:
{"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"} - 然后记录基线测试结果
- 首先写入配置头:
- 启动实验循环
The Experiment Loop
实验循环
LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read for new angles, try combining previous near-misses, try more radical architectural changes.
train.pyEach iteration:
1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat永久循环运行,无需询问「是否继续」——直接持续执行即可。
用户可能处于睡眠状态、不在电脑旁,或者预期你可以无限制运行。每次实验耗时约5分钟,每小时可运行约12次,一整晚可运行约100次。循环会一直执行直到用户手动中断,没有其他终止条件。如果没有新的实验思路,请再深入思考——重读寻找新的优化角度,尝试结合之前接近成功的方案,尝试更激进的架构改动。
train.py每轮迭代步骤:
1. 读取当前git状态和autoresearch.md内容
2. 基于过往实验结果和ASI备注,选择一个要对train.py做的实验性改动
3. 编辑train.py(唯一允许修改的文件)
4. git add train.py && git commit -m "experiment: <改动描述>"
5. 运行:bash autoresearch.sh > run.log 2>&1
6. 解析输出中的METRIC行
7. 如果输出为空(程序崩溃):读取run.log最后50行查看栈追踪信息
8. 决定:保留改动还是丢弃
9. 将结果记录到autoresearch.jsonl(包含ASI注解)
10. 如果丢弃/崩溃:执行git revert $(git rev-parse HEAD) --no-edit
11. 每完成几次实验后,更新autoresearch.md记录经验总结
12. 重复执行Decision Rules
决策规则
- val_bpb improved (lower) → (commit stays, branch advances)
keep - val_bpb equal or worse → (run
discard)git revert $(git rev-parse HEAD) --no-edit - Crash (OOM, CUDA error, NaN loss) → (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
discard - Simpler code for equal val_bpb → (removing complexity is a win)
keep - Catastrophic VRAM increase → consider even if val_bpb improved slightly
discard
- val_bpb降低(指标优化) → (保留提交,分支前进)
keep - val_bpb持平或升高(指标无优化或变差) → (执行
discard回退)git revert $(git rev-parse HEAD) --no-edit - 崩溃(OOM、CUDA错误、损失NaN) → (回退)。如果是简单可修复问题(拼写错误、导入错误),修复后重新运行。如果思路本身存在根本性缺陷,记录为崩溃后继续下一个实验
discard - val_bpb持平但代码更简洁 → (降低复杂度也是收益)
keep - 显存占用大幅增加 → 即使val_bpb有小幅提升,也可考虑
discard
Simplicity Criterion
简洁性准则
All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.
其他条件相同时,更简洁的方案更优。如果0.001的val_bpb提升需要添加20行冗余hack代码?通常不值得。如果删除代码就能带来0.001的提升?务必保留。val_bpb持平但代码大幅简化?保留。
Constraints
约束条件
- Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
- Single file modification. Only changes;
train.pyis immutable. This ensures fair comparison (same data, same evaluation).prepare.py - VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
- No new packages. You can only use what's already in .
pyproject.toml - Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.
- 固定5分钟时间预算。所有实验可直接横向对比——运行时间是公平的衡量标准
- 仅允许修改单个文件。仅可改动;
train.py不可修改。这保证了对比的公平性(相同数据、相同评估逻辑)prepare.py - 显存是软约束。允许使用更多显存,但需要标注权衡(更大的模型=5分钟内能跑的训练步数更少)
- 不允许新增依赖包。仅可使用中已有的依赖
pyproject.toml - 超时规则:如果单次运行超过10分钟,终止进程并按崩溃处理
Don't Thrash
避免无效尝试
If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read for new angles. Try a fundamentally different approach.
train.py如果连续3次实验失败或被丢弃,暂停并思考原因。重读寻找新的优化角度,尝试完全不同的优化方向。
train.pyHandling User Messages
处理用户消息
If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.
如果循环运行过程中用户发送了消息:完成当前迭代周期,响应反馈内容后立即恢复循环——无需等待用户许可。
Logging to autoresearch.jsonl
记录到autoresearch.jsonl
Each experiment appends one JSON line:
json
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}Use the shared logging script:
bash
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 2 \
--commit "$(git rev-parse --short HEAD)" \
--metric 0.993 \
--status keep \
--description "increase LR to 0.04" \
--metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
--segment 0 \
--asi '{"hypothesis":"higher LR converges faster"}'Parse metrics from benchmark output:
bash
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.shValid statuses: , , ,
keepdiscardcrashchecks_failed每次实验追加一行JSON记录:
json
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}使用内置的日志脚本:
bash
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
--run 2 \
--commit "$(git rev-parse --short HEAD)" \
--metric 0.993 \
--status keep \
--description "increase LR to 0.04" \
--metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
--segment 0 \
--asi '{"hypothesis":"higher LR converges faster"}'从基准测试输出中解析指标:
bash
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh有效状态值:、、、
keepdiscardcrashchecks_failedASI (Actionable Side Information)
ASI(可操作辅助信息)
ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.
Record ASI for every experiment:
json
{
"hypothesis": "Deeper model with fewer steps should compress better",
"arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
"result": "val_bpb improved 0.998→0.992, but 2x VRAM",
"next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}ASI是每个实验的结构化注解,在代码回退后仍会保留。当代码改动被丢弃时,只有改动描述和ASI会留存——这是过往实验仅有的结构化记忆。
为每个实验记录ASI:
json
{
"hypothesis": "Deeper model with fewer steps should compress better",
"arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
"result": "val_bpb improved 0.998→0.992, but 2x VRAM",
"next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}Resuming After Context Reset
上下文重置后恢复运行
If and exist in the working directory:
autoresearch.jsonlautoresearch.md- Read for full context (goal, metrics, files, constraints, learnings)
autoresearch.md - Read to see all past experiments, current best, and ASI annotations
autoresearch.jsonl - Check git log to verify current branch state matches expected state
- If git state is dirty (unclean shutdown), revert uncommitted changes
- Resume the loop from where it left off — no re-setup needed
- Resume immediately — do not ask "should I continue?"
如果工作目录中已存在和:
autoresearch.jsonlautoresearch.md- 读取获取完整上下文(目标、指标、文件、约束、经验总结)
autoresearch.md - 读取查看所有过往实验、当前最优结果和ASI注解
autoresearch.jsonl - 检查git日志,确认当前分支状态与预期一致
- 如果git状态不干净(异常关闭导致),回退未提交的改动
- 从断点处恢复循环运行——无需重新搭建环境
- 立即恢复运行——无需询问「是否继续」
Confidence Scoring
置信度评分
After 3+ experiments, assess whether improvements are real or noise:
- Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
- Confidence = |best improvement| / MAD
- ≥2.0× → likely real improvement
- 1.0–2.0× → marginal, could be noise
- <1.0× → within noise floor
ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.
完成3次及以上实验后,评估优化是真实提升还是噪声:
- 计算所有指标值的**中位数绝对偏差(MAD)**作为噪声基线
- 置信度 = |最优提升幅度| / MAD
- ≥2.0倍 → 大概率是真实提升
- 1.0–2.0倍 → 边际提升,可能是噪声
- <1.0倍 → 处于噪声基线范围内
固定种子的ML训练基本是确定性的,因此噪声基线通常非常低。
Template Architecture
模板架构
prepare.py (FIXED — never modify)
prepare.py(固定——禁止修改)
- Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
- Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
- Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
- Evaluation: computes bits-per-byte (vocab-size-independent metric)
evaluate_bpb()
Key constants: , , ,
MAX_SEQ_LEN = 2048TIME_BUDGET = 300EVAL_TOKENS = 40 * 524288VOCAB_SIZE = 8192- 数据下载:从HuggingFace拉取parquet分片(climbmix-400b-shuffle)
- Tokenizer训练:使用rustbpe/tiktoken训练BPE Tokenizer(8192词表大小)
- 数据加载器:最优文档打包,Token利用率100%,BOS对齐
- 评估逻辑:计算bits-per-byte(与词表大小无关的指标)
evaluate_bpb()
核心常量:、、、
MAX_SEQ_LEN = 2048TIME_BUDGET = 300EVAL_TOKENS = 40 * 524288VOCAB_SIZE = 8192train.py (MODIFIED BY AGENT — the only editable file)
train.py(由Agent修改——唯一可编辑文件)
- Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
- Optimizer: Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)
- Training: Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget
Editable: , , , , learning rates, LR schedule phases, and the full model architecture.
ASPECT_RATIODEPTHWINDOW_PATTERNTOTAL_BATCH_SIZE- 模型:搭载RoPE、滑动窗口注意力、值嵌入、Flash Attention 3的GPT架构
- 优化器:混合MuonAdamW(矩阵参数用Muon,其他参数用AdamW)
- 训练逻辑:梯度累积、LR调度(预热/平稳/降温)、固定时间预算
可修改项:、、、、学习率、LR调度阶段、完整模型架构。
ASPECT_RATIODEPTHWINDOW_PATTERNTOTAL_BATCH_SIZEGPU Requirements
GPU要求
Supported GPU Tiers
支持的GPU层级
| Tier | GPUs | VRAM | Notes |
|---|---|---|---|
| Consumer | GTX 1080 Ti, RTX 2080 Ti | 11GB | fp32 fallback, gradient checkpointing required |
| Consumer+ | RTX 3090, RTX 4090 | 24GB | Great for experiments |
| Enthusiast | RTX 5090 | 32GB | Excellent — larger models possible |
| Datacenter | A100, H100 | 40-80GB | Original development target |
| 层级 | GPU型号 | 显存 | 备注 |
|---|---|---|---|
| 消费级 | GTX 1080 Ti, RTX 2080 Ti | 11GB | 需要fp32 fallback,开启梯度检查点 |
| 消费级+ | RTX 3090, RTX 4090 | 24GB | 非常适合做实验 |
| 发烧友级 | RTX 5090 | 32GB | 极佳——可运行更大模型 |
| 数据中心级 | A100, H100 | 40-80GB | 原始开发目标环境 |
Consumer GPU Adaptations
消费级GPU适配方案
For GPUs with limited VRAM (< 16GB), apply these changes to during the first experiment:
train.py- Remove Flash Attention 3 import and dependency — the top-level block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the
from kernels import get_kernelcall infa3.flash_attn_func()withCausalSelfAttention.forward(). Also removetorch.nn.functional.scaled_dot_product_attentionfromkernelsand runpyproject.tomlagain.uv sync - Enable gradient checkpointing — use with
torch.utils.checkpoint.checkpoint()to trade ~30% compute for ~50% VRAM savingsuse_reentrant=False - Auto-scale model size — reduce and
DEPTHto fit VRAM budget (see table below)DEVICE_BATCH_SIZE - Cap evaluation steps — scale eval batch count by available VRAM (30-100 steps)
- fp32 fallback — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.
对于显存有限的GPU(<16GB),第一次实验时对做以下修改:
train.py- 移除Flash Attention 3导入和依赖——顶层代码块(第20-24行)会在启动时无条件运行,在非Hopper架构GPU上会运行失败。将整个代码块和
from kernels import get_kernel中的CausalSelfAttention.forward()调用替换为fa3.flash_attn_func()。同时从torch.nn.functional.scaled_dot_product_attention中移除pyproject.toml依赖,重新执行kernelsuv sync - 开启梯度检查点——使用并设置
torch.utils.checkpoint.checkpoint(),用约30%的计算开销换取约50%的显存节省use_reentrant=False - 自动缩放模型大小——降低和
DEPTH以适配显存预算(见下表)DEVICE_BATCH_SIZE - 限制评估步数——根据可用显存缩放评估批次数量(30-100步)
- fp32 fallback——Pascal架构GPU(计算能力<7.5)使用fp32替代bf16。修改自动混合精度的dtype,禁用bf16专属优化
VRAM Auto-Scaling Guide
显存自动缩放指南
| VRAM Budget | DEPTH | n_embd | Batch Size | Seq Length | ~Params |
|---|---|---|---|---|---|
| 4GB | 2 | 128 | 4 | 512 | ~1M |
| 8GB | 4 | 256 | 8 | 1024 | ~5M |
| 12GB | 6 | 384 | 16 | 1024 | ~14M |
| 16GB | 8 | 512 | 32 | 2048 | ~25M |
| 24GB | 8 | 512 | 128 | 2048 | ~50M |
| 32GB | 12 | 768 | 128 | 2048 | ~85M |
| 80GB | 16 | 1024 | 128 | 2048 | ~200M |
Note: must be a multiple of (default 128). Config search: start with the largest depth that fits, reduce then if OOM.
n_embdHEAD_DIMDEVICE_BATCH_SIZEMAX_SEQ_LEN| 显存预算 | DEPTH | n_embd | 批次大小 | 序列长度 | 参数量约值 |
|---|---|---|---|---|---|
| 4GB | 2 | 128 | 4 | 512 | ~1M |
| 8GB | 4 | 256 | 8 | 1024 | ~5M |
| 12GB | 6 | 384 | 16 | 1024 | ~14M |
| 16GB | 8 | 512 | 32 | 2048 | ~25M |
| 24GB | 8 | 512 | 128 | 2048 | ~50M |
| 32GB | 12 | 768 | 128 | 2048 | ~85M |
| 80GB | 16 | 1024 | 128 | 2048 | ~200M |
注意: 必须是(默认128)的倍数。配置搜索逻辑:优先选择能适配的最大深度,出现OOM时再降低,之后再降低。
n_embdHEAD_DIMDEVICE_BATCH_SIZEMAX_SEQ_LENExperiment Strategies
实验策略
- Architecture: Layer count, attention patterns, embedding dimensions, activation functions
- Optimizer: Learning rates (per-parameter), schedule phases, momentum, weight decay
- Attention: Window sizes, sliding window configs, full vs. local attention
- Batch size: Trade-off between gradient quality and steps-per-budget
- Initialization: Weight init schemes, residual scaling parameters
- Advanced: Value embeddings, softcapped logits, GQA
- 架构优化:层数、注意力模式、嵌入维度、激活函数
- 优化器优化:学习率(按参数设置)、调度阶段、动量、权重衰减
- 注意力优化:窗口大小、滑动窗口配置、全局vs局部注意力
- 批次大小优化:梯度质量和预算内可跑步数的权衡
- 初始化优化:权重初始化方案、残差缩放参数
- 高级优化:值嵌入、logit软截断、GQA
Metric: Bits Per Byte (BPB)
指标:Bits Per Byte(BPB)
How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See for the formula and interpretation table.
references/gpu-training-guide.md衡量模型对文本的压缩能力,按字节数归一化。与词表大小无关——所有架构可直接横向对比。数值越低越好。计算公式和解读表见。
references/gpu-training-guide.mdBenchmark Script
基准测试脚本
Use this as :
autoresearch.shbash
#!/usr/bin/env bash
set -euo pipefail
uv run train.py > run.log 2>&1
val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")
echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"autoresearch.shbash
#!/usr/bin/env bash
set -euo pipefail
uv run train.py > run.log 2>&1
val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")
echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"Session Files
会话文件
| File | Purpose |
|---|---|
| Living session document — goal, metrics, scope, learnings |
| Benchmark script — outputs |
| Append-only experiment log with ASI (survives restarts) |
| 文件名 | 用途 |
|---|---|
| 动态会话文档——记录目标、指标、范围、经验总结 |
| 基准测试脚本——输出 |
| 仅追加的实验日志,包含ASI(重启后仍保留) |
Additional Resources
额外资源
- — Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning
references/gpu-training-guide.md - — Extract METRIC lines from benchmark output
scripts/parse-metrics.sh - — Append experiment results to autoresearch.jsonl
scripts/log-experiment.sh - — Data preparation (download, tokenizer, dataloader, evaluation)
assets/prepare.py - — Model architecture and training loop
assets/train.py - — Self-contained agent instructions for the ML loop
assets/program.md - — Python dependencies (PyTorch, Flash Attention, etc.)
assets/pyproject.toml
- ——详细的GPU搭建、CUDA配置、OOM故障排查、BPB计算公式、性能调优指南
references/gpu-training-guide.md - ——从基准测试输出中提取METRIC行
scripts/parse-metrics.sh - ——将实验结果追加到autoresearch.jsonl
scripts/log-experiment.sh - ——数据准备(下载、Tokenizer、数据加载器、评估逻辑)
assets/prepare.py - ——模型架构和训练循环
assets/train.py - ——ML循环的自包含Agent指令
assets/program.md - ——Python依赖(PyTorch、Flash Attention等)
assets/pyproject.toml