autoresearch-ml

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Autoresearch ML: Autonomous LLM Training Optimization

Autoresearch ML:自主LLM训练优化

An autonomous experiment loop for single-GPU LLM pretraining. Edit
train.py
→ commit → run 5-minute training → measure
val_bpb
→ keep improvement or revert → repeat forever.
This skill is self-contained — it includes everything needed to set up and run the loop.
面向单GPU LLM预训练的自主实验循环。编辑
train.py
→ 提交改动 → 运行5分钟训练 → 测量
val_bpb
→ 保留改进或回退改动 → 无限循环执行
本工具为完全自包含版本——包含搭建和运行整个循环所需的全部内容。

Setup Phase

搭建阶段

1. Copy Template Assets

1. 复制模板资源

Copy the bundled training template to the project directory:
bash
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .
将内置的训练模板复制到项目目录:
bash
cp ${CLAUDE_SKILL_DIR}/assets/prepare.py .
cp ${CLAUDE_SKILL_DIR}/assets/train.py .
cp ${CLAUDE_SKILL_DIR}/assets/pyproject.toml .
cp ${CLAUDE_SKILL_DIR}/assets/program.md .

2. Install and Prepare

2. 安装依赖并准备环境

bash
uv sync                    # Install dependencies
uv run prepare.py          # Download data shards, train tokenizer (~2 min)
bash
uv sync                    # 安装依赖
uv run prepare.py          # 下载数据分片,训练Tokenizer(约2分钟)

3. Verify GPU

3. 验证GPU状态

bash
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"
bash
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name()}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')"

4. Initialize the Experiment Session

4. 初始化实验会话

  1. Create a branch:
    git checkout -b autoresearch/<tag>-<date>
  2. Ensure session files are gitignored (critical —
    git revert
    will fail if tracked):
    bash
    echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
    git add .gitignore && git commit -m "autoresearch: add session files to gitignore"
  3. Read
    prepare.py
    and
    train.py
    thoroughly to understand the codebase
  4. Write
    autoresearch.md
    — a living session document recording goal, metrics, files in scope, constraints, and learnings
  5. Write
    autoresearch.sh
    — the benchmark script (see Benchmark Script section below)
  6. Commit session files
  7. Run baseline:
    bash autoresearch.sh
  8. Parse metrics from output (lines matching
    METRIC name=value
    )
  9. Record baseline in
    autoresearch.jsonl
    :
    • First write a config header:
      {"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}
    • Then record the baseline result
  10. Begin the experiment loop
  1. 创建分支:
    git checkout -b autoresearch/<tag>-<date>
  2. 确保会话文件被git忽略(非常重要——如果文件被追踪,
    git revert
    会执行失败):
    bash
    echo -e "autoresearch.jsonl\nrun.log" >> .gitignore
    git add .gitignore && git commit -m "autoresearch: add session files to gitignore"
  3. 通读
    prepare.py
    train.py
    ,充分理解代码库逻辑
  4. 编写
    autoresearch.md
    ——动态更新的会话文档,记录目标、指标、涉及文件、约束和经验总结
  5. 编写
    autoresearch.sh
    ——基准测试脚本(见下文基准测试脚本章节)
  6. 提交会话文件
  7. 运行基线测试:
    bash autoresearch.sh
  8. 解析输出中的指标(匹配
    METRIC name=value
    格式的行)
  9. 将基线结果记录到
    autoresearch.jsonl
    • 首先写入配置头:
      {"type":"config","name":"Optimize val_bpb","metricName":"val_bpb","metricUnit":"bpb","bestDirection":"lower"}
    • 然后记录基线测试结果
  10. 启动实验循环

The Experiment Loop

实验循环

LOOP FOREVER. Never ask "should I continue?" — just keep going.
The user might be asleep, away from the computer, or expects you to work indefinitely. Each experiment takes ~5 minutes, so you can run ~12/hour, ~100 overnight. The loop runs until the user interrupts you, period. If you run out of ideas, think harder — re-read
train.py
for new angles, try combining previous near-misses, try more radical architectural changes.
Each iteration:
1. Read current git state and autoresearch.md
2. Choose an experimental change to train.py (informed by past results and ASI notes)
3. Edit train.py (the ONLY editable file)
4. git add train.py && git commit -m "experiment: <description>"
5. Run: bash autoresearch.sh > run.log 2>&1
6. Parse METRIC lines from output
7. If output is empty (crash): tail -n 50 run.log to read the stack trace
8. Decide: keep or discard
9. Log result to autoresearch.jsonl (include ASI annotations)
10. If discard/crash: git revert $(git rev-parse HEAD) --no-edit
11. Update autoresearch.md with learnings (every few experiments)
12. Repeat
永久循环运行,无需询问「是否继续」——直接持续执行即可。
用户可能处于睡眠状态、不在电脑旁,或者预期你可以无限制运行。每次实验耗时约5分钟,每小时可运行约12次,一整晚可运行约100次。循环会一直执行直到用户手动中断,没有其他终止条件。如果没有新的实验思路,请再深入思考——重读
train.py
寻找新的优化角度,尝试结合之前接近成功的方案,尝试更激进的架构改动。
每轮迭代步骤:
1. 读取当前git状态和autoresearch.md内容
2. 基于过往实验结果和ASI备注,选择一个要对train.py做的实验性改动
3. 编辑train.py(唯一允许修改的文件)
4. git add train.py && git commit -m "experiment: <改动描述>"
5. 运行:bash autoresearch.sh > run.log 2>&1
6. 解析输出中的METRIC行
7. 如果输出为空(程序崩溃):读取run.log最后50行查看栈追踪信息
8. 决定:保留改动还是丢弃
9. 将结果记录到autoresearch.jsonl(包含ASI注解)
10. 如果丢弃/崩溃:执行git revert $(git rev-parse HEAD) --no-edit
11. 每完成几次实验后,更新autoresearch.md记录经验总结
12. 重复执行

Decision Rules

决策规则

  • val_bpb improved (lower)
    keep
    (commit stays, branch advances)
  • val_bpb equal or worse
    discard
    (run
    git revert $(git rev-parse HEAD) --no-edit
    )
  • Crash (OOM, CUDA error, NaN loss)
    discard
    (revert). If it's a simple fix (typo, import), fix and re-run. If the idea is fundamentally broken, log as crash and move on.
  • Simpler code for equal val_bpb
    keep
    (removing complexity is a win)
  • Catastrophic VRAM increase → consider
    discard
    even if val_bpb improved slightly
  • val_bpb降低(指标优化)
    keep
    (保留提交,分支前进)
  • val_bpb持平或升高(指标无优化或变差)
    discard
    (执行
    git revert $(git rev-parse HEAD) --no-edit
    回退)
  • 崩溃(OOM、CUDA错误、损失NaN)
    discard
    (回退)。如果是简单可修复问题(拼写错误、导入错误),修复后重新运行。如果思路本身存在根本性缺陷,记录为崩溃后继续下一个实验
  • val_bpb持平但代码更简洁
    keep
    (降低复杂度也是收益)
  • 显存占用大幅增加 → 即使val_bpb有小幅提升,也可考虑
    discard

Simplicity Criterion

简洁性准则

All else being equal, simpler is better. A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 improvement from deleting code? Definitely keep. Equal val_bpb with much simpler code? Keep.
其他条件相同时,更简洁的方案更优。如果0.001的val_bpb提升需要添加20行冗余hack代码?通常不值得。如果删除代码就能带来0.001的提升?务必保留。val_bpb持平但代码大幅简化?保留。

Constraints

约束条件

  • Fixed 5-minute time budget. All experiments are directly comparable — the wall clock is the equalizer.
  • Single file modification. Only
    train.py
    changes;
    prepare.py
    is immutable. This ensures fair comparison (same data, same evaluation).
  • VRAM is a soft constraint. Using more VRAM is acceptable but note the trade-off (larger model = fewer training steps in 5 minutes).
  • No new packages. You can only use what's already in
    pyproject.toml
    .
  • Timeout: If a run exceeds 10 minutes, kill it and treat as a crash.
  • 固定5分钟时间预算。所有实验可直接横向对比——运行时间是公平的衡量标准
  • 仅允许修改单个文件。仅
    train.py
    可改动;
    prepare.py
    不可修改。这保证了对比的公平性(相同数据、相同评估逻辑)
  • 显存是软约束。允许使用更多显存,但需要标注权衡(更大的模型=5分钟内能跑的训练步数更少)
  • 不允许新增依赖包。仅可使用
    pyproject.toml
    中已有的依赖
  • 超时规则:如果单次运行超过10分钟,终止进程并按崩溃处理

Don't Thrash

避免无效尝试

If 3 consecutive experiments fail or get discarded, stop and think about why. Re-read
train.py
for new angles. Try a fundamentally different approach.
如果连续3次实验失败或被丢弃,暂停并思考原因。重读
train.py
寻找新的优化角度,尝试完全不同的优化方向。

Handling User Messages

处理用户消息

If the user sends a message while the loop is running: finish the current cycle, address the feedback, then resume immediately — do not wait for permission.
如果循环运行过程中用户发送了消息:完成当前迭代周期,响应反馈内容后立即恢复循环——无需等待用户许可。

Logging to autoresearch.jsonl

记录到autoresearch.jsonl

Each experiment appends one JSON line:
json
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}
Use the shared logging script:
bash
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'
Parse metrics from benchmark output:
bash
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
Valid statuses:
keep
,
discard
,
crash
,
checks_failed
每次实验追加一行JSON记录:
json
{"run":2,"commit":"def5678","metric":0.993,"metrics":{"peak_memory_mb":44200,"mfu_percent":39.8},"status":"keep","description":"increase LR to 0.04","timestamp":1700000000,"segment":0,"confidence":null,"asi":{"hypothesis":"higher LR converges faster","arch_change":"MATRIX_LR 0.03→0.04"}}
使用内置的日志脚本:
bash
bash ${CLAUDE_SKILL_DIR}/scripts/log-experiment.sh \
  --run 2 \
  --commit "$(git rev-parse --short HEAD)" \
  --metric 0.993 \
  --status keep \
  --description "increase LR to 0.04" \
  --metrics '{"peak_memory_mb":44200,"mfu_percent":39.8}' \
  --segment 0 \
  --asi '{"hypothesis":"higher LR converges faster"}'
从基准测试输出中解析指标:
bash
bash autoresearch.sh 2>&1 | bash ${CLAUDE_SKILL_DIR}/scripts/parse-metrics.sh
有效状态值:
keep
discard
crash
checks_failed

ASI (Actionable Side Information)

ASI(可操作辅助信息)

ASI is structured annotation per experiment that survives reverts. When code changes are discarded, only the description and ASI remain — the only structured memory of what happened.
Record ASI for every experiment:
json
{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}
ASI是每个实验的结构化注解,在代码回退后仍会保留。当代码改动被丢弃时,只有改动描述和ASI会留存——这是过往实验仅有的结构化记忆。
为每个实验记录ASI:
json
{
  "hypothesis": "Deeper model with fewer steps should compress better",
  "arch_change": "DEPTH 8→12, DEVICE_BATCH_SIZE 128→64",
  "result": "val_bpb improved 0.998→0.992, but 2x VRAM",
  "next_action_hint": "Try intermediate DEPTH=10 for better VRAM tradeoff"
}

Resuming After Context Reset

上下文重置后恢复运行

If
autoresearch.jsonl
and
autoresearch.md
exist in the working directory:
  1. Read
    autoresearch.md
    for full context (goal, metrics, files, constraints, learnings)
  2. Read
    autoresearch.jsonl
    to see all past experiments, current best, and ASI annotations
  3. Check git log to verify current branch state matches expected state
  4. If git state is dirty (unclean shutdown), revert uncommitted changes
  5. Resume the loop from where it left off — no re-setup needed
  6. Resume immediately — do not ask "should I continue?"
如果工作目录中已存在
autoresearch.jsonl
autoresearch.md
  1. 读取
    autoresearch.md
    获取完整上下文(目标、指标、文件、约束、经验总结)
  2. 读取
    autoresearch.jsonl
    查看所有过往实验、当前最优结果和ASI注解
  3. 检查git日志,确认当前分支状态与预期一致
  4. 如果git状态不干净(异常关闭导致),回退未提交的改动
  5. 从断点处恢复循环运行——无需重新搭建环境
  6. 立即恢复运行——无需询问「是否继续」

Confidence Scoring

置信度评分

After 3+ experiments, assess whether improvements are real or noise:
  • Compute the Median Absolute Deviation (MAD) of all metric values as a noise floor
  • Confidence = |best improvement| / MAD
  • ≥2.0× → likely real improvement
  • 1.0–2.0× → marginal, could be noise
  • <1.0× → within noise floor
ML training with fixed seeds is mostly deterministic, so the noise floor is typically very low.
完成3次及以上实验后,评估优化是真实提升还是噪声:
  • 计算所有指标值的**中位数绝对偏差(MAD)**作为噪声基线
  • 置信度 = |最优提升幅度| / MAD
  • ≥2.0倍 → 大概率是真实提升
  • 1.0–2.0倍 → 边际提升,可能是噪声
  • <1.0倍 → 处于噪声基线范围内
固定种子的ML训练基本是确定性的,因此噪声基线通常非常低。

Template Architecture

模板架构

prepare.py (FIXED — never modify)

prepare.py(固定——禁止修改)

  • Data download: Fetches parquet shards from HuggingFace (climbmix-400b-shuffle)
  • Tokenizer training: BPE tokenizer (8192 vocab) using rustbpe/tiktoken
  • Dataloader: Best-fit document packing with 100% token utilization, BOS-aligned
  • Evaluation:
    evaluate_bpb()
    computes bits-per-byte (vocab-size-independent metric)
Key constants:
MAX_SEQ_LEN = 2048
,
TIME_BUDGET = 300
,
EVAL_TOKENS = 40 * 524288
,
VOCAB_SIZE = 8192
  • 数据下载:从HuggingFace拉取parquet分片(climbmix-400b-shuffle)
  • Tokenizer训练:使用rustbpe/tiktoken训练BPE Tokenizer(8192词表大小)
  • 数据加载器:最优文档打包,Token利用率100%,BOS对齐
  • 评估逻辑
    evaluate_bpb()
    计算bits-per-byte(与词表大小无关的指标)
核心常量:
MAX_SEQ_LEN = 2048
TIME_BUDGET = 300
EVAL_TOKENS = 40 * 524288
VOCAB_SIZE = 8192

train.py (MODIFIED BY AGENT — the only editable file)

train.py(由Agent修改——唯一可编辑文件)

  • Model: GPT with RoPE, sliding window attention, value embeddings, Flash Attention 3
  • Optimizer: Hybrid MuonAdamW (Muon for matrices, AdamW for everything else)
  • Training: Gradient accumulation, LR schedules (warmup/flat/warmdown), fixed time budget
Editable:
ASPECT_RATIO
,
DEPTH
,
WINDOW_PATTERN
,
TOTAL_BATCH_SIZE
, learning rates, LR schedule phases, and the full model architecture.
  • 模型:搭载RoPE、滑动窗口注意力、值嵌入、Flash Attention 3的GPT架构
  • 优化器:混合MuonAdamW(矩阵参数用Muon,其他参数用AdamW)
  • 训练逻辑:梯度累积、LR调度(预热/平稳/降温)、固定时间预算
可修改项:
ASPECT_RATIO
DEPTH
WINDOW_PATTERN
TOTAL_BATCH_SIZE
、学习率、LR调度阶段、完整模型架构。

GPU Requirements

GPU要求

Supported GPU Tiers

支持的GPU层级

TierGPUsVRAMNotes
ConsumerGTX 1080 Ti, RTX 2080 Ti11GBfp32 fallback, gradient checkpointing required
Consumer+RTX 3090, RTX 409024GBGreat for experiments
EnthusiastRTX 509032GBExcellent — larger models possible
DatacenterA100, H10040-80GBOriginal development target
层级GPU型号显存备注
消费级GTX 1080 Ti, RTX 2080 Ti11GB需要fp32 fallback,开启梯度检查点
消费级+RTX 3090, RTX 409024GB非常适合做实验
发烧友级RTX 509032GB极佳——可运行更大模型
数据中心级A100, H10040-80GB原始开发目标环境

Consumer GPU Adaptations

消费级GPU适配方案

For GPUs with limited VRAM (< 16GB), apply these changes to
train.py
during the first experiment:
  1. Remove Flash Attention 3 import and dependency — the top-level
    from kernels import get_kernel
    block (lines 20-24) runs unconditionally at startup and will fail on non-Hopper GPUs. Replace the entire block and the
    fa3.flash_attn_func()
    call in
    CausalSelfAttention.forward()
    with
    torch.nn.functional.scaled_dot_product_attention
    . Also remove
    kernels
    from
    pyproject.toml
    and run
    uv sync
    again.
  2. Enable gradient checkpointing — use
    torch.utils.checkpoint.checkpoint()
    with
    use_reentrant=False
    to trade ~30% compute for ~50% VRAM savings
  3. Auto-scale model size — reduce
    DEPTH
    and
    DEVICE_BATCH_SIZE
    to fit VRAM budget (see table below)
  4. Cap evaluation steps — scale eval batch count by available VRAM (30-100 steps)
  5. fp32 fallback — use fp32 instead of bf16 for Pascal GPUs (compute capability < 7.5). Change the autocast dtype and disable bf16-specific optimizations.
对于显存有限的GPU(<16GB),第一次实验时对
train.py
做以下修改:
  1. 移除Flash Attention 3导入和依赖——顶层
    from kernels import get_kernel
    代码块(第20-24行)会在启动时无条件运行,在非Hopper架构GPU上会运行失败。将整个代码块和
    CausalSelfAttention.forward()
    中的
    fa3.flash_attn_func()
    调用替换为
    torch.nn.functional.scaled_dot_product_attention
    。同时从
    pyproject.toml
    中移除
    kernels
    依赖,重新执行
    uv sync
  2. 开启梯度检查点——使用
    torch.utils.checkpoint.checkpoint()
    并设置
    use_reentrant=False
    ,用约30%的计算开销换取约50%的显存节省
  3. 自动缩放模型大小——降低
    DEPTH
    DEVICE_BATCH_SIZE
    以适配显存预算(见下表)
  4. 限制评估步数——根据可用显存缩放评估批次数量(30-100步)
  5. fp32 fallback——Pascal架构GPU(计算能力<7.5)使用fp32替代bf16。修改自动混合精度的dtype,禁用bf16专属优化

VRAM Auto-Scaling Guide

显存自动缩放指南

VRAM BudgetDEPTHn_embdBatch SizeSeq Length~Params
4GB21284512~1M
8GB425681024~5M
12GB6384161024~14M
16GB8512322048~25M
24GB85121282048~50M
32GB127681282048~85M
80GB1610241282048~200M
Note:
n_embd
must be a multiple of
HEAD_DIM
(default 128). Config search: start with the largest depth that fits, reduce
DEVICE_BATCH_SIZE
then
MAX_SEQ_LEN
if OOM.
显存预算DEPTHn_embd批次大小序列长度参数量约值
4GB21284512~1M
8GB425681024~5M
12GB6384161024~14M
16GB8512322048~25M
24GB85121282048~50M
32GB127681282048~85M
80GB1610241282048~200M
注意:
n_embd
必须是
HEAD_DIM
(默认128)的倍数。配置搜索逻辑:优先选择能适配的最大深度,出现OOM时再降低
DEVICE_BATCH_SIZE
,之后再降低
MAX_SEQ_LEN

Experiment Strategies

实验策略

  1. Architecture: Layer count, attention patterns, embedding dimensions, activation functions
  2. Optimizer: Learning rates (per-parameter), schedule phases, momentum, weight decay
  3. Attention: Window sizes, sliding window configs, full vs. local attention
  4. Batch size: Trade-off between gradient quality and steps-per-budget
  5. Initialization: Weight init schemes, residual scaling parameters
  6. Advanced: Value embeddings, softcapped logits, GQA
  1. 架构优化:层数、注意力模式、嵌入维度、激活函数
  2. 优化器优化:学习率(按参数设置)、调度阶段、动量、权重衰减
  3. 注意力优化:窗口大小、滑动窗口配置、全局vs局部注意力
  4. 批次大小优化:梯度质量和预算内可跑步数的权衡
  5. 初始化优化:权重初始化方案、残差缩放参数
  6. 高级优化:值嵌入、logit软截断、GQA

Metric: Bits Per Byte (BPB)

指标:Bits Per Byte(BPB)

How well the model compresses text, normalized by byte count. Vocabulary-size-independent — all architectures are directly comparable. Lower is better. See
references/gpu-training-guide.md
for the formula and interpretation table.
衡量模型对文本的压缩能力,按字节数归一化。与词表大小无关——所有架构可直接横向对比。数值越低越好。计算公式和解读表见
references/gpu-training-guide.md

Benchmark Script

基准测试脚本

Use this as
autoresearch.sh
:
bash
#!/usr/bin/env bash
set -euo pipefail

uv run train.py > run.log 2>&1

val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")

echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"
autoresearch.sh
使用以下模板:
bash
#!/usr/bin/env bash
set -euo pipefail

uv run train.py > run.log 2>&1

val_bpb=$(grep "^val_bpb:" run.log | tail -1 | awk '{print $2}' || echo "0")
memory=$(grep "^peak_vram_mb:" run.log | tail -1 | awk '{print $2}' || echo "0")
mfu=$(grep "^mfu_percent:" run.log | tail -1 | awk '{print $2}' || echo "0")

echo "METRIC val_bpb=$val_bpb"
echo "METRIC peak_memory_mb=$memory"
echo "METRIC mfu_percent=$mfu"

Session Files

会话文件

FilePurpose
autoresearch.md
Living session document — goal, metrics, scope, learnings
autoresearch.sh
Benchmark script — outputs
METRIC name=value
lines
autoresearch.jsonl
Append-only experiment log with ASI (survives restarts)
文件名用途
autoresearch.md
动态会话文档——记录目标、指标、范围、经验总结
autoresearch.sh
基准测试脚本——输出
METRIC name=value
格式的指标行
autoresearch.jsonl
仅追加的实验日志,包含ASI(重启后仍保留)

Additional Resources

额外资源

  • references/gpu-training-guide.md
    — Detailed GPU setup, CUDA configuration, OOM troubleshooting, BPB formula, and performance tuning
  • scripts/parse-metrics.sh
    — Extract METRIC lines from benchmark output
  • scripts/log-experiment.sh
    — Append experiment results to autoresearch.jsonl
  • assets/prepare.py
    — Data preparation (download, tokenizer, dataloader, evaluation)
  • assets/train.py
    — Model architecture and training loop
  • assets/program.md
    — Self-contained agent instructions for the ML loop
  • assets/pyproject.toml
    — Python dependencies (PyTorch, Flash Attention, etc.)
  • references/gpu-training-guide.md
    ——详细的GPU搭建、CUDA配置、OOM故障排查、BPB计算公式、性能调优指南
  • scripts/parse-metrics.sh
    ——从基准测试输出中提取METRIC行
  • scripts/log-experiment.sh
    ——将实验结果追加到autoresearch.jsonl
  • assets/prepare.py
    ——数据准备(下载、Tokenizer、数据加载器、评估逻辑)
  • assets/train.py
    ——模型架构和训练循环
  • assets/program.md
    ——ML循环的自包含Agent指令
  • assets/pyproject.toml
    ——Python依赖(PyTorch、Flash Attention等)