nanochat-llm-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesenanochat LLM Training
nanochat LLM 训练
Skill by ara.so — Daily 2026 Skills collection.
nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial () auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours).
--depth来自ara.so的技能 — 2026每日技能合集。
nanochat是Karpathy开发的轻量可定制工具集,可在单GPU节点上端到端训练LLM。它涵盖分词、预训练、SFT微调、RL、评估(DCLM CORE分数)、带KV缓存的推理,以及类ChatGPT的网页UI。通过单一复杂度参数()可自动配置所有其他超参数(宽度、注意力头数、学习率、训练周期、权重衰减),实现计算最优的训练。你可以在8×H100节点上用约48美元、耗时约2小时复现GPT-2的性能(2019年时成本约为43000美元)。
--depthInstallation
安装
nanochat uses for dependency management:
uvbash
git clone https://github.com/karpathy/nanochat.git
cd nanochatnanochat使用进行依赖管理:
uvbash
git clone https://github.com/karpathy/nanochat.git
cd nanochatInstall uv if needed
若需要,安装uv
curl -LsSf https://astral.sh/uv/install.sh | sh
curl -LsSf https://astral.sh/uv/install.sh | sh
Create venv and install deps
创建虚拟环境并安装依赖
uv sync
source .venv/bin/activate
undefineduv sync
source .venv/bin/activate
undefinedKey Commands
核心命令
Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)
完整GPT-2快速训练(8×H100节点,约2–3小时,成本约48美元)
bash
undefinedbash
undefinedRun the reference pipeline: data download, pretraining, SFT, eval, chat
运行参考流程:数据下载、预训练、SFT、评估、聊天
bash runs/speedrun.sh
undefinedbash runs/speedrun.sh
undefinedPretraining (distributed)
分布式预训练
bash
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=26 \
--run="d26_run" \
--model-tag="d26"bash
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=26 \
--run="d26_run" \
--model-tag="d26"Pretraining (single GPU)
单GPU预训练
bash
python -m scripts.base_train -- \
--depth=26 \
--run="d26_single"bash
python -m scripts.base_train -- \
--depth=26 \
--run="d26_single"Quick Research Iteration (~5 min, GPT-1 scale)
快速研究迭代(约5分钟,GPT-1规模)
bash
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=12 \
--run="d12_exp" \
--model-tag="d12" \
--core-metric-every=999999 \
--sample-every=-1 \
--save-every=-1bash
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=12 \
--run="d12_exp" \
--model-tag="d12" \
--core-metric-every=999999 \
--sample-every=-1 \
--save-every=-1CPU / Apple Silicon (tiny model, ~minutes)
CPU / Apple Silicon(微型模型,耗时约数分钟)
bash
bash runs/runcpu.shbash
bash runs/runcpu.shServe Chat UI
启动聊天UI
bash
undefinedbash
undefinedAfter training completes
训练完成后
source .venv/bin/activate
python -m scripts.chat_web
source .venv/bin/activate
python -m scripts.chat_web
Visit http://<your-server-ip>:8000/
访问 http://<你的服务器IP>:8000/
undefinedundefinedCLI Chat
CLI聊天
bash
python -m scripts.chat_cli -p "hello"bash
python -m scripts.chat_cli -p "hello"Scaling Laws / Miniseries
缩放定律 / 迷你系列训练
bash
bash runs/scaling_laws.sh # sweep depths for scaling law data
bash runs/miniseries.sh # train full compute-optimal miniseriesbash
bash runs/scaling_laws.sh # 遍历不同depth获取缩放定律数据
bash runs/miniseries.sh # 训练完整的计算最优迷你系列模型The Depth Dial
Depth参数调节
The single most important parameter. Everything else is derived automatically:
| Approximate model scale | Notes |
|---|---|---|
| 6–8 | Tiny (toy) | CPU/MPS feasible |
| 12 | GPT-1 size | ~5 min on 8×H100, great for research iteration |
| 16 | Medium | ~15 min on 8×H100 |
| 24–26 | GPT-2 size | ~2 hrs on 8×H100, ~$48 |
bash
undefined这是最重要的单一参数,所有其他参数都会自动推导:
| 近似模型规模 | 说明 |
|---|---|---|
| 6–8 | 微型(玩具级) | 可在CPU/MPS上运行 |
| 12 | GPT-1规模 | 在8×H100上约5分钟,非常适合研究迭代 |
| 16 | 中型 | 在8×H100上约15分钟 |
| 24–26 | GPT-2规模 | 在8×H100上约2小时,成本约48美元 |
bash
undefinedSmaller/faster experiments
更小/更快的实验
python -m scripts.base_train -- --depth=12 --run="quick_test"
python -m scripts.base_train -- --depth=12 --run="quick_test"
Full GPT-2 grade
完整GPT-2级别
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"
undefinedtorchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"
undefinedPrecision / dtype Configuration
精度 / 数据类型配置
nanochat uses explicit dtype management via in . No .
COMPUTE_DTYPEnanochat/common.pytorch.amp.autocast| Hardware | Default | Override |
|---|---|---|
| CUDA SM 80+ (A100, H100) | | |
| CUDA SM < 80 (V100, T4) | | |
| CPU / MPS | | — |
bash
undefinednanochat通过中的显式管理数据类型,不使用。
nanochat/common.pyCOMPUTE_DTYPEtorch.amp.autocast| 硬件 | 默认值 | 覆盖方式 |
|---|---|---|
| CUDA SM 80+ (A100, H100) | | |
| CUDA SM < 80 (V100, T4) | | |
| CPU / MPS | | — |
bash
undefinedForce fp32 for inference
强制使用fp32进行推理
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
Force bf16 for training
强制使用bf16进行训练
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train
NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train
float16 training (enables GradScaler automatically)
float16训练(自动启用GradScaler)
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train
**How it works:** Weights stored in fp32 (optimizer precision), custom `Linear` casts to `COMPUTE_DTYPE` in forward pass, embeddings stored directly in `COMPUTE_DTYPE` to save memory.NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train
**工作原理:** 权重以fp32存储(优化器精度),自定义`Linear`层在前向传播时转换为`COMPUTE_DTYPE`,嵌入层直接以`COMPUTE_DTYPE`存储以节省内存。Key Python Modules
核心Python模块
nanochat/
├── gpt.py # GPT nn.Module Transformer
├── engine.py # Inference with KV Cache
├── dataloader.py # Tokenizing Distributed Data Loader
├── dataset.py # Download/read utils for pretraining data
├── optim.py # AdamW + Muon optimizer (1GPU and distributed)
├── core_eval.py # DCLM CORE score evaluation
├── loss_eval.py # Bits-per-byte evaluation
├── checkpoint_manager.py # Save/Load checkpoints
├── common.py # Utilities, COMPUTE_DTYPE
├── execution.py # Python code execution tool for LLM
└── engine.py # Efficient KV-cache inference
scripts/
├── base_train.py # Pretraining entry point
├── chat_web.py # Web chat UI server
└── chat_cli.py # CLI chat interface
runs/
├── speedrun.sh # Reference full pipeline (GPT-2 speedrun)
├── scaling_laws.sh # Scaling law sweeps
├── miniseries.sh # Full compute-optimal miniseries
└── runcpu.sh # CPU/MPS examplenanochat/
├── gpt.py # GPT神经网络模块Transformer
├── engine.py # 带KV缓存的推理引擎
├── dataloader.py # 分词分布式数据加载器
├── dataset.py # 预训练数据的下载/读取工具
├── optim.py # AdamW + Muon优化器(单GPU和分布式)
├── core_eval.py # DCLM CORE分数评估
├── loss_eval.py # 每字节比特数评估
├── checkpoint_manager.py # 保存/加载检查点
├── common.py # 工具函数,COMPUTE_DTYPE配置
├── execution.py # LLM的Python代码执行工具
└── engine.py # 高效KV缓存推理
scripts/
├── base_train.py # 预训练入口
├── chat_web.py # 网页聊天UI服务器
└── chat_cli.py # CLI聊天界面
runs/
├── speedrun.sh # 参考完整流程(GPT-2快速训练)
├── scaling_laws.sh # 缩放定律遍历实验
├── miniseries.sh # 完整计算最优迷你系列训练
└── runcpu.sh # CPU/MPS示例Real Code Examples
实际代码示例
Load and Run Inference on a Trained Model
加载并运行训练好的模型进行推理
python
import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManagerpython
import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManagerLoad checkpoint
加载检查点
ckpt_manager = CheckpointManager("checkpoints/d26")
model, config = ckpt_manager.load()
model.eval()
ckpt_manager = CheckpointManager("checkpoints/d26")
model, config = ckpt_manager.load()
model.eval()
Run inference with KV cache
使用KV缓存运行推理
engine = InferenceEngine(model)
output = engine.generate(
prompt="Once upon a time",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
)
print(output)
undefinedengine = InferenceEngine(model)
output = engine.generate(
prompt="Once upon a time",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
)
print(output)
undefinedCustom Training Script with Depth Dial
使用Depth参数的自定义训练脚本
python
import subprocess
def train_model(depth: int, run_name: str, nproc: int = 8):
"""Launch a compute-optimal training run for given depth."""
cmd = [
"torchrun",
"--standalone",
f"--nproc_per_node={nproc}",
"-m", "scripts.base_train",
"--",
f"--depth={depth}",
f"--run={run_name}",
f"--model-tag={run_name}",
]
subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})python
import subprocess
def train_model(depth: int, run_name: str, nproc: int = 8):
"""为指定depth启动计算最优的训练任务。"""
cmd = [
"torchrun",
"--standalone",
f"--nproc_per_node={nproc}",
"-m", "scripts.base_train",
"--",
f"--depth={depth}",
f"--run={run_name}",
f"--model-tag={run_name}",
]
subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})Quick research iteration
快速研究迭代
train_model(depth=12, run_name="my_experiment_d12")
train_model(depth=12, run_name="my_experiment_d12")
Full GPT-2 grade
完整GPT-2级别
train_model(depth=26, run_name="my_gpt2_repro")
undefinedtrain_model(depth=26, run_name="my_gpt2_repro")
undefinedAdjust Device Batch Size for Lower VRAM
调整设备批量大小以减少VRAM占用
bash
undefinedbash
undefinedDefault device_batch_size=32 needs ~80GB VRAM per GPU
默认device_batch_size=32需要每GPU约80GB VRAM
Reduce for smaller GPUs (gradient accumulation handles the rest)
为小显存GPU减少该值(梯度累积会维持等效的总批量大小)
torchrun --standalone --nproc_per_node=4 -m scripts.base_train --
--depth=12
--device_batch_size=16
--run="low_vram_run"
--depth=12
--device_batch_size=16
--run="low_vram_run"
--device_batch_size=16 # 尝试16、8、4、2、1
undefinedEven smaller
在wandb中监控关键指标
python -m scripts.base_train --
--depth=8
--device_batch_size=4
--run="single_gpu_small"
--depth=8
--device_batch_size=4
--run="single_gpu_small"
undefinedpython
undefinedMonitoring Key Metrics in wandb
nanochat会自动记录到wandb。需要关注的关键指标:
—
- val_bpb: 验证集损失(每字节比特数,与词汇表大小无关)
—
随训练步数、总训练时间、总训练FLOPS变化
—
- core_metric: DCLM CORE分数(目标>0.2565以超越GPT-2)
—
- train/mfu: 模型FLOPS利用率
—
- train/tok_per_sec: 训练吞吐量
—
训练前通过环境变量设置wandb项目
python
undefinedimport os
os.environ["WANDB_PROJECT"] = "my-nanochat-runs"
undefinednanochat logs to wandb automatically. Key metrics to watch:
用于SFT人格训练的合成数据
- val_bpb: validation loss in bits-per-byte (vocab-size-invariant)
—
as a function of step, total_training_time, total_training_flops
—
- core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)
—
- train/mfu: Model FLOPS utilization
—
- train/tok_per_sec: Training throughput
—
Set wandb project via env var before training
—
import os
os.environ["WANDB_PROJECT"] = "my-nanochat-runs"
undefinedpython
undefinedSynthetic Data for SFT Personality
dev/gen_synthetic_data.py — 生成身份/人格数据
—
然后按照指南将其混入SFT阶段:
—
示例:生成数据并让SFT指向该数据
python
undefinedpython dev/gen_synthetic_data.py --output data/identity_sft.jsonl
dev/gen_synthetic_data.py — generate identity/personality data
然后在你的SFT脚本配置中引用该数据
Then mix into SFT stage per the guide:
—
Example: generate data and point SFT to it
—
python dev/gen_synthetic_data.py --output data/identity_sft.jsonl
undefinedThen reference in your SFT script configuration
常见模式
—
研究迭代循环
undefinedbash
undefinedCommon Patterns
1. 在nanochat/目录中修改代码
Research Iteration Loop
2. 运行快速d12训练验证
bash
undefinedOMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train --
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1
1. Make a code change in nanochat/
3. 查看wandb:val_bpb与步数/时间/FLOPS的关系
2. Run quick d12 to validate
4. 如果结果理想,在d16或d26规模测试
OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train --
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1
undefined3. Check wandb: val_bpb vs step/time/flops
FP8训练(仅H100,用于快速训练)
4. If promising, test at d16 or d26
—
undefinedbash
undefinedFP8 Training (H100 only, for speedrun)
快速训练流程中使用FP8以进一步加速
—
查看runs/speedrun.sh获取具体调用方式
bash
undefinedbash runs/speedrun.sh
undefinedFP8 is used in the speedrun for additional speedup
仅评估CORE分数
See runs/speedrun.sh for the exact invocation
—
bash runs/speedrun.sh
undefinedbash
python -m nanochat.core_eval --checkpoint checkpoints/d26/latestEvaluate CORE Score Only
在Lambda / 远程机器上部署
bash
python -m nanochat.core_eval --checkpoint checkpoints/d26/latestbash
undefinedServe on Lambda / Remote Machine
远程机器训练完成后:
bash
undefinedsource .venv/bin/activate
python -m scripts.chat_web
On remote machine after training:
访问:http://<公网IP>:8000/
—
使用screen
或tmux
保持进程运行
screentmuxsource .venv/bin/activate
python -m scripts.chat_web
screen -S nanochat
python -m scripts.chat_web
Access via: http://<PUBLIC_IP>:8000/
按Ctrl+A,D键分离会话
Use screen
or tmux
to keep alive
screentmux—
screen -S nanochat
python -m scripts.chat_web
undefinedCtrl+A, D to detach
故障排除
—
OOM / 显存不足
undefinedbash
undefinedTroubleshooting
减小--device_batch_size(默认32)
OOM / Out of VRAM
代码使用梯度累积维持等效的总批量大小
bash
undefined--device_batch_size=16 # 尝试16、8、4、2、1
undefinedReduce --device_batch_size (default 32)
单GPU速度是8GPU的1/8
Code uses gradient accumulation to maintain effective batch size
—
--device_batch_size=16 # Try 16, 8, 4, 2, 1
undefined这是正常现象。省略直接使用,梯度累积会自动启用以维持等效的总批量大小。
torchrunpython -m scripts.base_trainSingle GPU is 8× Slower
在非CUDA硬件上运行
This is expected. Omit and use directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.
torchrunpython -m scripts.base_trainbash
undefinedRunning on Non-CUDA Hardware
MPS(Apple Silicon)或CPU — 以runcpu.sh为模板
bash
undefinedbash runs/runcpu.sh
MPS (Apple Silicon) or CPU — use runcpu.sh as template
结果会较弱;仅用于开发/调试
bash runs/runcpu.sh
undefinedResults will be weak; this is for development/debugging only
float16梯度下溢
undefinedbash
undefinedfloat16 Gradient Underflow
当NANOCHAT_DTYPE=float16时,nanochat自动启用GradScaler
bash
undefinedNANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16
注意:RL脚本不支持float16(SFT和base_train支持)
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
undefinedNote: RL scripts do NOT support float16 (SFT and base_train do)
V100 / T4(SM < 80)— 不支持bf16
undefinedbash
undefinedV100 / T4 (SM < 80) — No bf16
默认回退到float32;可选择使用float16
bash
undefinedNANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
undefinedDefault falls back to float32; optionally use float16
聊天UI无法访问
NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12
undefinedbash
undefinedChat UI Not Accessible
确保端口(默认8000)在云服务商的防火墙/安全组中开放
—
使用公网IP而非localhost访问:
—
http://<公网IP>:8000/
bash
undefinedundefinedEnsure the port (default 8000) is open in your cloud provider's firewall/security group
资源
Use the public IP, not localhost:
—
http://<PUBLIC_IP>:8000/
—
undefined- DeepWiki问答: https://deepwiki.com/karpathy/nanochat
- 讨论区: https://github.com/karpathy/nanochat/discussions
- Discord: Karpathy的Discord服务器中的频道
#nanochat - 排行榜文档:
dev/LEADERBOARD.md - 超越GPT-2指南: https://github.com/karpathy/nanochat/discussions/481
- 迷你系列v1: https://github.com/karpathy/nanochat/discussions/420
- 添加能力指南: https://github.com/karpathy/nanochat/discussions/164
Resources
—
- DeepWiki Q&A: https://deepwiki.com/karpathy/nanochat
- Discussions: https://github.com/karpathy/nanochat/discussions
- Discord: channel on Karpathy's Discord
#nanochat - Leaderboard docs:
dev/LEADERBOARD.md - Beating GPT-2 guide: https://github.com/karpathy/nanochat/discussions/481
- Miniseries v1: https://github.com/karpathy/nanochat/discussions/420
- Adding abilities guide: https://github.com/karpathy/nanochat/discussions/164
—