nanochat-llm-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

nanochat LLM Training

nanochat LLM 训练

Skill by ara.so — Daily 2026 Skills collection.

nanochat is Karpathy's minimal, hackable harness for training LLMs end-to-end on a single GPU node. It covers tokenization, pretraining, SFT finetuning, RL, evaluation (DCLM CORE score), inference with KV cache, and a ChatGPT-like web UI. A single complexity dial (

--depth

) auto-configures all other hyperparameters (width, heads, LR, training horizon, weight decay) for compute-optimal training. You can reproduce GPT-2 capability (~$43,000 in 2019) for ~$48 on an 8×H100 node (~2 hours).

来自ara.so的技能 — 2026每日技能合集。

nanochat是Karpathy开发的轻量可定制工具集，可在单GPU节点上端到端训练LLM。它涵盖分词、预训练、SFT微调、RL、评估（DCLM CORE分数）、带KV缓存的推理，以及类ChatGPT的网页UI。通过单一复杂度参数（

--depth

）可自动配置所有其他超参数（宽度、注意力头数、学习率、训练周期、权重衰减），实现计算最优的训练。你可以在8×H100节点上用约48美元、耗时约2小时复现GPT-2的性能（2019年时成本约为43000美元）。

Installation

安装

nanochat uses

uv

for dependency management:

bash

git clone https://github.com/karpathy/nanochat.git
cd nanochat

nanochat使用

uv

进行依赖管理：

bash

git clone https://github.com/karpathy/nanochat.git
cd nanochat

Install uv if needed

若需要，安装uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Create venv and install deps

创建虚拟环境并安装依赖

uv sync source .venv/bin/activate

undefined

uv sync source .venv/bin/activate

undefined

Key Commands

核心命令

Full GPT-2 Speedrun (8×H100 node, ~2–3 hours, ~$48)

完整GPT-2快速训练（8×H100节点，约2–3小时，成本约48美元）

bash

undefined

bash

undefined

Run the reference pipeline: data download, pretraining, SFT, eval, chat

运行参考流程：数据下载、预训练、SFT、评估、聊天

bash runs/speedrun.sh

undefined

bash runs/speedrun.sh

undefined

Pretraining (distributed)

分布式预训练

bash

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --run="d26_run" \
    --model-tag="d26"

bash

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --run="d26_run" \
    --model-tag="d26"

Pretraining (single GPU)

单GPU预训练

bash

python -m scripts.base_train -- \
    --depth=26 \
    --run="d26_single"

bash

python -m scripts.base_train -- \
    --depth=26 \
    --run="d26_single"

Quick Research Iteration (~5 min, GPT-1 scale)

快速研究迭代（约5分钟，GPT-1规模）

bash

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12_exp" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

bash

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=12 \
    --run="d12_exp" \
    --model-tag="d12" \
    --core-metric-every=999999 \
    --sample-every=-1 \
    --save-every=-1

CPU / Apple Silicon (tiny model, ~minutes)

CPU / Apple Silicon（微型模型，耗时约数分钟）

bash

bash runs/runcpu.sh

bash

bash runs/runcpu.sh

Serve Chat UI

启动聊天UI

bash

undefined

bash

undefined

After training completes

训练完成后

source .venv/bin/activate python -m scripts.chat_web

Visit http://<your-server-ip>:8000/

访问 http://<你的服务器IP>:8000/

undefined

undefined

CLI Chat

CLI聊天

bash

python -m scripts.chat_cli -p "hello"

bash

python -m scripts.chat_cli -p "hello"

Scaling Laws / Miniseries

缩放定律 / 迷你系列训练

bash

bash runs/scaling_laws.sh   # sweep depths for scaling law data
bash runs/miniseries.sh     # train full compute-optimal miniseries

bash

bash runs/scaling_laws.sh   # 遍历不同depth获取缩放定律数据
bash runs/miniseries.sh     # 训练完整的计算最优迷你系列模型

The Depth Dial

Depth参数调节

The single most important parameter. Everything else is derived automatically:

`--depth`	Approximate model scale	Notes
6–8	Tiny (toy)	CPU/MPS feasible
12	GPT-1 size	~5 min on 8×H100, great for research iteration
16	Medium	~15 min on 8×H100
24–26	GPT-2 size	~2 hrs on 8×H100, ~$48

bash

undefined

这是最重要的单一参数，所有其他参数都会自动推导：

`--depth`	近似模型规模	说明
6–8	微型（玩具级）	可在CPU/MPS上运行
12	GPT-1规模	在8×H100上约5分钟，非常适合研究迭代
16	中型	在8×H100上约15分钟
24–26	GPT-2规模	在8×H100上约2小时，成本约48美元

bash

undefined

Smaller/faster experiments

更小/更快的实验

python -m scripts.base_train -- --depth=12 --run="quick_test"

Full GPT-2 grade

完整GPT-2级别

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

undefined

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --run="gpt2_repro"

undefined

Precision / dtype Configuration

精度 / 数据类型配置

nanochat uses explicit dtype management via

COMPUTE_DTYPE

nanochat/common.py

. No

torch.amp.autocast

Hardware	Default	Override
CUDA SM 80+ (A100, H100)	`bfloat16`	`NANOCHAT_DTYPE=float32`
CUDA SM < 80 (V100, T4)	`float32`	`NANOCHAT_DTYPE=float16`
CPU / MPS	`float32`	—

bash

undefined

nanochat通过

nanochat/common.py

中的

COMPUTE_DTYPE

显式管理数据类型，不使用

torch.amp.autocast

。

硬件	默认值	覆盖方式
CUDA SM 80+ (A100, H100)	`bfloat16`	`NANOCHAT_DTYPE=float32`
CUDA SM < 80 (V100, T4)	`float32`	`NANOCHAT_DTYPE=float16`
CPU / MPS	`float32`	—

bash

undefined

Force fp32 for inference

强制使用fp32进行推理

NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"

Force bf16 for training

强制使用bf16进行训练

NANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train

float16 training (enables GradScaler automatically)

float16训练（自动启用GradScaler）

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train


**How it works:** Weights stored in fp32 (optimizer precision), custom `Linear` casts to `COMPUTE_DTYPE` in forward pass, embeddings stored directly in `COMPUTE_DTYPE` to save memory.

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train


**工作原理：** 权重以fp32存储（优化器精度），自定义`Linear`层在前向传播时转换为`COMPUTE_DTYPE`，嵌入层直接以`COMPUTE_DTYPE`存储以节省内存。

Key Python Modules

核心Python模块

nanochat/
├── gpt.py              # GPT nn.Module Transformer
├── engine.py           # Inference with KV Cache
├── dataloader.py       # Tokenizing Distributed Data Loader
├── dataset.py          # Download/read utils for pretraining data
├── optim.py            # AdamW + Muon optimizer (1GPU and distributed)
├── core_eval.py        # DCLM CORE score evaluation
├── loss_eval.py        # Bits-per-byte evaluation
├── checkpoint_manager.py  # Save/Load checkpoints
├── common.py           # Utilities, COMPUTE_DTYPE
├── execution.py        # Python code execution tool for LLM
└── engine.py           # Efficient KV-cache inference

scripts/
├── base_train.py       # Pretraining entry point
├── chat_web.py         # Web chat UI server
└── chat_cli.py         # CLI chat interface

runs/
├── speedrun.sh         # Reference full pipeline (GPT-2 speedrun)
├── scaling_laws.sh     # Scaling law sweeps
├── miniseries.sh       # Full compute-optimal miniseries
└── runcpu.sh           # CPU/MPS example

nanochat/
├── gpt.py              # GPT神经网络模块Transformer
├── engine.py           # 带KV缓存的推理引擎
├── dataloader.py       # 分词分布式数据加载器
├── dataset.py          # 预训练数据的下载/读取工具
├── optim.py            # AdamW + Muon优化器（单GPU和分布式）
├── core_eval.py        # DCLM CORE分数评估
├── loss_eval.py        # 每字节比特数评估
├── checkpoint_manager.py  # 保存/加载检查点
├── common.py           # 工具函数，COMPUTE_DTYPE配置
├── execution.py        # LLM的Python代码执行工具
└── engine.py           # 高效KV缓存推理

scripts/
├── base_train.py       # 预训练入口
├── chat_web.py         # 网页聊天UI服务器
└── chat_cli.py         # CLI聊天界面

runs/
├── speedrun.sh         # 参考完整流程（GPT-2快速训练）
├── scaling_laws.sh     # 缩放定律遍历实验
├── miniseries.sh       # 完整计算最优迷你系列训练
└── runcpu.sh           # CPU/MPS示例

Real Code Examples

实际代码示例

Load and Run Inference on a Trained Model

加载并运行训练好的模型进行推理

python

import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManager

python

import torch
from nanochat.gpt import GPT
from nanochat.engine import InferenceEngine
from nanochat.checkpoint_manager import CheckpointManager

Load checkpoint

加载检查点

ckpt_manager = CheckpointManager("checkpoints/d26") model, config = ckpt_manager.load() model.eval()

Run inference with KV cache

使用KV缓存运行推理

engine = InferenceEngine(model) output = engine.generate( prompt="Once upon a time", max_new_tokens=200, temperature=0.8, top_p=0.95, ) print(output)

undefined

engine = InferenceEngine(model) output = engine.generate( prompt="Once upon a time", max_new_tokens=200, temperature=0.8, top_p=0.95, ) print(output)

undefined

Custom Training Script with Depth Dial

使用Depth参数的自定义训练脚本

python

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):
    """Launch a compute-optimal training run for given depth."""
    cmd = [
        "torchrun",
        "--standalone",
        f"--nproc_per_node={nproc}",
        "-m", "scripts.base_train",
        "--",
        f"--depth={depth}",
        f"--run={run_name}",
        f"--model-tag={run_name}",
    ]
    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

python

import subprocess

def train_model(depth: int, run_name: str, nproc: int = 8):
    """为指定depth启动计算最优的训练任务。"""
    cmd = [
        "torchrun",
        "--standalone",
        f"--nproc_per_node={nproc}",
        "-m", "scripts.base_train",
        "--",
        f"--depth={depth}",
        f"--run={run_name}",
        f"--model-tag={run_name}",
    ]
    subprocess.run(cmd, env={"OMP_NUM_THREADS": "1", **__import__("os").environ})

Quick research iteration

快速研究迭代

train_model(depth=12, run_name="my_experiment_d12")

Full GPT-2 grade

完整GPT-2级别

train_model(depth=26, run_name="my_gpt2_repro")

undefined

train_model(depth=26, run_name="my_gpt2_repro")

undefined

Adjust Device Batch Size for Lower VRAM

调整设备批量大小以减少VRAM占用

bash

undefined

bash

undefined

Default device_batch_size=32 needs ~80GB VRAM per GPU

默认device_batch_size=32需要每GPU约80GB VRAM

Reduce for smaller GPUs (gradient accumulation handles the rest)

为小显存GPU减少该值（梯度累积会维持等效的总批量大小）

torchrun --standalone --nproc_per_node=4 -m scripts.base_train --
--depth=12
--device_batch_size=16
--run="low_vram_run"

--device_batch_size=16 # 尝试16、8、4、2、1

undefined

Even smaller

在wandb中监控关键指标

python -m scripts.base_train --
--depth=8
--device_batch_size=4
--run="single_gpu_small"

undefined

python

undefined

Monitoring Key Metrics in wandb

nanochat会自动记录到wandb。需要关注的关键指标：

—

- val_bpb: 验证集损失（每字节比特数，与词汇表大小无关）

—

随训练步数、总训练时间、总训练FLOPS变化

—

- core_metric: DCLM CORE分数（目标>0.2565以超越GPT-2）

—

- train/mfu: 模型FLOPS利用率

—

- train/tok_per_sec: 训练吞吐量

—

训练前通过环境变量设置wandb项目

python

undefined

import os os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

undefined

nanochat logs to wandb automatically. Key metrics to watch:

用于SFT人格训练的合成数据

- val_bpb: validation loss in bits-per-byte (vocab-size-invariant)

—

as a function of step, total_training_time, total_training_flops

—

- core_metric: DCLM CORE score (target > 0.2565 to beat GPT-2)

—

- train/mfu: Model FLOPS utilization

—

- train/tok_per_sec: Training throughput

—

Set wandb project via env var before training

—

import os os.environ["WANDB_PROJECT"] = "my-nanochat-runs"

undefined

python

undefined

Synthetic Data for SFT Personality

dev/gen_synthetic_data.py — 生成身份/人格数据

—

然后按照指南将其混入SFT阶段：

—

https://github.com/karpathy/nanochat/discussions/139

—

示例：生成数据并让SFT指向该数据

python

undefined

python dev/gen_synthetic_data.py --output data/identity_sft.jsonl

dev/gen_synthetic_data.py — generate identity/personality data

然后在你的SFT脚本配置中引用该数据

Then mix into SFT stage per the guide:

—

https://github.com/karpathy/nanochat/discussions/139

—

Example: generate data and point SFT to it

—

python dev/gen_synthetic_data.py --output data/identity_sft.jsonl

undefined

Then reference in your SFT script configuration

常见模式

—

研究迭代循环

undefined

bash

undefined

Common Patterns

1. 在nanochat/目录中修改代码

Research Iteration Loop

2. 运行快速d12训练验证

bash

undefined

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train --
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1

1. Make a code change in nanochat/

3. 查看wandb：val_bpb与步数/时间/FLOPS的关系

2. Run quick d12 to validate

4. 如果结果理想，在d16或d26规模测试

OMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train --
--depth=12 --run="test_my_change"
--core-metric-every=999999 --sample-every=-1 --save-every=-1

undefined

3. Check wandb: val_bpb vs step/time/flops

FP8训练（仅H100，用于快速训练）

4. If promising, test at d16 or d26

—

undefined

bash

undefined

FP8 Training (H100 only, for speedrun)

快速训练流程中使用FP8以进一步加速

—

查看runs/speedrun.sh获取具体调用方式

bash

undefined

bash runs/speedrun.sh

undefined

FP8 is used in the speedrun for additional speedup

仅评估CORE分数

See runs/speedrun.sh for the exact invocation

—

bash runs/speedrun.sh

undefined

bash

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

Evaluate CORE Score Only

在Lambda / 远程机器上部署

bash

python -m nanochat.core_eval --checkpoint checkpoints/d26/latest

bash

undefined

Serve on Lambda / Remote Machine

远程机器训练完成后：

bash

undefined

source .venv/bin/activate python -m scripts.chat_web

On remote machine after training:

访问：http://<公网IP>:8000/

—

使用

screen

或

tmux

保持进程运行

source .venv/bin/activate python -m scripts.chat_web

screen -S nanochat python -m scripts.chat_web

Access via: http://<PUBLIC_IP>:8000/

按Ctrl+A，D键分离会话

Use

screen

tmux

to keep alive

—

screen -S nanochat python -m scripts.chat_web

undefined

Ctrl+A, D to detach

故障排除

—

OOM / 显存不足

undefined

bash

undefined

Troubleshooting

减小--device_batch_size（默认32）

OOM / Out of VRAM

代码使用梯度累积维持等效的总批量大小

bash

undefined

--device_batch_size=16 # 尝试16、8、4、2、1

undefined

Reduce --device_batch_size (default 32)

单GPU速度是8GPU的1/8

Code uses gradient accumulation to maintain effective batch size

—

--device_batch_size=16 # Try 16, 8, 4, 2, 1

undefined

这是正常现象。省略

torchrun

直接使用

python -m scripts.base_train

，梯度累积会自动启用以维持等效的总批量大小。

Single GPU is 8× Slower

在非CUDA硬件上运行

This is expected. Omit

torchrun

and use

python -m scripts.base_train

directly. Gradient accumulation kicks in automatically to maintain equivalent total batch size.

bash

undefined

Running on Non-CUDA Hardware

MPS（Apple Silicon）或CPU — 以runcpu.sh为模板

bash

undefined

bash runs/runcpu.sh

MPS (Apple Silicon) or CPU — use runcpu.sh as template

结果会较弱；仅用于开发/调试

bash runs/runcpu.sh

undefined

Results will be weak; this is for development/debugging only

float16梯度下溢

undefined

bash

undefined

float16 Gradient Underflow

当NANOCHAT_DTYPE=float16时，nanochat自动启用GradScaler

bash

undefined

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

nanochat auto-enables GradScaler when NANOCHAT_DTYPE=float16

注意：RL脚本不支持float16（SFT和base_train支持）

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

undefined

Note: RL scripts do NOT support float16 (SFT and base_train do)

V100 / T4（SM < 80）— 不支持bf16

undefined

bash

undefined

V100 / T4 (SM < 80) — No bf16

默认回退到float32；可选择使用float16

bash

undefined

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

undefined

Default falls back to float32; optionally use float16

聊天UI无法访问

NANOCHAT_DTYPE=float16 torchrun --nproc_per_node=8 -m scripts.base_train -- --depth=12

undefined

bash

undefined

Chat UI Not Accessible

确保端口（默认8000）在云服务商的防火墙/安全组中开放

—

使用公网IP而非localhost访问：

—

http://<公网IP>:8000/

bash

undefined

undefined

Ensure the port (default 8000) is open in your cloud provider's firewall/security group

资源

Use the public IP, not localhost:

—

http://<PUBLIC_IP>:8000/

—

undefined

DeepWiki问答: https://deepwiki.com/karpathy/nanochat
讨论区: https://github.com/karpathy/nanochat/discussions
Discord: Karpathy的Discord服务器中的
```
#nanochat
```
频道
排行榜文档:
```
dev/LEADERBOARD.md
```
超越GPT-2指南: https://github.com/karpathy/nanochat/discussions/481
迷你系列v1: https://github.com/karpathy/nanochat/discussions/420
添加能力指南: https://github.com/karpathy/nanochat/discussions/164

Resources

—

DeepWiki Q&A: https://deepwiki.com/karpathy/nanochat
Discussions: https://github.com/karpathy/nanochat/discussions
Discord:
```
#nanochat
```
channel on Karpathy's Discord
Leaderboard docs:
```
dev/LEADERBOARD.md
```
Beating GPT-2 guide: https://github.com/karpathy/nanochat/discussions/481
Miniseries v1: https://github.com/karpathy/nanochat/discussions/420
Adding abilities guide: https://github.com/karpathy/nanochat/discussions/164

—