distributed-llm-pretraining-torchtitan
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTorchTitan - PyTorch Native Distributed LLM Pretraining
TorchTitan - PyTorch原生分布式LLM预训练
Quick start
快速开始
TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation:
bash
undefinedTorchTitan是PyTorch官方推出的大规模LLM预训练平台,支持可组合的4D并行(FSDP2、TP、PP、CP),在H100 GPU上相比基准方案提速65%以上。
安装:
bash
undefinedFrom PyPI (stable)
From PyPI (stable)
pip install torchtitan
pip install torchtitan
From source (latest features, requires PyTorch nightly)
From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
**Download tokenizer**:
```bashgit clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
**下载分词器**:
```bashGet HF token from https://huggingface.co/settings/tokens
Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.shpython scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
**在8块GPU上启动训练**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.shCommon workflows
常用工作流
Workflow 1: Pretrain Llama 3.1 8B on single node
工作流1:单节点预训练Llama 3.1 8B
Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpointStep 1: Download tokenizer
bash
python scripts/download_hf_assets.py \
--repo_id meta-llama/Llama-3.1-8B \
--assets tokenizer \
--hf_token=YOUR_HF_TOKENStep 2: Configure training
Edit or create a TOML config file:
toml
undefined复制以下检查清单:
单节点预训练:
- [ ] 步骤1:下载分词器
- [ ] 步骤2:配置训练参数
- [ ] 步骤3:启动训练
- [ ] 步骤4:监控训练与检查点保存步骤1:下载分词器
bash
python scripts/download_hf_assets.py \
--repo_id meta-llama/Llama-3.1-8B \
--assets tokenizer \
--hf_token=YOUR_HF_TOKEN步骤2:配置训练参数
编辑或创建TOML配置文件:
toml
undefinedllama3_8b_custom.toml
llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
**Step 3: Launch training**
```bash[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
**步骤3:启动训练**
```bash8 GPUs on single node
8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
Or explicitly with torchrun
Or explicitly with torchrun
torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml
**Step 4: Monitor and checkpoint**
TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tbtorchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml
**步骤4:监控训练与检查点保存**
TensorBoard日志保存在`./outputs/tb/`路径下:
```bash
tensorboard --logdir ./outputs/tbWorkflow 2: Multi-node training with SLURM
工作流2:使用SLURM进行多节点训练
Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpointStep 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
toml
[parallelism]
data_parallel_shard_degree = 32 # FSDP across 32 ranks
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 1 # No PP for 70B
context_parallel_degree = 1 # Increase for long sequencesStep 2: Set up SLURM script
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config_file ./llama3_70b.tomlStep 3: Submit job
bash
sbatch multinode_trainer.slurmStep 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
多节点训练:
- [ ] 步骤1:配置规模化并行策略
- [ ] 步骤2:编写SLURM脚本
- [ ] 步骤3:提交任务
- [ ] 步骤4:从检查点恢复训练步骤1:配置规模化并行策略
针对256块GPU(32个节点)上训练70B模型的配置示例:
toml
[parallelism]
data_parallel_shard_degree = 32 # FSDP across 32 ranks
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 1 # No PP for 70B
context_parallel_degree = 1 # Increase for long sequences步骤2:编写SLURM脚本
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config_file ./llama3_70b.toml步骤3:提交任务
bash
sbatch multinode_trainer.slurm步骤4:从检查点恢复训练
如果配置的路径下存在检查点,训练会自动恢复。
Workflow 3: Enable Float8 training for H100s
工作流3:为H100 GPU开启Float8训练
Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compileStep 1: Install torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.gitStep 2: Configure Float8
Add to your TOML config:
toml
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # Exclude output layer
[compile]
enable = true
components = ["model", "loss"]Step 3: Launch with compile
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--compile.enableFloat8可在H100 GPU上带来30-50%的训练提速。
Float8训练:
- [ ] 步骤1:安装torchao
- [ ] 步骤2:配置Float8参数
- [ ] 步骤3:开启compile启动训练步骤1:安装torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git步骤2:配置Float8参数
在你的TOML配置中添加以下内容:
toml
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # Exclude output layer
[compile]
enable = true
components = ["model", "loss"]步骤3:开启compile启动训练
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--compile.enableWorkflow 4: 4D parallelism for 405B models
工作流4:面向405B模型的4D并行训练
4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUsStep 1: Create seed checkpoint
Required for consistent initialization across PP stages:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1Step 2: Configure 4D parallelism
toml
[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 8 # PP across nodes
context_parallel_degree = 1 # CP for long sequences
[training]
local_batch_size = 32
seq_len = 8192Step 3: Launch on 512 GPUs
bash
undefined4D并行(FSDP + TP + PP + CP):
- [ ] 步骤1:创建种子检查点
- [ ] 步骤2:配置4D并行参数
- [ ] 步骤3:在512块GPU上启动训练步骤1:创建种子检查点
为保证PP各阶段初始化一致,需要提前创建种子检查点:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1步骤2:配置4D并行参数
toml
[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 8 # PP across nodes
context_parallel_degree = 1 # CP for long sequences
[training]
local_batch_size = 32
seq_len = 8192步骤3:在512块GPU上启动训练
bash
undefined64 nodes x 8 GPUs = 512 GPUs
64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefinedsrun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefinedWhen to use vs alternatives
选型指南:与替代方案的对比
Use TorchTitan when:
- Pretraining LLMs from scratch (8B to 405B+)
- Need PyTorch-native solution without third-party dependencies
- Require composable 4D parallelism (FSDP2, TP, PP, CP)
- Training on H100s with Float8 support
- Want interoperable checkpoints with torchtune/HuggingFace
Use alternatives instead:
- Megatron-LM: Maximum performance for NVIDIA-only deployments
- DeepSpeed: Broader ZeRO optimization ecosystem, inference support
- Axolotl/TRL: Fine-tuning rather than pretraining
- LitGPT: Educational, smaller-scale training
适合使用TorchTitan的场景:
- 从零开始预训练8B到405B+规模的LLM
- 需要无第三方依赖的PyTorch原生解决方案
- 需要可组合的4D并行能力(FSDP2、TP、PP、CP)
- 在支持Float8的H100 GPU上训练
- 需要与torchtune/HuggingFace兼容的检查点格式
更适合使用替代方案的场景:
- Megatron-LM: 纯NVIDIA部署环境下追求极致性能
- DeepSpeed: 需要更丰富的ZeRO优化生态、支持推理场景
- Axolotl/TRL: 做微调而非预训练任务
- LitGPT: 教学用途、小规模训练
Common issues
常见问题
Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
toml
[activation_checkpoint]
mode = "full" # Instead of "selective"
[training]
local_batch_size = 1Or use gradient accumulation:
toml
[training]
local_batch_size = 1
global_batch_size = 32 # Accumulates gradientsIssue: TP causes high memory with async collectives
Set environment variable:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
bash
undefined问题:大模型训练时内存不足
开启激活检查点并减小批次大小:
toml
[activation_checkpoint]
mode = "full" # Instead of "selective"
[training]
local_batch_size = 1或者使用梯度累积:
toml
[training]
local_batch_size = 1
global_batch_size = 32 # Accumulates gradients问题:TP搭配异步通信时内存占用过高
设置环境变量:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1问题:开启Float8训练后没有提速
Float8仅对大型GEMM运算有收益,请过滤小层:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]问题:并行策略调整后检查点加载失败
使用DCP的重分片能力:
bash
undefinedConvert sharded checkpoint to single file
Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt
dcp_to_torch checkpoint/step-1000 checkpoint.pt
**Issue: Pipeline parallelism initialization**
Create seed checkpoint first (see Workflow 4, Step 1).python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt
dcp_to_torch checkpoint/step-1000 checkpoint.pt
**问题:流水线并行初始化失败**
请先创建种子检查点(参考工作流4的步骤1)。Supported models
支持的模型
| Model | Sizes | Status |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | Production |
| Llama 4 | Various | Experimental |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
| GPT-OSS | 20B, 120B (MoE) | Experimental |
| Qwen 3 | Various | Experimental |
| Flux | Diffusion | Experimental |
| 模型 | 规模 | 状态 |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | 生产可用 |
| Llama 4 | 多规格 | 实验性 |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | 实验性 |
| GPT-OSS | 20B, 120B (MoE) | 实验性 |
| Qwen 3 | 多规格 | 实验性 |
| Flux | Diffusion | 实验性 |
Performance benchmarks (H100)
性能基准(H100)
| Model | GPUs | Parallelism | TPS/GPU | Techniques |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | Baseline |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |
| 模型 | GPU数量 | 并行策略 | TPS/GPU | 优化技术 |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | 基准 |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D并行 |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D并行 |
Advanced topics
进阶主题
FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models: See references/custom-models.md for TrainSpec protocol.
FSDP2配置: 查看 references/fsdp.md 了解FSDP2与FSDP1的详细对比以及ZeRO等价配置。
Float8训练: 查看 references/float8.md 了解张量级与行级缩放的使用方案。
检查点机制: 查看 references/checkpoint.md 了解HuggingFace格式转换与异步检查点功能。
添加自定义模型: 查看 references/custom-models.md 了解TrainSpec协议的使用方法。
Resources
资源
- GitHub: https://github.com/pytorch/torchtitan
- Paper: https://arxiv.org/abs/2410.06511
- ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
- PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44