distributed-llm-pretraining-torchtitan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TorchTitan - PyTorch Native Distributed LLM Pretraining

TorchTitan - PyTorch原生分布式LLM预训练

Quick start

快速开始

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation:
bash
undefined
TorchTitan是PyTorch官方推出的大规模LLM预训练平台,支持可组合的4D并行(FSDP2、TP、PP、CP),在H100 GPU上相比基准方案提速65%以上。
安装:
bash
undefined

From PyPI (stable)

From PyPI (stable)

pip install torchtitan
pip install torchtitan

From source (latest features, requires PyTorch nightly)

From source (latest features, requires PyTorch nightly)

git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**Download tokenizer**:
```bash
git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**下载分词器**:
```bash
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**在8块GPU上启动训练**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

常用工作流

Workflow 1: Pretrain Llama 3.1 8B on single node

工作流1:单节点预训练Llama 3.1 8B

Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
Step 1: Download tokenizer
bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
Step 2: Configure training
Edit or create a TOML config file:
toml
undefined
复制以下检查清单:
单节点预训练:
- [ ] 步骤1:下载分词器
- [ ] 步骤2:配置训练参数
- [ ] 步骤3:启动训练
- [ ] 步骤4:监控训练与检查点保存
步骤1:下载分词器
bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
步骤2:配置训练参数
编辑或创建TOML配置文件:
toml
undefined

llama3_8b_custom.toml

llama3_8b_custom.toml

[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**Step 3: Launch training**

```bash
[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**步骤3:启动训练**

```bash

8 GPUs on single node

8 GPUs on single node

CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

Or explicitly with torchrun

Or explicitly with torchrun

torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml

**Step 4: Monitor and checkpoint**

TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb
torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml

**步骤4:监控训练与检查点保存**

TensorBoard日志保存在`./outputs/tb/`路径下:
```bash
tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

工作流2:使用SLURM进行多节点训练

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
Step 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences
Step 2: Set up SLURM script
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
Step 3: Submit job
bash
sbatch multinode_trainer.slurm
Step 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
多节点训练:
- [ ] 步骤1:配置规模化并行策略
- [ ] 步骤2:编写SLURM脚本
- [ ] 步骤3:提交任务
- [ ] 步骤4:从检查点恢复训练
步骤1:配置规模化并行策略
针对256块GPU(32个节点)上训练70B模型的配置示例:
toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences
步骤2:编写SLURM脚本
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
步骤3:提交任务
bash
sbatch multinode_trainer.slurm
步骤4:从检查点恢复训练
如果配置的路径下存在检查点,训练会自动恢复。

Workflow 3: Enable Float8 training for H100s

工作流3:为H100 GPU开启Float8训练

Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
Step 1: Install torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Step 2: Configure Float8
Add to your TOML config:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
Step 3: Launch with compile
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable
Float8可在H100 GPU上带来30-50%的训练提速。
Float8训练:
- [ ] 步骤1:安装torchao
- [ ] 步骤2:配置Float8参数
- [ ] 步骤3:开启compile启动训练
步骤1:安装torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
步骤2:配置Float8参数
在你的TOML配置中添加以下内容:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
步骤3:开启compile启动训练
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

Workflow 4: 4D parallelism for 405B models

工作流4:面向405B模型的4D并行训练

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
Step 1: Create seed checkpoint
Required for consistent initialization across PP stages:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
Step 2: Configure 4D parallelism
toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
Step 3: Launch on 512 GPUs
bash
undefined
4D并行(FSDP + TP + PP + CP):
- [ ] 步骤1:创建种子检查点
- [ ] 步骤2:配置4D并行参数
- [ ] 步骤3:在512块GPU上启动训练
步骤1:创建种子检查点
为保证PP各阶段初始化一致,需要提前创建种子检查点:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
步骤2:配置4D并行参数
toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
步骤3:在512块GPU上启动训练
bash
undefined

64 nodes x 8 GPUs = 512 GPUs

64 nodes x 8 GPUs = 512 GPUs

srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefined
srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefined

When to use vs alternatives

选型指南:与替代方案的对比

Use TorchTitan when:
  • Pretraining LLMs from scratch (8B to 405B+)
  • Need PyTorch-native solution without third-party dependencies
  • Require composable 4D parallelism (FSDP2, TP, PP, CP)
  • Training on H100s with Float8 support
  • Want interoperable checkpoints with torchtune/HuggingFace
Use alternatives instead:
  • Megatron-LM: Maximum performance for NVIDIA-only deployments
  • DeepSpeed: Broader ZeRO optimization ecosystem, inference support
  • Axolotl/TRL: Fine-tuning rather than pretraining
  • LitGPT: Educational, smaller-scale training
适合使用TorchTitan的场景:
  • 从零开始预训练8B到405B+规模的LLM
  • 需要无第三方依赖的PyTorch原生解决方案
  • 需要可组合的4D并行能力(FSDP2、TP、PP、CP)
  • 在支持Float8的H100 GPU上训练
  • 需要与torchtune/HuggingFace兼容的检查点格式
更适合使用替代方案的场景:
  • Megatron-LM: 纯NVIDIA部署环境下追求极致性能
  • DeepSpeed: 需要更丰富的ZeRO优化生态、支持推理场景
  • Axolotl/TRL: 做微调而非预训练任务
  • LitGPT: 教学用途、小规模训练

Common issues

常见问题

Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1
Or use gradient accumulation:
toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients
Issue: TP causes high memory with async collectives
Set environment variable:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
bash
undefined
问题:大模型训练时内存不足
开启激活检查点并减小批次大小:
toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1
或者使用梯度累积:
toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients
问题:TP搭配异步通信时内存占用过高
设置环境变量:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
问题:开启Float8训练后没有提速
Float8仅对大型GEMM运算有收益,请过滤小层:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
问题:并行策略调整后检查点加载失败
使用DCP的重分片能力:
bash
undefined

Convert sharded checkpoint to single file

Convert sharded checkpoint to single file

python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt

**Issue: Pipeline parallelism initialization**

Create seed checkpoint first (see Workflow 4, Step 1).
python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt

**问题:流水线并行初始化失败**

请先创建种子检查点(参考工作流4的步骤1)。

Supported models

支持的模型

ModelSizesStatus
Llama 3.18B, 70B, 405BProduction
Llama 4VariousExperimental
DeepSeek V316B, 236B, 671B (MoE)Experimental
GPT-OSS20B, 120B (MoE)Experimental
Qwen 3VariousExperimental
FluxDiffusionExperimental
模型规模状态
Llama 3.18B, 70B, 405B生产可用
Llama 4多规格实验性
DeepSeek V316B, 236B, 671B (MoE)实验性
GPT-OSS20B, 120B (MoE)实验性
Qwen 3多规格实验性
FluxDiffusion实验性

Performance benchmarks (H100)

性能基准(H100)

ModelGPUsParallelismTPS/GPUTechniques
Llama 8B8FSDP5,762Baseline
Llama 8B8FSDP+compile+FP88,532+48%
Llama 70B256FSDP+TP+AsyncTP8762D parallel
Llama 405B512FSDP+TP+PP1283D parallel
模型GPU数量并行策略TPS/GPU优化技术
Llama 8B8FSDP5,762基准
Llama 8B8FSDP+compile+FP88,532+48%
Llama 70B256FSDP+TP+AsyncTP8762D并行
Llama 405B512FSDP+TP+PP1283D并行

Advanced topics

进阶主题

FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models: See references/custom-models.md for TrainSpec protocol.
FSDP2配置: 查看 references/fsdp.md 了解FSDP2与FSDP1的详细对比以及ZeRO等价配置。
Float8训练: 查看 references/float8.md 了解张量级与行级缩放的使用方案。
检查点机制: 查看 references/checkpoint.md 了解HuggingFace格式转换与异步检查点功能。
添加自定义模型: 查看 references/custom-models.md 了解TrainSpec协议的使用方法。

Resources

资源