distributed-llm-pretraining-torchtitan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TorchTitan - PyTorch Native Distributed LLM Pretraining

TorchTitan - PyTorch原生分布式大语言模型预训练

Quick start

快速开始

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation:
bash
undefined
TorchTitan是PyTorch官方推出的大规模LLM预训练平台,支持可组合的4D并行(FSDP2、TP、PP、CP),在H100 GPU上相比基准实现了65%以上的速度提升。
安装:
bash
undefined

From PyPI (stable)

From PyPI (stable)

pip install torchtitan
pip install torchtitan

From source (latest features, requires PyTorch nightly)

From source (latest features, requires PyTorch nightly)

git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**Download tokenizer**:
```bash
git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**下载分词器**:
```bash
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**在8块GPU上启动训练**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

常见工作流

Workflow 1: Pretrain Llama 3.1 8B on single node

工作流1:在单节点上预训练Llama 3.1 8B模型

Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
Step 1: Download tokenizer
bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
Step 2: Configure training
Edit or create a TOML config file:
toml
undefined
复制以下检查清单:
单节点预训练:
- [ ] 步骤1:下载分词器
- [ ] 步骤2:配置训练参数
- [ ] 步骤3:启动训练
- [ ] 步骤4:监控训练与checkpoint
步骤1:下载分词器
bash
python scripts/download_hf_assets.py \\
  --repo_id meta-llama/Llama-3.1-8B \\
	--assets tokenizer \\
	--hf_token=YOUR_HF_TOKEN
步骤2:配置训练参数
编辑或创建TOML配置文件:
toml
undefined

llama3_8b_custom.toml

llama3_8b_custom.toml

[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**Step 3: Launch training**

```bash
[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**步骤3:启动训练**

```bash

8 GPUs on single node

8 GPUs on single node

CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

Or explicitly with torchrun

Or explicitly with torchrun

torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml

**Step 4: Monitor and checkpoint**

TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb
torchrun --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_8b_custom.toml

**步骤4:监控训练与checkpoint**

TensorBoard日志会保存到`./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

工作流2:使用SLURM进行多节点训练

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
Step 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences
Step 2: Set up SLURM script
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
Step 3: Submit job
bash
sbatch multinode_trainer.slurm
Step 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
多节点训练:
- [ ] 步骤1:配置并行参数以实现规模化
- [ ] 步骤2:设置SLURM脚本
- [ ] 步骤3:提交任务
- [ ] 步骤4:从checkpoint恢复训练
步骤1: 配置并行参数以实现规模化
针对256块GPU(32个节点)训练70B模型:
toml
[parallelism]
data_parallel_shard_degree = 32	# FSDP across 32 ranks
tensor_parallel_degree = 8				# TP within node
pipeline_parallel_degree = 1			# No PP for 70B
context_parallel_degree = 1				# Increase for long sequences
步骤2: 设置SLURM脚本
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \\
	--nnodes=32 \\
	--nproc_per_node=8 \\
	--rdzv_backend=c10d \\
	--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \\
	-m torchtitan.train \\
	--job.config_file ./llama3_70b.toml
步骤3: 提交任务
bash
sbatch multinode_trainer.slurm
步骤4: 从checkpoint恢复训练
如果配置文件夹中存在checkpoint,训练会自动恢复。

Workflow 3: Enable Float8 training for H100s

工作流3: 为H100 GPU启用Float8训练

Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
Step 1: Install torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Step 2: Configure Float8
Add to your TOML config:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
Step 3: Launch with compile
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable
Float8可在H100 GPU上带来30-50%的速度提升。
Float8训练:
- [ ] 步骤1:安装torchao
- [ ] 步骤2:配置Float8
- [ ] 步骤3:结合compile启动训练
步骤1: 安装torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
步骤2: 配置Float8
在TOML配置文件中添加以下內容:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]	# Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
步骤3: 结合compile启动训练
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \\
	--model.converters="quantize.linear.float8" \\
	--quantize.linear.float8.enable_fsdp_float8_all_gather \\
	--compile.enable

Workflow 4: 4D parallelism for 405B models

工作流4: 针对405B模型的4D并行训练

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
Step 1: Create seed checkpoint
Required for consistent initialization across PP stages:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
Step 2: Configure 4D parallelism
toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
Step 3: Launch on 512 GPUs
bash
undefined
4D并行(FSDP + TP + PP + CP):
- [ ] 步骤1:创建种子checkpoint
- [ ] 步骤2:配置4D并行
- [ ] 步骤3: 在512块GPU上启动训练
步骤1: 创建种子checkpoint
PP阶段的一致初始化需要此步骤:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \\
	--checkpoint.enable \\
	--checkpoint.create_seed_checkpoint \\
	--parallelism.data_parallel_shard_degree 1 \\
	--parallelism.tensor_parallel_degree 1 \\
	--parallelism.pipeline_parallel_degree 1
步骤2: 配置4D并行
toml
[parallelism]
data_parallel_shard_degree = 8	# FSDP
tensor_parallel_degree = 8			# TP within node
pipeline_parallel_degree = 8		# PP across nodes
context_parallel_degree = 1			# CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
步骤3: 在512块GPU上启动训练
bash
undefined

64 nodes x 8 GPUs = 512 GPUs

64 nodes x 8 GPUs = 512 GPUs

srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefined
srun torchrun --nnodes=64 --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_405b.toml
undefined

When to use vs alternatives

适用场景与替代方案对比

Use TorchTitan when:
  • Pretraining LLMs from scratch (8B to 405B+)
  • Need PyTorch-native solution without third-party dependencies
  • Require composable 4D parallelism (FSDP2, TP, PP, CP)
  • Training on H100s with Float8 support
  • Want interoperable checkpoints with torchtune/HuggingFace
Use alternatives instead:
  • Megatron-LM: Maximum performance for NVIDIA-only deployments
  • DeepSpeed: Broader ZeRO optimization ecosystem, inference support
  • Axolotl/TRL: Fine-tuning rather than pretraining
  • LitGPT: Educational, smaller-scale training
选择TorchTitan的场景:
  • 从零开始预训练LLM(8B至405B+参数)
  • 需要无第三方依赖的PyTorch原生解决方案
  • 需支持可组合的4D并行(FSDP2、TP、PP、CP)
  • 在H100 GPU上进行训练并需Float8支持
  • 希望checkpoint与torchtune/HuggingFace兼容
选择替代方案的场景:
  • Megatron-LM: 仅NVIDIA部署场景下的极致性能需求
  • DeepSpeed: 更丰富ZeRo优化生态,支持推理
  • Axolotl/TRL: 侧重微调而非预训练
  • LitGPT: 面向教学场景的小规模训练

Common issues

常见问题

Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1
Or use gradient accumulation:
toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients
Issue: TP causes high memory with async collectives
Set environment variable:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
bash
undefined
问题: 大模型训练时内存不足
启用激活checkpoint并减小批次大小:
toml
[activation_checkpoint]
mode = "full"	# Instead of "selective"

[training]
local_batch_size =
或使用梯度累积:
toml
[training]
local_batch_size = 1
global_batch_size = 32	# Accumulates gradients
问题: TP结合异步集合操作导致内存占用过高
设置环境变量:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
问题: Float8训练未带来速度提升
Float8仅对大型GEMM操作有效益。过滤小型层:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
问题: 修改并行配置后checkpoint加载失败
使用DCP的重分片功能:
bash
undefined

Convert sharded checkpoint to single file

Convert sharded checkpoint to single file

python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt

**Issue: Pipeline parallelism initialization**

Create seed checkpoint first (see Workflow 4, Step 1).
python -m torch.distributed.checkpoint.format_utils \ dcp_to_torch checkpoint/step-1000 checkpoint.pt

**问题: 流水线并行初始化失败**

先创建种子checkpoint(参见工作流4,步骤1)。

Supported models

支持的模型

ModelSizesStatus
Llama 3.18B, 70B, 405BProduction
Llama 4VariousExperimental
DeepSeek V316B, 236B, 671B (MoE)Experimental
GPT-OSS20B, 120B (MoE)Experimental
Qwen 3VariousExperimental
FluxDiffusionExperimental
模型参数规模状态
Llama 3.18B,70B,405B正式可用
Llama4多种参数规模实验性
DeepSeek V316B,236B,671B(MoE)实验性
GPT-OSS20B,120B(MoE)实验性
Qwen3多种参数规模实验性
Flux扩散模型实验性
##性能基准测试(H100 GPU)
模型GPU数量并行策略每GPU TPS采用技术
Llama8B8FSDP5,762基准线
Llama8B8FSDP+compile+FP88,532提升48%
Llama70B256FSDP+TP+AsyncTP8762D并行
Llama405B512FSDP+TP+PP1283D并行
##进阶主题
FSDP2配置**:参见references/fsdp.md获取FSDP2与FSDP1的详细对比及ZeRo等效配置。
Float8训练:参见references/float8.md获取按张量维度與按行维度缩放方案。
Checkpointing:参见references/checkpoint.md获取与HuggingFace的转换方法及异步checkpointing说明
添加自定义模型:参见references/custom-models.md了解TrainSpec协议。
##资源

Performance benchmarks (H100)

ModelGPUsParallelismTPS/GPUTechniques
Llama 8B8FSDP5,762Baseline
Llama 8B8FSDP+compile+FP88,532+48%
Llama 70B256FSDP+TP+AsyncTP8762D parallel
Llama 405B512FSDP+TP+PP1283D parallel

Advanced topics

FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models: See references/custom-models.md for TrainSpec protocol.

Resources