distributed-llm-pretraining-torchtitan

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

TorchTitan - PyTorch Native Distributed LLM Pretraining

TorchTitan - PyTorch原生分布式大语言模型预训练

Quick start

快速开始

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.

Installation:

bash

undefined

TorchTitan是PyTorch官方推出的大规模LLM预训练平台，支持可组合的4D并行（FSDP2、TP、PP、CP），在H100 GPU上相比基准实现了65%以上的速度提升。

安装:

bash

undefined

From PyPI (stable)

pip install torchtitan

From source (latest features, requires PyTorch nightly)

git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt


**Download tokenizer**:
```bash

git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt


**下载分词器**:
```bash

Get HF token from https://huggingface.co/settings/tokens

python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...


**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...


**在8块GPU上启动训练**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

常见工作流

Workflow 1: Pretrain Llama 3.1 8B on single node

工作流1：在单节点上预训练Llama 3.1 8B模型

Copy this checklist:

Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint

Step 1: Download tokenizer

bash

python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN

Step 2: Configure training

Edit or create a TOML config file:

toml

undefined

复制以下检查清单:

单节点预训练:
- [ ] 步骤1：下载分词器
- [ ] 步骤2：配置训练参数
- [ ] 步骤3：启动训练
- [ ] 步骤4：监控训练与checkpoint

步骤1：下载分词器

bash

python scripts/download_hf_assets.py \\
  --repo_id meta-llama/Llama-3.1-8B \\
	--assets tokenizer \\
	--hf_token=YOUR_HF_TOKEN

步骤2：配置训练参数

编辑或创建TOML配置文件:

toml

undefined

llama3_8b_custom.toml

[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"

[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"

[optimizer] name = "AdamW" lr = 3e-4

[lr_scheduler] warmup_steps = 200

[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"

[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP

[activation_checkpoint] mode = "selective" selective_ac_option = "op"

[checkpoint] enable = true folder = "checkpoint" interval = 500


**Step 3: Launch training**

```bash

[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"

[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"

[optimizer] name = "AdamW" lr = 3e-4

[lr_scheduler] warmup_steps = 200

[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"

[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP

[activation_checkpoint] mode = "selective" selective_ac_option = "op"

[checkpoint] enable = true folder = "checkpoint" interval = 500


**步骤3：启动训练**

```bash

8 GPUs on single node

CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

Or explicitly with torchrun

torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml


**Step 4: Monitor and checkpoint**

TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb

torchrun --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_8b_custom.toml


**步骤4：监控训练与checkpoint**

TensorBoard日志会保存到`./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

工作流2：使用SLURM进行多节点训练

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint

Step 1: Configure parallelism for scale

For 70B model on 256 GPUs (32 nodes):

toml

[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences

Step 2: Set up SLURM script

bash

#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml

Step 3: Submit job

bash

sbatch multinode_trainer.slurm

Step 4: Resume from checkpoint

Training auto-resumes if checkpoint exists in configured folder.

多节点训练:
- [ ] 步骤1：配置并行参数以实现规模化
- [ ] 步骤２：设置SLURM脚本
- [ ] 步骤３：提交任务
- [ ] 步骤４：从checkpoint恢复训练

步骤1: 配置并行参数以实现规模化

针对256块GPU（32个节点）训练70B模型:

toml

[parallelism]
data_parallel_shard_degree = 32	# FSDP across 32 ranks
tensor_parallel_degree = 8				# TP within node
pipeline_parallel_degree = 1			# No PP for ７0B
context_parallel_degree = 1				# Increase for long sequences

步骤2: 设置SLURM脚本

bash

#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \\
	--nnodes=32 \\
	--nproc_per_node=8 \\
	--rdzv_backend=c10d \\
	--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \\
	-m torchtitan.train \\
	--job.config_file ./llama3_７0b.toml

步骤3: 提交任务

bash

sbatch multinode_trainer.slurm

步骤4: 从checkpoint恢复训练

如果配置文件夹中存在checkpoint，训练会自动恢复。

Workflow 3: Enable Float8 training for H100s

工作流3: 为H100 GPU启用Float８训练

Float8 provides 30-50% speedup on H100 GPUs.

Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile

Step 1: Install torchao

bash

USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

Step 2: Configure Float8

Add to your TOML config:

toml

[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]

Step 3: Launch with compile

bash

CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

Float８可在H100 GPU上带来30-50%的速度提升。

Float8训练:
- [ ] 步骤1：安装torchao
- [ ] 步骤2：配置Float8
- [ ] 步骤3：结合compile启动训练

步骤１: 安装torchao

bash

USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

步骤2: 配置Float8

在TOML配置文件中添加以下內容:

toml

[model]
converters = ["quantize.linear.float８"]

[quantize.linear.float８]
enable_fsdp_float８_all_gather = true
precompute_float８_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]	# Exclude output layer

[compile]
enable = true
components = ["model", "loss"]

步骤３: 结合compile启动训练

bash

CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \\
	--model.converters="quantize.linear.float８" \\
	--quantize.lineａr.float８.enable_fsdp_float８_all_gather \\
	--compile.enable

Workflow 4: 4D parallelism for 405B models

工作流4: 针对405B模型的4D并行训练

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs

Step 1: Create seed checkpoint

Required for consistent initialization across PP stages:

bash

NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1

Step 2: Configure 4D parallelism

toml

[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192

Step 3: Launch on 512 GPUs

bash

undefined

4D并行（FSDP + TP + PP + CP）:
- [ ] 步骤1：创建种子checkpoint
- [ ] 步骤2：配置4D并行
- [ ] 步骤3: 在512块GPU上启动训练

步骤1: 创建种子checkpoint

PP阶段的一致初始化需要此步骤:

bash

NGPU=1 CONFIG_FILE=./llama3_40５b.toml ./run_train.sh \\
	--checkpoint.enable \\
	--checkpoint.create_seed_checkpoint \\
	--parallelism.data_parallel_shard_degree 1 \\
	--parallelism.tensor_parallel_degree 1 \\
	--parallelism.pipeline_parallel_degree 1

步骤2: 配置4D并行

toml

[parallelism]
data_parallel_shard_degree = 8	# FSDP
tensor_parallel_degree = 8			# TP within node
pipeline_parallel_degree = 8		# PP across nodes
context_parallel_degree = 1			# CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192

步骤3: 在５12块GPU上启动训练

bash

undefined

64 nodes x 8 GPUs = 512 GPUs

srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml

undefined

srun torchrun --nnodes=64 --nproc_per_node=8 \ -m torchtitan.train \ --job.config_file ./llama3_405b.toml

undefined

When to use vs alternatives

适用场景与替代方案对比

Use TorchTitan when:

Pretraining LLMs from scratch (8B to 405B+)
Need PyTorch-native solution without third-party dependencies
Require composable 4D parallelism (FSDP2, TP, PP, CP)
Training on H100s with Float8 support
Want interoperable checkpoints with torchtune/HuggingFace

Use alternatives instead:

Megatron-LM: Maximum performance for NVIDIA-only deployments
DeepSpeed: Broader ZeRO optimization ecosystem, inference support
Axolotl/TRL: Fine-tuning rather than pretraining
LitGPT: Educational, smaller-scale training

选择TorchTitan的场景:

从零开始预训练LLM（8B至405B+参数）
需要无第三方依赖的PyTorch原生解决方案
需支持可组合的4D并行（FSDP2、TP、PP、CP）
在H1０0 GPU上进行训练并需Float８支持
希望checkpoint与torchtune/HuggingFace兼容

选择替代方案的场景:

Megatron-LM: 仅NVIDIA部署场景下的极致性能需求
DeepSpeed: 更丰富ZeRo优化生态，支持推理
Axolotl/TRL: 侧重微调而非预训练
LitGPT: 面向教学场景的小规模训练

Common issues

常见问题

Issue: Out of memory on large models

Enable activation checkpointing and reduce batch size:

toml

[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1

Or use gradient accumulation:

toml

[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients

Issue: TP causes high memory with async collectives

Set environment variable:

bash

export TORCH_NCCL_AVOID_RECORD_STREAMS=1

Issue: Float8 training not faster

Float8 only benefits large GEMMs. Filter small layers:

toml

[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]

Issue: Checkpoint loading fails after parallelism change

Use DCP's resharding capability:

bash

undefined

问题: 大模型训练时内存不足

启用激活checkpoint并减小批次大小:

toml

[activation_checkpoint]
mode = "full"	# Instead of "selective"

[training]
local_batch_size = １

或使用梯度累积:

toml

[training]
local_batch_size = 1
global_batch_size = 32	# Accumulates gradients

问题: TP结合异步集合操作导致内存占用过高

设置环境变量:

bash

export TORCH_NCCL_AVOID_RECORD_STREAMS=1

问题: Float８训练未带来速度提升

Float８仅对大型GEMM操作有效益。过滤小型层:

toml

[quantize.linear.float８]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]

问题: 修改并行配置后checkpoint加载失败

使用DCP的重分片功能:

bash

undefined

Convert sharded checkpoint to single file

python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt


**Issue: Pipeline parallelism initialization**

Create seed checkpoint first (see Workflow 4, Step 1).

python -m torch.distributed.checkpoint.format_utils \ dcp_to_torch checkpoint/step-1000 checkpoint.pt


**问题: 流水线并行初始化失败**

先创建种子checkpoint（参见工作流４，步骤1）。

Supported models

支持的模型

Model	Sizes	Status
Llama 3.1	8B, 70B, 405B	Production
Llama 4	Various	Experimental
DeepSeek V3	16B, 236B, 671B (MoE)	Experimental
GPT-OSS	20B, 120B (MoE)	Experimental
Qwen 3	Various	Experimental
Flux	Diffusion	Experimental

模型	参数规模	状态
Llama 3．1	8B,７0B,405B	正式可用
Llama４	多种参数规模	实验性
DeepSeek V3	16B,2３6B,６７1B（MoE）	实验性
GPT-OSS	20B,120B（MoE）	实验性
Qwen３	多种参数规模	实验性
Flux	扩散模型	实验性

##性能基准测试（H１00 GPU）

模型	GPU数量	并行策略	每GPU TPS	采用技术
Llama８B	8	FSDP	5,7６2	基准线
Llama８B	8	FSDP+compile+FP8	8,532	提升48%
Llama７0B	２56	FSDP+TP+AsyncTP	876	2D并行
Llama405B	５１2	FSDP+TP+PP	128	3D并行

##进阶主题

FSDP2配置**:参见references/fsdp.md获取FSDP2与FSDP1的详细对比及ZeRo等效配置。

Float８训练:参见references/float8.md获取按张量维度與按行维度缩放方案。

Checkpointing:参见references/checkpoint.md获取与HuggingFace的转换方法及异步checkpointing说明

添加自定义模型:参见references/custom-models.md了解TrainSpec协议。

##资源

-GitHub:https://github.com/pytorch/torchtitan -论文:https://arxiv.org/abs/241０.06511 -ICLR2025:https://iclr.cc/virtual/2025/poster/２9620 -PyTorch论坛:https://discuss.pytorch.org/c/distributed/torchtitan/44 ",

Performance benchmarks (H100)

—

Model	GPUs	Parallelism	TPS/GPU	Techniques
Llama 8B	8	FSDP	5,762	Baseline
Llama 8B	8	FSDP+compile+FP8	8,532	+48%
Llama 70B	256	FSDP+TP+AsyncTP	876	2D parallel
Llama 405B	512	FSDP+TP+PP	128	3D parallel

—

Advanced topics

—

FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.

Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.

Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.

Adding custom models: See references/custom-models.md for TrainSpec protocol.

—

Resources

—

GitHub: https://github.com/pytorch/torchtitan
Paper: https://arxiv.org/abs/2410.06511
ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44

—