training-llms-megatron

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Megatron-Core - Large-Scale LLM Training

Megatron-Core - 大规模LLM训练

Quick start

快速开始

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation:

bash

undefined

Megatron-Core通过先进的并行策略，在H100 GPU上实现最高47%的模型FLOP利用率（MFU），支持训练参数规模从20亿到4620亿的LLM。

安装:

bash

undefined

Docker (recommended)

Docker（推荐方式）

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

Or pip

或使用pip安装

pip install megatron-core


**Simple distributed training**:
```bash

pip install megatron-core


**简单分布式训练**:
```bash

Train with 2 GPUs using data parallelism

使用2块GPU通过数据并行进行训练

torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

Or LLaMA-3 8B training

或训练LLaMA-3 8B模型

./examples/llama/train_llama3_8b_fp8.sh

undefined

./examples/llama/train_llama3_8b_fp8.sh

undefined

Common workflows

常见工作流

Workflow 1: Train LLaMA-style model with 3D parallelism

工作流1：使用3D并行训练LLaMA风格模型

Copy this checklist:

LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model Size	GPUs	Tensor Parallel	Pipeline Parallel	Data Parallel	Context Parallel
7B	8	1	1	8	1
13B	8	2	1	4	1
70B	64	4	4	4	1
405B	128	8	8	2	2

Step 2: Configure training hyperparameters

bash

#!/bin/bash

复制以下检查清单：

LLaMA训练设置:
- [ ] 步骤1：选择并行配置
- [ ] 步骤2：配置训练超参数
- [ ] 步骤3：启动分布式训练
- [ ] 步骤4：监控性能指标

步骤1：选择并行配置

模型规模决定并行策略：

模型规模	GPU数量	张量并行	流水线并行	数据并行	上下文并行
7B	8	1	1	8	1
13B	8	2	1	4	1
70B	64	4	4	4	1
405B	128	8	8	2	2

步骤2：配置训练超参数

bash

#!/bin/bash

train_llama_70b.sh

GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel

GPUS_PER_NODE=8 NNODES=8 # 总计64块GPU TP=4 # 张量并行 PP=4 # 流水线并行 CP=1 # 上下文并行

LLaMA 70B configuration

LLaMA 70B配置

MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096

MODEL_SIZE=70 # 参数规模（十亿） HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096

Training hyperparameters

训练超参数

MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4

torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt


**Step 3: Launch distributed training**

```bash

MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4


**步骤3：启动分布式训练**

```bash

Single node (8 GPUs)

单节点（8块GPU）

bash train_llama_70b.sh

Multi-node with SLURM

多节点使用SLURM调度

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh


**Step 4: Monitor performance metrics**

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily

undefined

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh


**步骤4：监控性能指标**

需要跟踪的关键指标：

模型FLOP利用率（MFU）：H100上目标值>40% 吞吐量：每GPU每秒处理的Token数内存使用：70B模型每GPU内存占用<80GB 损失值：应持续稳定下降

undefined

Workflow 2: Configure Mixture of Experts (MoE) training

工作流2：配置混合专家（MoE）训练

For sparse MoE models like Mixtral.

MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP

Step 1: Configure expert parallelism

bash

undefined

适用于Mixtral等稀疏MoE模型。

MoE训练:
- [ ] 步骤1：配置专家并行
- [ ] 步骤2：设置MoE超参数
- [ ] 步骤3：启动带专家并行（EP）的训练

步骤1：配置专家并行

bash

undefined

Mixtral 8x7B example

Mixtral 8x7B示例

TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))

TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # 将8个专家拆分到4块GPU上 DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))

= 2 * 1 * 4 * 4 = 32 GPUs

= 2 * 1 * 4 * 4 = 32块GPU


**Step 2: Set MoE hyperparameters**

```bash
torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction


**步骤2：设置MoE超参数**

```bash
torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

步骤3：启动带专家并行的训练

专家并行将不同专家分布到不同GPU上，在保持模型能力的同时降低内存占用。

无EP时的内存占用：8个专家 ×7B = 每GPU56GB
EP=4时的内存占用：2个专家 ×7B = 每GPU14GB
内存节省：75%

Workflow 3: Optimize for maximum throughput

工作流3：优化以实现最大吞吐量

Achieve 47% MFU on H100.

Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees

Step 1: Enable optimizations

bash

--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

bash

--fp8-hybrid  # FP8 mixed precision training

在H100上实现47%的MFU。

性能优化:
- [ ] 步骤1：启用Flash Attention
- [ ] 步骤2：使用FP8精度（仅H100）
- [ ] 步骤3：优化微批次大小
- [ ] 步骤4：调整并行度

步骤1：启用优化选项

bash

--use-mcore-models  # 使用Megatron Core模型
--transformer-impl transformer_engine  # 使用Transformer Engine
--sequence-parallel  # 降低激活内存占用（需与TP配合使用）

步骤2：使用FP8精度（仅H100支持）

bash

--fp8-hybrid  # FP8混合精度训练

Transformer Engine handles FP8 automatically

Transformer Engine会自动处理FP8相关操作


Result: 1.5-2x speedup on H100 vs BF16.

**Step 3: Optimize micro-batch size**

Find largest micro-batch that fits in memory:

```bash


结果：在H100上相比BF16精度，训练速度提升1.5-2倍。

**步骤3：优化微批次大小**

找到能放入GPU内存的最大微批次大小：

```bash

Start with 1, increase until OOM

从1开始测试，直到出现内存不足（OOM）

for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done


Typical values:
- 7B model: 4-8
- 70B model: 1-2
- 405B model: 1

**Step 4: Tune parallelism degrees**

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs


Example 405B on 128 H100s:

TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs

undefined

for MBS in 1 2 4 8; do echo "测试微批次大小=$MBS" torchrun ... --micro-batch-size $MBS done


典型取值：
- 7B模型：4-8
- 70B模型：1-2
- 405B模型：1

**步骤4：调整并行度**

经验法则：

张量并行：并行度≤8（受节点内NVLink带宽限制）流水线并行：适用于参数规模>70B的模型上下文并行：适用于序列长度>8K Token的场景数据并行：用剩余GPU数量设置并行度


128块H100训练405B模型的示例：

TP=8（单节点） PP=8（跨节点） CP=2（长序列） DP=1 总计=8×8×2×1=128块GPU

undefined

When to use vs alternatives

适用场景与替代方案对比

Use Megatron-Core when:

Training models >10B parameters
Need maximum efficiency (target >40% MFU)
Using NVIDIA GPUs (A100, H100)
Production training at scale
Want fine-grained parallelism control

Use alternatives instead:

PyTorch FSDP: Models <70B, simpler API, PyTorch native
DeepSpeed: Easier setup, good for <100B models
HuggingFace Accelerate: Prototyping, simpler workflows
LitGPT: Educational, single-file implementations

适合使用Megatron-Core的场景：

训练参数规模>100亿的模型
需要最大化训练效率（目标MFU>40%）
使用NVIDIA GPU（A100、H100）
大规模生产环境训练
需要精细控制并行策略

适合使用替代方案的场景：

PyTorch FSDP：模型参数<70B，API更简单，PyTorch原生支持
DeepSpeed：设置更简便，适合参数<100B的模型
HuggingFace Accelerate：快速原型开发，工作流更简洁
LitGPT：用于教学，单文件实现

Common issues

常见问题

Issue: Low GPU utilization (<30% MFU)

Causes:

Micro-batch too small
Too much parallelism overhead
Not using Flash Attention

Fixes:

bash

undefined

问题：GPU利用率低（MFU<30%）

可能原因：

微批次大小过小
并行策略带来的开销过大
未启用Flash Attention

解决方法：

bash

undefined

Increase micro-batch

增大微批次大小

--micro-batch-size 4 # Was 1

--micro-batch-size 4 # 原设置为1

Enable optimizations

启用优化选项

--use-flash-attn --sequence-parallel

Reduce TP if >8

若TP>8则降低张量并行度

--tensor-model-parallel-size 4 # Was 16


**Issue: Out of memory**

Reduce memory with:
```bash
--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer

Or use CPU/NVMe offloading:

bash

--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant

Issue: Training slower than expected

Check:

Network bottleneck: Ensure InfiniBand/NVLink enabled
Pipeline bubbles: Use interleaved pipeline schedule
bash
```
--num-layers-per-virtual-pipeline-stage 2
```
Data loading: Use fast data loader
bash
```
--dataloader-type cyclic
```

Issue: Diverging loss

Stabilize training:

bash

--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN

--tensor-model-parallel-size 4 # 原设置为16


**问题：内存不足（OOM）**

通过以下方式降低内存占用：
```bash
--tensor-model-parallel-size 2  # 将模型拆分到多块GPU
--recompute-granularity full  # 梯度检查点
--recompute-method block  # 检查点Transformer块
--recompute-num-layers 1  # 每层都设置检查点

或使用CPU/NVMe卸载：

bash

--cpu-optimizer  # 将优化器卸载到CPU
--cpu-optimizer-type ADAM  # 使用CPU版Adam优化器

问题：训练速度低于预期

检查以下点：

网络瓶颈：确保已启用InfiniBand/NVLink
流水线气泡：使用交错流水线调度
bash
```
--num-layers-per-virtual-pipeline-stage 2
```
数据加载：使用快速数据加载器
bash
```
--dataloader-type cyclic
```

问题：损失值发散

稳定训练的方法：

bash

--lr-warmup-iters 2000  # 延长学习率预热步数
--clip-grad 1.0  # 梯度裁剪
--init-method-std 0.006  # 更小的初始化标准差
--attention-dropout 0.0  # 注意力层不使用Dropout
--hidden-dropout 0.0  # 前馈网络不使用Dropout

Advanced topics

进阶主题

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

并行策略：详见references/parallelism-guide.md，其中包含TP/PP/DP/CP/EP的详细对比、性能分析及适用场景。

性能基准测试：详见references/benchmarks.md，包含不同模型规模和GPU配置下的MFU数值。

生产环境配置：详见references/production-examples.md，包含LLaMA 3 405B、Nemotron-4 340B、DeepSeek-V3 671B等模型的真实生产环境设置。

训练配方：详见references/training-recipes.md，包含GPT/LLaMA/Mixtral架构的完整超参数配置。

Hardware requirements

硬件要求

GPU: NVIDIA Ampere+ (A100, H100, B200)
- Turing works but slower
- FP8 requires Hopper/Ada/Blackwell
Network: InfiniBand or 400Gb+ Ethernet for multi-node
Memory per GPU:
- 7B model: 40GB+
- 70B model: 80GB (with TP=4)
- 405B model: 80GB (with TP=8, PP=8)
Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)

GPU：NVIDIA Ampere及以上架构（A100、H100、B200）
- Turing架构可运行但速度较慢
- FP8精度需要Hopper/Ada/Blackwell架构
网络：多节点训练需InfiniBand或400Gb+以太网
单GPU内存:
- 7B模型：40GB+
- 70B模型：80GB（配合TP=4）
- 405B模型：80GB（配合TP=8、PP=8）
存储：快速NVMe存储用于保存检查点（70B+模型需要1TB以上空间）

Resources

资源

Docs: https://docs.nvidia.com/megatron-core/
GitHub: https://github.com/NVIDIA/Megatron-LM
Papers:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)

官方文档：https://docs.nvidia.com/megatron-core/
GitHub仓库：https://github.com/NVIDIA/Megatron-LM
相关论文:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
NeMo框架：https://docs.nvidia.com/nemo-framework/（基于Megatron-Core构建）