training-llms-megatron

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Megatron-Core - Large-Scale LLM Training

Megatron-Core - 大规模LLM训练

Quick start

快速开始

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
Installation:
bash
undefined
Megatron-Core通过高级并行策略,可在H100 GPU上训练参数规模从20亿到4620亿的LLM,模型FLOP利用率(MFU)最高可达47%。
安装方式:
bash
undefined

Docker (recommended)

Docker (recommended)

docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

Or pip

Or pip

pip install megatron-core

**Simple distributed training**:
```bash
pip install megatron-core

**简单分布式训练**:
```bash

Train with 2 GPUs using data parallelism

Train with 2 GPUs using data parallelism

torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

Or LLaMA-3 8B training

Or LLaMA-3 8B training

./examples/llama/train_llama3_8b_fp8.sh
undefined
./examples/llama/train_llama3_8b_fp8.sh
undefined

Common workflows

常见工作流

Workflow 1: Train LLaMA-style model with 3D parallelism

工作流1:用3D并行训练LLaMA风格模型

Copy this checklist:
LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics
Step 1: Choose parallelism configuration
Model size determines parallelism strategy:
Model SizeGPUsTensor ParallelPipeline ParallelData ParallelContext Parallel
7B81181
13B82141
70B644441
405B1288822
Step 2: Configure training hyperparameters
bash
#!/bin/bash
复制以下检查清单:
LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics
步骤1:选择并行配置
模型规模决定并行策略:
模型规模GPU数量张量并行流水线并行数据并行上下文并行
7B81181
13B82141
70B644441
405B1288822
步骤2:配置训练超参数
bash
#!/bin/bash

train_llama_70b.sh

train_llama_70b.sh

GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel
GPUS_PER_NODE=8 NNODES=8 # 64 GPUs total TP=4 # Tensor parallel PP=4 # Pipeline parallel CP=1 # Context parallel

LLaMA 70B configuration

LLaMA 70B configuration

MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096
MODEL_SIZE=70 # Billion parameters HIDDEN_SIZE=8192 NUM_LAYERS=80 NUM_HEADS=64 SEQ_LENGTH=4096

Training hyperparameters

Training hyperparameters

MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4
torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt

**Step 3: Launch distributed training**

```bash
MICRO_BATCH=1 GLOBAL_BATCH=1024 LR=3e-4
torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt

**步骤3:启动分布式训练**

```bash

Single node (8 GPUs)

Single node (8 GPUs)

bash train_llama_70b.sh
bash train_llama_70b.sh

Multi-node with SLURM

Multi-node with SLURM

sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

**Step 4: Monitor performance metrics**

Key metrics to track:
Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily
undefined
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

**步骤4:监控性能指标**

需要跟踪的关键指标:
Model FLOP Utilization (MFU): Target >40% on H100 Throughput: Tokens/sec/GPU Memory usage: <80GB per GPU for 70B model Loss: Should decrease steadily
undefined

Workflow 2: Configure Mixture of Experts (MoE) training

工作流2:配置混合专家(MoE)训练

For sparse MoE models like Mixtral.
MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP
Step 1: Configure expert parallelism
bash
undefined
适用于Mixtral这类稀疏MoE模型。
MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP
步骤1:配置专家并行
bash
undefined

Mixtral 8x7B example

Mixtral 8x7B example

TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
TENSOR_PARALLEL=2 PIPELINE_PARALLEL=1 EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))

= 2 * 1 * 4 * 4 = 32 GPUs

= 2 * 1 * 4 * 4 = 32 GPUs


**Step 2: Set MoE hyperparameters**

```bash
torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt
Step 3: Launch training with EP
Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction

**步骤2:设置MoE超参数**

```bash
torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt
步骤3:结合EP启动训练
专家并行将不同专家分配到不同GPU上,在保持模型容量的同时降低内存占用。
Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

工作流3:优化以实现最大吞吐量

Achieve 47% MFU on H100.
Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees
Step 1: Enable optimizations
bash
--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)
Step 2: Use FP8 precision (H100 only)
bash
--fp8-hybrid  # FP8 mixed precision training
在H100上达到47%的MFU。
Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees
步骤1:启用优化
bash
--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)
步骤2:使用FP8精度(仅H100支持)
bash
--fp8-hybrid  # FP8 mixed precision training

Transformer Engine handles FP8 automatically

Transformer Engine handles FP8 automatically


Result: 1.5-2x speedup on H100 vs BF16.

**Step 3: Optimize micro-batch size**

Find largest micro-batch that fits in memory:

```bash

效果:相比BF16,H100上可实现1.5-2倍的速度提升。

**步骤3:优化微批次大小**

找到能装入内存的最大微批次:

```bash

Start with 1, increase until OOM

Start with 1, increase until OOM

for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done

Typical values:
- 7B model: 4-8
- 70B model: 1-2
- 405B model: 1

**Step 4: Tune parallelism degrees**

Rules of thumb:
Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:
TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs
undefined
for MBS in 1 2 4 8; do echo "Testing micro-batch-size=$MBS" torchrun ... --micro-batch-size $MBS done

典型取值:
- 7B模型: 4-8
- 70B模型: 1-2
- 405B模型: 1

**步骤4:调整并行度**

经验法则:
Tensor Parallel: Use ≤8 (limited by NVLink within node) Pipeline Parallel: Use for >70B models Context Parallel: Use for sequences >8K tokens Data Parallel: Fill remaining GPUs

128张H100训练405B模型示例:
TP=8 (1 node) PP=8 (across nodes) CP=2 (long sequences) DP=1 Total = 8 × 8 × 2 × 1 = 128 GPUs
undefined

When to use vs alternatives

适用场景与替代方案对比

Use Megatron-Core when:
  • Training models >10B parameters
  • Need maximum efficiency (target >40% MFU)
  • Using NVIDIA GPUs (A100, H100)
  • Production training at scale
  • Want fine-grained parallelism control
Use alternatives instead:
  • PyTorch FSDP: Models <70B, simpler API, PyTorch native
  • DeepSpeed: Easier setup, good for <100B models
  • HuggingFace Accelerate: Prototyping, simpler workflows
  • LitGPT: Educational, single-file implementations
选择Megatron-Core的场景:
  • 训练参数超过100亿的模型
  • 需要最大化效率(目标MFU>40%)
  • 使用NVIDIA GPU(A100、H100)
  • 大规模生产级训练
  • 需要精细的并行控制
选择替代方案的场景:
  • PyTorch FSDP: 模型规模<70B,API更简单,PyTorch原生支持
  • DeepSpeed: 部署更简单,适合<100B的模型
  • HuggingFace Accelerate: 原型开发,工作流更简洁
  • LitGPT: 教学用途,单文件实现

Common issues

常见问题

Issue: Low GPU utilization (<30% MFU)
Causes:
  1. Micro-batch too small
  2. Too much parallelism overhead
  3. Not using Flash Attention
Fixes:
bash
undefined
问题:GPU利用率低(MFU<30%)
原因:
  1. 微批次太小
  2. 并行开销过大
  3. 未启用Flash Attention
解决方法:
bash
undefined

Increase micro-batch

Increase micro-batch

--micro-batch-size 4 # Was 1
--micro-batch-size 4 # Was 1

Enable optimizations

Enable optimizations

--use-flash-attn --sequence-parallel
--use-flash-attn --sequence-parallel

Reduce TP if >8

Reduce TP if >8

--tensor-model-parallel-size 4 # Was 16

**Issue: Out of memory**

Reduce memory with:
```bash
--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer
Or use CPU/NVMe offloading:
bash
--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant
Issue: Training slower than expected
Check:
  1. Network bottleneck: Ensure InfiniBand/NVLink enabled
  2. Pipeline bubbles: Use interleaved pipeline schedule
    bash
    --num-layers-per-virtual-pipeline-stage 2
  3. Data loading: Use fast data loader
    bash
    --dataloader-type cyclic
Issue: Diverging loss
Stabilize training:
bash
--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN
--tensor-model-parallel-size 4 # Was 16

**问题:内存不足**

通过以下方式减少内存占用:
```bash
--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer
或使用CPU/NVMe卸载:
bash
--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant
问题:训练速度低于预期
检查以下几点:
  1. 网络瓶颈: 确保已启用InfiniBand/NVLink
  2. 流水线气泡: 使用交错流水线调度
    bash
    --num-layers-per-virtual-pipeline-stage 2
  3. 数据加载: 使用快速数据加载器
    bash
    --dataloader-type cyclic
问题:损失发散
稳定训练的方法:
bash
--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN

Advanced topics

进阶主题

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.
Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
并行策略: 详见references/parallelism-guide.md,其中包含TP/PP/DP/CP/EP的详细对比、性能分析及适用场景。
性能基准: 详见references/benchmarks.md,包含不同模型规模和GPU配置下的MFU数据。
生产级配置: 详见references/production-examples.md,包含LLaMA 3 405B、Nemotron-4 340B、DeepSeek-V3 671B等模型的真实部署配置。
训练配方: 详见references/training-recipes.md,包含GPT/LLaMA/Mixtral架构的完整超参数配置。

Hardware requirements

硬件要求

  • GPU: NVIDIA Ampere+ (A100, H100, B200)
    • Turing works but slower
    • FP8 requires Hopper/Ada/Blackwell
  • Network: InfiniBand or 400Gb+ Ethernet for multi-node
  • Memory per GPU:
    • 7B model: 40GB+
    • 70B model: 80GB (with TP=4)
    • 405B model: 80GB (with TP=8, PP=8)
  • Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)
  • GPU: NVIDIA Ampere及以上(A100、H100、B200)
    • Turing架构可运行但速度较慢
    • FP8精度需要Hopper/Ada/Blackwell架构
  • 网络: 多节点训练需InfiniBand或400Gb+以太网
  • 单GPU内存:
    • 7B模型: 40GB+
    • 70B模型: 80GB(搭配TP=4)
    • 405B模型: 80GB(搭配TP=8、PP=8)
  • 存储: 快速NVMe存储用于 checkpoint(70B+模型需1TB+)

Resources

资源