training-llms-megatron
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMegatron-Core - Large-Scale LLM Training
Megatron-Core - 大规模LLM训练
Quick start
快速开始
Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
Installation:
bash
undefinedMegatron-Core通过先进的并行策略,在H100 GPU上实现最高47%的模型FLOP利用率(MFU),支持训练参数规模从20亿到4620亿的LLM。
安装:
bash
undefinedDocker (recommended)
Docker(推荐方式)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
Or pip
或使用pip安装
pip install megatron-core
**Simple distributed training**:
```bashpip install megatron-core
**简单分布式训练**:
```bashTrain with 2 GPUs using data parallelism
使用2块GPU通过数据并行进行训练
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
Or LLaMA-3 8B training
或训练LLaMA-3 8B模型
./examples/llama/train_llama3_8b_fp8.sh
undefined./examples/llama/train_llama3_8b_fp8.sh
undefinedCommon workflows
常见工作流
Workflow 1: Train LLaMA-style model with 3D parallelism
工作流1:使用3D并行训练LLaMA风格模型
Copy this checklist:
LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metricsStep 1: Choose parallelism configuration
Model size determines parallelism strategy:
| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |
|---|---|---|---|---|---|
| 7B | 8 | 1 | 1 | 8 | 1 |
| 13B | 8 | 2 | 1 | 4 | 1 |
| 70B | 64 | 4 | 4 | 4 | 1 |
| 405B | 128 | 8 | 8 | 2 | 2 |
Step 2: Configure training hyperparameters
bash
#!/bin/bash复制以下检查清单:
LLaMA训练设置:
- [ ] 步骤1:选择并行配置
- [ ] 步骤2:配置训练超参数
- [ ] 步骤3:启动分布式训练
- [ ] 步骤4:监控性能指标步骤1:选择并行配置
模型规模决定并行策略:
| 模型规模 | GPU数量 | 张量并行 | 流水线并行 | 数据并行 | 上下文并行 |
|---|---|---|---|---|---|
| 7B | 8 | 1 | 1 | 8 | 1 |
| 13B | 8 | 2 | 1 | 4 | 1 |
| 70B | 64 | 4 | 4 | 4 | 1 |
| 405B | 128 | 8 | 8 | 2 | 2 |
步骤2:配置训练超参数
bash
#!/bin/bashtrain_llama_70b.sh
train_llama_70b.sh
GPUS_PER_NODE=8
NNODES=8 # 64 GPUs total
TP=4 # Tensor parallel
PP=4 # Pipeline parallel
CP=1 # Context parallel
GPUS_PER_NODE=8
NNODES=8 # 总计64块GPU
TP=4 # 张量并行
PP=4 # 流水线并行
CP=1 # 上下文并行
LLaMA 70B configuration
LLaMA 70B配置
MODEL_SIZE=70 # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096
MODEL_SIZE=70 # 参数规模(十亿)
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096
Training hyperparameters
训练超参数
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4
torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
**Step 3: Launch distributed training**
```bashMICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4
torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
**步骤3:启动分布式训练**
```bashSingle node (8 GPUs)
单节点(8块GPU)
bash train_llama_70b.sh
bash train_llama_70b.sh
Multi-node with SLURM
多节点使用SLURM调度
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
**Step 4: Monitor performance metrics**
Key metrics to track:Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily
undefinedsbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
**步骤4:监控性能指标**
需要跟踪的关键指标:模型FLOP利用率(MFU):H100上目标值>40%
吞吐量:每GPU每秒处理的Token数
内存使用:70B模型每GPU内存占用<80GB
损失值:应持续稳定下降
undefinedWorkflow 2: Configure Mixture of Experts (MoE) training
工作流2:配置混合专家(MoE)训练
For sparse MoE models like Mixtral.
MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EPStep 1: Configure expert parallelism
bash
undefined适用于Mixtral等稀疏MoE模型。
MoE训练:
- [ ] 步骤1:配置专家并行
- [ ] 步骤2:设置MoE超参数
- [ ] 步骤3:启动带专家并行(EP)的训练步骤1:配置专家并行
bash
undefinedMixtral 8x7B example
Mixtral 8x7B示例
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4 # Split 8 experts across 4 GPUs
DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4 # 将8个专家拆分到4块GPU上
DATA_PARALLEL=4
TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
= 2 * 1 * 4 * 4 = 32 GPUs
= 2 * 1 * 4 * 4 = 32块GPU
**Step 2: Set MoE hyperparameters**
```bash
torchrun \
--nproc_per_node=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 4 \
--num-experts 8 \
--moe-router-topk 2 \
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 0.01 \
--hidden-size 4096 \
--num-layers 32 \
--num-attention-heads 32 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txtStep 3: Launch training with EP
Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.
Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction
**步骤2:设置MoE超参数**
```bash
torchrun \
--nproc_per_node=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 4 \
--num-experts 8 \
--moe-router-topk 2 \
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 0.01 \
--hidden-size 4096 \
--num-layers 32 \
--num-attention-heads 32 \
--seq-length 4096 \
--max-position-embeddings 4096 \
--bf16 \
--use-mcore-models \
--transformer-impl transformer_engine \
--data-path /path/to/data \
--vocab-file /path/to/vocab.json \
--merge-file /path/to/merges.txt步骤3:启动带专家并行的训练
专家并行将不同专家分布到不同GPU上,在保持模型能力的同时降低内存占用。
无EP时的内存占用:8个专家 ×7B = 每GPU56GB
EP=4时的内存占用:2个专家 ×7B = 每GPU14GB
内存节省:75%Workflow 3: Optimize for maximum throughput
工作流3:优化以实现最大吞吐量
Achieve 47% MFU on H100.
Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degreesStep 1: Enable optimizations
bash
--use-mcore-models # Use Megatron Core models
--transformer-impl transformer_engine # Use Transformer Engine
--sequence-parallel # Reduce activation memory (use with TP)Step 2: Use FP8 precision (H100 only)
bash
--fp8-hybrid # FP8 mixed precision training在H100上实现47%的MFU。
性能优化:
- [ ] 步骤1:启用Flash Attention
- [ ] 步骤2:使用FP8精度(仅H100)
- [ ] 步骤3:优化微批次大小
- [ ] 步骤4:调整并行度步骤1:启用优化选项
bash
--use-mcore-models # 使用Megatron Core模型
--transformer-impl transformer_engine # 使用Transformer Engine
--sequence-parallel # 降低激活内存占用(需与TP配合使用)步骤2:使用FP8精度(仅H100支持)
bash
--fp8-hybrid # FP8混合精度训练Transformer Engine handles FP8 automatically
Transformer Engine会自动处理FP8相关操作
Result: 1.5-2x speedup on H100 vs BF16.
**Step 3: Optimize micro-batch size**
Find largest micro-batch that fits in memory:
```bash
结果:在H100上相比BF16精度,训练速度提升1.5-2倍。
**步骤3:优化微批次大小**
找到能放入GPU内存的最大微批次大小:
```bashStart with 1, increase until OOM
从1开始测试,直到出现内存不足(OOM)
for MBS in 1 2 4 8; do
echo "Testing micro-batch-size=$MBS"
torchrun ... --micro-batch-size $MBS
done
Typical values:
- 7B model: 4-8
- 70B model: 1-2
- 405B model: 1
**Step 4: Tune parallelism degrees**
Rules of thumb:Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs
Example 405B on 128 H100s:TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs
undefinedfor MBS in 1 2 4 8; do
echo "测试微批次大小=$MBS"
torchrun ... --micro-batch-size $MBS
done
典型取值:
- 7B模型:4-8
- 70B模型:1-2
- 405B模型:1
**步骤4:调整并行度**
经验法则:张量并行:并行度≤8(受节点内NVLink带宽限制)
流水线并行:适用于参数规模>70B的模型
上下文并行:适用于序列长度>8K Token的场景
数据并行:用剩余GPU数量设置并行度
128块H100训练405B模型的示例:TP=8(单节点)
PP=8(跨节点)
CP=2(长序列)
DP=1
总计=8×8×2×1=128块GPU
undefinedWhen to use vs alternatives
适用场景与替代方案对比
Use Megatron-Core when:
- Training models >10B parameters
- Need maximum efficiency (target >40% MFU)
- Using NVIDIA GPUs (A100, H100)
- Production training at scale
- Want fine-grained parallelism control
Use alternatives instead:
- PyTorch FSDP: Models <70B, simpler API, PyTorch native
- DeepSpeed: Easier setup, good for <100B models
- HuggingFace Accelerate: Prototyping, simpler workflows
- LitGPT: Educational, single-file implementations
适合使用Megatron-Core的场景:
- 训练参数规模>100亿的模型
- 需要最大化训练效率(目标MFU>40%)
- 使用NVIDIA GPU(A100、H100)
- 大规模生产环境训练
- 需要精细控制并行策略
适合使用替代方案的场景:
- PyTorch FSDP:模型参数<70B,API更简单,PyTorch原生支持
- DeepSpeed:设置更简便,适合参数<100B的模型
- HuggingFace Accelerate:快速原型开发,工作流更简洁
- LitGPT:用于教学,单文件实现
Common issues
常见问题
Issue: Low GPU utilization (<30% MFU)
Causes:
- Micro-batch too small
- Too much parallelism overhead
- Not using Flash Attention
Fixes:
bash
undefined问题:GPU利用率低(MFU<30%)
可能原因:
- 微批次大小过小
- 并行策略带来的开销过大
- 未启用Flash Attention
解决方法:
bash
undefinedIncrease micro-batch
增大微批次大小
--micro-batch-size 4 # Was 1
--micro-batch-size 4 # 原设置为1
Enable optimizations
启用优化选项
--use-flash-attn
--sequence-parallel
--use-flash-attn
--sequence-parallel
Reduce TP if >8
若TP>8则降低张量并行度
--tensor-model-parallel-size 4 # Was 16
**Issue: Out of memory**
Reduce memory with:
```bash
--tensor-model-parallel-size 2 # Split model across GPUs
--recompute-granularity full # Gradient checkpointing
--recompute-method block # Checkpoint transformer blocks
--recompute-num-layers 1 # Checkpoint every layerOr use CPU/NVMe offloading:
bash
--cpu-optimizer # Offload optimizer to CPU
--cpu-optimizer-type ADAM # CPU Adam variantIssue: Training slower than expected
Check:
- Network bottleneck: Ensure InfiniBand/NVLink enabled
- Pipeline bubbles: Use interleaved pipeline schedule
bash
--num-layers-per-virtual-pipeline-stage 2 - Data loading: Use fast data loader
bash
--dataloader-type cyclic
Issue: Diverging loss
Stabilize training:
bash
--lr-warmup-iters 2000 # Longer warmup
--clip-grad 1.0 # Gradient clipping
--init-method-std 0.006 # Smaller init
--attention-dropout 0.0 # No dropout in attention
--hidden-dropout 0.0 # No dropout in FFN--tensor-model-parallel-size 4 # 原设置为16
**问题:内存不足(OOM)**
通过以下方式降低内存占用:
```bash
--tensor-model-parallel-size 2 # 将模型拆分到多块GPU
--recompute-granularity full # 梯度检查点
--recompute-method block # 检查点Transformer块
--recompute-num-layers 1 # 每层都设置检查点或使用CPU/NVMe卸载:
bash
--cpu-optimizer # 将优化器卸载到CPU
--cpu-optimizer-type ADAM # 使用CPU版Adam优化器问题:训练速度低于预期
检查以下点:
- 网络瓶颈:确保已启用InfiniBand/NVLink
- 流水线气泡:使用交错流水线调度
bash
--num-layers-per-virtual-pipeline-stage 2 - 数据加载:使用快速数据加载器
bash
--dataloader-type cyclic
问题:损失值发散
稳定训练的方法:
bash
--lr-warmup-iters 2000 # 延长学习率预热步数
--clip-grad 1.0 # 梯度裁剪
--init-method-std 0.006 # 更小的初始化标准差
--attention-dropout 0.0 # 注意力层不使用Dropout
--hidden-dropout 0.0 # 前馈网络不使用DropoutAdvanced topics
进阶主题
Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.
Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
并行策略:详见references/parallelism-guide.md,其中包含TP/PP/DP/CP/EP的详细对比、性能分析及适用场景。
性能基准测试:详见references/benchmarks.md,包含不同模型规模和GPU配置下的MFU数值。
生产环境配置:详见references/production-examples.md,包含LLaMA 3 405B、Nemotron-4 340B、DeepSeek-V3 671B等模型的真实生产环境设置。
训练配方:详见references/training-recipes.md,包含GPT/LLaMA/Mixtral架构的完整超参数配置。
Hardware requirements
硬件要求
- GPU: NVIDIA Ampere+ (A100, H100, B200)
- Turing works but slower
- FP8 requires Hopper/Ada/Blackwell
- Network: InfiniBand or 400Gb+ Ethernet for multi-node
- Memory per GPU:
- 7B model: 40GB+
- 70B model: 80GB (with TP=4)
- 405B model: 80GB (with TP=8, PP=8)
- Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)
- GPU:NVIDIA Ampere及以上架构(A100、H100、B200)
- Turing架构可运行但速度较慢
- FP8精度需要Hopper/Ada/Blackwell架构
- 网络:多节点训练需InfiniBand或400Gb+以太网
- 单GPU内存:
- 7B模型:40GB+
- 70B模型:80GB(配合TP=4)
- 405B模型:80GB(配合TP=8、PP=8)
- 存储:快速NVMe存储用于保存检查点(70B+模型需要1TB以上空间)
Resources
资源
- Docs: https://docs.nvidia.com/megatron-core/
- GitHub: https://github.com/NVIDIA/Megatron-LM
- Papers:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
- NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)
- 官方文档:https://docs.nvidia.com/megatron-core/
- GitHub仓库:https://github.com/NVIDIA/Megatron-LM
- 相关论文:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
- NeMo框架:https://docs.nvidia.com/nemo-framework/(基于Megatron-Core构建)