mlm-bridge-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MLM vs Bridge Training

MLM 与 Bridge 训练对比

For how they differ, the arg mapping tables, gotchas, and translation script, see:
  • @docs/megatron-lm-to-megatron-bridge.md
关于两者的差异、参数映射表、注意事项以及转换脚本,请参考:
  • @docs/megatron-lm-to-megatron-bridge.md

Correlation Testing

相关性测试

Use
vanilla_gpt_pretrain_config
for loss-correlation testing. This recipe uses bare
GPTModelProvider
defaults (LayerNorm, GeLU, learned_absolute position embeddings,
vocab_size
inherited from tokenizer) — matching MLM
pretrain_gpt.py
defaults with no args.
使用
vanilla_gpt_pretrain_config
进行损失相关性测试。该配置模板采用基础的
GPTModelProvider
默认设置(LayerNorm、GeLU、可学习绝对位置嵌入,
vocab_size
从分词器继承)——与MLM的
pretrain_gpt.py
默认设置一致,无需额外参数。

MLM Correlation Run (2L/256H, 1 GPU)

MLM 相关性测试运行(2层/256隐藏层大小,1 GPU)

bash
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --num-layers 2 --hidden-size 256 --num-attention-heads 4 \
  --ffn-hidden-size 1024 --seq-length 512 --max-position-embeddings 512 \
  --micro-batch-size 4 --global-batch-size 32 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 32000 \
  --lr 3e-4 --min-lr 3e-5 --seed 1234 --log-interval 1
bash
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --num-layers 2 --hidden-size 256 --num-attention-heads 4 \
  --ffn-hidden-size 1024 --seq-length 512 --max-position-embeddings 512 \
  --micro-batch-size 4 --global-batch-size 32 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 32000 \
  --lr 3e-4 --min-lr 3e-5 --seed 1234 --log-interval 1

Bridge Correlation Run (same config, 1 GPU)

Bridge 相关性测试运行(相同配置,1 GPU)

bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.num_layers=2 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=512 dataset.sequence_length=512 \
  train.train_iters=10 train.global_batch_size=32 train.micro_batch_size=4 \
  validation.eval_interval=10 validation.eval_iters=2 \
  optimizer.lr=3e-4 optimizer.min_lr=3e-5 \
  scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=10 \
  rng.seed=1234 logger.log_interval=1
bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.num_layers=2 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=512 dataset.sequence_length=512 \
  train.train_iters=10 train.global_batch_size=32 train.micro_batch_size=4 \
  validation.eval_interval=10 validation.eval_iters=2 \
  optimizer.lr=3e-4 optimizer.min_lr=3e-5 \
  scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=10 \
  rng.seed=1234 logger.log_interval=1

Verification

验证

With matched parameters the LM losses should be nearly identical at each iteration. Compare
lm loss
values from both logs — they should agree to within BF16 rounding.
在参数匹配的情况下,每次迭代的语言模型损失值应几乎完全一致。对比两次运行日志中的
lm loss
值——它们的差异应在BF16的舍入误差范围内。

Multi-GPU Examples

多GPU示例

MLM 2-GPU with TP=2

MLM 2-GPU (TP=2)

bash
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size 2 --sequence-parallel \
  --num-layers 4 --hidden-size 256 --num-attention-heads 4 \
  --seq-length 1024 --max-position-embeddings 1024 \
  --micro-batch-size 2 --global-batch-size 16 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 1024 \
  --lr 1e-4 --log-interval 1
bash
PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size 2 --sequence-parallel \
  --num-layers 4 --hidden-size 256 --num-attention-heads 4 \
  --seq-length 1024 --max-position-embeddings 1024 \
  --micro-batch-size 2 --global-batch-size 16 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 1024 \
  --lr 1e-4 --log-interval 1

Bridge 2-GPU with TP=2

Bridge 2-GPU (TP=2)

bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.tensor_model_parallel_size=2 model.sequence_parallel=true \
  model.num_layers=4 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=1024 dataset.sequence_length=1024 \
  train.train_iters=10 train.global_batch_size=16 train.micro_batch_size=2 \
  validation.eval_interval=10 validation.eval_iters=2 \
  scheduler.lr_warmup_iters=2 scheduler.lr_decay_iters=10 \
  logger.log_interval=1
bash
rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.tensor_model_parallel_size=2 model.sequence_parallel=true \
  model.num_layers=4 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=1024 dataset.sequence_length=1024 \
  train.train_iters=10 train.global_batch_size=16 train.micro_batch_size=2 \
  validation.eval_interval=10 validation.eval_iters=2 \
  scheduler.lr_warmup_iters=2 scheduler.lr_decay_iters=10 \
  logger.log_interval=1

Available Recipes

可用训练配置模板

Common recipes (use with
--recipe
):
  • vanilla_gpt_pretrain_config
    — Minimal GPT (bare GPTModelProvider defaults, ideal for correlation testing and custom configs)
  • llama32_1b_pretrain_config
    — Llama 3.2 1B (16L, 2048H, GBS=512, seq=8192)
  • llama3_8b_pretrain_config
    — Llama 3 8B
  • qwen3_8b_pretrain_config
    — Qwen3 8B
  • deepseek_v2_lite_pretrain_config
    — DeepSeek-V2-Lite 16B MoE
SFT/PEFT variants use
_sft_config
/
_peft_config
suffix.
常用配置模板(通过
--recipe
参数使用):
  • vanilla_gpt_pretrain_config
    — 极简GPT(基础GPTModelProvider默认设置,适用于相关性测试和自定义配置)
  • llama32_1b_pretrain_config
    — Llama 3.2 1B(16层,2048隐藏层大小,全局批量大小=512,序列长度=8192)
  • llama3_8b_pretrain_config
    — Llama 3 8B
  • qwen3_8b_pretrain_config
    — Qwen3 8B
  • deepseek_v2_lite_pretrain_config
    — DeepSeek-V2-Lite 16B MoE
SFT/PEFT变体使用
_sft_config
/
_peft_config
后缀。

Megatron-Core Submodule

Megatron-Core 子模块

For what the submodule is and why two versions exist, see @docs/megatron-lm-to-megatron-bridge.md.
关于该子模块的作用以及为何存在两个版本,请参考@docs/megatron-lm-to-megatron-bridge.md。

Check current version

查看当前版本

bash
./scripts/switch_mcore.sh status
bash
./scripts/switch_mcore.sh status

Switch to dev for testing newer MCore features

切换到dev版本测试新增MCore功能

bash
./scripts/switch_mcore.sh dev
bash
./scripts/switch_mcore.sh dev

uv sync (without --locked) since lockfile is for main

由于锁文件对应main版本,运行uv sync时无需加--locked参数

uv sync
undefined
uv sync
undefined

Switch back to main

切换回main版本

bash
./scripts/switch_mcore.sh main
bash
./scripts/switch_mcore.sh main

After pulling latest main

拉取最新main分支后

When you pull the latest Bridge main branch, the submodule pointer may have been updated. Re-sync the submodule:
bash
git submodule update --init 3rdparty/Megatron-LM
当你拉取Bridge的最新main分支时,子模块指针可能已更新。重新同步子模块:
bash
git submodule update --init 3rdparty/Megatron-LM

Pitfalls

注意事项

  1. Always
    rm -rf nemo_experiments
    before a fresh correlation run. Bridge auto-resumes from stale checkpoints silently.
  2. uv run
    required
    : Always use
    uv run python -m torch.distributed.run
    (not bare
    torchrun
    or
    python
    ).
  3. MLM PYTHONPATH: Must include
    3rdparty/Megatron-LM
    so
    gpt_builders.py
    is importable.
  4. Scheduler overrides: When overriding
    train.train_iters
    to a small value, also set
    scheduler.lr_warmup_iters
    and
    scheduler.lr_decay_iters
    or you get an assertion error.
  5. Use
    dataset.sequence_length
    in CLI overrides, not
    dataset.seq_length
    .
  6. MoE OOM: Large MoE models require full activation recomputation and typically multi-node EP. TP does NOT reduce per-GPU expert memory.
  7. uv sync --locked
    fails after switching to dev
    : The lockfile is generated against the main MCore commit. Use
    uv sync
    (without
    --locked
    ) when on dev.
  1. 每次新的相关性测试前务必执行
    rm -rf nemo_experiments
    :Bridge会自动从旧检查点静默恢复训练。
  2. 必须使用
    uv run
    :始终使用
    uv run python -m torch.distributed.run
    (而非直接使用
    torchrun
    python
    )。
  3. MLM的PYTHONPATH设置:必须包含
    3rdparty/Megatron-LM
    ,这样才能导入
    gpt_builders.py
  4. 调度器覆盖:当将
    train.train_iters
    覆盖为较小值时,同时需要设置
    scheduler.lr_warmup_iters
    scheduler.lr_decay_iters
    ,否则会触发断言错误。
  5. CLI覆盖时使用
    dataset.sequence_length
    :不要使用
    dataset.seq_length
  6. MoE模型内存不足(OOM):大型MoE模型需要完全激活重计算,通常需要多节点数据并行(EP)。张量并行(TP)不会减少每个GPU上的专家内存占用。
  7. 切换到dev版本后
    uv sync --locked
    失败
    :锁文件是基于MCore的main版本生成的。在dev版本上使用
    uv sync
    (不带
    --locked
    参数)。