recipe-recommender

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Auto Recipe — Recipe Index & Recommendation

自动配方 — 配方索引与推荐

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.
该技能会索引所有已发布的配方,帮助用户选择合适的初始配置、调整并行策略,并避免常见陷阱。

How to Use This Skill

如何使用该技能

  1. Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
  2. Look up the best-match recipe in the index below.
  3. Recommend the recipe function name + entry-point command.
  4. Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

  1. 向用户询问:模型名称/规模GPU数量及类型训练目标(预训练/SFT/PEFT),以及序列长度(如果非默认值)。
  2. 在下方索引中查找最匹配的配方。
  3. 推荐配方函数名称+入口命令。
  4. 提供调整建议(并行度调整、批量调优、避坑指南)。

Entry Points

入口点

Library recipes (functional training)

库配方(功能性训练)

bash
undefined
bash
undefined

Pretrain with mock data

用模拟数据进行预训练

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-pretrain-mock
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-pretrain-mock

SFT with SQuAD

用SQuAD进行SFT训练

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-finetune
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-finetune

Override any field via CLI

通过CLI覆盖任意字段

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
undefined
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
undefined

Performance recipes (throughput benchmarks)

性能配方(吞吐量基准测试)

bash
python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock
Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.

bash
python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock
注意:性能配方未经过全面正确性验证。大多数测试和验证是基于模拟数据进行的。它们仅用于测量吞吐量上限,而非生产环境训练。请始终独立验证损失曲线和收敛情况。

Recipe Unification (Coming Soon — PR #2803)

配方统一(即将推出 — PR #2803)

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:
  • Perf recipes move from
    scripts/performance/configs/
    src/megatron/bridge/recipes/<family>/<model>_perf.py
  • Each perf recipe becomes a self-contained Python function (e.g.
    llama3_8b_h100_bf16_pretrain_config()
    )
  • The old
    WorkloadBaseConfig
    set_workload_base_configs
    get_perf_optimized_recipe
    pipeline is removed
  • Shared helpers:
    _benchmark_common()
    (50 iters, timing, TE RNG),
    _perf_precision()
    (bf16 / fp8_cs / fp8_mx / nvfp4)
Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.
After #2803 lands, both library and perf recipes will be invocable through the same
run_recipe.py
entry point.

PR #2803 正在将性能配方统一为与库配方相同的Python函数格式。主要变化如下:
  • 性能配方从
    scripts/performance/configs/
    迁移至
    src/megatron/bridge/recipes/<family>/<model>_perf.py
  • 每个性能配方变为一个独立的Python函数(例如
    llama3_8b_h100_bf16_pretrain_config()
  • 移除旧的
    WorkloadBaseConfig
    set_workload_base_configs
    get_perf_optimized_recipe
    流程
  • 新增共享辅助函数:
    _benchmark_common()
    (50次迭代、计时、TE RNG)、
    _perf_precision()
    (bf16 / fp8_cs / fp8_mx / nvfp4)
为什么选择Python而非YAML? 之前基于YAML的方法存在诸多问题:配方逻辑分散在多个间接层中,配置不独立,两层流程导致维护和调试困难。Python函数更直观、可搜索且可组合。
待#2803合并后,库配方和性能配方都将通过同一个
run_recipe.py
入口点调用。

Library Recipe Index

库配方索引

All recipes live under
src/megatron/bridge/recipes/
. Each function returns a
ConfigContainer
with model, training, optimizer, and data settings.
所有配方都位于
src/megatron/bridge/recipes/
目录下。每个函数返回包含模型、训练、优化器和数据设置的
ConfigContainer
对象。

Llama

Llama

RecipeModeTPPPCPSPGPUs (min)Seq Len
llama2_7b_pretrain_config
Pretrain2124K
llama3_8b_pretrain_config
Pretrain2128K
llama3_8b_16k_pretrain_config
Pretrain212416K
llama3_8b_64k_pretrain_config
Pretrain214864K
llama3_8b_128k_pretrain_config
Pretrain21816128K
llama3_70b_pretrain_config
Pretrain84328K
llama3_70b_16k_pretrain_config
Pretrain8426416K
llama3_70b_64k_pretrain_config
Pretrain84412864K
llama31_405b_pretrain_config
Pretrain8161288K
llama3_8b_sft_config
SFT2128K
llama3_70b_sft_config
SFT44168K
llama31_405b_sft_config
SFT88648K
llama3_8b_peft_config
PEFT1118K
llama3_70b_peft_config
PEFT2488K
llama31_405b_peft_config
PEFT48328K
配方模式TPPPCPSP最小GPU数量序列长度
llama2_7b_pretrain_config
预训练2124K
llama3_8b_pretrain_config
预训练2128K
llama3_8b_16k_pretrain_config
预训练212416K
llama3_8b_64k_pretrain_config
预训练214864K
llama3_8b_128k_pretrain_config
预训练21816128K
llama3_70b_pretrain_config
预训练84328K
llama3_70b_16k_pretrain_config
预训练8426416K
llama3_70b_64k_pretrain_config
预训练84412864K
llama31_405b_pretrain_config
预训练8161288K
llama3_8b_sft_config
SFT2128K
llama3_70b_sft_config
SFT44168K
llama31_405b_sft_config
SFT88648K
llama3_8b_peft_config
PEFT1118K
llama3_70b_peft_config
PEFT2488K
llama31_405b_peft_config
PEFT48328K

Qwen2 / Qwen2.5

Qwen2 / Qwen2.5

RecipeModeTPPPSizes
qwen2_*_{pretrain,sft,peft}_config
All1–81–4500M, 1.5B, 7B, 14B, 32B, 72B
qwen25_*_{pretrain,sft,peft}_config
All1–81–4500M, 1.5B, 3B, 7B, 14B, 32B, 72B
配方模式TPPP模型规模
qwen2_*_{pretrain,sft,peft}_config
全部1–81–4500M, 1.5B, 7B, 14B, 32B, 72B
qwen25_*_{pretrain,sft,peft}_config
全部1–81–4500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)

Qwen3(密集型)

RecipeModeTPPPCPSizes
qwen3_*_pretrain_config
Pretrain1–81–2600M–32B
qwen3_*_sft_config
SFT1–81–2600M–32B
qwen3_600m_sft_128k_config
SFT118600M (128K seq)
qwen3_*_peft_config
PEFT11600M–32B
配方模式TPPPCP模型规模
qwen3_*_pretrain_config
预训练1–81–2600M–32B
qwen3_*_sft_config
SFT1–81–2600M–32B
qwen3_600m_sft_128k_config
SFT118600M(128K序列)
qwen3_*_peft_config
PEFT11600M–32B

Qwen3 MoE

Qwen3 MoE

RecipeModeTPPPEPCPGPUs
qwen3_30b_a3b_pretrain_config
Pretrain1188
qwen3_30b_a3b_sft_config
SFT1188
qwen3_30b_a3b_peft_config
PEFT1111
qwen3_235b_a22b_pretrain_config
Pretrain41682512+
qwen3_235b_a22b_sft_config
SFT488256
qwen3_235b_a22b_peft_config
PEFT14416
配方模式TPPPEPCPGPU数量
qwen3_30b_a3b_pretrain_config
预训练1188
qwen3_30b_a3b_sft_config
SFT1188
qwen3_30b_a3b_peft_config
PEFT1111
qwen3_235b_a22b_pretrain_config
预训练41682512+
qwen3_235b_a22b_sft_config
SFT488256
qwen3_235b_a22b_peft_config
PEFT14416

Qwen3-Next

Qwen3-Next

RecipeModeTPPPEP
qwen3_next_80b_a3b_pretrain_config
Pretrain148
qwen3_next_80b_a3b_sft_config
SFT128
qwen3_next_80b_a3b_peft_config
PEFT114
配方模式TPPPEP
qwen3_next_80b_a3b_pretrain_config
预训练148
qwen3_next_80b_a3b_sft_config
SFT128
qwen3_next_80b_a3b_peft_config
PEFT114

DeepSeek

DeepSeek

RecipeModeTPPPEPGPUs
deepseek_v2_lite_pretrain_config
Pretrain1188
deepseek_v2_pretrain_config
Pretrain1432128
deepseek_v3_pretrain_config
Pretrain216642048
deepseek_v3_pretrain_config_32nodes
Pretrain2832256
配方模式TPPPEPGPU数量
deepseek_v2_lite_pretrain_config
预训练1188
deepseek_v2_pretrain_config
预训练1432128
deepseek_v3_pretrain_config
预训练216642048
deepseek_v3_pretrain_config_32nodes
预训练2832256

GLM-4.5

GLM-4.5

RecipeModeTPPPEPGPUs
glm45_355b_pretrain_config
Pretrain2816256
glm45_air_106b_pretrain_config
Pretrain14832
glm45_355b_sft_config
SFT2816256
glm45_air_106b_sft_config
SFT14832
glm45_355b_peft_config
PEFT24432
glm45_air_106b_peft_config
PEFT1248
配方模式TPPPEPGPU数量
glm45_355b_pretrain_config
预训练2816256
glm45_air_106b_pretrain_config
预训练14832
glm45_355b_sft_config
SFT2816256
glm45_air_106b_sft_config
SFT14832
glm45_355b_peft_config
PEFT24432
glm45_air_106b_peft_config
PEFT1248

Gemma

Gemma

RecipeModeTPPPSizes
gemma2_*_{pretrain,sft,peft}_config
All2–81–22B, 9B, 27B
gemma3_1b_{pretrain,sft,peft}_config
All111B (32K seq)
配方模式TPPP模型规模
gemma2_*_{pretrain,sft,peft}_config
全部2–81–22B, 9B, 27B
gemma3_1b_{pretrain,sft,peft}_config
全部111B(32K序列)

NemotronH / Nemotron

NemotronH / Nemotron

RecipeModeTPPPEPNotes
nemotronh_{4b,8b,47b,56b}_*_config
P/S/PEFT1–81–4Dense SSM-hybrid
nemotron_3_nano_*_config
P/S/PEFTvaries18MoE + Mamba
nemotron_3_super_*_config
P/S/PEFT418MoE + Mamba, ~40% CUDA graph gain
nemotron_nano_{9b,12b}_v2_*_config
P/S/PEFTvaries1Dense
配方模式TPPPEP说明
nemotronh_{4b,8b,47b,56b}_*_config
预训练/SFT/PEFT1–81–4密集型SSM混合模型
nemotron_3_nano_*_config
预训练/SFT/PEFT可变18MoE + Mamba
nemotron_3_super_*_config
预训练/SFT/PEFT418MoE + Mamba,CUDA图增益约40%
nemotron_nano_{9b,12b}_v2_*_config
预训练/SFT/PEFT可变1密集型

Other Models

其他模型

RecipeModeNotes
moonlight_16b_{pretrain,sft,peft}_config
AllMoE EP=8
olmoe_7b_{pretrain,sft,peft}_config
AllMoE EP=8
ministral3_{3b,8b,14b}_{sft,peft}_config
SFT/PEFTDense
gpt_oss_20b_*_config
AllMoE + FP8/MXFP8 variants
gpt_oss_120b_*_config
AllMoE
vanilla_gpt_pretrain_config
PretrainMLM/Bridge parity baseline
gpt3_175b_pretrain_config
PretrainTP=4, PP=8, VP=6
kimi_k2_pretrain_config
Pretrain1T MoE, TP=2 PP=16 EP=32
配方模式说明
moonlight_16b_{pretrain,sft,peft}_config
全部MoE EP=8
olmoe_7b_{pretrain,sft,peft}_config
全部MoE EP=8
ministral3_{3b,8b,14b}_{sft,peft}_config
SFT/PEFT密集型
gpt_oss_20b_*_config
全部MoE + FP8/MXFP8变体
gpt_oss_120b_*_config
全部MoE
vanilla_gpt_pretrain_config
预训练MLM/Bridge一致性基线
gpt3_175b_pretrain_config
预训练TP=4, PP=8, VP=6
kimi_k2_pretrain_config
预训练1T MoE, TP=2 PP=16 EP=32

VLM Recipes

VLM配方

RecipeModeTPPPEPGPUs
gemma3_vl_{4b,12b,27b}_{sft,peft}_config
SFT/PEFT1–81–21–16
qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config
SFT/PEFT1–81–41–32
qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config
SFT/PEFT1–41–81–321–512
qwen35_vl_*_{sft,peft}_config
SFT/PEFTvariesvariesvariesvaries
glm_45v_{sft,peft}_config
SFT/PEFT184–1664–512
nemotron_nano_v2_vl_12b_{sft,peft}_config
SFT/PEFT2–418
配方模式TPPPEPGPU数量
gemma3_vl_{4b,12b,27b}_{sft,peft}_config
SFT/PEFT1–81–21–16
qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config
SFT/PEFT1–81–41–32
qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config
SFT/PEFT1–41–81–321–512
qwen35_vl_*_{sft,peft}_config
SFT/PEFT可变可变可变可变
glm_45v_{sft,peft}_config
SFT/PEFT184–1664–512
nemotron_nano_v2_vl_12b_{sft,peft}_config
SFT/PEFT2–418

Diffusion Recipes

Diffusion配方

RecipeModeTPCP
wan_1_3B_{pretrain,sft}_config
P/SFT18
wan_14B_{pretrain,sft}_config
P/SFT24
flux_12b_{pretrain,sft}_config
P/SFT21

配方模式TPCP
wan_1_3B_{pretrain,sft}_config
预训练/SFT18
wan_14B_{pretrain,sft}_config
预训练/SFT24
flux_12b_{pretrain,sft}_config
预训练/SFT21

Performance Recipe Index

性能配方索引

All perf recipes live under
scripts/performance/
. They are invoked via
run_script.py
and use
WorkloadBaseConfig
presets per GPU type.
Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.
所有性能配方都位于
scripts/performance/
目录下。通过
run_script.py
调用,并针对不同GPU类型使用
WorkloadBaseConfig
预设值。
重要提示:性能配方仅用于测量吞吐量上限基准测试**,而非生产环境训练。默认情况下,它们在模拟数据上运行50次迭代。吞吐量数值是理想目标,而非经过验证的收敛配置。

Llama 3 / 3.1

Llama 3 / 3.1

ModelGPUsGPU TypesKey Features
Llama 3 8B8H100, B200, B300, GB200, GB300, R100CUDA graphs (local), FSDP on GB variants
Llama 3 70B64H100, B200, B300, GB200, GB300TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B128–1024H100, B200, B300, GB200, GB300TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP
SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).
模型GPU数量GPU类型核心特性
Llama 3 8B8H100, B200, B300, GB200, GB300, R100CUDA图(本地),GB系列使用FSDP
Llama 3 70B64H100, B200, B300, GB200, GB300TP通信重叠(用户缓冲区),FSDP,CUDA图
Llama 3.1 405B128–1024H100, B200, B300, GB200, GB300TP+CP通信重叠(用户缓冲区),FSDP,大量PP/VP
SFT/LoRA变体也已存在(例如8B SFT带打包序列,70B SFT在32个GPU上运行)。

DeepSeek V3

DeepSeek V3

ModelGPUsGPU TypesKey Features
DeepSeek V3 (671B MoE)256–1024H100, B200, B300, GB200, GB300HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)
模型GPU数量GPU类型核心特性
DeepSeek V3(671B MoE)256–1024H100, B200, B300, GB200, GB300HybridEP调度器,MLA重计算,CUDA图(TE范围)

Qwen3 MoE

Qwen3 MoE

ModelGPUsGPU TypesKey Features
Qwen3 30B-A3B8–16H100, B200, B300, GB200, GB300MoE alltoall/flex dispatcher
Qwen3 235B-A22B64–256H100, B200, B300, GB200, GB300TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B64–128H100, B200, B300, GB200, GB300EP 64–128
模型GPU数量GPU类型核心特性
Qwen3 30B-A3B8–16H100, B200, B300, GB200, GB300MoE alltoall/flex调度器
Qwen3 235B-A22B64–256H100, B200, B300, GB200, GB300TP通信重叠,CUDA图,MoE a2a重叠
Qwen3-Next 80B-A3B64–128H100, B200, B300, GB200, GB300EP 64–128

Qwen3-VL

Qwen3-VL

ModelGPUsGPU TypesKey Features
Qwen3-VL 30B-A3B8–16H100, B200, B300, GB200, GB300VLM + MoE
Qwen3-VL 235B-A22B64–256H100, B200, B300, GB200, GB300VLM + MoE, TP comm overlap
模型GPU数量GPU类型核心特性
Qwen3-VL 30B-A3B8–16H100, B200, B300, GB200, GB300VLM + MoE
Qwen3-VL 235B-A22B64–256H100, B200, B300, GB200, GB300VLM + MoE,TP通信重叠

Kimi K2

Kimi K2

ModelGPUsGPU TypesKey Features
Kimi K2 (1T MoE)256–1024H100, B200, B300, GB200, GB300Muon/Adam optimizer, HybridEP, pipeline layout helpers
模型GPU数量GPU类型核心特性
Kimi K2(1T MoE)256–1024H100, B200, B300, GB200, GB300Muon/Adam优化器,HybridEP,流水线布局辅助工具

NemotronH

NemotronH

ModelGPUsGPU TypesKey Features
Nemotron 3 Nano (30B MoE+Mamba)8–16H100, B200, B300, GB200, GB300TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super64H100, B200, B300, GB200, GB300TE CUDA graphs, EP=64
NemotronH 56B64H100, B200, B300TP=2–8, TE graphs (mamba+attn)
模型GPU数量GPU类型核心特性
Nemotron 3 Nano(30B MoE+Mamba)8–16H100, B200, B300, GB200, GB300TE CUDA图(attn+mamba+moe),HybridEP
Nemotron 3 Super64H100, B200, B300, GB200, GB300TE CUDA图,EP=64
NemotronH 56B64H100, B200, B300TP=2–8,TE图(mamba+attn)

GPT-OSS

GPT-OSS

ModelGPUsGPU TypesKey Features
GPT-OSS 120B64H100, B200, GB200EP=64, HybridEP on GB200

模型GPU数量GPU类型核心特性
GPT-OSS 120B64H100, B200, GB200EP=64,GB200上使用HybridEP

Recommendation Decision Tree

推荐决策树

text
User wants to train a model
├─ Know the model name?
│   ├─ Yes → Look up in Library Recipe Index above
│   │   ├─ Has a recipe for their size + mode? → Use it directly
│   │   └─ No exact match? → Use closest size, adjust parallelism
│   └─ No → Ask for model name, size, and HF model ID
├─ What's the training goal?
│   ├─ Pretrain → Use *_pretrain_config
│   ├─ SFT (full fine-tune) → Use *_sft_config
│   └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
├─ How many GPUs?
│   ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│   ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│   ├─ 16–64 GPUs → 70B dense, medium MoE
│   └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
├─ Want throughput benchmarks?
│   ├─ Yes → Use perf recipes (scripts/performance/)
│   │   └─ ⚠️ These run on mock data for upper-bound perf only
│   └─ No → Use library recipes (scripts/training/run_recipe.py)
└─ Long context?
    ├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≤ 8K → Default recipes work

text
用户想要训练模型
├─ 是否知道模型名称?
│   ├─ 是 → 在上方库配方索引中查找
│   │   ├─ 存在对应规模和模式的配方? → 直接使用
│   │   └─ 无精确匹配? → 使用最接近的规模,调整并行度
│   └─ 否 → 询问模型名称、规模和HF模型ID
├─ 训练目标是什么?
│   ├─ 预训练 → 使用 *_pretrain_config
│   ├─ SFT(全量微调) → 使用 *_sft_config
│   └─ PEFT(LoRA/DoRA) → 使用 *_peft_config(GPU需求最低)
├─ 有多少个GPU?
│   ├─ 1个GPU → 仅PEFT配方可用(TP=1, PP=1)
│   ├─ 8个GPU(1节点) → 大多数8B–16B模型,小型MoE(EP=8)
│   ├─ 16–64个GPU → 70B密集型,中型MoE
│   └─ 128+个GPU → 405B+,大型MoE(DeepSeek V3, Kimi K2)
├─ 是否需要吞吐量基准测试?
│   ├─ 是 → 使用性能配方(scripts/performance/)
│   │   └─ ⚠️ 这些仅在模拟数据上运行,用于测量性能上限
│   └─ 否 → 使用库配方(scripts/training/run_recipe.py)
└─ 是否需要长上下文?
    ├─ > 8K → 需要CP(上下文并行),查看 *_16k / *_64k / *_128k 变体
    └─ ≤ 8K → 默认配方可用

Adjustment Advice (When Recommending)

调整建议(推荐时使用)

Parallelism Resizing Rules

并行度调整规则

When the user's GPU count differs from the recipe default:
  1. TP must divide
    num_key_value_heads
    (GQA constraint). E.g. if
    num_key_value_heads=8
    , valid TP = {1, 2, 4, 8}.
  2. TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
  3. PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
  4. EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
  5. SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
  6. CP requires all-to-all or ring attention. Check
    cp_comm_type
    . For GQA models,
    a2a+p2p
    hierarchical CP allows CP > num_kv_heads.
  7. world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.
当用户的GPU数量与配方默认值不同时:
  1. TP必须能整除
    num_key_value_heads
    (GQA约束)。例如,如果
    num_key_value_heads=8
    ,有效的TP值为 {1, 2, 4, 8}。
  2. TP应保持在单个节点内(NVLink)。TP>8需要节点间NVLink(例如GB200 NVL72)。
  3. PP会增加流水线气泡。尽量减少PP;仅当单独使用TP无法容纳模型时才增加PP。使用VP(虚拟流水线)来减少气泡开销。
  4. EP不会减少密集层内存。只有专家参数会随EP分片。共享注意力/嵌入层会被复制。如果MoE出现"OOM",优先增加EP而非TP。
  5. 只要TP>1,就应启用SP。它能消除冗余的激活复制,且几乎没有额外开销。
  6. CP需要all-to-all或环形注意力。检查
    cp_comm_type
    。对于GQA模型,
    a2a+p2p
    分层CP允许CP>num_kv_heads。
  7. world_size = DP × TP × PP × CP × EP。DP是隐式的。确保显式并行度的乘积能整除总GPU数量。

Batch Size Tuning

批量大小调优

  • Start with the recipe's
    micro_batch_size
    . If OOM, reduce to 1.
  • global_batch_size
    determines learning dynamics. Scale with DP:
    GBS = micro_batch_size × DP × gradient_accumulation_steps
    .
  • For MoE,
    micro_batch_size=1
    is typical at scale.
  • 从配方的
    micro_batch_size
    开始。如果出现OOM,将其减少到1。
  • global_batch_size
    决定学习动态。随DP缩放:
    GBS = micro_batch_size × DP × gradient_accumulation_steps
  • 对于MoE,大规模训练时通常使用
    micro_batch_size=1

Common Pitfalls to Warn About

需要提醒的常见陷阱

PitfallSymptomFix
TP > num_kv_headsCrash: "TP must divide num_query_groups"Reduce TP to a divisor of num_kv_heads
PP without VPPoor throughput (large bubble)Set
virtual_pipeline_model_parallel_size
EP too low for large MoEOOM on expert paramsIncrease EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequencesAssert: "CUDA graph accepts only Tensor inputs"Disable packing or use
local
full-iteration graphs
CUDA graphs + full recomputeAssert: "full recompute only with full iteration CUDA graph"Disable recompute or switch to
local
impl
use_te_rng_tracker
not set
Assert on provider init when CUDA graphs enabledSet
cfg.model.use_te_rng_tracker = True
and
cfg.rng.te_rng_tracker = True
FSDP + TP > 1 on H100Possible comm bottleneckPrefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CPOOM on activationsAdd CP=2/4/8; use
*_16k
,
*_64k
, or
*_128k
recipe variants
MoE
overlap_grad_reduce
on H100
May hurt perf (False in many H100 presets)Set
overlap_grad_reduce=False
for MoE on H100
VLM SFT missing image dataRuns but produces garbageProvide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDPTested on Blackwell onlyMay not work on H100; validate first
陷阱症状修复方法
TP > num_kv_heads崩溃:"TP must divide num_query_groups"将TP减少为num_kv_heads的约数
使用PP但未启用VP吞吐量差(气泡大)设置
virtual_pipeline_model_parallel_size
大型MoE的EP过低专家参数出现OOM增加EP;每个专家位于EP/num_experts个rank上
CUDA图 + 打包序列断言错误:"CUDA graph accepts only Tensor inputs"禁用打包或使用
local
全迭代图
CUDA图 + 全重计算断言错误:"full recompute only with full iteration CUDA graph"禁用重计算或切换到
local
实现
未设置
use_te_rng_tracker
启用CUDA图时,提供者初始化断言错误设置
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
H100上使用FSDP + TP>1可能出现通信瓶颈在H100上优先使用FSDP+TP=1或TP=2;FSDP在GB/B系列GPU上表现更好
长上下文但未启用CP激活层出现OOM添加CP=2/4/8;使用
*_16k
*_64k
*_128k
配方变体
H100上MoE启用
overlap_grad_reduce
可能影响性能(许多H100预设中为False)在H100上的MoE设置
overlap_grad_reduce=False
VLM SFT缺少图像数据运行正常但输出无效结果提供实际多模态数据集或使用模拟VLM数据
Qwen35-VL MoE FSDP仅在Blackwell上测试过在H100上可能无法运行;请先验证

Recipe Override Examples

配方覆盖示例

bash
undefined
bash
undefined

Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)

将Llama3 8B从2个GPU扩展到8个GPU(增加DP)

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock

Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs

减少Qwen3-MoE 30B的并行度以适配4个GPU

uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'
uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'

Add long context to an existing recipe

为现有配方添加长上下文支持

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'

Enable CUDA graphs on any recipe

在任意配方上启用CUDA图

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'

---
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'

---

Quick Reference: Which Recipe for My Situation?

快速参考:我的场景该用哪个配方?

I want to...Start withGPUs needed
Try Bridge for the first time
llama3_8b_sft_config
+ mock data
2
Fine-tune a 7-8B model
llama3_8b_sft_config
or
qwen3_8b_sft_config
2–8
LoRA on 1 GPU
llama3_8b_peft_config
or
qwen3_8b_peft_config
1
Pretrain a dense 70B
llama3_70b_pretrain_config
32–64
Train a small MoE
qwen3_30b_a3b_pretrain_config
8
Train a large MoE (235B+)
qwen3_235b_a22b_pretrain_config
256–512
Benchmark throughputPerf recipes via
run_script.py
Varies
Long-context training
llama3_8b_128k_pretrain_config
or add CP override
16+
VLM fine-tuning
qwen3_vl_8b_sft_config
or
gemma3_vl_*_sft_config
4–8
Diffusion training
wan_1_3B_pretrain_config
or
flux_12b_pretrain_config
8

我想要...从以下配方开始需要的GPU数量
首次尝试Bridge
llama3_8b_sft_config
+ 模拟数据
2
微调7-8B模型
llama3_8b_sft_config
qwen3_8b_sft_config
2–8
在1个GPU上进行LoRA训练
llama3_8b_peft_config
qwen3_8b_peft_config
1
预训练70B密集型模型
llama3_70b_pretrain_config
32–64
训练小型MoE模型
qwen3_30b_a3b_pretrain_config
8
训练大型MoE模型(235B+)
qwen3_235b_a22b_pretrain_config
256–512
基准测试吞吐量通过
run_script.py
使用性能配方
可变
长上下文训练
llama3_8b_128k_pretrain_config
或添加CP覆盖
16+
VLM微调
qwen3_vl_8b_sft_config
gemma3_vl_*_sft_config
4–8
Diffusion训练
wan_1_3B_pretrain_config
flux_12b_pretrain_config
8

Code Anchors

代码锚点

WhatPath
Library recipes root
src/megatron/bridge/recipes/
Recipe
__init__.py
(all exports)
src/megatron/bridge/recipes/__init__.py
Common recipe helpers
src/megatron/bridge/recipes/common.py
Training entry point
scripts/training/run_recipe.py
Perf recipes root
scripts/performance/
Perf entry point
scripts/performance/run_script.py
Perf workload configs
scripts/performance/configs/<family>/
Perf overrides (benchmark defaults)
scripts/performance/utils/overrides.py
内容路径
库配方根目录
src/megatron/bridge/recipes/
配方
__init__.py
(所有导出)
src/megatron/bridge/recipes/__init__.py
通用配方辅助函数
src/megatron/bridge/recipes/common.py
训练入口点
scripts/training/run_recipe.py
性能配方根目录
scripts/performance/
性能测试入口点
scripts/performance/run_script.py
性能测试工作负载配置
scripts/performance/configs/<family>/
性能测试覆盖(基准默认值)
scripts/performance/utils/overrides.py