recipe-recommender

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Auto Recipe — Recipe Index & Recommendation

自动配方 — 配方索引与推荐

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

该技能会索引所有已发布的配方，帮助用户选择合适的初始配置、调整并行策略，并避免常见陷阱。

How to Use This Skill

如何使用该技能

Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
Look up the best-match recipe in the index below.
Recommend the recipe function name + entry-point command.
Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

向用户询问：模型名称/规模、GPU数量及类型、训练目标（预训练/SFT/PEFT），以及序列长度（如果非默认值）。
在下方索引中查找最匹配的配方。
推荐配方函数名称+入口命令。
提供调整建议（并行度调整、批量调优、避坑指南）。

Entry Points

入口点

Library recipes (functional training)

库配方（功能性训练）

bash

undefined

bash

undefined

Pretrain with mock data

用模拟数据进行预训练

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-pretrain-mock

SFT with SQuAD

用SQuAD进行SFT训练

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-finetune

Override any field via CLI

通过CLI覆盖任意字段

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'

undefined

undefined

Performance recipes (throughput benchmarks)

性能配方（吞吐量基准测试）

bash

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

Perf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.

bash

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

注意：性能配方未经过全面正确性验证。大多数测试和验证是基于模拟数据进行的。它们仅用于测量吞吐量上限，而非生产环境训练。请始终独立验证损失曲线和收敛情况。

Recipe Unification (Coming Soon — PR #2803)

配方统一（即将推出 — PR #2803）

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

Perf recipes move from

scripts/performance/configs/

→

src/megatron/bridge/recipes/<family>/<model>_perf.py

Each perf recipe becomes a self-contained Python function (e.g.
```
llama3_8b_h100_bf16_pretrain_config()
```
)

The old

WorkloadBaseConfig

→

set_workload_base_configs

→

get_perf_optimized_recipe

pipeline is removed

Shared helpers:
```
_benchmark_common()
```
(50 iters, timing, TE RNG),
```
_perf_precision()
```
(bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same

run_recipe.py

entry point.

PR #2803 正在将性能配方统一为与库配方相同的Python函数格式。主要变化如下：

性能配方从

scripts/performance/configs/

迁移至

src/megatron/bridge/recipes/<family>/<model>_perf.py

每个性能配方变为一个独立的Python函数（例如
```
llama3_8b_h100_bf16_pretrain_config()
```
）

移除旧的

WorkloadBaseConfig

→

set_workload_base_configs

→

get_perf_optimized_recipe

流程

新增共享辅助函数：
```
_benchmark_common()
```
（50次迭代、计时、TE RNG）、
```
_perf_precision()
```
（bf16 / fp8_cs / fp8_mx / nvfp4）

为什么选择Python而非YAML？ 之前基于YAML的方法存在诸多问题：配方逻辑分散在多个间接层中，配置不独立，两层流程导致维护和调试困难。Python函数更直观、可搜索且可组合。

待#2803合并后，库配方和性能配方都将通过同一个

run_recipe.py

入口点调用。

Library Recipe Index

库配方索引

All recipes live under

src/megatron/bridge/recipes/

. Each function returns a

ConfigContainer

with model, training, optimizer, and data settings.

所有配方都位于

src/megatron/bridge/recipes/

目录下。每个函数返回包含模型、训练、优化器和数据设置的

ConfigContainer

对象。

Llama

Recipe	Mode	TP	PP	CP	SP	GPUs (min)	Seq Len
`llama2_7b_pretrain_config`	Pretrain	2	1	—	—	2	4K
`llama3_8b_pretrain_config`	Pretrain	2	1	—	✓	2	8K
`llama3_8b_16k_pretrain_config`	Pretrain	2	1	2	✓	4	16K
`llama3_8b_64k_pretrain_config`	Pretrain	2	1	4	✓	8	64K
`llama3_8b_128k_pretrain_config`	Pretrain	2	1	8	✓	16	128K
`llama3_70b_pretrain_config`	Pretrain	8	4	—	✓	32	8K
`llama3_70b_16k_pretrain_config`	Pretrain	8	4	2	✓	64	16K
`llama3_70b_64k_pretrain_config`	Pretrain	8	4	4	✓	128	64K
`llama31_405b_pretrain_config`	Pretrain	8	16	—	✓	128	8K
`llama3_8b_sft_config`	SFT	2	1	—	✓	2	8K
`llama3_70b_sft_config`	SFT	4	4	—	✓	16	8K
`llama31_405b_sft_config`	SFT	8	8	—	✓	64	8K
`llama3_8b_peft_config`	PEFT	1	1	—	—	1	8K
`llama3_70b_peft_config`	PEFT	2	4	—	✓	8	8K
`llama31_405b_peft_config`	PEFT	4	8	—	✓	32	8K

配方	模式	TP	PP	CP	SP	最小GPU数量	序列长度
`llama2_7b_pretrain_config`	预训练	2	1	—	—	2	4K
`llama3_8b_pretrain_config`	预训练	2	1	—	✓	2	8K
`llama3_8b_16k_pretrain_config`	预训练	2	1	2	✓	4	16K
`llama3_8b_64k_pretrain_config`	预训练	2	1	4	✓	8	64K
`llama3_8b_128k_pretrain_config`	预训练	2	1	8	✓	16	128K
`llama3_70b_pretrain_config`	预训练	8	4	—	✓	32	8K
`llama3_70b_16k_pretrain_config`	预训练	8	4	2	✓	64	16K
`llama3_70b_64k_pretrain_config`	预训练	8	4	4	✓	128	64K
`llama31_405b_pretrain_config`	预训练	8	16	—	✓	128	8K
`llama3_8b_sft_config`	SFT	2	1	—	✓	2	8K
`llama3_70b_sft_config`	SFT	4	4	—	✓	16	8K
`llama31_405b_sft_config`	SFT	8	8	—	✓	64	8K
`llama3_8b_peft_config`	PEFT	1	1	—	—	1	8K
`llama3_70b_peft_config`	PEFT	2	4	—	✓	8	8K
`llama31_405b_peft_config`	PEFT	4	8	—	✓	32	8K

Qwen2 / Qwen2.5

Recipe	Mode	TP	PP	Sizes
`qwen2_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 7B, 14B, 32B, 72B
`qwen25_*_{pretrain,sft,peft}_config`	All	1–8	1–4	500M, 1.5B, 3B, 7B, 14B, 32B, 72B

配方	模式	TP	PP	模型规模
`qwen2_*_{pretrain,sft,peft}_config`	全部	1–8	1–4	500M, 1.5B, 7B, 14B, 32B, 72B
`qwen25_*_{pretrain,sft,peft}_config`	全部	1–8	1–4	500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)

Qwen3（密集型）

Recipe	Mode	TP	PP	CP	Sizes
`qwen3_*_pretrain_config`	Pretrain	1–8	1–2	—	600M–32B
`qwen3_*_sft_config`	SFT	1–8	1–2	—	600M–32B
`qwen3_600m_sft_128k_config`	SFT	1	1	8	600M (128K seq)
`qwen3_*_peft_config`	PEFT	1	1	—	600M–32B

配方	模式	TP	PP	CP	模型规模
`qwen3_*_pretrain_config`	预训练	1–8	1–2	—	600M–32B
`qwen3_*_sft_config`	SFT	1–8	1–2	—	600M–32B
`qwen3_600m_sft_128k_config`	SFT	1	1	8	600M（128K序列）
`qwen3_*_peft_config`	PEFT	1	1	—	600M–32B

Qwen3 MoE

Recipe	Mode	TP	PP	EP	CP	GPUs
`qwen3_30b_a3b_pretrain_config`	Pretrain	1	1	8	—	8
`qwen3_30b_a3b_sft_config`	SFT	1	1	8	—	8
`qwen3_30b_a3b_peft_config`	PEFT	1	1	1	—	1
`qwen3_235b_a22b_pretrain_config`	Pretrain	4	16	8	2	512+
`qwen3_235b_a22b_sft_config`	SFT	4	8	8	—	256
`qwen3_235b_a22b_peft_config`	PEFT	1	4	4	—	16

配方	模式	TP	PP	EP	CP	GPU数量
`qwen3_30b_a3b_pretrain_config`	预训练	1	1	8	—	8
`qwen3_30b_a3b_sft_config`	SFT	1	1	8	—	8
`qwen3_30b_a3b_peft_config`	PEFT	1	1	1	—	1
`qwen3_235b_a22b_pretrain_config`	预训练	4	16	8	2	512+
`qwen3_235b_a22b_sft_config`	SFT	4	8	8	—	256
`qwen3_235b_a22b_peft_config`	PEFT	1	4	4	—	16

Qwen3-Next

Recipe	Mode	TP	PP	EP
`qwen3_next_80b_a3b_pretrain_config`	Pretrain	1	4	8
`qwen3_next_80b_a3b_sft_config`	SFT	1	2	8
`qwen3_next_80b_a3b_peft_config`	PEFT	1	1	4

配方	模式	TP	PP	EP
`qwen3_next_80b_a3b_pretrain_config`	预训练	1	4	8
`qwen3_next_80b_a3b_sft_config`	SFT	1	2	8
`qwen3_next_80b_a3b_peft_config`	PEFT	1	1	4

DeepSeek

Recipe	Mode	TP	PP	EP	GPUs
`deepseek_v2_lite_pretrain_config`	Pretrain	1	1	8	8
`deepseek_v2_pretrain_config`	Pretrain	1	4	32	128
`deepseek_v3_pretrain_config`	Pretrain	2	16	64	2048
`deepseek_v3_pretrain_config_32nodes`	Pretrain	2	8	32	256

配方	模式	TP	PP	EP	GPU数量
`deepseek_v2_lite_pretrain_config`	预训练	1	1	8	8
`deepseek_v2_pretrain_config`	预训练	1	4	32	128
`deepseek_v3_pretrain_config`	预训练	2	16	64	2048
`deepseek_v3_pretrain_config_32nodes`	预训练	2	8	32	256

GLM-4.5

Recipe	Mode	TP	PP	EP	GPUs
`glm45_355b_pretrain_config`	Pretrain	2	8	16	256
`glm45_air_106b_pretrain_config`	Pretrain	1	4	8	32
`glm45_355b_sft_config`	SFT	2	8	16	256
`glm45_air_106b_sft_config`	SFT	1	4	8	32
`glm45_355b_peft_config`	PEFT	2	4	4	32
`glm45_air_106b_peft_config`	PEFT	1	2	4	8

配方	模式	TP	PP	EP	GPU数量
`glm45_355b_pretrain_config`	预训练	2	8	16	256
`glm45_air_106b_pretrain_config`	预训练	1	4	8	32
`glm45_355b_sft_config`	SFT	2	8	16	256
`glm45_air_106b_sft_config`	SFT	1	4	8	32
`glm45_355b_peft_config`	PEFT	2	4	4	32
`glm45_air_106b_peft_config`	PEFT	1	2	4	8

Gemma

Recipe	Mode	TP	PP	Sizes
`gemma2_*_{pretrain,sft,peft}_config`	All	2–8	1–2	2B, 9B, 27B
`gemma3_1b_{pretrain,sft,peft}_config`	All	1	1	1B (32K seq)

配方	模式	TP	PP	模型规模
`gemma2_*_{pretrain,sft,peft}_config`	全部	2–8	1–2	2B, 9B, 27B
`gemma3_1b_{pretrain,sft,peft}_config`	全部	1	1	1B（32K序列）

NemotronH / Nemotron

Recipe	Mode	TP	PP	EP	Notes
`nemotronh_{4b,8b,47b,56b}_*_config`	P/S/PEFT	1–8	1–4	—	Dense SSM-hybrid
`nemotron_3_nano_*_config`	P/S/PEFT	varies	1	8	MoE + Mamba
`nemotron_3_super_*_config`	P/S/PEFT	4	1	8	MoE + Mamba, ~40% CUDA graph gain
`nemotron_nano_{9b,12b}_v2_*_config`	P/S/PEFT	varies	1	—	Dense

配方	模式	TP	PP	EP	说明
`nemotronh_{4b,8b,47b,56b}_*_config`	预训练/SFT/PEFT	1–8	1–4	—	密集型SSM混合模型
`nemotron_3_nano_*_config`	预训练/SFT/PEFT	可变	1	8	MoE + Mamba
`nemotron_3_super_*_config`	预训练/SFT/PEFT	4	1	8	MoE + Mamba，CUDA图增益约40%
`nemotron_nano_{9b,12b}_v2_*_config`	预训练/SFT/PEFT	可变	1	—	密集型

Other Models

其他模型

Recipe	Mode	Notes
`moonlight_16b_{pretrain,sft,peft}_config`	All	MoE EP=8
`olmoe_7b_{pretrain,sft,peft}_config`	All	MoE EP=8
`ministral3_{3b,8b,14b}_{sft,peft}_config`	SFT/PEFT	Dense
`gpt_oss_20b_*_config`	All	MoE + FP8/MXFP8 variants
`gpt_oss_120b_*_config`	All	MoE
`vanilla_gpt_pretrain_config`	Pretrain	MLM/Bridge parity baseline
`gpt3_175b_pretrain_config`	Pretrain	TP=4, PP=8, VP=6
`kimi_k2_pretrain_config`	Pretrain	1T MoE, TP=2 PP=16 EP=32

配方	模式	说明
`moonlight_16b_{pretrain,sft,peft}_config`	全部	MoE EP=8
`olmoe_7b_{pretrain,sft,peft}_config`	全部	MoE EP=8
`ministral3_{3b,8b,14b}_{sft,peft}_config`	SFT/PEFT	密集型
`gpt_oss_20b_*_config`	全部	MoE + FP8/MXFP8变体
`gpt_oss_120b_*_config`	全部	MoE
`vanilla_gpt_pretrain_config`	预训练	MLM/Bridge一致性基线
`gpt3_175b_pretrain_config`	预训练	TP=4, PP=8, VP=6
`kimi_k2_pretrain_config`	预训练	1T MoE, TP=2 PP=16 EP=32

VLM Recipes

VLM配方

Recipe	Mode	TP	PP	EP	GPUs
`gemma3_vl_{4b,12b,27b}_{sft,peft}_config`	SFT/PEFT	1–8	1–2	—	1–16
`qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config`	SFT/PEFT	1–8	1–4	—	1–32
`qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config`	SFT/PEFT	1–4	1–8	1–32	1–512
`qwen35_vl_*_{sft,peft}_config`	SFT/PEFT	varies	varies	varies	varies
`glm_45v_{sft,peft}_config`	SFT/PEFT	1	8	4–16	64–512
`nemotron_nano_v2_vl_12b_{sft,peft}_config`	SFT/PEFT	2–4	1	—	8

配方	模式	TP	PP	EP	GPU数量
`gemma3_vl_{4b,12b,27b}_{sft,peft}_config`	SFT/PEFT	1–8	1–2	—	1–16
`qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_config`	SFT/PEFT	1–8	1–4	—	1–32
`qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_config`	SFT/PEFT	1–4	1–8	1–32	1–512
`qwen35_vl_*_{sft,peft}_config`	SFT/PEFT	可变	可变	可变	可变
`glm_45v_{sft,peft}_config`	SFT/PEFT	1	8	4–16	64–512
`nemotron_nano_v2_vl_12b_{sft,peft}_config`	SFT/PEFT	2–4	1	—	8

Diffusion Recipes

Diffusion配方

Recipe	Mode	TP	CP
`wan_1_3B_{pretrain,sft}_config`	P/SFT	1	8
`wan_14B_{pretrain,sft}_config`	P/SFT	2	4
`flux_12b_{pretrain,sft}_config`	P/SFT	2	1

配方	模式	TP	CP
`wan_1_3B_{pretrain,sft}_config`	预训练/SFT	1	8
`wan_14B_{pretrain,sft}_config`	预训练/SFT	2	4
`flux_12b_{pretrain,sft}_config`	预训练/SFT	2	1

Performance Recipe Index

性能配方索引

All perf recipes live under

scripts/performance/

. They are invoked via

run_script.py

and use

WorkloadBaseConfig

presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

所有性能配方都位于

scripts/performance/

目录下。通过

run_script.py

调用，并针对不同GPU类型使用

WorkloadBaseConfig

预设值。

重要提示：性能配方仅用于测量吞吐量上限基准测试**，而非生产环境训练。默认情况下，它们在模拟数据上运行50次迭代。吞吐量数值是理想目标，而非经过验证的收敛配置。

Llama 3 / 3.1

Model	GPUs	GPU Types	Key Features
Llama 3 8B	8	H100, B200, B300, GB200, GB300, R100	CUDA graphs (local), FSDP on GB variants
Llama 3 70B	64	H100, B200, B300, GB200, GB300	TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B	128–1024	H100, B200, B300, GB200, GB300	TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

模型	GPU数量	GPU类型	核心特性
Llama 3 8B	8	H100, B200, B300, GB200, GB300, R100	CUDA图（本地），GB系列使用FSDP
Llama 3 70B	64	H100, B200, B300, GB200, GB300	TP通信重叠（用户缓冲区），FSDP，CUDA图
Llama 3.1 405B	128–1024	H100, B200, B300, GB200, GB300	TP+CP通信重叠（用户缓冲区），FSDP，大量PP/VP

SFT/LoRA变体也已存在（例如8B SFT带打包序列，70B SFT在32个GPU上运行）。

DeepSeek V3

Model	GPUs	GPU Types	Key Features
DeepSeek V3 (671B MoE)	256–1024	H100, B200, B300, GB200, GB300	HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

模型	GPU数量	GPU类型	核心特性
DeepSeek V3（671B MoE）	256–1024	H100, B200, B300, GB200, GB300	HybridEP调度器，MLA重计算，CUDA图（TE范围）

Qwen3 MoE

Model	GPUs	GPU Types	Key Features
Qwen3 30B-A3B	8–16	H100, B200, B300, GB200, GB300	MoE alltoall/flex dispatcher
Qwen3 235B-A22B	64–256	H100, B200, B300, GB200, GB300	TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B	64–128	H100, B200, B300, GB200, GB300	EP 64–128

模型	GPU数量	GPU类型	核心特性
Qwen3 30B-A3B	8–16	H100, B200, B300, GB200, GB300	MoE alltoall/flex调度器
Qwen3 235B-A22B	64–256	H100, B200, B300, GB200, GB300	TP通信重叠，CUDA图，MoE a2a重叠
Qwen3-Next 80B-A3B	64–128	H100, B200, B300, GB200, GB300	EP 64–128

Qwen3-VL

Model	GPUs	GPU Types	Key Features
Qwen3-VL 30B-A3B	8–16	H100, B200, B300, GB200, GB300	VLM + MoE
Qwen3-VL 235B-A22B	64–256	H100, B200, B300, GB200, GB300	VLM + MoE, TP comm overlap

模型	GPU数量	GPU类型	核心特性
Qwen3-VL 30B-A3B	8–16	H100, B200, B300, GB200, GB300	VLM + MoE
Qwen3-VL 235B-A22B	64–256	H100, B200, B300, GB200, GB300	VLM + MoE，TP通信重叠

Kimi K2

Model	GPUs	GPU Types	Key Features
Kimi K2 (1T MoE)	256–1024	H100, B200, B300, GB200, GB300	Muon/Adam optimizer, HybridEP, pipeline layout helpers

模型	GPU数量	GPU类型	核心特性
Kimi K2（1T MoE）	256–1024	H100, B200, B300, GB200, GB300	Muon/Adam优化器，HybridEP，流水线布局辅助工具

NemotronH

Model	GPUs	GPU Types	Key Features
Nemotron 3 Nano (30B MoE+Mamba)	8–16	H100, B200, B300, GB200, GB300	TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super	64	H100, B200, B300, GB200, GB300	TE CUDA graphs, EP=64
NemotronH 56B	64	H100, B200, B300	TP=2–8, TE graphs (mamba+attn)

模型	GPU数量	GPU类型	核心特性
Nemotron 3 Nano（30B MoE+Mamba）	8–16	H100, B200, B300, GB200, GB300	TE CUDA图（attn+mamba+moe），HybridEP
Nemotron 3 Super	64	H100, B200, B300, GB200, GB300	TE CUDA图，EP=64
NemotronH 56B	64	H100, B200, B300	TP=2–8，TE图（mamba+attn）

GPT-OSS

Model	GPUs	GPU Types	Key Features
GPT-OSS 120B	64	H100, B200, GB200	EP=64, HybridEP on GB200

模型	GPU数量	GPU类型	核心特性
GPT-OSS 120B	64	H100, B200, GB200	EP=64，GB200上使用HybridEP

Recommendation Decision Tree

Adjustment Advice (When Recommending)

调整建议（推荐时使用）

Parallelism Resizing Rules

并行度调整规则

When the user's GPU count differs from the recipe default:

TP must divide
num_key_value_heads
(GQA constraint). E.g. if
```
num_key_value_heads=8
```
, valid TP = {1, 2, 4, 8}.
TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
CP requires all-to-all or ring attention. Check
```
cp_comm_type
```
. For GQA models,
```
a2a+p2p
```
hierarchical CP allows CP > num_kv_heads.
world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

当用户的GPU数量与配方默认值不同时：

TP必须能整除
num_key_value_heads
（GQA约束）。例如，如果
```
num_key_value_heads=8
```
，有效的TP值为 {1, 2, 4, 8}。
TP应保持在单个节点内（NVLink）。TP>8需要节点间NVLink（例如GB200 NVL72）。
PP会增加流水线气泡。尽量减少PP；仅当单独使用TP无法容纳模型时才增加PP。使用VP（虚拟流水线）来减少气泡开销。
EP不会减少密集层内存。只有专家参数会随EP分片。共享注意力/嵌入层会被复制。如果MoE出现"OOM"，优先增加EP而非TP。
只要TP>1，就应启用SP。它能消除冗余的激活复制，且几乎没有额外开销。
CP需要all-to-all或环形注意力。检查
```
cp_comm_type
```
。对于GQA模型，
```
a2a+p2p
```
分层CP允许CP>num_kv_heads。
world_size = DP × TP × PP × CP × EP。DP是隐式的。确保显式并行度的乘积能整除总GPU数量。

Batch Size Tuning

批量大小调优

Start with the recipe's
```
micro_batch_size
```
. If OOM, reduce to 1.

global_batch_size

determines learning dynamics. Scale with DP:

GBS = micro_batch_size × DP × gradient_accumulation_steps

For MoE,
```
micro_batch_size=1
```
is typical at scale.

从配方的
```
micro_batch_size
```
开始。如果出现OOM，将其减少到1。

global_batch_size

决定学习动态。随DP缩放：

GBS = micro_batch_size × DP × gradient_accumulation_steps

。

对于MoE，大规模训练时通常使用
```
micro_batch_size=1
```
。

Common Pitfalls to Warn About

需要提醒的常见陷阱

Pitfall	Symptom	Fix
TP > num_kv_heads	Crash: "TP must divide num_query_groups"	Reduce TP to a divisor of num_kv_heads
PP without VP	Poor throughput (large bubble)	Set `virtual_pipeline_model_parallel_size`
EP too low for large MoE	OOM on expert params	Increase EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequences	Assert: "CUDA graph accepts only Tensor inputs"	Disable packing or use `local` full-iteration graphs
CUDA graphs + full recompute	Assert: "full recompute only with full iteration CUDA graph"	Disable recompute or switch to `local` impl
`use_te_rng_tracker` not set	Assert on provider init when CUDA graphs enabled	Set `cfg.model.use_te_rng_tracker = True` and `cfg.rng.te_rng_tracker = True`
FSDP + TP > 1 on H100	Possible comm bottleneck	Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CP	OOM on activations	Add CP=2/4/8; use `_16k` , `_64k` , or `*_128k` recipe variants
MoE `overlap_grad_reduce` on H100	May hurt perf (False in many H100 presets)	Set `overlap_grad_reduce=False` for MoE on H100
VLM SFT missing image data	Runs but produces garbage	Provide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDP	Tested on Blackwell only	May not work on H100; validate first

陷阱	症状	修复方法
TP > num_kv_heads	崩溃："TP must divide num_query_groups"	将TP减少为num_kv_heads的约数
使用PP但未启用VP	吞吐量差（气泡大）	设置 `virtual_pipeline_model_parallel_size`
大型MoE的EP过低	专家参数出现OOM	增加EP；每个专家位于EP/num_experts个rank上
CUDA图 + 打包序列	断言错误："CUDA graph accepts only Tensor inputs"	禁用打包或使用 `local` 全迭代图
CUDA图 + 全重计算	断言错误："full recompute only with full iteration CUDA graph"	禁用重计算或切换到 `local` 实现
未设置 `use_te_rng_tracker`	启用CUDA图时，提供者初始化断言错误	设置 `cfg.model.use_te_rng_tracker = True` 和 `cfg.rng.te_rng_tracker = True`
H100上使用FSDP + TP>1	可能出现通信瓶颈	在H100上优先使用FSDP+TP=1或TP=2；FSDP在GB/B系列GPU上表现更好
长上下文但未启用CP	激活层出现OOM	添加CP=2/4/8；使用 `_16k` 、 `_64k` 或 `*_128k` 配方变体
H100上MoE启用 `overlap_grad_reduce`	可能影响性能（许多H100预设中为False）	在H100上的MoE设置 `overlap_grad_reduce=False`
VLM SFT缺少图像数据	运行正常但输出无效结果	提供实际多模态数据集或使用模拟VLM数据
Qwen35-VL MoE FSDP	仅在Blackwell上测试过	在H100上可能无法运行；请先验证

Recipe Override Examples

配方覆盖示例

bash

undefined

bash

undefined

Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)

将Llama3 8B从2个GPU扩展到8个GPU（增加DP）

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock

Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs

减少Qwen3-MoE 30B的并行度以适配4个GPU

uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'

Add long context to an existing recipe

为现有配方添加长上下文支持

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'

Enable CUDA graphs on any recipe

在任意配方上启用CUDA图

uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'

---

---

Quick Reference: Which Recipe for My Situation?

快速参考：我的场景该用哪个配方？

I want to...	Start with	GPUs needed
Try Bridge for the first time	`llama3_8b_sft_config` + mock data	2
Fine-tune a 7-8B model	`llama3_8b_sft_config` or `qwen3_8b_sft_config`	2–8
LoRA on 1 GPU	`llama3_8b_peft_config` or `qwen3_8b_peft_config`	1
Pretrain a dense 70B	`llama3_70b_pretrain_config`	32–64
Train a small MoE	`qwen3_30b_a3b_pretrain_config`	8
Train a large MoE (235B+)	`qwen3_235b_a22b_pretrain_config`	256–512
Benchmark throughput	Perf recipes via `run_script.py`	Varies
Long-context training	`llama3_8b_128k_pretrain_config` or add CP override	16+
VLM fine-tuning	`qwen3_vl_8b_sft_config` or `gemma3_vl_*_sft_config`	4–8
Diffusion training	`wan_1_3B_pretrain_config` or `flux_12b_pretrain_config`	8

我想要...	从以下配方开始	需要的GPU数量
首次尝试Bridge	`llama3_8b_sft_config` + 模拟数据	2
微调7-8B模型	`llama3_8b_sft_config` 或 `qwen3_8b_sft_config`	2–8
在1个GPU上进行LoRA训练	`llama3_8b_peft_config` 或 `qwen3_8b_peft_config`	1
预训练70B密集型模型	`llama3_70b_pretrain_config`	32–64
训练小型MoE模型	`qwen3_30b_a3b_pretrain_config`	8
训练大型MoE模型（235B+）	`qwen3_235b_a22b_pretrain_config`	256–512
基准测试吞吐量	通过 `run_script.py` 使用性能配方	可变
长上下文训练	`llama3_8b_128k_pretrain_config` 或添加CP覆盖	16+
VLM微调	`qwen3_vl_8b_sft_config` 或 `gemma3_vl_*_sft_config`	4–8
Diffusion训练	`wan_1_3B_pretrain_config` 或 `flux_12b_pretrain_config`	8

Code Anchors

代码锚点

What	Path
Library recipes root	`src/megatron/bridge/recipes/`
Recipe `__init__.py` (all exports)	`src/megatron/bridge/recipes/__init__.py`
Common recipe helpers	`src/megatron/bridge/recipes/common.py`
Training entry point	`scripts/training/run_recipe.py`
Perf recipes root	`scripts/performance/`
Perf entry point	`scripts/performance/run_script.py`
Perf workload configs	`scripts/performance/configs/<family>/`
Perf overrides (benchmark defaults)	`scripts/performance/utils/overrides.py`

内容	路径
库配方根目录	`src/megatron/bridge/recipes/`
配方 `__init__.py` （所有导出）	`src/megatron/bridge/recipes/__init__.py`
通用配方辅助函数	`src/megatron/bridge/recipes/common.py`
训练入口点	`scripts/training/run_recipe.py`
性能配方根目录	`scripts/performance/`
性能测试入口点	`scripts/performance/run_script.py`
性能测试工作负载配置	`scripts/performance/configs/<family>/`
性能测试覆盖（基准默认值）	`scripts/performance/utils/overrides.py`

recipe-recommender

Original

Translation

Auto Recipe — Recipe Index & Recommendation

自动配方 — 配方索引与推荐

How to Use This Skill

如何使用该技能

Entry Points

入口点

Library recipes (functional training)

库配方（功能性训练）

Pretrain with mock data

用模拟数据进行预训练

SFT with SQuAD

用SQuAD进行SFT训练

Override any field via CLI

通过CLI覆盖任意字段

Performance recipes (throughput benchmarks)

性能配方（吞吐量基准测试）

Recipe Unification (Coming Soon — PR #2803)

配方统一（即将推出 — PR #2803）

Library Recipe Index

库配方索引

Llama

Llama

Qwen2 / Qwen2.5

Qwen2 / Qwen2.5

Qwen3 (Dense)

Qwen3（密集型）

Qwen3 MoE

Qwen3 MoE

Qwen3-Next

Qwen3-Next

DeepSeek

DeepSeek

GLM-4.5

GLM-4.5

Gemma

Gemma

NemotronH / Nemotron

NemotronH / Nemotron

Other Models

其他模型

VLM Recipes

VLM配方

Diffusion Recipes

Diffusion配方

Performance Recipe Index

性能配方索引

Llama 3 / 3.1

Llama 3 / 3.1

DeepSeek V3

DeepSeek V3

Qwen3 MoE

Qwen3 MoE

Qwen3-VL

Qwen3-VL

Kimi K2

Kimi K2

NemotronH

NemotronH

GPT-OSS

GPT-OSS

Recommendation Decision Tree

推荐决策树

Adjustment Advice (When Recommending)

调整建议（推荐时使用）

Parallelism Resizing Rules

并行度调整规则

Batch Size Tuning

批量大小调优

Common Pitfalls to Warn About

需要提醒的常见陷阱

Recipe Override Examples

配方覆盖示例

Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)

将Llama3 8B从2个GPU扩展到8个GPU（增加DP）

Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs

减少Qwen3-MoE 30B的并行度以适配4个GPU

Add long context to an existing recipe