recipe-recommender
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAuto Recipe — Recipe Index & Recommendation
自动配方 — 配方索引与推荐
This skill indexes every shipped recipe and helps users pick the right starting
config, adjust parallelism, and avoid common pitfalls.
该技能会索引所有已发布的配方,帮助用户选择合适的初始配置、调整并行策略,并避免常见陷阱。
How to Use This Skill
如何使用该技能
- Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
- Look up the best-match recipe in the index below.
- Recommend the recipe function name + entry-point command.
- Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).
- 向用户询问:模型名称/规模、GPU数量及类型、训练目标(预训练/SFT/PEFT),以及序列长度(如果非默认值)。
- 在下方索引中查找最匹配的配方。
- 推荐配方函数名称+入口命令。
- 提供调整建议(并行度调整、批量调优、避坑指南)。
Entry Points
入口点
Library recipes (functional training)
库配方(功能性训练)
bash
undefinedbash
undefinedPretrain with mock data
用模拟数据进行预训练
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-pretrain-mock
--recipe <recipe_function_name>
--dataset llm-pretrain-mock
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-pretrain-mock
--recipe <recipe_function_name>
--dataset llm-pretrain-mock
SFT with SQuAD
用SQuAD进行SFT训练
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-finetune
--recipe <recipe_function_name>
--dataset llm-finetune
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe <recipe_function_name>
--dataset llm-finetune
--recipe <recipe_function_name>
--dataset llm-finetune
Override any field via CLI
通过CLI覆盖任意字段
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
undefineduv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.tensor_model_parallel_size=2'
'training.global_batch_size=64'
undefinedPerformance recipes (throughput benchmarks)
性能配方(吞吐量基准测试)
bash
python scripts/performance/run_script.py \
--recipe <model_family> \
--gpu_type h100 \
--num_gpus 64 \
--data mockPerf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.
bash
python scripts/performance/run_script.py \
--recipe <model_family> \
--gpu_type h100 \
--num_gpus 64 \
--data mock注意:性能配方未经过全面正确性验证。大多数测试和验证是基于模拟数据进行的。它们仅用于测量吞吐量上限,而非生产环境训练。请始终独立验证损失曲线和收敛情况。
Recipe Unification (Coming Soon — PR #2803)
配方统一(即将推出 — PR #2803)
PR #2803 is
unifying performance recipes into the same Python function format used by
library recipes. Key changes:
- Perf recipes move from →
scripts/performance/configs/src/megatron/bridge/recipes/<family>/<model>_perf.py - Each perf recipe becomes a self-contained Python function (e.g. )
llama3_8b_h100_bf16_pretrain_config() - The old →
WorkloadBaseConfig→set_workload_base_configspipeline is removedget_perf_optimized_recipe - Shared helpers: (50 iters, timing, TE RNG),
_benchmark_common()(bf16 / fp8_cs / fp8_mx / nvfp4)_perf_precision()
Why Python, not YAML? Previous YAML-based approaches had problems:
recipe logic was split across multiple indirection layers, configs were not
self-contained, and the two-level pipeline made maintenance and debugging
difficult. Python functions are explicit, greppable, and composable.
After #2803 lands, both library and perf recipes will be invocable through the
same entry point.
run_recipe.pyPR #2803 正在将性能配方统一为与库配方相同的Python函数格式。主要变化如下:
- 性能配方从 迁移至
scripts/performance/configs/src/megatron/bridge/recipes/<family>/<model>_perf.py - 每个性能配方变为一个独立的Python函数(例如 )
llama3_8b_h100_bf16_pretrain_config() - 移除旧的 →
WorkloadBaseConfig→set_workload_base_configs流程get_perf_optimized_recipe - 新增共享辅助函数:(50次迭代、计时、TE RNG)、
_benchmark_common()(bf16 / fp8_cs / fp8_mx / nvfp4)_perf_precision()
为什么选择Python而非YAML? 之前基于YAML的方法存在诸多问题:配方逻辑分散在多个间接层中,配置不独立,两层流程导致维护和调试困难。Python函数更直观、可搜索且可组合。
待#2803合并后,库配方和性能配方都将通过同一个 入口点调用。
run_recipe.pyLibrary Recipe Index
库配方索引
All recipes live under . Each function returns a
with model, training, optimizer, and data settings.
src/megatron/bridge/recipes/ConfigContainer所有配方都位于 目录下。每个函数返回包含模型、训练、优化器和数据设置的 对象。
src/megatron/bridge/recipes/ConfigContainerLlama
Llama
| Recipe | Mode | TP | PP | CP | SP | GPUs (min) | Seq Len |
|---|---|---|---|---|---|---|---|
| Pretrain | 2 | 1 | — | — | 2 | 4K |
| Pretrain | 2 | 1 | — | ✓ | 2 | 8K |
| Pretrain | 2 | 1 | 2 | ✓ | 4 | 16K |
| Pretrain | 2 | 1 | 4 | ✓ | 8 | 64K |
| Pretrain | 2 | 1 | 8 | ✓ | 16 | 128K |
| Pretrain | 8 | 4 | — | ✓ | 32 | 8K |
| Pretrain | 8 | 4 | 2 | ✓ | 64 | 16K |
| Pretrain | 8 | 4 | 4 | ✓ | 128 | 64K |
| Pretrain | 8 | 16 | — | ✓ | 128 | 8K |
| SFT | 2 | 1 | — | ✓ | 2 | 8K |
| SFT | 4 | 4 | — | ✓ | 16 | 8K |
| SFT | 8 | 8 | — | ✓ | 64 | 8K |
| PEFT | 1 | 1 | — | — | 1 | 8K |
| PEFT | 2 | 4 | — | ✓ | 8 | 8K |
| PEFT | 4 | 8 | — | ✓ | 32 | 8K |
| 配方 | 模式 | TP | PP | CP | SP | 最小GPU数量 | 序列长度 |
|---|---|---|---|---|---|---|---|
| 预训练 | 2 | 1 | — | — | 2 | 4K |
| 预训练 | 2 | 1 | — | ✓ | 2 | 8K |
| 预训练 | 2 | 1 | 2 | ✓ | 4 | 16K |
| 预训练 | 2 | 1 | 4 | ✓ | 8 | 64K |
| 预训练 | 2 | 1 | 8 | ✓ | 16 | 128K |
| 预训练 | 8 | 4 | — | ✓ | 32 | 8K |
| 预训练 | 8 | 4 | 2 | ✓ | 64 | 16K |
| 预训练 | 8 | 4 | 4 | ✓ | 128 | 64K |
| 预训练 | 8 | 16 | — | ✓ | 128 | 8K |
| SFT | 2 | 1 | — | ✓ | 2 | 8K |
| SFT | 4 | 4 | — | ✓ | 16 | 8K |
| SFT | 8 | 8 | — | ✓ | 64 | 8K |
| PEFT | 1 | 1 | — | — | 1 | 8K |
| PEFT | 2 | 4 | — | ✓ | 8 | 8K |
| PEFT | 4 | 8 | — | ✓ | 32 | 8K |
Qwen2 / Qwen2.5
Qwen2 / Qwen2.5
| Recipe | Mode | TP | PP | Sizes |
|---|---|---|---|---|
| All | 1–8 | 1–4 | 500M, 1.5B, 7B, 14B, 32B, 72B |
| All | 1–8 | 1–4 | 500M, 1.5B, 3B, 7B, 14B, 32B, 72B |
| 配方 | 模式 | TP | PP | 模型规模 |
|---|---|---|---|---|
| 全部 | 1–8 | 1–4 | 500M, 1.5B, 7B, 14B, 32B, 72B |
| 全部 | 1–8 | 1–4 | 500M, 1.5B, 3B, 7B, 14B, 32B, 72B |
Qwen3 (Dense)
Qwen3(密集型)
| Recipe | Mode | TP | PP | CP | Sizes |
|---|---|---|---|---|---|
| Pretrain | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1 | 1 | 8 | 600M (128K seq) |
| PEFT | 1 | 1 | — | 600M–32B |
| 配方 | 模式 | TP | PP | CP | 模型规模 |
|---|---|---|---|---|---|
| 预训练 | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1 | 1 | 8 | 600M(128K序列) |
| PEFT | 1 | 1 | — | 600M–32B |
Qwen3 MoE
Qwen3 MoE
| Recipe | Mode | TP | PP | EP | CP | GPUs |
|---|---|---|---|---|---|---|
| Pretrain | 1 | 1 | 8 | — | 8 |
| SFT | 1 | 1 | 8 | — | 8 |
| PEFT | 1 | 1 | 1 | — | 1 |
| Pretrain | 4 | 16 | 8 | 2 | 512+ |
| SFT | 4 | 8 | 8 | — | 256 |
| PEFT | 1 | 4 | 4 | — | 16 |
| 配方 | 模式 | TP | PP | EP | CP | GPU数量 |
|---|---|---|---|---|---|---|
| 预训练 | 1 | 1 | 8 | — | 8 |
| SFT | 1 | 1 | 8 | — | 8 |
| PEFT | 1 | 1 | 1 | — | 1 |
| 预训练 | 4 | 16 | 8 | 2 | 512+ |
| SFT | 4 | 8 | 8 | — | 256 |
| PEFT | 1 | 4 | 4 | — | 16 |
Qwen3-Next
Qwen3-Next
| Recipe | Mode | TP | PP | EP |
|---|---|---|---|---|
| Pretrain | 1 | 4 | 8 |
| SFT | 1 | 2 | 8 |
| PEFT | 1 | 1 | 4 |
| 配方 | 模式 | TP | PP | EP |
|---|---|---|---|---|
| 预训练 | 1 | 4 | 8 |
| SFT | 1 | 2 | 8 |
| PEFT | 1 | 1 | 4 |
DeepSeek
DeepSeek
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| Pretrain | 1 | 1 | 8 | 8 |
| Pretrain | 1 | 4 | 32 | 128 |
| Pretrain | 2 | 16 | 64 | 2048 |
| Pretrain | 2 | 8 | 32 | 256 |
| 配方 | 模式 | TP | PP | EP | GPU数量 |
|---|---|---|---|---|---|
| 预训练 | 1 | 1 | 8 | 8 |
| 预训练 | 1 | 4 | 32 | 128 |
| 预训练 | 2 | 16 | 64 | 2048 |
| 预训练 | 2 | 8 | 32 | 256 |
GLM-4.5
GLM-4.5
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| Pretrain | 2 | 8 | 16 | 256 |
| Pretrain | 1 | 4 | 8 | 32 |
| SFT | 2 | 8 | 16 | 256 |
| SFT | 1 | 4 | 8 | 32 |
| PEFT | 2 | 4 | 4 | 32 |
| PEFT | 1 | 2 | 4 | 8 |
| 配方 | 模式 | TP | PP | EP | GPU数量 |
|---|---|---|---|---|---|
| 预训练 | 2 | 8 | 16 | 256 |
| 预训练 | 1 | 4 | 8 | 32 |
| SFT | 2 | 8 | 16 | 256 |
| SFT | 1 | 4 | 8 | 32 |
| PEFT | 2 | 4 | 4 | 32 |
| PEFT | 1 | 2 | 4 | 8 |
Gemma
Gemma
| Recipe | Mode | TP | PP | Sizes |
|---|---|---|---|---|
| All | 2–8 | 1–2 | 2B, 9B, 27B |
| All | 1 | 1 | 1B (32K seq) |
| 配方 | 模式 | TP | PP | 模型规模 |
|---|---|---|---|---|
| 全部 | 2–8 | 1–2 | 2B, 9B, 27B |
| 全部 | 1 | 1 | 1B(32K序列) |
NemotronH / Nemotron
NemotronH / Nemotron
| Recipe | Mode | TP | PP | EP | Notes |
|---|---|---|---|---|---|
| P/S/PEFT | 1–8 | 1–4 | — | Dense SSM-hybrid |
| P/S/PEFT | varies | 1 | 8 | MoE + Mamba |
| P/S/PEFT | 4 | 1 | 8 | MoE + Mamba, ~40% CUDA graph gain |
| P/S/PEFT | varies | 1 | — | Dense |
| 配方 | 模式 | TP | PP | EP | 说明 |
|---|---|---|---|---|---|
| 预训练/SFT/PEFT | 1–8 | 1–4 | — | 密集型SSM混合模型 |
| 预训练/SFT/PEFT | 可变 | 1 | 8 | MoE + Mamba |
| 预训练/SFT/PEFT | 4 | 1 | 8 | MoE + Mamba,CUDA图增益约40% |
| 预训练/SFT/PEFT | 可变 | 1 | — | 密集型 |
Other Models
其他模型
| Recipe | Mode | Notes |
|---|---|---|
| All | MoE EP=8 |
| All | MoE EP=8 |
| SFT/PEFT | Dense |
| All | MoE + FP8/MXFP8 variants |
| All | MoE |
| Pretrain | MLM/Bridge parity baseline |
| Pretrain | TP=4, PP=8, VP=6 |
| Pretrain | 1T MoE, TP=2 PP=16 EP=32 |
| 配方 | 模式 | 说明 |
|---|---|---|
| 全部 | MoE EP=8 |
| 全部 | MoE EP=8 |
| SFT/PEFT | 密集型 |
| 全部 | MoE + FP8/MXFP8变体 |
| 全部 | MoE |
| 预训练 | MLM/Bridge一致性基线 |
| 预训练 | TP=4, PP=8, VP=6 |
| 预训练 | 1T MoE, TP=2 PP=16 EP=32 |
VLM Recipes
VLM配方
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| SFT/PEFT | 1–8 | 1–2 | — | 1–16 |
| SFT/PEFT | 1–8 | 1–4 | — | 1–32 |
| SFT/PEFT | 1–4 | 1–8 | 1–32 | 1–512 |
| SFT/PEFT | varies | varies | varies | varies |
| SFT/PEFT | 1 | 8 | 4–16 | 64–512 |
| SFT/PEFT | 2–4 | 1 | — | 8 |
| 配方 | 模式 | TP | PP | EP | GPU数量 |
|---|---|---|---|---|---|
| SFT/PEFT | 1–8 | 1–2 | — | 1–16 |
| SFT/PEFT | 1–8 | 1–4 | — | 1–32 |
| SFT/PEFT | 1–4 | 1–8 | 1–32 | 1–512 |
| SFT/PEFT | 可变 | 可变 | 可变 | 可变 |
| SFT/PEFT | 1 | 8 | 4–16 | 64–512 |
| SFT/PEFT | 2–4 | 1 | — | 8 |
Diffusion Recipes
Diffusion配方
| Recipe | Mode | TP | CP |
|---|---|---|---|
| P/SFT | 1 | 8 |
| P/SFT | 2 | 4 |
| P/SFT | 2 | 1 |
| 配方 | 模式 | TP | CP |
|---|---|---|---|
| 预训练/SFT | 1 | 8 |
| 预训练/SFT | 2 | 4 |
| 预训练/SFT | 2 | 1 |
Performance Recipe Index
性能配方索引
All perf recipes live under . They are invoked via
and use presets per GPU type.
scripts/performance/run_script.pyWorkloadBaseConfigImportant: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.
所有性能配方都位于 目录下。通过 调用,并针对不同GPU类型使用 预设值。
scripts/performance/run_script.pyWorkloadBaseConfig重要提示:性能配方仅用于测量吞吐量上限基准测试**,而非生产环境训练。默认情况下,它们在模拟数据上运行50次迭代。吞吐量数值是理想目标,而非经过验证的收敛配置。
Llama 3 / 3.1
Llama 3 / 3.1
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Llama 3 8B | 8 | H100, B200, B300, GB200, GB300, R100 | CUDA graphs (local), FSDP on GB variants |
| Llama 3 70B | 64 | H100, B200, B300, GB200, GB300 | TP comm overlap (userbuffers), FSDP, CUDA graphs |
| Llama 3.1 405B | 128–1024 | H100, B200, B300, GB200, GB300 | TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP |
SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| Llama 3 8B | 8 | H100, B200, B300, GB200, GB300, R100 | CUDA图(本地),GB系列使用FSDP |
| Llama 3 70B | 64 | H100, B200, B300, GB200, GB300 | TP通信重叠(用户缓冲区),FSDP,CUDA图 |
| Llama 3.1 405B | 128–1024 | H100, B200, B300, GB200, GB300 | TP+CP通信重叠(用户缓冲区),FSDP,大量PP/VP |
SFT/LoRA变体也已存在(例如8B SFT带打包序列,70B SFT在32个GPU上运行)。
DeepSeek V3
DeepSeek V3
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| DeepSeek V3 (671B MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped) |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| DeepSeek V3(671B MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | HybridEP调度器,MLA重计算,CUDA图(TE范围) |
Qwen3 MoE
Qwen3 MoE
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Qwen3 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | MoE alltoall/flex dispatcher |
| Qwen3 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | TP comm overlap, CUDA graphs, MoE a2a overlap |
| Qwen3-Next 80B-A3B | 64–128 | H100, B200, B300, GB200, GB300 | EP 64–128 |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| Qwen3 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | MoE alltoall/flex调度器 |
| Qwen3 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | TP通信重叠,CUDA图,MoE a2a重叠 |
| Qwen3-Next 80B-A3B | 64–128 | H100, B200, B300, GB200, GB300 | EP 64–128 |
Qwen3-VL
Qwen3-VL
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Qwen3-VL 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | VLM + MoE |
| Qwen3-VL 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | VLM + MoE, TP comm overlap |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| Qwen3-VL 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | VLM + MoE |
| Qwen3-VL 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | VLM + MoE,TP通信重叠 |
Kimi K2
Kimi K2
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Kimi K2 (1T MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | Muon/Adam optimizer, HybridEP, pipeline layout helpers |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| Kimi K2(1T MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | Muon/Adam优化器,HybridEP,流水线布局辅助工具 |
NemotronH
NemotronH
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Nemotron 3 Nano (30B MoE+Mamba) | 8–16 | H100, B200, B300, GB200, GB300 | TE CUDA graphs (attn+mamba+moe), HybridEP |
| Nemotron 3 Super | 64 | H100, B200, B300, GB200, GB300 | TE CUDA graphs, EP=64 |
| NemotronH 56B | 64 | H100, B200, B300 | TP=2–8, TE graphs (mamba+attn) |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| Nemotron 3 Nano(30B MoE+Mamba) | 8–16 | H100, B200, B300, GB200, GB300 | TE CUDA图(attn+mamba+moe),HybridEP |
| Nemotron 3 Super | 64 | H100, B200, B300, GB200, GB300 | TE CUDA图,EP=64 |
| NemotronH 56B | 64 | H100, B200, B300 | TP=2–8,TE图(mamba+attn) |
GPT-OSS
GPT-OSS
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| GPT-OSS 120B | 64 | H100, B200, GB200 | EP=64, HybridEP on GB200 |
| 模型 | GPU数量 | GPU类型 | 核心特性 |
|---|---|---|---|
| GPT-OSS 120B | 64 | H100, B200, GB200 | EP=64,GB200上使用HybridEP |
Recommendation Decision Tree
推荐决策树
text
User wants to train a model
│
├─ Know the model name?
│ ├─ Yes → Look up in Library Recipe Index above
│ │ ├─ Has a recipe for their size + mode? → Use it directly
│ │ └─ No exact match? → Use closest size, adjust parallelism
│ └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│ ├─ Pretrain → Use *_pretrain_config
│ ├─ SFT (full fine-tune) → Use *_sft_config
│ └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│ ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│ ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│ ├─ 16–64 GPUs → 70B dense, medium MoE
│ └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│ ├─ Yes → Use perf recipes (scripts/performance/)
│ │ └─ ⚠️ These run on mock data for upper-bound perf only
│ └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
└─ ≤ 8K → Default recipes worktext
用户想要训练模型
│
├─ 是否知道模型名称?
│ ├─ 是 → 在上方库配方索引中查找
│ │ ├─ 存在对应规模和模式的配方? → 直接使用
│ │ └─ 无精确匹配? → 使用最接近的规模,调整并行度
│ └─ 否 → 询问模型名称、规模和HF模型ID
│
├─ 训练目标是什么?
│ ├─ 预训练 → 使用 *_pretrain_config
│ ├─ SFT(全量微调) → 使用 *_sft_config
│ └─ PEFT(LoRA/DoRA) → 使用 *_peft_config(GPU需求最低)
│
├─ 有多少个GPU?
│ ├─ 1个GPU → 仅PEFT配方可用(TP=1, PP=1)
│ ├─ 8个GPU(1节点) → 大多数8B–16B模型,小型MoE(EP=8)
│ ├─ 16–64个GPU → 70B密集型,中型MoE
│ └─ 128+个GPU → 405B+,大型MoE(DeepSeek V3, Kimi K2)
│
├─ 是否需要吞吐量基准测试?
│ ├─ 是 → 使用性能配方(scripts/performance/)
│ │ └─ ⚠️ 这些仅在模拟数据上运行,用于测量性能上限
│ └─ 否 → 使用库配方(scripts/training/run_recipe.py)
│
└─ 是否需要长上下文?
├─ > 8K → 需要CP(上下文并行),查看 *_16k / *_64k / *_128k 变体
└─ ≤ 8K → 默认配方可用Adjustment Advice (When Recommending)
调整建议(推荐时使用)
Parallelism Resizing Rules
并行度调整规则
When the user's GPU count differs from the recipe default:
- TP must divide (GQA constraint). E.g. if
num_key_value_heads, valid TP = {1, 2, 4, 8}.num_key_value_heads=8 - TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
- PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
- EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
- SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
- CP requires all-to-all or ring attention. Check . For GQA models,
cp_comm_typehierarchical CP allows CP > num_kv_heads.a2a+p2p - world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.
当用户的GPU数量与配方默认值不同时:
- TP必须能整除 (GQA约束)。例如,如果
num_key_value_heads,有效的TP值为 {1, 2, 4, 8}。num_key_value_heads=8 - TP应保持在单个节点内(NVLink)。TP>8需要节点间NVLink(例如GB200 NVL72)。
- PP会增加流水线气泡。尽量减少PP;仅当单独使用TP无法容纳模型时才增加PP。使用VP(虚拟流水线)来减少气泡开销。
- EP不会减少密集层内存。只有专家参数会随EP分片。共享注意力/嵌入层会被复制。如果MoE出现"OOM",优先增加EP而非TP。
- 只要TP>1,就应启用SP。它能消除冗余的激活复制,且几乎没有额外开销。
- CP需要all-to-all或环形注意力。检查 。对于GQA模型,
cp_comm_type分层CP允许CP>num_kv_heads。a2a+p2p - world_size = DP × TP × PP × CP × EP。DP是隐式的。确保显式并行度的乘积能整除总GPU数量。
Batch Size Tuning
批量大小调优
- Start with the recipe's . If OOM, reduce to 1.
micro_batch_size - determines learning dynamics. Scale with DP:
global_batch_size.GBS = micro_batch_size × DP × gradient_accumulation_steps - For MoE, is typical at scale.
micro_batch_size=1
- 从配方的 开始。如果出现OOM,将其减少到1。
micro_batch_size - 决定学习动态。随DP缩放:
global_batch_size。GBS = micro_batch_size × DP × gradient_accumulation_steps - 对于MoE,大规模训练时通常使用 。
micro_batch_size=1
Common Pitfalls to Warn About
需要提醒的常见陷阱
| Pitfall | Symptom | Fix |
|---|---|---|
| TP > num_kv_heads | Crash: "TP must divide num_query_groups" | Reduce TP to a divisor of num_kv_heads |
| PP without VP | Poor throughput (large bubble) | Set |
| EP too low for large MoE | OOM on expert params | Increase EP; each expert lives on EP/num_experts ranks |
| CUDA graphs + packed sequences | Assert: "CUDA graph accepts only Tensor inputs" | Disable packing or use |
| CUDA graphs + full recompute | Assert: "full recompute only with full iteration CUDA graph" | Disable recompute or switch to |
| Assert on provider init when CUDA graphs enabled | Set |
| FSDP + TP > 1 on H100 | Possible comm bottleneck | Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series |
| Long context without CP | OOM on activations | Add CP=2/4/8; use |
MoE | May hurt perf (False in many H100 presets) | Set |
| VLM SFT missing image data | Runs but produces garbage | Provide actual multimodal dataset or use mock VLM data |
| Qwen35-VL MoE FSDP | Tested on Blackwell only | May not work on H100; validate first |
| 陷阱 | 症状 | 修复方法 |
|---|---|---|
| TP > num_kv_heads | 崩溃:"TP must divide num_query_groups" | 将TP减少为num_kv_heads的约数 |
| 使用PP但未启用VP | 吞吐量差(气泡大) | 设置 |
| 大型MoE的EP过低 | 专家参数出现OOM | 增加EP;每个专家位于EP/num_experts个rank上 |
| CUDA图 + 打包序列 | 断言错误:"CUDA graph accepts only Tensor inputs" | 禁用打包或使用 |
| CUDA图 + 全重计算 | 断言错误:"full recompute only with full iteration CUDA graph" | 禁用重计算或切换到 |
未设置 | 启用CUDA图时,提供者初始化断言错误 | 设置 |
| H100上使用FSDP + TP>1 | 可能出现通信瓶颈 | 在H100上优先使用FSDP+TP=1或TP=2;FSDP在GB/B系列GPU上表现更好 |
| 长上下文但未启用CP | 激活层出现OOM | 添加CP=2/4/8;使用 |
H100上MoE启用 | 可能影响性能(许多H100预设中为False) | 在H100上的MoE设置 |
| VLM SFT缺少图像数据 | 运行正常但输出无效结果 | 提供实际多模态数据集或使用模拟VLM数据 |
| Qwen35-VL MoE FSDP | 仅在Blackwell上测试过 | 在H100上可能无法运行;请先验证 |
Recipe Override Examples
配方覆盖示例
bash
undefinedbash
undefinedScale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
将Llama3 8B从2个GPU扩展到8个GPU(增加DP)
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
减少Qwen3-MoE 30B的并行度以适配4个GPU
uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'
uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'
--recipe qwen3_30b_a3b_sft_config
--dataset llm-finetune
'model.expert_model_parallel_size=4'
Add long context to an existing recipe
为现有配方添加长上下文支持
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'
--recipe llama3_8b_pretrain_config
--dataset llm-pretrain-mock
'model.seq_length=32768'
'model.context_parallel_size=4'
Enable CUDA graphs on any recipe
在任意配方上启用CUDA图
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'
---uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'
--recipe qwen3_30b_a3b_pretrain_config
--dataset llm-pretrain-mock
'model.cuda_graph_impl=transformer_engine'
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]'
'model.use_te_rng_tracker=True'
'rng.te_rng_tracker=True'
---Quick Reference: Which Recipe for My Situation?
快速参考:我的场景该用哪个配方?
| I want to... | Start with | GPUs needed |
|---|---|---|
| Try Bridge for the first time | | 2 |
| Fine-tune a 7-8B model | | 2–8 |
| LoRA on 1 GPU | | 1 |
| Pretrain a dense 70B | | 32–64 |
| Train a small MoE | | 8 |
| Train a large MoE (235B+) | | 256–512 |
| Benchmark throughput | Perf recipes via | Varies |
| Long-context training | | 16+ |
| VLM fine-tuning | | 4–8 |
| Diffusion training | | 8 |
| 我想要... | 从以下配方开始 | 需要的GPU数量 |
|---|---|---|
| 首次尝试Bridge | | 2 |
| 微调7-8B模型 | | 2–8 |
| 在1个GPU上进行LoRA训练 | | 1 |
| 预训练70B密集型模型 | | 32–64 |
| 训练小型MoE模型 | | 8 |
| 训练大型MoE模型(235B+) | | 256–512 |
| 基准测试吞吐量 | 通过 | 可变 |
| 长上下文训练 | | 16+ |
| VLM微调 | | 4–8 |
| Diffusion训练 | | 8 |
Code Anchors
代码锚点
| What | Path |
|---|---|
| Library recipes root | |
Recipe | |
| Common recipe helpers | |
| Training entry point | |
| Perf recipes root | |
| Perf entry point | |
| Perf workload configs | |
| Perf overrides (benchmark defaults) | |
| 内容 | 路径 |
|---|---|
| 库配方根目录 | |
配方 | |
| 通用配方辅助函数 | |
| 训练入口点 | |
| 性能配方根目录 | |
| 性能测试入口点 | |
| 性能测试工作负载配置 | |
| 性能测试覆盖(基准默认值) | |