perf-moe-vlm-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMoE VLM Training
MoE VLM 训练
Stable docs: @docs/training/moe-optimization.md
Card: @skills/perf-moe-vlm-training/card.yaml
稳定文档:@docs/training/moe-optimization.md
卡片:@skills/perf-moe-vlm-training/card.yaml
FSDP vs 3D Parallel
FSDP 与 3D 并行对比
| Approach | Strength | Best fit |
|---|---|---|
| FSDP | Simplest path to a working multimodal run | first bring-up, memory-first tuning, awkward PP boundaries |
| 3D parallel | Higher ceiling after tuning | stable models with a clean PP layout and time for deeper sweeps |
For MoE VLMs, the practical workflow is usually:
- get the first reliable run with FSDP
- stabilize real-data input, recompute, and memory behavior
- move to 3D parallel only if the throughput headroom is worth the extra work
| 方法 | 优势 | 适用场景 |
|---|---|---|
| FSDP | 实现多模态训练运行的最简路径 | 首次启动、内存优先调优、PP边界复杂的场景 |
| 3D 并行 | 调优后性能上限更高 | 具备清晰PP布局、可投入时间进行深度调优的稳定模型 |
对于MoE VLM,实用的训练流程通常如下:
- 使用FSDP完成首次可靠运行
- 稳定真实数据输入、重计算及内存表现
- 仅当吞吐量提升空间值得额外投入时,再切换至3D并行
Rounded Findings From Recent VLM Runs
近期VLM训练实验的经验总结
Qwen3-VL class models
Qwen3-VL系列模型
The main patterns were consistent across the tracker:
- FSDP on GB200-class systems can already reach healthy high-teens utilization with a comparatively simple setup
- B200 FSDP runs are viable, but more sensitive to recompute choice and frozen vision settings
- 3D parallel can recover to a similar or better operating point, but only after tuning MBS, recompute, and the real vision path together
跟踪实验中的主要规律一致:
- 在GB200级系统上使用FSDP,仅需相对简单的配置即可达到较高的利用率(十几%的健康水平)
- B200上的FSDP运行可行,但对重计算选择和视觉模块冻结设置更为敏感
- 3D并行可达到相当或更优的性能表现,但需结合调优MBS、重计算及真实视觉路径才能实现
Real data vs mock data
真实数据 vs 模拟数据
Mock-data VLM runs are not trustworthy performance proxies. In the experiments,
image-free mock runs looked closer to "roughly twice as fast" than "slightly
optimistic" when compared with real multimodal input.
Use real or realistic image payloads before drawing any conclusion about VLM
throughput.
模拟数据的VLM训练无法作为可靠的性能参考。实验中,无图像的模拟运行与真实多模态输入相比,速度看起来“约快两倍”,而非“略高于实际”。
在得出关于VLM吞吐量的任何结论前,请使用真实或接近真实的图像数据。
Smaller multimodal MoE runs
小型多模态MoE训练
The smaller Qwen3.5-style multimodal experiments reinforce the same lessons:
- HybridEP is a solid default on GB200
- TE-scoped CUDA graphs help once the training loop is stable
- larger MBS can pay off, but only if the vision encoder does not become the next bottleneck
Qwen3.5风格的小型多模态实验验证了同样的结论:
- 在GB200上,HybridEP是可靠的默认选择
- 训练循环稳定后,TE范围的CUDA图有助于提升性能
- 增大MBS可能带来收益,但需确保视觉编码器不会成为新的瓶颈
Decision Guide
决策指南
Choose FSDP when
选择FSDP的场景
- you are bringing up a new VLM for the first time
- the model has awkward stage boundaries across embedding, vision, and decoder
- memory fit matters more than absolute throughput
- you may freeze the vision stack during decoder-focused tuning
- 首次启动新的VLM模型时
- 模型在嵌入层、视觉模块和解码器之间存在复杂的阶段边界时
- 内存适配优先级高于绝对吞吐量时
- 专注于解码器调优,可能需要冻结视觉模块时
Choose 3D parallel when
选择3D并行的场景
- the model is already stable under FSDP
- the PP layout is clear and repeatable
- you can sweep MBS, recompute, and CUDA-graph scope together
- the goal is best steady-state throughput, not easiest bring-up
- 模型已在FSDP下稳定运行时
- PP布局清晰且可重复时
- 可同时调优MBS、重计算及CUDA图范围时
- 目标是最佳稳态吞吐量,而非最简便的启动流程时
Key Tuning Knobs
关键调优参数
-
Freeze the vision stack when appropriate: if the work is decoder-focused, freezing the vision side often gives a small but real throughput gain and reduces memory pressure.
-
Sweep MBS aggressively: VLMs are more MBS-sensitive than text-only MoE runs because the vision path changes the compute-to-overhead balance.
-
Prefer selective recompute once the model fits: full recompute is a useful bring-up tool, but selective recompute is usually the better steady state.
-
Match CUDA-graph scope to the workload:is the safer MoE default, while narrower scopes can still be useful for controlled experiments.
attn moe_router moe_preprocess -
Use ETP only when EP alone is insufficient: it can unlock a layout, but it also introduces more communication and more tuning surface.
- 适时冻结视觉模块:如果工作重点在解码器,冻结视觉模块通常能带来小幅但真实的吞吐量提升,并降低内存压力。
- 积极调优MBS:由于视觉路径改变了计算与开销的平衡,VLM比纯文本MoE训练对MBS更为敏感。
- 模型适配后优先选择选择性重计算:全量重计算是实用的启动工具,但选择性重计算通常是更优的稳态选择。
- CUDA图范围与工作负载匹配:是更安全的MoE默认设置,而更窄的范围仍可用于受控实验。
attn moe_router moe_preprocess - 仅当EP不足以满足需求时使用ETP:它能解锁特定布局,但也会引入更多通信和调优复杂度。
Representative Config Families
典型配置示例
FSDP-first GB200 path
优先使用FSDP的GB200配置路径
text
TP=1 CP=1 PP=1
EP sized to the expert topology, often large
Dispatcher: HybridEP on GB200-class systems
Recompute: start with full, then relax toward selective recomputetext
TP=1 CP=1 PP=1
EP 根据专家拓扑设置,通常较大
调度器:GB200级系统使用HybridEP
重计算:从全量重计算开始,逐步过渡到选择性重计算3D-parallel GB200 path
3D并行的GB200配置路径
text
TP=1 CP=1 PP=1 or modest PP
EP and ETP sized to the expert topology
Dispatcher: HybridEP
CUDA Graph: start narrow, then widen only after the real-data path is stabletext
TP=1 CP=1 PP=1 或适度PP
EP 和 ETP 根据专家拓扑设置
调度器:HybridEP
CUDA图:从窄范围开始,仅在真实数据路径稳定后再扩大范围Compatibility
兼容性
| Feature | FSDP | 3D parallel |
|---|---|---|
| HybridEP on GB200 | strong default | strong default once topology is stable |
| CUDA graphs | useful after bring-up | useful, but more scope-sensitive |
| Freeze vision | natural fit | possible, but less often used as the headline perf path |
| Selective recompute | recommended | recommended |
| 特性 | FSDP | 3D 并行 |
|---|---|---|
| GB200上的HybridEP | 推荐默认选择 | 拓扑稳定后为推荐默认选择 |
| CUDA图 | 启动完成后有用 | 有用,但对范围更敏感 |
| 冻结视觉模块 | 自然适配 | 可行,但较少作为核心性能优化路径使用 |
| 选择性重计算 | 推荐 | 推荐 |
Pitfalls
常见陷阱
-
Mock multimodal data is misleading: it can make the decoder look much healthier than the real end-to-end VLM path.
-
The vision encoder can dominate unexpectedly: profile encoder, projector, and decoder separately before attributing everything to the dispatcher.
-
Do not compare FSDP and 3D-parallel runs with different effective work: normalize by useful tokens and workload shape, not only by step time.
-
ETP is not free: use it as a fit or topology tool, not as the default.
-
Recompute and CUDA-graph choices are coupled: the setting that gets the model to fit is often not the setting that gives the best steady-state speed.
- 模拟多模态数据具有误导性:它会让解码器的表现看起来比真实端到端VLM路径好得多。
- 视觉编码器可能意外成为性能瓶颈:在将所有问题归因于调度器之前,请分别分析编码器、投影层和解码器的性能。
- 不要用不同有效工作量对比FSDP和3D并行运行:需按有效token数和工作负载形态归一化,而非仅按步骤时间对比。
- ETP并非无代价:仅将其作为适配或拓扑调整工具,而非默认选择。
- 重计算与CUDA图选择相互关联:能让模型适配的设置往往并非能带来最佳稳态速度的设置。