perf-moe-vlm-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE VLM Training

MoE VLM 训练

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-vlm-training/card.yaml

稳定文档：@docs/training/moe-optimization.md 卡片：@skills/perf-moe-vlm-training/card.yaml

FSDP vs 3D Parallel

FSDP 与 3D 并行对比

Approach	Strength	Best fit
FSDP	Simplest path to a working multimodal run	first bring-up, memory-first tuning, awkward PP boundaries
3D parallel	Higher ceiling after tuning	stable models with a clean PP layout and time for deeper sweeps

For MoE VLMs, the practical workflow is usually:

get the first reliable run with FSDP
stabilize real-data input, recompute, and memory behavior
move to 3D parallel only if the throughput headroom is worth the extra work

方法	优势	适用场景
FSDP	实现多模态训练运行的最简路径	首次启动、内存优先调优、PP边界复杂的场景
3D 并行	调优后性能上限更高	具备清晰PP布局、可投入时间进行深度调优的稳定模型

对于MoE VLM，实用的训练流程通常如下：

使用FSDP完成首次可靠运行
稳定真实数据输入、重计算及内存表现
仅当吞吐量提升空间值得额外投入时，再切换至3D并行

Rounded Findings From Recent VLM Runs

近期VLM训练实验的经验总结

Qwen3-VL class models

Qwen3-VL系列模型

The main patterns were consistent across the tracker:

FSDP on GB200-class systems can already reach healthy high-teens utilization with a comparatively simple setup
B200 FSDP runs are viable, but more sensitive to recompute choice and frozen vision settings
3D parallel can recover to a similar or better operating point, but only after tuning MBS, recompute, and the real vision path together

跟踪实验中的主要规律一致：

在GB200级系统上使用FSDP，仅需相对简单的配置即可达到较高的利用率（十几%的健康水平）
B200上的FSDP运行可行，但对重计算选择和视觉模块冻结设置更为敏感
3D并行可达到相当或更优的性能表现，但需结合调优MBS、重计算及真实视觉路径才能实现

Real data vs mock data

真实数据 vs 模拟数据

Mock-data VLM runs are not trustworthy performance proxies. In the experiments, image-free mock runs looked closer to "roughly twice as fast" than "slightly optimistic" when compared with real multimodal input.

Use real or realistic image payloads before drawing any conclusion about VLM throughput.

模拟数据的VLM训练无法作为可靠的性能参考。实验中，无图像的模拟运行与真实多模态输入相比，速度看起来“约快两倍”，而非“略高于实际”。

在得出关于VLM吞吐量的任何结论前，请使用真实或接近真实的图像数据。

Smaller multimodal MoE runs

小型多模态MoE训练

The smaller Qwen3.5-style multimodal experiments reinforce the same lessons:

HybridEP is a solid default on GB200
TE-scoped CUDA graphs help once the training loop is stable
larger MBS can pay off, but only if the vision encoder does not become the next bottleneck

Qwen3.5风格的小型多模态实验验证了同样的结论：

在GB200上，HybridEP是可靠的默认选择
训练循环稳定后，TE范围的CUDA图有助于提升性能
增大MBS可能带来收益，但需确保视觉编码器不会成为新的瓶颈

Decision Guide

决策指南

Choose FSDP when

选择FSDP的场景

you are bringing up a new VLM for the first time
the model has awkward stage boundaries across embedding, vision, and decoder
memory fit matters more than absolute throughput
you may freeze the vision stack during decoder-focused tuning

首次启动新的VLM模型时
模型在嵌入层、视觉模块和解码器之间存在复杂的阶段边界时
内存适配优先级高于绝对吞吐量时
专注于解码器调优，可能需要冻结视觉模块时

Choose 3D parallel when

选择3D并行的场景

the model is already stable under FSDP
the PP layout is clear and repeatable
you can sweep MBS, recompute, and CUDA-graph scope together
the goal is best steady-state throughput, not easiest bring-up

模型已在FSDP下稳定运行时
PP布局清晰且可重复时
可同时调优MBS、重计算及CUDA图范围时
目标是最佳稳态吞吐量，而非最简便的启动流程时

Key Tuning Knobs

关键调优参数

Freeze the vision stack when appropriate: if the work is decoder-focused, freezing the vision side often gives a small but real throughput gain and reduces memory pressure.
Sweep MBS aggressively: VLMs are more MBS-sensitive than text-only MoE runs because the vision path changes the compute-to-overhead balance.
Prefer selective recompute once the model fits: full recompute is a useful bring-up tool, but selective recompute is usually the better steady state.
Match CUDA-graph scope to the workload:
```
attn moe_router moe_preprocess
```
is the safer MoE default, while narrower scopes can still be useful for controlled experiments.
Use ETP only when EP alone is insufficient: it can unlock a layout, but it also introduces more communication and more tuning surface.

适时冻结视觉模块：如果工作重点在解码器，冻结视觉模块通常能带来小幅但真实的吞吐量提升，并降低内存压力。
积极调优MBS：由于视觉路径改变了计算与开销的平衡，VLM比纯文本MoE训练对MBS更为敏感。
模型适配后优先选择选择性重计算：全量重计算是实用的启动工具，但选择性重计算通常是更优的稳态选择。
CUDA图范围与工作负载匹配：
```
attn moe_router moe_preprocess
```
是更安全的MoE默认设置，而更窄的范围仍可用于受控实验。
仅当EP不足以满足需求时使用ETP：它能解锁特定布局，但也会引入更多通信和调优复杂度。

Representative Config Families

典型配置示例

FSDP-first GB200 path

优先使用FSDP的GB200配置路径

text

TP=1  CP=1  PP=1
EP sized to the expert topology, often large
Dispatcher: HybridEP on GB200-class systems
Recompute: start with full, then relax toward selective recompute

text

TP=1  CP=1  PP=1
EP 根据专家拓扑设置，通常较大
调度器：GB200级系统使用HybridEP
重计算：从全量重计算开始，逐步过渡到选择性重计算

3D-parallel GB200 path

3D并行的GB200配置路径

text

TP=1  CP=1  PP=1 or modest PP
EP and ETP sized to the expert topology
Dispatcher: HybridEP
CUDA Graph: start narrow, then widen only after the real-data path is stable

text

TP=1  CP=1  PP=1 或适度PP
EP 和 ETP 根据专家拓扑设置
调度器：HybridEP
CUDA图：从窄范围开始，仅在真实数据路径稳定后再扩大范围

Compatibility

兼容性

Feature	FSDP	3D parallel
HybridEP on GB200	strong default	strong default once topology is stable
CUDA graphs	useful after bring-up	useful, but more scope-sensitive
Freeze vision	natural fit	possible, but less often used as the headline perf path
Selective recompute	recommended	recommended

特性	FSDP	3D 并行
GB200上的HybridEP	推荐默认选择	拓扑稳定后为推荐默认选择
CUDA图	启动完成后有用	有用，但对范围更敏感
冻结视觉模块	自然适配	可行，但较少作为核心性能优化路径使用
选择性重计算	推荐	推荐

Pitfalls

常见陷阱

Mock multimodal data is misleading: it can make the decoder look much healthier than the real end-to-end VLM path.
The vision encoder can dominate unexpectedly: profile encoder, projector, and decoder separately before attributing everything to the dispatcher.
Do not compare FSDP and 3D-parallel runs with different effective work: normalize by useful tokens and workload shape, not only by step time.
ETP is not free: use it as a fit or topology tool, not as the default.
Recompute and CUDA-graph choices are coupled: the setting that gets the model to fit is often not the setting that gives the best steady-state speed.

模拟多模态数据具有误导性：它会让解码器的表现看起来比真实端到端VLM路径好得多。
视觉编码器可能意外成为性能瓶颈：在将所有问题归因于调度器之前，请分别分析编码器、投影层和解码器的性能。
不要用不同有效工作量对比FSDP和3D并行运行：需按有效token数和工作负载形态归一化，而非仅按步骤时间对比。
ETP并非无代价：仅将其作为适配或拓扑调整工具，而非默认选择。
重计算与CUDA图选择相互关联：能让模型适配的设置往往并非能带来最佳稳态速度的设置。