nemo-mbridge-perf-moe-hardware-configs

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Hardware Configuration Reference

MoE硬件配置参考

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml
稳定文档:@docs/training/moe-optimization.md 卡片:@skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml

Quick Platform Playbook

快速平台指南

PlatformTypical MoE strategyWhat usually matters most
H100DeepEP + stronger PP + moderate TPcommunication overlap and PP efficiency
B200DeepEP + MXFP8 + careful PP layoutcontainer quality and tuned comm settings
GB200HybridEP + partial CUDA graphs + CPU cleanuphost overhead, topology-aware dispatch, memory headroom
GB300HybridEP + newer FP8 and kernel stacksame GB200 playbook, usually with a higher ceiling
平台典型MoE策略最关键的因素
H100DeepEP + 更强的PP + 适度TP通信重叠与PP效率
B200DeepEP + MXFP8 + 合理的PP布局容器质量与调优后的通信设置
GB200HybridEP + 部分CUDA Graph + CPU清理主机开销、拓扑感知调度、内存余量
GB300HybridEP + 新版FP8与内核栈与GB200指南相同,通常上限更高

Rounded Performance Bands

大致性能范围

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.
Workload familyHardwareTypical bandRepresentative shape
DSV3, large-scaleH100low-to-mid hundreds TFLOPS/GPU, high-teens MFUTP2, EP64, PP8, DeepEP
DSV3, large-scaleB200high-hundreds TFLOPS/GPU, mid-teens MFUTP1, EP32, PP8, DeepEP
DSV3, large-scaleGB200around 1K TFLOPS/GPU, low-20s MFUTP1, EP64, PP4, HybridEP
DSV3, large-scaleGB300above the GB200 band, often mid-20s MFUTP1, EP64, PP4, HybridEP
Qwen3 235BH100low-300s TFLOPS/GPU, around 30% MFUTP2, EP32, PP8, DeepEP
Qwen3 235BGB200high-hundreds TFLOPS/GPU in tuned runsTP1 or TP2, EP32-64, PP4, HybridEP
Qwen3 30BH100low-200s TFLOPS/GPUTP1, EP8, PP1, DeepEP
Qwen3-Next 80BGB200low-300s TFLOPS/GPU in BF16-class runsTP1, EP32, PP2, HybridEP
这些数值经过刻意取整,以便随着跟踪数据更新时文档仍能保持适用性。请将其视为规划范围,而非精确承诺。
工作负载系列硬件典型范围代表性配置形态
DSV3(大规模)H100每GPU数百TFLOPS(低至中区间),MFU为十几(高位)TP2, EP64, PP8, DeepEP
DSV3(大规模)B200每GPU数百TFLOPS(高位),MFU为十几(中位)TP1, EP32, PP8, DeepEP
DSV3(大规模)GB200每GPU约1000 TFLOPS,MFU为20左右(低位)TP1, EP64, PP4, HybridEP
DSV3(大规模)GB300高于GB200范围,MFU通常为20左右(中位)TP1, EP64, PP4, HybridEP
Qwen3 235BH100每GPU约300左右TFLOPS,MFU约30%TP2, EP32, PP8, DeepEP
Qwen3 235BGB200调优后每GPU可达数百TFLOPS(高位)TP1或TP2, EP32-64, PP4, HybridEP
Qwen3 30BH100每GPU约200左右TFLOPSTP1, EP8, PP1, DeepEP
Qwen3-Next 80BGB200BF16级运行中每GPU约300左右TFLOPSTP1, EP32, PP2, HybridEP

Representative Config Families

代表性配置系列

DSV3 on H100

H100上的DSV3

text
Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient
text
Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

DSV3 on B200

B200上的DSV3

text
Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning
text
Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

DSV3 on GB200 or GB300

GB200或GB300上的DSV3

text
Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes
text
Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

Qwen3 235B on H100

H100上的Qwen3 235B

text
Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup
text
Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

Qwen3 235B on GB200

GB200上的Qwen3 235B

text
Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom
text
Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

Qwen3-Next 80B on GB200

GB200上的Qwen3-Next 80B

text
Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality
text
Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

Cross-Cutting Patterns

通用模式

PP layout

PP布局

  • E
    = embedding
  • t
    = transformer
  • m
    = MTP
  • L
    = loss
  • |
    = stage boundary
The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.
  • E
    = embedding(嵌入层)
  • t
    = transformer(Transformer层)
  • m
    = MTP
  • L
    = loss(损失层)
  • |
    = stage boundary(阶段边界)
不同平台之间最大的差异通常不仅在于调度器,还在于调度器、PP形态以及VPP是否能保持各阶段平衡的组合。

Recompute strategy

重计算策略

Memory pressureStarting point
lownone or a very narrow selective set
moderate
moe_act
,
mlp
,
norm
, or similar selective modules
highmodel-specific up-projection plus selective MoE and MLP modules
extreme or long-contextfull recompute only if the selective path still does not fit
内存压力起始方案
无或仅选择极少量模块
中等
moe_act
mlp
norm
或类似的选择性模块
模型特定的上投影模块,加上选择性MoE和MLP模块
极高或长上下文仅当选择性方案仍无法容纳时,才使用完全重计算

Environment variables

环境变量

bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0
bash
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning

CPU端调优

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.
在GB200和GB300上,CPU亲和性与总体主机开销清理的效果几乎和更换调度器一样显著。请将这些工作视为一等调优任务,而非事后补充。

Pitfalls

常见陷阱

  1. Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
  2. Container quality matters: large regressions can come from the software stack rather than the model recipe.
  3. VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
  4. Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
  5. Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.
  1. 不要盲目照搬跟踪数据中的配置:最优配置通常不仅取决于硬件名称,还取决于路由模式、容器以及PP布局。
  2. 容器质量至关重要:软件栈而非模型配方可能导致性能大幅下降。
  3. VPP设置需有明确目的:不合理的VPP拆分可能抵消更优调度器带来的性能提升。
  4. 比较绝对吞吐量,而非仅看MFU:在BF16、FP8及其他精度模式之间切换时,MFU可能产生误导。
  5. 强制平衡路由是更安全的基准默认值:在比较硬件或调度器栈时,保持路由模式固定。