nemo-mbridge-perf-moe-hardware-configs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Hardware Configuration Reference

MoE硬件配置参考

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml

稳定文档：@docs/training/moe-optimization.md 卡片：@skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml

Quick Platform Playbook

快速平台指南

Platform	Typical MoE strategy	What usually matters most
H100	DeepEP + stronger PP + moderate TP	communication overlap and PP efficiency
B200	DeepEP + MXFP8 + careful PP layout	container quality and tuned comm settings
GB200	HybridEP + partial CUDA graphs + CPU cleanup	host overhead, topology-aware dispatch, memory headroom
GB300	HybridEP + newer FP8 and kernel stack	same GB200 playbook, usually with a higher ceiling

平台	典型MoE策略	最关键的因素
H100	DeepEP + 更强的PP + 适度TP	通信重叠与PP效率
B200	DeepEP + MXFP8 + 合理的PP布局	容器质量与调优后的通信设置
GB200	HybridEP + 部分CUDA Graph + CPU清理	主机开销、拓扑感知调度、内存余量
GB300	HybridEP + 新版FP8与内核栈	与GB200指南相同，通常上限更高

Rounded Performance Bands

大致性能范围

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.

Workload family	Hardware	Typical band	Representative shape
DSV3, large-scale	H100	low-to-mid hundreds TFLOPS/GPU, high-teens MFU	TP2, EP64, PP8, DeepEP
DSV3, large-scale	B200	high-hundreds TFLOPS/GPU, mid-teens MFU	TP1, EP32, PP8, DeepEP
DSV3, large-scale	GB200	around 1K TFLOPS/GPU, low-20s MFU	TP1, EP64, PP4, HybridEP
DSV3, large-scale	GB300	above the GB200 band, often mid-20s MFU	TP1, EP64, PP4, HybridEP
Qwen3 235B	H100	low-300s TFLOPS/GPU, around 30% MFU	TP2, EP32, PP8, DeepEP
Qwen3 235B	GB200	high-hundreds TFLOPS/GPU in tuned runs	TP1 or TP2, EP32-64, PP4, HybridEP
Qwen3 30B	H100	low-200s TFLOPS/GPU	TP1, EP8, PP1, DeepEP
Qwen3-Next 80B	GB200	low-300s TFLOPS/GPU in BF16-class runs	TP1, EP32, PP2, HybridEP

这些数值经过刻意取整，以便随着跟踪数据更新时文档仍能保持适用性。请将其视为规划范围，而非精确承诺。

工作负载系列	硬件	典型范围	代表性配置形态
DSV3（大规模）	H100	每GPU数百TFLOPS（低至中区间），MFU为十几（高位）	TP2, EP64, PP8, DeepEP
DSV3（大规模）	B200	每GPU数百TFLOPS（高位），MFU为十几（中位）	TP1, EP32, PP8, DeepEP
DSV3（大规模）	GB200	每GPU约1000 TFLOPS，MFU为20左右（低位）	TP1, EP64, PP4, HybridEP
DSV3（大规模）	GB300	高于GB200范围，MFU通常为20左右（中位）	TP1, EP64, PP4, HybridEP
Qwen3 235B	H100	每GPU约300左右TFLOPS，MFU约30%	TP2, EP32, PP8, DeepEP
Qwen3 235B	GB200	调优后每GPU可达数百TFLOPS（高位）	TP1或TP2, EP32-64, PP4, HybridEP
Qwen3 30B	H100	每GPU约200左右TFLOPS	TP1, EP8, PP1, DeepEP
Qwen3-Next 80B	GB200	BF16级运行中每GPU约300左右TFLOPS	TP1, EP32, PP2, HybridEP

Representative Config Families

代表性配置系列

DSV3 on H100

H100上的DSV3

text

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

text

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

DSV3 on B200

B200上的DSV3

text

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

text

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

DSV3 on GB200 or GB300

GB200或GB300上的DSV3

text

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

text

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

Qwen3 235B on H100

H100上的Qwen3 235B

text

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

text

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

Qwen3 235B on GB200

GB200上的Qwen3 235B

text

Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

text

Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

Qwen3-Next 80B on GB200

GB200上的Qwen3-Next 80B

text

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

text

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

Cross-Cutting Patterns

通用模式

PP layout

PP布局

```
E
```
= embedding
```
t
```
= transformer
```
m
```
= MTP
```
L
```
= loss
```
|
```
= stage boundary

The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.

```
E
```
= embedding（嵌入层）
```
t
```
= transformer（Transformer层）
```
m
```
= MTP
```
L
```
= loss（损失层）
```
|
```
= stage boundary（阶段边界）

不同平台之间最大的差异通常不仅在于调度器，还在于调度器、PP形态以及VPP是否能保持各阶段平衡的组合。

Recompute strategy

重计算策略

Memory pressure	Starting point
low	none or a very narrow selective set
moderate	`moe_act` , `mlp` , `norm` , or similar selective modules
high	model-specific up-projection plus selective MoE and MLP modules
extreme or long-context	full recompute only if the selective path still does not fit

内存压力	起始方案
低	无或仅选择极少量模块
中等	`moe_act` 、 `mlp` 、 `norm` 或类似的选择性模块
高	模型特定的上投影模块，加上选择性MoE和MLP模块
极高或长上下文	仅当选择性方案仍无法容纳时，才使用完全重计算

Environment variables

环境变量

bash

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

bash

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning

CPU端调优

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.

在GB200和GB300上，CPU亲和性与总体主机开销清理的效果几乎和更换调度器一样显著。请将这些工作视为一等调优任务，而非事后补充。

Pitfalls

常见陷阱

Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
Container quality matters: large regressions can come from the software stack rather than the model recipe.
VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.

不要盲目照搬跟踪数据中的配置：最优配置通常不仅取决于硬件名称，还取决于路由模式、容器以及PP布局。
容器质量至关重要：软件栈而非模型配方可能导致性能大幅下降。
VPP设置需有明确目的：不合理的VPP拆分可能抵消更优调度器带来的性能提升。
比较绝对吞吐量，而非仅看MFU：在BF16、FP8及其他精度模式之间切换时，MFU可能产生误导。
强制平衡路由是更安全的基准默认值：在比较硬件或调度器栈时，保持路由模式固定。