perf-moe-hardware-configs

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MoE Hardware Configuration Reference

MoE硬件配置参考

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-hardware-configs/card.yaml

稳定文档：@docs/training/moe-optimization.md 卡片：@skills/perf-moe-hardware-configs/card.yaml

Quick Platform Playbook

快速平台指南

Platform	Typical MoE strategy	What usually matters most
H100	DeepEP + stronger PP + moderate TP	communication overlap and PP efficiency
B200	DeepEP + MXFP8 + careful PP layout	container quality and tuned comm settings
GB200	HybridEP + partial CUDA graphs + CPU cleanup	host overhead, topology-aware dispatch, memory headroom
GB300	HybridEP + newer FP8 and kernel stack	same GB200 playbook, usually with a higher ceiling

平台	典型MoE策略	最关键的优化点
H100	DeepEP + 更强的PP + 适度TP	通信重叠与PP效率
B200	DeepEP + MXFP8 + 谨慎的PP布局	容器质量与调优后的通信设置
GB200	HybridEP + 部分CUDA Graph + CPU清理	主机开销、拓扑感知调度、内存余量
GB300	HybridEP + 新一代FP8与内核栈	与GB200指南一致，通常上限更高

Rounded Performance Bands

近似性能范围

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.

Workload family	Hardware	Typical band	Representative shape
DSV3, large-scale	H100	low-to-mid hundreds TFLOPS/GPU, high-teens MFU	TP2, EP64, PP8, DeepEP
DSV3, large-scale	B200	high-hundreds TFLOPS/GPU, mid-teens MFU	TP1, EP32, PP8, DeepEP
DSV3, large-scale	GB200	around 1K TFLOPS/GPU, low-20s MFU	TP1, EP64, PP4, HybridEP
DSV3, large-scale	GB300	above the GB200 band, often mid-20s MFU	TP1, EP64, PP4, HybridEP
Qwen3 235B	H100	low-300s TFLOPS/GPU, around 30% MFU	TP2, EP32, PP8, DeepEP
Qwen3 235B	GB200	high-hundreds TFLOPS/GPU in tuned runs	TP1 or TP2, EP32-64, PP4, HybridEP
Qwen3 30B	H100	low-200s TFLOPS/GPU	TP1, EP8, PP1, DeepEP
Qwen3-Next 80B	GB200	low-300s TFLOPS/GPU in BF16-class runs	TP1, EP32, PP2, HybridEP

这些数值经过刻意取整，以便跟踪器更新时文档仍能保持实用性。请将其视为规划范围，而非精确承诺。

工作负载系列	硬件	典型范围	代表性配置
DSV3（大规模）	H100	每GPU数百TFLOPS（低至中区间），MFU为十几（高区间）	TP2, EP64, PP8, DeepEP
DSV3（大规模）	B200	每GPU数百TFLOPS（高区间），MFU为十几（中区间）	TP1, EP32, PP8, DeepEP
DSV3（大规模）	GB200	每GPU约1K TFLOPS，MFU为20左右（低区间）	TP1, EP64, PP4, HybridEP
DSV3（大规模）	GB300	高于GB200范围，MFU通常为20左右（中区间）	TP1, EP64, PP4, HybridEP
Qwen3 235B	H100	每GPU约300左右TFLOPS，MFU约30%	TP2, EP32, PP8, DeepEP
Qwen3 235B	GB200	调优后每GPU数百TFLOPS（高区间）	TP1或TP2, EP32-64, PP4, HybridEP
Qwen3 30B	H100	每GPU约200左右TFLOPS	TP1, EP8, PP1, DeepEP
Qwen3-Next 80B	GB200	BF16级运行中每GPU约300左右TFLOPS	TP1, EP32, PP2, HybridEP

Representative Config Families

代表性配置系列

DSV3 on H100

H100上的DSV3

text

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

text

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
路由：强制均衡
重计算：轻到中度选择性重计算
优先级：通信重叠与保持PP效率

DSV3 on B200

B200上的DSV3

text

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

text

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2或类似配置
精度：MXFP8级
重计算：围绕MLA上投影和MLP侧模块的选择性重计算
优先级：容器质量、PP布局与DeepEP SMS调优

DSV3 on GB200 or GB300

GB200或GB300上的DSV3

text

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

text

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
精度：MXFP8级
CUDA Graph: attn + moe_router + moe_preprocess
优先级：HybridEP、CPU优化与图形友好的静态形状

Qwen3 235B on H100

H100上的Qwen3 235B

text

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

text

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
重计算：归一化与激活侧选择性重计算
优先级：通信重叠与路由路径清理

Qwen3 235B on GB200

GB200上的Qwen3 235B

text

Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

text

Dispatcher: HybridEP
TP=1或2  EP=32至64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
重计算：根据内存压力选择moe_act、mlp或归一化模块
优先级：平衡吞吐量与内存余量

Qwen3-Next 80B on GB200

GB200上的Qwen3-Next 80B

text

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

text

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP约为4
CUDA Graph: attn + moe_router + moe_preprocess
优先级：流水线布局与分组GEMM质量

Cross-Cutting Patterns

跨平台通用模式

PP layout

PP布局

```
E
```
= embedding
```
t
```
= transformer
```
m
```
= MTP
```
L
```
= loss
```
|
```
= stage boundary

The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.

```
E
```
= 嵌入层
```
t
```
= Transformer层
```
m
```
= MTP
```
L
```
= 损失层
```
|
```
= 阶段边界

平台间最大的差异通常不仅在于调度器，而是调度器、PP形状以及VPP是否保持各阶段平衡的组合效果。

Recompute strategy

重计算策略

Memory pressure	Starting point
low	none or a very narrow selective set
moderate	`moe_act` , `mlp` , `norm` , or similar selective modules
high	model-specific up-projection plus selective MoE and MLP modules
extreme or long-context	full recompute only if the selective path still does not fit

内存压力	初始方案
低	无或仅极少量选择性模块
中等	`moe_act` 、 `mlp` 、 `norm` 或类似选择性模块
高	模型特定的上投影模块，加上选择性MoE和MLP模块
极高或长上下文	仅当选择性方案仍无法容纳时，才使用全量重计算

Environment variables

环境变量

bash

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

bash

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # 结合EP重叠与CUDA Graph时常用
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning

CPU侧调优

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.

在GB200和GB300上，CPU亲和性与主机开销清理的优化效果几乎等同于更换调度器。请将其视为核心调优工作，而非事后补充。

Pitfalls

常见误区

Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
Container quality matters: large regressions can come from the software stack rather than the model recipe.
VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.

不要盲目照搬跟踪器配置：最优配置通常不仅取决于硬件名称，还与路由模式、容器和PP布局密切相关。
容器质量至关重要：软件栈的问题可能导致性能大幅下降，而非模型配方本身。
VPP配置需刻意设计：不合理的VPP拆分可能抵消更好调度器带来的性能提升。
比较绝对吞吐量，而非仅看MFU：在BF16、FP8等不同精度模式间切换时，MFU可能产生误导。
强制均衡路由是更安全的基准默认值：比较硬件或调度器栈时，请固定路由模式。