perf-moe-long-context

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MoE Long-Context Training

MoE长上下文训练

Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-long-context/card.yaml
稳定文档:@docs/training/moe-optimization.md 技能卡片:@skills/perf-moe-long-context/card.yaml

What Changes At Long Context

长上下文下的变化

Once sequence length moves well past the 4K-class regime, attention memory and activation residency become the dominant constraints. For MoE models, that usually means you need some combination of:
  • context parallelism
  • selective recompute
  • lower precision
  • CPU offload for optimizer state
  • a dispatcher and PP layout that do not waste the smaller remaining DP budget
当序列长度远超4K级别后,注意力内存和激活驻留成为主要限制因素。对于MoE模型,通常需要结合以下几种策略:
  • 上下文并行(CP)
  • 选择性重计算
  • 低精度训练
  • 优化器状态CPU卸载
  • 不浪费剩余少量数据并行(DP)预算的调度器与流水线并行(PP)布局

Rounded Scaling Patterns

规模化模式总结

DSV3 on H100

H100上的DSV3

The DSV3 long-context runs show a stable pattern:
  • selective recompute works better than full recompute once you move past the shortest contexts
  • throughput stays in a fairly narrow band from mid-length through very long contexts if CP is increased appropriately
  • the trade shifts from "memory fit" to "GPU-count feasibility" as CP grows
In other words, long context does not immediately collapse utilization if the layout is chosen well, but it does consume the DP budget very quickly.
DSV3的长上下文运行呈现出稳定模式:
  • 当序列长度超过最短上下文后,选择性重计算的效果优于全量重计算
  • 若适当增加CP,从中等长度到极长上下文的吞吐量会保持在一个相当窄的区间内
  • 随着CP增大,约束条件会从“内存适配”转向“GPU数量可行性”
换句话说,只要布局选择合理,长上下文不会立即导致利用率崩溃,但会快速消耗DP预算。

Qwen3-Next on GB200

GB200上的Qwen3-Next

Qwen3-Next behaves more like a memory-sensitive medium-scale model:
  • 8K and 32K remain practical with moderate CP
  • 64K is possible, but the throughput drop is noticeable and memory becomes much tighter
  • pipeline layout and grouped-GEMM improvements matter almost as much as CP
Qwen3-Next的表现更接近对内存敏感的中等规模模型:
  • 8K和32K序列长度在适度CP下仍具备实用性
  • 64K序列长度可行,但吞吐量下降明显,内存约束也会更紧张
  • 流水线布局和分组GEMM优化的重要性几乎与CP相当

Qwen3 235B on GB200

GB200上的Qwen3 235B

Qwen3 235B shows that long context can still be efficient on NVL72 systems when TP, CP, and HybridEP are coordinated. The best 128K-class configurations are not just "fit-only" recipes; they can remain highly efficient if routing, parallelism, and recompute are balanced.
Qwen3 235B表明,当张量并行(TP)、上下文并行(CP)和混合专家并行(HybridEP)协同配合时,长上下文在NVL72系统上仍能保持高效。最优的128K级配置并非仅“适配内存”的方案,若路由、并行策略和重计算达到平衡,仍可保持较高效率。

CP Sizing Rules Of Thumb

上下文并行(CP)规模设定经验法则

  1. Start from a 4K shard target: a good first guess is
    CP ~= seq_len / 4096
    , then round to a practical power-of-two layout.
  2. Keep DP alive if possible: long-context scaling becomes brittle once CP, EP, TP, and PP together squeeze DP down to the floor.
  3. Prefer selective recompute: recompute modules such as
    up_proj
    ,
    norm
    ,
    moe
    ,
    moe_act
    , or
    mlp
    before reaching for full recompute.
  4. Avoid SDPA-heavy recompute at very long context: recomputing attention internals can add a lot of work for less memory benefit than recomputing smaller MoE and MLP-side modules.
  5. Use TP as another lever on NVL72 systems: GB200 and GB300 runs can sometimes trade some CP for TP while still staying efficient.
  6. Assume GBS will need to shrink: as CP rises and DP falls, you may need to reduce global batch size or accept higher GA.
  1. 以4K分片为初始目标:合理的初步估算为
    CP ≈ seq_len / 4096
    ,然后取实用的2的幂次布局。
  2. 尽可能保留DP:当CP、EP、TP和PP共同将DP压缩至极限时,长上下文的扩展性会变得脆弱。
  3. 优先选择选择性重计算:在考虑全量重计算之前,先对
    up_proj
    norm
    moe
    moe_act
    mlp
    等模块进行重计算。
  4. 极长上下文下避免重计算SDPA密集型模块:重计算注意力内部结构会增加大量计算量,相比重计算更小的MoE和MLP侧模块,内存收益更低。
  5. 在NVL72系统上使用TP作为另一调节手段:GB200和GB300的运行有时可以在保持效率的前提下,用部分CP换取TP。
  6. 全局批次大小(GBS)可能需要缩小:随着CP增大、DP减小,可能需要降低全局批次大小或接受更高的梯度累积(GA)。

Representative Config Families

典型配置示例

DSV3 at 128K on H100

H100上128K序列长度的DSV3

text
TP=1  CP=32  EP=32  PP=8  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
text
TP=1  CP=32  EP=32  PP=8  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload

DSV3 at 256K on H100

H100上256K序列长度的DSV3

text
TP=1  CP=64  EP=32  PP=8  EDP=2  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload
text
TP=1  CP=64  EP=32  PP=8  EDP=2  VPP=4
Precision: FP8-class
Dispatcher: DeepEP
Recompute: up_proj, norm, moe, mlp
Extra memory help: optimizer CPU offload

Qwen3 235B at 128K on GB200

GB200上128K序列长度的Qwen3 235B

text
TP=4  CP=4  EP=32  PP=4  VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocess
text
TP=4  CP=4  EP=32  PP=4  VPP=12
Precision: BF16 or MXFP8
Dispatcher: HybridEP
Recompute: moe_act, norm
CUDA Graph: attn + moe_router + moe_preprocess

Recompute And CUDA Graph Guidance

重计算与CUDA Graph指南

For long-context MoE training:
  • start with selective recompute
  • add CUDA graphs only after the shapes and routing path are stable
  • keep sequence length and MBS fixed when using CUDA graphs
  • if the run depends on highly dynamic batches, prefer eager execution
Useful references:
  • @docs/training/activation-recomputation.md
  • @skills/perf-cuda-graphs/SKILL.md
针对长上下文MoE训练:
  • 从选择性重计算开始
  • 仅在形状和路由路径稳定后再添加CUDA Graph
  • 使用CUDA Graph时保持序列长度和微批次大小(MBS)固定
  • 若运行依赖高度动态的批次,优先选择即时执行(eager execution)
参考文档:
  • @docs/training/activation-recomputation.md
  • @skills/perf-cuda-graphs/SKILL.md

Pitfalls

注意事项

  1. CP does not replace EP or PP: it adds another dimension; it does not make the others disappear.
  2. A good 4K baseline can still be a bad long-context baseline: routing mode, recompute choice, and offload strategy often need to change.
  3. GPU-count feasibility becomes the real constraint: very long context can look fine in a single recipe, then become impossible once EP and PP are added honestly across the full model.
  4. CUDA graphs need static shapes: variable-length batches and opportunistic padding strategies can silently break the path.
  5. Container and kernel support matters more at 128K+: long-context paths tend to rely on newer kernels and bug fixes than short-context bring-up does.
  1. CP无法替代EP或PP:它只是增加了一个维度,并不会让其他并行策略消失。
  2. 优秀的4K基线未必适合长上下文:路由模式、重计算选择和卸载策略通常需要调整。
  3. GPU数量可行性成为真正的约束:单一看似可行的长上下文方案,在结合EP和PP部署完整模型时可能变得无法实现。
  4. CUDA Graph需要静态形状:变长批次和机会性填充策略可能会悄无声息地破坏执行路径。
  5. 128K+序列长度对容器和内核支持要求更高:长上下文路径往往依赖比短上下文更多的新内核和Bug修复。