sglang-minimax-m2-series-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SGLang MiniMax M2 Series Optimization

SGLang MiniMax M2系列优化

Overview

概述

The skill covers the full MiniMax optimization ladder: mainline history, the remaining still-open upstream PR track, and current-main validation lanes. Use it to recover, extend, or audit MiniMax-specific optimizations, or to reuse the patterns on a structurally similar MoE model.

As of

2026-04-21

, refreshed against SGLang

origin/main

commit

c122d343a

, the MiniMax story is split across three sources of truth:

mainline history already present in
```
main
```
still-open upstream PRs that are important for MiniMax-M2.5, but not fully landed in
```
main
```
yet
current registered docs/tests/workflows, especially the MiniMax-M2.7 AMD accuracy and performance lanes

This skill tracks all three, but it labels them clearly. Do not assume an optimization from a PR page is already in your local tree, and do not assume MiniMax-M2.7 or M2.7-highspeed is covered by MiniMax-M2.5 validation just because the same model file is used.

The historical evidence for every stage lives in:

references/pr-history.md: mainline and still-open PR evidence, benchmark notes, key code patterns
references/playbook.md: symptom mapping, commands, validation order

本技能覆盖MiniMax完整的优化路径：主分支历史、仍在推进的上游PR追踪，以及当前主分支的验证流程。可用于恢复、扩展或审核MiniMax专属优化，或将相关模式复用在结构相似的MoE模型上。

截至

2026-04-21

，基于SGLang

origin/main

提交

c122d343a

更新，MiniMax的相关内容分为三个可信来源：

已合并到
```
main
```
分支的主分支历史
对MiniMax-M2.5至关重要但尚未完全合并到
```
main
```
分支的上游PR
当前已注册的文档/测试/工作流，尤其是MiniMax-M2.7的AMD精度和性能验证流程

本技能会追踪这三部分内容，并进行清晰标注。请勿假设PR页面中的优化已存在于你的本地代码分支，也不要因为使用相同的模型文件，就认为MiniMax-M2.5的验证流程适用于MiniMax-M2.7或M2.7-highspeed。

每个阶段的历史记录可在以下文件中查看：

references/pr-history.md：主分支和待合并PR的记录、基准测试说明、核心代码模式
references/playbook.md：症状映射、命令、验证顺序

Before You Change Anything

修改前的准备

Record the exact serving shape first:

M2, M2.1, M2.5, or M2.7
instruct or reasoning-style launch
native, AWQ, FP8, FP4, ModelSlim, or other quant format
TP / DP / EP / PP topology
DP attention enabled or not
DeepEP, FlashInfer, Triton, or other MoE / attention backend
piecewise CUDA graph enabled or not
speculative decoding or Eagle3 enabled or not
NVIDIA, AMD, NPU, or other backend

launch parser pair:

--tool-call-parser minimax-m2

and

--reasoning-parser minimax-append-think

when tool/thinking behavior matters

exact registered suite, workflow job, or hardware lane used for validation
QK normalization depends on how heads are partitioned or replicated
M2.5 scale-out performance depends on communication strategy, not only kernels
quantized checkpoints depend on exact loader conventions

首先记录精确的服务部署形态：

M2、M2.1、M2.5或M2.7
指令型或推理型启动方式
原生、AWQ、FP8、FP4、ModelSlim或其他量化格式
TP / DP / EP / PP拓扑结构
是否启用DP注意力
使用DeepEP、FlashInfer、Triton或其他MoE/注意力后端
是否启用分段CUDA图
是否启用 speculative decoding 或 Eagle3
NVIDIA、AMD、NPU或其他后端
启动解析器组合：当工具/推理行为至关重要时，使用
```
--tool-call-parser minimax-m2
```
和
```
--reasoning-parser minimax-append-think
```
用于验证的精确注册套件、工作流任务或硬件流程
QK归一化依赖于注意力头的分区或复制方式
M2.5的横向扩展性能取决于通信策略，而非仅内核
量化 checkpoint 依赖于精确的加载器约定

Core Principle

核心原则

Do not treat MiniMax as a generic DeepSeek-like MoE.

MiniMax-M2 is a QK-normalized attention plus sparse-MoE story.
MiniMax-M2.5 adds a much heavier distributed and quantized runtime story.
MiniMax-M2.7 currently follows the MiniMax-M2-family model path but has its own AMD accuracy/performance lanes; treat it as a separate validation target.
MiniMax-M2.7-highspeed is currently visible through upstream docs work, not a separate current-main runtime path.
TP QK norm/all-reduce is already a two-generation mainline optimization: #16483 keeps the older RMSNormTP all-reduce aligned, and #20673 adds the fused JIT TP QK norm path behind
```
SGLANG_USE_FUSED_PARALLEL_QKNORM
```
.
The most important distinctions are often not "model size" but:
- whether attention TP equals model TP
- whether KV heads are partitioned or replicated
- whether MoE output should all-reduce, reduce-scatter, or stay fused for the next layer

The optimization order matters:

confirm the loader and topology contract
fix correctness in the MiniMax-specific path
remove generic overhead in the hot path
only then add deeper kernel or communication specialization
validate on the exact topology that triggered the issue

不要将MiniMax视为通用的DeepSeek类MoE模型。

MiniMax-M2是融合QK归一化注意力与稀疏MoE的模型
MiniMax-M2.5新增了更复杂的分布式和量化运行时逻辑
MiniMax-M2.7目前遵循MiniMax-M2家族的模型路径，但拥有独立的AMD精度/性能验证流程；需将其视为单独的验证目标
MiniMax-M2.7-highspeed目前仅通过上游文档工作可见，尚未成为当前主分支的独立运行时路径
TP QK归一化/全归约已是两代主分支优化：#16483保持旧版RMSNormTP全归约对齐，#20673在
```
SGLANG_USE_FUSED_PARALLEL_QKNORM
```
配置下新增融合JIT TP QK归一化路径
最重要的区别通常不是「模型大小」，而是：
- 注意力TP是否等于模型TP
- KV头是分区还是复制
- MoE输出是否需要全归约、归约分散，或保持融合状态进入下一层

优化顺序至关重要：

确认加载器和拓扑结构约定
修复MiniMax专属路径中的正确性问题
移除热点路径中的通用开销
仅在此基础上添加更深层的内核或通信优化
在触发问题的精确拓扑结构上进行验证

What Transfers To Similar Models

可复用至相似模型的内容

Reuse this skill on a non-MiniMax model when it shares one or more of these traits:

QK normalization whose cost is dominated by TP communication
KV-head replication when
```
num_key_value_heads < tp_size
```
sparse MoE without DeepSeek-style shared experts
topology-dependent attention groups where attention TP is not the same as model TP
quantized MoE checkpoints that rely on packed or fused module mappings

Reuse the order of investigation and validation discipline, not the MiniMax-specific constants.

当非MiniMax模型具备以下一个或多个特征时，可复用本技能：

QK归一化的成本主要由TP通信主导
当
```
num_key_value_heads < tp_size
```
时的KV头复制
无DeepSeek式共享专家的稀疏MoE
注意力TP与模型TP不同的拓扑依赖型注意力组
依赖打包或融合模块映射的量化MoE checkpoint

复用调查和验证流程的顺序，而非MiniMax专属的常量。

M2 Core Evolution Path

M2核心演进路径

Use this path when the target is

MiniMaxAI/MiniMax-M2*

and the problem is mostly inside the core model path already on

main

当目标为

MiniMaxAI/MiniMax-M2*

且问题主要存在于已合并到

main

分支的核心模型路径中时，使用此路径。

Stage M2-0: Basic support exists, but the path is still naive

阶段M2-0：基础支持已存在，但路径仍未优化

The model can launch, but the earliest support path is not yet optimized and may still miss MiniMax-specific surfaces.

basic model registration and weight loading
MiniMax-specific MoE, QK norm, and tool-call integration exist
do not confuse "supported" with "optimized"
#12129
```
python/sglang/srt/models/minimax_m2.py
```
exists and is the active runtime path
later performance or correctness work has a stable MiniMax-specific home

模型可启动，但最早的支持路径尚未优化，可能仍缺少MiniMax专属的功能面。

基础模型注册和权重加载
MiniMax专属的MoE、QK归一化和工具调用集成已存在
不要混淆「支持」和「优化」
#12129
```
python/sglang/srt/models/minimax_m2.py
```
已存在且为当前运行时路径
后续的性能或正确性工作拥有稳定的MiniMax专属实现载体

Stage M2-1: Fix RMSNorm precision before chasing speed

阶段M2-1：在追求速度前修复RMSNorm精度

MiniMax QK normalization is numerically sensitive. Before deeper optimization, the norm path must accumulate safely.

prefer fp32 accumulation in the norm path
treat QK norm correctness as a prerequisite for later TP work
#12186
the norm path no longer relies on lower-precision accumulation where MiniMax accuracy is sensitive

MiniMax的QK归一化对数值精度敏感。在进行更深层优化前，归一化路径必须保证安全的累加方式。

在归一化路径中优先使用fp32累加
将QK归一化的正确性视为后续TP工作的前提
#12186
归一化路径不再依赖低精度累加，避免影响MiniMax的精度敏感性

Stage M2-2: Expose Eagle3 capture and embedding surfaces

阶段M2-2：暴露Eagle3捕获和嵌入接口

MiniMax needs to expose the same capture surfaces as other spec-decoding-capable models. Without them, speculative or auxiliary-hidden-state features fail even if base generation works.

capture intermediate hidden states for selected layers
expose
```
get_embed_and_head
```
keep the speculative-decoding surface area on the MiniMax model, not on ad hoc wrappers
#12798
#13297
```
set_eagle3_layers_to_capture(...)
```
works
```
get_embed_and_head()
```
exists and downstream speculative code can call it

MiniMax需要暴露与其他支持spec-decoding的模型相同的捕获接口。缺少这些接口时，即使基础生成功能正常，推测解码或辅助隐藏状态功能也会失效。

捕获选定层的中间隐藏状态
暴露
```
get_embed_and_head
```
接口
将推测解码相关接口保留在MiniMax模型内部，而非临时包装器
#12798
#13297
```
set_eagle3_layers_to_capture(...)
```
可正常工作
```
get_embed_and_head()
```
已存在，下游推测解码代码可调用

Stage M2-3: Make the MoE path correct before making it faster

阶段M2-3：先确保MoE路径正确，再优化速度

Before tuning kernels, MiniMax needs the right MoE contract. This includes correct DeepEP forward usage and removing unnecessary router-side work.

keep the DeepEP forward path aligned with MiniMax's expert layout
do not add shared-expert logic that MiniMax does not use
remove unnecessary router work by specializing the top-k sigmoid path
#13892
#14047
the DeepEP MiniMax MoE path is functionally correct
the router no longer spends time on generic work MiniMax does not need

在调优内核前，MiniMax需要正确的MoE约定。这包括正确使用DeepEP前向传播，以及移除路由端不必要的工作。

保持DeepEP前向路径与MiniMax的专家布局对齐
不要添加MiniMax未使用的共享专家逻辑
通过优化top-k sigmoid路径移除不必要的路由工作
#13892
#14047
DeepEP MiniMax MoE路径功能正确
路由不再处理MiniMax不需要的通用逻辑

Stage M2-4: Specialize the QK norm hot path

阶段M2-4：优化QK归一化热点路径

For MiniMax, QK normalization is a real decode hotspot. Once correctness is solid, the next gains come from fusing the TP-aware norm path instead of doing separate generic operations.

compute Q and K norm together
keep TP-aware reduction in the same specialized path
preserve the custom all-reduce fast path by keeping reduction buffers aligned
prefer the fused JIT TP QK norm path on supported CUDA launches instead of stopping at the older in-model RMSNormTP path

check

SGLANG_USE_FUSED_PARALLEL_QKNORM

fused_parallel_qknorm(...)

, and

CustomAllReduceV2

before claiming a missing all-reduce optimization

#14416
#16483
#20673
```
MiniMaxM2RMSNormTP
```
is the active per-layer QK norm implementation
the reduction path consistently selects the fast aligned all-reduce path
```
MiniMaxM2QKRMSNorm
```
can use the fused TP QK norm custom op when the JIT path is enabled and supported

focused validation exists in

python/sglang/jit_kernel/tests/test_tp_qknorm.py

and

python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py

对于MiniMax，QK归一化是实际解码过程中的热点。一旦正确性得到保障，下一步优化来自于融合感知TP的归一化路径，而非执行单独的通用操作。

联合计算Q和K的归一化
将感知TP的归约保留在同一优化路径中
通过保持归约缓冲区对齐，保留自定义全归约快速路径
在支持的CUDA启动环境中，优先使用融合JIT TP QK归一化路径，而非停留在旧版模型内的RMSNormTP路径

在声称缺少全归约优化前，检查

SGLANG_USE_FUSED_PARALLEL_QKNORM

、

fused_parallel_qknorm(...)

和

CustomAllReduceV2

#14416
#16483
#20673
```
MiniMaxM2RMSNormTP
```
为当前每层QK归一化的实现
归约路径始终选择快速对齐的全归约路径
当JIT路径启用且支持时，
```
MiniMaxM2QKRMSNorm
```
可使用融合TP QK归一化自定义算子

在

python/sglang/jit_kernel/tests/test_tp_qknorm.py

和

python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py

中存在针对性验证

Stage M2-5: Harden for piecewise CUDA graph and pipeline parallelism

阶段M2-5：强化分段CUDA图和流水线并行支持

Once the core hot paths are in place, MiniMax needs to remain usable under graph capture and PP partitioning.

keep piecewise CUDA graph contexts correct around MoE expert-distribution recording
propagate
```
pp_proxy_tensors
```
make weight loading layer-range aware under PP
#18217
#19577

Family-adjacent caveat:

#18310 is for MiniMax-M2.1 and focuses on a
```
torch.compile
```
plus CUDA-graph crash. It is not the core M2 mainline optimization ladder, but it is worth borrowing if graph tracing regresses on a MiniMax-family branch.
MiniMax can run under PP without wrapper gaps
piecewise CUDA graph support does not regress the MiniMax-specific path

核心热点路径就绪后，MiniMax需要在图捕获和PP分区下保持可用。

确保分段CUDA图上下文在MoE专家分布记录时的正确性
传播
```
pp_proxy_tensors
```
使权重加载在PP下具备层范围感知能力
#18217
#19577

家族相关注意事项：

#18310针对MiniMax-M2.1，解决
```
torch.compile
```
加CUDA图崩溃问题。它不属于核心M2主分支优化路径，但如果图追踪在MiniMax家族分支中出现退化，可参考此修复。
MiniMax可在PP下运行，无包装器缺口
分段CUDA图支持不会导致MiniMax专属路径退化

M2.5 Extension Path

M2.5扩展路径

Use this path when the target is

MiniMaxAI/MiniMax-M2.5

or another later MiniMax-family checkpoint. Start from the M2 core stages above, then continue here.

当目标为

MiniMaxAI/MiniMax-M2.5

或其他较新的MiniMax家族checkpoint时，使用此路径。先完成上述M2核心阶段，再继续此路径。

Stage M25-0: Audit the loader contract already on

main

阶段M25-0：审核已合并到

main

分支的加载器约定

M2.5 stresses loading and quantized checkpoint conventions much harder than the early M2 path.

preserve
```
packed_modules_mapping
```
preserve KV-cache scale remapping
keep ModelSlim-specific layer assumptions consistent with MiniMax layout
#19995
#20870
#20905
packed qkv and gate-up modules load correctly
KV cache scales are not silently dropped
ModelSlim quant layers do not assume a different MoE layout

M2.5对加载和量化checkpoint约定的要求远高于早期M2路径。

保留
```
packed_modules_mapping
```
保留KV缓存缩放重映射
保持ModelSlim专属的层假设与MiniMax布局一致
#19995
#20870
#20905
打包的qkv和gate-up模块可正确加载
KV缓存缩放不会被静默丢弃
ModelSlim量化层不会假设不同的MoE布局

Stage M25-1: Fill the remaining quantized-loader gaps not yet in

main

阶段M25-1：填补当前主分支中仍缺失的量化加载器缺口

Status: Tracked upstream PR work; not fully present in

origin/main

commit

c122d343a

as of

2026-04-21

Some M2.5 quantized checkpoints use fused expert naming that the current mainline loader still does not fully cover.

support fused expert mappings such as
```
w13
```
prefer explicit fused mapping before falling back to older
```
w1/w2/w3
```
logic
add a focused weight-loading test when you port this work
#20031
AWQ or similar M2.5 checkpoints with fused expert weights load without local remapping hacks

状态：追踪上游PR工作；截至

2026-04-21

，尚未完全合并到

origin/main

提交

c122d343a

。

部分M2.5量化checkpoint使用融合专家命名，当前主分支加载器尚未完全支持。

支持
```
w13
```
等融合专家映射
优先使用显式融合映射，再回退到旧版
```
w1/w2/w3
```
逻辑
移植此工作时添加针对性的权重加载测试
#20031
带有融合专家权重的AWQ或类似M2.5 checkpoint无需本地重映射即可加载

Stage M25-2: Make the scale-out runtime contract explicit

阶段M25-2：明确横向扩展运行时约定

Status: Partly on

main

, partly still tracked from upstream PR work as of

origin/main

commit

c122d343a

2026-04-21

For M2.5, the next bottleneck is often not a single kernel. It is the distributed contract across PP, EP, DP, and DeepEP.

keep PP support from the mainline path
make DeepEP runtime requirements explicit, especially hidden-size and dtype expectations
treat DP support and DP-attention support as separate stages
#19577
#17826
#19468
PP launches correctly
DeepEP no longer fails due to unsupported hidden size or dtype mismatch
the runtime contract is written down for the exact TP / DP / EP / PP shape you care about

状态：部分已合并到

main

分支，部分仍在追踪上游PR工作；截至

2026-04-21

，基于

origin/main

提交

c122d343a

。

对于M2.5，下一个瓶颈通常不是单个内核，而是PP、EP、DP和DeepEP之间的分布式约定。

保留主分支的PP支持
明确DeepEP运行时要求，尤其是隐藏层大小和dtype预期
将DP支持和DP注意力支持视为独立阶段
#19577
#17826
#19468
PP启动正常
DeepEP不再因不支持的隐藏层大小或dtype不匹配而失败
为你关注的精确TP / DP / EP / PP形态记录运行时约定

Stage M25-3: Add the DP-attention and DEP communication optimizations

阶段M25-3：添加DP注意力和DEP通信优化

Status: Mixed: #20067 is part of

main

; #20489 and #20975 remain tracked as upstream PR work not fully present in

origin/main

commit

c122d343a

as of

2026-04-21

This is the biggest M2.5 scale-out gap. Performance and correctness both depend on using the attention-TP group rather than blindly reusing the model-TP group.

use attention TP group and rank instead of global TP group in MiniMax attention
allow reduce-scatter after MoE when padding or DEP makes it profitable
support FP4 all-gather when the communication path can quantize before transport
allow all-reduce fusion between the MoE output and the next attention preparation
guard zero-token and empty-batch paths
#20067
#20489
#20975
DP-attention MiniMax uses attention-TP metadata consistently
DEP no longer performs an unnecessary all-reduce plus slice
empty-batch or high-rank edge cases no longer crash

状态：混合状态：#20067已合并到

main

分支；#20489和#20975仍在追踪上游PR工作，截至

2026-04-21

尚未完全合并到

origin/main

提交

c122d343a

。

这是M2.5横向扩展的最大缺口。性能和正确性都依赖于使用注意力TP组，而非盲目复用模型TP组。

在MiniMax注意力中使用注意力TP组和rank，而非全局TP组
当填充或DEP使归约分散更高效时，允许在MoE后执行归约分散
当通信路径可在传输前量化时，支持FP4全聚集
允许在MoE输出与下一个注意力准备阶段之间融合全归约
保护零令牌和空批处理路径
#20067
#20489
#20975
DP注意力MiniMax始终使用注意力TP元数据
DEP不再执行不必要的全归约加切片操作
空批处理或高rank边缘情况不再崩溃

Stage M25-4: Replace the older TP QK norm path with the fused JIT version

阶段M25-4：用融合JIT版本替换旧版TP QK归一化路径

Status: Mainline as #20673 by

origin/main

commit

c122d343a

2026-04-21

The older QK norm path was already specialized; the newer mainline path pushes it further by moving to a fused JIT kernel that reuses custom all-reduce v2 more efficiently.

fuse TP Q and K norm into one custom op
keep a fallback path for unsupported environments
add a dedicated benchmark and unit test with the PR
#20673
the MiniMax path can use the fused TP QK norm custom op
the fallback path is still available when the JIT kernel cannot run

状态：已合并到主分支，#20673在

2026-04-21

的

origin/main

提交

c122d343a

中已存在。

旧版QK归一化路径已完成优化；新版主分支路径通过迁移到融合JIT内核，进一步提升效率，更高效地复用自定义全归约v2。

将TP Q和K归一化融合为一个自定义算子
为不支持的环境保留回退路径
随PR添加专用基准测试和单元测试
#20673
MiniMax路径可使用融合TP QK归一化自定义算子
当JIT内核无法运行时，回退路径仍可用

Stage M25-5: Fix TP16 and replicated-KV-head correctness

阶段M25-5：修复TP16和复制KV头的正确性

Status: Mainline as #20967 by

origin/main

commit

c122d343a

2026-04-21

When

num_key_value_heads < tp_size

, multiple TP ranks can share the same KV head. That means the K norm weights and reductions must follow the replica layout, not a naive full-TP assumption.

shard norm weights by logical head replica
reduce only across ranks that share the same head
do not assume the full TP group is the correct reduction group
#20967
high-TP MiniMax-M2.5 runs do not produce repeated or garbled output caused by incorrect K norm sharding

状态：已合并到主分支，#20967在

2026-04-21

的

origin/main

提交

c122d343a

中已存在。

当

num_key_value_heads < tp_size

时，多个TP rank可共享同一个KV头。这意味着K归一化权重和归约必须遵循副本布局，而非简单的全TP假设。

按逻辑头副本分片归一化权重
仅在共享同一头的rank间执行归约
不要假设全TP组是正确的归约组
#20967
高TP MiniMax-M2.5运行不会因K归一化分片错误而产生重复或乱码输出

Stage M25-6: Optional NVFP4 fallback for non-Blackwell GPUs

阶段M25-6：针对非Blackwell GPU的可选NVFP4回退

Status: Mainline as #19652 by

origin/main

commit

c122d343a

2026-04-21

; not MiniMax-specific, but directly relevant to some MiniMax-M2.5 deployments.

If the target checkpoint is an NVFP4 MiniMax variant on A100, H100, A40, or another non-Blackwell GPU, the real blocker may be the generic FP4 Marlin fallback rather than MiniMax model code.

keep weights compressed in FP4
route unsupported native FP4 cases to Marlin fallback
preserve both linear and MoE fallback paths
#19652
NVFP4 MiniMax-family checkpoints can run coherently on non-Blackwell GPUs without decompression hacks

状态：已合并到主分支，#19652在

2026-04-21

的

origin/main

提交

c122d343a

中已存在；并非MiniMax专属，但与部分MiniMax-M2.5部署直接相关。

如果目标checkpoint是A100、H100、A40或其他非Blackwell GPU上的NVFP4 MiniMax变体，真正的障碍可能是通用FP4 Marlin回退逻辑，而非MiniMax模型代码。

保持权重以FP4压缩存储
将不支持的原生FP4路由到Marlin回退路径
保留线性和MoE回退路径
#19652
NVFP4 MiniMax家族checkpoint可在非Blackwell GPU上连贯运行，无需解压缩 hack

M2.7 Current-Main Validation Path

M2.7当前主分支验证路径

Use this path when the target is

MiniMaxAI/MiniMax-M2.7

MiniMaxAI/MiniMax-M2.7-highspeed

, or when an AMD MiniMax change might affect the currently registered M2.7 lanes. Current main has explicit AMD accuracy and performance coverage for M2.7, while first-class M2.7 and M2.7-highspeed docs are tracked by upstream PR #20873.

当目标为

MiniMaxAI/MiniMax-M2.7

、

MiniMaxAI/MiniMax-M2.7-highspeed

，或AMD MiniMax变更可能影响当前已注册的M2.7验证流程时，使用此路径。当前主分支为M2.7提供了明确的AMD精度和性能覆盖，而M2.7和M2.7-highspeed的一等文档支持由上游PR #20873追踪。

Stage M27-0: Treat M2.7 as a separate later-family validation target

阶段M27-0：将M2.7视为独立的晚家族验证目标

M2.7 currently reuses the MiniMax-M2-family runtime code, but the active registered tests are not just copies of M2.5. They launch

MiniMaxAI/MiniMax-M2.7

on AMD with TP8+EP8 and the aiter attention backend.

keep the model-file assumptions from the M2/M2.5 ladder unless current code proves M2.7 has a new runtime path
validate M2.7 separately from M2.5 on AMD when changing attention, MoE communication, loader, or aiter-related behavior
use the registered M2.7 model path override
```
MINIMAX_M27_MODEL_PATH
```
for local mirrors
preserve launch details from the current tests:
```
--tp 8
```
,
```
--ep-size 8
```
,
```
--attention-backend aiter
```
,
```
SGLANG_USE_AITER=1
```
,
```
--mem-fraction-static 0.85
```
, multithread loading, and a long watchdog timeout
inspect both MI30x/MI325 and MI35x lanes because they use distinct registered suites

test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py

test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py

test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py

test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py

```
.github/workflows/nightly-test-amd.yml
```

.github/workflows/nightly-test-amd-rocm720.yml

M2.7 accuracy and performance suites pass on the target AMD lane
M2.5 and M2.7 failures are triaged independently
docs do not become the only source of truth for M2.7 until a first-class M2.7 usage doc exists

M2.7当前复用MiniMax-M2家族的运行时代码，但已注册的活跃测试并非M2.5的简单复制。它们在AMD上以TP8+EP8和aiter注意力后端启动

MiniMaxAI/MiniMax-M2.7

。

除非当前代码证明M2.7有新的运行时路径，否则保留M2/M2.5路径中的模型文件假设
当修改注意力、MoE通信、加载器或aiter相关行为时，在AMD上单独验证M2.7，而非与M2.5一起
使用已注册的M2.7模型路径覆盖
```
MINIMAX_M27_MODEL_PATH
```
进行本地镜像
保留当前测试的启动细节：
```
--tp 8
```
、
```
--ep-size 8
```
、
```
--attention-backend aiter
```
、
```
SGLANG_USE_AITER=1
```
、
```
--mem-fraction-static 0.85
```
、多线程加载，以及长看门狗超时
检查MI30x/MI325和MI35x验证流程，因为它们使用不同的注册套件

test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py

test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py

test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py

test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py

```
.github/workflows/nightly-test-amd.yml
```

.github/workflows/nightly-test-amd-rocm720.yml

M2.7精度和性能套件在目标AMD验证流程中通过
M2.5和M2.7的失败可独立分类
在一等M2.7使用文档出现前，不要将文档作为M2.7的唯一可信来源

Open PR Radar

活跃PR追踪

Check these active upstream tracks before designing a new MiniMax skill or declaring a gap:

#22934: M2.5 EPLB bugfix for missing

routed_experts_weights_of_layer

MiniMaxM2ForCausalLM

#22744: NVIDIA
```
--enable-tf32-matmul
```
support aimed at reducing FP32 gate GEMM cost for MiniMax-M2.5 decode.
#22300: FP8 GEMM/deepgemm scale-format fix for fp16 MiniMax-M2.5 models.
#23301: token-by-token streaming of MiniMax-M2 string tool-call parameters.
#20873: docs-only M2.7 and M2.7-highspeed support.
#22432 and #23190: NPU split-QKV, TP RMSNorm, RoPE, Eagle3, and DP-attention fixes for MiniMax2.
#17826, #19468, #20031, #20489, and #20975: remaining distributed, DeepEP, quant-loader, and DP-attention cleanup tracks carried from the earlier M2.5 radar.

在设计新的MiniMax技能或声明缺口前，检查以下活跃上游追踪：

#22934：修复

MiniMaxM2ForCausalLM

上缺失

routed_experts_weights_of_layer

的M2.5 EPLB bug。

#22744：NVIDIA
```
--enable-tf32-matmul
```
支持，旨在降低MiniMax-M2.5解码时FP32门控GEMM的成本。
#22300：针对fp16 MiniMax-M2.5模型的FP8 GEMM/deepgemm缩放格式修复。
#23301：MiniMax-M2字符串工具调用参数的逐令牌流式传输。
#20873：仅文档层面的M2.7和M2.7-highspeed支持。
#22432和#23190：针对MiniMax2的NPU拆分QKV、TP RMSNorm、RoPE、Eagle3和DP注意力修复。
#17826、#19468、#20031、#20489和#20975：早期M2.5追踪中遗留的分布式、DeepEP、量化加载器和DP注意力清理工作。

Investigation Order

调查顺序

When debugging a MiniMax issue, prefer this order:

classify the exact runtime shape
check whether the relevant optimization already exists on
```
main
```
if not, check whether it lives in a still-open upstream PR
only then decide whether to port, reimplement, or defer it

For the supporting evidence and commands, use:

references/playbook.md
references/pr-history.md

调试MiniMax问题时，优先遵循以下顺序：

分类精确的运行时形态
检查相关优化是否已存在于
```
main
```
分支
如果不存在，检查是否存在于仍在推进的上游PR中
仅在此之后决定是否移植、重新实现或推迟

支持性证据和命令可参考：

references/playbook.md
references/pr-history.md

Anti-Patterns

反模式

Do not treat attention TP and model TP as automatically identical for MiniMax-M2.5.
Do not optimize a generic MoE kernel first if the real problem is MiniMax loader or topology plumbing.
Do not assume a TP-only text launch proves PP, DP, DP-attention, or DeepEP correctness.
Do not bypass
```
packed_modules_mapping
```
or KV-scale remapping just to make one checkpoint load.
Do not copy still-open upstream PR behavior into production without noting that it is not on
```
main
```
yet.
Do not assume MiniMax-M2.7 is validated by passing only MiniMax-M2.5 tests; use the M2.7 AMD lanes when the change can affect that checkpoint.
Do not omit
```
--tool-call-parser minimax-m2
```
or
```
--reasoning-parser minimax-append-think
```
when validating tool or reasoning behavior from the serving docs.
Do not miss
```
SGLANG_USE_FUSED_PARALLEL_QKNORM
```
when benchmarking or diagnosing the current TP QK norm path.

不要假设MiniMax-M2.5的注意力TP和模型TP自动相同。
如果真正的问题是MiniMax加载器或拓扑结构 plumbing，不要优先优化通用MoE内核。
不要假设仅TP的文本启动可证明PP、DP、DP注意力或DeepEP的正确性。
不要为了加载单个checkpoint而绕过
```
packed_modules_mapping
```
或KV缩放重映射。
不要将仍在推进的上游PR行为复制到生产环境，而不注明其尚未合并到
```
main
```
分支。
不要假设通过MiniMax-M2.5测试即可验证MiniMax-M2.7；当变更可能影响该checkpoint时，使用M2.7 AMD验证流程。
验证服务文档中的工具或推理行为时，不要遗漏
```
--tool-call-parser minimax-m2
```
或
```
--reasoning-parser minimax-append-think
```
。
基准测试或诊断当前TP QK归一化路径时，不要遗漏
```
SGLANG_USE_FUSED_PARALLEL_QKNORM
```
。

sglang-minimax-m2-series-optimization

Original

Translation

SGLang MiniMax M2 Series Optimization

SGLang MiniMax M2系列优化

Overview

概述

Before You Change Anything

修改前的准备

Core Principle

核心原则

What Transfers To Similar Models

可复用至相似模型的内容

M2 Core Evolution Path

M2核心演进路径

Stage M2-0: Basic support exists, but the path is still naive

阶段M2-0：基础支持已存在，但路径仍未优化

Stage M2-1: Fix RMSNorm precision before chasing speed

阶段M2-1：在追求速度前修复RMSNorm精度

Stage M2-2: Expose Eagle3 capture and embedding surfaces

阶段M2-2：暴露Eagle3捕获和嵌入接口

Stage M2-3: Make the MoE path correct before making it faster

阶段M2-3：先确保MoE路径正确，再优化速度

Stage M2-4: Specialize the QK norm hot path

阶段M2-4：优化QK归一化热点路径

Stage M2-5: Harden for piecewise CUDA graph and pipeline parallelism

阶段M2-5：强化分段CUDA图和流水线并行支持

M2.5 Extension Path

M2.5扩展路径

Stage M25-0: Audit the loader contract already on main

阶段M25-0：审核已合并到main分支的加载器约定

Stage M25-1: Fill the remaining quantized-loader gaps not yet in main

阶段M25-1：填补当前主分支中仍缺失的量化加载器缺口

Stage M25-2: Make the scale-out runtime contract explicit

阶段M25-2：明确横向扩展运行时约定

Stage M25-3: Add the DP-attention and DEP communication optimizations

阶段M25-3：添加DP注意力和DEP通信优化

Stage M25-4: Replace the older TP QK norm path with the fused JIT version

阶段M25-4：用融合JIT版本替换旧版TP QK归一化路径

Stage M25-5: Fix TP16 and replicated-KV-head correctness

阶段M25-5：修复TP16和复制KV头的正确性

Stage M25-6: Optional NVFP4 fallback for non-Blackwell GPUs

阶段M25-6：针对非Blackwell GPU的可选NVFP4回退

M2.7 Current-Main Validation Path

M2.7当前主分支验证路径

Stage M27-0: Treat M2.7 as a separate later-family validation target

阶段M27-0：将M2.7视为独立的晚家族验证目标

Open PR Radar

活跃PR追踪

Investigation Order

调查顺序

Anti-Patterns

反模式

Stage M25-0: Audit the loader contract already on
`main`

阶段M25-0：审核已合并到
`main`
分支的加载器约定

Stage M25-1: Fill the remaining quantized-loader gaps not yet in
`main`