sglang-minimax-m2-series-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SGLang MiniMax M2 Series Optimization

SGLang MiniMax M2系列优化

Overview

概述

The skill covers the full MiniMax optimization ladder: mainline history, the remaining still-open upstream PR track, and current-main validation lanes. Use it to recover, extend, or audit MiniMax-specific optimizations, or to reuse the patterns on a structurally similar MoE model.
As of
2026-04-21
, refreshed against SGLang
origin/main
commit
c122d343a
, the MiniMax story is split across three sources of truth:
  • mainline history already present in
    main
  • still-open upstream PRs that are important for MiniMax-M2.5, but not fully landed in
    main
    yet
  • current registered docs/tests/workflows, especially the MiniMax-M2.7 AMD accuracy and performance lanes
This skill tracks all three, but it labels them clearly. Do not assume an optimization from a PR page is already in your local tree, and do not assume MiniMax-M2.7 or M2.7-highspeed is covered by MiniMax-M2.5 validation just because the same model file is used.
The historical evidence for every stage lives in:
  • references/pr-history.md: mainline and still-open PR evidence, benchmark notes, key code patterns
  • references/playbook.md: symptom mapping, commands, validation order
本技能覆盖MiniMax完整的优化路径:主分支历史、仍在推进的上游PR追踪,以及当前主分支的验证流程。可用于恢复、扩展或审核MiniMax专属优化,或将相关模式复用在结构相似的MoE模型上。
截至
2026-04-21
,基于SGLang
origin/main
提交
c122d343a
更新,MiniMax的相关内容分为三个可信来源:
  • 已合并到
    main
    分支的主分支历史
  • 对MiniMax-M2.5至关重要但尚未完全合并到
    main
    分支的上游PR
  • 当前已注册的文档/测试/工作流,尤其是MiniMax-M2.7的AMD精度和性能验证流程
本技能会追踪这三部分内容,并进行清晰标注。请勿假设PR页面中的优化已存在于你的本地代码分支,也不要因为使用相同的模型文件,就认为MiniMax-M2.5的验证流程适用于MiniMax-M2.7或M2.7-highspeed。
每个阶段的历史记录可在以下文件中查看:
  • references/pr-history.md:主分支和待合并PR的记录、基准测试说明、核心代码模式
  • references/playbook.md:症状映射、命令、验证顺序

Before You Change Anything

修改前的准备

Record the exact serving shape first:
  • M2, M2.1, M2.5, or M2.7
  • instruct or reasoning-style launch
  • native, AWQ, FP8, FP4, ModelSlim, or other quant format
  • TP / DP / EP / PP topology
  • DP attention enabled or not
  • DeepEP, FlashInfer, Triton, or other MoE / attention backend
  • piecewise CUDA graph enabled or not
  • speculative decoding or Eagle3 enabled or not
  • NVIDIA, AMD, NPU, or other backend
  • launch parser pair:
    --tool-call-parser minimax-m2
    and
    --reasoning-parser minimax-append-think
    when tool/thinking behavior matters
  • exact registered suite, workflow job, or hardware lane used for validation
  • QK normalization depends on how heads are partitioned or replicated
  • M2.5 scale-out performance depends on communication strategy, not only kernels
  • quantized checkpoints depend on exact loader conventions
首先记录精确的服务部署形态:
  • M2、M2.1、M2.5或M2.7
  • 指令型或推理型启动方式
  • 原生、AWQ、FP8、FP4、ModelSlim或其他量化格式
  • TP / DP / EP / PP拓扑结构
  • 是否启用DP注意力
  • 使用DeepEP、FlashInfer、Triton或其他MoE/注意力后端
  • 是否启用分段CUDA图
  • 是否启用 speculative decoding 或 Eagle3
  • NVIDIA、AMD、NPU或其他后端
  • 启动解析器组合:当工具/推理行为至关重要时,使用
    --tool-call-parser minimax-m2
    --reasoning-parser minimax-append-think
  • 用于验证的精确注册套件、工作流任务或硬件流程
  • QK归一化依赖于注意力头的分区或复制方式
  • M2.5的横向扩展性能取决于通信策略,而非仅内核
  • 量化 checkpoint 依赖于精确的加载器约定

Core Principle

核心原则

Do not treat MiniMax as a generic DeepSeek-like MoE.
  • MiniMax-M2 is a QK-normalized attention plus sparse-MoE story.
  • MiniMax-M2.5 adds a much heavier distributed and quantized runtime story.
  • MiniMax-M2.7 currently follows the MiniMax-M2-family model path but has its own AMD accuracy/performance lanes; treat it as a separate validation target.
  • MiniMax-M2.7-highspeed is currently visible through upstream docs work, not a separate current-main runtime path.
  • TP QK norm/all-reduce is already a two-generation mainline optimization: #16483 keeps the older RMSNormTP all-reduce aligned, and #20673 adds the fused JIT TP QK norm path behind
    SGLANG_USE_FUSED_PARALLEL_QKNORM
    .
  • The most important distinctions are often not "model size" but:
    • whether attention TP equals model TP
    • whether KV heads are partitioned or replicated
    • whether MoE output should all-reduce, reduce-scatter, or stay fused for the next layer
The optimization order matters:
  1. confirm the loader and topology contract
  2. fix correctness in the MiniMax-specific path
  3. remove generic overhead in the hot path
  4. only then add deeper kernel or communication specialization
  5. validate on the exact topology that triggered the issue
不要将MiniMax视为通用的DeepSeek类MoE模型。
  • MiniMax-M2是融合QK归一化注意力与稀疏MoE的模型
  • MiniMax-M2.5新增了更复杂的分布式和量化运行时逻辑
  • MiniMax-M2.7目前遵循MiniMax-M2家族的模型路径,但拥有独立的AMD精度/性能验证流程;需将其视为单独的验证目标
  • MiniMax-M2.7-highspeed目前仅通过上游文档工作可见,尚未成为当前主分支的独立运行时路径
  • TP QK归一化/全归约已是两代主分支优化:#16483保持旧版RMSNormTP全归约对齐,#20673
    SGLANG_USE_FUSED_PARALLEL_QKNORM
    配置下新增融合JIT TP QK归一化路径
  • 最重要的区别通常不是「模型大小」,而是:
    • 注意力TP是否等于模型TP
    • KV头是分区还是复制
    • MoE输出是否需要全归约、归约分散,或保持融合状态进入下一层
优化顺序至关重要:
  1. 确认加载器和拓扑结构约定
  2. 修复MiniMax专属路径中的正确性问题
  3. 移除热点路径中的通用开销
  4. 仅在此基础上添加更深层的内核或通信优化
  5. 在触发问题的精确拓扑结构上进行验证

What Transfers To Similar Models

可复用至相似模型的内容

Reuse this skill on a non-MiniMax model when it shares one or more of these traits:
  • QK normalization whose cost is dominated by TP communication
  • KV-head replication when
    num_key_value_heads < tp_size
  • sparse MoE without DeepSeek-style shared experts
  • topology-dependent attention groups where attention TP is not the same as model TP
  • quantized MoE checkpoints that rely on packed or fused module mappings
Reuse the order of investigation and validation discipline, not the MiniMax-specific constants.
当非MiniMax模型具备以下一个或多个特征时,可复用本技能:
  • QK归一化的成本主要由TP通信主导
  • num_key_value_heads < tp_size
    时的KV头复制
  • 无DeepSeek式共享专家的稀疏MoE
  • 注意力TP与模型TP不同的拓扑依赖型注意力组
  • 依赖打包或融合模块映射的量化MoE checkpoint
复用调查和验证流程的顺序,而非MiniMax专属的常量。

M2 Core Evolution Path

M2核心演进路径

Use this path when the target is
MiniMaxAI/MiniMax-M2*
and the problem is mostly inside the core model path already on
main
.
当目标为
MiniMaxAI/MiniMax-M2*
且问题主要存在于已合并到
main
分支的核心模型路径中时,使用此路径。

Stage M2-0: Basic support exists, but the path is still naive

阶段M2-0:基础支持已存在,但路径仍未优化

The model can launch, but the earliest support path is not yet optimized and may still miss MiniMax-specific surfaces.
  • basic model registration and weight loading
  • MiniMax-specific MoE, QK norm, and tool-call integration exist
  • do not confuse "supported" with "optimized"
  • python/sglang/srt/models/minimax_m2.py
    exists and is the active runtime path
  • later performance or correctness work has a stable MiniMax-specific home
模型可启动,但最早的支持路径尚未优化,可能仍缺少MiniMax专属的功能面。
  • 基础模型注册和权重加载
  • MiniMax专属的MoE、QK归一化和工具调用集成已存在
  • 不要混淆「支持」和「优化」
  • python/sglang/srt/models/minimax_m2.py
    已存在且为当前运行时路径
  • 后续的性能或正确性工作拥有稳定的MiniMax专属实现载体

Stage M2-1: Fix RMSNorm precision before chasing speed

阶段M2-1:在追求速度前修复RMSNorm精度

MiniMax QK normalization is numerically sensitive. Before deeper optimization, the norm path must accumulate safely.
  • prefer fp32 accumulation in the norm path
  • treat QK norm correctness as a prerequisite for later TP work
  • the norm path no longer relies on lower-precision accumulation where MiniMax accuracy is sensitive
MiniMax的QK归一化对数值精度敏感。在进行更深层优化前,归一化路径必须保证安全的累加方式。
  • 在归一化路径中优先使用fp32累加
  • 将QK归一化的正确性视为后续TP工作的前提
  • 归一化路径不再依赖低精度累加,避免影响MiniMax的精度敏感性

Stage M2-2: Expose Eagle3 capture and embedding surfaces

阶段M2-2:暴露Eagle3捕获和嵌入接口

MiniMax needs to expose the same capture surfaces as other spec-decoding-capable models. Without them, speculative or auxiliary-hidden-state features fail even if base generation works.
  • capture intermediate hidden states for selected layers
  • expose
    get_embed_and_head
  • keep the speculative-decoding surface area on the MiniMax model, not on ad hoc wrappers
  • set_eagle3_layers_to_capture(...)
    works
  • get_embed_and_head()
    exists and downstream speculative code can call it
MiniMax需要暴露与其他支持spec-decoding的模型相同的捕获接口。缺少这些接口时,即使基础生成功能正常,推测解码或辅助隐藏状态功能也会失效。
  • 捕获选定层的中间隐藏状态
  • 暴露
    get_embed_and_head
    接口
  • 将推测解码相关接口保留在MiniMax模型内部,而非临时包装器
  • set_eagle3_layers_to_capture(...)
    可正常工作
  • get_embed_and_head()
    已存在,下游推测解码代码可调用

Stage M2-3: Make the MoE path correct before making it faster

阶段M2-3:先确保MoE路径正确,再优化速度

Before tuning kernels, MiniMax needs the right MoE contract. This includes correct DeepEP forward usage and removing unnecessary router-side work.
  • keep the DeepEP forward path aligned with MiniMax's expert layout
  • do not add shared-expert logic that MiniMax does not use
  • remove unnecessary router work by specializing the top-k sigmoid path
  • the DeepEP MiniMax MoE path is functionally correct
  • the router no longer spends time on generic work MiniMax does not need
在调优内核前,MiniMax需要正确的MoE约定。这包括正确使用DeepEP前向传播,以及移除路由端不必要的工作。
  • 保持DeepEP前向路径与MiniMax的专家布局对齐
  • 不要添加MiniMax未使用的共享专家逻辑
  • 通过优化top-k sigmoid路径移除不必要的路由工作
  • DeepEP MiniMax MoE路径功能正确
  • 路由不再处理MiniMax不需要的通用逻辑

Stage M2-4: Specialize the QK norm hot path

阶段M2-4:优化QK归一化热点路径

For MiniMax, QK normalization is a real decode hotspot. Once correctness is solid, the next gains come from fusing the TP-aware norm path instead of doing separate generic operations.
  • compute Q and K norm together
  • keep TP-aware reduction in the same specialized path
  • preserve the custom all-reduce fast path by keeping reduction buffers aligned
  • prefer the fused JIT TP QK norm path on supported CUDA launches instead of stopping at the older in-model RMSNormTP path
  • check
    SGLANG_USE_FUSED_PARALLEL_QKNORM
    ,
    fused_parallel_qknorm(...)
    , and
    CustomAllReduceV2
    before claiming a missing all-reduce optimization
  • MiniMaxM2RMSNormTP
    is the active per-layer QK norm implementation
  • the reduction path consistently selects the fast aligned all-reduce path
  • MiniMaxM2QKRMSNorm
    can use the fused TP QK norm custom op when the JIT path is enabled and supported
  • focused validation exists in
    python/sglang/jit_kernel/tests/test_tp_qknorm.py
    and
    python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
对于MiniMax,QK归一化是实际解码过程中的热点。一旦正确性得到保障,下一步优化来自于融合感知TP的归一化路径,而非执行单独的通用操作。
  • 联合计算Q和K的归一化
  • 将感知TP的归约保留在同一优化路径中
  • 通过保持归约缓冲区对齐,保留自定义全归约快速路径
  • 在支持的CUDA启动环境中,优先使用融合JIT TP QK归一化路径,而非停留在旧版模型内的RMSNormTP路径
  • 在声称缺少全归约优化前,检查
    SGLANG_USE_FUSED_PARALLEL_QKNORM
    fused_parallel_qknorm(...)
    CustomAllReduceV2
  • MiniMaxM2RMSNormTP
    为当前每层QK归一化的实现
  • 归约路径始终选择快速对齐的全归约路径
  • 当JIT路径启用且支持时,
    MiniMaxM2QKRMSNorm
    可使用融合TP QK归一化自定义算子
  • python/sglang/jit_kernel/tests/test_tp_qknorm.py
    python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
    中存在针对性验证

Stage M2-5: Harden for piecewise CUDA graph and pipeline parallelism

阶段M2-5:强化分段CUDA图和流水线并行支持

Once the core hot paths are in place, MiniMax needs to remain usable under graph capture and PP partitioning.
  • keep piecewise CUDA graph contexts correct around MoE expert-distribution recording
  • propagate
    pp_proxy_tensors
  • make weight loading layer-range aware under PP
Family-adjacent caveat:
  • #18310 is for MiniMax-M2.1 and focuses on a
    torch.compile
    plus CUDA-graph crash. It is not the core M2 mainline optimization ladder, but it is worth borrowing if graph tracing regresses on a MiniMax-family branch.
  • MiniMax can run under PP without wrapper gaps
  • piecewise CUDA graph support does not regress the MiniMax-specific path
核心热点路径就绪后,MiniMax需要在图捕获和PP分区下保持可用。
  • 确保分段CUDA图上下文在MoE专家分布记录时的正确性
  • 传播
    pp_proxy_tensors
  • 使权重加载在PP下具备层范围感知能力
家族相关注意事项:
  • #18310针对MiniMax-M2.1,解决
    torch.compile
    加CUDA图崩溃问题。它不属于核心M2主分支优化路径,但如果图追踪在MiniMax家族分支中出现退化,可参考此修复。
  • MiniMax可在PP下运行,无包装器缺口
  • 分段CUDA图支持不会导致MiniMax专属路径退化

M2.5 Extension Path

M2.5扩展路径

Use this path when the target is
MiniMaxAI/MiniMax-M2.5
or another later MiniMax-family checkpoint. Start from the M2 core stages above, then continue here.
当目标为
MiniMaxAI/MiniMax-M2.5
或其他较新的MiniMax家族checkpoint时,使用此路径。先完成上述M2核心阶段,再继续此路径。

Stage M25-0: Audit the loader contract already on
main

阶段M25-0:审核已合并到
main
分支的加载器约定

M2.5 stresses loading and quantized checkpoint conventions much harder than the early M2 path.
  • preserve
    packed_modules_mapping
  • preserve KV-cache scale remapping
  • keep ModelSlim-specific layer assumptions consistent with MiniMax layout
  • packed qkv and gate-up modules load correctly
  • KV cache scales are not silently dropped
  • ModelSlim quant layers do not assume a different MoE layout
M2.5对加载和量化checkpoint约定的要求远高于早期M2路径。
  • 保留
    packed_modules_mapping
  • 保留KV缓存缩放重映射
  • 保持ModelSlim专属的层假设与MiniMax布局一致
  • 打包的qkv和gate-up模块可正确加载
  • KV缓存缩放不会被静默丢弃
  • ModelSlim量化层不会假设不同的MoE布局

Stage M25-1: Fill the remaining quantized-loader gaps not yet in
main

阶段M25-1:填补当前主分支中仍缺失的量化加载器缺口

Status: Tracked upstream PR work; not fully present in
origin/main
commit
c122d343a
as of
2026-04-21
.
Some M2.5 quantized checkpoints use fused expert naming that the current mainline loader still does not fully cover.
  • support fused expert mappings such as
    w13
  • prefer explicit fused mapping before falling back to older
    w1/w2/w3
    logic
  • add a focused weight-loading test when you port this work
  • AWQ or similar M2.5 checkpoints with fused expert weights load without local remapping hacks
状态: 追踪上游PR工作;截至
2026-04-21
,尚未完全合并到
origin/main
提交
c122d343a
部分M2.5量化checkpoint使用融合专家命名,当前主分支加载器尚未完全支持。
  • 支持
    w13
    等融合专家映射
  • 优先使用显式融合映射,再回退到旧版
    w1/w2/w3
    逻辑
  • 移植此工作时添加针对性的权重加载测试
  • 带有融合专家权重的AWQ或类似M2.5 checkpoint无需本地重映射即可加载

Stage M25-2: Make the scale-out runtime contract explicit

阶段M25-2:明确横向扩展运行时约定

Status: Partly on
main
, partly still tracked from upstream PR work as of
origin/main
commit
c122d343a
on
2026-04-21
.
For M2.5, the next bottleneck is often not a single kernel. It is the distributed contract across PP, EP, DP, and DeepEP.
  • keep PP support from the mainline path
  • make DeepEP runtime requirements explicit, especially hidden-size and dtype expectations
  • treat DP support and DP-attention support as separate stages
  • PP launches correctly
  • DeepEP no longer fails due to unsupported hidden size or dtype mismatch
  • the runtime contract is written down for the exact TP / DP / EP / PP shape you care about
状态: 部分已合并到
main
分支,部分仍在追踪上游PR工作;截至
2026-04-21
,基于
origin/main
提交
c122d343a
对于M2.5,下一个瓶颈通常不是单个内核,而是PP、EP、DP和DeepEP之间的分布式约定。
  • 保留主分支的PP支持
  • 明确DeepEP运行时要求,尤其是隐藏层大小和dtype预期
  • 将DP支持和DP注意力支持视为独立阶段
  • PP启动正常
  • DeepEP不再因不支持的隐藏层大小或dtype不匹配而失败
  • 为你关注的精确TP / DP / EP / PP形态记录运行时约定

Stage M25-3: Add the DP-attention and DEP communication optimizations

阶段M25-3:添加DP注意力和DEP通信优化

Status: Mixed: #20067 is part of
main
; #20489 and #20975 remain tracked as upstream PR work not fully present in
origin/main
commit
c122d343a
as of
2026-04-21
.
This is the biggest M2.5 scale-out gap. Performance and correctness both depend on using the attention-TP group rather than blindly reusing the model-TP group.
  • use attention TP group and rank instead of global TP group in MiniMax attention
  • allow reduce-scatter after MoE when padding or DEP makes it profitable
  • support FP4 all-gather when the communication path can quantize before transport
  • allow all-reduce fusion between the MoE output and the next attention preparation
  • guard zero-token and empty-batch paths
  • DP-attention MiniMax uses attention-TP metadata consistently
  • DEP no longer performs an unnecessary all-reduce plus slice
  • empty-batch or high-rank edge cases no longer crash
状态: 混合状态:#20067已合并到
main
分支;#20489#20975仍在追踪上游PR工作,截至
2026-04-21
尚未完全合并到
origin/main
提交
c122d343a
这是M2.5横向扩展的最大缺口。性能和正确性都依赖于使用注意力TP组,而非盲目复用模型TP组。
  • 在MiniMax注意力中使用注意力TP组和rank,而非全局TP组
  • 当填充或DEP使归约分散更高效时,允许在MoE后执行归约分散
  • 当通信路径可在传输前量化时,支持FP4全聚集
  • 允许在MoE输出与下一个注意力准备阶段之间融合全归约
  • 保护零令牌和空批处理路径
  • DP注意力MiniMax始终使用注意力TP元数据
  • DEP不再执行不必要的全归约加切片操作
  • 空批处理或高rank边缘情况不再崩溃

Stage M25-4: Replace the older TP QK norm path with the fused JIT version

阶段M25-4:用融合JIT版本替换旧版TP QK归一化路径

Status: Mainline as #20673 by
origin/main
commit
c122d343a
on
2026-04-21
.
The older QK norm path was already specialized; the newer mainline path pushes it further by moving to a fused JIT kernel that reuses custom all-reduce v2 more efficiently.
  • fuse TP Q and K norm into one custom op
  • keep a fallback path for unsupported environments
  • add a dedicated benchmark and unit test with the PR
  • the MiniMax path can use the fused TP QK norm custom op
  • the fallback path is still available when the JIT kernel cannot run
状态: 已合并到主分支,#20673
2026-04-21
origin/main
提交
c122d343a
中已存在。
旧版QK归一化路径已完成优化;新版主分支路径通过迁移到融合JIT内核,进一步提升效率,更高效地复用自定义全归约v2。
  • 将TP Q和K归一化融合为一个自定义算子
  • 为不支持的环境保留回退路径
  • 随PR添加专用基准测试和单元测试
  • MiniMax路径可使用融合TP QK归一化自定义算子
  • 当JIT内核无法运行时,回退路径仍可用

Stage M25-5: Fix TP16 and replicated-KV-head correctness

阶段M25-5:修复TP16和复制KV头的正确性

Status: Mainline as #20967 by
origin/main
commit
c122d343a
on
2026-04-21
.
When
num_key_value_heads < tp_size
, multiple TP ranks can share the same KV head. That means the K norm weights and reductions must follow the replica layout, not a naive full-TP assumption.
  • shard norm weights by logical head replica
  • reduce only across ranks that share the same head
  • do not assume the full TP group is the correct reduction group
  • high-TP MiniMax-M2.5 runs do not produce repeated or garbled output caused by incorrect K norm sharding
状态: 已合并到主分支,#20967
2026-04-21
origin/main
提交
c122d343a
中已存在。
num_key_value_heads < tp_size
时,多个TP rank可共享同一个KV头。这意味着K归一化权重和归约必须遵循副本布局,而非简单的全TP假设。
  • 按逻辑头副本分片归一化权重
  • 仅在共享同一头的rank间执行归约
  • 不要假设全TP组是正确的归约组
  • 高TP MiniMax-M2.5运行不会因K归一化分片错误而产生重复或乱码输出

Stage M25-6: Optional NVFP4 fallback for non-Blackwell GPUs

阶段M25-6:针对非Blackwell GPU的可选NVFP4回退

Status: Mainline as #19652 by
origin/main
commit
c122d343a
on
2026-04-21
; not MiniMax-specific, but directly relevant to some MiniMax-M2.5 deployments.
If the target checkpoint is an NVFP4 MiniMax variant on A100, H100, A40, or another non-Blackwell GPU, the real blocker may be the generic FP4 Marlin fallback rather than MiniMax model code.
  • keep weights compressed in FP4
  • route unsupported native FP4 cases to Marlin fallback
  • preserve both linear and MoE fallback paths
  • NVFP4 MiniMax-family checkpoints can run coherently on non-Blackwell GPUs without decompression hacks
状态: 已合并到主分支,#19652
2026-04-21
origin/main
提交
c122d343a
中已存在;并非MiniMax专属,但与部分MiniMax-M2.5部署直接相关。
如果目标checkpoint是A100、H100、A40或其他非Blackwell GPU上的NVFP4 MiniMax变体,真正的障碍可能是通用FP4 Marlin回退逻辑,而非MiniMax模型代码。
  • 保持权重以FP4压缩存储
  • 将不支持的原生FP4路由到Marlin回退路径
  • 保留线性和MoE回退路径
  • NVFP4 MiniMax家族checkpoint可在非Blackwell GPU上连贯运行,无需解压缩 hack

M2.7 Current-Main Validation Path

M2.7当前主分支验证路径

Use this path when the target is
MiniMaxAI/MiniMax-M2.7
,
MiniMaxAI/MiniMax-M2.7-highspeed
, or when an AMD MiniMax change might affect the currently registered M2.7 lanes. Current main has explicit AMD accuracy and performance coverage for M2.7, while first-class M2.7 and M2.7-highspeed docs are tracked by upstream PR #20873.
当目标为
MiniMaxAI/MiniMax-M2.7
MiniMaxAI/MiniMax-M2.7-highspeed
,或AMD MiniMax变更可能影响当前已注册的M2.7验证流程时,使用此路径。当前主分支为M2.7提供了明确的AMD精度和性能覆盖,而M2.7和M2.7-highspeed的一等文档支持由上游PR #20873追踪。

Stage M27-0: Treat M2.7 as a separate later-family validation target

阶段M27-0:将M2.7视为独立的晚家族验证目标

M2.7 currently reuses the MiniMax-M2-family runtime code, but the active registered tests are not just copies of M2.5. They launch
MiniMaxAI/MiniMax-M2.7
on AMD with TP8+EP8 and the aiter attention backend.
  • keep the model-file assumptions from the M2/M2.5 ladder unless current code proves M2.7 has a new runtime path
  • validate M2.7 separately from M2.5 on AMD when changing attention, MoE communication, loader, or aiter-related behavior
  • use the registered M2.7 model path override
    MINIMAX_M27_MODEL_PATH
    for local mirrors
  • preserve launch details from the current tests:
    --tp 8
    ,
    --ep-size 8
    ,
    --attention-backend aiter
    ,
    SGLANG_USE_AITER=1
    ,
    --mem-fraction-static 0.85
    , multithread loading, and a long watchdog timeout
  • inspect both MI30x/MI325 and MI35x lanes because they use distinct registered suites
  • test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py
  • test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py
  • test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py
  • test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py
  • .github/workflows/nightly-test-amd.yml
  • .github/workflows/nightly-test-amd-rocm720.yml
  • M2.7 accuracy and performance suites pass on the target AMD lane
  • M2.5 and M2.7 failures are triaged independently
  • docs do not become the only source of truth for M2.7 until a first-class M2.7 usage doc exists
M2.7当前复用MiniMax-M2家族的运行时代码,但已注册的活跃测试并非M2.5的简单复制。它们在AMD上以TP8+EP8和aiter注意力后端启动
MiniMaxAI/MiniMax-M2.7
  • 除非当前代码证明M2.7有新的运行时路径,否则保留M2/M2.5路径中的模型文件假设
  • 当修改注意力、MoE通信、加载器或aiter相关行为时,在AMD上单独验证M2.7,而非与M2.5一起
  • 使用已注册的M2.7模型路径覆盖
    MINIMAX_M27_MODEL_PATH
    进行本地镜像
  • 保留当前测试的启动细节:
    --tp 8
    --ep-size 8
    --attention-backend aiter
    SGLANG_USE_AITER=1
    --mem-fraction-static 0.85
    、多线程加载,以及长看门狗超时
  • 检查MI30x/MI325和MI35x验证流程,因为它们使用不同的注册套件
  • test/registered/amd/accuracy/mi30x/test_minimax_m27_eval_amd.py
  • test/registered/amd/perf/mi30x/test_minimax_m27_perf_amd.py
  • test/registered/amd/accuracy/mi35x/test_minimax_m27_eval_mi35x.py
  • test/registered/amd/perf/mi35x/test_minimax_m27_perf_mi35x.py
  • .github/workflows/nightly-test-amd.yml
  • .github/workflows/nightly-test-amd-rocm720.yml
  • M2.7精度和性能套件在目标AMD验证流程中通过
  • M2.5和M2.7的失败可独立分类
  • 在一等M2.7使用文档出现前,不要将文档作为M2.7的唯一可信来源

Open PR Radar

活跃PR追踪

Check these active upstream tracks before designing a new MiniMax skill or declaring a gap:
  • #22934: M2.5 EPLB bugfix for missing
    routed_experts_weights_of_layer
    on
    MiniMaxM2ForCausalLM
    .
  • #22744: NVIDIA
    --enable-tf32-matmul
    support aimed at reducing FP32 gate GEMM cost for MiniMax-M2.5 decode.
  • #22300: FP8 GEMM/deepgemm scale-format fix for fp16 MiniMax-M2.5 models.
  • #23301: token-by-token streaming of MiniMax-M2 string tool-call parameters.
  • #20873: docs-only M2.7 and M2.7-highspeed support.
  • #22432 and #23190: NPU split-QKV, TP RMSNorm, RoPE, Eagle3, and DP-attention fixes for MiniMax2.
  • #17826, #19468, #20031, #20489, and #20975: remaining distributed, DeepEP, quant-loader, and DP-attention cleanup tracks carried from the earlier M2.5 radar.
在设计新的MiniMax技能或声明缺口前,检查以下活跃上游追踪:
  • #22934:修复
    MiniMaxM2ForCausalLM
    上缺失
    routed_experts_weights_of_layer
    的M2.5 EPLB bug。
  • #22744:NVIDIA
    --enable-tf32-matmul
    支持,旨在降低MiniMax-M2.5解码时FP32门控GEMM的成本。
  • #22300:针对fp16 MiniMax-M2.5模型的FP8 GEMM/deepgemm缩放格式修复。
  • #23301:MiniMax-M2字符串工具调用参数的逐令牌流式传输。
  • #20873:仅文档层面的M2.7和M2.7-highspeed支持。
  • #22432#23190:针对MiniMax2的NPU拆分QKV、TP RMSNorm、RoPE、Eagle3和DP注意力修复。
  • #17826#19468#20031#20489#20975:早期M2.5追踪中遗留的分布式、DeepEP、量化加载器和DP注意力清理工作。

Investigation Order

调查顺序

When debugging a MiniMax issue, prefer this order:
  1. classify the exact runtime shape
  2. check whether the relevant optimization already exists on
    main
  3. if not, check whether it lives in a still-open upstream PR
  4. only then decide whether to port, reimplement, or defer it
For the supporting evidence and commands, use:
  • references/playbook.md
  • references/pr-history.md
调试MiniMax问题时,优先遵循以下顺序:
  1. 分类精确的运行时形态
  2. 检查相关优化是否已存在于
    main
    分支
  3. 如果不存在,检查是否存在于仍在推进的上游PR中
  4. 仅在此之后决定是否移植、重新实现或推迟
支持性证据和命令可参考:
  • references/playbook.md
  • references/pr-history.md

Anti-Patterns

反模式

  • Do not treat attention TP and model TP as automatically identical for MiniMax-M2.5.
  • Do not optimize a generic MoE kernel first if the real problem is MiniMax loader or topology plumbing.
  • Do not assume a TP-only text launch proves PP, DP, DP-attention, or DeepEP correctness.
  • Do not bypass
    packed_modules_mapping
    or KV-scale remapping just to make one checkpoint load.
  • Do not copy still-open upstream PR behavior into production without noting that it is not on
    main
    yet.
  • Do not assume MiniMax-M2.7 is validated by passing only MiniMax-M2.5 tests; use the M2.7 AMD lanes when the change can affect that checkpoint.
  • Do not omit
    --tool-call-parser minimax-m2
    or
    --reasoning-parser minimax-append-think
    when validating tool or reasoning behavior from the serving docs.
  • Do not miss
    SGLANG_USE_FUSED_PARALLEL_QKNORM
    when benchmarking or diagnosing the current TP QK norm path.
  • 不要假设MiniMax-M2.5的注意力TP和模型TP自动相同。
  • 如果真正的问题是MiniMax加载器或拓扑结构 plumbing,不要优先优化通用MoE内核。
  • 不要假设仅TP的文本启动可证明PP、DP、DP注意力或DeepEP的正确性。
  • 不要为了加载单个checkpoint而绕过
    packed_modules_mapping
    或KV缩放重映射。
  • 不要将仍在推进的上游PR行为复制到生产环境,而不注明其尚未合并到
    main
    分支。
  • 不要假设通过MiniMax-M2.5测试即可验证MiniMax-M2.7;当变更可能影响该checkpoint时,使用M2.7 AMD验证流程。
  • 验证服务文档中的工具或推理行为时,不要遗漏
    --tool-call-parser minimax-m2
    --reasoning-parser minimax-append-think
  • 基准测试或诊断当前TP QK归一化路径时,不要遗漏
    SGLANG_USE_FUSED_PARALLEL_QKNORM