sglang-kimi-k2-k25-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SGLang Kimi K2/K2.5 Optimization

SGLang Kimi K2/K2.5 优化指南

Overview

概述

The skill is an optimization ladder. Identify which stage the current code is at, apply the next missing optimization, and only move deeper after the earlier stage is satisfied.
Current-main snapshot: This skill was refreshed against SGLang
origin/main
commit
c122d343a
on
2026-04-21
. Since the older PR ladder was written, current main has added a Kimi-K2.5 usage doc, parser and OpenAI-serving tests for
kimi_k2
, Kimi-K2.5 LoRA regression coverage, AMD/GB300 validation lanes, and a Kimi-K2-Thinking stress test. Treat those as part of the active validation surface, not as optional CI trivia. Active open PRs now also define several next likely skill updates: W4AFP8 loading, W4A16 DeepEP low-latency, Kimi-K2.5 multimodal processor fixes, ROCm fused QK RMSNorm, and JIT migration of the older K2 fused gate path. One important non-open gap is Kimi-K2-Thinking DeepEP plus int4/Marlin: #13789 tried to support it but was closed unmerged after hitting an illegal memory access in the
fused_marlin_moe
path. Do not mark that combination as mainline-supported just because the generic Marlin JIT work in #19181 landed.
The historical evidence for every stage lives in:
  • references/pr-history.md: merged PR evidence, benchmark tables, key code
  • references/playbook.md: symptom mapping, commands, validation order
本技能是一套优化阶梯。先确定当前代码所处的阶段,应用下一个缺失的优化步骤,只有在完成前一阶段后,才能进入更深层次的优化。
当前主分支快照: 本技能基于2026年4月21日SGLang
origin/main
分支的
c122d343a
提交进行更新。自旧版PR阶梯文档编写以来,当前主分支新增了Kimi-K2.5使用文档、
kimi_k2
的解析器与OpenAI服务测试、Kimi-K2.5 LoRA回归覆盖、AMD/GB300验证流程,以及Kimi-K2-Thinking压力测试。请将这些内容视为核心验证范围的一部分,而非可选的CI琐事。 当前活跃的未合并PR也定义了多个后续可能的技能更新方向:W4AFP8加载、W4A16 DeepEP低延迟路径、Kimi-K2.5多模态处理器修复、ROCm融合QK RMSNorm,以及旧版K2融合门控路径的JIT迁移。 一个重要的非公开缺口是Kimi-K2-Thinking的DeepEP加int4/Marlin支持:#13789曾尝试实现该功能,但因
fused_marlin_moe
路径出现非法内存访问而未合并。不要因为#19181中的通用Marlin JIT工作已合并,就认为该组合已被主线版本支持。
每个阶段的历史依据可在以下文档中找到:
  • references/pr-history.md:已合并PR的证据、基准测试表、核心代码
  • references/playbook.md:症状映射、命令、验证顺序

Before You Change Anything

修改前的准备

Record the exact serving shape first:
  • K2 or K2.5
  • thinking or instruct
  • text-only or multimodal
  • native or quantized weights
  • TP / DP / EP / PP / PD / EPLB topology
  • speculative decoding enabled or not
  • NVIDIA / AMD / other backend
  • launch parser pair:
    --tool-call-parser kimi_k2
    and
    --reasoning-parser kimi_k2
  • LoRA enabled or not
  • current SGLang commit and the matching registered-test lane
首先记录当前服务的精确配置:
  • 模型类型:K2 或 K2.5
  • 模式:思考模式(thinking)或指令模式(instruct)
  • 输入类型:纯文本或多模态
  • 权重类型:原生或量化
  • 分布式拓扑:TP / DP / EP / PP / PD / EPLB
  • 是否启用投机解码
  • 后端类型:NVIDIA / AMD / 其他
  • 启动解析器组合:
    --tool-call-parser kimi_k2
    --reasoning-parser kimi_k2
  • 是否启用LoRA
  • 当前SGLang提交版本及对应的注册测试流程

Core Principle

核心原则

Do not treat K2 and K2.5 as one optimization problem.
  • K2 is mainly a
    384 experts
    router and MoE hot-path story.
  • K2 thinking adds a separate quantized Marlin MoE story.
  • K2.5 is much more wrapper-heavy:
    text_config
    , quant mapping, PP/PD/EPLB, multimodal DP encoder, Eagle3, and PCG compatibility all matter.
  • Current K2/K2.5 serving also has a parser contract: tool calls and thinking output are expected to go through
    kimi_k2
    .
  • Current open PRs split into three useful future tracks: quantized K2.5 loading/execution, multimodal wrapper correctness, and backend-specific fused norm or MoE kernel work.
For non-Kimi models, first decide which family they resemble more:
  • a K2-like router and MoE hot-path problem
  • a K2 thinking-like quantized Marlin MoE problem
  • or a K2.5-like wrapper, multimodal, and runtime-plumbing problem
The optimization order also matters:
  1. enable the right shape or wrapper contract
  2. remove generic overhead
  3. add or select the Kimi-specific fast path
  4. micro-optimize the hot kernel
  5. harden PCG or distributed correctness
  6. tune per-device fused MoE configs
不要将K2和K2.5视为同一优化问题。
  • K2主要是
    384个专家
    的路由与MoE热路径优化。
  • K2思考模式新增了独立的量化Marlin MoE优化。
  • K2.5更侧重于包装器逻辑:
    text_config
    、量化映射、PP/PD/EPLB、多模态DP编码器、Eagle3以及PCG兼容性都至关重要。
  • 当前K2/K2.5服务还存在解析器契约:工具调用和思考模式输出需通过
    kimi_k2
    处理。
  • 当前活跃PR分为三个未来优化方向:量化K2.5加载/执行、多模态包装器正确性、后端特定的融合归一化或MoE内核优化。
对于非Kimi模型,先判断它更接近以下哪类问题:
  • 类似K2的路由与MoE热路径问题
  • 类似K2思考模式的量化Marlin MoE问题
  • 类似K2.5的包装器、多模态及运行时管线问题
优化顺序同样重要:
  1. 启用正确的配置或包装器契约
  2. 移除通用逻辑的开销
  3. 添加或选择Kimi专属的快速路径
  4. 微优化热内核
  5. 强化PCG或分布式正确性
  6. 针对不同设备调优融合MoE配置

What Transfers To Similar Models

可迁移至相似模型的经验

Reuse this skill on a non-Kimi model when it shares one or more of these traits:
  • a sparse MoE router whose real expert count or top-k shape breaks generic DeepSeek-era assumptions
  • a wrapper that hides text-model metadata inside
    text_config
    or wrapper-local quant fields
  • a multimodal tower whose throughput depends on DP encoder or DP attention execution details
  • speculative decoding, PP, PD, EPLB, or PCG features that depend on wrapper surface area rather than only kernel speed
Reuse the optimization order and validation discipline, not the literal Kimi constants or filenames.
当非Kimi模型具备以下一个或多个特征时,可复用本技能:
  • 稀疏MoE路由,其实际专家数量或top-k配置打破了通用DeepSeek时代的假设
  • 包装器将文本模型元数据隐藏在
    text_config
    或包装器本地量化字段中
  • 多模态塔的吞吐量依赖于DP编码器或DP注意力的执行细节
  • 投机解码、PP、PD、EPLB或PCG功能依赖于包装器的暴露面,而非仅内核速度
复用优化顺序和验证规范,而非字面意义上的Kimi常量或文件名。

Open PR Radar

活跃PR追踪

Check these active upstream tracks before designing a new Kimi skill or declaring a gap:
  • #22806:
    KimiW4AFp8Config
    and W4AFP8 model-loading support for Kimi-K2.5.
  • #22496: Kimi-K2.5 W4A16 DeepEP low-latency path with JIT Marlin and direct DeepEP MoE work.
  • #22964:
    KimiGPUProcessorWrapper._cpu_call
    output fix after grid metadata changed from
    grid_thws
    toward
    image_grid_thw
    .
  • #23186: AMD/ROCm BF16 fused QK RMSNorm path for
    Kimi-K2.5-MXFP4
    .
  • #19703: migrate
    kimi_k2_moe_fused_gate
    from AOT
    sgl-kernel
    into the JIT kernel module.
  • #22488: generalize the Kimi2 fused MoE gate JIT path to GLM-5-style
    256
    -expert shapes.
  • #22208: AMD gfx950 small-M fused MoE config tuning for Kimi-K2.5
    int4_w4a16
    .
  • #21741: generic compressed-tensors W4AFP8 MoE support that underpins some Kimi-K2.5 quantized loading work.
Known closed gap to remember:
  • #13789: attempted Kimi-K2-Thinking DeepEP support with int4/Marlin, but remained unmerged after an illegal memory access in
    fused_marlin_moe
    .
在设计新的Kimi优化技能或声明缺口前,先查看以下上游活跃PR:
  • #22806:为Kimi-K2.5添加
    KimiW4AFp8Config
    及W4AFP8模型加载支持。
  • #22496:Kimi-K2.5的W4A16 DeepEP低延迟路径,包含JIT Marlin与直接DeepEP MoE实现。
  • #22964:修复
    KimiGPUProcessorWrapper._cpu_call
    在网格元数据从
    grid_thws
    改为
    image_grid_thw
    后的输出问题。
  • #23186:为
    Kimi-K2.5-MXFP4
    添加AMD/ROCm BF16融合QK RMSNorm路径。
  • #19703:将
    kimi_k2_moe_fused_gate
    从AOT的
    sgl-kernel
    迁移至JIT内核模块。
  • #22488:将Kimi2融合MoE门控JIT路径推广至GLM-5风格的
    256
    专家配置。
  • #22208:为Kimi-K2.5
    int4_w4a16
    调优AMD gfx950小批量融合MoE配置。
  • #21741:通用压缩张量W4AFP8 MoE支持,为部分Kimi-K2.5量化加载工作提供基础。
需要记住的已关闭缺口:
  • #13789:尝试为Kimi-K2-Thinking添加DeepEP与int4/Marlin支持,但因
    fused_marlin_moe
    路径出现非法内存访问而未合并。

K2 Evolution Path

K2演进路径

Use this path when the target is
moonshotai/Kimi-K2*
.
当目标模型为
moonshotai/Kimi-K2*
时,使用此路径。

Stage K2-0: Basic support but not yet K2-shaped

阶段K2-0:基础支持但未适配K2配置

The code may "support K2" in a broad sense, but still inherit DeepSeek assumptions that silently cap performance or break specialized kernels.
  • hidden assumptions that
    num_experts == 256
  • generic grouped-topk path still used for K2
  • no dedicated K2 router or gate dispatch
  • router-side code can legally run with
    384
    experts
  • tests and benchmarks no longer hardcode
    256
    for the K2 path
代码可能在广义上“支持K2”,但仍继承了DeepSeek的假设,这些假设会暗中限制性能或破坏专用内核。
  • 隐含假设
    num_experts == 256
  • K2仍使用通用分组top-k路径
  • 无专用K2路由或门控调度
  • 参考PR:#8013
  • 路由侧代码可合法支持
    384
    个专家
  • 测试与基准测试不再为K2路径硬编码
    256

Stage K2-1: Remove the
256 experts
router assumption

阶段K2-1:移除“256个专家”的路由假设

Before K2 can benefit from any later hot-path work, the router GEMM path has to accept
384
experts.
  • make
    dsv3_router_gemm
    dispatch by runtime expert count
  • keep the specialized unrolled kernel structure instead of falling back to a generic slow path
  • preserve the existing hidden-size and output-dtype contract
  • sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu
  • the K2 path uses the specialized router kernel with
    384
    experts
  • there is no hidden fallback caused by a hardcoded
    256
在K2能从后续热路径优化中获益前,路由GEMM路径必须支持
384
个专家。
  • dsv3_router_gemm
    根据运行时专家数量进行调度
  • 保留专用的展开内核结构,而非回退到通用慢路径
  • 保留现有的隐藏层大小与输出 dtype 契约
  • 涉及文件:
    sgl-kernel/csrc/gemm/dsv3_router_gemm_entry.cu
  • 参考PR:#8013
  • K2路径使用支持
    384
    个专家的专用路由内核
  • 不存在因硬编码
    256
    导致的隐藏回退逻辑

Stage K2-2: Remove generic grouped-topk overhead

阶段K2-2:移除通用分组top-k的开销

K2 thinking has
384
experts and
num_expert_group == 1
. The generic grouped-topk path wastes time on masking and grouping logic that K2 does not need.
  • add a narrow K2-specific topk implementation
  • keep semantics identical: renormalization, routed scaling, expert remap, padded-token masking
  • optimize the dispatch condition before touching CUDA
  • python/sglang/srt/layers/moe/topk.py
  • K2 uses a dedicated biased-topk path instead of the generic grouped implementation
  • the router hotspot is smaller even before adding a fused CUDA op
K2思考模式有
384
个专家且
num_expert_group == 1
,通用分组top-k路径会在K2不需要的掩码与分组逻辑上浪费时间。
  • 添加针对K2的窄范围top-k实现
  • 保持语义一致:重归一化、路由缩放、专家重映射、填充令牌掩码
  • 在修改CUDA代码前优化调度条件
  • 涉及文件:
    python/sglang/srt/layers/moe/topk.py
  • 参考PR:#13150
  • K2使用专用的带偏置top-k路径,而非通用分组实现
  • 路由热点在添加融合CUDA操作前已缩小

Stage K2-3: Fuse sigmoid, bias, top-k, and renorm into one CUDA op

阶段K2-3:将sigmoid、偏置、top-k与重归一化融合为单个CUDA操作

Once the Python or
torch.compile
topk path is still hot, K2 needs a model-specific fused gate kernel to cut launch overhead and reduce temporary traffic.
  • build a dedicated
    kimi_k2_moe_fused_gate
    op for
    384 experts
  • separate small-token and large-token execution strategies
  • fuse
    sigmoid
    ,
    + bias
    , top-k selection, writeback, and optional renormalize/scaling
  • add a dedicated benchmark and unit test at the same time
  • sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
  • sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
  • sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py
  • the K2 router hotspot moves from generic topk to the dedicated fused gate
  • benchmark coverage exists for the exact K2 shape
当Python或
torch.compile
的top-k路径仍是热点时,K2需要一个模型专属的融合门控内核来减少启动开销与临时数据传输。
  • 384个专家
    构建专用的
    kimi_k2_moe_fused_gate
    操作
  • 区分小令牌与大令牌的执行策略
  • 融合
    sigmoid
    +偏置
    、top-k选择、写回及可选的重归一化/缩放
  • 同时添加专用基准测试与单元测试
  • 涉及文件:
    sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
    sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
    sgl-kernel/benchmark/bench_kimi_k2_moe_fused_gate.py
  • 参考PR:#13287
  • K2路由热点从通用top-k转移至专用融合门控
  • 针对K2精确配置的基准测试覆盖已存在

Stage K2-4: Wire runtime dispatch to the best available K2 fast path

阶段K2-4:将运行时调度连接至最优K2快速路径

Adding a fast kernel is not enough; the runtime must actually choose the best maintained path for the K2 shape.
  • dispatch by K2 shape in
    topk.py
  • use
    kimi_k2_moe_fused_gate
    when it is the active best path for the current backend
  • prefer
    fused_topk_deepseek
    when backend constraints are satisfied and it supersedes the older gate path
  • keep the generic implementation alive for non-K2 shapes and non-CUDA backends
  • treat dispatch order as part of the optimization contract, not as an incidental detail
  • python/sglang/srt/layers/moe/topk.py
  • CUDA K2 requests no longer route through the old compiled topk path
  • kernel selection is deterministic for the active backend and shape
仅添加快速内核不够,运行时必须为K2配置实际选择最优的维护路径。
  • topk.py
    中根据K2配置进行调度
  • 当前后端支持时,优先使用
    kimi_k2_moe_fused_gate
    作为最优路径
  • 当后端约束满足且
    fused_topk_deepseek
    优于旧门控路径时,优先选择该路径
  • 为非K2配置与非CUDA后端保留通用实现
  • 将调度顺序视为优化契约的一部分,而非偶然细节
  • 涉及文件:
    python/sglang/srt/layers/moe/topk.py
  • 参考PR: #13332 #15347 #17325
  • CUDA K2请求不再通过旧编译top-k路径
  • 内核选择针对当前后端与配置是确定性的

Stage K2-5: Micro-optimize the K2 fused gate kernel when it is still the hot path

阶段K2-5:当融合门控仍是热点时,对其进行微优化

After the fused gate exists, and when it is still the active router fast path, the next gains come from simplifying the kernel and improving memory behavior rather than inventing another algorithm.
  • simplify dtype support to the dtype the hot path actually uses
  • vectorize large-token loads with
    float4
  • reduce
    __syncthreads()
    in the small-token path
  • shrink shared-memory footprint by storing only selected top-k state
  • preserve deterministic tie-breaking in reductions
  • sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
  • profiler shows the fused gate kernel materially smaller than the first fused version
  • kernel correctness is unchanged
融合门控存在后,若它仍是活跃的路由快速路径,下一步优化需简化内核并改进内存行为,而非发明新算法。
  • 将 dtype 支持简化为热路径实际使用的类型
  • 使用
    float4
    向量化大令牌加载
  • 减少小令牌路径中的
    __syncthreads()
    调用
  • 通过仅存储选中的top-k状态缩小共享内存占用
  • 保留归约中的确定性平局处理
  • 涉及文件:
    sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
  • 参考PR:#13374
  • 性能分析显示融合门控内核较首个融合版本显著缩小
  • 内核正确性未改变

Stage K2-6: Harden piecewise CUDA graph correctness

阶段K2-6:强化分段CUDA图(PCG)的正确性

Fast kernels that are not PCG-safe become unusable in exactly the high-performance serving regime K2 cares about.
  • register fake ops for shape and dtype propagation
  • ensure invalid expert selections write deterministic zero outputs and zero indices
  • treat PCG correctness bugs as first-class regressions, not edge cases
  • python/sglang/srt/layers/moe/topk.py
  • sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
  • PCG capture and replay work without launch crashes or illegal instructions
不支持PCG的快速内核在K2关注的高性能服务场景中无法使用。
  • 注册用于形状与 dtype 传播的伪操作
  • 确保无效专家选择输出确定性的零值与零索引
  • 将PCG正确性缺陷视为一级回归,而非边缘情况
  • 涉及文件:
    python/sglang/srt/layers/moe/topk.py
    sgl-kernel/csrc/moe/kimi_k2_moe_fused_gate.cu
  • 参考PR: #13466 #15306
  • PCG捕获与重放可正常运行,无启动崩溃或非法指令

Stage K2-7: Clean up and harden the quantized K2 thinking Marlin MoE path

阶段K2-7:清理并强化量化K2思考模式的Marlin MoE路径

For K2 thinking, a lot of remaining latency and correctness risk lives in memory traffic, scratch-buffer handling, EP plumbing, PCG compatibility, and the active Marlin implementation boundary rather than in the router itself.
  • move
    fused_marlin_moe
    into SGLang-side code so it can evolve without
    sgl-kernel
    release friction
  • choose config via
    try_get_optimal_moe_config(..., is_marlin=True)
  • reuse shared temporary buffers
  • do not zero scratch buffers unless EP actually requires it
  • only pass a real
    expert_map
    when dispatcher metadata exists
  • keep the Marlin path PCG-safe
  • optimize the active JIT-backed Marlin kernel implementation, not only the wrapper around it
  • python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.py
  • python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
  • python/sglang/jit_kernel/moe_wna16_marlin.py
  • tp-only K2 thinking no longer pays the fake-EP zeroing penalty
  • EP-aware paths pass true expert metadata instead of placeholders
  • PCG capture works for the quantized Marlin MoE path
  • future kernel changes land in the active JIT-backed implementation instead of only in stale wrapper code
对于K2思考模式,剩余的大部分延迟与正确性风险存在于内存传输、临时缓冲区处理、EP管线、PCG兼容性以及当前Marlin实现边界,而非路由本身。
  • fused_marlin_moe
    移至SGLang侧代码,使其无需等待
    sgl-kernel
    发布即可演进
  • 通过
    try_get_optimal_moe_config(..., is_marlin=True)
    选择配置
  • 复用共享临时缓冲区
  • 仅在EP实际需要时才清零临时缓冲区
  • 仅当调度器元数据存在时才传递真实的
    expert_map
  • 保持Marlin路径的PCG兼容性
  • 优化活跃的JIT-backed Marlin内核实现,而非仅优化其包装器
  • 涉及文件:
    python/sglang/srt/layers/moe/fused_moe_triton/fused_marlin_moe.py
    python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors_moe.py
    python/sglang/jit_kernel/moe_wna16_marlin.py
  • 仅使用TP的K2思考模式不再承担伪EP清零开销
  • 感知EP的路径传递真实专家元数据而非占位符
  • 量化Marlin MoE路径支持PCG捕获
  • 未来内核变更可直接在活跃的JIT-backed实现中进行,而非仅修改陈旧的包装器代码

Stage K2-8: Add hardware-specific fused MoE tuning files

阶段K2-8:添加硬件专属的融合MoE调优文件

Even with the correct kernel path, K2 MoE throughput is device-sensitive. H20, H200, and B200 needed separate config tables.
  • treat the tuning filename as part of the optimization contract
  • match exact
    Triton version
    ,
    E
    ,
    N
    ,
    dtype
    ,
    block_shape
    , and
    device_name
  • do not reuse a config file just because the model name is also K2
  • python/sglang/srt/layers/moe/fused_moe_triton/configs/
  • the serving shape resolves to the exact per-device tuning file instead of a generic default
即使内核路径正确,K2 MoE的吞吐量仍对设备敏感。H20、H200与B200需要单独的配置表。
  • 将调优文件名视为优化契约的一部分
  • 匹配精确的
    Triton version
    E
    N
    dtype
    block_shape
    device_name
  • 不要仅因模型名称同为K2就复用配置文件
  • 涉及文件:
    python/sglang/srt/layers/moe/fused_moe_triton/configs/
  • 服务配置可匹配到精确的设备专属调优文件,而非通用默认值

K2.5 Evolution Path

K2.5演进路径

Use this path when the target is
moonshotai/Kimi-K2.5*
.
当目标模型为
moonshotai/Kimi-K2.5*
时,使用此路径。

Stage K25-0: Wrapper bring-up

阶段K25-0:包装器启动

K2.5 is not "just another DeepSeek model". It is a wrapped multimodal model, so optimization begins with the wrapper contract.
  • expose the language model through a K2.5 wrapper
  • keep multimodal tower and projector plumbing explicit
  • do not assume later runtime features can bypass the wrapper
  • K2.5 can launch with the wrapper and basic multimodal plumbing intact
K2.5并非“又一个DeepSeek模型”,它是一个包装后的多模态模型,因此优化从包装器契约开始。
  • 通过K2.5包装器暴露语言模型
  • 明确保留多模态塔与投影器管线
  • 不要假设后续运行时功能可绕过包装器
  • 参考PR:#17789
  • K2.5可通过包装器启动,且基础多模态管线完整

Stage K25-1: Make MoE config initialization wrapper-aware

阶段K25-1:让MoE配置初始化感知包装器

The first K2.5 bottleneck was often not kernel speed, but the wrong config being read because MoE metadata lived in
text_config
.
  • inspect
    hf_config.text_config
    when present
  • do not assume top-level
    hf_config
    contains the text-model MoE fields
  • python/sglang/srt/managers/scheduler.py
  • fused MoE config init sees the real text-model expert metadata
K2.5的首个瓶颈通常不是内核速度,而是因MoE元数据存储在
text_config
中导致读取到错误配置。
  • 当存在
    hf_config.text_config
    时,优先读取该配置
  • 不要假设顶级
    hf_config
    包含文本模型的MoE字段
  • 涉及文件:
    python/sglang/srt/managers/scheduler.py
  • 参考PR:#18064
  • 融合MoE配置初始化可读取到真实的文本模型专家元数据

Stage K25-2: Make quantized checkpoint loading robust

阶段K25-2:强化量化检查点加载的鲁棒性

Before K2.5 can be optimized, quantized checkpoints must load with the right name mapping and quant metadata propagation.
  • remap HF weight names into the SGLang wrapper layout
  • remap excluded-module patterns too, not only main weights
  • preserve
    quant_config
    on the wrapper itself
  • python/sglang/srt/models/kimi_k25.py
  • python/sglang/srt/layers/quantization/modelopt_quant.py
  • NVFP4 or related quantized K2.5 checkpoints load without wrapper-name mismatch
在优化K2.5前,量化检查点必须能通过正确的名称映射与量化元数据传播完成加载。
  • 将HF权重名称映射至SGLang包装器布局
  • 不仅映射主权重,还要映射排除模块的模式
  • 在包装器上保留
    quant_config
  • 涉及文件:
    python/sglang/srt/models/kimi_k25.py
    python/sglang/srt/layers/quantization/modelopt_quant.py
  • 参考PR: #18370 #18440
  • NVFP4或相关量化K2.5检查点可正常加载,无包装器名称不匹配问题

Stage K25-3: Add parallel runtime plumbing through the wrapper

阶段K25-3:添加通过包装器的并行运行时管线

K2.5 performance features often fail because the wrapper does not expose the metadata that PP, PD, EPLB, or PCG expect.
  • forward
    pp_proxy_tensors
  • expose
    start_layer
    and
    end_layer
  • expose
    self.model = self.language_model.model
  • expose
    routed_experts_weights_of_layer
  • python/sglang/srt/models/kimi_k25.py
  • python/sglang/srt/models/deepseek_v2.py
  • wrapper-level runtime features no longer need to special-case K2.5 manually
K2.5的性能功能常因包装器未暴露PP、PD、EPLB或PCG所需的元数据而失效。
  • 转发
    pp_proxy_tensors
  • 暴露
    start_layer
    end_layer
  • 暴露
    self.model = self.language_model.model
  • 暴露
    routed_experts_weights_of_layer
  • 涉及文件:
    python/sglang/srt/models/kimi_k25.py
    python/sglang/srt/models/deepseek_v2.py
  • 包装器级别的运行时功能无需再为K2.5单独编写特殊逻辑

Stage K25-4: Scale and stabilize the multimodal DP path

阶段K25-4:扩展并稳定多模态DP路径

For multimodal K2.5, the vision path becomes part of the performance story. Local auto-batching alone was not enough, and DP-attention correctness issues could erase the gains.
  • gate the vision tower by
    mm_enable_dp_encoder
  • pass
    use_data_parallel
    through the wrapper and encoder stack
  • use the DP-sharded vision execution path when enabled
  • avoid extra reduction or mismatched behavior in the VLM DP-attention path
  • python/sglang/srt/models/kimi_k25.py
  • python/sglang/srt/layers/attention/vision.py
  • multimodal K2.5 can use the DP encoder path instead of only local vision execution
  • enabling DP attention does not introduce extra reduction or launch mismatch
对于多模态K2.5,视觉路径成为性能优化的一部分。仅本地自动批处理不够,DP注意力的正确性问题可能抵消优化收益。
  • 通过
    mm_enable_dp_encoder
    控制视觉塔
  • use_data_parallel
    传递至包装器与编码器栈
  • 启用时使用DP分片的视觉执行路径
  • 避免VLM DP注意力路径中出现额外归约或行为不匹配
  • 涉及文件:
    python/sglang/srt/models/kimi_k25.py
    python/sglang/srt/layers/attention/vision.py
  • 参考PR: #17991 #18689
  • 多模态K2.5可使用DP编码器路径,而非仅本地视觉执行
  • 启用DP注意力不会引入额外归约或启动不匹配

Stage K25-5: Tune fused MoE for the real K2.5 quant and hardware shape

阶段K25-5:针对真实K2.5量化与硬件配置调优融合MoE

K2.5 fused MoE can look supported while still using a terrible default config. On AMD, the big gain came from teaching the tuning tools the real K2.5 wrapper and int4 shape.
  • look through
    text_config
    in tuning utilities
  • derive
    block_shape
    from
    quantization_config["config_groups"]
    when needed
  • add full
    int4_w4a16
    support to benchmark and tuning scripts
  • pass
    ServerArgs
    into worker processes so tuning matches real serving
  • save the final per-shape configs under the exact filename contract
  • benchmark/kernels/fused_moe_triton/common_utils.py
  • benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py
  • python/sglang/srt/layers/moe/fused_moe_triton/configs/
  • K2.5 no longer falls back to a default fused-MoE config for the active quant and device
K2.5的融合MoE看似已支持,但可能仍使用糟糕的默认配置。在AMD设备上,关键优化是让调优工具了解真实的K2.5包装器与int4配置。
  • 在调优工具中读取
    text_config
  • 必要时从
    quantization_config["config_groups"]
    推导
    block_shape
  • 为基准测试与调优脚本添加完整的
    int4_w4a16
    支持
  • ServerArgs
    传递至工作进程,使调优与真实服务匹配
  • 将最终的按配置调优结果保存至精确的文件名契约下
  • 涉及文件:
    benchmark/kernels/fused_moe_triton/common_utils.py
    benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py
    python/sglang/srt/layers/moe/fused_moe_triton/configs/
  • 参考PR:#19228
  • K2.5不再为当前量化与设备回退至默认融合MoE配置

Stage K25-6: Make speculative decoding wrapper-compatible

阶段K25-6:让投机解码与包装器兼容

For K2.5, speculative decoding support was mostly blocked by missing wrapper hooks, not by the draft model core itself.
  • forward Eagle3 helper methods through the K2.5 wrapper
  • keep embed/head exposure consistent with other optimized model wrappers
  • python/sglang/srt/models/kimi_k25.py
  • Eagle3 runtime code can call the wrapper the same way it calls direct language-model wrappers
对于K2.5,投机解码支持主要因缺少包装器钩子而受阻,而非模型核心本身。
  • 通过K2.5包装器转发Eagle3辅助方法
  • 保持嵌入层/头部的暴露方式与其他优化后的模型包装器一致
  • 涉及文件:
    python/sglang/srt/models/kimi_k25.py
  • 参考PR:#19689
  • Eagle3运行时代码可像调用直接语言模型包装器一样调用K2.5包装器

Stage K25-7: Harden multimodal + DP attention + speculative decode correctness

阶段K25-7:强化多模态+DP注意力+投机解码的正确性

Once K2.5 mixes multimodal embeddings, DP attention, and speculative decoding, correctness bugs show up in extend mode and launch paths.
  • respect
    forward_batch.mm_input_embeds
  • append only the new tail token embedding in extend mode
  • validate on the exact combined launch shape, not just TP-only text mode
  • python/sglang/srt/models/llama_eagle3.py
  • test/registered/8-gpu-models/test_kimi_k25.py
  • the combined multimodal + DP attention + MTP or Eagle3 path runs without launch crashes
当K2.5混合使用多模态嵌入、DP注意力与投机解码时,正确性缺陷会在扩展模式与启动路径中显现。
  • 尊重
    forward_batch.mm_input_embeds
  • 在扩展模式下仅追加新的尾令牌嵌入
  • 针对精确的组合启动配置进行验证,而非仅TP-only文本模式
  • 涉及文件:
    python/sglang/srt/models/llama_eagle3.py
    test/registered/8-gpu-models/test_kimi_k25.py
  • 参考PR:#21391
  • 多模态+DP注意力+MTP或Eagle3的组合路径可正常运行,无启动崩溃

Stage K25-8: Keep current-main serving and validation surfaces aligned

阶段K25-8:保持当前主分支服务与验证范围一致

Current main has enough Kimi-K2.5 coverage that optimization work should preserve the launch, parser, multimodal-grid, LoRA, and backend validation contracts together.
  • keep the documented Kimi-K2.5 launch contract in sync with tests:
    --tool-call-parser kimi_k2
    and
    --reasoning-parser kimi_k2
  • preserve Kimi grid metadata flow:
    grid_thws
    ,
    KimiGridMMDataMixin
    , and the GPU image preprocessing path with CPU-compatible inputs
  • if LoRA, MoE LoRA sharing, attention backend selection, or logprob paths are touched, include the Kimi-K2.5 LoRA regression
  • choose AMD validation by backend and quant shape:
    • native Kimi-K2.5 with
      aiter
      MLA uses TP4 on MI35x because 64 heads at TP8 gives only 8 heads per GPU
    • Kimi-K2.5-MXFP4 MI35x coverage uses TP8 and validates default plus FP8 KV-cache variants
  • use GB300/NVFP4 lanes when changing Blackwell-specific quant, cache, or kernel behavior
  • use the Kimi-K2-Thinking stress test when parser, long-run stability, or K2 thinking serving paths are involved
  • docs_new/docs/basic_usage/kimi_k2_5.mdx
  • python/sglang/srt/function_call/kimik2_detector.py
  • python/sglang/srt/parser/reasoning_parser.py
  • python/sglang/srt/multimodal/processors/kimi_common.py
  • python/sglang/srt/multimodal/processors/kimi_k25.py
  • test/registered/function_call/test_kimik2_detector.py
  • test/registered/lora/test_lora_kimi_k25_logprob_diff.py
  • test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
  • test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
  • test/registered/gb300/test_kimi_k25.py
  • test/registered/gb300/test_kimi_k25_nvfp4.py
  • test/registered/stress/test_stress_kimi_k2.py
  • docs, registered launches, and stress tests agree on the parser flags
  • image token expansion still derives from grid metadata instead of placeholder counts alone
  • backend-specific tests run on the topology they were written for
  • LoRA logprob coverage is not skipped when changing K2.5 adapter or shared-expert LoRA behavior
当前主分支已具备足够的Kimi-K2.5覆盖,优化工作需同时保留启动、解析器、多模态网格、LoRA及后端验证契约。
  • 保持文档中的Kimi-K2.5启动契约与测试同步:
    --tool-call-parser kimi_k2
    --reasoning-parser kimi_k2
  • 保留Kimi网格元数据流程:
    grid_thws
    KimiGridMMDataMixin
    以及支持CPU输入的GPU图像预处理路径
  • 若修改LoRA、MoE LoRA共享、注意力后端选择或对数概率路径,需包含Kimi-K2.5 LoRA回归测试
  • 根据后端与量化配置选择AMD验证流程:
    • 原生Kimi-K2.5搭配
      aiter
      MLA在MI35x上使用TP4,因为64个头部在TP8下每个GPU仅分配8个头部
    • Kimi-K2.5-MXFP4 MI35x覆盖使用TP8,并验证默认与FP8 KV缓存变体
  • 修改Blackwell专属的量化、缓存或内核行为时,使用GB300/NVFP4测试流程
  • 修改解析器、长期稳定性或K2思考模式服务路径时,运行Kimi-K2-Thinking压力测试
  • 涉及文件:
    docs_new/docs/basic_usage/kimi_k2_5.mdx
    python/sglang/srt/function_call/kimik2_detector.py
    python/sglang/srt/parser/reasoning_parser.py
    python/sglang/srt/multimodal/processors/kimi_common.py
    python/sglang/srt/multimodal/processors/kimi_k25.py
    test/registered/function_call/test_kimik2_detector.py
    test/registered/lora/test_lora_kimi_k25_logprob_diff.py
    test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
    test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
    test/registered/gb300/test_kimi_k25.py
    test/registered/gb300/test_kimi_k25_nvfp4.py
    test/registered/stress/test_stress_kimi_k2.py
  • 文档、注册启动流程与压力测试在解析器参数上保持一致
  • 图像令牌扩展仍基于网格元数据,而非仅占位符数量
  • 后端特定测试在其设计的拓扑上运行
  • 修改K2.5适配器或共享专家LoRA行为时,不会跳过LoRA对数概率覆盖测试

Usage

使用方法

When asked to optimize Kimi, or a structurally similar model, do not start from whatever file looks hottest. First place the current code on the ladder above.
Use this decision process:
  1. Determine whether the current problem is K2, K2.5, or which family it most closely resembles.
  2. Find the highest stage that is already satisfied.
  3. Work on the next missing stage, not a later one.
  4. Validate narrowly on the exact serving shape.
  5. Only after that, widen to more topology combinations.
Examples:
  • K2 decode is still spending time in generic grouped-topk: you are missing
    K2-2
    , so do not jump straight to fused-MoE tuning files.
  • K2 fused gate exists but PCG crashes: you are at
    K2-6
    , so focus on fake op registration or invalid-selection guards.
  • K2.5 int4 on AMD uses a default config: you are at
    K25-5
    , so tune the wrapper-aware fused-MoE path before editing the model wrapper again.
  • K2.5 PP works but Eagle3 multimodal crashes: you are at
    K25-6
    or
    K25-7
    , not at the early support stages.
  • K2.5 tool calls, thinking output, image requests, or LoRA logprobs regress after a runtime change: you are at
    K25-8
    ; check launch parser flags, grid metadata, OpenAI-serving parser tests, and the targeted backend lane before blaming the MoE kernel.
For non-Kimi models, map them by structure:
  • a new sparse MoE model with non-generic expert count or router shape: treat it like the K2 path first
  • a quantized sparse MoE model whose hot path runs through Marlin: treat it like the K2 thinking Marlin path
  • a wrapped multimodal model whose PP, PD, DP encoder, or speculative decode support keeps breaking: treat it like the K2.5 path first
当需要优化Kimi或结构相似的模型时,不要从看起来最热门的文件开始。首先将当前代码定位到上述阶梯中的对应阶段。
使用以下决策流程:
  1. 判断当前问题属于K2、K2.5,或最接近哪类模型家族。
  2. 找到已完成的最高阶段。
  3. 处理下一个缺失的阶段,而非跳过直接处理后续阶段。
  4. 针对精确的服务配置进行窄范围验证。
  5. 完成后,再扩展到更多拓扑组合。
示例:
  • K2解码仍在通用分组top-k上花费时间: 你缺失
    K2-2
    阶段的优化,不要直接跳至融合MoE调优文件。
  • K2融合门控已存在但PCG崩溃: 你处于
    K2-6
    阶段,需专注于伪操作注册或无效选择防护。
  • AMD上的K2.5 int4使用默认配置: 你处于
    K25-5
    阶段,需先调优感知包装器的融合MoE路径,再修改模型包装器。
  • K2.5 PP正常但Eagle3多模态崩溃: 你处于
    K25-6
    K25-7
    阶段,而非早期支持阶段。
  • 运行时变更后,K2.5工具调用、思考输出、图像请求或LoRA对数概率出现回归: 你处于
    K25-8
    阶段;在归咎于MoE内核前,先检查启动解析器参数、网格元数据、OpenAI服务解析器测试及目标后端流程。
对于非Kimi模型,按结构映射:
  • 具有非通用专家数量或路由配置的新型稀疏MoE模型: 先按K2路径处理
  • 热路径通过Marlin运行的量化稀疏MoE模型: 按K2思考模式的Marlin路径处理
  • PP、PD、DP编码器或投机解码支持频繁失效的包装式多模态模型: 先按K2.5路径处理

Guardrails

注意事项

  • Do not hardcode
    256
    experts anywhere on the K2 fast path.
  • Do not collapse K2 and K2.5 into one generic DeepSeek optimization.
  • Do not optimize generic kernels first if K2 already has a dedicated specialization.
  • Do not bypass the K2.5 wrapper to make one feature work; later PP, PD, EPLB, or PCG features often depend on that wrapper surface.
  • Do not trust a tuning file unless
    Triton version
    ,
    E
    ,
    N
    ,
    dtype
    ,
    block_shape
    , and
    device
    all match the current shape.
  • Do not validate only TP-only text mode if the real bug involves PP, DP encoder, multimodal inputs, EP, or speculative decoding.
  • Do not omit
    --tool-call-parser kimi_k2
    or
    --reasoning-parser kimi_k2
    when validating current K2/K2.5 tool or thinking behavior.
  • Do not run native Kimi-K2.5 aiter MLA on MI35x at TP8 and treat the failure as a model regression; the registered native aiter MLA lane documents TP4 as the supported shape.
  • Do not copy Kimi-specific constants, dispatch predicates, or tuning filenames into a different model unless its serving shape actually matches.
  • 不要在K2快速路径的任何位置硬编码
    256
    个专家。
  • 不要将K2和K2.5合并为单一的通用DeepSeek优化。
  • 如果K2已有专用优化,不要优先优化通用内核。
  • 不要绕过K2.5包装器来实现单个功能;后续PP、PD、EPLB或PCG功能通常依赖于该包装器的暴露面。
  • 除非
    Triton version
    E
    N
    dtype
    block_shape
    device
    均匹配当前配置,否则不要信任调优文件。
  • 如果真实缺陷涉及PP、DP编码器、多模态输入、EP或投机解码,不要仅在TP-only文本模式下进行验证。
  • 验证当前K2/K2.5工具或思考模式行为时,不要遗漏
    --tool-call-parser kimi_k2
    --reasoning-parser kimi_k2
    参数。
  • 不要在MI35x上以TP8运行原生Kimi-K2.5 aiter MLA,并将失败视为模型回归;注册的原生aiter MLA流程已记录TP4为支持的配置。
  • 不要将Kimi专属的常量、调度谓词或调优文件名复制到其他模型,除非其服务配置完全匹配。

Validation

验证

For K2 kernel work:
bash
pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.py
For K2 quantized Marlin MoE work:
bash
pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.py
For K2.5 wrapper or combined runtime work:
bash
pytest -q test/registered/8-gpu-models/test_kimi_k25.py
For current parser and OpenAI-serving behavior:
bash
pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2
For backend-specific or adapter-sensitive K2.5 work, run only on matching hardware:
bash
pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.py
For tuning work:
  • rerun only the relevant tuning script
  • keep the real model, quant, TP, EP, and backend
  • save output under the exact config filename contract
针对K2内核工作:
bash
pytest -q sgl-kernel/tests/test_dsv3_router_gemm.py
pytest -q sgl-kernel/tests/test_kimi_k2_moe_fused_gate.py
pytest -q test/registered/kernels/test_fused_topk_deepseek.py
python benchmark/bench_kimi_k2_moe_fused_gate.py
针对K2量化Marlin MoE工作:
bash
pytest -q test/registered/quant/test_marlin_moe.py
pytest -q python/sglang/jit_kernel/tests/test_moe_wna16_marlin.py
针对K2.5包装器或组合运行时工作:
bash
pytest -q test/registered/8-gpu-models/test_kimi_k25.py
针对当前解析器与OpenAI服务行为:
bash
pytest -q test/registered/function_call/test_kimik2_detector.py
pytest -q test/registered/unit/parser/test_reasoning_parser.py -k KimiK2
pytest -q test/registered/unit/function_call/test_function_call_parser.py -k KimiK2
pytest -q test/registered/unit/entrypoints/openai/test_serving_chat.py -k kimi_k2
针对后端特定或适配器敏感的K2.5工作,仅在匹配硬件上运行:
bash
pytest -q test/registered/lora/test_lora_kimi_k25_logprob_diff.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_aiter_mla_eval_mi35x.py
pytest -q test/registered/amd/accuracy/mi35x/test_kimi_k25_mxfp4_eval_mi35x.py
pytest -q test/registered/gb300/test_kimi_k25.py
pytest -q test/registered/gb300/test_kimi_k25_nvfp4.py
pytest -q test/registered/stress/test_stress_kimi_k2.py
针对调优工作:
  • 仅重新运行相关调优脚本
  • 保持真实的模型、量化、TP、EP与后端配置
  • 将输出保存至精确的配置文件名契约下

References

参考资料

  • Historical evidence and benchmark tables: references/pr-history.md
  • Symptom mapping and validation order: references/playbook.md
  • 历史依据与基准测试表:references/pr-history.md
  • 症状映射与验证顺序:references/playbook.md