sglang-prod-incident-triage

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SGLang Serving Debug

SGLang 服务调试

Overview

概述

Use this skill to turn a live serving problem into a debug path you can replay.

Use one loop:

collect a baseline bundle
save the failing request or crash dump
replay on a clean target
only then switch tools

Do not start with profiling.

This skill should work with more focused skills instead of re-implementing them:

```
debug-cuda-crash
```
when replay plus coredump points to a CUDA crash path
```
debug-distributed-hang
```
when the problem is clearly a TP/PP/DP/EP hang
```
sglang-torch-profiler-analysis
```
when the issue is already narrowed to a compute-side path

Three examples are included:

TTFT spike with low queue time
replay-first CUDA crash flow
request-shaped distributed hang flow

使用本技能可将在线服务问题转化为可重放的调试路径。

遵循以下循环步骤：

收集基准数据包
保存失败请求或崩溃转储
在干净的目标环境中重放问题
之后再切换调试工具

不要从性能分析开始。

本技能应与更聚焦的技能配合使用，而非重复实现它们：

当重放加核心转储指向CUDA崩溃路径时，使用
```
debug-cuda-crash
```
当问题明确为TP/PP/DP/EP停滞时，使用
```
debug-distributed-hang
```
当问题已缩小到计算侧路径时，使用
```
sglang-torch-profiler-analysis
```

本文包含三个示例：

队列时间短但TTFT突增的场景
重放优先的CUDA崩溃调试流程
请求形态导致的分布式停滞调试流程

Output Contract

输出约定

Return:

problem class
what was checked
strongest signal so far
current best guess
what was ruled out
next step
production risk

返回内容需包含：

问题类别
已检查项
当前最强信号
当前最优推测
已排除项
下一步操作
生产风险

When To Use It

使用场景

```
/health
```
or
```
/health_generate
```
is unhealthy
latency or throughput regressed under serving load
queue size grows while health still looks green
one request class times out or hangs
the server crashes only after some requests
outputs changed after a deploy, topology change, or weight switch
one older commit is known-good and a newer commit is known-bad

```
/health
```
或
```
/health_generate
```
状态异常
服务负载下延迟或吞吐量退化
健康状态显示正常但队列持续增长
某类请求超时或停滞
服务器仅在处理部分请求后崩溃
部署、拓扑变更或权重切换后输出结果改变
已知旧版本提交正常，新版本提交存在问题

Workflow

工作流程

1. Collect a baseline bundle

1. 收集基准数据包

If a live server is reachable, collect a read-only bundle before anything more intrusive:

bash

python3 scripts/incident_artifact_tool.py collect-bundle \
  --base-url http://127.0.0.1:30000 \
  --outdir /tmp/incident_bundle

python3 scripts/incident_artifact_tool.py summarize-bundle \
  /tmp/incident_bundle

If the server is protected:

bash

python3 scripts/incident_artifact_tool.py collect-bundle \
  --base-url http://127.0.0.1:30000 \
  --token "$SGLANG_BEARER_TOKEN" \
  --outdir /tmp/incident_bundle

The bundle script collects:

```
/health
```
```
/health_generate
```
```
/model_info
```
```
/server_info
```
```
/v1/loads?include=all
```

/v1/loads?include=core,queues,disagg,spec

```
/metrics
```
```
/hicache/storage-backend
```
on a best-effort basis

Use the summary for a quick read on:

health vs. active health state
topology and runtime flags
point-in-time queue and token usage
TTFT / E2E / queue-time heuristics from Prometheus metrics

If the summary says the bundle was captured while the server was idle, recollect it during traffic or move quickly to dump plus replay.

If no live server is reachable, start from the best dump or log already available:

crash dump
request dump
logs
CUDA coredump
OTel trace
torch profile

如果可访问在线服务器，在执行任何更具侵入性的操作前，先收集只读数据包：

bash

python3 scripts/incident_artifact_tool.py collect-bundle \
  --base-url http://127.0.0.1:30000 \
  --outdir /tmp/incident_bundle

python3 scripts/incident_artifact_tool.py summarize-bundle \
  /tmp/incident_bundle

如果服务器受保护：

bash

python3 scripts/incident_artifact_tool.py collect-bundle \
  --base-url http://127.0.0.1:30000 \
  --token "$SGLANG_BEARER_TOKEN" \
  --outdir /tmp/incident_bundle

数据包脚本会收集以下内容：

```
/health
```
```
/health_generate
```
```
/model_info
```
```
/server_info
```
```
/v1/loads?include=all
```

/v1/loads?include=core,queues,disagg,spec

```
/metrics
```
尽最大努力收集
```
/hicache/storage-backend
```

使用汇总信息快速了解：

健康状态与活跃健康状态对比
拓扑结构和运行时标志
特定时间点的队列和令牌使用情况
来自Prometheus指标的TTFT / E2E / 队列时间启发式数据

如果汇总显示数据包是在服务器空闲时捕获的，请在有流量时重新收集，或直接转向转储加重放流程。

如果无法访问在线服务器，从已有的最佳转储或日志开始：

崩溃转储
请求转储
日志
CUDA核心转储
OTel追踪
torch性能分析报告

2. Save the failing request

2. 保存失败请求

Read references/decision-tree.md only if the problem class is still unclear:

server down or unhealthy
latency or throughput regression
wrong output or behavior regression
intermittent timeout or hang

Then preserve the request payload that actually triggers the problem:

crash path: use
```
--crash-dump-folder
```
non-crash path: enable request dump or save the exact trigger request

Do not jump straight from a live symptom to low-level debugging without first saving something you can replay.

仅当问题类别仍不明确时，阅读references/decision-tree.md：

服务器宕机或状态异常
延迟或吞吐量退化
输出错误或行为退化
间歇性超时或停滞

然后保存实际触发问题的请求负载：

崩溃路径：使用
```
--crash-dump-folder
```
非崩溃路径：启用请求转储或保存确切的触发请求

不要直接从在线症状跳到底层调试，先保存可重放的内容。

3. Replay on a clean target

3. 在干净的目标环境中重放

Read references/endpoints-and-signals.md when you need help reading the baseline bundle or the replay target.

Read references/replay-trace-profile.md when you need the replay, trace, profile, or bisect paths.

Standard order:

collect baseline bundle
capture request dump or crash dump
restart a clean debug target if needed
replay the same issue
collect replay-time logs and dumps

当需要解读基准数据包或重放目标环境时，阅读references/endpoints-and-signals.md。

当需要重放、追踪、性能分析或二分排查路径时，阅读references/replay-trace-profile.md。

标准流程顺序：

收集基准数据包
捕获请求转储或崩溃转储
如有需要，重启干净的调试目标环境
重放相同问题
收集重放时的日志和转储

4. Only go deeper after replay

4. 重放完成后再深入调试

Replay

重放

Use replay when:

a crash dump exists
a request dump exists
the problem depends on request shape or workload mix

If a crash dump exists, summarize it first:

bash

python3 scripts/incident_artifact_tool.py summarize-dump \
  --input-file /path/to/crash_dump.pkl

Then replay:

bash

python3 /path/to/sglang/scripts/playground/replay_request_dump.py \
  --input-file /path/to/crash_dump.pkl \
  --host 127.0.0.1 \
  --port 30000 \
  --parallel 128

safe_pickle_load

blocks a locally captured trusted dump, use:

bash

python3 scripts/replay_trusted_request_dump.py \
  --input-file /path/to/request_dump.pkl \
  --host 127.0.0.1 \
  --port 30000 \
  --parallel 1

If replay indicates a CUDA crash path, restart the same build with coredumps enabled before reproducing again:

bash

SGLANG_CUDA_COREDUMP=1 \
SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \
python -m sglang.launch_server \
  --model-path ... \
  --crash-dump-folder /tmp/sglang_crash_dump \
  ...

Then inspect the generated coredump:

bash

cuda-gdb "$(which python3)" \
  -ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_<host>.<pid>.<ts>"

For a replay-first crash example, read references/moe-shared-oob-case-study.md.

在以下场景使用重放：

存在崩溃转储
存在请求转储
问题依赖于请求形态或工作负载组合

如果存在崩溃转储，先汇总信息：

bash

python3 scripts/incident_artifact_tool.py summarize-dump \
  --input-file /path/to/crash_dump.pkl

然后执行重放：

bash

python3 /path/to/sglang/scripts/playground/replay_request_dump.py \
  --input-file /path/to/crash_dump.pkl \
  --host 127.0.0.1 \
  --port 30000 \
  --parallel 128

如果

safe_pickle_load

阻止了本地捕获的可信转储，使用：

bash

python3 scripts/replay_trusted_request_dump.py \
  --input-file /path/to/request_dump.pkl \
  --host 127.0.0.1 \
  --port 30000 \
  --parallel 1

如果重放显示是CUDA崩溃路径，在再次复现前，重启相同版本并启用核心转储：

bash

SGLANG_CUDA_COREDUMP=1 \
SGLANG_CUDA_COREDUMP_DIR=/tmp/sglang_cuda_coredumps \
python -m sglang.launch_server \
  --model-path ... \
  --crash-dump-folder /tmp/sglang_crash_dump \
  ...

然后检查生成的核心转储：

bash

cuda-gdb "$(which python3)" \
  -ex "target cudacore /tmp/sglang_cuda_coredumps/cuda_coredump_<host>.<pid>.<ts>"

如需查看重放优先的崩溃示例，阅读references/moe-shared-oob-case-study.md。

OTel trace

OTel追踪

Use tracing when:

request-stage timing is unclear
router vs. worker attribution is unclear
PD prefill/decode transfer may be implicated

If tracing was enabled at startup, you can change the level without restart:

bash

curl "http://127.0.0.1:30000/set_trace_level?level=1"
curl "http://127.0.0.1:30000/set_trace_level?level=2"

在以下场景使用追踪：

请求阶段的时间分布不明确
路由层与工作层的责任归属不明确
PD预填充/解码传输可能存在问题

如果启动时已启用追踪，无需重启即可修改级别：

bash

curl "http://127.0.0.1:30000/set_trace_level?level=1"
curl "http://127.0.0.1:30000/set_trace_level?level=2"

Torch profile

Torch性能分析

Use profiling when:

the issue is already narrowed to compute-side ownership
replay already reproduces the problem
metrics and loads do not explain the regression

At that point, switch to

sglang-torch-profiler-analysis

. Do not duplicate its profiling workflow here.

For a low-noise latency example, read references/ttft-prefill-not-queue-case-study.md.

在以下场景使用性能分析：

问题已缩小到计算侧责任范围
重放已复现问题
指标和负载无法解释退化原因

此时切换到

sglang-torch-profiler-analysis

，不要在此重复其性能分析流程。

如需查看低噪声延迟示例，阅读references/ttft-prefill-not-queue-case-study.md。

Distributed hang

分布式停滞

If this looks like a collective stall, save the failing request, replay it on a clean target, collect the replay-time bundle and stacks, then switch to

debug-distributed-hang

For an example of that flow, read references/communication-hang-case-study.md.

如果看起来是集体停滞，保存失败请求，在干净的目标环境中重放，收集重放时的数据包和调用栈，然后切换到

debug-distributed-hang

。

如需查看该流程示例，阅读references/communication-hang-case-study.md。

Regression between two commits

两个提交版本间的退化

If one commit is known-good and another is known-bad, build a deterministic harness before doing deeper manual debugging:

choose a stable reproducer: request replay, benchmark command, or correctness check
make the harness return
```
0
```
on good behavior and non-zero on bad behavior
run
```
git bisect start <bad> <good>
```
run
```
git bisect run <harness>
```
return here only after a candidate commit is isolated

Prefer replay-backed bisect when the regression depends on request shape or long-running serving state.

如果已知旧版本提交正常，新版本提交存在问题，在进行更深层次的手动调试前，先构建确定性测试工具：

选择稳定的复现方式：请求重放、基准测试命令或正确性检查
使测试工具在行为正常时返回
```
0
```
，异常时返回非零值
执行
```
git bisect start <bad> <good>
```
执行
```
git bisect run <harness>
```
仅在定位到候选提交后返回此流程

当退化问题依赖于请求形态或长期运行的服务状态时，优先选择基于重放的二分排查。

6. Switch tools when the boundary is clear

6. 明确问题边界后切换工具

Switch tools once the fault class is clear:

```
sglang-torch-profiler-analysis
```
for kernel and overlap attribution
```
debug-distributed-hang
```
for collective or rank-divergence hangs
```
debug-cuda-crash
```
for CUDA crash reproduction and kernel API logging

Do not switch tools before collecting the first bundle unless the user already has decisive logs or dumps.

当明确故障类别后，切换工具：

针对内核和重叠归因，使用
```
sglang-torch-profiler-analysis
```
针对集体停滞或秩分歧停滞，使用
```
debug-distributed-hang
```
针对CUDA崩溃复现和内核API日志，使用
```
debug-cuda-crash
```

除非用户已有决定性的日志或转储，否则不要在收集首个数据包前切换工具。

References

参考资料

Load only what the current step needs:

references/decision-tree.md
- problem classes, tool switch points, return shape
references/endpoints-and-signals.md
- endpoint behavior, auth notes, field reading
references/replay-trace-profile.md
- request dump, crash dump, replay, trace, profiler step, bisect
references/moe-shared-oob-case-study.md
- example: upstream top-k corruption, downstream MoE align shared-memory OOB
references/ttft-prefill-not-queue-case-study.md
- example: TTFT spike with low queue time, request replay, and likely prefill-side ownership
references/communication-hang-case-study.md
- example: request-shaped TP hang with request replay and distributed-hang debug flow

仅加载当前步骤所需的资料：

references/decision-tree.md
- 问题类别、工具切换节点、返回格式
references/endpoints-and-signals.md
- 端点行为、认证说明、字段解读
references/replay-trace-profile.md
- 请求转储、崩溃转储、重放、追踪、性能分析步骤、二分排查
references/moe-shared-oob-case-study.md
- 示例：上游top-k损坏，下游MoE对齐共享内存越界
references/ttft-prefill-not-queue-case-study.md
- 示例：队列时间短但TTFT突增，请求重放，可能是预填充侧问题
references/communication-hang-case-study.md
- 示例：请求形态导致的TP停滞，请求重放，分布式停滞调试流程

Scripts

脚本

scripts/incident_artifact_tool.py
- collect a read-only live bundle
- summarize a collected bundle into a compact debug note
- summarize a trusted request dump or crash dump before replay
scripts/replay_trusted_request_dump.py
- replay a trusted request dump when
```
safe_pickle_load
```
  blocks stock replay

If a live bundle was collected, include its path.

If replay, trace, or profiling was chosen, say why bundle plus dump were not enough.

scripts/incident_artifact_tool.py
- 收集只读在线数据包
- 将收集的数据包汇总为简洁的调试笔记
- 在重放前汇总可信的请求转储或崩溃转储
scripts/replay_trusted_request_dump.py
- 当
```
safe_pickle_load
```
  阻止标准重放时，重放可信的请求转储

如果已收集在线数据包，请注明其路径。

如果选择了重放、追踪或性能分析，请说明为何数据包加转储不足以解决问题。