sglang-torch-profiler-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SGLang Torch Profiler Analysis

SGLang Torch Profiler 分析

Overview

概述

Use this skill for SGLang

torch.profiler

analysis.

There is only one public workflow:

```
triage
```

Use the unified entrypoint:

scripts/analyze_sglang_torch_profile.py

triage

always prints the same three tables:

kernel table
overlap-opportunity table
fuse-pattern table

By default, all three tables only render rows at or above

1.0%

cumulative GPU-time share. Treat anything below that as noise unless the user explicitly asks for a lower cutoff.

The script-level fuse-pattern table should stay source-backed and deterministic. Do not build a fuzzy string-matching engine into the script for typo-tolerance.

If exact/source-backed matching is weak but the agent judges that a cluster of kernels still looks semantically close to a known pattern, add a short AI note after the table with one of these labels:

```
high
```
: very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertainty
```
medium
```
: several signals line up, but one important piece is still ambiguous
```
low
```
: weak resemblance only; mention it only if it is still worth a human follow-up

本工具用于SGLang

torch.profiler

分析。

仅提供一个公开工作流：

```
triage
```
（分析诊断）

使用统一入口脚本：

scripts/analyze_sglang_torch_profile.py

triage

始终输出以下三张表格：

内核表
重叠机会表
融合模式表

默认情况下，三张表格仅显示累积GPU时间占比≥1.0%的行。除非用户明确要求降低阈值，否则占比低于该值的内容视为噪声。

脚本层面的融合模式表应基于源码匹配且具有确定性。请勿在脚本中构建模糊字符串匹配引擎来处理拼写错误。

如果精确的源码匹配效果不佳，但Agent判定一组内核在语义上仍与已知模式高度相似，请在表格后添加简短的AI注释，并使用以下标签之一：

```
high
```
：极有可能属于同一模式家族；主要不确定性来自命名差异或实现细节的微小调整
```
medium
```
：多个特征匹配，但仍有一个关键信息不明确
```
low
```
：仅存在微弱相似性；仅在值得人工跟进时提及

When To Use It

适用场景

inspect an SGLang torch profiler trace or profile directory
profile a live SGLang server and immediately analyze the output
summarize which kernel families dominate prefill or decode
map kernels back to Python code paths
judge whether a code path still has overlap headroom
check whether an already-known fusion or overlap path should have applied

检查SGLang torch profiler生成的追踪文件或性能分析目录
对运行中的SGLang服务器进行性能分析并立即解读结果
总结哪些内核家族主导预填充（prefill）或解码（decode）阶段
将内核映射回Python代码路径
判断代码路径是否仍有重叠优化空间
检查已知的融合或重叠优化路径是否已生效

Diffusion Backend Gate

扩散模型后端校验

For diffusion benchmark or profiling work, only analyze traces produced by the native SGLang diffusion backend.

If the run that generated the trace logs any of:

```
Falling back to diffusers backend
```
```
Using diffusers backend
```
```
Loaded diffusers pipeline
```

stop the workflow instead of analyzing the trace. Treat it as a backend-selection issue, not as valid SGLang diffusion profiler evidence.

对于扩散模型的基准测试或性能分析工作，仅分析由原生SGLang扩散模型后端生成的追踪文件。

如果生成追踪文件的运行日志中出现以下任意内容：

```
Falling back to diffusers backend
```
```
Using diffusers backend
```
```
Loaded diffusers pipeline
```

请终止工作流，不要分析该追踪文件。将其视为后端选择问题，而非有效的SGLang扩散模型性能分析依据。

Main Flows

主要流程

1. Single-trace triage from an existing profile dir or trace

1. 基于已有性能分析目录或追踪文件的单追踪分析

bash

python3 scripts/analyze_sglang_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

Use this when you want the fastest read on kernel share and likely fused-kernel pattern matches. The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.

bash

python3 scripts/analyze_sglang_torch_profile.py \
  --input /path/to/profile_dir_or_trace.json.gz

当你需要快速了解内核占比情况及可能的融合模式匹配时使用此流程。单追踪模式下的重叠表结论较为保守，会提示你何时需要使用映射+正式双追踪分析。

2. Single-trace triage from a running server

2. 基于运行中服务器的单追踪分析

bash

python3 scripts/analyze_sglang_torch_profile.py \
  --url http://127.0.0.1:30000 \
  --num-steps 5 \
  --profile-by-stage

bash

python3 scripts/analyze_sglang_torch_profile.py \
  --url http://127.0.0.1:30000 \
  --num-steps 5 \
  --profile-by-stage

3. Two-trace triage from existing profile dirs or traces

3. 基于已有性能分析目录或追踪文件的双追踪分析

bash

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

Use this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.

bash

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-input /path/to/graph_off_profile_dir \
  --formal-input /path/to/graph_on_profile_dir

当你需要更可靠的重叠结论以及更清晰的内核到源码映射时使用此流程。

4. Two-trace triage from running servers

4. 基于运行中服务器的双追踪分析

bash

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

bash

python3 scripts/analyze_sglang_torch_profile.py triage \
  --mapping-url http://127.0.0.1:31025 \
  --formal-url http://127.0.0.1:31026 \
  --num-steps 5 \
  --profile-by-stage

profile_by_stage

profile_by_stage

参数

profile_by_stage

is not only for PD disaggregation.

On ordinary non-PD serving, it is still useful because prefill and decode usually have very different bottlenecks.
On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary
```
profile_by_stage
```
.

profile_by_stage

不仅用于PD分离（PD disaggregation）场景。

在普通非PD分离的服务场景中，该参数仍有用处，因为预填充和解码阶段通常存在截然不同的瓶颈。
在SGLang内部当前的profile-v2流程中，基于阶段的性能分析是标准流程。
PD分离式服务新增一条规则：预填充工作节点和解码工作节点必须分别进行性能分析。这比普通的
```
profile_by_stage
```
要求更严格。

How To Choose The Triage Shape

如何选择分析模式

Single-trace triage

单追踪分析

Use when you want the lowest-friction report:

one trace is already available
you mainly want kernel share and fusion clues
you are comparing two runs side by side by running triage once per trace

This is the recommended default.

当你需要最低成本的报告时使用：

已有一份追踪文件
主要关注内核占比和融合线索
通过对每个追踪文件分别运行分析来对比两次运行的差异

这是推荐的默认模式。

Two-trace triage

双追踪分析

Use when you need:

a stronger answer about overlap headroom
graph-off source mapping plus graph-on final behavior
more trustworthy overlap recommendations in the middle table

mapping trace with

--disable-cuda-graph --disable-piecewise-cuda-graph

formal trace with the real serving optimizations enabled

Do not call the mapping pass a "fast profile". It exists to recover

kernel -> cpu_op -> python scope

当你需要以下内容时使用：

关于重叠优化空间的更可靠结论
关闭Graph后的源码映射+开启Graph后的最终行为分析
重叠机会表中更可信的优化建议

映射追踪：使用

--disable-cuda-graph --disable-piecewise-cuda-graph

参数生成

正式追踪：开启真实服务优化生成

不要将映射追踪称为“快速性能分析”。它的作用是恢复

内核 -> cpu_op -> Python作用域

的映射关系。

Workflow

工作流

Single-trace workflow

单追踪工作流

If the user only wants a quick diagnosis, one trace is enough.
Prefer rank-local
```
TP-0
```
traces over merged traces.
For a live server, this skill can call
```
sglang.profiler
```
and automatically send a small probe request.
Prefer
```
--profile-by-stage
```
even on standard serving unless the user explicitly wants an all-stage mixed trace.

如果用户仅需要快速诊断，一份追踪文件即可。
优先选择单节点本地的
```
TP-0
```
追踪文件，而非合并后的追踪文件。
对于运行中的服务器，本工具可以调用
```
sglang.profiler
```
并自动发送小型探测请求。
除非用户明确需要混合所有阶段的追踪文件，否则即使在标准服务场景下也优先使用
```
--profile-by-stage
```
参数。

Two-trace workflow

双追踪工作流

Produce a mapping trace first with graph disabled.
Produce a formal trace second with graph enabled and the real serving flags kept on.
Run
```
triage
```
for the compact three-table report.
Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the
```
PR-backed / in-flight
```
sections for still-moving upstream work. Prefer reporting:
- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
If no exact pattern fully matches but the trace still looks semantically close to a known family, add one flat
```
AI similarity judgment
```
note after the tables. Use
```
high
```
,
```
medium
```
, or
```
low
```
only. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.

先生成关闭Graph的映射追踪文件。
再生成开启Graph并保留真实服务标志的正式追踪文件。
运行
```
triage
```
生成包含三张表格的简洁报告。
按以下顺序解读结果：
- 内核表
- 重叠机会表
- 融合模式表
在提出“新”优化思路之前，将排名靠前的行与references/fuse-overlap-catalog.md和references/overlap-catalog.md进行对比。先检查主线内容，再查看
```
PR-backed / in-flight
```
部分的上游未完成工作。优先报告：
- 本应在此生效但未生效的已有融合或重叠路径
- 在本追踪文件中被禁用、不支持或出现退化的已有路径
- 在其他主线场景存在但本地缺失，或仍处于上游开发中的模式
- 仅当没有目录条目匹配时，才报告真正的新优化机会
如果没有完全匹配的模式，但追踪内容在语义上仍与已知模式家族相似，请在表格后添加一条
```
AI相似度判断
```
注释。仅使用
```
high
```
/
```
medium
```
/
```
low
```
标签。注释应基于完整的模式形态，而非单一内核名称。优先参考语义线索，如生产者-消费者链、源码位置、CPU操作名称、TP上下文和模型特定结构。请勿修改脚本生成的表格本身来包含这些启发式判断。

References

参考文档

Load these only when needed:

references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing source paths
references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling

仅在需要时加载以下文档：

references/source-map.md
- 上游SGLang profiler入口点和追踪文件写入的源码路径
references/heuristics.md
- 重叠标签、依赖风险解读及限制
references/fuse-overlap-catalog.md
- 包含已有融合和重叠模式的混合源码匹配目录，包括主线内容及PR-backed / in-flight内容
references/overlap-catalog.md
- 覆盖LLM、VLM、扩散模型、分离式服务、HiSparse和推测调度的重叠优化专属查询表

Output Contract

输出规范

Return:

trace path or generated profile path
model/server args when available
kernel table
overlap-opportunity table
fuse-pattern table
optional
```
AI similarity judgment
```
note with
```
high
```
/
```
medium
```
/
```
low
```
when exact matching is inconclusive
one short conclusion about what dominates the run
whether the overlap conclusion came from single-trace triage or mapping/formal two-trace triage

返回内容应包含：

追踪文件路径或生成的性能分析目录路径
可用的模型/服务器参数
内核表
重叠机会表
融合模式表
当精确匹配无法得出结论时，可选的带有
```
high
```
/
```
medium
```
/
```
low
```
标签的
```
AI相似度判断
```
注释
关于本次运行主导因素的简短结论
重叠结论来自单追踪分析还是映射+正式双追踪分析

sglang-torch-profiler-analysis

Original

Translation

SGLang Torch Profiler Analysis

SGLang Torch Profiler 分析

Overview

概述

When To Use It

适用场景

Diffusion Backend Gate

扩散模型后端校验

Main Flows

主要流程

1. Single-trace triage from an existing profile dir or trace

1. 基于已有性能分析目录或追踪文件的单追踪分析

2. Single-trace triage from a running server

2. 基于运行中服务器的单追踪分析

3. Two-trace triage from existing profile dirs or traces

3. 基于已有性能分析目录或追踪文件的双追踪分析

4. Two-trace triage from running servers

4. 基于运行中服务器的双追踪分析

`profile_by_stage`

`profile_by_stage`
参数

How To Choose The Triage Shape

如何选择分析模式

Single-trace triage

单追踪分析

Two-trace triage

双追踪分析

Workflow

工作流

Single-trace workflow

单追踪工作流

Two-trace workflow

双追踪工作流

References

参考文档

Output Contract

输出规范