sglang-torch-profiler-analysis
Original:🇺🇸 English
Translated
4 scriptsChecked / no sensitive code detected
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
5installs
Added on
NPX Install
npx skill4agent add bbuf/sglang-auto-driven-skills sglang-torch-profiler-analysisTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →SGLang Torch Profiler Analysis
Overview
Use this skill for SGLang analysis.
torch.profilerThere is only one public workflow:
triage
Use the unified entrypoint:
- scripts/analyze_sglang_torch_profile.py
triage- kernel table
- overlap-opportunity table
- fuse-pattern table
By default, all three tables only render rows at or above cumulative GPU-time share.
Treat anything below that as noise unless the user explicitly asks for a lower cutoff.
1.0%The script-level fuse-pattern table should stay source-backed and deterministic.
Do not build a fuzzy string-matching engine into the script for typo-tolerance.
If exact/source-backed matching is weak but the agent judges that a cluster of kernels
still looks semantically close to a known pattern, add a short AI note after the table
with one of these labels:
- : very likely the same pattern family; naming drift or minor implementation reshaping is the main uncertainty
high - : several signals line up, but one important piece is still ambiguous
medium - : weak resemblance only; mention it only if it is still worth a human follow-up
low
When To Use It
- inspect an SGLang torch profiler trace or profile directory
- profile a live SGLang server and immediately analyze the output
- summarize which kernel families dominate prefill or decode
- map kernels back to Python code paths
- judge whether a code path still has overlap headroom
- check whether an already-known fusion or overlap path should have applied
Diffusion Backend Gate
For diffusion benchmark or profiling work, only analyze traces produced by the native
SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipeline
stop the workflow instead of analyzing the trace. Treat it as a backend-selection issue,
not as valid SGLang diffusion profiler evidence.
Main Flows
1. Single-trace triage from an existing profile dir or trace
bash
python3 scripts/analyze_sglang_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gzUse this when you want the fastest read on kernel share and likely fused-kernel pattern matches.
The overlap table stays conservative in single-trace mode and will tell you when a mapping/formal pair is needed.
2. Single-trace triage from a running server
bash
python3 scripts/analyze_sglang_torch_profile.py \
--url http://127.0.0.1:30000 \
--num-steps 5 \
--profile-by-stage3. Two-trace triage from existing profile dirs or traces
bash
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dirUse this when you need stronger overlap conclusions and cleaner kernel-to-source attribution.
4. Two-trace triage from running servers
bash
python3 scripts/analyze_sglang_torch_profile.py triage \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stageprofile_by_stage
profile_by_stageprofile_by_stage- On ordinary non-PD serving, it is still useful because prefill and decode usually have very different bottlenecks.
- On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
- PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary .
profile_by_stage
How To Choose The Triage Shape
Single-trace triage
Use when you want the lowest-friction report:
- one trace is already available
- you mainly want kernel share and fusion clues
- you are comparing two runs side by side by running triage once per trace
This is the recommended default.
Two-trace triage
Use when you need:
- a stronger answer about overlap headroom
- graph-off source mapping plus graph-on final behavior
- more trustworthy overlap recommendations in the middle table
- mapping trace with
--disable-cuda-graph --disable-piecewise-cuda-graph - formal trace with the real serving optimizations enabled
Do not call the mapping pass a "fast profile". It exists to recover .
kernel -> cpu_op -> python scopeWorkflow
Single-trace workflow
- If the user only wants a quick diagnosis, one trace is enough.
- Prefer rank-local traces over merged traces.
TP-0 - For a live server, this skill can call and automatically send a small probe request.
sglang.profiler - Prefer even on standard serving unless the user explicitly wants an all-stage mixed trace.
--profile-by-stage
Two-trace workflow
- Produce a mapping trace first with graph disabled.
- Produce a formal trace second with graph enabled and the real serving flags kept on.
- Run for the compact three-table report.
triage - Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
- Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the sections for still-moving upstream work. Prefer reporting:
PR-backed / in-flight- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
- If no exact pattern fully matches but the trace still looks semantically close to a known family, add one flat note after the tables. Use
AI similarity judgment,high, ormediumonly. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.low
References
Load these only when needed:
- references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing source paths
- references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
- references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
- references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling
Output Contract
Return:
- trace path or generated profile path
- model/server args when available
- kernel table
- overlap-opportunity table
- fuse-pattern table
- optional note with
AI similarity judgment/high/mediumwhen exact matching is inconclusivelow - one short conclusion about what dominates the run
- whether the overlap conclusion came from single-trace triage or mapping/formal two-trace triage