llm-torch-profiler-analysis
Original:🇺🇸 English
Translated
13 scripts
Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.
5installs
Added on
NPX Install
npx skill4agent add bbuf/sglang-auto-driven-skills llm-torch-profiler-analysisTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Unified LLM Torch Profiler Analysis
Overview
Use this skill for analysis across:
torch.profilersglangvllmTensorRT-LLM
There is only one public workflow:
triage
Preferred unified entrypoint:
- scripts/analyze_llm_torch_profile.py
Backwards-compatibility shim (kept so older calls keep working; it just forwards to the unified entrypoint):
docker exec ... analyze_sglang_torch_profile.py ...- scripts/analyze_sglang_torch_profile.py
Markdown bundling helper:
- scripts/render_triage_markdown_bundle.py
triage- kernel table
- overlap-opportunity table
- fuse-pattern table
By default, all three tables only render rows at or above cumulative GPU-time share.
Rows below that are hidden by default unless the user asks for a lower cutoff.
1.0%Keep the fuse-pattern table source-backed and deterministic.
Do not turn it into a fuzzy matcher.
If exact source-backed matching is weak but a kernel cluster is still close to a known family,
add one short note after the tables with exactly one of:
highmediumlow
Capability Matrix
| Capability | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| Existing trace triage | yes | yes | yes |
| Single-trace live capture | yes | yes, if torch profiler is enabled on server | requires profiler control endpoints |
| Two-trace mapping+formal triage | yes | yes | yes |
| Stage-aware live capture | yes | no | no |
| yes | usually ignored on HTTP profiler route | usually ignored on HTTP profiler route |
For TensorRT-LLM, live capture only works when the server exposes and
, and when the deployment already provides a shared trace path plus the
required env vars.
/start_profile/stop_profileReal H100 Validation
The current reference run is the matrix captured on on
under:
4x H1002026-04-23h100_sglang/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3
Rendered markdown bundle:
/data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.md
Validated model directories:
mixtral_8x7b_instructqwen2_5_32b_instructqwen3_32b
Each model directory contains:
analysis_sglang.txtanalysis_vllm.txtanalysis_trtllm.txt- framework-specific trace roots and probe artifacts
Validated matrix:
| Model | SGLang | vLLM | TensorRT-LLM | Result |
|---|---|---|---|---|
| | | | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
| | | | three tables rendered correctly on all three frameworks; benchmark probes returned direct, non-empty text |
| | | | three tables rendered correctly on all three frameworks; vLLM and TensorRT-LLM chat probes often emitted |
Use this run as the main H100 reference.
The older single-card Qwen3 matrix is still useful for bring-up, but it is
not the default reference anymore.
2026-04-22Checked-in sample outputs:
references/validated_outputs/20260422_h100_qwen3_matrix/qwen3_30b_a3b
To render a validated run into one markdown document:
bash
python3 scripts/render_triage_markdown_bundle.py \
--analysis-root /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3 \
--output /data/bbuf/validate/unified_llm_profiler_skill/runs/20260423_h100_large_model_matrix_v3/h100_large_model_matrix_v3_bundle.mdThe bundle groups by model and keeps the three tables for each framework.
H100 notes:
- all three frameworks now render kernel, overlap, and fuse tables with separate and
extend/prefillsections when the trace contains a clean stage splitdecode - SGLang live capture is validated and calls the server profiler API directly instead of shelling out to
sglang.profiler - SGLang trace flush can lag well beyond a few seconds, so the runner waits longer for artifacts than the earlier implementation
- SGLang kernel-site reconstruction keeps sampling disabled in the mapping path so the optimized parser does not perturb SGLang table output; equality rechecks matched for ,
Mixtral-8x7B-Instruct-v0.1, andQwen3-32Bnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 - vLLM live capture requires to match the server
--output-dir; the validated H100 flow usestorch_profiler_dirand then drives--profiler-config {"profiler":"torch","torch_profiler_dir":"..."}and/start_profile/stop_profile - TensorRT-LLM validation stays on ; the H100 flow writes the trace with
--backend pytorchand then analyzes the saved traceTLLM_TORCH_PROFILE_TRACE - current TensorRT-LLM profiler setup still needs a
py_executor.pyoverride for table-quality Python locations, and the matrix runner generates that override underwith_stack=True/data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm - on this host, keep all trace roots under , not
/data/.../home/...
When To Use It
- inspect a trace or profile directory from
torch.profiler,sglang, orvllmTensorRT-LLM - profile a live serving endpoint and analyze the result
- summarize which kernel families dominate prefill or decode
- map kernels back to Python code paths
- judge whether a code path still leaves overlap opportunity
- check whether an already-known fusion or overlap path should have applied
Diffusion Backend Gate
For diffusion benchmark or profiling work, only analyze traces produced by the native
SGLang diffusion backend.
If the run that generated the trace logs any of:
Falling back to diffusers backendUsing diffusers backendLoaded diffusers pipeline
stop the workflow instead of analyzing the trace.
Handle it as a backend-selection issue, not as native-kernel profiler evidence.
Main Flows
1. Single-trace triage from an existing profile dir or trace
bash
python3 scripts/analyze_llm_torch_profile.py \
--input /path/to/profile_dir_or_trace.json.gzUse this when one trace is enough.
The overlap table stays conservative in single-trace mode and will tell you when a
mapping/formal pair is needed.
2. Single-trace live capture from SGLang
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework sglang \
--url http://127.0.0.1:30000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/sglang_profile_live \
--num-steps 5 \
--profile-by-stageThe script sends to the SGLang server directly.
Keep under so later analysis and docs can see the trace.
The script writes , sends the probe requests after profiling is armed,
and waits longer for trace flush than the earlier implementation.
POST /start_profile--output-dir/data/...server_args.json3. Single-trace live capture from vLLM
Launch vLLM with torch profiler enabled, for example:
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--profiler-config '{"profiler":"torch","torch_profiler_dir":"/data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile"}'Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework vllm \
--url http://127.0.0.1:8000 \
--output-dir /data/bbuf/validate/unified_llm_profiler_skill/runs/example/vllm_profile \
--num-steps 5 \
--no-profile-by-stageFor vLLM, must point to the same the server uses.
The current vLLM profiler config already defaults ,
so the runner only needs to set .
On , external vLLM containers should mount both:
--output-dirtorch_profiler_dirtorch_profiler_with_stack=truetorch_profiler_dirh100_sglang/data/.cache/huggingface:/root/.cache/huggingface/data/bbuf/validate/unified_llm_profiler_skill:/data/bbuf/validate/unified_llm_profiler_skill
4. Single-trace live capture from TensorRT-LLM
Use this only when the server exposes and ,
and the trace path is shared with the current machine.
POST /start_profilePOST /stop_profileTypical env expectations are:
TLLM_PROFILE_START_STOP=1- or
TLLM_TORCH_PROFILE_TRACE=/shared/path/trace.json.json.gz
Then run:
bash
python3 scripts/analyze_llm_torch_profile.py \
--framework trtllm \
--url http://127.0.0.1:8000 \
--output-dir /shared/path \
--num-steps 5 \
--no-profile-by-stageIf the deployment does not expose the profiler control endpoints, fall back to analyzing
an existing trace instead of trying live capture.
On the current TensorRT-LLM mainline path, creates the torch profiler
with and but not .
For table-quality validation, use the override generator:
py_executor.pyrecord_shapes=Truewith_modules=Truewith_stack=Truebash
python3 scripts/make_trtllm_py_executor_override.py \
--source /path/to/original/py_executor.py \
--output /data/bbuf/validate/unified_llm_profiler_skill/overrides/trtllm/py_executor_with_stack.pyThe matrix runner does this automatically on H100 before TensorRT-LLM capture starts.
This is the validated TensorRT-LLM flow on :
h100_sglang- launch with
trtllm-serveTLLM_TORCH_PROFILE_TRACE=/data/.../trace.json - run a few benchmark requests
- analyze the emitted trace with
--input /data/.../trace.json
5. Two-trace triage from existing profile dirs or traces
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--mapping-input /path/to/graph_off_profile_dir \
--formal-input /path/to/graph_on_profile_dirUse this when you need stronger overlap attribution and kernel-to-source mapping.
6. Two-trace triage from running servers
bash
python3 scripts/analyze_llm_torch_profile.py triage \
--framework sglang \
--mapping-url http://127.0.0.1:31025 \
--formal-url http://127.0.0.1:31026 \
--num-steps 5 \
--profile-by-stageFor or , use the same shape but pass:
vllmTensorRT-LLM- or
--framework vllm--framework trtllm --mapping-output-dir ...--formal-output-dir ...--no-profile-by-stage
profile_by_stage
profile_by_stage--profile-by-stage- On ordinary non-PD SGLang serving, it is still useful because prefill and decode usually have very different bottlenecks.
- On the current profile-v2 path inside SGLang, stage-based profiling is effectively the normal path.
- PD-disaggregated serving adds one extra rule: prefill workers and decode workers must be profiled separately. That is stricter than ordinary .
profile_by_stage - For and
vllm, disable it withTensorRT-LLM.--no-profile-by-stage
How To Choose The Triage Shape
Single-trace triage
Use when you want the lowest-friction report:
- one trace is already available
- you mainly want kernel share and fusion clues
- you are comparing two runs side by side by running triage once per trace
Prefer this by default.
Two-trace triage
Use when you need:
- a stronger overlap answer
- graph-off source mapping plus graph-on final behavior
- more trustworthy overlap recommendations in the middle table
- mapping trace with graph disabled or with the lower-fusion / more-readable config
- formal trace with the real serving optimizations enabled
Do not call the mapping pass a "fast profile".
It exists to recover .
kernel -> cpu_op -> python scopeWorkflow
Single-trace workflow
- If the user only wants a diagnosis, one trace is enough.
- Prefer one-rank traces over merged traces whenever the profiler emitted both.
- For a live server, let the script drive the profiler only when the framework-specific prerequisites are already met.
- Prefer SGLang unless the user explicitly wants an all-stage mixed trace.
--profile-by-stage - When on , create or clean the target trace directory through
h100_sglangso the path is definitely writable underdocker exec sglang_bbuf ..../data
Two-trace workflow
- Produce a mapping trace first with graph disabled or the lower-fusion configuration.
- Produce a formal trace second with the real serving optimizations enabled.
- Run for the three-table report.
triage - Read the results in this order:
- kernel table
- overlap-opportunity table
- fuse-pattern table
- Before calling something a "new" optimization idea, compare the top rows against both references/fuse-overlap-catalog.md and references/overlap-catalog.md. Check mainline rows first, then the sections. Prefer reporting:
PR-backed / in-flight- an existing fused or overlap path that should already apply here
- an existing path that appears disabled, unsupported, or regressed in this trace
- an upstream pattern that is mainline elsewhere but missing locally, or still open upstream
- a truly new opportunity only when no catalog entry fits
- If no exact pattern fully matches but the trace is still close to a known family, add one flat similarity note after the tables.
Use ,
high, ormediumonly. Base that note on the full pattern shape, not on one kernel name alone. Prefer semantic cues such as producer-consumer chain, source locations, CPU op names, TP context, and model-specific structure. Do not rewrite the script table itself to include these heuristic judgments.low
References
Load these only when needed:
- references/source-map.md
- upstream SGLang profiler entrypoints and trace-writing paths; still most useful for SGLang-specific source follow-up
- references/heuristics.md
- overlap labels, dependency-risk interpretation, and limits
- references/fuse-overlap-catalog.md
- mixed source-backed catalog of existing fuse and overlap patterns, including mainline rows plus PR-backed / in-flight rows
- references/overlap-catalog.md
- overlap-only lookup table across LLM, VLM, diffusion, disaggregation, HiSparse, and speculative scheduling
Output Contract
Return:
- trace path or generated profile path
- framework
- model/server args when available
- kernel table
- overlap-opportunity table
- fuse-pattern table
- optional similarity note with /
high/mediumwhen exact matching is inconclusivelow - one short summary of what dominates the run
- whether the overlap read came from single-trace triage or mapping/formal two-trace triage