AutoDeploy Layer Visualizer
Visualize a single transformer decoder layer from an AutoDeploy SSA graph dump.
Optionally overlay actual GPU kernel names and durations from an nsys trace.
Prerequisite knowledge: This skill assumes familiarity with the
skill, which covers how to enable dumps via
, file naming conventions, SSA format basics, and GraphModule section structure. Refer to
for that context.
Inputs
- Dump directory — path to a directory containing graph dump files
- Layer number — which decoder layer to visualize (e.g. 5)
- (Optional) Dump file — specific file. If not given, pick the file with the highest numeric prefix (final transform stage).
- (Optional) Trace file — path to or trace file. When provided, GPU kernel names and durations are extracted and annotated onto each node in the visualization.
Workflow
Phase 0: Ask the user
Before starting, ask the user two questions (skip any already answered in their request):
- Do you have an nsys trace file? ( or ) — if yes, kernel names and durations from the trace will be annotated directly on each node in the diagram.
- If yes — prefill or decode? The CUDA graph structure differs between prefill and decode phases. Knowing which phase the trace captures helps correctly segment layers and map kernels.
Proceed once both are answered (or the user says no trace).
Phase 1: Select the dump file
If the user didn't specify a file, pick the file with the highest numeric prefix (the final transform stage). See
for the file naming convention and how lexicographic sort matches pipeline order.
Phase 2: Read the graph dump
Read the selected
file. If it contains multiple
sections (delimited by
headers), pick the one labeled
or the first/largest one with real operation nodes.
Phase 3: Extract the layer subgraph
This is the core step — you do this yourself by understanding the graph structure. Do NOT delegate to a script.
The dump is an SSA-form graph where each line is one of:
- Placeholder: (model inputs like input_ids, kv_cache, etc.)
- Operation:
%name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype
- Output:
output(%name1, %name2, ...)
How to identify which nodes belong to layer N:
A transformer decoder layer typically contains these blocks in order:
- Input LayerNorm — consumes
model_layers_N_input_layernorm_weight
- Self-Attention (MLA/MHA/GQA) — consumes
model_layers_N_self_attn_*
weights (q_proj, k_proj, v_proj, o_proj, kv_a_proj, kv_b_proj, q_a_proj, q_b_proj, etc.)
- Post-Attention LayerNorm + Residual — consumes
model_layers_N_post_attention_layernorm_weight
- FFN / MoE — consumes weights (gate_weight, shared_experts, etc.) and fused weight references
Extraction rules:
-
Seed nodes: Any operation whose arguments include a weight reference matching
or
or a fused weight pattern like
(where these are non-
references, i.e. weight parameters, not activation outputs from other ops).
-
Forward follow: From seeds, follow the dataflow forward — if a node consumes the output of a seed/layer node, it belongs to this layer too. Stop when you hit a node that consumes weights from a
different layer (e.g.
where M ≠ N).
-
Backward follow: From seeds, follow the dataflow backward to pick up intermediate ops (getitem, sub, floordiv, eq, mul, view, reshape, etc.) that feed into seed nodes. Stop when you hit:
- A placeholder node (model input)
- A node that belongs to a different layer (i.e., it was already identified as part of layer M ≠ N by the seed rule, or its own backward chain reaches a different layer's weights)
- The output of a residual-add from the previous layer (this is the boundary between layers)
-
Critical: Generic arithmetic ops like
,
,
,
that sit between two layers' MoE routing logic can be tricky. Trace their inputs backward — if they ultimately derive from layer M's weights/operations (not layer N's), they belong to layer M, not layer N. The suffix number on these generic ops does NOT indicate which layer they belong to; you must trace the dataflow.
-
External inputs: Nodes from outside the layer that feed into layer nodes are "external inputs" (show them as input nodes in the diagram). These include:
- Previous layer's residual output
- Global model buffers (RoPE caches like , batch metadata)
- KV cache placeholders
Phase 4: Produce the JSON
After extraction, output a JSON file at
<dump_dir>/<dump_stem>_layer<N>.json
with this structure:
json
{
"layer": 5,
"source_file": "085_compile_compile_model.txt",
"nodes": [
{
"id": "noaux_tc_op_default_2",
"op": "trtllm.noaux_tc_op.default",
"shape": "(8x8, 8x8)",
"dtype": "(torch.bfloat16, torch.int32)",
"group": "moe",
"sub_group": "moe_router",
"inputs": ["dsv3_router_gemm_op_default_2"],
"weight_inputs": [
{"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
]
}
],
"edges": [
{"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
],
"external_inputs": [
{
"id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
"label": "Layer 4 residual output",
"shape": "(2x4x7168, 2x4x7168)"
}
]
}
Node fields:
- : the SSA name (without prefix)
- : the full operation target string
- , : output shape and dtype
- : one of , /, , , , ,
- (optional): finer classification like , , , , ,
- : list of node IDs that this node consumes (only nodes within the layer or external inputs)
- : list of weight parameters consumed (name, shape, dtype)
Edge fields:
- , : node IDs (both must be in or )
Group assignment heuristic:
- : ops consuming or weights
- /: ops consuming weights or named , ,
- : ops consuming MoE weights (, ), router ops (, topk), and the arithmetic ops that process router outputs (sub, floordiv, eq, mul between router and MoE fused op)
- : ops consuming weights that aren't MoE (e.g., , , , )
- : everything else (getitem, view, reshape, etc.) — assign to the same group as neighbors
Phase 5: (Optional) Extract trace kernels
If the user provided a trace file, extract per-layer kernel sequences:
bash
python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.json
This script uses
to extract a single CUDA graph replay, groups kernels by stream, and identifies which streams belong to which layer. It outputs a JSON with per-layer kernel sequences including short names, full names, durations, and stream IDs.
Phase 6: (Optional) Annotate each node with its trace kernels
When trace kernel data is available, you must map GPU kernels to individual FX graph nodes. This is the key step — the render script will display kernel names and durations directly on each node in the diagram.
How to map kernels to nodes:
Read the trace kernel JSON and the layer JSON side by side. For each FX graph node, identify which GPU kernel(s) it corresponds to based on the op type and the kernel execution order. Common mappings:
| FX graph op | Trace kernel(s) |
|---|
flashinfer_mla_with_cache
| |
| + + |
| + |
| |
flashinfer_fused_add_rms_norm
| |
| (input_layernorm) | |
| (post_attn) | |
| + + + |
| (top-k routing) | |
/ fused_finegrained_fp8_swiglu_mlp
| (×2) + |
| or |
| or |
triton_rope_on_interleaved_qk_inputs
| |
Add a
list to each node that has corresponding GPU kernels:
json
{
"id": "flashinfer_mla_with_cache_default_5",
"op": "auto_deploy.flashinfer_mla_with_cache.default",
"group": "mla",
"trace_kernels": [
{"kernel": "fmhaSm100...", "duration_us": 50.1}
],
...
}
Also add top-level
for the whole layer:
json
{
"layer": 5,
"trace_summary": {
"total_duration_us": 650.9,
"kernel_count": 66,
"num_streams": 3
},
"nodes": [...]
}
The render script will display each node's kernels as
lines below the node label. Nodes without trace_kernels are left unchanged.
Phase 7: Render
Run the bundled visualization script:
bash
python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>
This reads the JSON and produces
and
files. If nodes have
fields, the script renders kernel names and durations directly on each node's label (prefixed with ⚡).
Present the output paths to the user.
Notes
- The script requires ( command) to be installed for PNG rendering
- If the dump has multiple GraphModules, prefer the module or auto-select the largest
- When in doubt about a node's layer membership, trace its weight dependencies all the way back — weight parameter names are the ground truth for layer assignment
- The trace extraction script auto-converts to if needed (requires on PATH)
- CUDA graph kernels run across multiple streams per layer; the script handles multi-stream grouping automatically