ad-layer-visualizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAutoDeploy Layer Visualizer
AutoDeploy层可视化工具
Visualize a single transformer decoder layer from an AutoDeploy SSA graph dump.
Optionally overlay actual GPU kernel names and durations from an nsys trace.
Prerequisite knowledge: This skill assumes familiarity with theskill, which covers how to enable dumps viaad-graph-dump, file naming conventions, SSA format basics, and GraphModule section structure. Refer toAD_DUMP_GRAPHS_DIRfor that context.ad-graph-dump
将AutoDeploy SSA图转储中的单个Transformer解码器层可视化。可选择叠加来自nsys trace的实际GPU内核名称和持续时间。
前置知识:本技能假设用户熟悉技能,该技能涵盖了如何通过ad-graph-dump启用转储、文件命名规则、SSA格式基础以及GraphModule部分的结构。相关背景请参考AD_DUMP_GRAPHS_DIR。ad-graph-dump
Inputs
输入
- Dump directory — path to a directory containing graph dump files
.txt - Layer number — which decoder layer to visualize (e.g. 5)
- (Optional) Dump file — specific file. If not given, pick the file with the highest numeric prefix (final transform stage).
.txt - (Optional) Trace file — path to or
.nsys-reptrace file. When provided, GPU kernel names and durations are extracted and annotated onto each node in the visualization..sqlite
- 转储目录 — 包含图转储文件的目录路径
.txt - 层编号 — 要可视化的解码器层(例如5)
- (可选)转储文件 — 特定的文件。如果未指定,选择带有最高数字前缀的文件(最终转换阶段)。
.txt - (可选)跟踪文件 — 或
.nsys-rep跟踪文件的路径。提供该文件时,GPU内核名称和持续时间会被提取并标注到可视化图中的每个节点上。.sqlite
Workflow
工作流程
Phase 0: Ask the user
阶段0:询问用户
Before starting, ask the user two questions (skip any already answered in their request):
- Do you have an nsys trace file? (or
.nsys-rep) — if yes, kernel names and durations from the trace will be annotated directly on each node in the diagram..sqlite - If yes — prefill or decode? The CUDA graph structure differs between prefill and decode phases. Knowing which phase the trace captures helps correctly segment layers and map kernels.
Proceed once both are answered (or the user says no trace).
开始前,向用户询问两个问题(若用户的请求中已回答则跳过):
- 您是否有nsys跟踪文件?(或
.nsys-rep)—— 如果有,跟踪中的内核名称和持续时间将直接标注在图中的每个节点上。.sqlite - 如果有 — 是预填充阶段还是解码阶段? CUDA图结构在预填充和解码阶段有所不同。了解跟踪捕获的是哪个阶段有助于正确分割层并映射内核。
待两个问题都得到回答(或用户表示没有跟踪文件)后再继续。
Phase 1: Select the dump file
阶段1:选择转储文件
If the user didn't specify a file, pick the file with the highest numeric prefix (the final transform stage). See for the file naming convention and how lexicographic sort matches pipeline order.
ad-graph-dump如果用户未指定文件,选择带有最高数字前缀的文件(最终转换阶段)。有关文件命名规则以及字典序排序如何匹配流水线顺序,请参考。
ad-graph-dumpPhase 2: Read the graph dump
阶段2:读取图转储
Read the selected file. If it contains multiple sections (delimited by headers), pick the one labeled or the first/largest one with real operation nodes.
.txtGraphModule========monolithic.model读取选定的文件。如果文件包含多个由标题分隔的部分,选择标注为的部分,或第一个/最大的包含实际操作节点的部分。
.txt========GraphModulemonolithic.modelPhase 3: Extract the layer subgraph
阶段3:提取层子图
This is the core step — you do this yourself by understanding the graph structure. Do NOT delegate to a script.
The dump is an SSA-form graph where each line is one of:
- Placeholder: (model inputs like input_ids, kv_cache, etc.)
%name : shape : dtype - Operation:
%name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype - Output:
output(%name1, %name2, ...)
How to identify which nodes belong to layer N:
A transformer decoder layer typically contains these blocks in order:
- Input LayerNorm — consumes
model_layers_N_input_layernorm_weight - Self-Attention (MLA/MHA/GQA) — consumes weights (q_proj, k_proj, v_proj, o_proj, kv_a_proj, kv_b_proj, q_a_proj, q_b_proj, etc.)
model_layers_N_self_attn_* - Post-Attention LayerNorm + Residual — consumes
model_layers_N_post_attention_layernorm_weight - FFN / MoE — consumes weights (gate_weight, shared_experts, etc.) and
model_layers_N_mlp_*fused weight referencesfused_*_N_*
Extraction rules:
-
Seed nodes: Any operation whose arguments include a weight reference matchingor
layers_N_or a fused weight pattern likelayers.N.(where these are non-fused_*_N_*references, i.e. weight parameters, not activation outputs from other ops).% -
Forward follow: From seeds, follow the dataflow forward — if a node consumes the output of a seed/layer node, it belongs to this layer too. Stop when you hit a node that consumes weights from a different layer (e.g.where M ≠ N).
layers_M_ -
Backward follow: From seeds, follow the dataflow backward to pick up intermediate ops (getitem, sub, floordiv, eq, mul, view, reshape, etc.) that feed into seed nodes. Stop when you hit:
- A placeholder node (model input)
- A node that belongs to a different layer (i.e., it was already identified as part of layer M ≠ N by the seed rule, or its own backward chain reaches a different layer's weights)
- The output of a residual-add from the previous layer (this is the boundary between layers)
-
Critical: Generic arithmetic ops like,
sub_1,floordiv_1,mul_1that sit between two layers' MoE routing logic can be tricky. Trace their inputs backward — if they ultimately derive from layer M's weights/operations (not layer N's), they belong to layer M, not layer N. The suffix number on these generic ops does NOT indicate which layer they belong to; you must trace the dataflow.eq_1 -
External inputs: Nodes from outside the layer that feed into layer nodes are "external inputs" (show them as input nodes in the diagram). These include:
- Previous layer's residual output
- Global model buffers (RoPE caches like , batch metadata)
_ad_rotary_cos_sin_N - KV cache placeholders
这是核心步骤 — 您需要自己理解图结构来完成此操作,不要委托给脚本。
转储文件是SSA格式的图,每行属于以下类型之一:
- 占位符:(模型输入,如input_ids、kv_cache等)
%name : shape : dtype - 操作:
%name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype - 输出:
output(%name1, %name2, ...)
如何识别属于第N层的节点:
Transformer解码器层通常按顺序包含以下模块:
- 输入层归一化(Input LayerNorm) — 引用
model_layers_N_input_layernorm_weight - 自注意力(Self-Attention)(MLA/MHA/GQA)—— 引用权重(q_proj、k_proj、v_proj、o_proj、kv_a_proj、kv_b_proj、q_a_proj、q_b_proj等)
model_layers_N_self_attn_* - 注意力后层归一化 + 残差连接(Post-Attention LayerNorm + Residual) — 引用
model_layers_N_post_attention_layernorm_weight - 前馈网络/混合专家(FFN / MoE) — 引用权重(gate_weight、shared_experts等)以及
model_layers_N_mlp_*融合权重引用fused_*_N_*
提取规则:
-
种子节点:任何参数中包含匹配或
layers_N_的权重引用,或类似layers.N.的融合权重模式(这些是非fused_*_N_*引用,即权重参数,而非其他操作的激活输出)的操作。% -
正向追踪:从种子节点开始,沿数据流正向追踪 — 如果一个节点消耗种子节点/层节点的输出,则该节点也属于此层。当遇到消耗其他层(例如,其中M≠N)权重的节点时停止。
layers_M_ -
反向追踪:从种子节点开始,沿数据流反向追踪以获取输入到种子节点的中间操作(getitem、sub、floordiv、eq、mul、view、reshape等)。当遇到以下情况时停止:
- 占位符节点(模型输入)
- 属于其他层的节点(即已通过种子规则识别为属于M≠N层的节点,或其反向链到达其他层的权重)
- 来自前一层的残差加法输出(这是层之间的边界)
-
关键提示:位于两层MoE路由逻辑之间的通用算术操作(如、
sub_1、floordiv_1、mul_1)可能难以处理。反向追踪它们的输入 — 如果它们最终源自M层的权重/操作(而非N层),则属于M层,而非N层。这些通用操作的后缀数字不表示它们所属的层;您必须追踪数据流。eq_1 -
外部输入:来自层外并输入到层节点的节点是“外部输入”(在图中显示为输入节点)。这些包括:
- 前一层的残差输出
- 全局模型缓冲区(如这样的RoPE缓存、批处理元数据)
_ad_rotary_cos_sin_N - KV缓存占位符
Phase 4: Produce the JSON
阶段4:生成JSON
After extraction, output a JSON file at with this structure:
<dump_dir>/<dump_stem>_layer<N>.jsonjson
{
"layer": 5,
"source_file": "085_compile_compile_model.txt",
"nodes": [
{
"id": "noaux_tc_op_default_2",
"op": "trtllm.noaux_tc_op.default",
"shape": "(8x8, 8x8)",
"dtype": "(torch.bfloat16, torch.int32)",
"group": "moe",
"sub_group": "moe_router",
"inputs": ["dsv3_router_gemm_op_default_2"],
"weight_inputs": [
{"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
]
}
],
"edges": [
{"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
],
"external_inputs": [
{
"id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
"label": "Layer 4 residual output",
"shape": "(2x4x7168, 2x4x7168)"
}
]
}Node fields:
- : the SSA name (without
idprefix)% - : the full operation target string
op - ,
shape: output shape and dtypedtype - : one of
group,norm/attention,mla,moe,mlp,mamba,gdnother - (optional): finer classification like
sub_group,q_branch,kv_branch,rope,moe_router,moe_expertsshared_experts - : list of node IDs that this node consumes (only nodes within the layer or external inputs)
inputs - : list of weight parameters consumed (name, shape, dtype)
weight_inputs
Edge fields:
- ,
from: node IDs (both must be intoornodes)external_inputs
Group assignment heuristic:
- : ops consuming
normorinput_layernormweightspost_attention_layernorm - /
mla: ops consumingattentionweights or namedself_attn_*,*mla*,*rope**sdpa* - : ops consuming MoE weights (
moe,*moe*), router ops (*experts*, topk), and the arithmetic ops that process router outputs (sub, floordiv, eq, mul between router and MoE fused op)noaux_tc_op - : ops consuming
mlpweights that aren't MoE (e.g.,mlp_*,shared_experts,gate_proj,up_proj)down_proj - : everything else (getitem, view, reshape, etc.) — assign to the same group as neighbors
other
提取完成后,在路径下输出一个JSON文件,结构如下:
<dump_dir>/<dump_stem>_layer<N>.jsonjson
{
"layer": 5,
"source_file": "085_compile_compile_model.txt",
"nodes": [
{
"id": "noaux_tc_op_default_2",
"op": "trtllm.noaux_tc_op.default",
"shape": "(8x8, 8x8)",
"dtype": "(torch.bfloat16, torch.int32)",
"group": "moe",
"sub_group": "moe_router",
"inputs": ["dsv3_router_gemm_op_default_2"],
"weight_inputs": [
{"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
]
}
],
"edges": [
{"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
],
"external_inputs": [
{
"id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
"label": "Layer 4 residual output",
"shape": "(2x4x7168, 2x4x7168)"
}
]
}节点字段:
- :SSA名称(不带
id前缀)% - :完整的操作目标字符串
op - 、
shape:输出形状和数据类型dtype - :以下值之一:
group、norm/attention、mla、moe、mlp、mamba、gdnother - (可选):更精细的分类,如
sub_group、q_branch、kv_branch、rope、moe_router、moe_expertsshared_experts - :此节点消耗的节点ID列表(仅限层内节点或外部输入)
inputs - :消耗的权重参数列表(名称、形状、数据类型)
weight_inputs
边字段:
- 、
from:节点ID(必须存在于to或nodes中)external_inputs
分组分配规则:
- :引用
norm或input_layernorm权重的操作post_attention_layernorm - /
mla:引用attention权重或名称包含self_attn_*、*mla*、*rope*的操作*sdpa* - :引用MoE权重(
moe、*moe*)的操作、路由操作(*experts*、topk)以及处理路由输出的算术操作(路由与MoE融合操作之间的sub、floordiv、eq、mul)noaux_tc_op - :引用非MoE的
mlp权重的操作(如mlp_*、shared_experts、gate_proj、up_proj)down_proj - :其他所有操作(getitem、view、reshape等)—— 分配为与相邻节点相同的组
other
Phase 5: (Optional) Extract trace kernels
阶段5:(可选)提取跟踪内核
If the user provided a trace file, extract per-layer kernel sequences:
bash
python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.jsonThis script uses to extract a single CUDA graph replay, groups kernels by stream, and identifies which streams belong to which layer. It outputs a JSON with per-layer kernel sequences including short names, full names, durations, and stream IDs.
graphNodeId如果用户提供了跟踪文件,提取每层的内核序列:
bash
python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.json此脚本使用提取单个CUDA图重放,按流分组内核,并识别哪些流属于哪一层。输出的JSON包含每层的内核序列,包括简称、全称、持续时间和流ID。
graphNodeIdPhase 6: (Optional) Annotate each node with its trace kernels
阶段6:(可选)为每个节点标注跟踪内核
When trace kernel data is available, you must map GPU kernels to individual FX graph nodes. This is the key step — the render script will display kernel names and durations directly on each node in the diagram.
How to map kernels to nodes:
Read the trace kernel JSON and the layer JSON side by side. For each FX graph node, identify which GPU kernel(s) it corresponds to based on the op type and the kernel execution order. Common mappings:
| FX graph op | Trace kernel(s) |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Add a list to each node that has corresponding GPU kernels:
"trace_kernels"json
{
"id": "flashinfer_mla_with_cache_default_5",
"op": "auto_deploy.flashinfer_mla_with_cache.default",
"group": "mla",
"trace_kernels": [
{"kernel": "fmhaSm100...", "duration_us": 50.1}
],
...
}Also add top-level for the whole layer:
trace_summaryjson
{
"layer": 5,
"trace_summary": {
"total_duration_us": 650.9,
"kernel_count": 66,
"num_streams": 3
},
"nodes": [...]
}The render script will display each node's kernels as lines below the node label. Nodes without trace_kernels are left unchanged.
⚡ kernel_name (Xµs)当有跟踪内核数据时,您必须将GPU内核映射到各个FX图节点。这是关键步骤 — 渲染脚本会在图中的每个节点上直接显示内核名称和持续时间。
如何将内核映射到节点:
同时读取跟踪内核JSON和层JSON。对于每个FX图节点,根据操作类型和内核执行顺序识别对应的GPU内核。常见映射:
| FX图操作 | 跟踪内核 |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
为每个有对应GPU内核的节点添加列表:
"trace_kernels"json
{
"id": "flashinfer_mla_with_cache_default_5",
"op": "auto_deploy.flashinfer_mla_with_cache.default",
"group": "mla",
"trace_kernels": [
{"kernel": "fmhaSm100...", "duration_us": 50.1}
],
...
}同时为整个层添加顶层:
trace_summaryjson
{
"layer": 5,
"trace_summary": {
"total_duration_us": 650.9,
"kernel_count": 66,
"num_streams": 3
},
"nodes": [...]
}渲染脚本会将每个节点的内核显示为节点标签下方的行。没有字段的节点保持不变。
⚡ kernel_name (Xµs)trace_kernelsPhase 7: Render
阶段7:渲染
Run the bundled visualization script:
bash
python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>This reads the JSON and produces and files. If nodes have fields, the script renders kernel names and durations directly on each node's label (prefixed with ⚡).
.dot.pngtrace_kernelsPresent the output paths to the user.
运行附带的可视化脚本:
bash
python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>该脚本读取JSON并生成和文件。如果节点有字段,脚本会将内核名称和持续时间直接渲染在每个节点的标签上(前缀为⚡)。
.dot.pngtrace_kernels将输出路径告知用户。
Notes
注意事项
- The script requires (
graphvizcommand) to be installed for PNG renderingdot - If the dump has multiple GraphModules, prefer the module or auto-select the largest
monolithic.model - When in doubt about a node's layer membership, trace its weight dependencies all the way back — weight parameter names are the ground truth for layer assignment
- The trace extraction script auto-converts to
.nsys-repif needed (requires.sqliteon PATH)nsys - CUDA graph kernels run across multiple streams per layer; the script handles multi-stream grouping automatically
- 脚本需要安装(
graphviz命令)才能生成PNG格式的可视化图dot - 如果转储文件包含多个GraphModule,优先选择模块或自动选择最大的模块
monolithic.model - 如果不确定某个节点属于哪一层,跟踪其权重依赖关系直到源头 — 权重参数名称是层分配的可靠依据
- 跟踪提取脚本会在需要时自动将转换为
.nsys-rep(要求.sqlite在PATH环境变量中)nsys - CUDA图内核在每层的多个流上运行;脚本会自动处理多流分组