ad-layer-visualizer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AutoDeploy Layer Visualizer

AutoDeploy层可视化工具

Visualize a single transformer decoder layer from an AutoDeploy SSA graph dump. Optionally overlay actual GPU kernel names and durations from an nsys trace.

Prerequisite knowledge: This skill assumes familiarity with the
ad-graph-dump
skill, which covers how to enable dumps via
AD_DUMP_GRAPHS_DIR
, file naming conventions, SSA format basics, and GraphModule section structure. Refer to
ad-graph-dump
for that context.

将AutoDeploy SSA图转储中的单个Transformer解码器层可视化。可选择叠加来自nsys trace的实际GPU内核名称和持续时间。

前置知识：本技能假设用户熟悉
ad-graph-dump
技能，该技能涵盖了如何通过
AD_DUMP_GRAPHS_DIR
启用转储、文件命名规则、SSA格式基础以及GraphModule部分的结构。相关背景请参考
ad-graph-dump
。

Inputs

输入

Dump directory — path to a directory containing graph dump
```
.txt
```
files
Layer number — which decoder layer to visualize (e.g. 5)
(Optional) Dump file — specific
```
.txt
```
file. If not given, pick the file with the highest numeric prefix (final transform stage).
(Optional) Trace file — path to
```
.nsys-rep
```
or
```
.sqlite
```
trace file. When provided, GPU kernel names and durations are extracted and annotated onto each node in the visualization.

转储目录 — 包含图转储
```
.txt
```
文件的目录路径
层编号 — 要可视化的解码器层（例如5）
（可选）转储文件 — 特定的
```
.txt
```
文件。如果未指定，选择带有最高数字前缀的文件（最终转换阶段）。
（可选）跟踪文件 —
```
.nsys-rep
```
或
```
.sqlite
```
跟踪文件的路径。提供该文件时，GPU内核名称和持续时间会被提取并标注到可视化图中的每个节点上。

Workflow

工作流程

Phase 0: Ask the user

阶段0：询问用户

Before starting, ask the user two questions (skip any already answered in their request):

Do you have an nsys trace file? (
```
.nsys-rep
```
or
```
.sqlite
```
) — if yes, kernel names and durations from the trace will be annotated directly on each node in the diagram.
If yes — prefill or decode? The CUDA graph structure differs between prefill and decode phases. Knowing which phase the trace captures helps correctly segment layers and map kernels.

Proceed once both are answered (or the user says no trace).

开始前，向用户询问两个问题（若用户的请求中已回答则跳过）：

您是否有nsys跟踪文件？（
```
.nsys-rep
```
或
```
.sqlite
```
）—— 如果有，跟踪中的内核名称和持续时间将直接标注在图中的每个节点上。
如果有 — 是预填充阶段还是解码阶段？ CUDA图结构在预填充和解码阶段有所不同。了解跟踪捕获的是哪个阶段有助于正确分割层并映射内核。

待两个问题都得到回答（或用户表示没有跟踪文件）后再继续。

Phase 1: Select the dump file

阶段1：选择转储文件

If the user didn't specify a file, pick the file with the highest numeric prefix (the final transform stage). See

ad-graph-dump

for the file naming convention and how lexicographic sort matches pipeline order.

如果用户未指定文件，选择带有最高数字前缀的文件（最终转换阶段）。有关文件命名规则以及字典序排序如何匹配流水线顺序，请参考

ad-graph-dump

。

Phase 2: Read the graph dump

阶段2：读取图转储

Read the selected

.txt

file. If it contains multiple

GraphModule

sections (delimited by

========

headers), pick the one labeled

monolithic.model

or the first/largest one with real operation nodes.

读取选定的

.txt

文件。如果文件包含多个由

========

标题分隔的

GraphModule

部分，选择标注为

monolithic.model

的部分，或第一个/最大的包含实际操作节点的部分。

Phase 3: Extract the layer subgraph

阶段3：提取层子图

This is the core step — you do this yourself by understanding the graph structure. Do NOT delegate to a script.

The dump is an SSA-form graph where each line is one of:

Placeholder:
```
%name : shape : dtype
```
(model inputs like input_ids, kv_cache, etc.)

Operation:

%name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype

Output:
```
output(%name1, %name2, ...)
```

How to identify which nodes belong to layer N:

A transformer decoder layer typically contains these blocks in order:

Input LayerNorm — consumes
```
model_layers_N_input_layernorm_weight
```
Self-Attention (MLA/MHA/GQA) — consumes
```
model_layers_N_self_attn_*
```
weights (q_proj, k_proj, v_proj, o_proj, kv_a_proj, kv_b_proj, q_a_proj, q_b_proj, etc.)
Post-Attention LayerNorm + Residual — consumes
```
model_layers_N_post_attention_layernorm_weight
```
FFN / MoE — consumes
```
model_layers_N_mlp_*
```
weights (gate_weight, shared_experts, etc.) and
```
fused_*_N_*
```
fused weight references

Extraction rules:

Seed nodes: Any operation whose arguments include a weight reference matching
```
layers_N_
```
or
```
layers.N.
```
or a fused weight pattern like
```
fused_*_N_*
```
(where these are non-
```
%
```
references, i.e. weight parameters, not activation outputs from other ops).
Forward follow: From seeds, follow the dataflow forward — if a node consumes the output of a seed/layer node, it belongs to this layer too. Stop when you hit a node that consumes weights from a different layer (e.g.
```
layers_M_
```
where M ≠ N).
Backward follow: From seeds, follow the dataflow backward to pick up intermediate ops (getitem, sub, floordiv, eq, mul, view, reshape, etc.) that feed into seed nodes. Stop when you hit:
- A placeholder node (model input)
- A node that belongs to a different layer (i.e., it was already identified as part of layer M ≠ N by the seed rule, or its own backward chain reaches a different layer's weights)
- The output of a residual-add from the previous layer (this is the boundary between layers)
Critical: Generic arithmetic ops like
```
sub_1
```
,
```
floordiv_1
```
,
```
mul_1
```
,
```
eq_1
```
that sit between two layers' MoE routing logic can be tricky. Trace their inputs backward — if they ultimately derive from layer M's weights/operations (not layer N's), they belong to layer M, not layer N. The suffix number on these generic ops does NOT indicate which layer they belong to; you must trace the dataflow.
External inputs: Nodes from outside the layer that feed into layer nodes are "external inputs" (show them as input nodes in the diagram). These include:
- Previous layer's residual output
- Global model buffers (RoPE caches like
```
_ad_rotary_cos_sin_N
```
  , batch metadata)
- KV cache placeholders

这是核心步骤 — 您需要自己理解图结构来完成此操作，不要委托给脚本。

转储文件是SSA格式的图，每行属于以下类型之一：

占位符：
```
%name : shape : dtype
```
（模型输入，如input_ids、kv_cache等）

操作：

%name = namespace.op_name(%input1, %input2, ..., const_args...) : shape : dtype

输出：
```
output(%name1, %name2, ...)
```

如何识别属于第N层的节点：

Transformer解码器层通常按顺序包含以下模块：

输入层归一化（Input LayerNorm） — 引用
```
model_layers_N_input_layernorm_weight
```
自注意力（Self-Attention）（MLA/MHA/GQA）—— 引用
```
model_layers_N_self_attn_*
```
权重（q_proj、k_proj、v_proj、o_proj、kv_a_proj、kv_b_proj、q_a_proj、q_b_proj等）
注意力后层归一化 + 残差连接（Post-Attention LayerNorm + Residual） — 引用
```
model_layers_N_post_attention_layernorm_weight
```
前馈网络/混合专家（FFN / MoE） — 引用
```
model_layers_N_mlp_*
```
权重（gate_weight、shared_experts等）以及
```
fused_*_N_*
```
融合权重引用

提取规则：

种子节点：任何参数中包含匹配
```
layers_N_
```
或
```
layers.N.
```
的权重引用，或类似
```
fused_*_N_*
```
的融合权重模式（这些是非
```
%
```
引用，即权重参数，而非其他操作的激活输出）的操作。
正向追踪：从种子节点开始，沿数据流正向追踪 — 如果一个节点消耗种子节点/层节点的输出，则该节点也属于此层。当遇到消耗其他层（例如
```
layers_M_
```
，其中M≠N）权重的节点时停止。
反向追踪：从种子节点开始，沿数据流反向追踪以获取输入到种子节点的中间操作（getitem、sub、floordiv、eq、mul、view、reshape等）。当遇到以下情况时停止：
- 占位符节点（模型输入）
- 属于其他层的节点（即已通过种子规则识别为属于M≠N层的节点，或其反向链到达其他层的权重）
- 来自前一层的残差加法输出（这是层之间的边界）
关键提示：位于两层MoE路由逻辑之间的通用算术操作（如
```
sub_1
```
、
```
floordiv_1
```
、
```
mul_1
```
、
```
eq_1
```
）可能难以处理。反向追踪它们的输入 — 如果它们最终源自M层的权重/操作（而非N层），则属于M层，而非N层。这些通用操作的后缀数字不表示它们所属的层；您必须追踪数据流。
外部输入：来自层外并输入到层节点的节点是“外部输入”（在图中显示为输入节点）。这些包括：
- 前一层的残差输出
- 全局模型缓冲区（如
```
_ad_rotary_cos_sin_N
```
  这样的RoPE缓存、批处理元数据）
- KV缓存占位符

Phase 4: Produce the JSON

阶段4：生成JSON

After extraction, output a JSON file at

<dump_dir>/<dump_stem>_layer<N>.json

with this structure:

json

{
  "layer": 5,
  "source_file": "085_compile_compile_model.txt",
  "nodes": [
    {
      "id": "noaux_tc_op_default_2",
      "op": "trtllm.noaux_tc_op.default",
      "shape": "(8x8, 8x8)",
      "dtype": "(torch.bfloat16, torch.int32)",
      "group": "moe",
      "sub_group": "moe_router",
      "inputs": ["dsv3_router_gemm_op_default_2"],
      "weight_inputs": [
        {"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
      ]
    }
  ],
  "edges": [
    {"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
  ],
  "external_inputs": [
    {
      "id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
      "label": "Layer 4 residual output",
      "shape": "(2x4x7168, 2x4x7168)"
    }
  ]
}

Node fields:

```
id
```
: the SSA name (without
```
%
```
prefix)
```
op
```
: the full operation target string
```
shape
```
,
```
dtype
```
: output shape and dtype

group

: one of

norm

attention

mla

moe

mlp

mamba

gdn

other

sub_group

(optional): finer classification like

q_branch

kv_branch

rope

moe_router

moe_experts

shared_experts

```
inputs
```
: list of node IDs that this node consumes (only nodes within the layer or external inputs)
```
weight_inputs
```
: list of weight parameters consumed (name, shape, dtype)

Edge fields:

```
from
```
,
```
to
```
: node IDs (both must be in
```
nodes
```
or
```
external_inputs
```
)

Group assignment heuristic:

norm

: ops consuming

input_layernorm

post_attention_layernorm

weights

mla

attention

: ops consuming

self_attn_*

weights or named

*mla*

*rope*

*sdpa*

```
moe
```
: ops consuming MoE weights (
```
*moe*
```
,
```
*experts*
```
), router ops (
```
noaux_tc_op
```
, topk), and the arithmetic ops that process router outputs (sub, floordiv, eq, mul between router and MoE fused op)
```
mlp
```
: ops consuming
```
mlp_*
```
weights that aren't MoE (e.g.,
```
shared_experts
```
,
```
gate_proj
```
,
```
up_proj
```
,
```
down_proj
```
)
```
other
```
: everything else (getitem, view, reshape, etc.) — assign to the same group as neighbors

提取完成后，在

<dump_dir>/<dump_stem>_layer<N>.json

路径下输出一个JSON文件，结构如下：

json

{
  "layer": 5,
  "source_file": "085_compile_compile_model.txt",
  "nodes": [
    {
      "id": "noaux_tc_op_default_2",
      "op": "trtllm.noaux_tc_op.default",
      "shape": "(8x8, 8x8)",
      "dtype": "(torch.bfloat16, torch.int32)",
      "group": "moe",
      "sub_group": "moe_router",
      "inputs": ["dsv3_router_gemm_op_default_2"],
      "weight_inputs": [
        {"name": "model_layers_5_mlp_gate_e_score_correction_bias", "shape": "256", "dtype": "torch.bfloat16"}
      ]
    }
  ],
  "edges": [
    {"from": "dsv3_router_gemm_op_default_2", "to": "noaux_tc_op_default_2"}
  ],
  "external_inputs": [
    {
      "id": "trtllm_fused_allreduce_residual_rmsnorm_default_4",
      "label": "Layer 4 residual output",
      "shape": "(2x4x7168, 2x4x7168)"
    }
  ]
}

节点字段：

```
id
```
：SSA名称（不带
```
%
```
前缀）
```
op
```
：完整的操作目标字符串
```
shape
```
、
```
dtype
```
：输出形状和数据类型

group

：以下值之一：

norm

、

attention

mla

、

moe

、

mlp

、

mamba

、

gdn

、

other

sub_group

（可选）：更精细的分类，如

q_branch

、

kv_branch

、

rope

、

moe_router

、

moe_experts

、

shared_experts

```
inputs
```
：此节点消耗的节点ID列表（仅限层内节点或外部输入）
```
weight_inputs
```
：消耗的权重参数列表（名称、形状、数据类型）

边字段：

```
from
```
、
```
to
```
：节点ID（必须存在于
```
nodes
```
或
```
external_inputs
```
中）

分组分配规则：

norm

：引用

input_layernorm

或

post_attention_layernorm

权重的操作

```
mla
```
/
```
attention
```
：引用
```
self_attn_*
```
权重或名称包含
```
*mla*
```
、
```
*rope*
```
、
```
*sdpa*
```
的操作
```
moe
```
：引用MoE权重（
```
*moe*
```
、
```
*experts*
```
）的操作、路由操作（
```
noaux_tc_op
```
、topk）以及处理路由输出的算术操作（路由与MoE融合操作之间的sub、floordiv、eq、mul）
```
mlp
```
：引用非MoE的
```
mlp_*
```
权重的操作（如
```
shared_experts
```
、
```
gate_proj
```
、
```
up_proj
```
、
```
down_proj
```
）
```
other
```
：其他所有操作（getitem、view、reshape等）—— 分配为与相邻节点相同的组

Phase 5: (Optional) Extract trace kernels

阶段5：（可选）提取跟踪内核

If the user provided a trace file, extract per-layer kernel sequences:

bash

python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.json

This script uses

graphNodeId

to extract a single CUDA graph replay, groups kernels by stream, and identifies which streams belong to which layer. It outputs a JSON with per-layer kernel sequences including short names, full names, durations, and stream IDs.

如果用户提供了跟踪文件，提取每层的内核序列：

bash

python <skill_dir>/scripts/extract_trace_kernels.py <trace_file> --layer <N> --output <dump_dir>/<dump_stem>_layer<N>_kernels.json

此脚本使用

graphNodeId

提取单个CUDA图重放，按流分组内核，并识别哪些流属于哪一层。输出的JSON包含每层的内核序列，包括简称、全称、持续时间和流ID。

Phase 6: (Optional) Annotate each node with its trace kernels

阶段6：（可选）为每个节点标注跟踪内核

When trace kernel data is available, you must map GPU kernels to individual FX graph nodes. This is the key step — the render script will display kernel names and durations directly on each node in the diagram.

How to map kernels to nodes:

Read the trace kernel JSON and the layer JSON side by side. For each FX graph node, identify which GPU kernel(s) it corresponds to based on the op type and the kernel execution order. Common mappings:

FX graph op	Trace kernel(s)
`flashinfer_mla_with_cache`	`fmhaSm100...`
`finegrained_fp8_linear`	`fp8_blockscale` + `pack_fp32_to_ue8m0` + `deep_gemm_fp8`
`torch_linear_simple`	`nvjet_gemm` + `splitKreduce`
`flashinfer_rms_norm`	`rms_norm_reduce_fusion`
`flashinfer_fused_add_rms_norm`	`fused_kernel_a5fe...`
`mlir_fused` (input_layernorm)	`fused_kernel_1984...`
`mlir_fused` (post_attn)	`fused_kernel_2e06...`
`trtllm_moe_fused`	`bmm_E4m3...` + `moe_activation_deepseek` + `bmm_Bfloat16...` + `moe_finalize`
`noaux_tc_op` (top-k routing)	`deepseek_v3_topk`
`fused_swiglu_mlp` / `fused_finegrained_fp8_swiglu_mlp`	`deep_gemm_fp8` (×2) + `act_and_mul`
`trtllm_dist_all_reduce`	`nccl_allreduce_symk` or `symm_mem_allreduce`
`symm_mem_all_gather`	`symm_mem_allgather` or `nccl_allgather`
`triton_rope_on_interleaved_qk_inputs`	`mla_rope_assign_qkv`

Add a

"trace_kernels"

list to each node that has corresponding GPU kernels:

json

{
  "id": "flashinfer_mla_with_cache_default_5",
  "op": "auto_deploy.flashinfer_mla_with_cache.default",
  "group": "mla",
  "trace_kernels": [
    {"kernel": "fmhaSm100...", "duration_us": 50.1}
  ],
  ...
}

Also add top-level

trace_summary

for the whole layer:

json

{
  "layer": 5,
  "trace_summary": {
    "total_duration_us": 650.9,
    "kernel_count": 66,
    "num_streams": 3
  },
  "nodes": [...]
}

The render script will display each node's kernels as

⚡ kernel_name (Xµs)

lines below the node label. Nodes without trace_kernels are left unchanged.

当有跟踪内核数据时，您必须将GPU内核映射到各个FX图节点。这是关键步骤 — 渲染脚本会在图中的每个节点上直接显示内核名称和持续时间。

如何将内核映射到节点：

同时读取跟踪内核JSON和层JSON。对于每个FX图节点，根据操作类型和内核执行顺序识别对应的GPU内核。常见映射：

FX图操作	跟踪内核
`flashinfer_mla_with_cache`	`fmhaSm100...`
`finegrained_fp8_linear`	`fp8_blockscale` + `pack_fp32_to_ue8m0` + `deep_gemm_fp8`
`torch_linear_simple`	`nvjet_gemm` + `splitKreduce`
`flashinfer_rms_norm`	`rms_norm_reduce_fusion`
`flashinfer_fused_add_rms_norm`	`fused_kernel_a5fe...`
`mlir_fused` （input_layernorm）	`fused_kernel_1984...`
`mlir_fused` （post_attn）	`fused_kernel_2e06...`
`trtllm_moe_fused`	`bmm_E4m3...` + `moe_activation_deepseek` + `bmm_Bfloat16...` + `moe_finalize`
`noaux_tc_op` （top-k路由）	`deepseek_v3_topk`
`fused_swiglu_mlp` / `fused_finegrained_fp8_swiglu_mlp`	`deep_gemm_fp8` （×2） + `act_and_mul`
`trtllm_dist_all_reduce`	`nccl_allreduce_symk` 或 `symm_mem_allreduce`
`symm_mem_all_gather`	`symm_mem_allgather` 或 `nccl_allgather`
`triton_rope_on_interleaved_qk_inputs`	`mla_rope_assign_qkv`

为每个有对应GPU内核的节点添加

"trace_kernels"

列表：

json

{
  "id": "flashinfer_mla_with_cache_default_5",
  "op": "auto_deploy.flashinfer_mla_with_cache.default",
  "group": "mla",
  "trace_kernels": [
    {"kernel": "fmhaSm100...", "duration_us": 50.1}
  ],
  ...
}

同时为整个层添加顶层

trace_summary

：

json

{
  "layer": 5,
  "trace_summary": {
    "total_duration_us": 650.9,
    "kernel_count": 66,
    "num_streams": 3
  },
  "nodes": [...]
}

渲染脚本会将每个节点的内核显示为节点标签下方的

⚡ kernel_name (Xµs)

行。没有

trace_kernels

字段的节点保持不变。

Phase 7: Render

阶段7：渲染

Run the bundled visualization script:

bash

python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>

This reads the JSON and produces

.dot

and

.png

files. If nodes have

trace_kernels

fields, the script renders kernel names and durations directly on each node's label (prefixed with ⚡).

Present the output paths to the user.

运行附带的可视化脚本：

bash

python <skill_dir>/scripts/render_layer.py <dump_dir>/<dump_stem>_layer<N>.json --output <dump_dir>/<dump_stem>_layer<N>

该脚本读取JSON并生成

.dot

和

.png

文件。如果节点有

trace_kernels

字段，脚本会将内核名称和持续时间直接渲染在每个节点的标签上（前缀为⚡）。

将输出路径告知用户。

Notes

注意事项

The script requires
```
graphviz
```
(
```
dot
```
command) to be installed for PNG rendering
If the dump has multiple GraphModules, prefer the
```
monolithic.model
```
module or auto-select the largest
When in doubt about a node's layer membership, trace its weight dependencies all the way back — weight parameter names are the ground truth for layer assignment
The trace extraction script auto-converts
```
.nsys-rep
```
to
```
.sqlite
```
if needed (requires
```
nsys
```
on PATH)
CUDA graph kernels run across multiple streams per layer; the script handles multi-stream grouping automatically

脚本需要安装
```
graphviz
```
（
```
dot
```
命令）才能生成PNG格式的可视化图
如果转储文件包含多个GraphModule，优先选择
```
monolithic.model
```
模块或自动选择最大的模块
如果不确定某个节点属于哪一层，跟踪其权重依赖关系直到源头 — 权重参数名称是层分配的可靠依据
跟踪提取脚本会在需要时自动将
```
.nsys-rep
```
转换为
```
.sqlite
```
（要求
```
nsys
```
在PATH环境变量中）
CUDA图内核在每层的多个流上运行；脚本会自动处理多流分组