llm-pipeline-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Pipeline Analysis

LLM 流水线分析

Overview

概述

Use this when a whole-trace profiler summary is too coarse. The scripts read a Chrome-trace JSON file, find layer-boundary anchor kernels, group kernels into forward passes and layers, and print timing tables you can use for Perfetto navigation or detailed timing analysis.
当全追踪的性能分析器摘要过于粗略时,可以使用本工具。这些脚本读取Chrome-trace格式的JSON文件,找到层边界的锚点内核,将内核分组到正向传播过程和层中,并打印可用于Perfetto导航或详细耗时分析的计时表格。

When To Use It

使用场景

  • when you need to know which layers contribute most
  • when the model has alternating layer types (e.g. models with
    compress_ratios
    like DeepSeek-V4 NSA)
  • when you need to compare cold-start vs steady-state forward passes
  • when you need to navigate to a specific layer in Perfetto UI
  • when you need to select representative layers for deep-dive analysis
  • 当你需要了解哪些层对耗时贡献最大时
  • 当模型包含交替层类型时(例如带有
    compress_ratios
    的模型,如DeepSeek-V4 NSA)
  • 当你需要对比冷启动与稳态正向传播过程时
  • 当你需要在Perfetto UI中导航到特定层时
  • 当你需要选择代表性层进行深度分析时

Confirmation Required

需确认的输入项

Before running scripts, collect or verify these inputs:
ItemWhy it mattersHow to obtainDefault if user skips
Model nameDetermines which
config.json
to use; affects layer classification
Ask user— (required)
Model profileDetermines anchor kernel, blocks-per-layer, and kernel classification rulesAsk user or auto-infer from configAuto-inferred from config
config.json
path
Provides
compress_ratios
,
num_hidden_layers
,
num_hash_layers
etc.
Ask user or search filesystem— (required)
GPU typeOptional context for reports and hardware notesAsk user
TP / EPParallelism config affects kernel naming and AllReduce countAsk user or infer from trace filename (e.g.
TP-0
)
TP=8, EP=8
Serving modeDecode vs prefill changes kernel mix and FLOPs profileAsk userdecode B=1
If the user cannot provide
config.json
, search common locations such as
/root/workspace/*/config.json
and the HuggingFace cache. If it is still not available, require an explicit
--profile
.
在运行脚本前,收集或验证以下输入:
重要性获取方式用户未提供时的默认值
模型名称决定使用哪个
config.json
,影响层分类
询问用户—(必填)
模型配置文件决定锚点内核、每层块数和内核分类规则询问用户或从config自动推断从config自动推断
config.json
路径
提供
compress_ratios
num_hidden_layers
num_hash_layers
等信息
询问用户或搜索文件系统—(必填)
GPU类型为报告和硬件说明提供可选上下文询问用户
TP / EP并行配置影响内核命名和AllReduce计数询问用户或从追踪文件名推断(例如
TP-0
TP=8, EP=8
服务模式解码与预填充会改变内核组合和FLOPs配置询问用户decode B=1
如果用户无法提供
config.json
,请搜索常见位置,如
/root/workspace/*/config.json
和HuggingFace缓存。如果仍无法获取,则需要显式指定
--profile
参数。

Model Profiles

模型配置文件

Scripts use ModelProfile to determine layer boundary detection and kernel classification. Profiles are auto-inferred from
config.json
or selected via
--profile
:
ProfileAnchor kernelBlocks/layerLayer structureAuto-infer condition
dsv4_csa_hca
mhc_post_tilelang
2attn + ffn halves
compress_ratios
non-empty
dsv3_mla
flash_fwd_mla_combine
1full layer
kv_lora_rank > 0
generic
auto-detect or
--anchor-kernel
1full layerfallback
Use
--profile generic --anchor-kernel YOUR_KERNEL
for models not covered by built-in profiles.
脚本使用ModelProfile来确定层边界检测和内核分类规则。配置文件可从
config.json
自动推断,或通过
--profile
参数选择:
配置文件锚点内核每层块数层结构自动推断条件
dsv4_csa_hca
mhc_post_tilelang
2注意力层 + FFN层各占一半
compress_ratios
非空
dsv3_mla
flash_fwd_mla_combine
1完整层
kv_lora_rank > 0
generic
自动检测或通过
--anchor-kernel
指定
1完整层兜底选项
对于内置配置未覆盖的模型,使用
--profile generic --anchor-kernel YOUR_KERNEL

Prerequisites

前置条件

  • A
    torch.profiler
    trace in Chrome-trace JSON format (
    .json
    or
    .json.gz
    )
  • The model's
    config.json
    (for profile inference,
    compress_ratios
    , etc.)
  • The trace must contain a recognizable layer-boundary anchor kernel (auto-detected from the profile, or specified via
    --anchor-kernel
    )
  • 一个Chrome-trace JSON格式的
    torch.profiler
    追踪文件(
    .json
    .json.gz
  • 模型的
    config.json
    (用于配置文件推断、
    compress_ratios
    等)
  • 追踪文件必须包含可识别的层边界锚点内核(从配置文件自动检测,或通过
    --anchor-kernel
    指定)

Layer Boundary Detection

层边界检测

The scripts use an anchor kernel as a layer-boundary marker. The anchor and layer structure are determined by the active ModelProfile.
For example, with the
dsv4_csa_hca
profile, each transformer layer produces 2 consecutive
mhc_post_tilelang
calls:
mhc_post_tilelang  ← end of attn half (attention + O-proj + AllReduce)
  ... ffn computation ...
mhc_post_tilelang  ← end of ffn half (MoE experts + AllReduce)
  ... next layer attn ...
mhc_post_tilelang  ← next layer's attn boundary
So for N layers with the
dsv4_csa_hca
profile, one forward pass has
2N
anchor blocks. With
dsv3_mla
or
generic
, each layer has 1 block.
Forward pass
P
starts at block index
P * (N * blocks_per_layer)
.
脚本使用锚点内核作为层边界标记。锚点和层结构由当前激活的ModelProfile决定。
例如,使用
dsv4_csa_hca
配置文件时,每个Transformer层会产生2次连续
mhc_post_tilelang
调用:
mhc_post_tilelang  ← 注意力半层结束(注意力 + O-proj + AllReduce)
  ... FFN计算 ...
mhc_post_tilelang  ← FFN半层结束(MoE专家 + AllReduce)
  ... 下一层注意力计算 ...
mhc_post_tilelang  ← 下一层的注意力边界
因此,对于N层的
dsv4_csa_hca
配置模型,一次正向传播过程包含
2N
个锚点块。使用
dsv3_mla
generic
配置时,每层包含1个块。
第P次正向传播过程从块索引
P * (N * blocks_per_layer)
开始。

Scripts

脚本说明

1.
layer_timeline_analyzer.py
— Per-layer timeline and cluster stats

1.
layer_timeline_analyzer.py
— 每层时间线和集群统计

bash
undefined
bash
undefined

Show all forward passes summary (cold-start vs steady-state)

显示所有正向传播过程摘要(冷启动 vs 稳态)

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes

Detailed per-layer breakdown for a specific forward pass

特定正向传播过程的详细每层分解

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5

Auto-select first steady-state pass

自动选择第一个稳态传播过程

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json

The script prints:
- Per-layer wall-clock time, sum-duration, and category breakdown (MLA, MoE, GEMM, NCCL, MHC, Hadamard)
- Layer cluster statistics grouped by type (C4_LIGHT, C128_HEAVY, HASH, etc.)
- All-passes summary showing cold-start → steady-state growth
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json

该脚本输出:
- 每层的挂钟时间、总时长,以及分类占比(MLA、MoE、GEMM、NCCL、MHC、Hadamard)
- 按类型分组的层集群统计(C4_LIGHT、C128_HEAVY、HASH等)
- 所有传播过程的摘要,展示冷启动到稳态的变化

2.
layer_kernel_breakdown.py
— Per-layer kernel detail and compute flow

2.
layer_kernel_breakdown.py
— 每层内核细节和计算流

bash
undefined
bash
undefined

Single layer kernel dump

导出单个层的内核信息

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3

Compute flow format (with model architecture summary and category column)

计算流格式(包含模型架构摘要和分类列)

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow

JSON export

JSON导出

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json

Compare two layers side-by-side

对比两个层的差异

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3

Output formats:
- `--format text` (default): grouped summary + top hot kernels ranked by duration, with simplified names and percentages
- `--format compute-flow`: model architecture summary + per-kernel hotness table with `Category`, `%`, and `ts_rel(ms)` columns
- `--format json`: machine-readable per-kernel detail ranked by duration
- Kernel diff when comparing two layers (unique kernels in each)
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3

输出格式:
- `--format text`(默认):分组摘要 + 按时长排序的热门内核,包含简化名称和占比
- `--format compute-flow`:模型架构摘要 + 带`Category`、`%`和`ts_rel(ms)`列的内核热度表
- `--format json`:按时长排序的机器可读内核细节
- 对比两个层时显示内核差异(各层独有的内核)

3.
perfetto_time_mapper.py
— Perfetto UI time navigation

3.
perfetto_time_mapper.py
— Perfetto UI时间导航

bash
undefined
bash
undefined

Show all forward pass time ranges in Perfetto

在Perfetto中显示所有正向传播过程的时间范围

python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json

Layer-level time ranges for a specific forward pass

特定正向传播过程的层级别时间范围

python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42

The script prints:
- Forward pass time ranges in Perfetto-relative seconds
- Per-layer start/end times with compress_ratio labels
python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42

该脚本输出:
- Perfetto相对秒数的正向传播过程时间范围
- 带compress_ratio标签的每层开始/结束时间

Workflow

工作流程

Step 1: Identify steady-state forward pass

步骤1:识别稳态正向传播过程

bash
python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --show-all-passes
Read the "all-passes" table. The first pass is cold-start (few tokens). Find the first pass where layer-0 wall-clock stabilizes (typically pass 3-5).
bash
python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --show-all-passes
查看“所有传播过程”表格。第一次传播是冷启动(token数量少)。找到第0层挂钟时间稳定的第一个传播过程(通常是第3-5次)。

Step 2: Per-layer breakdown on steady-state pass

步骤2:对稳态传播过程进行每层分解

bash
python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --fwd-pass 5
Identify:
  • Which layer type dominates (C4_LIGHT vs C128_HEAVY vs HASH)
  • The MLA / MoE / GEMM / NCCL proportion per layer type
  • Which layer type is the best next target
bash
python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --fwd-pass 5
识别:
  • 哪种层类型占主导(C4_LIGHT vs C128_HEAVY vs HASH)
  • 每种层类型的MLA / MoE / GEMM / NCCL占比
  • 下一个最佳分析目标层类型

Step 3: Compute flow for representative layer(s)

步骤3:获取代表性层的计算流

Select 1-2 representative layers (one per bottleneck type), then:
bash
undefined
选择1-2个代表性层(每个瓶颈类型选一个),然后执行:
bash
undefined

Human-readable compute flow table

人类可读的计算流表格

python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow
python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow

JSON export

JSON导出

python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json

The `--format compute-flow` output includes:
- Model architecture summary at the top
- Per-kernel hotness table with `# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`
- Rows are ranked by `dur(us)` descending by default; use `ts_rel(ms)` to jump back to the kernel's trace location.
python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json

`--format compute-flow`输出包含:
- 顶部的模型架构摘要
- 带`# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`列的内核热度表
- 默认按`dur(us)`降序排列;使用`ts_rel(ms)`可跳转到内核在追踪文件中的位置

Step 4: Compare layer types (optional)

步骤4:对比层类型(可选)

bash
python3 scripts/layer_kernel_breakdown.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layer 2 --compare-layer 3
This shows the exact kernel difference between the two layer types.
bash
python3 scripts/layer_kernel_breakdown.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layer 2 --compare-layer 3
这会显示两种层类型之间的具体内核差异。

Step 5: Navigate in Perfetto UI (optional)

步骤5:在Perfetto UI中导航(可选)

bash
python3 scripts/perfetto_time_mapper.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layers 2,3,38,42
Use the printed time ranges to navigate directly in Perfetto.
bash
python3 scripts/perfetto_time_mapper.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layers 2,3,38,42
使用输出的时间范围直接在Perfetto中导航。

Layer Type Classification

层类型分类

The scripts classify layers based on
config.json
fields:
Config fieldValueLayer TypeDescription
compress_ratios[i]
0FULL_ATTNNo NSA compression (layers 0-1)
compress_ratios[i]
4C4_LIGHTC128 sparse attention, fastest
compress_ratios[i]
128C128_HEAVYC4 attention + Hadamard + Indexer, bottleneck
i >= N - num_hash_layers
HASHHash-table routing with paged MQA
i == 0
FIRSTFirst layer (empty KV cache)
i == N - 1
FINALFinal layer (lm_head output)
脚本基于
config.json
字段对层进行分类:
配置字段层类型描述
compress_ratios[i]
0FULL_ATTN无NSA压缩(第0-1层)
compress_ratios[i]
4C4_LIGHTC128稀疏注意力,速度最快
compress_ratios[i]
128C128_HEAVYC4注意力 + Hadamard + 索引器,性能瓶颈
i >= N - num_hash_layers
HASH带分页MQA的哈希表路由
i == 0
FIRST第一层(空KV缓存)
i == N - 1
FINAL最后一层(lm_head输出)

Kernel Categories

内核分类

Kernels are classified by the active ModelProfile's rules. Categories marked with (DSv4) are specific to the
dsv4_csa_hca
profile; all profiles include the universal categories.
CategoryMatch PatternProfileTypical Share (DSv4)
★ MLA Attention
flash_fwd_splitkv_mla
DSv4, DSv321-33%
★ MoE Fused
fused_moe_kernel
DSv4, DSv311-17%
● NCCL AllReduce
AllReduce
universal5-8%
GEMM fp8
deep_gemm
universal12-25%
GEMM bf16
nvjet
universal11-13%
Hadamard Xform
hadamard
DSv40-2.4%
Indexer Cache
indexer
DSv40-0.1%
Paged MQA
paged_mqa_logits
DSv40-1.8%
MHC
mhc_pre_gemm_sqrsum
,
mhc_pre_big_fuse
,
mhc_post_tilelang
DSv410-15%
C4/C128 Prefill
c4_prefill
,
c128_prefill
DSv40-0.3%
RMSNorm
RMSNorm
,
rms_normalize
universal1-2%
FP8 Quant
quant
,
Quant
universal1-2%
TopK
topk
universal0-0.7%
RoPE
deepseek_rope
,
fused_norm_rope
DSv4, DSv31-2%
Activation
silu_mul_clamp
,
act_and_mul
universal0-0.5%
Otheruniversal2-5%
内核根据当前ModelProfile的规则分类。标记(DSv4)的分类是
dsv4_csa_hca
配置文件特有的;所有配置文件都包含通用分类。
分类匹配模式配置文件典型占比(DSv4)
★ MLA Attention
flash_fwd_splitkv_mla
DSv4, DSv321-33%
★ MoE Fused
fused_moe_kernel
DSv4, DSv311-17%
● NCCL AllReduce
AllReduce
通用5-8%
GEMM fp8
deep_gemm
通用12-25%
GEMM bf16
nvjet
通用11-13%
Hadamard Xform
hadamard
DSv40-2.4%
Indexer Cache
indexer
DSv40-0.1%
Paged MQA
paged_mqa_logits
DSv40-1.8%
MHC
mhc_pre_gemm_sqrsum
,
mhc_pre_big_fuse
,
mhc_post_tilelang
DSv410-15%
C4/C128 Prefill
c4_prefill
,
c128_prefill
DSv40-0.3%
RMSNorm
RMSNorm
,
rms_normalize
通用1-2%
FP8 Quant
quant
,
Quant
通用1-2%
TopK
topk
通用0-0.7%
RoPE
deepseek_rope
,
fused_norm_rope
DSv4, DSv31-2%
Activation
silu_mul_clamp
,
act_and_mul
通用0-0.5%
Other通用2-5%

Reporting Checklist

报告检查清单

Include:
  1. Trace metadata: trace path, model config path, GPU type, TP/EP
  2. Model Architecture Summary (from
    config.json
    ):
    • model name, num_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim
    • Attention type (e.g. csa_hca), Q/O LoRA ranks
    • MoE config: num_experts, topk, num_shared_experts, intermediate_size
    • MHC config (if applicable)
    • NSA config (if applicable): index_n_heads, index_head_dim, index_topk, qk_rope_head_dim, sliding_window
    • compress_ratios distribution (how many C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH layers)
  3. Per-batch forward passes summary table (from
    layer_timeline_analyzer.py --show-all-passes
    ):
    • Columns: Fwd#, Start(s), End(s), Duration(ms), Avg Layer(ms), First Layer(ms), Notes
    • Identifies cold-start vs steady-state passes
  4. Chosen forward pass: index and rationale (cold-start vs steady-state)
  5. Per-layer wall-clock and sum-duration table (from
    layer_timeline_analyzer.py --fwd-pass N
    ):
    • Columns: L, c_r, Type, Wall(ms), SumDur(ms), MLA, MoE, GEMM, NCCL, MHC, Hadam, AR#, K#
    • Each row is one layer, with layer type label
  6. Layer cluster statistics table grouped by type:
    • Columns: Cluster, #, Avg Wall(ms), Avg Sum(ms), MLA%, MoE%, GEMM%, NCCL%, MHC%, Hadam%
    • Identifies bottleneck layer type and likely next target
  7. Compute Flow Table for selected representative layer(s):
    • Produced by
      layer_kernel_breakdown.py --format compute-flow
    • Columns:
      # | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims
    • Rows are sorted by top hot kernels (
      dur(us)
      descending) by default
    • Optional JSON export (
      --format json
      )
  8. Perfetto UI time ranges when requested
  9. One-line summary: bottleneck layer type and likely next target
报告需包含:
  1. 追踪元数据:追踪文件路径、模型配置路径、GPU类型、TP/EP
  2. 模型架构摘要(来自
    config.json
    ):
    • 模型名称、层数、隐藏层大小、注意力头数、键值头数、头维度
    • 注意力类型(如csa_hca)、Q/O LoRA秩
    • MoE配置:专家数量、topk、共享专家数量、中间层大小
    • MHC配置(如适用)
    • NSA配置(如适用):index_n_heads、index_head_dim、index_topk、qk_rope_head_dim、滑动窗口
    • compress_ratios分布(C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH层的数量)
  3. 每批次正向传播过程摘要表(来自
    layer_timeline_analyzer.py --show-all-passes
    ):
    • 列:Fwd#、Start(s)、End(s)、Duration(ms)、Avg Layer(ms)、First Layer(ms)、Notes
    • 标记冷启动与稳态传播过程
  4. 选定的传播过程:索引和选择理由(冷启动 vs 稳态)
  5. 每层挂钟时间和总时长表(来自
    layer_timeline_analyzer.py --fwd-pass N
    ):
    • 列:L、c_r、Type、Wall(ms)、SumDur(ms)、MLA、MoE、GEMM、NCCL、MHC、Hadam、AR#、K#
    • 每行对应一个层,带层类型标签
  6. 按类型分组的层集群统计表
    • 列:Cluster、#、Avg Wall(ms)、Avg Sum(ms)、MLA%、MoE%、GEMM%、NCCL%、MHC%、Hadam%
    • 识别瓶颈层类型和下一个可能的分析目标
  7. 选定代表性层的计算流表
    • layer_kernel_breakdown.py --format compute-flow
      生成
    • 列:
      # | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims
    • 默认按热门内核(
      dur(us)
      降序)排序
    • 可选JSON导出(
      --format json
  8. Perfetto UI时间范围(如需求)
  9. 一句话总结:瓶颈层类型和下一个可能的分析目标