llm-pipeline-analysis

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Pipeline Analysis

LLM 流水线分析

Overview

概述

Use this when a whole-trace profiler summary is too coarse. The scripts read a Chrome-trace JSON file, find layer-boundary anchor kernels, group kernels into forward passes and layers, and print timing tables you can use for Perfetto navigation or detailed timing analysis.

当全追踪的性能分析器摘要过于粗略时，可以使用本工具。这些脚本读取Chrome-trace格式的JSON文件，找到层边界的锚点内核，将内核分组到正向传播过程和层中，并打印可用于Perfetto导航或详细耗时分析的计时表格。

When To Use It

使用场景

when you need to know which layers contribute most
when the model has alternating layer types (e.g. models with
```
compress_ratios
```
like DeepSeek-V4 NSA)
when you need to compare cold-start vs steady-state forward passes
when you need to navigate to a specific layer in Perfetto UI
when you need to select representative layers for deep-dive analysis

当你需要了解哪些层对耗时贡献最大时
当模型包含交替层类型时（例如带有
```
compress_ratios
```
的模型，如DeepSeek-V4 NSA）
当你需要对比冷启动与稳态正向传播过程时
当你需要在Perfetto UI中导航到特定层时
当你需要选择代表性层进行深度分析时

Confirmation Required

需确认的输入项

Before running scripts, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Model name	Determines which `config.json` to use; affects layer classification	Ask user	— (required)
Model profile	Determines anchor kernel, blocks-per-layer, and kernel classification rules	Ask user or auto-infer from config	Auto-inferred from config
`config.json` path	Provides `compress_ratios` , `num_hidden_layers` , `num_hash_layers` etc.	Ask user or search filesystem	— (required)
GPU type	Optional context for reports and hardware notes	Ask user	—
TP / EP	Parallelism config affects kernel naming and AllReduce count	Ask user or infer from trace filename (e.g. `TP-0` )	TP=8, EP=8
Serving mode	Decode vs prefill changes kernel mix and FLOPs profile	Ask user	decode B=1

If the user cannot provide

config.json

, search common locations such as

/root/workspace/*/config.json

and the HuggingFace cache. If it is still not available, require an explicit

--profile

在运行脚本前，收集或验证以下输入：

项	重要性	获取方式	用户未提供时的默认值
模型名称	决定使用哪个 `config.json` ，影响层分类	询问用户	—（必填）
模型配置文件	决定锚点内核、每层块数和内核分类规则	询问用户或从config自动推断	从config自动推断
`config.json` 路径	提供 `compress_ratios` 、 `num_hidden_layers` 、 `num_hash_layers` 等信息	询问用户或搜索文件系统	—（必填）
GPU类型	为报告和硬件说明提供可选上下文	询问用户	—
TP / EP	并行配置影响内核命名和AllReduce计数	询问用户或从追踪文件名推断（例如 `TP-0` ）	TP=8, EP=8
服务模式	解码与预填充会改变内核组合和FLOPs配置	询问用户	decode B=1

如果用户无法提供

config.json

，请搜索常见位置，如

/root/workspace/*/config.json

和HuggingFace缓存。如果仍无法获取，则需要显式指定

--profile

参数。

Model Profiles

模型配置文件

Scripts use ModelProfile to determine layer boundary detection and kernel classification. Profiles are auto-inferred from

config.json

or selected via

--profile

Profile	Anchor kernel	Blocks/layer	Layer structure	Auto-infer condition
`dsv4_csa_hca`	`mhc_post_tilelang`	2	attn + ffn halves	`compress_ratios` non-empty
`dsv3_mla`	`flash_fwd_mla_combine`	1	full layer	`kv_lora_rank > 0`
`generic`	auto-detect or `--anchor-kernel`	1	full layer	fallback

Use

--profile generic --anchor-kernel YOUR_KERNEL

for models not covered by built-in profiles.

脚本使用ModelProfile来确定层边界检测和内核分类规则。配置文件可从

config.json

自动推断，或通过

--profile

参数选择：

配置文件	锚点内核	每层块数	层结构	自动推断条件
`dsv4_csa_hca`	`mhc_post_tilelang`	2	注意力层 + FFN层各占一半	`compress_ratios` 非空
`dsv3_mla`	`flash_fwd_mla_combine`	1	完整层	`kv_lora_rank > 0`
`generic`	自动检测或通过 `--anchor-kernel` 指定	1	完整层	兜底选项

对于内置配置未覆盖的模型，使用

--profile generic --anchor-kernel YOUR_KERNEL

。

Prerequisites

前置条件

A
```
torch.profiler
```
trace in Chrome-trace JSON format (
```
.json
```
or
```
.json.gz
```
)
The model's
```
config.json
```
(for profile inference,
```
compress_ratios
```
, etc.)
The trace must contain a recognizable layer-boundary anchor kernel (auto-detected from the profile, or specified via
```
--anchor-kernel
```
)

一个Chrome-trace JSON格式的
```
torch.profiler
```
追踪文件（
```
.json
```
或
```
.json.gz
```
）
模型的
```
config.json
```
（用于配置文件推断、
```
compress_ratios
```
等）
追踪文件必须包含可识别的层边界锚点内核（从配置文件自动检测，或通过
```
--anchor-kernel
```
指定）

Layer Boundary Detection

层边界检测

The scripts use an anchor kernel as a layer-boundary marker. The anchor and layer structure are determined by the active ModelProfile.

For example, with the

dsv4_csa_hca

profile, each transformer layer produces 2 consecutive

mhc_post_tilelang

calls:

mhc_post_tilelang  ← end of attn half (attention + O-proj + AllReduce)
  ... ffn computation ...
mhc_post_tilelang  ← end of ffn half (MoE experts + AllReduce)
  ... next layer attn ...
mhc_post_tilelang  ← next layer's attn boundary

So for N layers with the

dsv4_csa_hca

profile, one forward pass has

2N

anchor blocks. With

dsv3_mla

generic

, each layer has 1 block.

Forward pass

starts at block index

P * (N * blocks_per_layer)

脚本使用锚点内核作为层边界标记。锚点和层结构由当前激活的ModelProfile决定。

例如，使用

dsv4_csa_hca

配置文件时，每个Transformer层会产生2次连续的

mhc_post_tilelang

调用：

mhc_post_tilelang  ← 注意力半层结束（注意力 + O-proj + AllReduce）
  ... FFN计算 ...
mhc_post_tilelang  ← FFN半层结束（MoE专家 + AllReduce）
  ... 下一层注意力计算 ...
mhc_post_tilelang  ← 下一层的注意力边界

因此，对于N层的

dsv4_csa_hca

配置模型，一次正向传播过程包含

2N

个锚点块。使用

dsv3_mla

或

generic

配置时，每层包含1个块。

第P次正向传播过程从块索引

P * (N * blocks_per_layer)

开始。

Scripts

脚本说明

layer_timeline_analyzer.py

— Per-layer timeline and cluster stats

layer_timeline_analyzer.py

— 每层时间线和集群统计

bash

undefined

bash

undefined

Show all forward passes summary (cold-start vs steady-state)

显示所有正向传播过程摘要（冷启动 vs 稳态）

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes

Detailed per-layer breakdown for a specific forward pass

特定正向传播过程的详细每层分解

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5

Auto-select first steady-state pass

自动选择第一个稳态传播过程

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json


The script prints:
- Per-layer wall-clock time, sum-duration, and category breakdown (MLA, MoE, GEMM, NCCL, MHC, Hadamard)
- Layer cluster statistics grouped by type (C4_LIGHT, C128_HEAVY, HASH, etc.)
- All-passes summary showing cold-start → steady-state growth

python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json


该脚本输出：
- 每层的挂钟时间、总时长，以及分类占比（MLA、MoE、GEMM、NCCL、MHC、Hadamard）
- 按类型分组的层集群统计（C4_LIGHT、C128_HEAVY、HASH等）
- 所有传播过程的摘要，展示冷启动到稳态的变化

layer_kernel_breakdown.py

— Per-layer kernel detail and compute flow

layer_kernel_breakdown.py

— 每层内核细节和计算流

bash

undefined

bash

undefined

Single layer kernel dump

导出单个层的内核信息

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3

Compute flow format (with model architecture summary and category column)

计算流格式（包含模型架构摘要和分类列）

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow

JSON export

JSON导出

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json

Compare two layers side-by-side

对比两个层的差异

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3


Output formats:
- `--format text` (default): grouped summary + top hot kernels ranked by duration, with simplified names and percentages
- `--format compute-flow`: model architecture summary + per-kernel hotness table with `Category`, `%`, and `ts_rel(ms)` columns
- `--format json`: machine-readable per-kernel detail ranked by duration
- Kernel diff when comparing two layers (unique kernels in each)

python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3


输出格式：
- `--format text`（默认）：分组摘要 + 按时长排序的热门内核，包含简化名称和占比
- `--format compute-flow`：模型架构摘要 + 带`Category`、`%`和`ts_rel(ms)`列的内核热度表
- `--format json`：按时长排序的机器可读内核细节
- 对比两个层时显示内核差异（各层独有的内核）

perfetto_time_mapper.py

— Perfetto UI time navigation

perfetto_time_mapper.py

— Perfetto UI时间导航

bash

undefined

bash

undefined

Show all forward pass time ranges in Perfetto

在Perfetto中显示所有正向传播过程的时间范围

python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json

Layer-level time ranges for a specific forward pass

特定正向传播过程的层级别时间范围

python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42


The script prints:
- Forward pass time ranges in Perfetto-relative seconds
- Per-layer start/end times with compress_ratio labels

python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42


该脚本输出：
- Perfetto相对秒数的正向传播过程时间范围
- 带compress_ratio标签的每层开始/结束时间

Workflow

工作流程

Step 1: Identify steady-state forward pass

步骤1：识别稳态正向传播过程

bash

python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --show-all-passes

Read the "all-passes" table. The first pass is cold-start (few tokens). Find the first pass where layer-0 wall-clock stabilizes (typically pass 3-5).

bash

python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --show-all-passes

查看“所有传播过程”表格。第一次传播是冷启动（token数量少）。找到第0层挂钟时间稳定的第一个传播过程（通常是第3-5次）。

Step 2: Per-layer breakdown on steady-state pass

步骤2：对稳态传播过程进行每层分解

bash

python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --fwd-pass 5

Identify:

Which layer type dominates (C4_LIGHT vs C128_HEAVY vs HASH)
The MLA / MoE / GEMM / NCCL proportion per layer type
Which layer type is the best next target

bash

python3 scripts/layer_timeline_analyzer.py \
  --trace $TRACE --config $CONFIG --fwd-pass 5

识别：

哪种层类型占主导（C4_LIGHT vs C128_HEAVY vs HASH）
每种层类型的MLA / MoE / GEMM / NCCL占比
下一个最佳分析目标层类型

Step 3: Compute flow for representative layer(s)

步骤3：获取代表性层的计算流

Select 1-2 representative layers (one per bottleneck type), then:

bash

undefined

选择1-2个代表性层（每个瓶颈类型选一个），然后执行：

bash

undefined

Human-readable compute flow table

人类可读的计算流表格

python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow

JSON export

JSON导出

python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json


The `--format compute-flow` output includes:
- Model architecture summary at the top
- Per-kernel hotness table with `# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`
- Rows are ranked by `dur(us)` descending by default; use `ts_rel(ms)` to jump back to the kernel's trace location.

python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json


`--format compute-flow`输出包含：
- 顶部的模型架构摘要
- 带`# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`列的内核热度表
- 默认按`dur(us)`降序排列；使用`ts_rel(ms)`可跳转到内核在追踪文件中的位置

Step 4: Compare layer types (optional)

步骤4：对比层类型（可选）

bash

python3 scripts/layer_kernel_breakdown.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layer 2 --compare-layer 3

This shows the exact kernel difference between the two layer types.

bash

python3 scripts/layer_kernel_breakdown.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layer 2 --compare-layer 3

这会显示两种层类型之间的具体内核差异。

Step 5: Navigate in Perfetto UI (optional)

步骤5：在Perfetto UI中导航（可选）

bash

python3 scripts/perfetto_time_mapper.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layers 2,3,38,42

Use the printed time ranges to navigate directly in Perfetto.

bash

python3 scripts/perfetto_time_mapper.py \
  --trace $TRACE --config $CONFIG \
  --fwd-pass 5 --layers 2,3,38,42

使用输出的时间范围直接在Perfetto中导航。

Layer Type Classification

层类型分类

The scripts classify layers based on

config.json

fields:

Config field	Value	Layer Type	Description
`compress_ratios[i]`	0	FULL_ATTN	No NSA compression (layers 0-1)
`compress_ratios[i]`	4	C4_LIGHT	C128 sparse attention, fastest
`compress_ratios[i]`	128	C128_HEAVY	C4 attention + Hadamard + Indexer, bottleneck
`i >= N - num_hash_layers`	—	HASH	Hash-table routing with paged MQA
`i == 0`	—	FIRST	First layer (empty KV cache)
`i == N - 1`	—	FINAL	Final layer (lm_head output)

脚本基于

config.json

字段对层进行分类：

配置字段	值	层类型	描述
`compress_ratios[i]`	0	FULL_ATTN	无NSA压缩（第0-1层）
`compress_ratios[i]`	4	C4_LIGHT	C128稀疏注意力，速度最快
`compress_ratios[i]`	128	C128_HEAVY	C4注意力 + Hadamard + 索引器，性能瓶颈
`i >= N - num_hash_layers`	—	HASH	带分页MQA的哈希表路由
`i == 0`	—	FIRST	第一层（空KV缓存）
`i == N - 1`	—	FINAL	最后一层（lm_head输出）

Kernel Categories

内核分类

Kernels are classified by the active ModelProfile's rules. Categories marked with (DSv4) are specific to the

dsv4_csa_hca

profile; all profiles include the universal categories.

Category	Match Pattern	Profile	Typical Share (DSv4)
★ MLA Attention	`flash_fwd_splitkv_mla`	DSv4, DSv3	21-33%
★ MoE Fused	`fused_moe_kernel`	DSv4, DSv3	11-17%
● NCCL AllReduce	`AllReduce`	universal	5-8%
GEMM fp8	`deep_gemm`	universal	12-25%
GEMM bf16	`nvjet`	universal	11-13%
Hadamard Xform	`hadamard`	DSv4	0-2.4%
Indexer Cache	`indexer`	DSv4	0-0.1%
Paged MQA	`paged_mqa_logits`	DSv4	0-1.8%
MHC	`mhc_pre_gemm_sqrsum` , `mhc_pre_big_fuse` , `mhc_post_tilelang`	DSv4	10-15%
C4/C128 Prefill	`c4_prefill` , `c128_prefill`	DSv4	0-0.3%
RMSNorm	`RMSNorm` , `rms_normalize`	universal	1-2%
FP8 Quant	`quant` , `Quant`	universal	1-2%
TopK	`topk`	universal	0-0.7%
RoPE	`deepseek_rope` , `fused_norm_rope`	DSv4, DSv3	1-2%
Activation	`silu_mul_clamp` , `act_and_mul`	universal	0-0.5%
Other	—	universal	2-5%

内核根据当前ModelProfile的规则分类。标记(DSv4)的分类是

dsv4_csa_hca

配置文件特有的；所有配置文件都包含通用分类。

分类	匹配模式	配置文件	典型占比（DSv4）
★ MLA Attention	`flash_fwd_splitkv_mla`	DSv4, DSv3	21-33%
★ MoE Fused	`fused_moe_kernel`	DSv4, DSv3	11-17%
● NCCL AllReduce	`AllReduce`	通用	5-8%
GEMM fp8	`deep_gemm`	通用	12-25%
GEMM bf16	`nvjet`	通用	11-13%
Hadamard Xform	`hadamard`	DSv4	0-2.4%
Indexer Cache	`indexer`	DSv4	0-0.1%
Paged MQA	`paged_mqa_logits`	DSv4	0-1.8%
MHC	`mhc_pre_gemm_sqrsum` , `mhc_pre_big_fuse` , `mhc_post_tilelang`	DSv4	10-15%
C4/C128 Prefill	`c4_prefill` , `c128_prefill`	DSv4	0-0.3%
RMSNorm	`RMSNorm` , `rms_normalize`	通用	1-2%
FP8 Quant	`quant` , `Quant`	通用	1-2%
TopK	`topk`	通用	0-0.7%
RoPE	`deepseek_rope` , `fused_norm_rope`	DSv4, DSv3	1-2%
Activation	`silu_mul_clamp` , `act_and_mul`	通用	0-0.5%
Other	—	通用	2-5%

Reporting Checklist

报告检查清单

Include:

Trace metadata: trace path, model config path, GPU type, TP/EP
Model Architecture Summary (from
```
config.json
```
):
- model name, num_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim
- Attention type (e.g. csa_hca), Q/O LoRA ranks
- MoE config: num_experts, topk, num_shared_experts, intermediate_size
- MHC config (if applicable)
- NSA config (if applicable): index_n_heads, index_head_dim, index_topk, qk_rope_head_dim, sliding_window
- compress_ratios distribution (how many C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH layers)
Per-batch forward passes summary table (from
```
layer_timeline_analyzer.py --show-all-passes
```
):
- Columns: Fwd#, Start(s), End(s), Duration(ms), Avg Layer(ms), First Layer(ms), Notes
- Identifies cold-start vs steady-state passes
Chosen forward pass: index and rationale (cold-start vs steady-state)
Per-layer wall-clock and sum-duration table (from
```
layer_timeline_analyzer.py --fwd-pass N
```
):
- Columns: L, c_r, Type, Wall(ms), SumDur(ms), MLA, MoE, GEMM, NCCL, MHC, Hadam, AR#, K#
- Each row is one layer, with layer type label
Layer cluster statistics table grouped by type:
- Columns: Cluster, #, Avg Wall(ms), Avg Sum(ms), MLA%, MoE%, GEMM%, NCCL%, MHC%, Hadam%
- Identifies bottleneck layer type and likely next target
Compute Flow Table for selected representative layer(s):
- Produced by
```
layer_kernel_breakdown.py --format compute-flow
```
- Columns:
```
# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims
```
- Rows are sorted by top hot kernels (
```
dur(us)
```
  descending) by default
- Optional JSON export (
```
--format json
```
  )
Perfetto UI time ranges when requested
One-line summary: bottleneck layer type and likely next target

报告需包含：

追踪元数据：追踪文件路径、模型配置路径、GPU类型、TP/EP
模型架构摘要（来自
```
config.json
```
）：
- 模型名称、层数、隐藏层大小、注意力头数、键值头数、头维度
- 注意力类型（如csa_hca）、Q/O LoRA秩
- MoE配置：专家数量、topk、共享专家数量、中间层大小
- MHC配置（如适用）
- NSA配置（如适用）：index_n_heads、index_head_dim、index_topk、qk_rope_head_dim、滑动窗口
- compress_ratios分布（C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH层的数量）
每批次正向传播过程摘要表（来自
```
layer_timeline_analyzer.py --show-all-passes
```
）：
- 列：Fwd#、Start(s)、End(s)、Duration(ms)、Avg Layer(ms)、First Layer(ms)、Notes
- 标记冷启动与稳态传播过程
选定的传播过程：索引和选择理由（冷启动 vs 稳态）
每层挂钟时间和总时长表（来自
```
layer_timeline_analyzer.py --fwd-pass N
```
）：
- 列：L、c_r、Type、Wall(ms)、SumDur(ms)、MLA、MoE、GEMM、NCCL、MHC、Hadam、AR#、K#
- 每行对应一个层，带层类型标签
按类型分组的层集群统计表：
- 列：Cluster、#、Avg Wall(ms)、Avg Sum(ms)、MLA%、MoE%、GEMM%、NCCL%、MHC%、Hadam%
- 识别瓶颈层类型和下一个可能的分析目标

选定代表性层的计算流表：

由

layer_kernel_breakdown.py --format compute-flow

生成

列：

# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims

默认按热门内核（
```
dur(us)
```
降序）排序
可选JSON导出（
```
--format json
```
）

Perfetto UI时间范围（如需求）
一句话总结：瓶颈层类型和下一个可能的分析目标

llm-pipeline-analysis

Original

Translation

LLM Pipeline Analysis

LLM 流水线分析

Overview

概述

When To Use It

使用场景

Confirmation Required

需确认的输入项

Model Profiles

模型配置文件

Prerequisites

前置条件

Layer Boundary Detection

层边界检测

Scripts

脚本说明

1. layer_timeline_analyzer.py — Per-layer timeline and cluster stats

1. layer_timeline_analyzer.py — 每层时间线和集群统计

Show all forward passes summary (cold-start vs steady-state)

显示所有正向传播过程摘要（冷启动 vs 稳态）

Detailed per-layer breakdown for a specific forward pass

特定正向传播过程的详细每层分解

Auto-select first steady-state pass

自动选择第一个稳态传播过程

2. layer_kernel_breakdown.py — Per-layer kernel detail and compute flow

2. layer_kernel_breakdown.py — 每层内核细节和计算流

Single layer kernel dump

导出单个层的内核信息

Compute flow format (with model architecture summary and category column)

计算流格式（包含模型架构摘要和分类列）

JSON export

JSON导出

Compare two layers side-by-side

对比两个层的差异

3. perfetto_time_mapper.py — Perfetto UI time navigation

3. perfetto_time_mapper.py — Perfetto UI时间导航

Show all forward pass time ranges in Perfetto

在Perfetto中显示所有正向传播过程的时间范围

Layer-level time ranges for a specific forward pass

特定正向传播过程的层级别时间范围

Workflow

工作流程

Step 1: Identify steady-state forward pass

步骤1：识别稳态正向传播过程

Step 2: Per-layer breakdown on steady-state pass

步骤2：对稳态传播过程进行每层分解

Step 3: Compute flow for representative layer(s)

步骤3：获取代表性层的计算流

Human-readable compute flow table

人类可读的计算流表格

JSON export

JSON导出

Step 4: Compare layer types (optional)

步骤4：对比层类型（可选）

Step 5: Navigate in Perfetto UI (optional)

步骤5：在Perfetto UI中导航（可选）

Layer Type Classification

层类型分类

Kernel Categories

内核分类

Reporting Checklist

报告检查清单

1.
`layer_timeline_analyzer.py`
— Per-layer timeline and cluster stats

1.
`layer_timeline_analyzer.py`
— 每层时间线和集群统计

2.
`layer_kernel_breakdown.py`
— Per-layer kernel detail and compute flow

2.
`layer_kernel_breakdown.py`
— 每层内核细节和计算流

3.
`perfetto_time_mapper.py`
— Perfetto UI time navigation

3.
`perfetto_time_mapper.py`
— Perfetto UI时间导航