llm-pipeline-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Pipeline Analysis
LLM 流水线分析
Overview
概述
Use this when a whole-trace profiler summary is too coarse. The scripts read a
Chrome-trace JSON file, find layer-boundary anchor kernels, group kernels into
forward passes and layers, and print timing tables you can use for Perfetto
navigation or detailed timing analysis.
当全追踪的性能分析器摘要过于粗略时,可以使用本工具。这些脚本读取Chrome-trace格式的JSON文件,找到层边界的锚点内核,将内核分组到正向传播过程和层中,并打印可用于Perfetto导航或详细耗时分析的计时表格。
When To Use It
使用场景
- when you need to know which layers contribute most
- when the model has alternating layer types (e.g. models with
like DeepSeek-V4 NSA)
compress_ratios - when you need to compare cold-start vs steady-state forward passes
- when you need to navigate to a specific layer in Perfetto UI
- when you need to select representative layers for deep-dive analysis
- 当你需要了解哪些层对耗时贡献最大时
- 当模型包含交替层类型时(例如带有的模型,如DeepSeek-V4 NSA)
compress_ratios - 当你需要对比冷启动与稳态正向传播过程时
- 当你需要在Perfetto UI中导航到特定层时
- 当你需要选择代表性层进行深度分析时
Confirmation Required
需确认的输入项
Before running scripts, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Determines which | Ask user | — (required) |
| Model profile | Determines anchor kernel, blocks-per-layer, and kernel classification rules | Ask user or auto-infer from config | Auto-inferred from config |
| Provides | Ask user or search filesystem | — (required) |
| GPU type | Optional context for reports and hardware notes | Ask user | — |
| TP / EP | Parallelism config affects kernel naming and AllReduce count | Ask user or infer from trace filename (e.g. | TP=8, EP=8 |
| Serving mode | Decode vs prefill changes kernel mix and FLOPs profile | Ask user | decode B=1 |
If the user cannot provide , search common locations such as
and the HuggingFace cache. If it is still not
available, require an explicit .
config.json/root/workspace/*/config.json--profile在运行脚本前,收集或验证以下输入:
| 项 | 重要性 | 获取方式 | 用户未提供时的默认值 |
|---|---|---|---|
| 模型名称 | 决定使用哪个 | 询问用户 | —(必填) |
| 模型配置文件 | 决定锚点内核、每层块数和内核分类规则 | 询问用户或从config自动推断 | 从config自动推断 |
| 提供 | 询问用户或搜索文件系统 | —(必填) |
| GPU类型 | 为报告和硬件说明提供可选上下文 | 询问用户 | — |
| TP / EP | 并行配置影响内核命名和AllReduce计数 | 询问用户或从追踪文件名推断(例如 | TP=8, EP=8 |
| 服务模式 | 解码与预填充会改变内核组合和FLOPs配置 | 询问用户 | decode B=1 |
如果用户无法提供,请搜索常见位置,如和HuggingFace缓存。如果仍无法获取,则需要显式指定参数。
config.json/root/workspace/*/config.json--profileModel Profiles
模型配置文件
Scripts use ModelProfile to determine layer boundary detection and kernel
classification. Profiles are auto-inferred from or selected
via :
config.json--profile| Profile | Anchor kernel | Blocks/layer | Layer structure | Auto-infer condition |
|---|---|---|---|---|
| | 2 | attn + ffn halves | |
| | 1 | full layer | |
| auto-detect or | 1 | full layer | fallback |
Use for models not covered
by built-in profiles.
--profile generic --anchor-kernel YOUR_KERNEL脚本使用ModelProfile来确定层边界检测和内核分类规则。配置文件可从自动推断,或通过参数选择:
config.json--profile| 配置文件 | 锚点内核 | 每层块数 | 层结构 | 自动推断条件 |
|---|---|---|---|---|
| | 2 | 注意力层 + FFN层各占一半 | |
| | 1 | 完整层 | |
| 自动检测或通过 | 1 | 完整层 | 兜底选项 |
对于内置配置未覆盖的模型,使用。
--profile generic --anchor-kernel YOUR_KERNELPrerequisites
前置条件
- A trace in Chrome-trace JSON format (
torch.profileror.json).json.gz - The model's (for profile inference,
config.json, etc.)compress_ratios - The trace must contain a recognizable layer-boundary anchor kernel
(auto-detected from the profile, or specified via )
--anchor-kernel
- 一个Chrome-trace JSON格式的追踪文件(
torch.profiler或.json).json.gz - 模型的(用于配置文件推断、
config.json等)compress_ratios - 追踪文件必须包含可识别的层边界锚点内核(从配置文件自动检测,或通过指定)
--anchor-kernel
Layer Boundary Detection
层边界检测
The scripts use an anchor kernel as a layer-boundary marker. The anchor and
layer structure are determined by the active ModelProfile.
For example, with the profile, each transformer layer produces
2 consecutive calls:
dsv4_csa_hcamhc_post_tilelangmhc_post_tilelang ← end of attn half (attention + O-proj + AllReduce)
... ffn computation ...
mhc_post_tilelang ← end of ffn half (MoE experts + AllReduce)
... next layer attn ...
mhc_post_tilelang ← next layer's attn boundarySo for N layers with the profile, one forward pass has
anchor blocks. With or , each layer has 1 block.
dsv4_csa_hca2Ndsv3_mlagenericForward pass starts at block index .
PP * (N * blocks_per_layer)脚本使用锚点内核作为层边界标记。锚点和层结构由当前激活的ModelProfile决定。
例如,使用配置文件时,每个Transformer层会产生2次连续的调用:
dsv4_csa_hcamhc_post_tilelangmhc_post_tilelang ← 注意力半层结束(注意力 + O-proj + AllReduce)
... FFN计算 ...
mhc_post_tilelang ← FFN半层结束(MoE专家 + AllReduce)
... 下一层注意力计算 ...
mhc_post_tilelang ← 下一层的注意力边界因此,对于N层的配置模型,一次正向传播过程包含个锚点块。使用或配置时,每层包含1个块。
dsv4_csa_hca2Ndsv3_mlageneric第P次正向传播过程从块索引开始。
P * (N * blocks_per_layer)Scripts
脚本说明
1. layer_timeline_analyzer.py
— Per-layer timeline and cluster stats
layer_timeline_analyzer.py1. layer_timeline_analyzer.py
— 每层时间线和集群统计
layer_timeline_analyzer.pybash
undefinedbash
undefinedShow all forward passes summary (cold-start vs steady-state)
显示所有正向传播过程摘要(冷启动 vs 稳态)
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--show-all-passes
Detailed per-layer breakdown for a specific forward pass
特定正向传播过程的详细每层分解
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5
Auto-select first steady-state pass
自动选择第一个稳态传播过程
python3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
The script prints:
- Per-layer wall-clock time, sum-duration, and category breakdown (MLA, MoE, GEMM, NCCL, MHC, Hadamard)
- Layer cluster statistics grouped by type (C4_LIGHT, C128_HEAVY, HASH, etc.)
- All-passes summary showing cold-start → steady-state growthpython3 scripts/layer_timeline_analyzer.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
该脚本输出:
- 每层的挂钟时间、总时长,以及分类占比(MLA、MoE、GEMM、NCCL、MHC、Hadamard)
- 按类型分组的层集群统计(C4_LIGHT、C128_HEAVY、HASH等)
- 所有传播过程的摘要,展示冷启动到稳态的变化2. layer_kernel_breakdown.py
— Per-layer kernel detail and compute flow
layer_kernel_breakdown.py2. layer_kernel_breakdown.py
— 每层内核细节和计算流
layer_kernel_breakdown.pybash
undefinedbash
undefinedSingle layer kernel dump
导出单个层的内核信息
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3
Compute flow format (with model architecture summary and category column)
计算流格式(包含模型架构摘要和分类列)
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format compute-flow
JSON export
JSON导出
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 3 --format json
Compare two layers side-by-side
对比两个层的差异
python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3
Output formats:
- `--format text` (default): grouped summary + top hot kernels ranked by duration, with simplified names and percentages
- `--format compute-flow`: model architecture summary + per-kernel hotness table with `Category`, `%`, and `ts_rel(ms)` columns
- `--format json`: machine-readable per-kernel detail ranked by duration
- Kernel diff when comparing two layers (unique kernels in each)python3 scripts/layer_kernel_breakdown.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layer 2 --compare-layer 3
输出格式:
- `--format text`(默认):分组摘要 + 按时长排序的热门内核,包含简化名称和占比
- `--format compute-flow`:模型架构摘要 + 带`Category`、`%`和`ts_rel(ms)`列的内核热度表
- `--format json`:按时长排序的机器可读内核细节
- 对比两个层时显示内核差异(各层独有的内核)3. perfetto_time_mapper.py
— Perfetto UI time navigation
perfetto_time_mapper.py3. perfetto_time_mapper.py
— Perfetto UI时间导航
perfetto_time_mapper.pybash
undefinedbash
undefinedShow all forward pass time ranges in Perfetto
在Perfetto中显示所有正向传播过程的时间范围
python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
Layer-level time ranges for a specific forward pass
特定正向传播过程的层级别时间范围
python3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42
The script prints:
- Forward pass time ranges in Perfetto-relative seconds
- Per-layer start/end times with compress_ratio labelspython3 scripts/perfetto_time_mapper.py
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42
--trace /path/to/TP-0.trace.json.gz
--config /path/to/config.json
--fwd-pass 5 --layers 2,3,38,42
该脚本输出:
- Perfetto相对秒数的正向传播过程时间范围
- 带compress_ratio标签的每层开始/结束时间Workflow
工作流程
Step 1: Identify steady-state forward pass
步骤1:识别稳态正向传播过程
bash
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --show-all-passesRead the "all-passes" table. The first pass is cold-start (few tokens).
Find the first pass where layer-0 wall-clock stabilizes (typically pass 3-5).
bash
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --show-all-passes查看“所有传播过程”表格。第一次传播是冷启动(token数量少)。找到第0层挂钟时间稳定的第一个传播过程(通常是第3-5次)。
Step 2: Per-layer breakdown on steady-state pass
步骤2:对稳态传播过程进行每层分解
bash
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --fwd-pass 5Identify:
- Which layer type dominates (C4_LIGHT vs C128_HEAVY vs HASH)
- The MLA / MoE / GEMM / NCCL proportion per layer type
- Which layer type is the best next target
bash
python3 scripts/layer_timeline_analyzer.py \
--trace $TRACE --config $CONFIG --fwd-pass 5识别:
- 哪种层类型占主导(C4_LIGHT vs C128_HEAVY vs HASH)
- 每种层类型的MLA / MoE / GEMM / NCCL占比
- 下一个最佳分析目标层类型
Step 3: Compute flow for representative layer(s)
步骤3:获取代表性层的计算流
Select 1-2 representative layers (one per bottleneck type), then:
bash
undefined选择1-2个代表性层(每个瓶颈类型选一个),然后执行:
bash
undefinedHuman-readable compute flow table
人类可读的计算流表格
python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow
python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format compute-flow
JSON export
JSON导出
python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
The `--format compute-flow` output includes:
- Model architecture summary at the top
- Per-kernel hotness table with `# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`
- Rows are ranked by `dur(us)` descending by default; use `ts_rel(ms)` to jump back to the kernel's trace location.python3 scripts/layer_kernel_breakdown.py
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
--trace $TRACE --config $CONFIG
--fwd-pass 5 --layer 3 --format json > /tmp/layer3_detail.json
`--format compute-flow`输出包含:
- 顶部的模型架构摘要
- 带`# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims`列的内核热度表
- 默认按`dur(us)`降序排列;使用`ts_rel(ms)`可跳转到内核在追踪文件中的位置Step 4: Compare layer types (optional)
步骤4:对比层类型(可选)
bash
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 2 --compare-layer 3This shows the exact kernel difference between the two layer types.
bash
python3 scripts/layer_kernel_breakdown.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layer 2 --compare-layer 3这会显示两种层类型之间的具体内核差异。
Step 5: Navigate in Perfetto UI (optional)
步骤5:在Perfetto UI中导航(可选)
bash
python3 scripts/perfetto_time_mapper.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layers 2,3,38,42Use the printed time ranges to navigate directly in Perfetto.
bash
python3 scripts/perfetto_time_mapper.py \
--trace $TRACE --config $CONFIG \
--fwd-pass 5 --layers 2,3,38,42使用输出的时间范围直接在Perfetto中导航。
Layer Type Classification
层类型分类
The scripts classify layers based on fields:
config.json| Config field | Value | Layer Type | Description |
|---|---|---|---|
| 0 | FULL_ATTN | No NSA compression (layers 0-1) |
| 4 | C4_LIGHT | C128 sparse attention, fastest |
| 128 | C128_HEAVY | C4 attention + Hadamard + Indexer, bottleneck |
| — | HASH | Hash-table routing with paged MQA |
| — | FIRST | First layer (empty KV cache) |
| — | FINAL | Final layer (lm_head output) |
脚本基于字段对层进行分类:
config.json| 配置字段 | 值 | 层类型 | 描述 |
|---|---|---|---|
| 0 | FULL_ATTN | 无NSA压缩(第0-1层) |
| 4 | C4_LIGHT | C128稀疏注意力,速度最快 |
| 128 | C128_HEAVY | C4注意力 + Hadamard + 索引器,性能瓶颈 |
| — | HASH | 带分页MQA的哈希表路由 |
| — | FIRST | 第一层(空KV缓存) |
| — | FINAL | 最后一层(lm_head输出) |
Kernel Categories
内核分类
Kernels are classified by the active ModelProfile's rules. Categories marked
with (DSv4) are specific to the profile; all profiles include
the universal categories.
dsv4_csa_hca| Category | Match Pattern | Profile | Typical Share (DSv4) |
|---|---|---|---|
| ★ MLA Attention | | DSv4, DSv3 | 21-33% |
| ★ MoE Fused | | DSv4, DSv3 | 11-17% |
| ● NCCL AllReduce | | universal | 5-8% |
| GEMM fp8 | | universal | 12-25% |
| GEMM bf16 | | universal | 11-13% |
| Hadamard Xform | | DSv4 | 0-2.4% |
| Indexer Cache | | DSv4 | 0-0.1% |
| Paged MQA | | DSv4 | 0-1.8% |
| MHC | | DSv4 | 10-15% |
| C4/C128 Prefill | | DSv4 | 0-0.3% |
| RMSNorm | | universal | 1-2% |
| FP8 Quant | | universal | 1-2% |
| TopK | | universal | 0-0.7% |
| RoPE | | DSv4, DSv3 | 1-2% |
| Activation | | universal | 0-0.5% |
| Other | — | universal | 2-5% |
内核根据当前ModelProfile的规则分类。标记(DSv4)的分类是配置文件特有的;所有配置文件都包含通用分类。
dsv4_csa_hca| 分类 | 匹配模式 | 配置文件 | 典型占比(DSv4) |
|---|---|---|---|
| ★ MLA Attention | | DSv4, DSv3 | 21-33% |
| ★ MoE Fused | | DSv4, DSv3 | 11-17% |
| ● NCCL AllReduce | | 通用 | 5-8% |
| GEMM fp8 | | 通用 | 12-25% |
| GEMM bf16 | | 通用 | 11-13% |
| Hadamard Xform | | DSv4 | 0-2.4% |
| Indexer Cache | | DSv4 | 0-0.1% |
| Paged MQA | | DSv4 | 0-1.8% |
| MHC | | DSv4 | 10-15% |
| C4/C128 Prefill | | DSv4 | 0-0.3% |
| RMSNorm | | 通用 | 1-2% |
| FP8 Quant | | 通用 | 1-2% |
| TopK | | 通用 | 0-0.7% |
| RoPE | | DSv4, DSv3 | 1-2% |
| Activation | | 通用 | 0-0.5% |
| Other | — | 通用 | 2-5% |
Reporting Checklist
报告检查清单
Include:
- Trace metadata: trace path, model config path, GPU type, TP/EP
- Model Architecture Summary (from ):
config.json- model name, num_layers, hidden_size, num_attention_heads, num_key_value_heads, head_dim
- Attention type (e.g. csa_hca), Q/O LoRA ranks
- MoE config: num_experts, topk, num_shared_experts, intermediate_size
- MHC config (if applicable)
- NSA config (if applicable): index_n_heads, index_head_dim, index_topk, qk_rope_head_dim, sliding_window
- compress_ratios distribution (how many C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH layers)
- Per-batch forward passes summary table (from ):
layer_timeline_analyzer.py --show-all-passes- Columns: Fwd#, Start(s), End(s), Duration(ms), Avg Layer(ms), First Layer(ms), Notes
- Identifies cold-start vs steady-state passes
- Chosen forward pass: index and rationale (cold-start vs steady-state)
- Per-layer wall-clock and sum-duration table (from ):
layer_timeline_analyzer.py --fwd-pass N- Columns: L, c_r, Type, Wall(ms), SumDur(ms), MLA, MoE, GEMM, NCCL, MHC, Hadam, AR#, K#
- Each row is one layer, with layer type label
- Layer cluster statistics table grouped by type:
- Columns: Cluster, #, Avg Wall(ms), Avg Sum(ms), MLA%, MoE%, GEMM%, NCCL%, MHC%, Hadam%
- Identifies bottleneck layer type and likely next target
- Compute Flow Table for selected representative layer(s):
- Produced by
layer_kernel_breakdown.py --format compute-flow - Columns:
# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims - Rows are sorted by top hot kernels (descending) by default
dur(us) - Optional JSON export ()
--format json
- Produced by
- Perfetto UI time ranges when requested
- One-line summary: bottleneck layer type and likely next target
报告需包含:
- 追踪元数据:追踪文件路径、模型配置路径、GPU类型、TP/EP
- 模型架构摘要(来自):
config.json- 模型名称、层数、隐藏层大小、注意力头数、键值头数、头维度
- 注意力类型(如csa_hca)、Q/O LoRA秩
- MoE配置:专家数量、topk、共享专家数量、中间层大小
- MHC配置(如适用)
- NSA配置(如适用):index_n_heads、index_head_dim、index_topk、qk_rope_head_dim、滑动窗口
- compress_ratios分布(C4_LIGHT / C128_HEAVY / FULL_ATTN / HASH层的数量)
- 每批次正向传播过程摘要表(来自):
layer_timeline_analyzer.py --show-all-passes- 列:Fwd#、Start(s)、End(s)、Duration(ms)、Avg Layer(ms)、First Layer(ms)、Notes
- 标记冷启动与稳态传播过程
- 选定的传播过程:索引和选择理由(冷启动 vs 稳态)
- 每层挂钟时间和总时长表(来自):
layer_timeline_analyzer.py --fwd-pass N- 列:L、c_r、Type、Wall(ms)、SumDur(ms)、MLA、MoE、GEMM、NCCL、MHC、Hadam、AR#、K#
- 每行对应一个层,带层类型标签
- 按类型分组的层集群统计表:
- 列:Cluster、#、Avg Wall(ms)、Avg Sum(ms)、MLA%、MoE%、GEMM%、NCCL%、MHC%、Hadam%
- 识别瓶颈层类型和下一个可能的分析目标
- 选定代表性层的计算流表:
- 由生成
layer_kernel_breakdown.py --format compute-flow - 列:
# | Half | Category | Simplified Name | dur(us) | % | ts_rel(ms) | Input Dims - 默认按热门内核(降序)排序
dur(us) - 可选JSON导出()
--format json
- 由
- Perfetto UI时间范围(如需求)
- 一句话总结:瓶颈层类型和下一个可能的分析目标