model-compute-simulation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Model Compute Simulation

模型计算模拟

Overview

概述

Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.

当你需要了解算子顺序、张量维度、FLOPs、MFU或并行性检查相关问题时，可使用本工具。模拟器会加载模型配置，构建代表性算子序列，打印张量形状和FLOPs，并可根据测得的延迟估算MFU。

Confirmation Required

需确认的信息

Before running a simulation, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Model name	Resolves to config in `model-config-index.json` ; determines entire architecture	Ask user or infer from trace context	— (required)
Config accuracy	Indexed values may differ from actual serving config (e.g. `routed_expert_intermediate_size` , `compress_ratios` )	Ask user to provide `config.json` or verify key params against HuggingFace	Use indexed values with a caveat
GPU type	Determines peak FLOPS for MFU denominator	Ask user	— (required for MFU)
dtype (bf16 / fp8)	Affects peak FLOPS selection; fp8 doubles peak	Ask user	bf16
Batch size & seq len	Directly affects FLOPs and tensor shapes	Ask user	B=1, S=1 (decode)
TP / DP / EP	TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs	Ask user	TP=8, DP=1, EP=8
Measured latency (ms)	Required for MFU numerator; must be per-GPU forward-pass wall-clock	Ask user or extract from a profiler trace	— (optional, no MFU without it)

If the model is not in

model-config-index.json

, ask the user for a

config.json

path or add an indexed config before running estimates.

运行模拟前，请收集或验证以下输入信息：

项目	重要性	获取方式	用户未提供时的默认值
模型名称	用于在 `model-config-index.json` 中匹配配置，决定整个模型架构	询问用户或从跟踪上下文推断	—（必填）
配置准确性	索引中的值可能与实际服务配置不同（如 `routed_expert_intermediate_size` 、 `compress_ratios` ）	请用户提供 `config.json` 或与HuggingFace的关键参数进行核对	使用索引值并附加说明
GPU类型	决定MFU分母的峰值FLOPS	询问用户	—（计算MFU时必填）
数据类型（bf16 / fp8）	影响峰值FLOPS的选择；fp8会使峰值翻倍	询问用户	bf16
批量大小与序列长度	直接影响FLOPs和张量形状	询问用户	B=1，S=1（解码阶段）
TP / DP / EP	TP将GEMM FLOPs拆分到多个GPU；EP拆分专家FLOPs	询问用户	TP=8，DP=1，EP=8
测得的延迟（毫秒）	计算MFU分子所需；必须是单GPU前向传播的实际耗时	询问用户或从性能分析跟踪中提取	—（可选，无此值则无法计算MFU）

如果模型不在

model-config-index.json

中，请用户提供

config.json

路径，或先添加索引配置再进行估算。

Workflow

工作流程

Step 1: Load model config

步骤1：加载模型配置

Resolve the model name and load its configuration parameters:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models

The script resolves the model name against

references/model-config-index.json

, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).

If the model is not indexed, tell the user to provide a

config.json

path or request an index update.

解析模型名称并加载其配置参数：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models

该脚本会根据

references/model-config-index.json

解析模型名称，该文件存储了公开HuggingFace模型的配置参数（hidden_size、num_experts、MLA ranks等）。

如果模型未被索引，请告知用户提供

config.json

路径或请求更新索引。

Step 2: Generate execution flow and tensor dimensions

步骤2：生成执行流程与张量维度

Run the simulator with batch size, sequence length, and parallelism configuration:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16

The simulator prints:

Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
Attention vs MoE/FFN FLOPs proportion per layer
Total model FLOPs for a single forward pass

For decode: use

--seq-len 1

. For prefill: use

--seq-len <prompt_length>

结合批量大小、序列长度和并行性配置运行模拟器：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16

模拟器会输出：

每层的算子序列，包含FLOPs和张量形状（shape_in → shape_out）
每层中Attention与MoE/FFN的FLOPs占比
单次前向传播的模型总FLOPs

解码阶段：使用

--seq-len 1

。 预填充阶段：使用

--seq-len <prompt_length>

。

Step 3: Estimate MFU with measured latency

步骤3：结合测得延迟估算MFU

Provide the measured forward-pass latency to compute MFU:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0

MFU = theoretical_min_time / measured_time × 100%

The simulator prints:

Overall MFU
Per-layer MFU (uniform layer-time assumption)
Per-operator FLOPs proportion (for identifying which ops dominate)

GPU peak FLOPS are loaded from

references/gpu-specs.json

. The bundled hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM 180GB. Use aliases such as

--gpu h100

--gpu h200

, or

--gpu b200

when running on those local boxes.

提供测得的前向传播延迟以计算MFU：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0

MFU = 理论最小耗时 / 实测耗时 × 100%

模拟器会输出：

整体MFU
每层MFU（假设各层耗时均匀）
单算子FLOPs占比（用于识别主导算子）

GPU峰值FLOPS从

references/gpu-specs.json

加载。内置的硬件表包含H20、H100 SXM 80GB、H200 SXM 141GB和B200 SXM 180GB。在这些本地设备上运行时，可使用别名如

--gpu h100

、

--gpu h200

或

--gpu b200

。

Step 4: Per-operator MFU with kernel-level latency

步骤4：结合内核级延迟计算单算子MFU

When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.

当你有内核级实测延迟时，可通过将内核时长映射到计算流程来计算单算子MFU。

Method A:

--kernel-flow

(kernel-level MFU, recommended)

方法A：

--kernel-flow

（内核级MFU，推荐）

Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json

The

--kernel-flow

parameter accepts a JSON string or

@file

path. It produces a kernel-level MFU table that preserves all kernel rows from the compute flow and adds:

```
Mapped Op
```
: which operator this kernel maps to
```
FLOPs
```
: operator's total FLOPs
```
Theo(us)
```
: theoretical minimum time
```
MFU%
```
: measured FLOPs utilization
```
shape_in→shape_out
```
: operator tensor dimensions

When

--kernel-flow

is provided, the static per-operator template is omitted because the kernel-level MFU table already carries per-kernel shape and FLOPs information. The output keeps the model summary, serving configuration, total FLOPs, and kernel-level MFU table.

Mapping rules:

Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A

FP8 kernel MFU correction: Kernels in categories

moe

(fused_moe_kernel) and

gemm_fp8

use fp8 math internally even when

--dtype bf16

is specified. For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS (2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a superscript

⁸

(for example,

63.7%⁸

) to show that the fp8 denominator was used.

gemm_bf16

kernels still use the bf16 peak FLOPS denominator.

提供内核级详细信息作为JSON，然后输入到模拟器进行内核级MFU分析。此方法会保留计算流程中的每个内核行，并添加FLOPs/MFU列。

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json

--kernel-flow

参数接受JSON字符串或

@file

路径。它会生成一个内核级MFU表，保留计算流程中的所有内核行，并添加：

```
Mapped Op
```
：该内核对应的算子
```
FLOPs
```
：算子的总FLOPs
```
Theo(us)
```
：理论最小耗时
```
MFU%
```
：实测FLOPs利用率
```
shape_in→shape_out
```
：算子的张量维度

当提供

--kernel-flow

时，静态单算子模板会被省略，因为内核级MFU表已包含每个内核的形状和FLOPs信息。输出会保留模型摘要、服务配置、总FLOPs和内核级MFU表。

映射规则：

直接匹配内核（mla、moe、mhc、rmsnorm、hadamard、rope、quant、topk等）：时长直接分配给对应算子
通用GEMM内核（gemm_fp8、gemm_bf16）：时长按FLOPs占比分配给剩余未分配的投影GEMM算子
开销内核（allreduce、moe_align、moe_sort等）：保留行，FLOPs/MFU标记为N/A

FP8内核MFU修正：属于

moe

（fused_moe_kernel）和

gemm_fp8

类别的内核，即使指定

--dtype bf16

，内部仍使用fp8计算。对于这些内核，MFU分母使用GPU的fp8峰值FLOPS（是bf16峰值的2倍）而非bf16峰值。计算得到的MFU会标记上标

⁸

（例如

63.7%⁸

），表示使用了fp8分母。

gemm_bf16

内核仍使用bf16峰值FLOPS作为分母。

Method B:

--kernel-detail

(operator-level MFU, legacy)

方法B：

--kernel-detail

（算子级MFU，旧版）

Same input as

--kernel-flow

but outputs an operator-level summary table (aggregated by operator, not per-kernel). Use when you want a compact view.

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'

The

--kernel-ms

parameter accepts a JSON object mapping kernel category names to their measured durations in milliseconds. It uses FLOPs-proportional distribution across entire categories, which is less precise than

--kernel-detail

because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.

Output includes:

Model architecture summary (layers, hidden_size, attention_type, MoE config)
Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
Overall and per-layer MFU

输入与

--kernel-flow

相同，但输出算子级汇总表（按算子聚合，而非按内核）。当你需要简洁视图时使用此方法。

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'

--kernel-ms

参数接受一个JSON对象，将内核类别名称映射到其实测时长（单位为毫秒）。它会按FLOPs比例在整个类别中分配时长，这比

--kernel-detail

精度低，因为通用GEMM类别（gemm_fp8、gemm_bf16）涵盖多个算子类别。

输出内容包括：

模型架构摘要（层数、hidden_size、attention_type、MoE配置）
每层计算流程：算子序列，包含张量维度、FLOPs、shape_in→shape_out
单算子MFU表：每个算子的FLOPs、理论耗时、实测耗时（来自跟踪）、MFU%
内核→算子映射说明（直接匹配 vs 按FLOPs比例分配 vs 开销）
整体MFU和每层MFU

When To Use It

使用场景

when you need compute-level detail for a known model or config
when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
when the user asks about MFU and can provide measured forward-pass latency
when comparing compute profiles across different parallelism configurations

当你需要已知模型或配置的计算级详细信息时
当用户询问特定服务形态下的执行流程、张量维度或FLOPs时
当用户询问MFU并能提供实测前向传播延迟时
当比较不同并行性配置下的计算概况时

Useful Commands

实用命令

List known model IDs:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models

List known GPU types:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus

Emit JSON for automation:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json

列出已知模型ID：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models

列出已知GPU类型：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus

输出JSON格式用于自动化：

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json

Reporting Checklist

报告检查清单

Include:

Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
Per-layer compute flow (showing first representative layer in detail):
- Operator sequence table: name, category, FLOPs, shape_in → shape_out
- Attention vs MoE/FFN FLOPs proportion
Total model FLOPs for a single forward pass
Kernel-level MFU table (when
```
--kernel-flow
```
provided):
- Preserves ALL kernel rows from the compute flow (never deleted)
- Per-kernel columns:
```
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
```
- Direct-match kernels: show mapped operator FLOPs/MFU
- Overhead kernels: show N/A for FLOPs/MFU, row preserved
Operator-level MFU table (when
```
--kernel-detail
```
or
```
--measured-ms
```
provided):
- Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel category → operator mapping explained
Overall MFU and per-layer MFU
One-line summary: dominant compute category, MFU status, key bottleneck

需包含：

模型架构摘要：模型名称、配置来源、num_layers、hidden_size、attention_type、MoE配置（num_experts、topk、shared_experts）、MHC、head_dim
服务配置：batch_size、seq_len、TP、DP、EP、GPU、dtype
每层计算流程（详细展示首个代表性层）：
- 算子序列表：名称、类别、FLOPs、shape_in → shape_out
- Attention与MoE/FFN的FLOPs占比
单次前向传播的模型总FLOPs
内核级MFU表（当提供
```
--kernel-flow
```
时）：
- 保留计算流程中的所有内核行（绝不删除）
- 内核级列：
```
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
```
- 直接匹配内核：显示映射算子的FLOPs/MFU
- 开销内核：FLOPs/MFU显示为N/A，保留行
算子级MFU表（当提供
```
--kernel-detail
```
或
```
--measured-ms
```
时）：
- 每个算子：名称、类别、总FLOPs、单GPU FLOPs、理论耗时、实测耗时（来自跟踪）、MFU%
- 内核类别→算子映射说明
整体MFU和每层MFU
一句话摘要：主导计算类别、MFU状态、关键瓶颈

Trace-Based Validation (extract_compute_flow_from_trace.py)

基于跟踪的验证（extract_compute_flow_from_trace.py）

Use

scripts/extract_compute_flow_from_trace.py

to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.

bash

undefined

使用

scripts/extract_compute_flow_from_trace.py

从torch性能分析跟踪中提取真实算子序列和张量维度，然后与静态模板进行对比作为基准验证。

bash

undefined

Extract compute flow from a trace

从跟踪中提取计算流程

python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz --format text

Compare trace against static template

对比跟踪与静态模板

python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8

undefined

python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8

undefined

Compute Flow Confirmation Hierarchy

计算流程确认优先级

When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:

Static template (

model_compute_simulator.py

model-config-index.json

) — fast, covers known models

Trace extraction (
```
extract_compute_flow_from_trace.py
```
) — validates template against real execution
Inference framework source code — when trace is insufficient (missing
```
Input Dims
```
, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
- SGLang:
```
python/sglang/srt/models/<model_name>.py
```
  — contains the
```
forward()
```
  method with the exact operator sequence, tensor shapes, and parallelism split logic
- vLLM:
```
vllm/model_executor/models/<model_name>.py
```
- TensorRT-LLM:
```
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
```
  + model config files
When consulting framework source, focus on:
- The
```
forward()
```
  method: operator call order and residual connections
- QKV / O projection: whether LoRA-style down/up projections are used (
```
q_lora_rank
```
  ,
```
o_lora_rank
```
  )
- MoE routing: top-k selection, shared vs routed expert split
- TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
- Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
Action: If the framework source reveals discrepancies with the static template, update
```
model-config-index.json
```
and/or
```
build_layer_ops()
```
accordingly.

当静态模板或跟踪提取无法完全确认计算过程（如范围模糊、形状缺失、新模型架构）时，请遵循以下优先级：

静态模板（
```
model_compute_simulator.py
```
+
```
model-config-index.json
```
）——速度快，覆盖已知模型
跟踪提取（
```
extract_compute_flow_from_trace.py
```
）——验证模板与实际执行是否一致
推理框架源代码——当跟踪信息不足（缺少
```
Input Dims
```
、CUDA Graph重放、无范围信息的编译内核）时，直接从服务框架源代码中读取模型的前向流程：
- SGLang：
```
python/sglang/srt/models/<model_name>.py
```
  ——包含
```
forward()
```
  方法，其中有精确的算子序列、张量形状和并行性拆分逻辑
- vLLM：
```
vllm/model_executor/models/<model_name>.py
```
- TensorRT-LLM：
```
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
```
  + 模型配置文件
查阅框架源代码时，重点关注：
- ```
forward()
```
  方法：算子调用顺序和残差连接
- QKV / O投影：是否使用LoRA风格的下/上投影（
```
q_lora_rank
```
  、
```
o_lora_rank
```
  ）
- MoE路由：top-k选择、共享专家与路由专家拆分
- TP/EP切片：哪些维度被拆分，以及FLOPs如何在GPU间分配
- 静态模板中未包含的任何模型特定算子（如MHC、Hadamard、indexer）
操作：如果框架源代码显示与静态模板存在差异，请相应更新
```
model-config-index.json
```
和/或
```
build_layer_ops()
```
。

Limitations of Trace Extraction

跟踪提取的局限性

Limitation	Detail	Workaround
`record_shapes=True` required	Trace must be captured with shape recording enabled; without it, `Input Dims` fields are absent and FLOPs cannot be computed	SGLang live capture and vLLM `torch_profiler_with_stack=true` already enable this; TensorRT-LLM requires a `py_executor.py` override adding `record_shapes=True`
CUDA Graph mode	During graph replay, `cpu_op` events may only appear once (at capture time); shape information for replayed iterations is not re-recorded	The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage
TP-sliced dimensions	Trace shows post-TP-split dimensions (e.g. `H/TP` ), not the full-model view	Use `--tp` in `--compare` mode to scale trace FLOPs back to full-model equivalents
Scope attribution quality	Python scope depends on `with_stack=True` ; some frameworks or compiled paths may produce shallow or missing scope chains	Graceful degradation: ops with unresolved scope are categorized as "other"
Not a replacement for static templates	Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path	Use trace extraction to verify templates for new models, then update `model-config-index.json` if discrepancies are found

局限性	详情	解决方法
需要 `record_shapes=True`	跟踪必须启用形状记录；否则 `Input Dims` 字段缺失，无法计算FLOPs	SGLang实时捕获和vLLM的 `torch_profiler_with_stack=true` 已默认启用此功能；TensorRT-LLM需要覆盖 `py_executor.py` 添加 `record_shapes=True`
CUDA Graph模式	在图重放期间， `cpu_op` 事件可能仅出现一次（在捕获时）；重放迭代的形状信息不会重新记录	脚本会检测图捕获阶段并标记受影响的算子；使用eager模式跟踪以获取完整覆盖
TP拆分维度	跟踪显示的是TP拆分后的维度（如 `H/TP` ），而非全模型视图	在 `--compare` 模式中使用 `--tp` 参数将跟踪FLOPs缩放回全模型等效值
范围归因质量	Python范围依赖 `with_stack=True` ；某些框架或编译路径可能生成浅层次或缺失的范围链	优雅降级：无法解析范围的算子归类为“other”
无法替代静态模板	跟踪提取是验证和发现工具；静态模板仍是主要的快速分析路径	使用跟踪提取验证新模型的模板，若发现差异则更新 `model-config-index.json`

References

参考资料

```
references/model-config-index.json
```
: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
```
references/gpu-specs.json
```
: GPU peak FLOPS specifications for MFU calculation.
```
scripts/extract_compute_flow_from_trace.py
```
: trace-based compute flow extraction and template validation tool.

```
references/model-config-index.json
```
：模型配置参数（hidden_size、专家数量、MLA ranks等）。
```
references/gpu-specs.json
```
：用于MFU计算的GPU峰值FLOPS规格。
```
scripts/extract_compute_flow_from_trace.py
```
：基于跟踪的计算流程提取和模板验证工具。

model-compute-simulation

Original

Translation

Model Compute Simulation

模型计算模拟

Overview

概述

Confirmation Required

需确认的信息

Workflow

工作流程

Step 1: Load model config

步骤1：加载模型配置

Step 2: Generate execution flow and tensor dimensions

步骤2：生成执行流程与张量维度

Step 3: Estimate MFU with measured latency

步骤3：结合测得延迟估算MFU

Step 4: Per-operator MFU with kernel-level latency

步骤4：结合内核级延迟计算单算子MFU

Method A: --kernel-flow (kernel-level MFU, recommended)

方法A：--kernel-flow（内核级MFU，推荐）

Method B: --kernel-detail (operator-level MFU, legacy)

方法B：--kernel-detail（算子级MFU，旧版）

When To Use It

使用场景

Useful Commands

实用命令

Reporting Checklist

报告检查清单

Trace-Based Validation (extract_compute_flow_from_trace.py)

基于跟踪的验证（extract_compute_flow_from_trace.py）

Extract compute flow from a trace

从跟踪中提取计算流程

Compare trace against static template

对比跟踪与静态模板

Compute Flow Confirmation Hierarchy

计算流程确认优先级

Limitations of Trace Extraction

跟踪提取的局限性

References

参考资料

Method A:
`--kernel-flow`
(kernel-level MFU, recommended)

方法A：
`--kernel-flow`
（内核级MFU，推荐）

Method B:
`--kernel-detail`
(operator-level MFU, legacy)

方法B：
`--kernel-detail`
（算子级MFU，旧版）