model-compute-simulation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseModel Compute Simulation
模型计算模拟
Overview
概述
Use this when the question is about operator order, tensor dimensions, FLOPs,
MFU, or parallelism checks. The simulator loads a model config, builds the
representative operator sequence, prints tensor shapes and FLOPs, and can
estimate MFU from measured latency.
当你需要了解算子顺序、张量维度、FLOPs、MFU或并行性检查相关问题时,可使用本工具。模拟器会加载模型配置,构建代表性算子序列,打印张量形状和FLOPs,并可根据测得的延迟估算MFU。
Confirmation Required
需确认的信息
Before running a simulation, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|---|---|---|
| Model name | Resolves to config in | Ask user or infer from trace context | — (required) |
| Config accuracy | Indexed values may differ from actual serving config (e.g. | Ask user to provide | Use indexed values with a caveat |
| GPU type | Determines peak FLOPS for MFU denominator | Ask user | — (required for MFU) |
| dtype (bf16 / fp8) | Affects peak FLOPS selection; fp8 doubles peak | Ask user | bf16 |
| Batch size & seq len | Directly affects FLOPs and tensor shapes | Ask user | B=1, S=1 (decode) |
| TP / DP / EP | TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs | Ask user | TP=8, DP=1, EP=8 |
| Measured latency (ms) | Required for MFU numerator; must be per-GPU forward-pass wall-clock | Ask user or extract from a profiler trace | — (optional, no MFU without it) |
If the model is not in , ask the user for a
path or add an indexed config before running estimates.
model-config-index.jsonconfig.json运行模拟前,请收集或验证以下输入信息:
| 项目 | 重要性 | 获取方式 | 用户未提供时的默认值 |
|---|---|---|---|
| 模型名称 | 用于在 | 询问用户或从跟踪上下文推断 | —(必填) |
| 配置准确性 | 索引中的值可能与实际服务配置不同(如 | 请用户提供 | 使用索引值并附加说明 |
| GPU类型 | 决定MFU分母的峰值FLOPS | 询问用户 | —(计算MFU时必填) |
| 数据类型(bf16 / fp8) | 影响峰值FLOPS的选择;fp8会使峰值翻倍 | 询问用户 | bf16 |
| 批量大小与序列长度 | 直接影响FLOPs和张量形状 | 询问用户 | B=1,S=1(解码阶段) |
| TP / DP / EP | TP将GEMM FLOPs拆分到多个GPU;EP拆分专家FLOPs | 询问用户 | TP=8,DP=1,EP=8 |
| 测得的延迟(毫秒) | 计算MFU分子所需;必须是单GPU前向传播的实际耗时 | 询问用户或从性能分析跟踪中提取 | —(可选,无此值则无法计算MFU) |
如果模型不在中,请用户提供路径,或先添加索引配置再进行估算。
model-config-index.jsonconfig.jsonWorkflow
工作流程
Step 1: Load model config
步骤1:加载模型配置
Resolve the model name and load its configuration parameters:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-modelsThe script resolves the model name against , which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).
references/model-config-index.jsonIf the model is not indexed, tell the user to provide a path or request an index update.
config.json解析模型名称并加载其配置参数:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models该脚本会根据解析模型名称,该文件存储了公开HuggingFace模型的配置参数(hidden_size、num_experts、MLA ranks等)。
references/model-config-index.json如果模型未被索引,请告知用户提供路径或请求更新索引。
config.jsonStep 2: Generate execution flow and tensor dimensions
步骤2:生成执行流程与张量维度
Run the simulator with batch size, sequence length, and parallelism configuration:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16The simulator prints:
- Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
- Attention vs MoE/FFN FLOPs proportion per layer
- Total model FLOPs for a single forward pass
For decode: use .
For prefill: use .
--seq-len 1--seq-len <prompt_length>结合批量大小、序列长度和并行性配置运行模拟器:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16模拟器会输出:
- 每层的算子序列,包含FLOPs和张量形状(shape_in → shape_out)
- 每层中Attention与MoE/FFN的FLOPs占比
- 单次前向传播的模型总FLOPs
解码阶段:使用。
预填充阶段:使用。
--seq-len 1--seq-len <prompt_length>Step 3: Estimate MFU with measured latency
步骤3:结合测得延迟估算MFU
Provide the measured forward-pass latency to compute MFU:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--measured-ms 15.0MFU = theoretical_min_time / measured_time × 100%
The simulator prints:
- Overall MFU
- Per-layer MFU (uniform layer-time assumption)
- Per-operator FLOPs proportion (for identifying which ops dominate)
GPU peak FLOPS are loaded from . The bundled
hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM
180GB. Use aliases such as , , or when
running on those local boxes.
references/gpu-specs.json--gpu h100--gpu h200--gpu b200提供测得的前向传播延迟以计算MFU:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--measured-ms 15.0MFU = 理论最小耗时 / 实测耗时 × 100%
模拟器会输出:
- 整体MFU
- 每层MFU(假设各层耗时均匀)
- 单算子FLOPs占比(用于识别主导算子)
GPU峰值FLOPS从加载。内置的硬件表包含H20、H100 SXM 80GB、H200 SXM 141GB和B200 SXM 180GB。在这些本地设备上运行时,可使用别名如、或。
references/gpu-specs.json--gpu h100--gpu h200--gpu b200Step 4: Per-operator MFU with kernel-level latency
步骤4:结合内核级延迟计算单算子MFU
When you have per-kernel measured latency, compute per-operator MFU by mapping
kernel durations to the compute flow.
当你有内核级实测延迟时,可通过将内核时长映射到计算流程来计算单算子MFU。
Method A: --kernel-flow
(kernel-level MFU, recommended)
--kernel-flow方法A:--kernel-flow
(内核级MFU,推荐)
--kernel-flowProvide per-kernel detail as JSON, then feed it to the simulator for
kernel-level MFU analysis. This preserves every kernel row from the compute
flow and adds FLOPs/MFU columns.
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-flow @/tmp/layer3_detail.jsonThe parameter accepts a JSON string or path. It produces
a kernel-level MFU table that preserves all kernel rows from the compute
flow and adds:
--kernel-flow@file- : which operator this kernel maps to
Mapped Op - : operator's total FLOPs
FLOPs - : theoretical minimum time
Theo(us) - : measured FLOPs utilization
MFU% - : operator tensor dimensions
shape_in→shape_out
When is provided, the static per-operator template is omitted
because the kernel-level MFU table already carries per-kernel shape and FLOPs
information. The output keeps the model summary, serving configuration, total
FLOPs, and kernel-level MFU table.
--kernel-flowMapping rules:
- Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
- Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
- Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A
FP8 kernel MFU correction: Kernels in categories (fused_moe_kernel)
and use fp8 math internally even when is specified.
For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS
(2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a
superscript (for example, ) to show that the fp8 denominator was
used. kernels still use the bf16 peak FLOPS denominator.
moegemm_fp8--dtype bf16⁸63.7%⁸gemm_bf16提供内核级详细信息作为JSON,然后输入到模拟器进行内核级MFU分析。此方法会保留计算流程中的每个内核行,并添加FLOPs/MFU列。
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-flow @/tmp/layer3_detail.json--kernel-flow@file- :该内核对应的算子
Mapped Op - :算子的总FLOPs
FLOPs - :理论最小耗时
Theo(us) - :实测FLOPs利用率
MFU% - :算子的张量维度
shape_in→shape_out
当提供时,静态单算子模板会被省略,因为内核级MFU表已包含每个内核的形状和FLOPs信息。输出会保留模型摘要、服务配置、总FLOPs和内核级MFU表。
--kernel-flow映射规则:
- 直接匹配内核(mla、moe、mhc、rmsnorm、hadamard、rope、quant、topk等):时长直接分配给对应算子
- 通用GEMM内核(gemm_fp8、gemm_bf16):时长按FLOPs占比分配给剩余未分配的投影GEMM算子
- 开销内核(allreduce、moe_align、moe_sort等):保留行,FLOPs/MFU标记为N/A
FP8内核MFU修正:属于(fused_moe_kernel)和类别的内核,即使指定,内部仍使用fp8计算。对于这些内核,MFU分母使用GPU的fp8峰值FLOPS(是bf16峰值的2倍)而非bf16峰值。计算得到的MFU会标记上标(例如),表示使用了fp8分母。内核仍使用bf16峰值FLOPS作为分母。
moegemm_fp8--dtype bf16⁸63.7%⁸gemm_bf16Method B: --kernel-detail
(operator-level MFU, legacy)
--kernel-detail方法B:--kernel-detail
(算子级MFU,旧版)
--kernel-detailSame input as but outputs an operator-level summary table
(aggregated by operator, not per-kernel). Use when you want a compact view.
--kernel-flowbash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-ms '{
"mla": 4.922, "moe": 1.644, "allreduce": 0.769,
"hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
"gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
"rope": 0.209, "topk": 0.122, "activation": 0.071,
"other": 0.437
}'The parameter accepts a JSON object mapping kernel category names
to their measured durations in milliseconds. It uses FLOPs-proportional
distribution across entire categories, which is less precise than
because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.
--kernel-ms--kernel-detailOutput includes:
- Model architecture summary (layers, hidden_size, attention_type, MoE config)
- Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
- Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
- Overall and per-layer MFU
输入与相同,但输出算子级汇总表(按算子聚合,而非按内核)。当你需要简洁视图时使用此方法。
--kernel-flowbash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-ms '{
"mla": 4.922, "moe": 1.644, "allreduce": 0.769,
"hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
"gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
"rope": 0.209, "topk": 0.122, "activation": 0.071,
"other": 0.437
}'--kernel-ms--kernel-detail输出内容包括:
- 模型架构摘要(层数、hidden_size、attention_type、MoE配置)
- 每层计算流程:算子序列,包含张量维度、FLOPs、shape_in→shape_out
- 单算子MFU表:每个算子的FLOPs、理论耗时、实测耗时(来自跟踪)、MFU%
- 内核→算子映射说明(直接匹配 vs 按FLOPs比例分配 vs 开销)
- 整体MFU和每层MFU
When To Use It
使用场景
- when you need compute-level detail for a known model or config
- when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
- when the user asks about MFU and can provide measured forward-pass latency
- when comparing compute profiles across different parallelism configurations
- 当你需要已知模型或配置的计算级详细信息时
- 当用户询问特定服务形态下的执行流程、张量维度或FLOPs时
- 当用户询问MFU并能提供实测前向传播延迟时
- 当比较不同并行性配置下的计算概况时
Useful Commands
实用命令
List known model IDs:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-modelsList known GPU types:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpusEmit JSON for automation:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json列出已知模型ID:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models列出已知GPU类型:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus输出JSON格式用于自动化:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format jsonReporting Checklist
报告检查清单
Include:
- Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
- Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
- Per-layer compute flow (showing first representative layer in detail):
- Operator sequence table: name, category, FLOPs, shape_in → shape_out
- Attention vs MoE/FFN FLOPs proportion
- Total model FLOPs for a single forward pass
- Kernel-level MFU table (when provided):
--kernel-flow- Preserves ALL kernel rows from the compute flow (never deleted)
- Per-kernel columns:
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out - Direct-match kernels: show mapped operator FLOPs/MFU
- Overhead kernels: show N/A for FLOPs/MFU, row preserved
- Operator-level MFU table (when or
--kernel-detailprovided):--measured-ms- Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel category → operator mapping explained
- Overall MFU and per-layer MFU
- One-line summary: dominant compute category, MFU status, key bottleneck
需包含:
- 模型架构摘要:模型名称、配置来源、num_layers、hidden_size、attention_type、MoE配置(num_experts、topk、shared_experts)、MHC、head_dim
- 服务配置:batch_size、seq_len、TP、DP、EP、GPU、dtype
- 每层计算流程(详细展示首个代表性层):
- 算子序列表:名称、类别、FLOPs、shape_in → shape_out
- Attention与MoE/FFN的FLOPs占比
- 单次前向传播的模型总FLOPs
- 内核级MFU表(当提供时):
--kernel-flow- 保留计算流程中的所有内核行(绝不删除)
- 内核级列:
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out - 直接匹配内核:显示映射算子的FLOPs/MFU
- 开销内核:FLOPs/MFU显示为N/A,保留行
- 算子级MFU表(当提供或
--kernel-detail时):--measured-ms- 每个算子:名称、类别、总FLOPs、单GPU FLOPs、理论耗时、实测耗时(来自跟踪)、MFU%
- 内核类别→算子映射说明
- 整体MFU和每层MFU
- 一句话摘要:主导计算类别、MFU状态、关键瓶颈
Trace-Based Validation (extract_compute_flow_from_trace.py)
基于跟踪的验证(extract_compute_flow_from_trace.py)
Use to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.
scripts/extract_compute_flow_from_trace.pybash
undefined使用从torch性能分析跟踪中提取真实算子序列和张量维度,然后与静态模板进行对比作为基准验证。
scripts/extract_compute_flow_from_trace.pybash
undefinedExtract compute flow from a trace
从跟踪中提取计算流程
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz --format text
--input /path/to/trace.json.gz --format text
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz --format text
--input /path/to/trace.json.gz --format text
Compare trace against static template
对比跟踪与静态模板
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
undefinedpython3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
undefinedCompute Flow Confirmation Hierarchy
计算流程确认优先级
When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:
-
Static template (+
model_compute_simulator.py) — fast, covers known modelsmodel-config-index.json -
Trace extraction () — validates template against real execution
extract_compute_flow_from_trace.py -
Inference framework source code — when trace is insufficient (missing, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
Input Dims- SGLang: — contains the
python/sglang/srt/models/<model_name>.pymethod with the exact operator sequence, tensor shapes, and parallelism split logicforward() - vLLM:
vllm/model_executor/models/<model_name>.py - TensorRT-LLM: + model config files
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
When consulting framework source, focus on:- The method: operator call order and residual connections
forward() - QKV / O projection: whether LoRA-style down/up projections are used (,
q_lora_rank)o_lora_rank - MoE routing: top-k selection, shared vs routed expert split
- TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
- Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
Action: If the framework source reveals discrepancies with the static template, updateand/ormodel-config-index.jsonaccordingly.build_layer_ops() - SGLang:
当静态模板或跟踪提取无法完全确认计算过程(如范围模糊、形状缺失、新模型架构)时,请遵循以下优先级:
-
静态模板(+
model_compute_simulator.py)——速度快,覆盖已知模型model-config-index.json -
跟踪提取()——验证模板与实际执行是否一致
extract_compute_flow_from_trace.py -
推理框架源代码——当跟踪信息不足(缺少、CUDA Graph重放、无范围信息的编译内核)时,直接从服务框架源代码中读取模型的前向流程:
Input Dims- SGLang:——包含
python/sglang/srt/models/<model_name>.py方法,其中有精确的算子序列、张量形状和并行性拆分逻辑forward() - vLLM:
vllm/model_executor/models/<model_name>.py - TensorRT-LLM:+ 模型配置文件
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
查阅框架源代码时,重点关注:- 方法:算子调用顺序和残差连接
forward() - QKV / O投影:是否使用LoRA风格的下/上投影(、
q_lora_rank)o_lora_rank - MoE路由:top-k选择、共享专家与路由专家拆分
- TP/EP切片:哪些维度被拆分,以及FLOPs如何在GPU间分配
- 静态模板中未包含的任何模型特定算子(如MHC、Hadamard、indexer)
操作:如果框架源代码显示与静态模板存在差异,请相应更新和/或model-config-index.json。build_layer_ops() - SGLang:
Limitations of Trace Extraction
跟踪提取的局限性
| Limitation | Detail | Workaround |
|---|---|---|
| Trace must be captured with shape recording enabled; without it, | SGLang live capture and vLLM |
| CUDA Graph mode | During graph replay, | The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage |
| TP-sliced dimensions | Trace shows post-TP-split dimensions (e.g. | Use |
| Scope attribution quality | Python scope depends on | Graceful degradation: ops with unresolved scope are categorized as "other" |
| Not a replacement for static templates | Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path | Use trace extraction to verify templates for new models, then update |
| 局限性 | 详情 | 解决方法 |
|---|---|---|
需要 | 跟踪必须启用形状记录;否则 | SGLang实时捕获和vLLM的 |
| CUDA Graph模式 | 在图重放期间, | 脚本会检测图捕获阶段并标记受影响的算子;使用eager模式跟踪以获取完整覆盖 |
| TP拆分维度 | 跟踪显示的是TP拆分后的维度(如 | 在 |
| 范围归因质量 | Python范围依赖 | 优雅降级:无法解析范围的算子归类为“other” |
| 无法替代静态模板 | 跟踪提取是验证和发现工具;静态模板仍是主要的快速分析路径 | 使用跟踪提取验证新模型的模板,若发现差异则更新 |
References
参考资料
- : model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
references/model-config-index.json - : GPU peak FLOPS specifications for MFU calculation.
references/gpu-specs.json - : trace-based compute flow extraction and template validation tool.
scripts/extract_compute_flow_from_trace.py
- :模型配置参数(hidden_size、专家数量、MLA ranks等)。
references/model-config-index.json - :用于MFU计算的GPU峰值FLOPS规格。
references/gpu-specs.json - :基于跟踪的计算流程提取和模板验证工具。
scripts/extract_compute_flow_from_trace.py