Model Compute Simulation
Overview
Use this when the question is about operator order, tensor dimensions, FLOPs,
MFU, or parallelism checks. The simulator loads a model config, builds the
representative operator sequence, prints tensor shapes and FLOPs, and can
estimate MFU from measured latency.
Confirmation Required
Before running a simulation, collect or verify these inputs:
| Item | Why it matters | How to obtain | Default if user skips |
|---|
| Model name | Resolves to config in ; determines entire architecture | Ask user or infer from trace context | — (required) |
| Config accuracy | Indexed values may differ from actual serving config (e.g. routed_expert_intermediate_size
, ) | Ask user to provide or verify key params against HuggingFace | Use indexed values with a caveat |
| GPU type | Determines peak FLOPS for MFU denominator | Ask user | — (required for MFU) |
| dtype (bf16 / fp8) | Affects peak FLOPS selection; fp8 doubles peak | Ask user | bf16 |
| Batch size & seq len | Directly affects FLOPs and tensor shapes | Ask user | B=1, S=1 (decode) |
| TP / DP / EP | TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs | Ask user | TP=8, DP=1, EP=8 |
| Measured latency (ms) | Required for MFU numerator; must be per-GPU forward-pass wall-clock | Ask user or extract from a profiler trace | — (optional, no MFU without it) |
If the model is not in
, ask the user for a
path or add an indexed config before running estimates.
Workflow
Step 1: Load model config
Resolve the model name and load its configuration parameters:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models
The script resolves the model name against
references/model-config-index.json
, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).
If the model is not indexed, tell the user to provide a
path or request an index update.
Step 2: Generate execution flow and tensor dimensions
Run the simulator with batch size, sequence length, and parallelism configuration:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16
The simulator prints:
- Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
- Attention vs MoE/FFN FLOPs proportion per layer
- Total model FLOPs for a single forward pass
For
decode: use
.
For
prefill: use
--seq-len <prompt_length>
.
Step 3: Estimate MFU with measured latency
Provide the measured forward-pass latency to compute MFU:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 1 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--measured-ms 15.0
MFU = theoretical_min_time / measured_time × 100%
The simulator prints:
- Overall MFU
- Per-layer MFU (uniform layer-time assumption)
- Per-operator FLOPs proportion (for identifying which ops dominate)
GPU peak FLOPS are loaded from
references/gpu-specs.json
. The bundled
hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM
180GB. Use aliases such as
,
, or
when
running on those local boxes.
Step 4: Per-operator MFU with kernel-level latency
When you have per-kernel measured latency, compute per-operator MFU by mapping
kernel durations to the compute flow.
Method A: (kernel-level MFU, recommended)
Provide per-kernel detail as JSON, then feed it to the simulator for
kernel-level MFU analysis. This preserves every kernel row from the compute
flow and adds FLOPs/MFU columns.
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-flow @/tmp/layer3_detail.json
The
parameter accepts a JSON string or
path. It produces
a
kernel-level MFU table that preserves all kernel rows from the compute
flow and adds:
- : which operator this kernel maps to
- : operator's total FLOPs
- : theoretical minimum time
- : measured FLOPs utilization
- : operator tensor dimensions
When
is provided, the static per-operator template is omitted
because the kernel-level MFU table already carries per-kernel shape and FLOPs
information. The output keeps the model summary, serving configuration, total
FLOPs, and kernel-level MFU table.
Mapping rules:
- Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
- Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
- Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A
FP8 kernel MFU correction: Kernels in categories
(fused_moe_kernel)
and
use fp8 math internally even when
is specified.
For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS
(2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a
superscript
(for example,
) to show that the fp8 denominator was
used.
kernels still use the bf16 peak FLOPS denominator.
Method B: (operator-level MFU, legacy)
Same input as
but outputs an
operator-level summary table
(aggregated by operator, not per-kernel). Use when you want a compact view.
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
--batch-size 1 --seq-len 8192 \
--tp 8 --dp 1 --ep 8 \
--gpu h20 --dtype bf16 \
--kernel-ms '{
"mla": 4.922, "moe": 1.644, "allreduce": 0.769,
"hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
"gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
"rope": 0.209, "topk": 0.122, "activation": 0.071,
"other": 0.437
}'
The
parameter accepts a JSON object mapping
kernel category names
to their measured durations in
milliseconds. It uses FLOPs-proportional
distribution across entire categories, which is less precise than
because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.
Output includes:
- Model architecture summary (layers, hidden_size, attention_type, MoE config)
- Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
- Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
- Overall and per-layer MFU
When To Use It
- when you need compute-level detail for a known model or config
- when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
- when the user asks about MFU and can provide measured forward-pass latency
- when comparing compute profiles across different parallelism configurations
Useful Commands
List known model IDs:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models
List known GPU types:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus
Emit JSON for automation:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json
Reporting Checklist
Include:
- Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
- Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
- Per-layer compute flow (showing first representative layer in detail):
- Operator sequence table: name, category, FLOPs, shape_in → shape_out
- Attention vs MoE/FFN FLOPs proportion
- Total model FLOPs for a single forward pass
- Kernel-level MFU table (when provided):
- Preserves ALL kernel rows from the compute flow (never deleted)
- Per-kernel columns:
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
- Direct-match kernels: show mapped operator FLOPs/MFU
- Overhead kernels: show N/A for FLOPs/MFU, row preserved
- Operator-level MFU table (when or provided):
- Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel category → operator mapping explained
- Overall MFU and per-layer MFU
- One-line summary: dominant compute category, MFU status, key bottleneck
Trace-Based Validation (extract_compute_flow_from_trace.py)
Use
scripts/extract_compute_flow_from_trace.py
to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.
bash
# Extract compute flow from a trace
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz --format text
# Compare trace against static template
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
--input /path/to/trace.json.gz \
--compare qwen3-235b-a22b \
--batch-size 1 --seq-len 1 --tp 8 --ep 8
Compute Flow Confirmation Hierarchy
When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:
-
Static template (
model_compute_simulator.py
+
) — fast, covers known models
-
Trace extraction (
extract_compute_flow_from_trace.py
) — validates template against real execution
-
Inference framework source code — when trace is insufficient (missing
, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
- SGLang:
python/sglang/srt/models/<model_name>.py
— contains the method with the exact operator sequence, tensor shapes, and parallelism split logic
- vLLM:
vllm/model_executor/models/<model_name>.py
- TensorRT-LLM:
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
+ model config files
When consulting framework source, focus on:
- The method: operator call order and residual connections
- QKV / O projection: whether LoRA-style down/up projections are used (, )
- MoE routing: top-k selection, shared vs routed expert split
- TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
- Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
Action: If the framework source reveals discrepancies with the static template, update
and/or
accordingly.
Limitations of Trace Extraction
| Limitation | Detail | Workaround |
|---|
| required | Trace must be captured with shape recording enabled; without it, fields are absent and FLOPs cannot be computed | SGLang live capture and vLLM torch_profiler_with_stack=true
already enable this; TensorRT-LLM requires a override adding |
| CUDA Graph mode | During graph replay, events may only appear once (at capture time); shape information for replayed iterations is not re-recorded | The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage |
| TP-sliced dimensions | Trace shows post-TP-split dimensions (e.g. ), not the full-model view | Use in mode to scale trace FLOPs back to full-model equivalents |
| Scope attribution quality | Python scope depends on ; some frameworks or compiled paths may produce shallow or missing scope chains | Graceful degradation: ops with unresolved scope are categorized as "other" |
| Not a replacement for static templates | Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path | Use trace extraction to verify templates for new models, then update if discrepancies are found |
References
references/model-config-index.json
: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
references/gpu-specs.json
: GPU peak FLOPS specifications for MFU calculation.
scripts/extract_compute_flow_from_trace.py
: trace-based compute flow extraction and template validation tool.