Model Compute Simulation

Overview

Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.

Confirmation Required

Before running a simulation, collect or verify these inputs:

Item	Why it matters	How to obtain	Default if user skips
Model name	Resolves to config in `model-config-index.json` ; determines entire architecture	Ask user or infer from trace context	— (required)
Config accuracy	Indexed values may differ from actual serving config (e.g. `routed_expert_intermediate_size` , `compress_ratios` )	Ask user to provide `config.json` or verify key params against HuggingFace	Use indexed values with a caveat
GPU type	Determines peak FLOPS for MFU denominator	Ask user	— (required for MFU)
dtype (bf16 / fp8)	Affects peak FLOPS selection; fp8 doubles peak	Ask user	bf16
Batch size & seq len	Directly affects FLOPs and tensor shapes	Ask user	B=1, S=1 (decode)
TP / DP / EP	TP splits GEMM FLOPs across GPUs; EP splits expert FLOPs	Ask user	TP=8, DP=1, EP=8
Measured latency (ms)	Required for MFU numerator; must be per-GPU forward-pass wall-clock	Ask user or extract from a profiler trace	— (optional, no MFU without it)

If the model is not in

model-config-index.json

, ask the user for a

config.json

path or add an indexed config before running estimates.

Workflow

Step 1: Load model config

Resolve the model name and load its configuration parameters:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models

The script resolves the model name against

references/model-config-index.json

, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).

If the model is not indexed, tell the user to provide a

config.json

path or request an index update.

Step 2: Generate execution flow and tensor dimensions

Run the simulator with batch size, sequence length, and parallelism configuration:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16

The simulator prints:

Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
Attention vs MoE/FFN FLOPs proportion per layer
Total model FLOPs for a single forward pass

For decode: use

--seq-len 1

. For prefill: use

--seq-len <prompt_length>

Step 3: Estimate MFU with measured latency

Provide the measured forward-pass latency to compute MFU:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0

MFU = theoretical_min_time / measured_time × 100%

The simulator prints:

Overall MFU
Per-layer MFU (uniform layer-time assumption)
Per-operator FLOPs proportion (for identifying which ops dominate)

GPU peak FLOPS are loaded from

references/gpu-specs.json

. The bundled hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM 180GB. Use aliases such as

--gpu h100

--gpu h200

, or

--gpu b200

when running on those local boxes.

Step 4: Per-operator MFU with kernel-level latency

When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.

Method A:

--kernel-flow

(kernel-level MFU, recommended)

Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json

The

--kernel-flow

parameter accepts a JSON string or

@file

path. It produces a kernel-level MFU table that preserves all kernel rows from the compute flow and adds:

```
Mapped Op
```
: which operator this kernel maps to
```
FLOPs
```
: operator's total FLOPs
```
Theo(us)
```
: theoretical minimum time
```
MFU%
```
: measured FLOPs utilization
```
shape_in→shape_out
```
: operator tensor dimensions

When

--kernel-flow

is provided, the static per-operator template is omitted because the kernel-level MFU table already carries per-kernel shape and FLOPs information. The output keeps the model summary, serving configuration, total FLOPs, and kernel-level MFU table.

Mapping rules:

Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A

FP8 kernel MFU correction: Kernels in categories

moe

(fused_moe_kernel) and

gemm_fp8

use fp8 math internally even when

--dtype bf16

is specified. For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS (2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a superscript

⁸

(for example,

63.7%⁸

) to show that the fp8 denominator was used.

gemm_bf16

kernels still use the bf16 peak FLOPS denominator.

Method B:

--kernel-detail

(operator-level MFU, legacy)

Same input as

--kernel-flow

but outputs an operator-level summary table (aggregated by operator, not per-kernel). Use when you want a compact view.

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'

The

--kernel-ms

parameter accepts a JSON object mapping kernel category names to their measured durations in milliseconds. It uses FLOPs-proportional distribution across entire categories, which is less precise than

--kernel-detail

because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.

Output includes:

Model architecture summary (layers, hidden_size, attention_type, MoE config)
Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
Overall and per-layer MFU

When To Use It

when you need compute-level detail for a known model or config
when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
when the user asks about MFU and can provide measured forward-pass latency
when comparing compute profiles across different parallelism configurations

Useful Commands

List known model IDs:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models

List known GPU types:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus

Emit JSON for automation:

bash

python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json

Reporting Checklist

Include:

Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
Per-layer compute flow (showing first representative layer in detail):
- Operator sequence table: name, category, FLOPs, shape_in → shape_out
- Attention vs MoE/FFN FLOPs proportion
Total model FLOPs for a single forward pass
Kernel-level MFU table (when
```
--kernel-flow
```
provided):
- Preserves ALL kernel rows from the compute flow (never deleted)
- Per-kernel columns:
```
# | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
```
- Direct-match kernels: show mapped operator FLOPs/MFU
- Overhead kernels: show N/A for FLOPs/MFU, row preserved
Operator-level MFU table (when
```
--kernel-detail
```
or
```
--measured-ms
```
provided):
- Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
- Kernel category → operator mapping explained
Overall MFU and per-layer MFU
One-line summary: dominant compute category, MFU status, key bottleneck

Trace-Based Validation (extract_compute_flow_from_trace.py)

Use

scripts/extract_compute_flow_from_trace.py

to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.

bash

# Extract compute flow from a trace
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
  --input /path/to/trace.json.gz --format text

# Compare trace against static template
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py \
  --input /path/to/trace.json.gz \
  --compare qwen3-235b-a22b \
  --batch-size 1 --seq-len 1 --tp 8 --ep 8

Compute Flow Confirmation Hierarchy

When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:

Static template (

model_compute_simulator.py

model-config-index.json

) — fast, covers known models

Trace extraction (
```
extract_compute_flow_from_trace.py
```
) — validates template against real execution
Inference framework source code — when trace is insufficient (missing
```
Input Dims
```
, CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
- SGLang:
```
python/sglang/srt/models/<model_name>.py
```
  — contains the
```
forward()
```
  method with the exact operator sequence, tensor shapes, and parallelism split logic
- vLLM:
```
vllm/model_executor/models/<model_name>.py
```
- TensorRT-LLM:
```
cpp/tensorrt_llm/pyexecutor/py_executor.cpp
```
  + model config files
When consulting framework source, focus on:
- The
```
forward()
```
  method: operator call order and residual connections
- QKV / O projection: whether LoRA-style down/up projections are used (
```
q_lora_rank
```
  ,
```
o_lora_rank
```
  )
- MoE routing: top-k selection, shared vs routed expert split
- TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
- Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
Action: If the framework source reveals discrepancies with the static template, update
```
model-config-index.json
```
and/or
```
build_layer_ops()
```
accordingly.

Limitations of Trace Extraction

Limitation	Detail	Workaround
`record_shapes=True` required	Trace must be captured with shape recording enabled; without it, `Input Dims` fields are absent and FLOPs cannot be computed	SGLang live capture and vLLM `torch_profiler_with_stack=true` already enable this; TensorRT-LLM requires a `py_executor.py` override adding `record_shapes=True`
CUDA Graph mode	During graph replay, `cpu_op` events may only appear once (at capture time); shape information for replayed iterations is not re-recorded	The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage
TP-sliced dimensions	Trace shows post-TP-split dimensions (e.g. `H/TP` ), not the full-model view	Use `--tp` in `--compare` mode to scale trace FLOPs back to full-model equivalents
Scope attribution quality	Python scope depends on `with_stack=True` ; some frameworks or compiled paths may produce shallow or missing scope chains	Graceful degradation: ops with unresolved scope are categorized as "other"
Not a replacement for static templates	Trace extraction is a validation and discovery tool; static templates remain the primary fast-analysis path	Use trace extraction to verify templates for new models, then update `model-config-index.json` if discrepancies are found

References

```
references/model-config-index.json
```
: model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
```
references/gpu-specs.json
```
: GPU peak FLOPS specifications for MFU calculation.
```
scripts/extract_compute_flow_from_trace.py
```
: trace-based compute flow extraction and template validation tool.

model-compute-simulation

NPX Install

Tags

SKILL.md Content

Model Compute Simulation

Overview

Confirmation Required

Workflow

Step 1: Load model config

Step 2: Generate execution flow and tensor dimensions

Step 3: Estimate MFU with measured latency

Step 4: Per-operator MFU with kernel-level latency

Method A:
`--kernel-flow`
(kernel-level MFU, recommended)

Method B:
`--kernel-detail`
(operator-level MFU, legacy)

When To Use It

Useful Commands

Reporting Checklist

Trace-Based Validation (extract_compute_flow_from_trace.py)

Compute Flow Confirmation Hierarchy

Limitations of Trace Extraction

References