model-compute-simulation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Compute Simulation

模型计算模拟

Overview

概述

Use this when the question is about operator order, tensor dimensions, FLOPs, MFU, or parallelism checks. The simulator loads a model config, builds the representative operator sequence, prints tensor shapes and FLOPs, and can estimate MFU from measured latency.
当你需要了解算子顺序、张量维度、FLOPs、MFU或并行性检查相关问题时,可使用本工具。模拟器会加载模型配置,构建代表性算子序列,打印张量形状和FLOPs,并可根据测得的延迟估算MFU。

Confirmation Required

需确认的信息

Before running a simulation, collect or verify these inputs:
ItemWhy it mattersHow to obtainDefault if user skips
Model nameResolves to config in
model-config-index.json
; determines entire architecture
Ask user or infer from trace context— (required)
Config accuracyIndexed values may differ from actual serving config (e.g.
routed_expert_intermediate_size
,
compress_ratios
)
Ask user to provide
config.json
or verify key params against HuggingFace
Use indexed values with a caveat
GPU typeDetermines peak FLOPS for MFU denominatorAsk user— (required for MFU)
dtype (bf16 / fp8)Affects peak FLOPS selection; fp8 doubles peakAsk userbf16
Batch size & seq lenDirectly affects FLOPs and tensor shapesAsk userB=1, S=1 (decode)
TP / DP / EPTP splits GEMM FLOPs across GPUs; EP splits expert FLOPsAsk userTP=8, DP=1, EP=8
Measured latency (ms)Required for MFU numerator; must be per-GPU forward-pass wall-clockAsk user or extract from a profiler trace— (optional, no MFU without it)
If the model is not in
model-config-index.json
, ask the user for a
config.json
path or add an indexed config before running estimates.
运行模拟前,请收集或验证以下输入信息:
项目重要性获取方式用户未提供时的默认值
模型名称用于在
model-config-index.json
中匹配配置,决定整个模型架构
询问用户或从跟踪上下文推断—(必填)
配置准确性索引中的值可能与实际服务配置不同(如
routed_expert_intermediate_size
compress_ratios
请用户提供
config.json
或与HuggingFace的关键参数进行核对
使用索引值并附加说明
GPU类型决定MFU分母的峰值FLOPS询问用户—(计算MFU时必填)
数据类型(bf16 / fp8)影响峰值FLOPS的选择;fp8会使峰值翻倍询问用户bf16
批量大小与序列长度直接影响FLOPs和张量形状询问用户B=1,S=1(解码阶段)
TP / DP / EPTP将GEMM FLOPs拆分到多个GPU;EP拆分专家FLOPs询问用户TP=8,DP=1,EP=8
测得的延迟(毫秒)计算MFU分子所需;必须是单GPU前向传播的实际耗时询问用户或从性能分析跟踪中提取—(可选,无此值则无法计算MFU)
如果模型不在
model-config-index.json
中,请用户提供
config.json
路径,或先添加索引配置再进行估算。

Workflow

工作流程

Step 1: Load model config

步骤1:加载模型配置

Resolve the model name and load its configuration parameters:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models
The script resolves the model name against
references/model-config-index.json
, which stores public HuggingFace config parameters (hidden_size, num_experts, MLA ranks, etc.).
If the model is not indexed, tell the user to provide a
config.json
path or request an index update.
解析模型名称并加载其配置参数:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "<model name>" --list-models
该脚本会根据
references/model-config-index.json
解析模型名称,该文件存储了公开HuggingFace模型的配置参数(hidden_size、num_experts、MLA ranks等)。
如果模型未被索引,请告知用户提供
config.json
路径或请求更新索引。

Step 2: Generate execution flow and tensor dimensions

步骤2:生成执行流程与张量维度

Run the simulator with batch size, sequence length, and parallelism configuration:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16
The simulator prints:
  • Per-layer operator sequence with FLOPs and tensor shapes (shape_in → shape_out)
  • Attention vs MoE/FFN FLOPs proportion per layer
  • Total model FLOPs for a single forward pass
For decode: use
--seq-len 1
. For prefill: use
--seq-len <prompt_length>
.
结合批量大小、序列长度和并行性配置运行模拟器:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16
模拟器会输出:
  • 每层的算子序列,包含FLOPs和张量形状(shape_in → shape_out)
  • 每层中Attention与MoE/FFN的FLOPs占比
  • 单次前向传播的模型总FLOPs
解码阶段:使用
--seq-len 1
预填充阶段:使用
--seq-len <prompt_length>

Step 3: Estimate MFU with measured latency

步骤3:结合测得延迟估算MFU

Provide the measured forward-pass latency to compute MFU:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0
MFU = theoretical_min_time / measured_time × 100%
The simulator prints:
  • Overall MFU
  • Per-layer MFU (uniform layer-time assumption)
  • Per-operator FLOPs proportion (for identifying which ops dominate)
GPU peak FLOPS are loaded from
references/gpu-specs.json
. The bundled hardware table includes H20, H100 SXM 80GB, H200 SXM 141GB, and B200 SXM 180GB. Use aliases such as
--gpu h100
,
--gpu h200
, or
--gpu b200
when running on those local boxes.
提供测得的前向传播延迟以计算MFU:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 1 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --measured-ms 15.0
MFU = 理论最小耗时 / 实测耗时 × 100%
模拟器会输出:
  • 整体MFU
  • 每层MFU(假设各层耗时均匀)
  • 单算子FLOPs占比(用于识别主导算子)
GPU峰值FLOPS从
references/gpu-specs.json
加载。内置的硬件表包含H20、H100 SXM 80GB、H200 SXM 141GB和B200 SXM 180GB。在这些本地设备上运行时,可使用别名如
--gpu h100
--gpu h200
--gpu b200

Step 4: Per-operator MFU with kernel-level latency

步骤4:结合内核级延迟计算单算子MFU

When you have per-kernel measured latency, compute per-operator MFU by mapping kernel durations to the compute flow.
当你有内核级实测延迟时,可通过将内核时长映射到计算流程来计算单算子MFU。

Method A:
--kernel-flow
(kernel-level MFU, recommended)

方法A:
--kernel-flow
(内核级MFU,推荐)

Provide per-kernel detail as JSON, then feed it to the simulator for kernel-level MFU analysis. This preserves every kernel row from the compute flow and adds FLOPs/MFU columns.
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json
The
--kernel-flow
parameter accepts a JSON string or
@file
path. It produces a kernel-level MFU table that preserves all kernel rows from the compute flow and adds:
  • Mapped Op
    : which operator this kernel maps to
  • FLOPs
    : operator's total FLOPs
  • Theo(us)
    : theoretical minimum time
  • MFU%
    : measured FLOPs utilization
  • shape_in→shape_out
    : operator tensor dimensions
When
--kernel-flow
is provided, the static per-operator template is omitted because the kernel-level MFU table already carries per-kernel shape and FLOPs information. The output keeps the model summary, serving configuration, total FLOPs, and kernel-level MFU table.
Mapping rules:
  • Direct-match kernels (mla, moe, mhc, rmsnorm, hadamard, rope, quant, topk, etc.): time is assigned directly to the corresponding operators
  • Generic GEMM kernels (gemm_fp8, gemm_bf16): time is distributed to remaining unassigned projection GEMM operators by FLOPs share
  • Overhead kernels (allreduce, moe_align, moe_sort, other): rows preserved, FLOPs/MFU marked as N/A
FP8 kernel MFU correction: Kernels in categories
moe
(fused_moe_kernel) and
gemm_fp8
use fp8 math internally even when
--dtype bf16
is specified. For these kernels, the MFU denominator uses the GPU's fp8 peak FLOPS (2x bf16 peak) instead of bf16 peak. The resulting MFU is marked with a superscript
(for example,
63.7%⁸
) to show that the fp8 denominator was used.
gemm_bf16
kernels still use the bf16 peak FLOPS denominator.
提供内核级详细信息作为JSON,然后输入到模拟器进行内核级MFU分析。此方法会保留计算流程中的每个内核行,并添加FLOPs/MFU列。
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-flow @/tmp/layer3_detail.json
--kernel-flow
参数接受JSON字符串或
@file
路径。它会生成一个内核级MFU表,保留计算流程中的所有内核行,并添加:
  • Mapped Op
    :该内核对应的算子
  • FLOPs
    :算子的总FLOPs
  • Theo(us)
    :理论最小耗时
  • MFU%
    :实测FLOPs利用率
  • shape_in→shape_out
    :算子的张量维度
当提供
--kernel-flow
时,静态单算子模板会被省略,因为内核级MFU表已包含每个内核的形状和FLOPs信息。输出会保留模型摘要、服务配置、总FLOPs和内核级MFU表。
映射规则:
  • 直接匹配内核(mla、moe、mhc、rmsnorm、hadamard、rope、quant、topk等):时长直接分配给对应算子
  • 通用GEMM内核(gemm_fp8、gemm_bf16):时长按FLOPs占比分配给剩余未分配的投影GEMM算子
  • 开销内核(allreduce、moe_align、moe_sort等):保留行,FLOPs/MFU标记为N/A
FP8内核MFU修正:属于
moe
(fused_moe_kernel)和
gemm_fp8
类别的内核,即使指定
--dtype bf16
,内部仍使用fp8计算。对于这些内核,MFU分母使用GPU的fp8峰值FLOPS(是bf16峰值的2倍)而非bf16峰值。计算得到的MFU会标记上标
(例如
63.7%⁸
),表示使用了fp8分母。
gemm_bf16
内核仍使用bf16峰值FLOPS作为分母。

Method B:
--kernel-detail
(operator-level MFU, legacy)

方法B:
--kernel-detail
(算子级MFU,旧版)

Same input as
--kernel-flow
but outputs an operator-level summary table (aggregated by operator, not per-kernel). Use when you want a compact view.
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'
The
--kernel-ms
parameter accepts a JSON object mapping kernel category names to their measured durations in milliseconds. It uses FLOPs-proportional distribution across entire categories, which is less precise than
--kernel-detail
because generic GEMM categories (gemm_fp8, gemm_bf16) span multiple operator categories.
Output includes:
  • Model architecture summary (layers, hidden_size, attention_type, MoE config)
  • Per-layer compute flow: operator sequence with tensor dimensions, FLOPs, shape_in→shape_out
  • Per-operator MFU table: each operator's FLOPs, theoretical time, measured time (from trace), MFU%
  • Kernel → operator mapping explanation (direct-match vs FLOPs-proportional vs overhead)
  • Overall and per-layer MFU
输入与
--kernel-flow
相同,但输出算子级汇总表(按算子聚合,而非按内核)。当你需要简洁视图时使用此方法。
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "Qwen3-235B-A22B" \
  --batch-size 1 --seq-len 8192 \
  --tp 8 --dp 1 --ep 8 \
  --gpu h20 --dtype bf16 \
  --kernel-ms '{
    "mla": 4.922, "moe": 1.644, "allreduce": 0.769,
    "hadamard": 0.348, "mhc": 1.388, "gemm_fp8": 1.692,
    "gemm_bf16": 0.125, "rmsnorm": 0.227, "quant": 0.311,
    "rope": 0.209, "topk": 0.122, "activation": 0.071,
    "other": 0.437
  }'
--kernel-ms
参数接受一个JSON对象,将内核类别名称映射到其实测时长(单位为毫秒)。它会按FLOPs比例在整个类别中分配时长,这比
--kernel-detail
精度低,因为通用GEMM类别(gemm_fp8、gemm_bf16)涵盖多个算子类别。
输出内容包括:
  • 模型架构摘要(层数、hidden_size、attention_type、MoE配置)
  • 每层计算流程:算子序列,包含张量维度、FLOPs、shape_in→shape_out
  • 单算子MFU表:每个算子的FLOPs、理论耗时、实测耗时(来自跟踪)、MFU%
  • 内核→算子映射说明(直接匹配 vs 按FLOPs比例分配 vs 开销)
  • 整体MFU和每层MFU

When To Use It

使用场景

  • when you need compute-level detail for a known model or config
  • when the user asks about execution flow, tensor dimensions, or FLOPs for a specific serving shape
  • when the user asks about MFU and can provide measured forward-pass latency
  • when comparing compute profiles across different parallelism configurations
  • 当你需要已知模型或配置的计算级详细信息时
  • 当用户询问特定服务形态下的执行流程、张量维度或FLOPs时
  • 当用户询问MFU并能提供实测前向传播延迟时
  • 当比较不同并行性配置下的计算概况时

Useful Commands

实用命令

List known model IDs:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models
List known GPU types:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus
Emit JSON for automation:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json
列出已知模型ID:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-models
列出已知GPU类型:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py --list-gpus
输出JSON格式用于自动化:
bash
python3 skills/model-compute-simulation/scripts/model_compute_simulator.py "GLM-5" --format json

Reporting Checklist

报告检查清单

Include:
  1. Model architecture summary: model name, config source, num_layers, hidden_size, attention_type, MoE config (num_experts, topk, shared_experts), MHC, head_dim
  2. Serving configuration: batch_size, seq_len, TP, DP, EP, GPU, dtype
  3. Per-layer compute flow (showing first representative layer in detail):
    • Operator sequence table: name, category, FLOPs, shape_in → shape_out
    • Attention vs MoE/FFN FLOPs proportion
  4. Total model FLOPs for a single forward pass
  5. Kernel-level MFU table (when
    --kernel-flow
    provided):
    • Preserves ALL kernel rows from the compute flow (never deleted)
    • Per-kernel columns:
      # | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
    • Direct-match kernels: show mapped operator FLOPs/MFU
    • Overhead kernels: show N/A for FLOPs/MFU, row preserved
  6. Operator-level MFU table (when
    --kernel-detail
    or
    --measured-ms
    provided):
    • Each operator: name, category, total FLOPs, per-GPU FLOPs, theoretical time, measured time (from trace), MFU%
    • Kernel category → operator mapping explained
  7. Overall MFU and per-layer MFU
  8. One-line summary: dominant compute category, MFU status, key bottleneck
需包含:
  1. 模型架构摘要:模型名称、配置来源、num_layers、hidden_size、attention_type、MoE配置(num_experts、topk、shared_experts)、MHC、head_dim
  2. 服务配置:batch_size、seq_len、TP、DP、EP、GPU、dtype
  3. 每层计算流程(详细展示首个代表性层):
    • 算子序列表:名称、类别、FLOPs、shape_in → shape_out
    • Attention与MoE/FFN的FLOPs占比
  4. 单次前向传播的模型总FLOPs
  5. 内核级MFU表(当提供
    --kernel-flow
    时):
    • 保留计算流程中的所有内核行(绝不删除)
    • 内核级列:
      # | Half | Category | Simplified Name | dur(us) | % | Mapped Op | FLOPs | Theo(us) | MFU% | shape_in→shape_out
    • 直接匹配内核:显示映射算子的FLOPs/MFU
    • 开销内核:FLOPs/MFU显示为N/A,保留行
  6. 算子级MFU表(当提供
    --kernel-detail
    --measured-ms
    时):
    • 每个算子:名称、类别、总FLOPs、单GPU FLOPs、理论耗时、实测耗时(来自跟踪)、MFU%
    • 内核类别→算子映射说明
  7. 整体MFU每层MFU
  8. 一句话摘要:主导计算类别、MFU状态、关键瓶颈

Trace-Based Validation (extract_compute_flow_from_trace.py)

基于跟踪的验证(extract_compute_flow_from_trace.py)

Use
scripts/extract_compute_flow_from_trace.py
to extract the real operator sequence and tensor dimensions from a torch profiler trace, then compare against the static template as ground truth validation.
bash
undefined
使用
scripts/extract_compute_flow_from_trace.py
从torch性能分析跟踪中提取真实算子序列和张量维度,然后与静态模板进行对比作为基准验证。
bash
undefined

Extract compute flow from a trace

从跟踪中提取计算流程

python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz --format text
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz --format text

Compare trace against static template

对比跟踪与静态模板

python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
undefined
python3 skills/model-compute-simulation/scripts/extract_compute_flow_from_trace.py
--input /path/to/trace.json.gz
--compare qwen3-235b-a22b
--batch-size 1 --seq-len 1 --tp 8 --ep 8
undefined

Compute Flow Confirmation Hierarchy

计算流程确认优先级

When the static template or trace extraction cannot fully confirm the compute process (e.g. ambiguous scope, missing shapes, new model architecture), follow this escalation hierarchy:
  1. Static template (
    model_compute_simulator.py
    +
    model-config-index.json
    ) — fast, covers known models
  2. Trace extraction (
    extract_compute_flow_from_trace.py
    ) — validates template against real execution
  3. Inference framework source code — when trace is insufficient (missing
    Input Dims
    , CUDA Graph replay, compiled kernels without scope), read the model's forward flow directly from the serving framework source:
    • SGLang:
      python/sglang/srt/models/<model_name>.py
      — contains the
      forward()
      method with the exact operator sequence, tensor shapes, and parallelism split logic
    • vLLM:
      vllm/model_executor/models/<model_name>.py
    • TensorRT-LLM:
      cpp/tensorrt_llm/pyexecutor/py_executor.cpp
      + model config files
    When consulting framework source, focus on:
    • The
      forward()
      method: operator call order and residual connections
    • QKV / O projection: whether LoRA-style down/up projections are used (
      q_lora_rank
      ,
      o_lora_rank
      )
    • MoE routing: top-k selection, shared vs routed expert split
    • TP/EP slicing: which dimensions are split and how FLOPs divide across GPUs
    • Any model-specific ops not in the static template (e.g. MHC, Hadamard, indexer)
    Action: If the framework source reveals discrepancies with the static template, update
    model-config-index.json
    and/or
    build_layer_ops()
    accordingly.
当静态模板或跟踪提取无法完全确认计算过程(如范围模糊、形状缺失、新模型架构)时,请遵循以下优先级:
  1. 静态模板
    model_compute_simulator.py
    +
    model-config-index.json
    )——速度快,覆盖已知模型
  2. 跟踪提取
    extract_compute_flow_from_trace.py
    )——验证模板与实际执行是否一致
  3. 推理框架源代码——当跟踪信息不足(缺少
    Input Dims
    、CUDA Graph重放、无范围信息的编译内核)时,直接从服务框架源代码中读取模型的前向流程:
    • SGLang
      python/sglang/srt/models/<model_name>.py
      ——包含
      forward()
      方法,其中有精确的算子序列、张量形状和并行性拆分逻辑
    • vLLM
      vllm/model_executor/models/<model_name>.py
    • TensorRT-LLM
      cpp/tensorrt_llm/pyexecutor/py_executor.cpp
      + 模型配置文件
    查阅框架源代码时,重点关注:
    • forward()
      方法:算子调用顺序和残差连接
    • QKV / O投影:是否使用LoRA风格的下/上投影(
      q_lora_rank
      o_lora_rank
    • MoE路由:top-k选择、共享专家与路由专家拆分
    • TP/EP切片:哪些维度被拆分,以及FLOPs如何在GPU间分配
    • 静态模板中未包含的任何模型特定算子(如MHC、Hadamard、indexer)
    操作:如果框架源代码显示与静态模板存在差异,请相应更新
    model-config-index.json
    和/或
    build_layer_ops()

Limitations of Trace Extraction

跟踪提取的局限性

LimitationDetailWorkaround
record_shapes=True
required
Trace must be captured with shape recording enabled; without it,
Input Dims
fields are absent and FLOPs cannot be computed
SGLang live capture and vLLM
torch_profiler_with_stack=true
already enable this; TensorRT-LLM requires a
py_executor.py
override adding
record_shapes=True
CUDA Graph modeDuring graph replay,
cpu_op
events may only appear once (at capture time); shape information for replayed iterations is not re-recorded
The script detects graph capture phases and annotates affected ops; use eager-mode traces for full coverage
TP-sliced dimensionsTrace shows post-TP-split dimensions (e.g.
H/TP
), not the full-model view
Use
--tp
in
--compare
mode to scale trace FLOPs back to full-model equivalents
Scope attribution qualityPython scope depends on
with_stack=True
; some frameworks or compiled paths may produce shallow or missing scope chains
Graceful degradation: ops with unresolved scope are categorized as "other"
Not a replacement for static templatesTrace extraction is a validation and discovery tool; static templates remain the primary fast-analysis pathUse trace extraction to verify templates for new models, then update
model-config-index.json
if discrepancies are found
局限性详情解决方法
需要
record_shapes=True
跟踪必须启用形状记录;否则
Input Dims
字段缺失,无法计算FLOPs
SGLang实时捕获和vLLM的
torch_profiler_with_stack=true
已默认启用此功能;TensorRT-LLM需要覆盖
py_executor.py
添加
record_shapes=True
CUDA Graph模式在图重放期间,
cpu_op
事件可能仅出现一次(在捕获时);重放迭代的形状信息不会重新记录
脚本会检测图捕获阶段并标记受影响的算子;使用eager模式跟踪以获取完整覆盖
TP拆分维度跟踪显示的是TP拆分后的维度(如
H/TP
),而非全模型视图
--compare
模式中使用
--tp
参数将跟踪FLOPs缩放回全模型等效值
范围归因质量Python范围依赖
with_stack=True
;某些框架或编译路径可能生成浅层次或缺失的范围链
优雅降级:无法解析范围的算子归类为“other”
无法替代静态模板跟踪提取是验证和发现工具;静态模板仍是主要的快速分析路径使用跟踪提取验证新模型的模板,若发现差异则更新
model-config-index.json

References

参考资料

  • references/model-config-index.json
    : model configuration parameters (hidden_size, expert counts, MLA ranks, etc.).
  • references/gpu-specs.json
    : GPU peak FLOPS specifications for MFU calculation.
  • scripts/extract_compute_flow_from_trace.py
    : trace-based compute flow extraction and template validation tool.
  • references/model-config-index.json
    :模型配置参数(hidden_size、专家数量、MLA ranks等)。
  • references/gpu-specs.json
    :用于MFU计算的GPU峰值FLOPS规格。
  • scripts/extract_compute_flow_from_trace.py
    :基于跟踪的计算流程提取和模板验证工具。