flash-moe-inference

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Flash-MoE Inference Engine

Flash-MoE推理引擎

Skill by ara.so — Daily 2026 Skills collection.

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.

由ara.so提供的技能 — 2026每日技能合集。

Flash-MoE是一款纯C/Objective-C/Metal推理引擎，可在配备48GB内存的MacBook Pro上以4.4+ tokens/秒的速度运行Qwen3.5-397B-A17B（3970亿参数的Mixture-of-Experts模型）。它可按需从NVMe SSD流式加载209GB的专家权重 — 无需Python，无需机器学习框架，仅依赖C、Objective-C和手工调优的Metal着色器。

Requirements

环境要求

Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
OS: macOS 26+ (Darwin 25+)
Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)

硬件：Apple Silicon Mac（如M3 Max或类似型号），48GB及以上统一内存，1TB及以上SSD且剩余空间约210GB
操作系统：macOS 26+（Darwin 25+）
工具：Xcode命令行工具，Python 3.x（仅用于权重提取）
模型：Qwen3.5-397B-A17B safetensors权重（需从HuggingFace单独下载）

Installation & Build

安装与构建

bash

undefined

bash

undefined

Clone the repo

克隆仓库

git clone https://github.com/danveloper/flash-moe cd flash-moe/metal_infer

Build everything

构建所有组件

make

Verify build artifacts

验证构建产物

ls infer chat main


The Makefile compiles `infer.m`, `chat.m`, `main.m` with Metal shader compilation for `shaders.metal`.

ls infer chat main


Makefile会编译`infer.m`、`chat.m`、`main.m`，同时编译`shaders.metal`中的Metal着色器。

Weight Preparation

权重准备

Step 1: Extract non-expert weights

步骤1：提取非专家权重

bash

undefined

bash

undefined

From the metal_infer/ directory

在metal_infer/目录下执行

Point to your downloaded Qwen3.5-397B safetensors directory

指向你下载好的Qwen3.5-397B safetensors权重目录

python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/

Produces:

生成以下文件：

model_weights.bin (~5.5GB, mmap'd at runtime)

model_weights.bin (~5.5GB，运行时通过mmap加载)

model_weights.json (tensor manifest)

model_weights.json (张量清单)

vocab.bin (vocabulary)

vocab.bin (词汇表)

tokenizer.bin (BPE tokenizer data)

tokenizer.bin (BPE分词器数据)

undefined

undefined

Step 2: Pack expert weights (4-bit, production)

步骤2：打包专家权重（4位量化，生产环境）

bash

undefined

bash

undefined

From repo root

在仓库根目录执行

python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/

Produces packed_experts/ directory (~209GB)

生成packed_experts/目录（约209GB）

Each expert is a separate file: layer_XX_expert_YYYY.bin

每个专家对应单独文件：layer_XX_expert_YYYY.bin

undefined

undefined

Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

步骤3：可选的2位重量化（速度更快，但会破坏JSON/工具调用功能）

bash

undefined

bash

undefined

Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)

将4位专家权重转换为2位（节省约89GB，总大小120GB）

python3 metal_infer/repack_experts_2bit.py
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/

undefined

python3 metal_infer/repack_experts_2bit.py
metal_infer/packed_experts/
metal_infer/packed_experts_2bit/

undefined

Key Commands

核心命令

Basic inference

基础推理

bash

cd metal_infer

bash

cd metal_infer

4-bit inference (production quality, tool calling works)

4位推理（生产环境质量，支持工具调用）

./infer --prompt "Explain quantum computing" --tokens 100

./infer --prompt "解释量子计算" --tokens 100

2-bit inference (faster, breaks JSON/tool calling)

2位推理（速度更快，会破坏JSON/工具调用）

./infer --prompt "Explain quantum computing" --tokens 100 --2bit

./infer --prompt "解释量子计算" --tokens 100 --2bit

Per-layer timing breakdown

按层展示时间分解

./infer --prompt "Hello" --tokens 20 --timing

undefined

./infer --prompt "Hello" --tokens 20 --timing

undefined

Interactive chat with tool calling

带工具调用的交互式聊天

bash

./chat

bash

./chat

Opens TUI with full tool calling support

打开带完整工具调用支持的终端UI

Uses 4-bit experts by default

默认使用4位专家权重

undefined

undefined

MoE-only benchmark (measures expert throughput)

仅MoE基准测试（测量专家吞吐量）

bash

./main

bash

./main

Runs pure expert forward-pass benchmark

运行纯专家前向传播基准测试

Reports tokens/sec without attention overhead

报告不含注意力开销的tokens/秒速度

undefined

undefined

Project Structure

项目结构

flash-moe/
├── paper/
│   └── flash_moe.pdf          # Full technical paper
├── metal_infer/
│   ├── infer.m                # Complete inference engine (~7000 lines)
│   ├── shaders.metal          # Metal compute kernels (~1200 lines)
│   ├── chat.m                 # Interactive chat TUI
│   ├── tokenizer.h            # Single-header C BPE tokenizer (449 lines)
│   ├── main.m                 # MoE-only benchmark
│   ├── Makefile
│   ├── extract_weights.py     # Safetensors → model_weights.bin
│   ├── repack_experts_2bit.py # 4-bit → 2-bit requantization
│   ├── train_predictor.py     # Expert routing prediction analysis
│   ├── model_weights.bin      # Non-expert weights (mmap'd)
│   ├── model_weights.json     # Tensor manifest
│   ├── vocab.bin
│   ├── tokenizer.bin
│   ├── packed_experts/        # 4-bit expert files (209GB)
│   └── packed_experts_2bit/   # 2-bit expert files (120GB, optional)
├── repack_experts.py          # 4-bit expert packing from safetensors
├── progress.py                # Results visualization
└── results.tsv                # Experiment log

flash-moe/
├── paper/
│   └── flash_moe.pdf          # 完整技术论文
├── metal_infer/
│   ├── infer.m                # 完整推理引擎（约7000行代码）
│   ├── shaders.metal          # Metal计算内核（约1200行代码）
│   ├── chat.m                 # 交互式聊天终端UI
│   ├── tokenizer.h            # 单头文件C语言BPE分词器（449行代码）
│   ├── main.m                 # 仅MoE基准测试
│   ├── Makefile
│   ├── extract_weights.py     # Safetensors → model_weights.bin转换工具
│   ├── repack_experts_2bit.py # 4位→2位重量化工具
│   ├── train_predictor.py     # 专家路由预测分析工具
│   ├── model_weights.bin      # 非专家权重（内存映射加载）
│   ├── model_weights.json     # 张量清单
│   ├── vocab.bin
│   ├── tokenizer.bin
│   ├── packed_experts/        # 4位专家文件（209GB）
│   └── packed_experts_2bit/   # 2位专家文件（120GB，可选）
├── repack_experts.py          # 从safetensors打包4位专家权重的工具
├── progress.py                # 结果可视化工具
└── results.tsv                # 实验日志

Architecture Overview

架构概述

The model has 60 transformer layers:

45 GatedDeltaNet (linear attention) layers
15 standard full attention layers
Each layer: 512 experts, K=4 activated per token + 1 shared expert
Hidden dimension: 4096

该模型包含60个Transformer层：

45个GatedDeltaNet（线性注意力）层
15个标准全注意力层
每层：512个专家，每个token激活K=4个专家 + 1个共享专家
隐藏维度：4096

Per-layer pipeline (4.28ms average at 4-bit)

每层处理流水线（4位量化下平均4.28毫秒）

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU: flush results                       [0.01ms CPU]  
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU: softmax + topK routing               [0.003ms]
           → I/O: parallel pread K=4 experts           [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

CMD3(prev) → CMD1: 注意力投影 + delta-net  [1.22ms GPU]
           → CPU: 刷新结果                       [0.01ms CPU]  
           → CMD2: o_proj + 归一化 + 路由 + 共享专家    [0.55ms GPU]
           → CPU: softmax + topK路由               [0.003ms]
           → I/O: 并行预读取K=4个专家           [2.41ms SSD]
           → CMD3: 专家前向计算 + 合并 + 归一化     [0.04ms编码，延迟执行]

Metal Shader Kernels

Metal着色器内核

The

shaders.metal

file contains hand-written kernels. Key kernels:

metal

// 4-bit dequantized matrix-vector multiply (FMA-optimized)
// Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x
// Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction

kernel void matvec_4bit_fma(
    device const uint8_t* weights [[buffer(0)]],
    device const float* scales    [[buffer(1)]],
    device const float* biases    [[buffer(2)]],
    device const float* x         [[buffer(3)]],
    device float* out             [[buffer(4)]],
    uint tid [[thread_position_in_threadgroup]],
    uint gid [[threadgroup_position_in_grid]])
{
    // ... tiled SIMD-reduced FMA kernel
    // 12% faster than naive (nibble * scale + bias) * x
}

// Fused SwiGLU activation
kernel void swiglu(device float* gate [[buffer(0)]],
                   device const float* up [[buffer(1)]],
                   uint gid [[thread_position_in_grid]])
{
    float g = gate[gid];
    gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}

// RMS normalization (two-pass)
kernel void rms_norm_pass1(...) // sum of squares reduction
kernel void rms_norm_pass2(...) // apply normalization

// GPU RoPE (fused with Q deinterleave and K normalization)
kernel void rope_qk(...)

// MoE combine + residual + sigmoid gate (fused)
kernel void moe_combine_residual(...)

shaders.metal

文件包含手工编写的内核。核心内核如下：

metal

// 4位反量化矩阵-向量乘法（FMA优化）
// 核心思路：fma(nibble, scale*x, bias*x) 替代 (nibble*scale + bias)*x
// 预计算scale*x和bias*x，将反量化与乘法融合为一条FMA指令

kernel void matvec_4bit_fma(
    device const uint8_t* weights [[buffer(0)]],
    device const float* scales    [[buffer(1)]],
    device const float* biases    [[buffer(2)]],
    device const float* x         [[buffer(3)]],
    device float* out             [[buffer(4)]],
    uint tid [[thread_position_in_threadgroup]],
    uint gid [[threadgroup_position_in_grid]])
{
    // ... 分块SIMD归约FMA内核
    // 比原生(nibble * scale + bias) * x快12%
}

// 融合SwiGLU激活函数
kernel void swiglu(device float* gate [[buffer(0)]],
                   device const float* up [[buffer(1)]],
                   uint gid [[thread_position_in_grid]])
{
    float g = gate[gid];
    gate[gid] = (g / (1.0f + exp(-g))) * up[gid];
}

// RMS归一化（两步法）
kernel void rms_norm_pass1(...) // 平方和归约
kernel void rms_norm_pass2(...) // 应用归一化

// GPU RoPE（与Q解交错和K归一化融合）
kernel void rope_qk(...)

// MoE合并 + 残差 + Sigmoid门控（融合）
kernel void moe_combine_residual(...)

SSD Expert Streaming Pattern

SSD专家流式加载模式

The core innovation — loading only K=4 active experts per layer from SSD:

objc

// Parallel expert loading using GCD dispatch groups
// From infer.m (conceptual pattern)

dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);

for (int k = 0; k < K_EXPERTS; k++) {
    int expert_id = top_k_indices[k];
    dispatch_group_async(group, ioQueue, ^{
        // Each expert: ~6.75MB at 4-bit
        char path[256];
        snprintf(path, sizeof(path), 
                 "packed_experts/layer_%02d_expert_%04d.bin",
                 layer, expert_id);
        
        int fd = open(path, O_RDONLY);
        // pread() — non-blocking, OS page cache handles LRU
        pread(fd, expert_buffer[k], expert_size, 0);
        close(fd);
    });
}

dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon

Why
pread()
not
mmap()
: mmap incurs per-page fault overhead on cold data (~5x slower). Direct

pread()

with OS page cache achieves ~71% hit rate naturally.

核心创新点 — 仅从SSD加载每层中K=4个激活的专家：

objc

// 使用GCD调度组实现并行专家加载
// 来自infer.m（概念性代码）

dispatch_group_t group = dispatch_group_create();
dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0);

for (int k = 0; k < K_EXPERTS; k++) {
    int expert_id = top_k_indices[k];
    dispatch_group_async(group, ioQueue, ^{
        // 每个专家：4位量化下约6.75MB
        char path[256];
        snprintf(path, sizeof(path), 
                 "packed_experts/layer_%02d_expert_%04d.bin",
                 layer, expert_id);
        
        int fd = open(path, O_RDONLY);
        // pread() — 非阻塞，由操作系统页缓存处理LRU
        pread(fd, expert_buffer[k], expert_size, 0);
        close(fd);
    });
}

dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
// 后续执行GPU计算 — 串行流水线在Apple Silicon上是硬件最优的

为何使用
pread()
而非
mmap()
：mmap在冷数据上会产生每页故障开销（速度比pread()慢5倍）。直接使用

pread()

结合操作系统页缓存可自然达到约71%的命中率。

GatedDeltaNet Linear Attention (BLAS)

GatedDeltaNet线性注意力（BLAS）

The recurrence update uses Accelerate BLAS — 64% faster than scalar:

objc

// GatedDeltaNet state update per head (conceptual pattern)
// state: 128×128 float matrix, 64 heads
// From infer.m

#import <Accelerate/Accelerate.h>

for (int h = 0; h < 64; h++) {
    float* S = state + h * 128 * 128;  // 128×128 state matrix
    float* q = Q + h * 128;
    float* k = K + h * 128;
    float* v = V + h * 128;
    
    // β·(k⊗v) outer product update
    // cblas_sger: S += beta * (k ⊗ v)
    cblas_sger(CblasRowMajor, 128, 128,
               beta[h], k, 1, v, 1, S, 128);
    
    // Decay: S = alpha * S
    cblas_sscal(128 * 128, alpha[h], S, 1);
    
    // Output: o = S @ q
    cblas_sgemv(CblasRowMajor, CblasNoTrans,
                128, 128, 1.0f, S, 128, q, 1, 0.0f,
                output + h * 128, 1);
}

循环更新使用Accelerate BLAS — 比标量实现快64%：

objc

// 每个头的GatedDeltaNet状态更新（概念性代码）
// state: 128×128浮点矩阵，共64个注意力头
// 来自infer.m

#import <Accelerate/Accelerate.h>

for (int h = 0; h < 64; h++) {
    float* S = state + h * 128 * 128;  // 128×128状态矩阵
    float* q = Q + h * 128;
    float* k = K + h * 128;
    float* v = V + h * 128;
    
    // β·(k⊗v)外积更新
    // cblas_sger: S += beta * (k ⊗ v)
    cblas_sger(CblasRowMajor, 128, 128,
               beta[h], k, 1, v, 1, S, 128);
    
    // 衰减：S = alpha * S
    cblas_sscal(128 * 128, alpha[h], S, 1);
    
    // 输出：o = S @ q
    cblas_sgemv(CblasRowMajor, CblasNoTrans,
                128, 128, 1.0f, S, 128, q, 1, 0.0f,
                output + h * 128, 1);
}

Performance Configuration

性能配置

4-bit (production default)

4位量化（生产环境默认）

Quality: Excellent — full tool calling, correct JSON
Speed: 4.36 tok/s
Disk: 209GB

质量：极佳 — 完整支持工具调用，JSON输出正确
速度：4.36 tok/s
磁盘占用：209GB

2-bit (speed testing only)

2位量化（仅用于速度测试）

Quality: Good — but breaks JSON/tool calling (
```
\name\
```
instead of
```
"name"
```
)
Speed: 5.74 tok/s (7.05 peak single token with warm cache)
Disk: 120GB
Uses
```
F_NOCACHE
```
flag to avoid page cache thrashing

质量：良好 — 但会破坏JSON/工具调用（输出
```
\name\
```
而非
```
"name"
```
）
速度：5.74 tok/s（缓存预热后单token峰值7.05 tok/s）
磁盘占用：120GB
使用
```
F_NOCACHE
```
标志避免页缓存抖动

What NOT to Try (Learned from 58 Experiments)

请勿尝试的方案（从58次实验中总结）

Approach	Why it fails
`mmap()` expert files	Per-page fault overhead: 5x slower than `pread()`
`dispatch_io`	`dispatch_data` management overhead: -70%
`F_RDADVISE` prefetch	SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed
Custom Metal LRU cache	GPU memory pressure: -38% vs OS page cache
LZ4 expert compression	Decompress overhead > warm cache savings: -13%
Temporal expert prediction	25% hit rate, wastes SSD bandwidth: -18%
Speculative early routing	Cache pollution: -38%
MTP speculative decoding	MoE I/O scales per-token (unlike dense models): break-even
Spin-poll GPU wait	CPU thermal throttle competes with GPU: -23%
Parallel SSD + GPU overlap	Unified memory controller arbitration: net negative

Key principle: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.

方案	失败原因
`mmap()` 专家文件	每页故障开销：比 `pread()` 慢5倍
`dispatch_io`	`dispatch_data` 管理开销：性能下降70%
`F_RDADVISE` 预读取	SSD DMA与GPU共享内存控制器 — 并发访问导致GPU速度下降73%
自定义Metal LRU缓存	GPU内存压力：比操作系统页缓存性能低38%
LZ4专家权重压缩	解压开销超过缓存预热收益：性能下降13%
时间维度专家预测	命中率仅25%，浪费SSD带宽：性能下降18%
speculative early routing	缓存污染：性能下降38%
MTP speculative解码	MoE的I/O开销随token数量增加而线性增长（与密集模型不同）：收支平衡
自旋轮询GPU等待	CPU热节流与GPU竞争资源：性能下降23%
SSD与GPU并行重叠执行	统一内存控制器仲裁：整体性能为负收益

核心原则：在Apple Silicon上，GPU DMA与SSD DMA共享同一内存控制器。串行流水线（GPU → SSD → GPU）是硬件最优的执行方式。

Troubleshooting

故障排除

Build fails

构建失败

bash

undefined

bash

undefined

Ensure Xcode CLI tools are installed

确保已安装Xcode命令行工具

xcode-select --install

Check Metal compiler is available

检查Metal编译器是否可用

xcrun -sdk macosx metal --version

undefined

xcrun -sdk macosx metal --version

undefined

Out of memory

内存不足

The engine is designed to use ~6GB active:

5.5GB:
```
model_weights.bin
```
(mmap'd, read-only)
~200MB: Metal scratch buffers
Remaining ~42GB: OS page cache for expert data

If you see OOM, check for other processes consuming unified memory:

bash

sudo memory_pressure
vm_stat

该引擎设计为仅使用约6GB活跃内存：

5.5GB：
```
model_weights.bin
```
（内存映射，只读）
~200MB：Metal临时缓冲区
剩余约42GB：用于专家数据的操作系统页缓存

如果出现内存不足，检查是否有其他进程占用统一内存：

bash

sudo memory_pressure
vm_stat

Slow performance

性能缓慢

bash

undefined

bash

undefined

Check SSD speed — needs ~17GB/s for target performance

检查SSD速度 — 需要约17GB/s才能达到目标性能

Run with timing to identify bottleneck

带时间参数运行以定位瓶颈

./infer --prompt "Hello" --tokens 5 --timing

Verify packed_experts/ is on internal SSD, not external drive

验证packed_experts/目录位于内部SSD而非外部磁盘

diskutil info /

undefined

diskutil info /

undefined

Wrong expert directory

专家目录路径错误

bash

undefined

bash

undefined

Default paths expected by infer.m:

infer.m默认期望的路径：

metal_infer/packed_experts/ (4-bit)

metal_infer/packed_experts/ (4位权重)

metal_infer/packed_experts_2bit/ (2-bit)

metal_infer/packed_experts_2bit/ (2位权重)

Ensure you're running from metal_infer/ directory

确保你在metal_infer/目录下运行

cd metal_infer ./infer --prompt "test"

undefined

cd metal_infer ./infer --prompt "test"

undefined

Tool calling broken

工具调用功能异常

Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.

请使用4位量化权重，不要使用2位。2位量化会破坏JSON输出中的引号字符，导致工具调用不可靠。对于智能体类工作负载，请始终使用默认的4位配置。

Memory Safety

内存安全性

The engine explicitly manages all allocations:

No unbounded caches
Expert data never accumulates in GPU memory
```
model_weights.bin
```
is mmap'd read-only — kernel manages pages
Expert files are opened/read/closed per inference step

该引擎显式管理所有内存分配：

无无界缓存
专家数据不会在GPU内存中累积
```
model_weights.bin
```
通过内存映射加载（只读），由内核管理页面
专家文件在每次推理步骤中打开/读取/关闭