onnx-webgpu-converter

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ONNX WebGPU Model Converter

ONNX WebGPU 模型转换器

Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.

将任意HuggingFace模型转换为ONNX格式，并通过Transformers.js + WebGPU在浏览器中运行。

Workflow Overview

工作流程概述

Check if ONNX version already exists on HuggingFace
Set up Python environment with optimum
Export model to ONNX with optimum-cli
Quantize for target deployment (WebGPU vs WASM)
Upload to HuggingFace Hub (optional)
Use in Transformers.js with WebGPU

检查HuggingFace上是否已有ONNX版本
使用optimum搭建Python环境
通过optimum-cli将模型导出为ONNX格式
针对目标部署场景进行量化（WebGPU vs WASM）
上传至HuggingFace Hub（可选）
结合WebGPU在Transformers.js中使用

Step 1: Check for Existing ONNX Models

步骤1：检查是否已有ONNX模型

Before converting, check if the model already has an ONNX version:

Search
```
onnx-community/<model-name>
```
on HuggingFace Hub
Check the model repo for an
```
onnx/
```
folder
Browse https://huggingface.co/models?library=transformers.js (1200+ pre-converted)

If found, skip to Step 6.

转换前，请先确认模型是否已有ONNX版本：

在HuggingFace Hub上搜索
```
onnx-community/<model-name>
```
检查模型仓库中是否存在
```
onnx/
```
文件夹
浏览https://huggingface.co/models?library=transformers.js（已有1200+预转换模型）

若已找到，直接跳至步骤6。

Step 2: Environment Setup

步骤2：环境搭建

bash

undefined

bash

undefined

Create venv (recommended)

创建虚拟环境（推荐）

python -m venv onnx-env && source onnx-env/bin/activate

Install optimum with ONNX support

安装带ONNX支持的optimum

pip install "optimum[onnx]" onnxruntime

For GPU-accelerated export (optional)

（可选）支持GPU加速的导出

pip install onnxruntime-gpu


**Verify installation:**
```bash
optimum-cli export onnx --help

pip install onnxruntime-gpu


**验证安装：**
```bash
optimum-cli export onnx --help

Step 3: Export to ONNX

步骤3：导出为ONNX格式

Basic Export (auto-detect task)

基础导出（自动检测任务）

bash

optimum-cli export onnx --model <model_id_or_path> ./output_dir/

bash

optimum-cli export onnx --model <model_id_or_path> ./output_dir/

With Explicit Task

指定任务导出

bash

optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/

Common tasks:

text-generation

text-classification

feature-extraction

image-classification

automatic-speech-recognition

object-detection

image-segmentation

question-answering

token-classification

zero-shot-classification

For decoder models, append

-with-past

for KV cache reuse (default behavior):

text-generation-with-past

text2text-generation-with-past

automatic-speech-recognition-with-past

bash

optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/

常见任务类型：

text-generation

、

text-classification

、

feature-extraction

、

image-classification

、

automatic-speech-recognition

、

object-detection

、

image-segmentation

、

question-answering

、

token-classification

、

zero-shot-classification

对于解码器模型，追加

-with-past

以支持KV缓存复用（默认行为）：

text-generation-with-past

、

text2text-generation-with-past

、

automatic-speech-recognition-with-past

Full CLI Reference

完整CLI参数参考

Flag	Description
`-m MODEL, --model MODEL`	HuggingFace model ID or local path (required)
`--task TASK`	Export task (auto-detected if on Hub)
`--opset OPSET`	ONNX opset version (default: auto)
`--device DEVICE`	Export device, `cpu` (default) or `cuda`
`--optimize {O1,O2,O3,O4}`	ONNX Runtime optimization level
`--monolith`	Force single ONNX file (vs split encoder/decoder)
`--no-post-process`	Skip post-processing (e.g., decoder merging)
`--trust-remote-code`	Allow custom model code from Hub
`--pad_token_id ID`	Override pad token (needed for some models)
`--cache_dir DIR`	Cache directory for downloaded models
`--batch_size N`	Batch size for dummy inputs
`--sequence_length N`	Sequence length for dummy inputs
`--framework {pt}`	Source framework
`--atol ATOL`	Absolute tolerance for validation

参数	说明
`-m MODEL, --model MODEL`	HuggingFace模型ID或本地路径（必填）
`--task TASK`	导出任务类型（若模型在Hub上则自动检测）
`--opset OPSET`	ONNX opset版本（默认：自动）
`--device DEVICE`	导出设备， `cpu` （默认）或 `cuda`
`--optimize {O1,O2,O3,O4}`	ONNX Runtime优化级别
`--monolith`	强制生成单个ONNX文件（而非拆分编码器/解码器）
`--no-post-process`	跳过后处理（如解码器合并）
`--trust-remote-code`	允许从Hub加载自定义模型代码
`--pad_token_id ID`	覆盖pad token（部分模型需要）
`--cache_dir DIR`	下载模型的缓存目录
`--batch_size N`	虚拟输入的批次大小
`--sequence_length N`	虚拟输入的序列长度
`--framework {pt}`	源框架
`--atol ATOL`	验证的绝对容差

Optimization Levels

优化级别说明

Level	Description
O1	Basic general optimizations
O2	Basic + extended + transformer fusions
O3	O2 + GELU approximation
O4	O3 + mixed precision fp16 (GPU only, requires `--device cuda` )

级别	说明
O1	基础通用优化
O2	基础优化+扩展优化+Transformer融合
O3	O2优化+GELU近似
O4	O3优化+混合精度fp16（仅GPU可用，需指定 `--device cuda` ）

Step 4: Quantize for Web Deployment

步骤4：针对Web部署进行量化

Quantization Types for Transformers.js

Transformers.js支持的量化类型

dtype	Precision	Best For	Size Reduction
`fp32`	Full 32-bit	Maximum accuracy	None (baseline)
`fp16`	Half 16-bit	WebGPU default quality	~50%
`q8` / `int8`	8-bit	WASM default, good balance	~75%
`q4` / `bnb4`	4-bit	Maximum compression	~87%
`q4f16`	4-bit weights, fp16 compute	WebGPU + small size	~87%

数据类型	精度	适用场景	体积缩减比例
`fp32`	完整32位	最高精度需求	无（基准）
`fp16`	半16位	WebGPU默认质量	~50%
`q8` / `int8`	8位	WASM默认，平衡精度与体积	~75%
`q4` / `bnb4`	4位	最大压缩比	~87%
`q4f16`	4位权重，fp16计算	WebGPU+小体积需求	~87%

Using optimum-cli quantization

使用optimum-cli进行量化

bash

undefined

bash

undefined

Dynamic quantization (post-export)

动态量化（导出后执行）

optimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/

undefined

optimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/

undefined

Using Python API for finer control

使用Python API实现更精细的控制

python

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)

python

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)

Producing Multiple dtype Variants for Transformers.js

为Transformers.js生成多数据类型变体

To provide fp32, fp16, q8, and q4 variants (like

onnx-community

models), organize output as:

model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json

若要提供fp32、fp16、q8和q4变体（类似

onnx-community

的模型），请按如下结构组织输出：

model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json

Step 5: Upload to HuggingFace Hub (Optional)

步骤5：上传至HuggingFace Hub（可选）

bash

undefined

bash

undefined

Login

Upload

上传

huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/

Add transformers.js tag to model card for discoverability

为模型卡片添加transformers.js标签以提升可发现性

undefined

undefined

Step 6: Use in Transformers.js with WebGPU

步骤6：结合WebGPU在Transformers.js中使用

Install

安装

bash

npm install @huggingface/transformers

bash

npm install @huggingface/transformers

Basic Pipeline with WebGPU

带WebGPU的基础Pipeline

javascript

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU acceleration
  dtype: "q4",         // Quantization level
});

const result = await pipe("input text");

javascript

import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU加速
  dtype: "q4",         // 量化级别
});

const result = await pipe("input text");

Per-Module dtypes (encoder-decoder models)

按模块指定数据类型（编码器-解码器模型）

Some models (Whisper, Florence-2) need different quantization per component:

javascript

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md

部分模型（如Whisper、Florence-2）需要为不同组件指定不同量化类型：

javascript

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

关于Transformers.js WebGPU的详细使用模式： 请参考references/webgpu-usage.md

Troubleshooting

故障排查

For conversion errors and common issues: See references/conversion-guide.md

转换错误及常见问题解决： 请参考references/conversion-guide.md

Quick Fixes

快速修复方案

"Task not found": Use
```
--task
```
flag explicitly. For decoder models try
```
text-generation-with-past
```
"trust_remote_code": Add
```
--trust-remote-code
```
flag for custom model architectures
Out of memory: Use
```
--device cpu
```
and smaller
```
--batch_size
```
Validation fails: Try
```
--no-post-process
```
or increase
```
--atol
```
Model not supported: Check supported architectures — 120+ architectures supported
WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)

"Task not found"：显式使用
```
--task
```
参数。对于解码器模型，尝试使用
```
text-generation-with-past
```
"trust_remote_code"：为自定义模型架构添加
```
--trust-remote-code
```
参数
内存不足：使用
```
--device cpu
```
并减小
```
--batch_size
```
验证失败：尝试使用
```
--no-post-process
```
或增大
```
--atol
```
值
模型不支持：查看支持的架构 — 已支持120+种架构
WebGPU回退至WASM：确保浏览器支持WebGPU（Chrome 113+、Edge 113+）

Supported Task → Pipeline Mapping

支持的任务与Pipeline映射

Task	Transformers.js Pipeline	Example Model
text-classification	`sentiment-analysis`	distilbert-base-uncased-finetuned-sst-2
text-generation	`text-generation`	Qwen2.5-0.5B-Instruct
feature-extraction	`feature-extraction`	mxbai-embed-xsmall-v1
automatic-speech-recognition	`automatic-speech-recognition`	whisper-tiny.en
image-classification	`image-classification`	mobilenetv4_conv_small
object-detection	`object-detection`	detr-resnet-50
image-segmentation	`image-segmentation`	segformer-b0
zero-shot-image-classification	`zero-shot-image-classification`	clip-vit-base-patch32
depth-estimation	`depth-estimation`	depth-anything-small
translation	`translation`	nllb-200-distilled-600M
summarization	`summarization`	bart-large-cnn

任务类型	Transformers.js Pipeline	示例模型
text-classification	`sentiment-analysis`	distilbert-base-uncased-finetuned-sst-2
text-generation	`text-generation`	Qwen2.5-0.5B-Instruct
feature-extraction	`feature-extraction`	mxbai-embed-xsmall-v1
automatic-speech-recognition	`automatic-speech-recognition`	whisper-tiny.en
image-classification	`image-classification`	mobilenetv4_conv_small
object-detection	`object-detection`	detr-resnet-50
image-segmentation	`image-segmentation`	segformer-b0
zero-shot-image-classification	`zero-shot-image-classification`	clip-vit-base-patch32
depth-estimation	`depth-estimation`	depth-anything-small
translation	`translation`	nllb-200-distilled-600M
summarization	`summarization`	bart-large-cnn

onnx-webgpu-converter

Original

Translation

ONNX WebGPU Model Converter

ONNX WebGPU 模型转换器

Workflow Overview

工作流程概述

Step 1: Check for Existing ONNX Models

步骤1：检查是否已有ONNX模型

Step 2: Environment Setup

步骤2：环境搭建

Create venv (recommended)

创建虚拟环境（推荐）

Install optimum with ONNX support

安装带ONNX支持的optimum

For GPU-accelerated export (optional)

（可选）支持GPU加速的导出

Step 3: Export to ONNX

步骤3：导出为ONNX格式

Basic Export (auto-detect task)

基础导出（自动检测任务）

With Explicit Task

指定任务导出

Full CLI Reference

完整CLI参数参考

Optimization Levels

优化级别说明

Step 4: Quantize for Web Deployment

步骤4：针对Web部署进行量化

Quantization Types for Transformers.js

Transformers.js支持的量化类型

Using optimum-cli quantization

使用optimum-cli进行量化

Dynamic quantization (post-export)

动态量化（导出后执行）

Using Python API for finer control

使用Python API实现更精细的控制

Producing Multiple dtype Variants for Transformers.js

为Transformers.js生成多数据类型变体

Step 5: Upload to HuggingFace Hub (Optional)

步骤5：上传至HuggingFace Hub（可选）

Login

登录

Upload

上传

Add transformers.js tag to model card for discoverability

为模型卡片添加transformers.js标签以提升可发现性

Step 6: Use in Transformers.js with WebGPU

步骤6：结合WebGPU在Transformers.js中使用

Install

安装

Basic Pipeline with WebGPU

带WebGPU的基础Pipeline

Per-Module dtypes (encoder-decoder models)

按模块指定数据类型（编码器-解码器模型）

Troubleshooting

故障排查

Quick Fixes

快速修复方案

Supported Task → Pipeline Mapping

支持的任务与Pipeline映射