onnx-webgpu-converter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ONNX WebGPU Model Converter

ONNX WebGPU 模型转换器

Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.
将任意HuggingFace模型转换为ONNX格式,并通过Transformers.js + WebGPU在浏览器中运行。

Workflow Overview

工作流程概述

  1. Check if ONNX version already exists on HuggingFace
  2. Set up Python environment with optimum
  3. Export model to ONNX with optimum-cli
  4. Quantize for target deployment (WebGPU vs WASM)
  5. Upload to HuggingFace Hub (optional)
  6. Use in Transformers.js with WebGPU
  1. 检查HuggingFace上是否已有ONNX版本
  2. 使用optimum搭建Python环境
  3. 通过optimum-cli将模型导出为ONNX格式
  4. 针对目标部署场景进行量化(WebGPU vs WASM)
  5. 上传至HuggingFace Hub(可选)
  6. 结合WebGPU在Transformers.js中使用

Step 1: Check for Existing ONNX Models

步骤1:检查是否已有ONNX模型

Before converting, check if the model already has an ONNX version:
If found, skip to Step 6.
转换前,请先确认模型是否已有ONNX版本:
若已找到,直接跳至步骤6。

Step 2: Environment Setup

步骤2:环境搭建

bash
undefined
bash
undefined

Create venv (recommended)

创建虚拟环境(推荐)

python -m venv onnx-env && source onnx-env/bin/activate
python -m venv onnx-env && source onnx-env/bin/activate

Install optimum with ONNX support

安装带ONNX支持的optimum

pip install "optimum[onnx]" onnxruntime
pip install "optimum[onnx]" onnxruntime

For GPU-accelerated export (optional)

(可选)支持GPU加速的导出

pip install onnxruntime-gpu

**Verify installation:**
```bash
optimum-cli export onnx --help
pip install onnxruntime-gpu

**验证安装:**
```bash
optimum-cli export onnx --help

Step 3: Export to ONNX

步骤3:导出为ONNX格式

Basic Export (auto-detect task)

基础导出(自动检测任务)

bash
optimum-cli export onnx --model <model_id_or_path> ./output_dir/
bash
optimum-cli export onnx --model <model_id_or_path> ./output_dir/

With Explicit Task

指定任务导出

bash
optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/
Common tasks:
text-generation
,
text-classification
,
feature-extraction
,
image-classification
,
automatic-speech-recognition
,
object-detection
,
image-segmentation
,
question-answering
,
token-classification
,
zero-shot-classification
For decoder models, append
-with-past
for KV cache reuse (default behavior):
text-generation-with-past
,
text2text-generation-with-past
,
automatic-speech-recognition-with-past
bash
optimum-cli export onnx \
  --model <model_id> \
  --task <task> \
  ./output_dir/
常见任务类型:
text-generation
text-classification
feature-extraction
image-classification
automatic-speech-recognition
object-detection
image-segmentation
question-answering
token-classification
zero-shot-classification
对于解码器模型,追加
-with-past
以支持KV缓存复用(默认行为):
text-generation-with-past
text2text-generation-with-past
automatic-speech-recognition-with-past

Full CLI Reference

完整CLI参数参考

FlagDescription
-m MODEL, --model MODEL
HuggingFace model ID or local path (required)
--task TASK
Export task (auto-detected if on Hub)
--opset OPSET
ONNX opset version (default: auto)
--device DEVICE
Export device,
cpu
(default) or
cuda
--optimize {O1,O2,O3,O4}
ONNX Runtime optimization level
--monolith
Force single ONNX file (vs split encoder/decoder)
--no-post-process
Skip post-processing (e.g., decoder merging)
--trust-remote-code
Allow custom model code from Hub
--pad_token_id ID
Override pad token (needed for some models)
--cache_dir DIR
Cache directory for downloaded models
--batch_size N
Batch size for dummy inputs
--sequence_length N
Sequence length for dummy inputs
--framework {pt}
Source framework
--atol ATOL
Absolute tolerance for validation
参数说明
-m MODEL, --model MODEL
HuggingFace模型ID或本地路径(必填)
--task TASK
导出任务类型(若模型在Hub上则自动检测)
--opset OPSET
ONNX opset版本(默认:自动)
--device DEVICE
导出设备,
cpu
(默认)或
cuda
--optimize {O1,O2,O3,O4}
ONNX Runtime优化级别
--monolith
强制生成单个ONNX文件(而非拆分编码器/解码器)
--no-post-process
跳过后处理(如解码器合并)
--trust-remote-code
允许从Hub加载自定义模型代码
--pad_token_id ID
覆盖pad token(部分模型需要)
--cache_dir DIR
下载模型的缓存目录
--batch_size N
虚拟输入的批次大小
--sequence_length N
虚拟输入的序列长度
--framework {pt}
源框架
--atol ATOL
验证的绝对容差

Optimization Levels

优化级别说明

LevelDescription
O1Basic general optimizations
O2Basic + extended + transformer fusions
O3O2 + GELU approximation
O4O3 + mixed precision fp16 (GPU only, requires
--device cuda
)
级别说明
O1基础通用优化
O2基础优化+扩展优化+Transformer融合
O3O2优化+GELU近似
O4O3优化+混合精度fp16(仅GPU可用,需指定
--device cuda

Step 4: Quantize for Web Deployment

步骤4:针对Web部署进行量化

Quantization Types for Transformers.js

Transformers.js支持的量化类型

dtypePrecisionBest ForSize Reduction
fp32
Full 32-bitMaximum accuracyNone (baseline)
fp16
Half 16-bitWebGPU default quality~50%
q8
/
int8
8-bitWASM default, good balance~75%
q4
/
bnb4
4-bitMaximum compression~87%
q4f16
4-bit weights, fp16 computeWebGPU + small size~87%
数据类型精度适用场景体积缩减比例
fp32
完整32位最高精度需求无(基准)
fp16
半16位WebGPU默认质量~50%
q8
/
int8
8位WASM默认,平衡精度与体积~75%
q4
/
bnb4
4位最大压缩比~87%
q4f16
4位权重,fp16计算WebGPU+小体积需求~87%

Using optimum-cli quantization

使用optimum-cli进行量化

bash
undefined
bash
undefined

Dynamic quantization (post-export)

动态量化(导出后执行)

optimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
undefined
optimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
undefined

Using Python API for finer control

使用Python API实现更精细的控制

python
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)
python
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)

Producing Multiple dtype Variants for Transformers.js

为Transformers.js生成多数据类型变体

To provide fp32, fp16, q8, and q4 variants (like
onnx-community
models), organize output as:
model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json
若要提供fp32、fp16、q8和q4变体(类似
onnx-community
的模型),请按如下结构组织输出:
model_onnx/
├── onnx/
│   ├── model.onnx              # fp32
│   ├── model_fp16.onnx         # fp16
│   ├── model_quantized.onnx    # q8
│   └── model_q4.onnx           # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json

Step 5: Upload to HuggingFace Hub (Optional)

步骤5:上传至HuggingFace Hub(可选)

bash
undefined
bash
undefined

Login

登录

huggingface-cli login
huggingface-cli login

Upload

上传

huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/

Add transformers.js tag to model card for discoverability

为模型卡片添加transformers.js标签以提升可发现性

undefined
undefined

Step 6: Use in Transformers.js with WebGPU

步骤6:结合WebGPU在Transformers.js中使用

Install

安装

bash
npm install @huggingface/transformers
bash
npm install @huggingface/transformers

Basic Pipeline with WebGPU

带WebGPU的基础Pipeline

javascript
import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU acceleration
  dtype: "q4",         // Quantization level
});

const result = await pipe("input text");
javascript
import { pipeline } from "@huggingface/transformers";

const pipe = await pipeline("task-name", "model-id-or-path", {
  device: "webgpu",    // GPU加速
  dtype: "q4",         // 量化级别
});

const result = await pipe("input text");

Per-Module dtypes (encoder-decoder models)

按模块指定数据类型(编码器-解码器模型)

Some models (Whisper, Florence-2) need different quantization per component:
javascript
const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);
For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md
部分模型(如Whisper、Florence-2)需要为不同组件指定不同量化类型:
javascript
const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);
关于Transformers.js WebGPU的详细使用模式: 请参考references/webgpu-usage.md

Troubleshooting

故障排查

For conversion errors and common issues: See references/conversion-guide.md
转换错误及常见问题解决: 请参考references/conversion-guide.md

Quick Fixes

快速修复方案

  • "Task not found": Use
    --task
    flag explicitly. For decoder models try
    text-generation-with-past
  • "trust_remote_code": Add
    --trust-remote-code
    flag for custom model architectures
  • Out of memory: Use
    --device cpu
    and smaller
    --batch_size
  • Validation fails: Try
    --no-post-process
    or increase
    --atol
  • Model not supported: Check supported architectures — 120+ architectures supported
  • WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)
  • "Task not found":显式使用
    --task
    参数。对于解码器模型,尝试使用
    text-generation-with-past
  • "trust_remote_code":为自定义模型架构添加
    --trust-remote-code
    参数
  • 内存不足:使用
    --device cpu
    并减小
    --batch_size
  • 验证失败:尝试使用
    --no-post-process
    或增大
    --atol
  • 模型不支持:查看支持的架构 — 已支持120+种架构
  • WebGPU回退至WASM:确保浏览器支持WebGPU(Chrome 113+、Edge 113+)

Supported Task → Pipeline Mapping

支持的任务与Pipeline映射

TaskTransformers.js PipelineExample Model
text-classification
sentiment-analysis
distilbert-base-uncased-finetuned-sst-2
text-generation
text-generation
Qwen2.5-0.5B-Instruct
feature-extraction
feature-extraction
mxbai-embed-xsmall-v1
automatic-speech-recognition
automatic-speech-recognition
whisper-tiny.en
image-classification
image-classification
mobilenetv4_conv_small
object-detection
object-detection
detr-resnet-50
image-segmentation
image-segmentation
segformer-b0
zero-shot-image-classification
zero-shot-image-classification
clip-vit-base-patch32
depth-estimation
depth-estimation
depth-anything-small
translation
translation
nllb-200-distilled-600M
summarization
summarization
bart-large-cnn
任务类型Transformers.js Pipeline示例模型
text-classification
sentiment-analysis
distilbert-base-uncased-finetuned-sst-2
text-generation
text-generation
Qwen2.5-0.5B-Instruct
feature-extraction
feature-extraction
mxbai-embed-xsmall-v1
automatic-speech-recognition
automatic-speech-recognition
whisper-tiny.en
image-classification
image-classification
mobilenetv4_conv_small
object-detection
object-detection
detr-resnet-50
image-segmentation
image-segmentation
segformer-b0
zero-shot-image-classification
zero-shot-image-classification
clip-vit-base-patch32
depth-estimation
depth-estimation
depth-anything-small
translation
translation
nllb-200-distilled-600M
summarization
summarization
bart-large-cnn