onnx-webgpu-converter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseONNX WebGPU Model Converter
ONNX WebGPU 模型转换器
Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.
将任意HuggingFace模型转换为ONNX格式,并通过Transformers.js + WebGPU在浏览器中运行。
Workflow Overview
工作流程概述
- Check if ONNX version already exists on HuggingFace
- Set up Python environment with optimum
- Export model to ONNX with optimum-cli
- Quantize for target deployment (WebGPU vs WASM)
- Upload to HuggingFace Hub (optional)
- Use in Transformers.js with WebGPU
- 检查HuggingFace上是否已有ONNX版本
- 使用optimum搭建Python环境
- 通过optimum-cli将模型导出为ONNX格式
- 针对目标部署场景进行量化(WebGPU vs WASM)
- 上传至HuggingFace Hub(可选)
- 结合WebGPU在Transformers.js中使用
Step 1: Check for Existing ONNX Models
步骤1:检查是否已有ONNX模型
Before converting, check if the model already has an ONNX version:
- Search on HuggingFace Hub
onnx-community/<model-name> - Check the model repo for an folder
onnx/ - Browse https://huggingface.co/models?library=transformers.js (1200+ pre-converted)
If found, skip to Step 6.
转换前,请先确认模型是否已有ONNX版本:
- 在HuggingFace Hub上搜索
onnx-community/<model-name> - 检查模型仓库中是否存在文件夹
onnx/ - 浏览https://huggingface.co/models?library=transformers.js(已有1200+预转换模型)
若已找到,直接跳至步骤6。
Step 2: Environment Setup
步骤2:环境搭建
bash
undefinedbash
undefinedCreate venv (recommended)
创建虚拟环境(推荐)
python -m venv onnx-env && source onnx-env/bin/activate
python -m venv onnx-env && source onnx-env/bin/activate
Install optimum with ONNX support
安装带ONNX支持的optimum
pip install "optimum[onnx]" onnxruntime
pip install "optimum[onnx]" onnxruntime
For GPU-accelerated export (optional)
(可选)支持GPU加速的导出
pip install onnxruntime-gpu
**Verify installation:**
```bash
optimum-cli export onnx --helppip install onnxruntime-gpu
**验证安装:**
```bash
optimum-cli export onnx --helpStep 3: Export to ONNX
步骤3:导出为ONNX格式
Basic Export (auto-detect task)
基础导出(自动检测任务)
bash
optimum-cli export onnx --model <model_id_or_path> ./output_dir/bash
optimum-cli export onnx --model <model_id_or_path> ./output_dir/With Explicit Task
指定任务导出
bash
optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/Common tasks: , , , , , , , , ,
text-generationtext-classificationfeature-extractionimage-classificationautomatic-speech-recognitionobject-detectionimage-segmentationquestion-answeringtoken-classificationzero-shot-classificationFor decoder models, append for KV cache reuse (default behavior):
, ,
-with-pasttext-generation-with-pasttext2text-generation-with-pastautomatic-speech-recognition-with-pastbash
optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/常见任务类型: 、、、、、、、、、
text-generationtext-classificationfeature-extractionimage-classificationautomatic-speech-recognitionobject-detectionimage-segmentationquestion-answeringtoken-classificationzero-shot-classification对于解码器模型,追加以支持KV缓存复用(默认行为):
、、
-with-pasttext-generation-with-pasttext2text-generation-with-pastautomatic-speech-recognition-with-pastFull CLI Reference
完整CLI参数参考
| Flag | Description |
|---|---|
| HuggingFace model ID or local path (required) |
| Export task (auto-detected if on Hub) |
| ONNX opset version (default: auto) |
| Export device, |
| ONNX Runtime optimization level |
| Force single ONNX file (vs split encoder/decoder) |
| Skip post-processing (e.g., decoder merging) |
| Allow custom model code from Hub |
| Override pad token (needed for some models) |
| Cache directory for downloaded models |
| Batch size for dummy inputs |
| Sequence length for dummy inputs |
| Source framework |
| Absolute tolerance for validation |
| 参数 | 说明 |
|---|---|
| HuggingFace模型ID或本地路径(必填) |
| 导出任务类型(若模型在Hub上则自动检测) |
| ONNX opset版本(默认:自动) |
| 导出设备, |
| ONNX Runtime优化级别 |
| 强制生成单个ONNX文件(而非拆分编码器/解码器) |
| 跳过后处理(如解码器合并) |
| 允许从Hub加载自定义模型代码 |
| 覆盖pad token(部分模型需要) |
| 下载模型的缓存目录 |
| 虚拟输入的批次大小 |
| 虚拟输入的序列长度 |
| 源框架 |
| 验证的绝对容差 |
Optimization Levels
优化级别说明
| Level | Description |
|---|---|
| O1 | Basic general optimizations |
| O2 | Basic + extended + transformer fusions |
| O3 | O2 + GELU approximation |
| O4 | O3 + mixed precision fp16 (GPU only, requires |
| 级别 | 说明 |
|---|---|
| O1 | 基础通用优化 |
| O2 | 基础优化+扩展优化+Transformer融合 |
| O3 | O2优化+GELU近似 |
| O4 | O3优化+混合精度fp16(仅GPU可用,需指定 |
Step 4: Quantize for Web Deployment
步骤4:针对Web部署进行量化
Quantization Types for Transformers.js
Transformers.js支持的量化类型
| dtype | Precision | Best For | Size Reduction |
|---|---|---|---|
| Full 32-bit | Maximum accuracy | None (baseline) |
| Half 16-bit | WebGPU default quality | ~50% |
| 8-bit | WASM default, good balance | ~75% |
| 4-bit | Maximum compression | ~87% |
| 4-bit weights, fp16 compute | WebGPU + small size | ~87% |
| 数据类型 | 精度 | 适用场景 | 体积缩减比例 |
|---|---|---|---|
| 完整32位 | 最高精度需求 | 无(基准) |
| 半16位 | WebGPU默认质量 | ~50% |
| 8位 | WASM默认,平衡精度与体积 | ~75% |
| 4位 | 最大压缩比 | ~87% |
| 4位权重,fp16计算 | WebGPU+小体积需求 | ~87% |
Using optimum-cli quantization
使用optimum-cli进行量化
bash
undefinedbash
undefinedDynamic quantization (post-export)
动态量化(导出后执行)
optimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
undefinedoptimum-cli onnxruntime quantize
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
--onnx_model ./output_dir/
--avx512
-o ./quantized_dir/
undefinedUsing Python API for finer control
使用Python API实现更精细的控制
python
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)python
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)Producing Multiple dtype Variants for Transformers.js
为Transformers.js生成多数据类型变体
To provide fp32, fp16, q8, and q4 variants (like models), organize output as:
onnx-communitymodel_onnx/
├── onnx/
│ ├── model.onnx # fp32
│ ├── model_fp16.onnx # fp16
│ ├── model_quantized.onnx # q8
│ └── model_q4.onnx # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json若要提供fp32、fp16、q8和q4变体(类似的模型),请按如下结构组织输出:
onnx-communitymodel_onnx/
├── onnx/
│ ├── model.onnx # fp32
│ ├── model_fp16.onnx # fp16
│ ├── model_quantized.onnx # q8
│ └── model_q4.onnx # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.jsonStep 5: Upload to HuggingFace Hub (Optional)
步骤5:上传至HuggingFace Hub(可选)
bash
undefinedbash
undefinedLogin
登录
huggingface-cli login
huggingface-cli login
Upload
上传
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
Add transformers.js tag to model card for discoverability
为模型卡片添加transformers.js标签以提升可发现性
undefinedundefinedStep 6: Use in Transformers.js with WebGPU
步骤6:结合WebGPU在Transformers.js中使用
Install
安装
bash
npm install @huggingface/transformersbash
npm install @huggingface/transformersBasic Pipeline with WebGPU
带WebGPU的基础Pipeline
javascript
import { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU acceleration
dtype: "q4", // Quantization level
});
const result = await pipe("input text");javascript
import { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU加速
dtype: "q4", // 量化级别
});
const result = await pipe("input text");Per-Module dtypes (encoder-decoder models)
按模块指定数据类型(编码器-解码器模型)
Some models (Whisper, Florence-2) need different quantization per component:
javascript
const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md
部分模型(如Whisper、Florence-2)需要为不同组件指定不同量化类型:
javascript
const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);关于Transformers.js WebGPU的详细使用模式: 请参考references/webgpu-usage.md
Troubleshooting
故障排查
For conversion errors and common issues: See references/conversion-guide.md
转换错误及常见问题解决: 请参考references/conversion-guide.md
Quick Fixes
快速修复方案
- "Task not found": Use flag explicitly. For decoder models try
--tasktext-generation-with-past - "trust_remote_code": Add flag for custom model architectures
--trust-remote-code - Out of memory: Use and smaller
--device cpu--batch_size - Validation fails: Try or increase
--no-post-process--atol - Model not supported: Check supported architectures — 120+ architectures supported
- WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)
- "Task not found":显式使用参数。对于解码器模型,尝试使用
--tasktext-generation-with-past - "trust_remote_code":为自定义模型架构添加参数
--trust-remote-code - 内存不足:使用并减小
--device cpu--batch_size - 验证失败:尝试使用或增大
--no-post-process值--atol - 模型不支持:查看支持的架构 — 已支持120+种架构
- WebGPU回退至WASM:确保浏览器支持WebGPU(Chrome 113+、Edge 113+)
Supported Task → Pipeline Mapping
支持的任务与Pipeline映射
| Task | Transformers.js Pipeline | Example Model |
|---|---|---|
| text-classification | | distilbert-base-uncased-finetuned-sst-2 |
| text-generation | | Qwen2.5-0.5B-Instruct |
| feature-extraction | | mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | | whisper-tiny.en |
| image-classification | | mobilenetv4_conv_small |
| object-detection | | detr-resnet-50 |
| image-segmentation | | segformer-b0 |
| zero-shot-image-classification | | clip-vit-base-patch32 |
| depth-estimation | | depth-anything-small |
| translation | | nllb-200-distilled-600M |
| summarization | | bart-large-cnn |
| 任务类型 | Transformers.js Pipeline | 示例模型 |
|---|---|---|
| text-classification | | distilbert-base-uncased-finetuned-sst-2 |
| text-generation | | Qwen2.5-0.5B-Instruct |
| feature-extraction | | mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | | whisper-tiny.en |
| image-classification | | mobilenetv4_conv_small |
| object-detection | | detr-resnet-50 |
| image-segmentation | | segformer-b0 |
| zero-shot-image-classification | | clip-vit-base-patch32 |
| depth-estimation | | depth-anything-small |
| translation | | nllb-200-distilled-600M |
| summarization | | bart-large-cnn |