Loading...
Loading...
Convert HuggingFace transformer models to ONNX format for browser inference with Transformers.js and WebGPU. Use when given a HuggingFace model link to convert to ONNX, when setting up optimum-cli for ONNX export, when quantizing models (fp16, q8, q4) for web deployment, when configuring Transformers.js with WebGPU acceleration, or when troubleshooting ONNX conversion errors. Triggers on mentions of ONNX conversion, Transformers.js, WebGPU inference, optimum export, model quantization for browser, or running ML models in the browser.
npx skill4agent add jakerains/agentskills onnx-webgpu-converteronnx-community/<model-name>onnx/# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate
# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime
# For GPU-accelerated export (optional)
pip install onnxruntime-gpuoptimum-cli export onnx --helpoptimum-cli export onnx --model <model_id_or_path> ./output_dir/optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/text-generationtext-classificationfeature-extractionimage-classificationautomatic-speech-recognitionobject-detectionimage-segmentationquestion-answeringtoken-classificationzero-shot-classification-with-pasttext-generation-with-pasttext2text-generation-with-pastautomatic-speech-recognition-with-past| Flag | Description |
|---|---|
| HuggingFace model ID or local path (required) |
| Export task (auto-detected if on Hub) |
| ONNX opset version (default: auto) |
| Export device, |
| ONNX Runtime optimization level |
| Force single ONNX file (vs split encoder/decoder) |
| Skip post-processing (e.g., decoder merging) |
| Allow custom model code from Hub |
| Override pad token (needed for some models) |
| Cache directory for downloaded models |
| Batch size for dummy inputs |
| Sequence length for dummy inputs |
| Source framework |
| Absolute tolerance for validation |
| Level | Description |
|---|---|
| O1 | Basic general optimizations |
| O2 | Basic + extended + transformer fusions |
| O3 | O2 + GELU approximation |
| O4 | O3 + mixed precision fp16 (GPU only, requires |
| dtype | Precision | Best For | Size Reduction |
|---|---|---|---|
| Full 32-bit | Maximum accuracy | None (baseline) |
| Half 16-bit | WebGPU default quality | ~50% |
| 8-bit | WASM default, good balance | ~75% |
| 4-bit | Maximum compression | ~87% |
| 4-bit weights, fp16 compute | WebGPU + small size | ~87% |
# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
--onnx_model ./output_dir/ \
--avx512 \
-o ./quantized_dir/from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)onnx-communitymodel_onnx/
├── onnx/
│ ├── model.onnx # fp32
│ ├── model_fp16.onnx # fp16
│ ├── model_quantized.onnx # q8
│ └── model_q4.onnx # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json# Login
huggingface-cli login
# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
# Add transformers.js tag to model card for discoverabilitynpm install @huggingface/transformersimport { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU acceleration
dtype: "q4", // Quantization level
});
const result = await pipe("input text");const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);--tasktext-generation-with-past--trust-remote-code--device cpu--batch_size--no-post-process--atol| Task | Transformers.js Pipeline | Example Model |
|---|---|---|
| text-classification | | distilbert-base-uncased-finetuned-sst-2 |
| text-generation | | Qwen2.5-0.5B-Instruct |
| feature-extraction | | mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | | whisper-tiny.en |
| image-classification | | mobilenetv4_conv_small |
| object-detection | | detr-resnet-50 |
| image-segmentation | | segformer-b0 |
| zero-shot-image-classification | | clip-vit-base-patch32 |
| depth-estimation | | depth-anything-small |
| translation | | nllb-200-distilled-600M |
| summarization | | bart-large-cnn |