browser-onnx

Original：🇺🇸 English

Translated

Implements high-performance local machine learning inference in the browser using ONNX Runtime Web. Use this skill when the user needs privacy-first, low-latency, or offline AI capabilities (e.g., image classification, object detection, or NLP) without server-side processing.

7installs

Sourcethongnt0208/browser-onnx-skills

Added on2026-02-07

NPX Install

npx skill4agent add thongnt0208/browser-onnx-skills browser-onnx

SKILL.md Content

View Translation Comparison →

Browser-Based ONNX Inference

This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.

1. Setup and Installation

Install the required library via npm:

bash

npm install onnxruntime-web

Note: For experimental features like WebGPU or WebNN, use the nightly version
onnxruntime-web@dev
.

2. Global Environment Configuration

Set global

ort.env

flags before creating a session to optimize the runtime environment.

WebAssembly (CPU): Enable multi-threading by setting
```
ort.env.wasm.numThreads
```
(default is half of hardware concurrency) and use a Proxy Worker (
```
ort.env.wasm.proxy = true
```
) to keep the UI responsive.
WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using
```
ort.env.wasm.wasmPaths
```
to point to local assets or a CDN.
WebGPU (GPU): Use
```
ort.env.webgpu.profiling = { mode: 'default' }
```
for performance diagnosis during development.

3. Creating an Inference Session

Initialize the session by choosing the appropriate Execution Provider (EP):

javascript

import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
  graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});

4. Data Preprocessing

Input data must match the model's training format (e.g., NCHW for vision models).

Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
Tensor Creation: Use
```
new ort.Tensor('float32', float32Data,)
```
to prepare the input feeds.

5. Optimized Inference Patterns

Graph Capture: For models with static shapes on WebGPU, enable
```
enableGraphCapture: true
```
to reduce CPU overhead by replaying kernel executions.
IO Binding: For transformer models, keep data on the GPU by using
```
ort.Tensor.fromGpuBuffer()
```
and setting
```
preferredOutputLocation: 'gpu-buffer'
```
to avoid expensive memory copies.
Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.

6. Large Model Handling (>2GB)

Platform Limits: Browsers like Chrome limit
```
ArrayBuffer
```
to ~2GB. Models exceeding this must be exported with external data.

Loading External Data: Explicitly link external weight files in the session options:

javascript

const session = await ort.InferenceSession.create(modelUrl, {
  externalData: [{ path: './model.data', data: dataUrl }]
});

7. Common Edge Cases

Memory Management: Explicitly call
```
tensor.dispose()
```
for GPU tensors to prevent memory leaks.
Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.

8. Examples

Multilingual Translation

Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.

Object Detection (YOLO)

Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.