browser-onnx

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Browser-Based ONNX Inference

基于浏览器的ONNX推理

This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.

本技能提供了一套完整的工作流，可使用ONNX Runtime Web (ORT-Web)在浏览器本地运行ONNX模型。本地推理在数据隐私、降低服务器成本和无限可扩展性方面具有显著优势，因为每个用户都将使用自己的计算资源。

1. Setup and Installation

1. 环境搭建与安装

Install the required library via npm:

bash

npm install onnxruntime-web

Note: For experimental features like WebGPU or WebNN, use the nightly version
onnxruntime-web@dev
.

通过npm安装所需库：

bash

npm install onnxruntime-web

注意：如需WebGPU或WebNN等实验性功能，请使用夜间版本
onnxruntime-web@dev
。

2. Global Environment Configuration

2. 全局环境配置

Set global

ort.env

flags before creating a session to optimize the runtime environment.

WebAssembly (CPU): Enable multi-threading by setting
```
ort.env.wasm.numThreads
```
(default is half of hardware concurrency) and use a Proxy Worker (
```
ort.env.wasm.proxy = true
```
) to keep the UI responsive.
WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using
```
ort.env.wasm.wasmPaths
```
to point to local assets or a CDN.
WebGPU (GPU): Use
```
ort.env.webgpu.profiling = { mode: 'default' }
```
for performance diagnosis during development.

在创建会话前设置全局

ort.env

参数，以优化运行时环境。

WebAssembly (CPU)： 通过设置
```
ort.env.wasm.numThreads
```
启用多线程（默认值为硬件并发数的一半），并使用Proxy Worker（设置
```
ort.env.wasm.proxy = true
```
）保持UI响应性。
WASM路径： 如果二进制文件与JS包不在同一目录，可通过
```
ort.env.wasm.wasmPaths
```
手动覆盖路径，指向本地资源或CDN。
WebGPU (GPU)： 开发期间，使用
```
ort.env.webgpu.profiling = { mode: 'default' }
```
进行性能诊断。

3. Creating an Inference Session

3. 创建推理会话

Initialize the session by choosing the appropriate Execution Provider (EP):

javascript

import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
  graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});

选择合适的**执行提供器 (EP)**初始化会话：

javascript

import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // 优先使用GPU，回退到CPU
  graphOptimizationLevel: 'all' // 启用所有图级优化
});

4. Data Preprocessing

4. 数据预处理

Input data must match the model's training format (e.g., NCHW for vision models).

Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
Tensor Creation: Use
```
new ort.Tensor('float32', float32Data,)
```
to prepare the input feeds.

输入数据必须与模型训练格式匹配（例如视觉模型使用NCHW格式）。

图像转张量： 使用JIMP或OpenCV.js等库调整图像大小、归一化（除以255.0）并将RGBA转换为RGB。
张量创建： 使用
```
new ort.Tensor('float32', float32Data,)
```
准备输入数据。

5. Optimized Inference Patterns

5. 优化推理模式

Graph Capture: For models with static shapes on WebGPU, enable
```
enableGraphCapture: true
```
to reduce CPU overhead by replaying kernel executions.
IO Binding: For transformer models, keep data on the GPU by using
```
ort.Tensor.fromGpuBuffer()
```
and setting
```
preferredOutputLocation: 'gpu-buffer'
```
to avoid expensive memory copies.
Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.

图捕获： 对于WebGPU上形状固定的模型，启用
```
enableGraphCapture: true
```
，通过重放内核执行减少CPU开销。
IO绑定： 对于Transformer模型，使用
```
ort.Tensor.fromGpuBuffer()
```
将数据保留在GPU上，并设置
```
preferredOutputLocation: 'gpu-buffer'
```
以避免昂贵的内存拷贝。
量化： CPU（WASM）推理优先选择uint8量化模型以提升性能；避免在CPU上使用float16，因为它缺乏原生支持且运行缓慢。

6. Large Model Handling (>2GB)

6. 大模型处理（>2GB）

Platform Limits: Browsers like Chrome limit
```
ArrayBuffer
```
to ~2GB. Models exceeding this must be exported with external data.

Loading External Data: Explicitly link external weight files in the session options:

javascript

const session = await ort.InferenceSession.create(modelUrl, {
  externalData: [{ path: './model.data', data: dataUrl }]
});

平台限制： Chrome等浏览器将
```
ArrayBuffer
```
限制在约2GB。超过此大小的模型必须通过外部数据导出。

加载外部数据： 在会话选项中显式链接外部权重文件：

javascript

const session = await ort.InferenceSession.create(modelUrl, {
  externalData: [{ path: './model.data', data: dataUrl }]
});

7. Common Edge Cases

7. 常见边缘情况

Memory Management: Explicitly call
```
tensor.dispose()
```
for GPU tensors to prevent memory leaks.
Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.

内存管理： 显式调用
```
tensor.dispose()
```
释放GPU张量，防止内存泄漏。
零尺寸张量： 无论选择哪种EP，ORT-Web都会将维度为0的张量视为CPU张量。
热节流： 移动设备上持续进行推理可能会触发频率缩放，导致延迟翻倍。使用轻量级“微型”模型以维持热平衡。

8. Examples

8. 示例

Multilingual Translation

多语言翻译

Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.

使用单例模式将繁重的翻译任务卸载到独立的Web Worker，确保模型（如NLLB-200）仅加载一次。

Object Detection (YOLO)

目标检测（YOLO）

Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.

实现非极大值抑制 (NMS)。如果浏览器不支持特定NMS操作，可运行单独的NMS ONNX模型在本地过滤重叠框。