browser-onnx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Browser-Based ONNX Inference

基于浏览器的ONNX推理

This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.
本技能提供了一套完整的工作流,可使用ONNX Runtime Web (ORT-Web)在浏览器本地运行ONNX模型。本地推理在数据隐私降低服务器成本无限可扩展性方面具有显著优势,因为每个用户都将使用自己的计算资源。

1. Setup and Installation

1. 环境搭建与安装

Install the required library via npm:
bash
npm install onnxruntime-web
Note: For experimental features like WebGPU or WebNN, use the nightly version
onnxruntime-web@dev
.
通过npm安装所需库:
bash
npm install onnxruntime-web
注意:如需WebGPU或WebNN等实验性功能,请使用夜间版本
onnxruntime-web@dev

2. Global Environment Configuration

2. 全局环境配置

Set global
ort.env
flags before creating a session to optimize the runtime environment.
  • WebAssembly (CPU): Enable multi-threading by setting
    ort.env.wasm.numThreads
    (default is half of hardware concurrency) and use a Proxy Worker (
    ort.env.wasm.proxy = true
    ) to keep the UI responsive.
  • WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using
    ort.env.wasm.wasmPaths
    to point to local assets or a CDN.
  • WebGPU (GPU): Use
    ort.env.webgpu.profiling = { mode: 'default' }
    for performance diagnosis during development.
在创建会话前设置全局
ort.env
参数,以优化运行时环境。
  • WebAssembly (CPU): 通过设置
    ort.env.wasm.numThreads
    启用多线程(默认值为硬件并发数的一半),并使用Proxy Worker(设置
    ort.env.wasm.proxy = true
    )保持UI响应性。
  • WASM路径: 如果二进制文件与JS包不在同一目录,可通过
    ort.env.wasm.wasmPaths
    手动覆盖路径,指向本地资源或CDN。
  • WebGPU (GPU): 开发期间,使用
    ort.env.webgpu.profiling = { mode: 'default' }
    进行性能诊断。

3. Creating an Inference Session

3. 创建推理会话

Initialize the session by choosing the appropriate Execution Provider (EP):
javascript
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
  graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});
选择合适的**执行提供器 (EP)**初始化会话:
javascript
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // 优先使用GPU,回退到CPU
  graphOptimizationLevel: 'all' // 启用所有图级优化
});

4. Data Preprocessing

4. 数据预处理

Input data must match the model's training format (e.g., NCHW for vision models).
  • Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
  • Tensor Creation: Use
    new ort.Tensor('float32', float32Data,)
    to prepare the input feeds.
输入数据必须与模型训练格式匹配(例如视觉模型使用NCHW格式)。
  • 图像转张量: 使用JIMP或OpenCV.js等库调整图像大小、归一化(除以255.0)并将RGBA转换为RGB。
  • 张量创建: 使用
    new ort.Tensor('float32', float32Data,)
    准备输入数据。

5. Optimized Inference Patterns

5. 优化推理模式

  • Graph Capture: For models with static shapes on WebGPU, enable
    enableGraphCapture: true
    to reduce CPU overhead by replaying kernel executions.
  • IO Binding: For transformer models, keep data on the GPU by using
    ort.Tensor.fromGpuBuffer()
    and setting
    preferredOutputLocation: 'gpu-buffer'
    to avoid expensive memory copies.
  • Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.
  • 图捕获: 对于WebGPU上形状固定的模型,启用
    enableGraphCapture: true
    ,通过重放内核执行减少CPU开销。
  • IO绑定: 对于Transformer模型,使用
    ort.Tensor.fromGpuBuffer()
    将数据保留在GPU上,并设置
    preferredOutputLocation: 'gpu-buffer'
    以避免昂贵的内存拷贝。
  • 量化: CPU(WASM)推理优先选择uint8量化模型以提升性能;避免在CPU上使用float16,因为它缺乏原生支持且运行缓慢。

6. Large Model Handling (>2GB)

6. 大模型处理(>2GB)

  • Platform Limits: Browsers like Chrome limit
    ArrayBuffer
    to ~2GB. Models exceeding this must be exported with external data.
  • Loading External Data: Explicitly link external weight files in the session options:
    javascript
    const session = await ort.InferenceSession.create(modelUrl, {
      externalData: [{ path: './model.data', data: dataUrl }]
    });
  • 平台限制: Chrome等浏览器将
    ArrayBuffer
    限制在约2GB。超过此大小的模型必须通过外部数据导出。
  • 加载外部数据: 在会话选项中显式链接外部权重文件:
    javascript
    const session = await ort.InferenceSession.create(modelUrl, {
      externalData: [{ path: './model.data', data: dataUrl }]
    });

7. Common Edge Cases

7. 常见边缘情况

  • Memory Management: Explicitly call
    tensor.dispose()
    for GPU tensors to prevent memory leaks.
  • Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
  • Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.
  • 内存管理: 显式调用
    tensor.dispose()
    释放GPU张量,防止内存泄漏。
  • 零尺寸张量: 无论选择哪种EP,ORT-Web都会将维度为0的张量视为CPU张量。
  • 热节流: 移动设备上持续进行推理可能会触发频率缩放,导致延迟翻倍。使用轻量级“微型”模型以维持热平衡。

8. Examples

8. 示例

Multilingual Translation

多语言翻译

Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.
使用单例模式将繁重的翻译任务卸载到独立的Web Worker,确保模型(如NLLB-200)仅加载一次。

Object Detection (YOLO)

目标检测(YOLO)

Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.
实现非极大值抑制 (NMS)。如果浏览器不支持特定NMS操作,可运行单独的NMS ONNX模型在本地过滤重叠框。