browser-onnx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBrowser-Based ONNX Inference
基于浏览器的ONNX推理
This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.
本技能提供了一套完整的工作流,可使用ONNX Runtime Web (ORT-Web)在浏览器本地运行ONNX模型。本地推理在数据隐私、降低服务器成本和无限可扩展性方面具有显著优势,因为每个用户都将使用自己的计算资源。
1. Setup and Installation
1. 环境搭建与安装
Install the required library via npm:
bash
npm install onnxruntime-webNote: For experimental features like WebGPU or WebNN, use the nightly version .
onnxruntime-web@dev通过npm安装所需库:
bash
npm install onnxruntime-web注意:如需WebGPU或WebNN等实验性功能,请使用夜间版本。
onnxruntime-web@dev2. Global Environment Configuration
2. 全局环境配置
Set global flags before creating a session to optimize the runtime environment.
ort.env- WebAssembly (CPU): Enable multi-threading by setting (default is half of hardware concurrency) and use a Proxy Worker (
ort.env.wasm.numThreads) to keep the UI responsive.ort.env.wasm.proxy = true - WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using to point to local assets or a CDN.
ort.env.wasm.wasmPaths - WebGPU (GPU): Use for performance diagnosis during development.
ort.env.webgpu.profiling = { mode: 'default' }
在创建会话前设置全局参数,以优化运行时环境。
ort.env- WebAssembly (CPU): 通过设置启用多线程(默认值为硬件并发数的一半),并使用Proxy Worker(设置
ort.env.wasm.numThreads)保持UI响应性。ort.env.wasm.proxy = true - WASM路径: 如果二进制文件与JS包不在同一目录,可通过手动覆盖路径,指向本地资源或CDN。
ort.env.wasm.wasmPaths - WebGPU (GPU): 开发期间,使用进行性能诊断。
ort.env.webgpu.profiling = { mode: 'default' }
3. Creating an Inference Session
3. 创建推理会话
Initialize the session by choosing the appropriate Execution Provider (EP):
javascript
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create('./model.onnx', {
executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});选择合适的**执行提供器 (EP)**初始化会话:
javascript
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create('./model.onnx', {
executionProviders: ['webgpu', 'wasm'], // 优先使用GPU,回退到CPU
graphOptimizationLevel: 'all' // 启用所有图级优化
});4. Data Preprocessing
4. 数据预处理
Input data must match the model's training format (e.g., NCHW for vision models).
- Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
- Tensor Creation: Use to prepare the input feeds.
new ort.Tensor('float32', float32Data,)
输入数据必须与模型训练格式匹配(例如视觉模型使用NCHW格式)。
- 图像转张量: 使用JIMP或OpenCV.js等库调整图像大小、归一化(除以255.0)并将RGBA转换为RGB。
- 张量创建: 使用准备输入数据。
new ort.Tensor('float32', float32Data,)
5. Optimized Inference Patterns
5. 优化推理模式
- Graph Capture: For models with static shapes on WebGPU, enable to reduce CPU overhead by replaying kernel executions.
enableGraphCapture: true - IO Binding: For transformer models, keep data on the GPU by using and setting
ort.Tensor.fromGpuBuffer()to avoid expensive memory copies.preferredOutputLocation: 'gpu-buffer' - Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.
- 图捕获: 对于WebGPU上形状固定的模型,启用,通过重放内核执行减少CPU开销。
enableGraphCapture: true - IO绑定: 对于Transformer模型,使用将数据保留在GPU上,并设置
ort.Tensor.fromGpuBuffer()以避免昂贵的内存拷贝。preferredOutputLocation: 'gpu-buffer' - 量化: CPU(WASM)推理优先选择uint8量化模型以提升性能;避免在CPU上使用float16,因为它缺乏原生支持且运行缓慢。
6. Large Model Handling (>2GB)
6. 大模型处理(>2GB)
- Platform Limits: Browsers like Chrome limit to ~2GB. Models exceeding this must be exported with external data.
ArrayBuffer - Loading External Data: Explicitly link external weight files in the session options:
javascript
const session = await ort.InferenceSession.create(modelUrl, { externalData: [{ path: './model.data', data: dataUrl }] });
- 平台限制: Chrome等浏览器将限制在约2GB。超过此大小的模型必须通过外部数据导出。
ArrayBuffer - 加载外部数据: 在会话选项中显式链接外部权重文件:
javascript
const session = await ort.InferenceSession.create(modelUrl, { externalData: [{ path: './model.data', data: dataUrl }] });
7. Common Edge Cases
7. 常见边缘情况
- Memory Management: Explicitly call for GPU tensors to prevent memory leaks.
tensor.dispose() - Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
- Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.
- 内存管理: 显式调用释放GPU张量,防止内存泄漏。
tensor.dispose() - 零尺寸张量: 无论选择哪种EP,ORT-Web都会将维度为0的张量视为CPU张量。
- 热节流: 移动设备上持续进行推理可能会触发频率缩放,导致延迟翻倍。使用轻量级“微型”模型以维持热平衡。
8. Examples
8. 示例
Multilingual Translation
多语言翻译
Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.
使用单例模式将繁重的翻译任务卸载到独立的Web Worker,确保模型(如NLLB-200)仅加载一次。
Object Detection (YOLO)
目标检测(YOLO)
Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.
实现非极大值抑制 (NMS)。如果浏览器不支持特定NMS操作,可运行单独的NMS ONNX模型在本地过滤重叠框。