vision-ocr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Vision & OCR

视觉与OCR

Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR / Tesseract) -> local vision models (TrOCR, Florence-2) -> cloud vision LLM (GPT-4o, Claude, Gemini).

借助渐进式三层管道从图像、文档和手写笔记中提取文本：本地OCR（PaddleOCR / Tesseract）-> 本地视觉模型（TrOCR、Florence-2）-> 云端视觉LLM（GPT-4o、Claude、Gemini）。

High-Level API:

performOCR()

高级API：

performOCR()

For one-shot text extraction, use the top-level

performOCR()

function. It handles input resolution, pipeline lifecycle, and cleanup automatically.

typescript

import { performOCR } from '@framers/agentos';

const result = await performOCR({
  image: '/path/to/receipt.png',   // file path, URL, base64, or Buffer
  strategy: 'progressive',         // 'progressive' | 'local-only' | 'cloud-only'
  confidenceThreshold: 0.7,        // min confidence before escalating tier
});

console.log(result.text);       // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier);       // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider);   // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions);    // bounding boxes (when available)

对于一次性文本提取，使用顶层的

performOCR()

函数。它会自动处理输入分辨率、管道生命周期和清理工作。

typescript

import { performOCR } from '@framers/agentos';

const result = await performOCR({
  image: '/path/to/receipt.png',   // file path, URL, base64, or Buffer
  strategy: 'progressive',         // 'progressive' | 'local-only' | 'cloud-only'
  confidenceThreshold: 0.7,        // min confidence before escalating tier
});

console.log(result.text);       // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier);       // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider);   // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions);    // bounding boxes (when available)

When to use

performOCR()

VisionPipeline

何时使用

performOCR()

VisionPipeline

Use case	Recommendation
One-shot text extraction from a single image	`performOCR()` — simplest API
Batch processing many images	`VisionPipeline` — create once, reuse, dispose when done
Need CLIP embeddings or document layout	`VisionPipeline` — richer result shape
Quick scripts and integrations	`performOCR()` — zero boilerplate

使用场景	推荐方案
从单张图像中一次性提取文本	`performOCR()` — 最简单的API
批量处理多张图像	`VisionPipeline` — 创建一次，重复使用，完成后销毁
需要CLIP嵌入或文档布局	`VisionPipeline` — 更丰富的结果结构
快速脚本和集成	`performOCR()` — 零样板代码

Progressive Tier System

渐进式层级系统

The pipeline tries the cheapest/fastest tier first and only escalates when confidence is below threshold:

Tier 1 — Local OCR (PaddleOCR or Tesseract.js): Fast, free, offline. Handles printed text in documents, receipts, screenshots.
Tier 2 — Local Vision Models (TrOCR / Florence-2): Still offline. Handles handwritten notes, complex document layouts with tables and figures.
Tier 3 — Cloud Vision LLM (GPT-4o / Claude / Gemini): Best quality. Handles photographs, diagrams, mixed content, anything the local tiers can't confidently read.

管道会先尝试成本最低/速度最快的层级，仅当置信度低于阈值时才会升级：

第一层 — 本地OCR（PaddleOCR或Tesseract.js）：快速、免费、离线。处理文档、收据、截图中的印刷文本。
第二层 — 本地视觉模型（TrOCR / Florence-2）：仍可离线使用。处理手写笔记、包含表格和图表的复杂文档布局。
第三层 — 云端视觉LLM（GPT-4o / Claude / Gemini）：质量最佳。处理照片、图表、混合内容以及本地层级无法可靠识别的任何内容。

Strategy Selection

策略选择

'progressive'
(default): Start local, escalate only if needed. Best cost/quality balance for most use cases.
'local-only'
: Never call cloud APIs. Use for air-gapped environments, privacy-sensitive data (medical records, financial docs), or when no API keys are available.
'cloud-only'
: Skip local tiers entirely, send straight to a cloud vision LLM. Use when you need the highest quality output and cost is not a concern.

'progressive'
（默认）：从本地开始，仅在需要时升级。是大多数使用场景下成本与质量的最佳平衡。
'local-only'
：绝不调用云端API。适用于隔离环境、隐私敏感数据（医疗记录、财务文档）或无API密钥可用的情况。
'cloud-only'
：完全跳过本地层级，直接发送至云端视觉LLM。适用于需要最高质量输出且不考虑成本的场景。

Input Formats

输入格式

performOCR()

accepts four input types:

File path:
```
'/tmp/scan.png'
```
— reads from disk
URL:
```
'https://example.com/receipt.jpg'
```
— fetches via HTTP
Base64 string: Raw base64 or
```
data:image/png;base64,...
```
data URIs — decoded in-memory
Buffer: Raw image bytes — passed directly to the pipeline

performOCR()

接受四种输入类型：

文件路径：
```
'/tmp/scan.png'
```
— 从磁盘读取
URL：
```
'https://example.com/receipt.jpg'
```
— 通过HTTP获取
Base64字符串：原始Base64或
```
data:image/png;base64,...
```
数据URI — 在内存中解码
Buffer：原始图像字节 — 直接传递给管道

Capabilities

功能

Printed text OCR: Extract text from documents, receipts, screenshots, PDFs
Handwriting recognition: Read handwritten notes and forms via TrOCR
Document layout understanding: Parse tables, figures, headings via Florence-2
Bounding box regions: Spatial text locations for overlay rendering
Image embeddings: Generate CLIP vectors for semantic image search (via
```
VisionPipeline
```
only)

印刷文本OCR：从文档、收据、截图、PDF中提取文本
手写识别：通过TrOCR读取手写笔记和表单
文档布局理解：通过Florence-2解析表格、图表、标题
边界框区域：用于叠加渲染的文本空间位置
图像嵌入：生成CLIP向量用于语义图像搜索（仅通过
```
VisionPipeline
```
）

Examples

示例

"Read the text from this receipt"
"What does this handwritten note say?"
"Extract the table data from this PDF page"
"OCR this screenshot and return the error message"

"读取这张收据上的文本"
"这张手写笔记写了什么？"
"从这张PDF页面提取表格数据"
"对这张截图进行OCR并返回错误信息"

vision-ocr

Original

Translation

Vision & OCR

视觉与OCR

High-Level API:
`performOCR()`

高级API：
`performOCR()`

When to use
`performOCR()`
vs
`VisionPipeline`

何时使用
`performOCR()`
vs
`VisionPipeline`

Progressive Tier System

渐进式层级系统

Strategy Selection

策略选择

Input Formats

输入格式

Capabilities

功能

Examples

示例

vision-ocr

Original

Translation

Vision & OCR

视觉与OCR

High-Level API: performOCR()

高级API：performOCR()

When to use performOCR() vs VisionPipeline

何时使用performOCR() vs VisionPipeline

Progressive Tier System

渐进式层级系统

Strategy Selection

策略选择

Input Formats

输入格式

Capabilities

功能

Examples

示例

High-Level API:
`performOCR()`

高级API：
`performOCR()`

When to use
`performOCR()`
vs
`VisionPipeline`

何时使用
`performOCR()`
vs
`VisionPipeline`