vision-ocr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseVision & OCR
视觉与OCR
Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR / Tesseract) -> local vision models (TrOCR, Florence-2) -> cloud vision LLM (GPT-4o, Claude, Gemini).
借助渐进式三层管道从图像、文档和手写笔记中提取文本:本地OCR(PaddleOCR / Tesseract)-> 本地视觉模型(TrOCR、Florence-2)-> 云端视觉LLM(GPT-4o、Claude、Gemini)。
High-Level API: performOCR()
performOCR()高级API:performOCR()
performOCR()For one-shot text extraction, use the top-level function. It handles input resolution, pipeline lifecycle, and cleanup automatically.
performOCR()typescript
import { performOCR } from '@framers/agentos';
const result = await performOCR({
image: '/path/to/receipt.png', // file path, URL, base64, or Buffer
strategy: 'progressive', // 'progressive' | 'local-only' | 'cloud-only'
confidenceThreshold: 0.7, // min confidence before escalating tier
});
console.log(result.text); // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier); // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider); // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions); // bounding boxes (when available)对于一次性文本提取,使用顶层的函数。它会自动处理输入分辨率、管道生命周期和清理工作。
performOCR()typescript
import { performOCR } from '@framers/agentos';
const result = await performOCR({
image: '/path/to/receipt.png', // file path, URL, base64, or Buffer
strategy: 'progressive', // 'progressive' | 'local-only' | 'cloud-only'
confidenceThreshold: 0.7, // min confidence before escalating tier
});
console.log(result.text); // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier); // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider); // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions); // bounding boxes (when available)When to use performOCR()
vs VisionPipeline
performOCR()VisionPipeline何时使用performOCR()
vs VisionPipeline
performOCR()VisionPipeline| Use case | Recommendation |
|---|---|
| One-shot text extraction from a single image | |
| Batch processing many images | |
| Need CLIP embeddings or document layout | |
| Quick scripts and integrations | |
| 使用场景 | 推荐方案 |
|---|---|
| 从单张图像中一次性提取文本 | |
| 批量处理多张图像 | |
| 需要CLIP嵌入或文档布局 | |
| 快速脚本和集成 | |
Progressive Tier System
渐进式层级系统
The pipeline tries the cheapest/fastest tier first and only escalates when confidence is below threshold:
- Tier 1 — Local OCR (PaddleOCR or Tesseract.js): Fast, free, offline. Handles printed text in documents, receipts, screenshots.
- Tier 2 — Local Vision Models (TrOCR / Florence-2): Still offline. Handles handwritten notes, complex document layouts with tables and figures.
- Tier 3 — Cloud Vision LLM (GPT-4o / Claude / Gemini): Best quality. Handles photographs, diagrams, mixed content, anything the local tiers can't confidently read.
管道会先尝试成本最低/速度最快的层级,仅当置信度低于阈值时才会升级:
- 第一层 — 本地OCR(PaddleOCR或Tesseract.js):快速、免费、离线。处理文档、收据、截图中的印刷文本。
- 第二层 — 本地视觉模型(TrOCR / Florence-2):仍可离线使用。处理手写笔记、包含表格和图表的复杂文档布局。
- 第三层 — 云端视觉LLM(GPT-4o / Claude / Gemini):质量最佳。处理照片、图表、混合内容以及本地层级无法可靠识别的任何内容。
Strategy Selection
策略选择
- (default): Start local, escalate only if needed. Best cost/quality balance for most use cases.
'progressive' - : Never call cloud APIs. Use for air-gapped environments, privacy-sensitive data (medical records, financial docs), or when no API keys are available.
'local-only' - : Skip local tiers entirely, send straight to a cloud vision LLM. Use when you need the highest quality output and cost is not a concern.
'cloud-only'
- (默认):从本地开始,仅在需要时升级。是大多数使用场景下成本与质量的最佳平衡。
'progressive' - :绝不调用云端API。适用于隔离环境、隐私敏感数据(医疗记录、财务文档)或无API密钥可用的情况。
'local-only' - :完全跳过本地层级,直接发送至云端视觉LLM。适用于需要最高质量输出且不考虑成本的场景。
'cloud-only'
Input Formats
输入格式
performOCR()- File path: — reads from disk
'/tmp/scan.png' - URL: — fetches via HTTP
'https://example.com/receipt.jpg' - Base64 string: Raw base64 or data URIs — decoded in-memory
data:image/png;base64,... - Buffer: Raw image bytes — passed directly to the pipeline
performOCR()- 文件路径:— 从磁盘读取
'/tmp/scan.png' - URL:— 通过HTTP获取
'https://example.com/receipt.jpg' - Base64字符串:原始Base64或数据URI — 在内存中解码
data:image/png;base64,... - Buffer:原始图像字节 — 直接传递给管道
Capabilities
功能
- Printed text OCR: Extract text from documents, receipts, screenshots, PDFs
- Handwriting recognition: Read handwritten notes and forms via TrOCR
- Document layout understanding: Parse tables, figures, headings via Florence-2
- Bounding box regions: Spatial text locations for overlay rendering
- Image embeddings: Generate CLIP vectors for semantic image search (via only)
VisionPipeline
- 印刷文本OCR:从文档、收据、截图、PDF中提取文本
- 手写识别:通过TrOCR读取手写笔记和表单
- 文档布局理解:通过Florence-2解析表格、图表、标题
- 边界框区域:用于叠加渲染的文本空间位置
- 图像嵌入:生成CLIP向量用于语义图像搜索(仅通过)
VisionPipeline
Examples
示例
- "Read the text from this receipt"
- "What does this handwritten note say?"
- "Extract the table data from this PDF page"
- "OCR this screenshot and return the error message"
- "读取这张收据上的文本"
- "这张手写笔记写了什么?"
- "从这张PDF页面提取表格数据"
- "对这张截图进行OCR并返回错误信息"