vision-ocr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Vision & OCR

视觉与OCR

Extract text from images, documents, and handwritten notes using a progressive 3-tier pipeline: local OCR (PaddleOCR / Tesseract) -> local vision models (TrOCR, Florence-2) -> cloud vision LLM (GPT-4o, Claude, Gemini).
借助渐进式三层管道从图像、文档和手写笔记中提取文本:本地OCR(PaddleOCR / Tesseract)-> 本地视觉模型(TrOCR、Florence-2)-> 云端视觉LLM(GPT-4o、Claude、Gemini)。

High-Level API:
performOCR()

高级API:
performOCR()

For one-shot text extraction, use the top-level
performOCR()
function. It handles input resolution, pipeline lifecycle, and cleanup automatically.
typescript
import { performOCR } from '@framers/agentos';

const result = await performOCR({
  image: '/path/to/receipt.png',   // file path, URL, base64, or Buffer
  strategy: 'progressive',         // 'progressive' | 'local-only' | 'cloud-only'
  confidenceThreshold: 0.7,        // min confidence before escalating tier
});

console.log(result.text);       // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier);       // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider);   // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions);    // bounding boxes (when available)
对于一次性文本提取,使用顶层的
performOCR()
函数。它会自动处理输入分辨率、管道生命周期和清理工作。
typescript
import { performOCR } from '@framers/agentos';

const result = await performOCR({
  image: '/path/to/receipt.png',   // file path, URL, base64, or Buffer
  strategy: 'progressive',         // 'progressive' | 'local-only' | 'cloud-only'
  confidenceThreshold: 0.7,        // min confidence before escalating tier
});

console.log(result.text);       // extracted text
console.log(result.confidence); // 0–1 score
console.log(result.tier);       // 'ocr' | 'handwriting' | 'document-ai' | 'cloud-vision'
console.log(result.provider);   // 'paddle' | 'tesseract' | 'openai' | etc.
console.log(result.regions);    // bounding boxes (when available)

When to use
performOCR()
vs
VisionPipeline

何时使用
performOCR()
vs
VisionPipeline

Use caseRecommendation
One-shot text extraction from a single image
performOCR()
— simplest API
Batch processing many images
VisionPipeline
— create once, reuse, dispose when done
Need CLIP embeddings or document layout
VisionPipeline
— richer result shape
Quick scripts and integrations
performOCR()
— zero boilerplate
使用场景推荐方案
从单张图像中一次性提取文本
performOCR()
— 最简单的API
批量处理多张图像
VisionPipeline
— 创建一次,重复使用,完成后销毁
需要CLIP嵌入或文档布局
VisionPipeline
— 更丰富的结果结构
快速脚本和集成
performOCR()
— 零样板代码

Progressive Tier System

渐进式层级系统

The pipeline tries the cheapest/fastest tier first and only escalates when confidence is below threshold:
  1. Tier 1 — Local OCR (PaddleOCR or Tesseract.js): Fast, free, offline. Handles printed text in documents, receipts, screenshots.
  2. Tier 2 — Local Vision Models (TrOCR / Florence-2): Still offline. Handles handwritten notes, complex document layouts with tables and figures.
  3. Tier 3 — Cloud Vision LLM (GPT-4o / Claude / Gemini): Best quality. Handles photographs, diagrams, mixed content, anything the local tiers can't confidently read.
管道会先尝试成本最低/速度最快的层级,仅当置信度低于阈值时才会升级:
  1. 第一层 — 本地OCR(PaddleOCR或Tesseract.js):快速、免费、离线。处理文档、收据、截图中的印刷文本。
  2. 第二层 — 本地视觉模型(TrOCR / Florence-2):仍可离线使用。处理手写笔记、包含表格和图表的复杂文档布局。
  3. 第三层 — 云端视觉LLM(GPT-4o / Claude / Gemini):质量最佳。处理照片、图表、混合内容以及本地层级无法可靠识别的任何内容。

Strategy Selection

策略选择

  • 'progressive'
    (default): Start local, escalate only if needed. Best cost/quality balance for most use cases.
  • 'local-only'
    : Never call cloud APIs. Use for air-gapped environments, privacy-sensitive data (medical records, financial docs), or when no API keys are available.
  • 'cloud-only'
    : Skip local tiers entirely, send straight to a cloud vision LLM. Use when you need the highest quality output and cost is not a concern.
  • 'progressive'
    (默认):从本地开始,仅在需要时升级。是大多数使用场景下成本与质量的最佳平衡。
  • 'local-only'
    :绝不调用云端API。适用于隔离环境、隐私敏感数据(医疗记录、财务文档)或无API密钥可用的情况。
  • 'cloud-only'
    :完全跳过本地层级,直接发送至云端视觉LLM。适用于需要最高质量输出且不考虑成本的场景。

Input Formats

输入格式

performOCR()
accepts four input types:
  • File path:
    '/tmp/scan.png'
    — reads from disk
  • URL:
    'https://example.com/receipt.jpg'
    — fetches via HTTP
  • Base64 string: Raw base64 or
    data:image/png;base64,...
    data URIs — decoded in-memory
  • Buffer: Raw image bytes — passed directly to the pipeline
performOCR()
接受四种输入类型:
  • 文件路径
    '/tmp/scan.png'
    — 从磁盘读取
  • URL
    'https://example.com/receipt.jpg'
    — 通过HTTP获取
  • Base64字符串:原始Base64或
    data:image/png;base64,...
    数据URI — 在内存中解码
  • Buffer:原始图像字节 — 直接传递给管道

Capabilities

功能

  • Printed text OCR: Extract text from documents, receipts, screenshots, PDFs
  • Handwriting recognition: Read handwritten notes and forms via TrOCR
  • Document layout understanding: Parse tables, figures, headings via Florence-2
  • Bounding box regions: Spatial text locations for overlay rendering
  • Image embeddings: Generate CLIP vectors for semantic image search (via
    VisionPipeline
    only)
  • 印刷文本OCR:从文档、收据、截图、PDF中提取文本
  • 手写识别:通过TrOCR读取手写笔记和表单
  • 文档布局理解:通过Florence-2解析表格、图表、标题
  • 边界框区域:用于叠加渲染的文本空间位置
  • 图像嵌入:生成CLIP向量用于语义图像搜索(仅通过
    VisionPipeline

Examples

示例

  • "Read the text from this receipt"
  • "What does this handwritten note say?"
  • "Extract the table data from this PDF page"
  • "OCR this screenshot and return the error message"
  • "读取这张收据上的文本"
  • "这张手写笔记写了什么?"
  • "从这张PDF页面提取表格数据"
  • "对这张截图进行OCR并返回错误信息"