image-ocr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseImage OCR Expert
图像OCR专家
Expert in extracting, processing, and structuring text from images using OCR tools and techniques.
擅长使用OCR工具与技术从图像中提取、处理和结构化文本。
Description
说明
This skill provides specialized knowledge for extracting text from images, including:
- Tool and library selection by use case (Tesseract, EasyOCR, PaddleOCR, cloud APIs)
- Image preprocessing to maximize OCR accuracy
- Post-processing and structuring of extracted text
- Handling handwriting, receipts, invoices, documents, screenshots
- Multilingual OCR and special character support
- Integration into Python/Node.js/cloud pipelines
Triggers: ocr, extract text from image, image to text, read text image, optical character recognition, tesseract, easyocr, paddleocr, textract, vision api, document extraction, screenshot text, invoice ocr, receipt ocr, handwriting recognition, image text extraction
本技能提供从图像中提取文本的专业知识,包括:
- 根据使用场景选择工具与库(Tesseract、EasyOCR、PaddleOCR、云API)
- 图像预处理以最大化OCR准确率
- 提取文本的后处理与结构化
- 处理手写体、收据、发票、文档、截图
- 多语言OCR与特殊字符支持
- 集成到Python/Node.js/云流水线中
触发词:ocr, extract text from image, image to text, read text image, optical character recognition, tesseract, easyocr, paddleocr, textract, vision api, document extraction, screenshot text, invoice ocr, receipt ocr, handwriting recognition, image text extraction
Tool Selection Guide
工具选择指南
| Tool | Best For | Languages | Accuracy | Cost |
|---|---|---|---|---|
| Tesseract | Local, simple docs, print text | 100+ | Medium | Free |
| EasyOCR | Local, photos, multiple scripts | 80+ | High | Free |
| PaddleOCR | Local, CJK languages, tables | 80+ | Very High | Free |
| Google Vision API | Cloud, complex docs, handwriting | All | Excellent | Pay-per-use |
| AWS Textract | Cloud, forms, tables, invoices | Limited | Excellent | Pay-per-use |
| Azure Computer Vision | Cloud, general OCR | 164 | Excellent | Pay-per-use |
| Surya | Local, multilingual PDFs | 90+ | High | Free |
| Docling | Local, PDFs, structured output | Many | High | Free |
| 工具 | 适用场景 | 支持语言 | 准确率 | 成本 |
|---|---|---|---|---|
| Tesseract | 本地使用、简单文档、印刷体文本 | 100+种 | 中等 | 免费 |
| EasyOCR | 本地使用、照片、多语种脚本 | 80+种 | 高 | 免费 |
| PaddleOCR | 本地使用、中日韩语言、表格 | 80+种 | 极高 | 免费 |
| Google Vision API | 云服务、复杂文档、手写体 | 所有语言 | 优秀 | 按使用付费 |
| AWS Textract | 云服务、表单、表格、发票 | 有限 | 优秀 | 按使用付费 |
| Azure Computer Vision | 云服务、通用OCR | 164种 | 优秀 | 按使用付费 |
| Surya | 本地使用、多语种PDF | 90+种 | 高 | 免费 |
| Docling | 本地使用、PDF、结构化输出 | 多种 | 高 | 免费 |
Decision Tree
决策树
Is accuracy critical and budget available?
├─ YES → Google Vision API or AWS Textract
└─ NO → Local solution
├─ CJK (Chinese/Japanese/Korean) or tables? → PaddleOCR
├─ General photos or multiple languages? → EasyOCR
├─ Simple printed English docs? → Tesseract
└─ PDF documents with structure? → Docling or SuryaIs accuracy critical and budget available?
├─ YES → Google Vision API or AWS Textract
└─ NO → Local solution
├─ CJK (Chinese/Japanese/Korean) or tables? → PaddleOCR
├─ General photos or multiple languages? → EasyOCR
├─ Simple printed English docs? → Tesseract
└─ PDF documents with structure? → Docling or SuryaPython Implementations
Python实现
Tesseract (pytesseract)
Tesseract (pytesseract)
python
import pytesseract
from PIL import Image
import cv2
import numpy as np
def extract_text_tesseract(image_path: str, lang: str = "eng") -> str:
"""Extract text using Tesseract. Best for clean printed documents."""
image = Image.open(image_path)
# Config: --psm 6 = assume uniform block of text
config = "--psm 6 --oem 3"
text = pytesseract.image_to_string(image, lang=lang, config=config)
return text.strip()
def extract_with_confidence(image_path: str) -> list[dict]:
"""Extract text with bounding boxes and confidence scores."""
image = Image.open(image_path)
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
results = []
for i, word in enumerate(data["text"]):
if word.strip() and int(data["conf"][i]) > 30:
results.append({
"text": word,
"confidence": data["conf"][i],
"bbox": {
"x": data["left"][i],
"y": data["top"][i],
"width": data["width"][i],
"height": data["height"][i],
}
})
return resultspython
import pytesseract
from PIL import Image
import cv2
import numpy as np
def extract_text_tesseract(image_path: str, lang: str = "eng") -> str:
"""Extract text using Tesseract. Best for clean printed documents."""
image = Image.open(image_path)
# Config: --psm 6 = assume uniform block of text
config = "--psm 6 --oem 3"
text = pytesseract.image_to_string(image, lang=lang, config=config)
return text.strip()
def extract_with_confidence(image_path: str) -> list[dict]:
"""Extract text with bounding boxes and confidence scores."""
image = Image.open(image_path)
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
results = []
for i, word in enumerate(data["text"]):
if word.strip() and int(data["conf"][i]) > 30:
results.append({
"text": word,
"confidence": data["conf"][i],
"bbox": {
"x": data["left"][i],
"y": data["top"][i],
"width": data["width"][i],
"height": data["height"][i],
}
})
return resultsInstall: pip install pytesseract pillow
Install: pip install pytesseract pillow
System: apt install tesseract-ocr (Linux) / brew install tesseract (Mac)
System: apt install tesseract-ocr (Linux) / brew install tesseract (Mac)
undefinedundefinedEasyOCR
EasyOCR
python
import easyocr
from pathlib import Path
def extract_text_easyocr(
image_path: str,
languages: list[str] = ["en"],
detail: bool = False
) -> str | list:
"""
Extract text using EasyOCR. Best for photos and multiple languages.
languages: ['en'], ['en', 'es'], ['ch_sim', 'en'], etc.
"""
reader = easyocr.Reader(languages, gpu=False) # gpu=True if CUDA available
results = reader.readtext(image_path)
if not detail:
# Return plain text sorted by vertical position
results_sorted = sorted(results, key=lambda x: x[0][0][1])
return "\n".join([text for _, text, conf in results_sorted if conf > 0.3])
return [
{
"text": text,
"confidence": round(conf, 3),
"bbox": bbox,
}
for bbox, text, conf in results
]python
import easyocr
from pathlib import Path
def extract_text_easyocr(
image_path: str,
languages: list[str] = ["en"],
detail: bool = False
) -> str | list:
"""
Extract text using EasyOCR. Best for photos and multiple languages.
languages: ['en'], ['en', 'es'], ['ch_sim', 'en'], etc.
"""
reader = easyocr.Reader(languages, gpu=False) # gpu=True if CUDA available
results = reader.readtext(image_path)
if not detail:
# Return plain text sorted by vertical position
results_sorted = sorted(results, key=lambda x: x[0][0][1])
return "\n".join([text for _, text, conf in results_sorted if conf > 0.3])
return [
{
"text": text,
"confidence": round(conf, 3),
"bbox": bbox,
}
for bbox, text, conf in results
]Install: pip install easyocr
Install: pip install easyocr
undefinedundefinedPaddleOCR (best for CJK and tables)
PaddleOCR (best for CJK and tables)
python
from paddleocr import PaddleOCR
import json
def extract_text_paddle(
image_path: str,
lang: str = "en", # "en", "ch", "japan", "korean", "es", etc.
use_angle_cls: bool = True,
) -> str:
"""Extract text using PaddleOCR. Best for CJK and structured documents."""
ocr = PaddleOCR(use_angle_cls=use_angle_cls, lang=lang, show_log=False)
result = ocr.ocr(image_path, cls=True)
lines = []
if result and result[0]:
# Sort by y position (top to bottom)
items = sorted(result[0], key=lambda x: x[0][0][1])
lines = [item[1][0] for item in items if item[1][1] > 0.3]
return "\n".join(lines)python
from paddleocr import PaddleOCR
import json
def extract_text_paddle(
image_path: str,
lang: str = "en", # "en", "ch", "japan", "korean", "es", etc.
use_angle_cls: bool = True,
) -> str:
"""Extract text using PaddleOCR. Best for CJK and structured documents."""
ocr = PaddleOCR(use_angle_cls=use_angle_cls, lang=lang, show_log=False)
result = ocr.ocr(image_path, cls=True)
lines = []
if result and result[0]:
# Sort by y position (top to bottom)
items = sorted(result[0], key=lambda x: x[0][0][1])
lines = [item[1][0] for item in items if item[1][1] > 0.3]
return "\n".join(lines)Install: pip install paddlepaddle paddleocr
Install: pip install paddlepaddle paddleocr
undefinedundefinedGoogle Vision API
Google Vision API
python
from google.cloud import vision
import io
def extract_text_google_vision(image_path: str) -> dict:
"""
Extract text using Google Vision API.
Requires: GOOGLE_APPLICATION_CREDENTIALS env var set.
"""
client = vision.ImageAnnotatorClient()
with io.open(image_path, "rb") as image_file:
content = image_file.read()
image = vision.Image(content=content)
# Full text detection (better for documents)
response = client.document_text_detection(image=image)
document = response.full_text_annotation
return {
"text": document.text,
"pages": [
{
"blocks": [
{
"text": " ".join(
symbol.text
for para in block.paragraphs
for word in para.words
for symbol in word.symbols
),
"confidence": block.confidence,
}
for block in page.blocks
]
}
for page in document.pages
]
}python
from google.cloud import vision
import io
def extract_text_google_vision(image_path: str) -> dict:
"""
Extract text using Google Vision API.
Requires: GOOGLE_APPLICATION_CREDENTIALS env var set.
"""
client = vision.ImageAnnotatorClient()
with io.open(image_path, "rb") as image_file:
content = image_file.read()
image = vision.Image(content=content)
# Full text detection (better for documents)
response = client.document_text_detection(image=image)
document = response.full_text_annotation
return {
"text": document.text,
"pages": [
{
"blocks": [
{
"text": " ".join(
symbol.text
for para in block.paragraphs
for word in para.words
for symbol in word.symbols
),
"confidence": block.confidence,
}
for block in page.blocks
]
}
for page in document.pages
]
}Install: pip install google-cloud-vision
Install: pip install google-cloud-vision
undefinedundefinedAWS Textract (best for forms and invoices)
AWS Textract (best for forms and invoices)
python
import boto3
import json
def extract_text_textract(image_path: str, region: str = "us-east-1") -> dict:
"""
Extract text, forms, and tables using AWS Textract.
Handles key-value pairs and structured tables automatically.
"""
client = boto3.client("textract", region_name=region)
with open(image_path, "rb") as f:
image_bytes = f.read()
response = client.analyze_document(
Document={"Bytes": image_bytes},
FeatureTypes=["TABLES", "FORMS"]
)
# Extract raw text
blocks = response["Blocks"]
lines = [b["Text"] for b in blocks if b["BlockType"] == "LINE"]
# Extract key-value pairs (forms)
key_values = {}
key_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "KEY" in b.get("EntityTypes", [])}
value_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "VALUE" in b.get("EntityTypes", [])}
for key_block in key_map.values():
key_text = _get_text_from_block(key_block, blocks)
for rel in key_block.get("Relationships", []):
if rel["Type"] == "VALUE":
for val_id in rel["Ids"]:
if val_id in value_map:
val_text = _get_text_from_block(value_map[val_id], blocks)
key_values[key_text] = val_text
return {
"text": "\n".join(lines),
"form_fields": key_values,
}
def _get_text_from_block(block, all_blocks):
word_ids = []
for rel in block.get("Relationships", []):
if rel["Type"] == "CHILD":
word_ids.extend(rel["Ids"])
block_map = {b["Id"]: b for b in all_blocks}
words = [block_map[wid]["Text"] for wid in word_ids if wid in block_map and block_map[wid]["BlockType"] == "WORD"]
return " ".join(words)python
import boto3
import json
def extract_text_textract(image_path: str, region: str = "us-east-1") -> dict:
"""
Extract text, forms, and tables using AWS Textract.
Handles key-value pairs and structured tables automatically.
"""
client = boto3.client("textract", region_name=region)
with open(image_path, "rb") as f:
image_bytes = f.read()
response = client.analyze_document(
Document={"Bytes": image_bytes},
FeatureTypes=["TABLES", "FORMS"]
)
# Extract raw text
blocks = response["Blocks"]
lines = [b["Text"] for b in blocks if b["BlockType"] == "LINE"]
# Extract key-value pairs (forms)
key_values = {}
key_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "KEY" in b.get("EntityTypes", [])}
value_map = {b["Id"]: b for b in blocks if b["BlockType"] == "KEY_VALUE_SET" and "VALUE" in b.get("EntityTypes", [])}
for key_block in key_map.values():
key_text = _get_text_from_block(key_block, blocks)
for rel in key_block.get("Relationships", []):
if rel["Type"] == "VALUE":
for val_id in rel["Ids"]:
if val_id in value_map:
val_text = _get_text_from_block(value_map[val_id], blocks)
key_values[key_text] = val_text
return {
"text": "\n".join(lines),
"form_fields": key_values,
}
def _get_text_from_block(block, all_blocks):
word_ids = []
for rel in block.get("Relationships", []):
if rel["Type"] == "CHILD":
word_ids.extend(rel["Ids"])
block_map = {b["Id"]: b for b in all_blocks}
words = [block_map[wid]["Text"] for wid in word_ids if wid in block_map and block_map[wid]["BlockType"] == "WORD"]
return " ".join(words)Install: pip install boto3
Install: pip install boto3
---
---Image Preprocessing
图像预处理
Preprocessing is the #1 factor in OCR accuracy. Always apply before running OCR.
python
import cv2
import numpy as np
from PIL import Image, ImageEnhance, ImageFilter
def preprocess_for_ocr(image_path: str, output_path: str = None) -> np.ndarray:
"""
Full preprocessing pipeline for maximum OCR accuracy.
Apply selectively based on image type.
"""
img = cv2.imread(image_path)
# 1. Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 2. Resize if too small (OCR works better at 300+ DPI)
height, width = gray.shape
if width < 1000:
scale = 2000 / width
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
# 3. Deskew (fix rotation)
gray = deskew(gray)
# 4. Denoise
denoised = cv2.fastNlMeansDenoising(gray, h=10)
# 5. Binarization (choose one based on lighting)
# Option A: Otsu (uniform lighting)
_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Option B: Adaptive (uneven lighting, shadows)
# binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
# cv2.THRESH_BINARY, 11, 2)
# 6. Morphological cleanup (remove noise dots)
kernel = np.ones((1, 1), np.uint8)
cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
if output_path:
cv2.imwrite(output_path, cleaned)
return cleaned
def deskew(image: np.ndarray) -> np.ndarray:
"""Correct image rotation using projection analysis."""
coords = np.column_stack(np.where(image > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
if abs(angle) < 0.5: # Skip if nearly straight
return image
h, w = image.shape
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
def enhance_contrast(image_path: str) -> Image.Image:
"""Enhance contrast using PIL - useful for faded text."""
img = Image.open(image_path).convert("L")
enhancer = ImageEnhance.Contrast(img)
return enhancer.enhance(2.0)预处理是影响OCR准确率的首要因素。在运行OCR前务必进行预处理。
python
import cv2
import numpy as np
from PIL import Image, ImageEnhance, ImageFilter
def preprocess_for_ocr(image_path: str, output_path: str = None) -> np.ndarray:
"""
Full preprocessing pipeline for maximum OCR accuracy.
Apply selectively based on image type.
"""
img = cv2.imread(image_path)
# 1. Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 2. Resize if too small (OCR works better at 300+ DPI)
height, width = gray.shape
if width < 1000:
scale = 2000 / width
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
# 3. Deskew (fix rotation)
gray = deskew(gray)
# 4. Denoise
denoised = cv2.fastNlMeansDenoising(gray, h=10)
# 5. Binarization (choose one based on lighting)
# Option A: Otsu (uniform lighting)
_, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
# Option B: Adaptive (uneven lighting, shadows)
# binary = cv2.adaptiveThreshold(denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
# cv2.THRESH_BINARY, 11, 2)
# 6. Morphological cleanup (remove noise dots)
kernel = np.ones((1, 1), np.uint8)
cleaned = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
if output_path:
cv2.imwrite(output_path, cleaned)
return cleaned
def deskew(image: np.ndarray) -> np.ndarray:
"""Correct image rotation using projection analysis."""
coords = np.column_stack(np.where(image > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
if abs(angle) < 0.5: # Skip if nearly straight
return image
h, w = image.shape
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
def enhance_contrast(image_path: str) -> Image.Image:
"""Enhance contrast using PIL - useful for faded text."""
img = Image.open(image_path).convert("L")
enhancer = ImageEnhance.Contrast(img)
return enhancer.enhance(2.0)Install: pip install opencv-python pillow
Install: pip install opencv-python pillow
undefinedundefinedPreprocessing Decision Guide
预处理决策指南
| Image Problem | Solution |
|---|---|
| Rotated/skewed text | |
| Low resolution | Upscale 2x with |
| Uneven lighting/shadows | Adaptive thresholding |
| Uniform background | Otsu thresholding |
| Noisy/grainy | |
| Faded text | PIL |
| Color background | Convert to grayscale first |
| Handwriting | Skip binarization, use cloud API |
| 图像问题 | 解决方案 |
|---|---|
| 文本旋转/倾斜 | |
| 低分辨率 | 使用 |
| 光照不均/阴影 | 自适应阈值处理 |
| 背景均匀 | Otsu阈值处理 |
| 图像有噪点/颗粒感 | |
| 文本褪色 | PIL |
| 彩色背景 | 先转为灰度图 |
| 手写体 | 跳过二值化,使用云API |
PDF to Text Extraction
PDF转文本提取
python
import fitz # PyMuPDF - for native text extraction
from pdf2image import convert_from_path # for scanned PDFs
import pytesseract
def extract_pdf_text(pdf_path: str, ocr_fallback: bool = True) -> str:
"""
Smart PDF extraction:
- Uses native text layer if available (fast, accurate)
- Falls back to OCR for scanned pages
"""
doc = fitz.open(pdf_path)
full_text = []
for page_num, page in enumerate(doc):
# Try native text extraction first
text = page.get_text().strip()
if text and len(text) > 50:
full_text.append(text)
elif ocr_fallback:
# Scanned page — render and OCR
pix = page.get_pixmap(dpi=300)
img_path = f"/tmp/page_{page_num}.png"
pix.save(img_path)
ocr_text = pytesseract.image_to_string(img_path)
full_text.append(ocr_text)
doc.close()
return "\n\n".join(full_text)python
import fitz # PyMuPDF - for native text extraction
from pdf2image import convert_from_path # for scanned PDFs
import pytesseract
def extract_pdf_text(pdf_path: str, ocr_fallback: bool = True) -> str:
"""
Smart PDF extraction:
- Uses native text layer if available (fast, accurate)
- Falls back to OCR for scanned pages
"""
doc = fitz.open(pdf_path)
full_text = []
for page_num, page in enumerate(doc):
# Try native text extraction first
text = page.get_text().strip()
if text and len(text) > 50:
full_text.append(text)
elif ocr_fallback:
# Scanned page — render and OCR
pix = page.get_pixmap(dpi=300)
img_path = f"/tmp/page_{page_num}.png"
pix.save(img_path)
ocr_text = pytesseract.image_to_string(img_path)
full_text.append(ocr_text)
doc.close()
return "\n\n".join(full_text)Install: pip install PyMuPDF pdf2image pytesseract
Install: pip install PyMuPDF pdf2image pytesseract
System: apt install poppler-utils (for pdf2image on Linux)
System: apt install poppler-utils (for pdf2image on Linux)
---
---Post-Processing Extracted Text
提取文本的后处理
python
import re
from difflib import SequenceMatcher
def clean_ocr_text(text: str) -> str:
"""Standard cleanup for OCR output."""
# Remove non-printable characters
text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
# Normalize whitespace
text = re.sub(r" +", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
# Fix common OCR misreads
corrections = {
r"\b0(?=[a-zA-Z])": "O", # 0 misread as O before letter
r"(?<=[a-zA-Z])0\b": "O", # O misread as 0 after letter
r"\bl\b": "I", # lowercase l misread as I (context-dependent)
r"rn": "m", # rn → m (common serif font error)
}
for pattern, replacement in corrections.items():
text = re.sub(pattern, replacement, text)
return text.strip()
def extract_structured_data(text: str) -> dict:
"""Extract common structured fields from OCR text."""
patterns = {
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}",
"date": r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
"amount": r"\$\s?\d+(?:,\d{3})*(?:\.\d{2})?",
"url": r"https?://[^\s]+",
}
return {
field: re.findall(pattern, text)
for field, pattern in patterns.items()
}
def merge_multiline_words(text: str) -> str:
"""Fix hyphenated words split across lines (common in PDFs)."""
return re.sub(r"(\w)-\n(\w)", r"\1\2", text)python
import re
from difflib import SequenceMatcher
def clean_ocr_text(text: str) -> str:
"""Standard cleanup for OCR output."""
# Remove non-printable characters
text = re.sub(r"[^\x20-\x7E\n\t]", "", text)
# Normalize whitespace
text = re.sub(r" +", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text)
# Fix common OCR misreads
corrections = {
r"\b0(?=[a-zA-Z])": "O", # 0 misread as O before letter
r"(?<=[a-zA-Z])0\b": "O", # O misread as 0 after letter
r"\bl\b": "I", # lowercase l misread as I (context-dependent)
r"rn": "m", # rn → m (common serif font error)
}
for pattern, replacement in corrections.items():
text = re.sub(pattern, replacement, text)
return text.strip()
def extract_structured_data(text: str) -> dict:
"""Extract common structured fields from OCR text."""
patterns = {
"email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
"phone": r"[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}",
"date": r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b",
"amount": r"\$\s?\d+(?:,\d{3})*(?:\.\d{2})?",
"url": r"https?://[^\s]+",
}
return {
field: re.findall(pattern, text)
for field, pattern in patterns.items()
}
def merge_multiline_words(text: str) -> str:
"""Fix hyphenated words split across lines (common in PDFs)."""
return re.sub(r"(\w)-\n(\w)", r"\1\2", text)Node.js / TypeScript
Node.js / TypeScript
typescript
// Using Tesseract.js (pure JS, no native deps needed)
import Tesseract from "tesseract.js";
async function extractText(imagePath: string, lang = "eng"): Promise<string> {
const { data } = await Tesseract.recognize(imagePath, lang, {
logger: () => {}, // suppress progress logs
});
return data.text.trim();
}
// With confidence filtering
async function extractWithConfidence(imagePath: string) {
const { data } = await Tesseract.recognize(imagePath, "eng");
return data.words
.filter((word) => word.confidence > 70)
.map((word) => ({
text: word.text,
confidence: word.confidence,
bbox: word.bbox,
}));
}
// Install: npm install tesseract.jstypescript
// Using Google Vision API from Node.js
import vision from "@google-cloud/vision";
const client = new vision.ImageAnnotatorClient();
async function extractTextCloud(imagePath: string): Promise<string> {
const [result] = await client.documentTextDetection(imagePath);
return result.fullTextAnnotation?.text ?? "";
}
// Install: npm install @google-cloud/visiontypescript
// Using Tesseract.js (pure JS, no native deps needed)
import Tesseract from "tesseract.js";
async function extractText(imagePath: string, lang = "eng"): Promise<string> {
const { data } = await Tesseract.recognize(imagePath, lang, {
logger: () => {}, // suppress progress logs
});
return data.text.trim();
}
// With confidence filtering
async function extractWithConfidence(imagePath: string) {
const { data } = await Tesseract.recognize(imagePath, "eng");
return data.words
.filter((word) => word.confidence > 70)
.map((word) => ({
text: word.text,
confidence: word.confidence,
bbox: word.bbox,
}));
}
// Install: npm install tesseract.jstypescript
// Using Google Vision API from Node.js
import vision from "@google-cloud/vision";
const client = new vision.ImageAnnotatorClient();
async function extractTextCloud(imagePath: string): Promise<string> {
const [result] = await client.documentTextDetection(imagePath);
return result.fullTextAnnotation?.text ?? "";
}
// Install: npm install @google-cloud/visionClaude Vision API for OCR
Claude Vision API for OCR
Use Claude's vision capability when you need structured extraction + understanding:
python
import anthropic
import base64
from pathlib import Path
def extract_with_claude(image_path: str, instruction: str = None) -> str:
"""
Use Claude to extract and structure text from an image.
Best when you need semantic understanding, not just raw text.
"""
client = anthropic.Anthropic()
image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode()
ext = Path(image_path).suffix.lower()
media_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".webp": "image/webp"}
media_type = media_types.get(ext, "image/jpeg")
prompt = instruction or (
"Extract ALL text from this image exactly as it appears. "
"Preserve the original structure, line breaks, and formatting. "
"Return only the extracted text, nothing else."
)
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{"type": "text", "text": prompt},
],
}
],
)
return message.content[0].text使用Claude的视觉功能进行结构化提取与理解:
python
import anthropic
import base64
from pathlib import Path
def extract_with_claude(image_path: str, instruction: str = None) -> str:
"""
Use Claude to extract and structure text from an image.
Best when you need semantic understanding, not just raw text.
"""
client = anthropic.Anthropic()
image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode()
ext = Path(image_path).suffix.lower()
media_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg", ".png": "image/png", ".webp": "image/webp"}
media_type = media_types.get(ext, "image/jpeg")
prompt = instruction or (
"Extract ALL text from this image exactly as it appears. "
"Preserve the original structure, line breaks, and formatting. "
"Return only the extracted text, nothing else."
)
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{"type": "text", "text": prompt},
],
}
],
)
return message.content[0].textExample: structured invoice extraction
Example: structured invoice extraction
def extract_invoice(image_path: str) -> dict:
result = extract_with_claude(
image_path,
instruction="""Extract all data from this invoice and return as JSON:
{
"invoice_number": "",
"date": "",
"vendor": {"name": "", "address": "", "email": ""},
"items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
"subtotal": 0,
"tax": 0,
"total": 0
}
Return only valid JSON, no explanation."""
)
import json
return json.loads(result)
undefineddef extract_invoice(image_path: str) -> dict:
result = extract_with_claude(
image_path,
instruction="""Extract all data from this invoice and return as JSON:
{
"invoice_number": "",
"date": "",
"vendor": {"name": "", "address": "", "email": ""},
"items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
"subtotal": 0,
"tax": 0,
"total": 0
}
Return only valid JSON, no explanation."""
)
import json
return json.loads(result)
undefinedWhen to Use Claude vs Traditional OCR
何时使用Claude vs 传统OCR
| Scenario | Use Claude | Use Traditional OCR |
|---|---|---|
| Extract + understand structure | ✅ | ❌ |
| Invoice/receipt parsing | ✅ | ❌ (Textract is also good) |
| Handwriting with context | ✅ | ❌ |
| Large volume (1000s of images) | ❌ (cost) | ✅ |
| Simple raw text extraction | ❌ (overkill) | ✅ |
| Tables with complex structure | ✅ | PaddleOCR / Textract |
| Real-time / low latency | ❌ | ✅ |
| 场景 | 使用Claude | 使用传统OCR |
|---|---|---|
| 提取并理解结构 | ✅ | ❌ |
| 发票/收据解析 | ✅ | ❌ (Textract也适用) |
| 带上下文的手写体 | ✅ | ❌ |
| 大批量处理(数千张图像) | ❌ (成本高) | ✅ |
| 简单的原始文本提取 | ❌ (大材小用) | ✅ |
| 结构复杂的表格 | ✅ | PaddleOCR / Textract |
| 实时/低延迟需求 | ❌ | ✅ |
Accuracy Benchmarks by Image Type
不同图像类型的准确率基准
| Image Type | Tesseract | EasyOCR | PaddleOCR | Google Vision |
|---|---|---|---|---|
| Printed documents (clean) | 95% | 97% | 97% | 99% |
| Screenshots | 90% | 95% | 95% | 98% |
| Photos of documents | 70% | 88% | 90% | 97% |
| Handwriting | 40% | 55% | 55% | 85% |
| Low res / blurry | 45% | 70% | 72% | 80% |
| Receipts / invoices | 75% | 85% | 88% | 97% |
| Chinese/Japanese/Korean | 60%* | 85% | 95% | 99% |
*Requires additional language pack installation
| 图像类型 | Tesseract | EasyOCR | PaddleOCR | Google Vision |
|---|---|---|---|---|
| 清晰印刷文档 | 95% | 97% | 97% | 99% |
| 截图 | 90% | 95% | 95% | 98% |
| 文档照片 | 70% | 88% | 90% | 97% |
| 手写体 | 40% | 55% | 55% | 85% |
| 低分辨率/模糊 | 45% | 70% | 72% | 80% |
| 收据/发票 | 75% | 85% | 88% | 97% |
| 中日韩文本 | 60%* | 85% | 95% | 99% |
*需要安装额外的语言包
Common Errors and Fixes
常见错误与修复方案
Tesseract returns garbage text
Tesseract返回乱码文本
- Cause: Image too small or too noisy
- Fix: Upscale 2x, apply denoising and binarization
- 原因:图像过小或噪点过多
- 修复:放大2倍,应用去噪和二值化处理
EasyOCR misses text in columns
EasyOCR遗漏多列文本
- Cause: Default layout analysis fails on multi-column
- Fix: Crop each column separately and OCR individually
- 原因:默认布局分析对多列文本失效
- 修复:单独裁剪每一列后分别进行OCR
PaddleOCR slow on CPU
PaddleOCR在CPU上运行缓慢
- Cause: Large model loaded
- Fix: Use if available, or
use_gpu=Truefor horizontal textuse_angle_cls=False
- 原因:加载了大型模型
- 修复:如果可用则设置,对于水平文本可设置
use_gpu=Trueuse_angle_cls=False
Bounding boxes don't align with text
边界框与文本不匹配
- Cause: Image was rotated before OCR
- Fix: Apply in preprocessing
deskew()
- 原因:OCR前图像已被旋转
- 修复:在预处理中应用
deskew()
Cloud API returns empty for some regions
云API对部分区域返回空结果
- Cause: Low contrast or very small text
- Fix: Preprocess image, increase DPI, crop region of interest
- 原因:对比度低或文本过小
- 修复:预处理图像,提高DPI,裁剪感兴趣区域
PDF text layer has wrong encoding
PDF文本层编码错误
- Cause: Non-standard font embedding
- Fix: Use to inspect encoding, or skip to OCR fallback
fitz.Page.get_text("rawdict")
- 原因:嵌入了非标准字体
- 修复:使用检查编码,或直接使用OCR fallback
fitz.Page.get_text("rawdict")
Quick Start Templates
快速入门模板
Minimal local OCR (Python)
轻量本地OCR(Python)
bash
pip install easyocr
python -c "import easyocr; r=easyocr.Reader(['en']); print('\n'.join([t for _,t,c in r.readtext('image.png') if c>0.3]))"bash
pip install easyocr
python -c "import easyocr; r=easyocr.Reader(['en']); print('\n'.join([t for _,t,c in r.readtext('image.png') if c>0.3]))"Minimal cloud OCR (Node.js)
轻量云OCR(Node.js)
bash
npm install tesseract.js
node -e "const T=require('tesseract.js'); T.recognize('image.png','eng').then(r=>console.log(r.data.text))"bash
npm install tesseract.js
node -e "const T=require('tesseract.js'); T.recognize('image.png','eng').then(r=>console.log(r.data.text))"Batch processing pipeline
批量处理流水线
python
from pathlib import Path
import easyocr
reader = easyocr.Reader(["en"], gpu=False)
def batch_ocr(folder: str, output_folder: str) -> None:
Path(output_folder).mkdir(exist_ok=True)
images = list(Path(folder).glob("*.{png,jpg,jpeg,tiff,bmp}"))
for img_path in images:
results = reader.readtext(str(img_path))
text = "\n".join(t for _, t, c in results if c > 0.3)
out_path = Path(output_folder) / f"{img_path.stem}.txt"
out_path.write_text(text, encoding="utf-8")
print(f"✓ {img_path.name} → {out_path.name}")
print(f"\nProcessed {len(images)} images.")
batch_ocr("./images", "./output")python
from pathlib import Path
import easyocr
reader = easyocr.Reader(["en"], gpu=False)
def batch_ocr(folder: str, output_folder: str) -> None:
Path(output_folder).mkdir(exist_ok=True)
images = list(Path(folder).glob("*.{png,jpg,jpeg,tiff,bmp}"))
for img_path in images:
results = reader.readtext(str(img_path))
text = "\n".join(t for _, t, c in results if c > 0.3)
out_path = Path(output_folder) / f"{img_path.stem}.txt"
out_path.write_text(text, encoding="utf-8")
print(f"✓ {img_path.name} → {out_path.name}")
print(f"\nProcessed {len(images)} images.")
batch_ocr("./images", "./output")Rules
规则
- Select the OCR engine based on the document type and accuracy requirements before writing code: Tesseract for local/offline simple documents, EasyOCR for multilingual handwriting, cloud APIs (Google Vision, AWS Textract) for production accuracy on structured documents
- Image preprocessing (grayscale conversion, binarization, deskew) is required before Tesseract and EasyOCR for non-ideal inputs — skipping it causes significant accuracy degradation
- OCR output must always be treated as unvalidated text — apply post-processing (regex, string normalization) before using extracted values in business logic
- Never pass sensitive document images to cloud OCR APIs without confirming data privacy and compliance requirements with the project owner
- Confidence scores from the OCR engine must be checked; results below the project-defined threshold must be flagged for human review rather than accepted automatically
- 编写代码前需根据文档类型和准确率要求选择OCR引擎:简单文档本地/离线使用选择Tesseract,多语种手写体选择EasyOCR,结构化文档生产级准确率选择云API(Google Vision、AWS Textract)
- 对于非理想输入,Tesseract和EasyOCR前必须进行图像预处理(灰度转换、二值化、去倾斜)——跳过预处理会导致准确率大幅下降
- OCR输出必须视为未验证文本——在业务逻辑中使用提取值前,需应用后处理(正则表达式、字符串标准化)
- 未与项目负责人确认数据隐私和合规要求前,切勿将敏感文档图像发送至云OCR API
- 必须检查OCR引擎的置信度分数;低于项目定义阈值的结果必须标记为人工审核,而非自动接受