ocr-document-processor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOCR Document Processor
OCR文档处理器
Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
使用光学字符识别(OCR)从图片、扫描版PDF和照片中提取文本。支持多种语言、结构化输出格式以及智能文档解析。
Core Capabilities
核心功能
- Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
- PDF OCR: Process scanned PDFs page by page
- Multi-language: Support for 100+ languages
- Structured Output: Plain text, Markdown, JSON, or HTML
- Table Detection: Extract tabular data to CSV/JSON
- Batch Processing: Process multiple documents at once
- Quality Assessment: Confidence scoring for OCR results
- 图片OCR:从PNG、JPEG、TIFF、BMP格式的图片中提取文本
- PDF OCR:逐页处理扫描版PDF
- 多语言支持:支持100多种语言
- 结构化输出:纯文本、Markdown、JSON或HTML格式
- 表格检测:将表格数据提取为CSV/JSON格式
- 批量处理:同时处理多个文档
- 质量评估:为OCR结果提供置信度评分
Quick Start
快速开始
python
from scripts.ocr_processor import OCRProcessorpython
from scripts.ocr_processor import OCRProcessorSimple text extraction
Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)
Extract to structured format
Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # Text blocks with positions
undefinedresult = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks']) # Text blocks with positions
undefinedCore Workflow
核心工作流程
1. Basic Text Extraction
1. 基础文本提取
python
from scripts.ocr_processor import OCRProcessorpython
from scripts.ocr_processor import OCRProcessorFrom image
From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()
processor = OCRProcessor("scan.png")
text = processor.extract_text()
From PDF
From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # All pages
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text() # All pages
Specific pages
Specific pages
text = processor.extract_text(pages=[1, 2, 3])
undefinedtext = processor.extract_text(pages=[1, 2, 3])
undefined2. Structured Extraction
2. 结构化提取
python
undefinedpython
undefinedGet detailed results
Get detailed results
result = processor.extract_structured()
result = processor.extract_structured()
Result contains:
Result contains:
- text: Full extracted text
- text: Full extracted text
- blocks: Text blocks with bounding boxes
- blocks: Text blocks with bounding boxes
- lines: Individual lines
- lines: Individual lines
- words: Individual words with confidence
- words: Individual words with confidence
- confidence: Overall confidence score
- confidence: Overall confidence score
- language: Detected language
- language: Detected language
undefinedundefined3. Export Formats
3. 导出格式
python
undefinedpython
undefinedExport to Markdown
Export to Markdown
processor.export_markdown("output.md")
processor.export_markdown("output.md")
Export to JSON
Export to JSON
processor.export_json("output.json")
processor.export_json("output.json")
Export to searchable PDF
Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")
processor.export_searchable_pdf("searchable.pdf")
Export to HTML
Export to HTML
processor.export_html("output.html")
undefinedprocessor.export_html("output.html")
undefinedLanguage Support
语言支持
python
undefinedpython
undefinedSpecify language for better accuracy
Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')
processor = OCRProcessor("german_doc.png", lang='deu')
Multiple languages
Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
Auto-detect language
Auto-detect language
processor = OCRProcessor("document.png", lang='auto')
undefinedprocessor = OCRProcessor("document.png", lang='auto')
undefinedSupported Languages (Common)
支持的常见语言
| Code | Language | Code | Language |
|---|---|---|---|
| eng | English | fra | French |
| deu | German | spa | Spanish |
| ita | Italian | por | Portuguese |
| rus | Russian | chi_sim | Chinese (Simplified) |
| chi_tra | Chinese (Traditional) | jpn | Japanese |
| kor | Korean | ara | Arabic |
| hin | Hindi | nld | Dutch |
| 代码 | 语言 | 代码 | 语言 |
|---|---|---|---|
| eng | 英语 | fra | 法语 |
| deu | 德语 | spa | 西班牙语 |
| ita | 意大利语 | por | 葡萄牙语 |
| rus | 俄语 | chi_sim | 简体中文 |
| chi_tra | 繁体中文 | jpn | 日语 |
| kor | 韩语 | ara | 阿拉伯语 |
| hin | 印地语 | nld | 荷兰语 |
Image Preprocessing
图片预处理
Preprocessing improves OCR accuracy on low-quality images.
python
undefined预处理可提升低质量图片的OCR识别准确率。
python
undefinedEnable preprocessing
Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # Fix rotation
denoise=True, # Remove noise
threshold=True, # Binarize image
contrast=1.5 # Enhance contrast
)
text = processor.extract_text()
undefinedprocessor = OCRProcessor("noisy_scan.png")
processor.preprocess(
deskew=True, # Fix rotation
denoise=True, # Remove noise
threshold=True, # Binarize image
contrast=1.5 # Enhance contrast
)
text = processor.extract_text()
undefinedAvailable Preprocessing Options
可用的预处理选项
| Option | Description | Default |
|---|---|---|
| Correct skewed/rotated images | False |
| Remove noise and artifacts | False |
| Convert to black/white | False |
| 'otsu', 'adaptive', 'simple' | 'otsu' |
| Contrast factor (1.0 = no change) | 1.0 |
| Sharpen factor (0 = none) | 0 |
| Upscale factor for small text | 1.0 |
| Remove shadow artifacts | False |
| 选项 | 描述 | 默认值 |
|---|---|---|
| 校正倾斜/旋转的图片 | False |
| 去除噪点和伪影 | False |
| 转换为黑白图像 | False |
| 'otsu', 'adaptive', 'simple' | 'otsu' |
| 对比度系数(1.0 = 无变化) | 1.0 |
| 锐化系数(0 = 无锐化) | 0 |
| 小文本放大系数 | 1.0 |
| 去除阴影伪影 | False |
Table Extraction
表格提取
python
undefinedpython
undefinedExtract tables from document
Extract tables from document
tables = processor.extract_tables()
tables = processor.extract_tables()
Each table is a list of rows
Each table is a list of rows
for table in tables:
for row in table:
print(row)
for table in tables:
for row in table:
print(row)
Export tables to CSV
Export tables to CSV
processor.export_tables_csv("tables/")
processor.export_tables_csv("tables/")
Export to JSON
Export to JSON
processor.export_tables_json("tables.json")
undefinedprocessor.export_tables_json("tables.json")
undefinedPDF Processing
PDF处理
Multi-Page PDFs
多页PDF
python
undefinedpython
undefinedProcess all pages
Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()
Process specific pages
Process specific pages
page_3 = processor.extract_text(pages=[3])
page_3 = processor.extract_text(pages=[3])
Get per-page results
Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
undefinedresults = processor.extract_by_page()
for page_num, text in results.items():
print(f"Page {page_num}: {len(text)} characters")
undefinedCreate Searchable PDF
创建可搜索PDF
python
undefinedpython
undefinedConvert scanned PDF to searchable PDF
Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
undefinedprocessor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")
undefinedBatch Processing
批量处理
python
from scripts.ocr_processor import batch_ocrpython
from scripts.ocr_processor import batch_ocrProcess directory of images
Process directory of images
results = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
undefinedresults = batch_ocr(
input_dir="scans/",
output_dir="extracted/",
output_format="markdown",
lang="eng",
recursive=True
)
print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")
undefinedReceipt/Document Parsing
收据/文档解析
Receipt Extraction
收据提取
python
undefinedpython
undefinedParse receipt structure
Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()
Returns structured data:
Returns structured data:
- vendor: Store name
- vendor: Store name
- date: Transaction date
- date: Transaction date
- items: List of items with prices
- items: List of items with prices
- subtotal: Subtotal amount
- subtotal: Subtotal amount
- tax: Tax amount
- tax: Tax amount
- total: Total amount
- total: Total amount
undefinedundefinedBusiness Card Parsing
名片解析
python
undefinedpython
undefinedExtract business card info
Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()
Returns:
Returns:
- name: Person's name
- name: Person's name
- title: Job title
- title: Job title
- company: Company name
- company: Company name
- email: Email addresses
- email: Email addresses
- phone: Phone numbers
- phone: Phone numbers
- address: Physical address
- address: Physical address
- website: Website URLs
- website: Website URLs
undefinedundefinedConfiguration
配置
python
processor = OCRProcessor("document.png")python
processor = OCRProcessor("document.png")Configure OCR settings
Configure OCR settings
processor.config.update({
'psm': 3, # Page segmentation mode
'oem': 3, # OCR engine mode
'dpi': 300, # DPI for processing
'timeout': 30, # Timeout in seconds
'min_confidence': 60, # Minimum word confidence
})
undefinedprocessor.config.update({
'psm': 3, # Page segmentation mode
'oem': 3, # OCR engine mode
'dpi': 300, # DPI for processing
'timeout': 30, # Timeout in seconds
'min_confidence': 60, # Minimum word confidence
})
undefinedPage Segmentation Modes (PSM)
页面分割模式(PSM)
| Mode | Description |
|---|---|
| 0 | Orientation and script detection only |
| 1 | Automatic page segmentation with OSD |
| 3 | Fully automatic page segmentation (default) |
| 4 | Assume single column of text |
| 6 | Assume single uniform block of text |
| 7 | Treat image as single text line |
| 8 | Treat image as single word |
| 11 | Sparse text. Find as much text as possible |
| 12 | Sparse text with OSD |
| 模式 | 描述 |
|---|---|
| 0 | 仅检测方向和脚本 |
| 1 | 自动页面分割并检测方向和脚本 |
| 3 | 完全自动页面分割(默认) |
| 4 | 假设为单列文本 |
| 6 | 假设为单个统一文本块 |
| 7 | 将图片视为单行文本 |
| 8 | 将图片视为单个单词 |
| 11 | 稀疏文本:尽可能多地查找文本 |
| 12 | 稀疏文本并检测方向和脚本 |
Quality Assessment
质量评估
python
undefinedpython
undefinedGet confidence scores
Get confidence scores
result = processor.extract_structured()
result = processor.extract_structured()
Overall confidence (0-100)
Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")
print(f"Confidence: {result['confidence']}%")
Per-word confidence
Per-word confidence
for word in result['words']:
print(f"{word['text']}: {word['confidence']}%")
for word in result['words']:
print(f"{word['text']}: {word['confidence']}%")
Filter low-confidence words
Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
undefinedhigh_conf_words = [w for w in result['words'] if w['confidence'] > 80]
undefinedOutput Formats
输出格式
Markdown Export
Markdown导出
python
processor.export_markdown("output.md")Output includes:
- Document title (if detected)
- Structured headings
- Paragraphs
- Tables (as Markdown tables)
- Page breaks for multi-page docs
python
processor.export_markdown("output.md")输出内容包括:
- 文档标题(如果可检测到)
- 结构化标题
- 段落
- 表格(以Markdown表格形式)
- 多页文档的分页符
JSON Export
JSON导出
python
processor.export_json("output.json")Output structure:
json
{
"source": "document.pdf",
"pages": 5,
"language": "eng",
"confidence": 92.5,
"text": "Full extracted text...",
"blocks": [
{
"type": "paragraph",
"text": "Block text...",
"bbox": [x, y, width, height],
"confidence": 95.2
}
],
"tables": [...]
}python
processor.export_json("output.json")输出结构:
json
{
"source": "document.pdf",
"pages": 5,
"language": "eng",
"confidence": 92.5,
"text": "Full extracted text...",
"blocks": [
{
"type": "paragraph",
"text": "Block text...",
"bbox": [x, y, width, height],
"confidence": 95.2
}
],
"tables": [...]
}HTML Export
HTML导出
python
processor.export_html("output.html")Creates styled HTML with:
- Preserved layout approximation
- Highlighted low-confidence regions
- Embedded images (optional)
- Print-friendly styling
python
processor.export_html("output.html")生成带样式的HTML,包含:
- 近似保留原布局
- 高亮低置信度区域
- 可选嵌入图片
- 适合打印的样式
CLI Usage
CLI使用
bash
undefinedbash
undefinedBasic extraction
Basic extraction
python ocr_processor.py image.png -o output.txt
python ocr_processor.py image.png -o output.txt
Extract to markdown
Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown
python ocr_processor.py document.pdf -o output.md --format markdown
Specify language
Specify language
python ocr_processor.py german.png --lang deu
python ocr_processor.py german.png --lang deu
Batch processing
Batch processing
python ocr_processor.py scans/ -o extracted/ --batch
python ocr_processor.py scans/ -o extracted/ --batch
With preprocessing
With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise
undefinedpython ocr_processor.py noisy.png --preprocess --deskew --denoise
undefinedError Handling
错误处理
python
from scripts.ocr_processor import OCRProcessor, OCRError
try:
processor = OCRProcessor("document.png")
text = processor.extract_text()
except OCRError as e:
print(f"OCR failed: {e}")
except FileNotFoundError:
print("File not found")python
from scripts.ocr_processor import OCRProcessor, OCRError
try:
processor = OCRProcessor("document.png")
text = processor.extract_text()
except OCRError as e:
print(f"OCR failed: {e}")
except FileNotFoundError:
print("File not found")Performance Tips
性能优化建议
- Image Quality: Higher resolution (300+ DPI) improves accuracy
- Preprocessing: Use for low-quality scans
- Language: Specifying language improves speed and accuracy
- PSM Mode: Choose appropriate mode for document type
- Large Files: Process PDFs page by page for memory efficiency
- 图片质量:更高分辨率(300+ DPI)可提升识别准确率
- 预处理:针对低质量扫描件使用预处理
- 语言指定:指定语言可提升识别速度和准确率
- PSM模式:根据文档类型选择合适的模式
- 大文件处理:逐页处理PDF以提升内存使用效率
Limitations
局限性
- Handwritten text: Limited accuracy
- Complex layouts: May lose structure
- Very low quality: Preprocessing helps but has limits
- Non-Latin scripts: Require specific language packs
- 手写文本:识别准确率有限
- 复杂布局:可能丢失结构信息
- 极低质量图片:预处理有帮助但效果有限
- 非拉丁语系脚本:需要安装特定语言包
Dependencies
依赖项
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0System Requirements
系统要求
- Tesseract OCR engine must be installed
- Language data files for non-English languages
- 必须安装Tesseract OCR引擎
- 非英语语言需安装对应语言数据包