ocr-document-processor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OCR Document Processor

OCR文档处理器

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.
使用光学字符识别(OCR)从图片、扫描版PDF和照片中提取文本。支持多种语言、结构化输出格式以及智能文档解析。

Core Capabilities

核心功能

  • Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
  • PDF OCR: Process scanned PDFs page by page
  • Multi-language: Support for 100+ languages
  • Structured Output: Plain text, Markdown, JSON, or HTML
  • Table Detection: Extract tabular data to CSV/JSON
  • Batch Processing: Process multiple documents at once
  • Quality Assessment: Confidence scoring for OCR results
  • 图片OCR:从PNG、JPEG、TIFF、BMP格式的图片中提取文本
  • PDF OCR:逐页处理扫描版PDF
  • 多语言支持:支持100多种语言
  • 结构化输出:纯文本、Markdown、JSON或HTML格式
  • 表格检测:将表格数据提取为CSV/JSON格式
  • 批量处理:同时处理多个文档
  • 质量评估:为OCR结果提供置信度评分

Quick Start

快速开始

python
from scripts.ocr_processor import OCRProcessor
python
from scripts.ocr_processor import OCRProcessor

Simple text extraction

Simple text extraction

processor = OCRProcessor("document.png") text = processor.extract_text() print(text)
processor = OCRProcessor("document.png") text = processor.extract_text() print(text)

Extract to structured format

Extract to structured format

result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions
undefined
result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions
undefined

Core Workflow

核心工作流程

1. Basic Text Extraction

1. 基础文本提取

python
from scripts.ocr_processor import OCRProcessor
python
from scripts.ocr_processor import OCRProcessor

From image

From image

processor = OCRProcessor("scan.png") text = processor.extract_text()
processor = OCRProcessor("scan.png") text = processor.extract_text()

From PDF

From PDF

processor = OCRProcessor("scanned.pdf") text = processor.extract_text() # All pages
processor = OCRProcessor("scanned.pdf") text = processor.extract_text() # All pages

Specific pages

Specific pages

text = processor.extract_text(pages=[1, 2, 3])
undefined
text = processor.extract_text(pages=[1, 2, 3])
undefined

2. Structured Extraction

2. 结构化提取

python
undefined
python
undefined

Get detailed results

Get detailed results

result = processor.extract_structured()
result = processor.extract_structured()

Result contains:

Result contains:

- text: Full extracted text

- text: Full extracted text

- blocks: Text blocks with bounding boxes

- blocks: Text blocks with bounding boxes

- lines: Individual lines

- lines: Individual lines

- words: Individual words with confidence

- words: Individual words with confidence

- confidence: Overall confidence score

- confidence: Overall confidence score

- language: Detected language

- language: Detected language

undefined
undefined

3. Export Formats

3. 导出格式

python
undefined
python
undefined

Export to Markdown

Export to Markdown

processor.export_markdown("output.md")
processor.export_markdown("output.md")

Export to JSON

Export to JSON

processor.export_json("output.json")
processor.export_json("output.json")

Export to searchable PDF

Export to searchable PDF

processor.export_searchable_pdf("searchable.pdf")
processor.export_searchable_pdf("searchable.pdf")

Export to HTML

Export to HTML

processor.export_html("output.html")
undefined
processor.export_html("output.html")
undefined

Language Support

语言支持

python
undefined
python
undefined

Specify language for better accuracy

Specify language for better accuracy

processor = OCRProcessor("german_doc.png", lang='deu')
processor = OCRProcessor("german_doc.png", lang='deu')

Multiple languages

Multiple languages

processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

Auto-detect language

Auto-detect language

processor = OCRProcessor("document.png", lang='auto')
undefined
processor = OCRProcessor("document.png", lang='auto')
undefined

Supported Languages (Common)

支持的常见语言

CodeLanguageCodeLanguage
engEnglishfraFrench
deuGermanspaSpanish
itaItalianporPortuguese
rusRussianchi_simChinese (Simplified)
chi_traChinese (Traditional)jpnJapanese
korKoreanaraArabic
hinHindinldDutch
代码语言代码语言
eng英语fra法语
deu德语spa西班牙语
ita意大利语por葡萄牙语
rus俄语chi_sim简体中文
chi_tra繁体中文jpn日语
kor韩语ara阿拉伯语
hin印地语nld荷兰语

Image Preprocessing

图片预处理

Preprocessing improves OCR accuracy on low-quality images.
python
undefined
预处理可提升低质量图片的OCR识别准确率。
python
undefined

Enable preprocessing

Enable preprocessing

processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text()
undefined
processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text()
undefined

Available Preprocessing Options

可用的预处理选项

OptionDescriptionDefault
deskew
Correct skewed/rotated imagesFalse
denoise
Remove noise and artifactsFalse
threshold
Convert to black/whiteFalse
threshold_method
'otsu', 'adaptive', 'simple''otsu'
contrast
Contrast factor (1.0 = no change)1.0
sharpen
Sharpen factor (0 = none)0
scale
Upscale factor for small text1.0
remove_shadows
Remove shadow artifactsFalse
选项描述默认值
deskew
校正倾斜/旋转的图片False
denoise
去除噪点和伪影False
threshold
转换为黑白图像False
threshold_method
'otsu', 'adaptive', 'simple''otsu'
contrast
对比度系数(1.0 = 无变化)1.0
sharpen
锐化系数(0 = 无锐化)0
scale
小文本放大系数1.0
remove_shadows
去除阴影伪影False

Table Extraction

表格提取

python
undefined
python
undefined

Extract tables from document

Extract tables from document

tables = processor.extract_tables()
tables = processor.extract_tables()

Each table is a list of rows

Each table is a list of rows

for table in tables: for row in table: print(row)
for table in tables: for row in table: print(row)

Export tables to CSV

Export tables to CSV

processor.export_tables_csv("tables/")
processor.export_tables_csv("tables/")

Export to JSON

Export to JSON

processor.export_tables_json("tables.json")
undefined
processor.export_tables_json("tables.json")
undefined

PDF Processing

PDF处理

Multi-Page PDFs

多页PDF

python
undefined
python
undefined

Process all pages

Process all pages

processor = OCRProcessor("document.pdf") full_text = processor.extract_text()
processor = OCRProcessor("document.pdf") full_text = processor.extract_text()

Process specific pages

Process specific pages

page_3 = processor.extract_text(pages=[3])
page_3 = processor.extract_text(pages=[3])

Get per-page results

Get per-page results

results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters")
undefined
results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters")
undefined

Create Searchable PDF

创建可搜索PDF

python
undefined
python
undefined

Convert scanned PDF to searchable PDF

Convert scanned PDF to searchable PDF

processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf")
undefined
processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf")
undefined

Batch Processing

批量处理

python
from scripts.ocr_processor import batch_ocr
python
from scripts.ocr_processor import batch_ocr

Process directory of images

Process directory of images

results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True )
print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files")
undefined
results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True )
print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files")
undefined

Receipt/Document Parsing

收据/文档解析

Receipt Extraction

收据提取

python
undefined
python
undefined

Parse receipt structure

Parse receipt structure

processor = OCRProcessor("receipt.jpg") receipt_data = processor.parse_receipt()
processor = OCRProcessor("receipt.jpg") receipt_data = processor.parse_receipt()

Returns structured data:

Returns structured data:

- vendor: Store name

- vendor: Store name

- date: Transaction date

- date: Transaction date

- items: List of items with prices

- items: List of items with prices

- subtotal: Subtotal amount

- subtotal: Subtotal amount

- tax: Tax amount

- tax: Tax amount

- total: Total amount

- total: Total amount

undefined
undefined

Business Card Parsing

名片解析

python
undefined
python
undefined

Extract business card info

Extract business card info

processor = OCRProcessor("card.jpg") contact = processor.parse_business_card()
processor = OCRProcessor("card.jpg") contact = processor.parse_business_card()

Returns:

Returns:

- name: Person's name

- name: Person's name

- title: Job title

- title: Job title

- company: Company name

- company: Company name

- email: Email addresses

- email: Email addresses

- phone: Phone numbers

- phone: Phone numbers

- address: Physical address

- address: Physical address

- website: Website URLs

- website: Website URLs

undefined
undefined

Configuration

配置

python
processor = OCRProcessor("document.png")
python
processor = OCRProcessor("document.png")

Configure OCR settings

Configure OCR settings

processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence })
undefined
processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence })
undefined

Page Segmentation Modes (PSM)

页面分割模式(PSM)

ModeDescription
0Orientation and script detection only
1Automatic page segmentation with OSD
3Fully automatic page segmentation (default)
4Assume single column of text
6Assume single uniform block of text
7Treat image as single text line
8Treat image as single word
11Sparse text. Find as much text as possible
12Sparse text with OSD
模式描述
0仅检测方向和脚本
1自动页面分割并检测方向和脚本
3完全自动页面分割(默认)
4假设为单列文本
6假设为单个统一文本块
7将图片视为单行文本
8将图片视为单个单词
11稀疏文本:尽可能多地查找文本
12稀疏文本并检测方向和脚本

Quality Assessment

质量评估

python
undefined
python
undefined

Get confidence scores

Get confidence scores

result = processor.extract_structured()
result = processor.extract_structured()

Overall confidence (0-100)

Overall confidence (0-100)

print(f"Confidence: {result['confidence']}%")
print(f"Confidence: {result['confidence']}%")

Per-word confidence

Per-word confidence

for word in result['words']: print(f"{word['text']}: {word['confidence']}%")
for word in result['words']: print(f"{word['text']}: {word['confidence']}%")

Filter low-confidence words

Filter low-confidence words

high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
undefined
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]
undefined

Output Formats

输出格式

Markdown Export

Markdown导出

python
processor.export_markdown("output.md")
Output includes:
  • Document title (if detected)
  • Structured headings
  • Paragraphs
  • Tables (as Markdown tables)
  • Page breaks for multi-page docs
python
processor.export_markdown("output.md")
输出内容包括:
  • 文档标题(如果可检测到)
  • 结构化标题
  • 段落
  • 表格(以Markdown表格形式)
  • 多页文档的分页符

JSON Export

JSON导出

python
processor.export_json("output.json")
Output structure:
json
{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}
python
processor.export_json("output.json")
输出结构:
json
{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

HTML Export

HTML导出

python
processor.export_html("output.html")
Creates styled HTML with:
  • Preserved layout approximation
  • Highlighted low-confidence regions
  • Embedded images (optional)
  • Print-friendly styling
python
processor.export_html("output.html")
生成带样式的HTML,包含:
  • 近似保留原布局
  • 高亮低置信度区域
  • 可选嵌入图片
  • 适合打印的样式

CLI Usage

CLI使用

bash
undefined
bash
undefined

Basic extraction

Basic extraction

python ocr_processor.py image.png -o output.txt
python ocr_processor.py image.png -o output.txt

Extract to markdown

Extract to markdown

python ocr_processor.py document.pdf -o output.md --format markdown
python ocr_processor.py document.pdf -o output.md --format markdown

Specify language

Specify language

python ocr_processor.py german.png --lang deu
python ocr_processor.py german.png --lang deu

Batch processing

Batch processing

python ocr_processor.py scans/ -o extracted/ --batch
python ocr_processor.py scans/ -o extracted/ --batch

With preprocessing

With preprocessing

python ocr_processor.py noisy.png --preprocess --deskew --denoise
undefined
python ocr_processor.py noisy.png --preprocess --deskew --denoise
undefined

Error Handling

错误处理

python
from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")
python
from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

Performance Tips

性能优化建议

  1. Image Quality: Higher resolution (300+ DPI) improves accuracy
  2. Preprocessing: Use for low-quality scans
  3. Language: Specifying language improves speed and accuracy
  4. PSM Mode: Choose appropriate mode for document type
  5. Large Files: Process PDFs page by page for memory efficiency
  1. 图片质量:更高分辨率(300+ DPI)可提升识别准确率
  2. 预处理:针对低质量扫描件使用预处理
  3. 语言指定:指定语言可提升识别速度和准确率
  4. PSM模式:根据文档类型选择合适的模式
  5. 大文件处理:逐页处理PDF以提升内存使用效率

Limitations

局限性

  • Handwritten text: Limited accuracy
  • Complex layouts: May lose structure
  • Very low quality: Preprocessing helps but has limits
  • Non-Latin scripts: Require specific language packs
  • 手写文本:识别准确率有限
  • 复杂布局:可能丢失结构信息
  • 极低质量图片:预处理有帮助但效果有限
  • 非拉丁语系脚本:需要安装特定语言包

Dependencies

依赖项

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

System Requirements

系统要求

  • Tesseract OCR engine must be installed
  • Language data files for non-English languages
  • 必须安装Tesseract OCR引擎
  • 非英语语言需安装对应语言数据包