OCR Document Processor

OCR文档处理器

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.

使用光学字符识别（OCR）从图片、扫描版PDF和照片中提取文本。支持多种语言、结构化输出格式以及智能文档解析。

Core Capabilities

核心功能

Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
PDF OCR: Process scanned PDFs page by page
Multi-language: Support for 100+ languages
Structured Output: Plain text, Markdown, JSON, or HTML
Table Detection: Extract tabular data to CSV/JSON
Batch Processing: Process multiple documents at once
Quality Assessment: Confidence scoring for OCR results

图片OCR：从PNG、JPEG、TIFF、BMP格式的图片中提取文本
PDF OCR：逐页处理扫描版PDF
多语言支持：支持100多种语言
结构化输出：纯文本、Markdown、JSON或HTML格式
表格检测：将表格数据提取为CSV/JSON格式
批量处理：同时处理多个文档
质量评估：为OCR结果提供置信度评分

Quick Start

快速开始

python

from scripts.ocr_processor import OCRProcessor

python

from scripts.ocr_processor import OCRProcessor

Simple text extraction

processor = OCRProcessor("document.png") text = processor.extract_text() print(text)

Extract to structured format

result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions

undefined

result = processor.extract_structured() print(result['text']) print(result['confidence']) print(result['blocks']) # Text blocks with positions

undefined

Core Workflow

核心工作流程

1. Basic Text Extraction

1. 基础文本提取

python

from scripts.ocr_processor import OCRProcessor

python

from scripts.ocr_processor import OCRProcessor

From image

processor = OCRProcessor("scan.png") text = processor.extract_text()

From PDF

processor = OCRProcessor("scanned.pdf") text = processor.extract_text() # All pages

Specific pages

text = processor.extract_text(pages=[1, 2, 3])

undefined

text = processor.extract_text(pages=[1, 2, 3])

undefined

2. Structured Extraction

2. 结构化提取

python

undefined

python

undefined

Get detailed results

result = processor.extract_structured()

Result contains:

- text: Full extracted text

- blocks: Text blocks with bounding boxes

- lines: Individual lines

- words: Individual words with confidence

- confidence: Overall confidence score

- language: Detected language

undefined

undefined

3. Export Formats

3. 导出格式

python

undefined

python

undefined

Export to Markdown

processor.export_markdown("output.md")

Export to JSON

processor.export_json("output.json")

Export to searchable PDF

processor.export_searchable_pdf("searchable.pdf")

Export to HTML

processor.export_html("output.html")

undefined

processor.export_html("output.html")

undefined

Language Support

语言支持

python

undefined

python

undefined

Specify language for better accuracy

processor = OCRProcessor("german_doc.png", lang='deu')

Multiple languages

processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

Auto-detect language

processor = OCRProcessor("document.png", lang='auto')

undefined

processor = OCRProcessor("document.png", lang='auto')

undefined

Supported Languages (Common)

支持的常见语言

Code	Language	Code	Language
eng	English	fra	French
deu	German	spa	Spanish
ita	Italian	por	Portuguese
rus	Russian	chi_sim	Chinese (Simplified)
chi_tra	Chinese (Traditional)	jpn	Japanese
kor	Korean	ara	Arabic
hin	Hindi	nld	Dutch

代码	语言	代码	语言
eng	英语	fra	法语
deu	德语	spa	西班牙语
ita	意大利语	por	葡萄牙语
rus	俄语	chi_sim	简体中文
chi_tra	繁体中文	jpn	日语
kor	韩语	ara	阿拉伯语
hin	印地语	nld	荷兰语

Image Preprocessing

图片预处理

Preprocessing improves OCR accuracy on low-quality images.

python

undefined

预处理可提升低质量图片的OCR识别准确率。

python

undefined

Enable preprocessing

processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text()

undefined

processor = OCRProcessor("noisy_scan.png") processor.preprocess( deskew=True, # Fix rotation denoise=True, # Remove noise threshold=True, # Binarize image contrast=1.5 # Enhance contrast ) text = processor.extract_text()

undefined

Available Preprocessing Options

可用的预处理选项

Option	Description	Default
`deskew`	Correct skewed/rotated images	False
`denoise`	Remove noise and artifacts	False
`threshold`	Convert to black/white	False
`threshold_method`	'otsu', 'adaptive', 'simple'	'otsu'
`contrast`	Contrast factor (1.0 = no change)	1.0
`sharpen`	Sharpen factor (0 = none)	0
`scale`	Upscale factor for small text	1.0
`remove_shadows`	Remove shadow artifacts	False

选项	描述	默认值
`deskew`	校正倾斜/旋转的图片	False
`denoise`	去除噪点和伪影	False
`threshold`	转换为黑白图像	False
`threshold_method`	'otsu', 'adaptive', 'simple'	'otsu'
`contrast`	对比度系数（1.0 = 无变化）	1.0
`sharpen`	锐化系数（0 = 无锐化）	0
`scale`	小文本放大系数	1.0
`remove_shadows`	去除阴影伪影	False

Table Extraction

表格提取

python

undefined

python

undefined

Extract tables from document

tables = processor.extract_tables()

Each table is a list of rows

for table in tables: for row in table: print(row)

Export tables to CSV

processor.export_tables_csv("tables/")

Export to JSON

processor.export_tables_json("tables.json")

undefined

processor.export_tables_json("tables.json")

undefined

PDF Processing

PDF处理

Multi-Page PDFs

多页PDF

python

undefined

python

undefined

Process all pages

processor = OCRProcessor("document.pdf") full_text = processor.extract_text()

Process specific pages

page_3 = processor.extract_text(pages=[3])

Get per-page results

results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters")

undefined

results = processor.extract_by_page() for page_num, text in results.items(): print(f"Page {page_num}: {len(text)} characters")

undefined

Create Searchable PDF

创建可搜索PDF

python

undefined

python

undefined

Convert scanned PDF to searchable PDF

processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf")

undefined

processor = OCRProcessor("scanned.pdf") processor.export_searchable_pdf("searchable.pdf")

undefined

Batch Processing

批量处理

python

from scripts.ocr_processor import batch_ocr

python

from scripts.ocr_processor import batch_ocr

Process directory of images

results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True )

print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files")

undefined

results = batch_ocr( input_dir="scans/", output_dir="extracted/", output_format="markdown", lang="eng", recursive=True )

print(f"Processed: {results['success']} files") print(f"Failed: {results['failed']} files")

undefined

Receipt/Document Parsing

收据/文档解析

Receipt Extraction

收据提取

python

undefined

python

undefined

Parse receipt structure

processor = OCRProcessor("receipt.jpg") receipt_data = processor.parse_receipt()

Returns structured data:

- vendor: Store name

- date: Transaction date

- items: List of items with prices

- subtotal: Subtotal amount

- tax: Tax amount

- total: Total amount

undefined

undefined

Business Card Parsing

名片解析

python

undefined

python

undefined

Extract business card info

processor = OCRProcessor("card.jpg") contact = processor.parse_business_card()

Returns:

- name: Person's name

- title: Job title

- company: Company name

- email: Email addresses

- phone: Phone numbers

- address: Physical address

- website: Website URLs

undefined

undefined

Configuration

配置

python

processor = OCRProcessor("document.png")

python

processor = OCRProcessor("document.png")

Configure OCR settings

processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence })

undefined

processor.config.update({ 'psm': 3, # Page segmentation mode 'oem': 3, # OCR engine mode 'dpi': 300, # DPI for processing 'timeout': 30, # Timeout in seconds 'min_confidence': 60, # Minimum word confidence })

undefined

Page Segmentation Modes (PSM)

页面分割模式（PSM）

Mode	Description
0	Orientation and script detection only
1	Automatic page segmentation with OSD
3	Fully automatic page segmentation (default)
4	Assume single column of text
6	Assume single uniform block of text
7	Treat image as single text line
8	Treat image as single word
11	Sparse text. Find as much text as possible
12	Sparse text with OSD

模式	描述
0	仅检测方向和脚本
1	自动页面分割并检测方向和脚本
3	完全自动页面分割（默认）
4	假设为单列文本
6	假设为单个统一文本块
7	将图片视为单行文本
8	将图片视为单个单词
11	稀疏文本：尽可能多地查找文本
12	稀疏文本并检测方向和脚本

Quality Assessment

质量评估

python

undefined

python

undefined

Get confidence scores

result = processor.extract_structured()

Overall confidence (0-100)

print(f"Confidence: {result['confidence']}%")

Per-word confidence

for word in result['words']: print(f"{word['text']}: {word['confidence']}%")

Filter low-confidence words

high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

undefined

high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

undefined

Output Formats

输出格式

Markdown Export

Markdown导出

python

processor.export_markdown("output.md")

Output includes:

Document title (if detected)
Structured headings
Paragraphs
Tables (as Markdown tables)
Page breaks for multi-page docs

python

processor.export_markdown("output.md")

输出内容包括：

文档标题（如果可检测到）
结构化标题
段落
表格（以Markdown表格形式）
多页文档的分页符

JSON Export

JSON导出

python

processor.export_json("output.json")

Output structure:

json

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

python

processor.export_json("output.json")

输出结构：

json

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

HTML Export

HTML导出

python

processor.export_html("output.html")

Creates styled HTML with:

Preserved layout approximation
Highlighted low-confidence regions
Embedded images (optional)
Print-friendly styling

python

processor.export_html("output.html")

生成带样式的HTML，包含：

近似保留原布局
高亮低置信度区域
可选嵌入图片
适合打印的样式

CLI Usage

CLI使用

bash

undefined

bash

undefined

Basic extraction

python ocr_processor.py image.png -o output.txt

Extract to markdown

python ocr_processor.py document.pdf -o output.md --format markdown

Specify language

python ocr_processor.py german.png --lang deu

Batch processing

python ocr_processor.py scans/ -o extracted/ --batch

With preprocessing

python ocr_processor.py noisy.png --preprocess --deskew --denoise

undefined

python ocr_processor.py noisy.png --preprocess --deskew --denoise

undefined

Error Handling

错误处理

python

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

python

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

Performance Tips

性能优化建议

Image Quality: Higher resolution (300+ DPI) improves accuracy
Preprocessing: Use for low-quality scans
Language: Specifying language improves speed and accuracy
PSM Mode: Choose appropriate mode for document type
Large Files: Process PDFs page by page for memory efficiency

图片质量：更高分辨率（300+ DPI）可提升识别准确率
预处理：针对低质量扫描件使用预处理
语言指定：指定语言可提升识别速度和准确率
PSM模式：根据文档类型选择合适的模式
大文件处理：逐页处理PDF以提升内存使用效率

Limitations

局限性

Handwritten text: Limited accuracy
Complex layouts: May lose structure
Very low quality: Preprocessing helps but has limits
Non-Latin scripts: Require specific language packs

手写文本：识别准确率有限
复杂布局：可能丢失结构信息
极低质量图片：预处理有帮助但效果有限
非拉丁语系脚本：需要安装特定语言包

Dependencies

依赖项

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

System Requirements

系统要求

Tesseract OCR engine must be installed
Language data files for non-English languages

必须安装Tesseract OCR引擎
非英语语言需安装对应语言数据包

ocr-document-processor

Original

Translation

OCR Document Processor

OCR文档处理器

Core Capabilities

核心功能

Quick Start

快速开始

Simple text extraction

Simple text extraction

Extract to structured format

Extract to structured format

Core Workflow

核心工作流程

1. Basic Text Extraction

1. 基础文本提取

From image

From image

From PDF

From PDF

Specific pages

Specific pages

2. Structured Extraction

2. 结构化提取

Get detailed results

Get detailed results

Result contains:

Result contains:

- text: Full extracted text

- text: Full extracted text

- blocks: Text blocks with bounding boxes

- blocks: Text blocks with bounding boxes

- lines: Individual lines

- lines: Individual lines

- words: Individual words with confidence

- words: Individual words with confidence

- confidence: Overall confidence score

- confidence: Overall confidence score

- language: Detected language

- language: Detected language

3. Export Formats

3. 导出格式

Export to Markdown

Export to Markdown

Export to JSON

Export to JSON

Export to searchable PDF

Export to searchable PDF

Export to HTML

Export to HTML

Language Support

语言支持

Specify language for better accuracy

Specify language for better accuracy

Multiple languages

Multiple languages

Auto-detect language

Auto-detect language

Supported Languages (Common)

支持的常见语言

Image Preprocessing

图片预处理

Enable preprocessing

Enable preprocessing

Available Preprocessing Options

可用的预处理选项

Table Extraction

表格提取

Extract tables from document

Extract tables from document

Each table is a list of rows

Each table is a list of rows

Export tables to CSV

Export tables to CSV

Export to JSON

Export to JSON

PDF Processing

PDF处理

Multi-Page PDFs