smart-ocr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Smart OCR Skill

智能OCR Skill

Overview

概述

This skill enables intelligent text extraction from images and scanned documents using PaddleOCR - a leading OCR engine supporting 100+ languages. Extract text from photos, screenshots, scanned PDFs, and handwritten documents with high accuracy.

本Skill借助PaddleOCR（一款支持100余种语言的领先OCR引擎）实现从图片和扫描文档中智能提取文本。可高精度提取照片、截图、扫描PDF及手写文档中的文本。

How to Use

使用方法

Provide the image or scanned document
Optionally specify language(s) to detect
I'll extract text with position and confidence data

Example prompts:

"Extract all text from this screenshot"
"OCR this scanned PDF document"
"Read the text from this business card photo"
"Extract Chinese and English text from this image"

提供图片或扫描文档
可选择性指定要检测的语言
我将提取包含位置和置信度数据的文本

示例提示词：

"提取此截图中的所有文本"
"对这份扫描PDF文档进行OCR识别"
"读取这张名片照片中的文本"
"提取此图片中的中英文文本"

Domain Knowledge

领域知识

PaddleOCR Fundamentals

PaddleOCR基础

python

from paddleocr import PaddleOCR

python

from paddleocr import PaddleOCR

Initialize OCR engine

ocr = PaddleOCR(use_angle_cls=True, lang='en')

Run OCR on image

result = ocr.ocr('image.png', cls=True)

Result structure: [[box, (text, confidence)], ...]

for line in result[0]: box = line[0] # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]] text = line[1][0] # Extracted text conf = line[1][1] # Confidence score print(f"{text} ({conf:.2f})")

undefined

for line in result[0]: box = line[0] # [[x1,y1], [x2,y2], [x3,y3], [x4,y4]] text = line[1][0] # Extracted text conf = line[1][1] # Confidence score print(f"{text} ({conf:.2f})")

undefined

Supported Languages

支持的语言

python

undefined

python

undefined

Common language codes

languages = { 'en': 'English', 'ch': 'Chinese (Simplified)', 'cht': 'Chinese (Traditional)', 'japan': 'Japanese', 'korean': 'Korean', 'french': 'French', 'german': 'German', 'spanish': 'Spanish', 'russian': 'Russian', 'arabic': 'Arabic', 'hindi': 'Hindi', 'vi': 'Vietnamese', 'th': 'Thai', # ... 100+ languages supported }

Use specific language

ocr = PaddleOCR(lang='ch') # Chinese ocr = PaddleOCR(lang='japan') # Japanese ocr = PaddleOCR(lang='multilingual') # Auto-detect

undefined

ocr = PaddleOCR(lang='ch') # Chinese ocr = PaddleOCR(lang='japan') # Japanese ocr = PaddleOCR(lang='multilingual') # Auto-detect

undefined

Configuration Options

配置选项

python

from paddleocr import PaddleOCR

ocr = PaddleOCR(
    # Detection settings
    det_model_dir=None,         # Custom detection model
    det_limit_side_len=960,     # Max side length for detection
    det_db_thresh=0.3,          # Binarization threshold
    det_db_box_thresh=0.5,      # Box score threshold
    
    # Recognition settings
    rec_model_dir=None,         # Custom recognition model
    rec_char_dict_path=None,    # Custom character dictionary
    
    # Angle classification
    use_angle_cls=True,         # Enable angle classification
    cls_model_dir=None,         # Custom classification model
    
    # Language
    lang='en',                  # Language code
    
    # Performance
    use_gpu=True,               # Use GPU if available
    gpu_mem=500,                # GPU memory limit (MB)
    enable_mkldnn=True,         # CPU optimization
    
    # Output
    show_log=False,             # Suppress logs
)

python

from paddleocr import PaddleOCR

ocr = PaddleOCR(
    # Detection settings
    det_model_dir=None,         # Custom detection model
    det_limit_side_len=960,     # Max side length for detection
    det_db_thresh=0.3,          # Binarization threshold
    det_db_box_thresh=0.5,      # Box score threshold
    
    # Recognition settings
    rec_model_dir=None,         # Custom recognition model
    rec_char_dict_path=None,    # Custom character dictionary
    
    # Angle classification
    use_angle_cls=True,         # Enable angle classification
    cls_model_dir=None,         # Custom classification model
    
    # Language
    lang='en',                  # Language code
    
    # Performance
    use_gpu=True,               # Use GPU if available
    gpu_mem=500,                # GPU memory limit (MB)
    enable_mkldnn=True,         # CPU optimization
    
    # Output
    show_log=False,             # Suppress logs
)

Processing Different Sources

处理不同来源

Image Files

图片文件

python

undefined

python

undefined

Single image

result = ocr.ocr('image.png')

Multiple images

images = ['img1.png', 'img2.png', 'img3.png'] for img in images: result = ocr.ocr(img) process_result(result)

undefined

images = ['img1.png', 'img2.png', 'img3.png'] for img in images: result = ocr.ocr(img) process_result(result)

undefined

PDF Files (Scanned)

扫描PDF文件

python

from pdf2image import convert_from_path

def ocr_pdf(pdf_path):
    """OCR a scanned PDF."""
    # Convert PDF pages to images
    images = convert_from_path(pdf_path)
    
    all_text = []
    for i, img in enumerate(images):
        # Save temp image
        temp_path = f'temp_page_{i}.png'
        img.save(temp_path)
        
        # OCR the image
        result = ocr.ocr(temp_path)
        
        # Extract text
        page_text = '\n'.join([line[1][0] for line in result[0]])
        all_text.append(f"--- Page {i+1} ---\n{page_text}")
        
        os.remove(temp_path)
    
    return '\n\n'.join(all_text)

python

from pdf2image import convert_from_path

def ocr_pdf(pdf_path):
    """OCR a scanned PDF."""
    # Convert PDF pages to images
    images = convert_from_path(pdf_path)
    
    all_text = []
    for i, img in enumerate(images):
        # Save temp image
        temp_path = f'temp_page_{i}.png'
        img.save(temp_path)
        
        # OCR the image
        result = ocr.ocr(temp_path)
        
        # Extract text
        page_text = '\n'.join([line[1][0] for line in result[0]])
        all_text.append(f"--- Page {i+1} ---\n{page_text}")
        
        os.remove(temp_path)
    
    return '\n\n'.join(all_text)

URLs and Bytes

URL与字节流

python

import requests
from io import BytesIO

python

import requests
from io import BytesIO

From URL

response = requests.get('https://example.com/image.png') result = ocr.ocr(BytesIO(response.content))

From bytes

with open('image.png', 'rb') as f: img_bytes = f.read() result = ocr.ocr(BytesIO(img_bytes))

undefined

with open('image.png', 'rb') as f: img_bytes = f.read() result = ocr.ocr(BytesIO(img_bytes))

undefined

Result Processing

结果处理

python

def process_ocr_result(result):
    """Process OCR result into structured data."""
    
    lines = []
    for line in result[0]:
        box = line[0]
        text = line[1][0]
        confidence = line[1][1]
        
        # Calculate bounding box
        x_coords = [p[0] for p in box]
        y_coords = [p[1] for p in box]
        
        lines.append({
            'text': text,
            'confidence': confidence,
            'bbox': {
                'left': min(x_coords),
                'top': min(y_coords),
                'right': max(x_coords),
                'bottom': max(y_coords),
            },
            'raw_box': box
        })
    
    return lines

python

def process_ocr_result(result):
    """Process OCR result into structured data."""
    
    lines = []
    for line in result[0]:
        box = line[0]
        text = line[1][0]
        confidence = line[1][1]
        
        # Calculate bounding box
        x_coords = [p[0] for p in box]
        y_coords = [p[1] for p in box]
        
        lines.append({
            'text': text,
            'confidence': confidence,
            'bbox': {
                'left': min(x_coords),
                'top': min(y_coords),
                'right': max(x_coords),
                'bottom': max(y_coords),
            },
            'raw_box': box
        })
    
    return lines

Sort by position (top to bottom, left to right)

def sort_by_position(lines): return sorted(lines, key=lambda x: (x['bbox']['top'], x['bbox']['left']))

undefined

def sort_by_position(lines): return sorted(lines, key=lambda x: (x['bbox']['top'], x['bbox']['left']))

undefined

Text Layout Reconstruction

文本布局重建

python

def reconstruct_layout(result, line_threshold=10):
    """Reconstruct text layout from OCR results."""
    
    lines = process_ocr_result(result)
    lines = sort_by_position(lines)
    
    # Group into logical lines
    text_lines = []
    current_line = []
    current_y = None
    
    for line in lines:
        y = line['bbox']['top']
        
        if current_y is None or abs(y - current_y) < line_threshold:
            current_line.append(line)
            current_y = y
        else:
            # New line
            text_lines.append(' '.join([l['text'] for l in current_line]))
            current_line = [line]
            current_y = y
    
    # Add last line
    if current_line:
        text_lines.append(' '.join([l['text'] for l in current_line]))
    
    return '\n'.join(text_lines)

python

def reconstruct_layout(result, line_threshold=10):
    """Reconstruct text layout from OCR results."""
    
    lines = process_ocr_result(result)
    lines = sort_by_position(lines)
    
    # Group into logical lines
    text_lines = []
    current_line = []
    current_y = None
    
    for line in lines:
        y = line['bbox']['top']
        
        if current_y is None or abs(y - current_y) < line_threshold:
            current_line.append(line)
            current_y = y
        else:
            # New line
            text_lines.append(' '.join([l['text'] for l in current_line]))
            current_line = [line]
            current_y = y
    
    # Add last line
    if current_line:
        text_lines.append(' '.join([l['text'] for l in current_line]))
    
    return '\n'.join(text_lines)

Best Practices

最佳实践

Preprocess Images: Improve quality before OCR
Choose Correct Language: Specify language for better accuracy
Handle Multi-column: Process columns separately
Filter Low Confidence: Skip results below threshold
Batch Processing: Process multiple images efficiently

预处理图片：在OCR识别前提升图片质量
选择正确语言：指定语言以提升识别精度
处理多列布局：单独处理每一列内容
过滤低置信度结果：跳过低于阈值的识别结果
批量处理：高效处理多张图片

Common Patterns

常见应用场景

Image Preprocessing

图片预处理

python

from PIL import Image, ImageEnhance, ImageFilter

def preprocess_image(image_path):
    """Preprocess image for better OCR."""
    img = Image.open(image_path)
    
    # Convert to grayscale
    img = img.convert('L')
    
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(2.0)
    
    # Sharpen
    img = img.filter(ImageFilter.SHARPEN)
    
    # Save preprocessed
    preprocessed_path = 'preprocessed.png'
    img.save(preprocessed_path)
    
    return preprocessed_path

python

from PIL import Image, ImageEnhance, ImageFilter

def preprocess_image(image_path):
    """Preprocess image for better OCR."""
    img = Image.open(image_path)
    
    # Convert to grayscale
    img = img.convert('L')
    
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(2.0)
    
    # Sharpen
    img = img.filter(ImageFilter.SHARPEN)
    
    # Save preprocessed
    preprocessed_path = 'preprocessed.png'
    img.save(preprocessed_path)
    
    return preprocessed_path

Batch OCR with Progress

带进度的批量OCR

python

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

def batch_ocr(image_paths, max_workers=4):
    """OCR multiple images in parallel."""
    
    results = {}
    
    def process_single(img_path):
        result = ocr.ocr(img_path)
        return img_path, result
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_single, p) for p in image_paths]
        
        for future in tqdm(futures, desc="Processing OCR"):
            path, result = future.result()
            results[path] = result
    
    return results

python

from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

def batch_ocr(image_paths, max_workers=4):
    """OCR multiple images in parallel."""
    
    results = {}
    
    def process_single(img_path):
        result = ocr.ocr(img_path)
        return img_path, result
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_single, p) for p in image_paths]
        
        for future in tqdm(futures, desc="Processing OCR"):
            path, result = future.result()
            results[path] = result
    
    return results

Examples

示例

Example 1: Business Card Reader

示例1：名片识别器

python

from paddleocr import PaddleOCR
import re

def read_business_card(image_path):
    """Extract contact info from business card."""
    
    ocr = PaddleOCR(use_angle_cls=True, lang='en')
    result = ocr.ocr(image_path)
    
    # Extract all text
    all_text = []
    for line in result[0]:
        all_text.append(line[1][0])
    
    full_text = '\n'.join(all_text)
    
    # Parse contact info
    contact = {
        'name': None,
        'email': None,
        'phone': None,
        'company': None,
        'title': None,
        'raw_text': full_text
    }
    
    # Email pattern
    email_match = re.search(r'[\w\.-]+@[\w\.-]+\.\w+', full_text)
    if email_match:
        contact['email'] = email_match.group()
    
    # Phone pattern
    phone_match = re.search(r'[\+\d][\d\s\-\(\)]{8,}', full_text)
    if phone_match:
        contact['phone'] = phone_match.group().strip()
    
    # Name is usually the largest/first text
    if all_text:
        contact['name'] = all_text[0]
    
    return contact

card_info = read_business_card('business_card.jpg')
print(f"Name: {card_info['name']}")
print(f"Email: {card_info['email']}")
print(f"Phone: {card_info['phone']}")

python

from paddleocr import PaddleOCR
import re

def read_business_card(image_path):
    """Extract contact info from business card."""
    
    ocr = PaddleOCR(use_angle_cls=True, lang='en')
    result = ocr.ocr(image_path)
    
    # Extract all text
    all_text = []
    for line in result[0]:
        all_text.append(line[1][0])
    
    full_text = '\n'.join(all_text)
    
    # Parse contact info
    contact = {
        'name': None,
        'email': None,
        'phone': None,
        'company': None,
        'title': None,
        'raw_text': full_text
    }
    
    # Email pattern
    email_match = re.search(r'[\w\.-]+@[\w\.-]+\.\w+', full_text)
    if email_match:
        contact['email'] = email_match.group()
    
    # Phone pattern
    phone_match = re.search(r'[\+\d][\d\s\-\(\)]{8,}', full_text)
    if phone_match:
        contact['phone'] = phone_match.group().strip()
    
    # Name is usually the largest/first text
    if all_text:
        contact['name'] = all_text[0]
    
    return contact

card_info = read_business_card('business_card.jpg')
print(f"Name: {card_info['name']}")
print(f"Email: {card_info['email']}")
print(f"Phone: {card_info['phone']}")

Example 2: Receipt Scanner

示例2：收据扫描器

python

from paddleocr import PaddleOCR
import re

def scan_receipt(image_path):
    """Extract items and total from receipt."""
    
    ocr = PaddleOCR(use_angle_cls=True, lang='en')
    result = ocr.ocr(image_path)
    
    lines = []
    for line in result[0]:
        text = line[1][0]
        y_pos = line[0][0][1]
        lines.append({'text': text, 'y': y_pos})
    
    # Sort by vertical position
    lines.sort(key=lambda x: x['y'])
    
    receipt = {
        'items': [],
        'subtotal': None,
        'tax': None,
        'total': None
    }
    
    for line in lines:
        text = line['text']
        
        # Look for total
        if 'total' in text.lower():
            amount = re.search(r'\$?([\d,]+\.?\d*)', text)
            if amount:
                if 'sub' in text.lower():
                    receipt['subtotal'] = float(amount.group(1).replace(',', ''))
                else:
                    receipt['total'] = float(amount.group(1).replace(',', ''))
        
        # Look for tax
        elif 'tax' in text.lower():
            amount = re.search(r'\$?([\d,]+\.?\d*)', text)
            if amount:
                receipt['tax'] = float(amount.group(1).replace(',', ''))
        
        # Look for items (line with price)
        else:
            item_match = re.search(r'(.+?)\s+\$?([\d,]+\.?\d+)$', text)
            if item_match:
                receipt['items'].append({
                    'name': item_match.group(1).strip(),
                    'price': float(item_match.group(2).replace(',', ''))
                })
    
    return receipt

receipt_data = scan_receipt('receipt.jpg')
print(f"Items: {len(receipt_data['items'])}")
print(f"Total: ${receipt_data['total']}")

python

from paddleocr import PaddleOCR
import re

def scan_receipt(image_path):
    """Extract items and total from receipt."""
    
    ocr = PaddleOCR(use_angle_cls=True, lang='en')
    result = ocr.ocr(image_path)
    
    lines = []
    for line in result[0]:
        text = line[1][0]
        y_pos = line[0][0][1]
        lines.append({'text': text, 'y': y_pos})
    
    # Sort by vertical position
    lines.sort(key=lambda x: x['y'])
    
    receipt = {
        'items': [],
        'subtotal': None,
        'tax': None,
        'total': None
    }
    
    for line in lines:
        text = line['text']
        
        # Look for total
        if 'total' in text.lower():
            amount = re.search(r'\$?([\d,]+\.?\d*)', text)
            if amount:
                if 'sub' in text.lower():
                    receipt['subtotal'] = float(amount.group(1).replace(',', ''))
                else:
                    receipt['total'] = float(amount.group(1).replace(',', ''))
        
        # Look for tax
        elif 'tax' in text.lower():
            amount = re.search(r'\$?([\d,]+\.?\d*)', text)
            if amount:
                receipt['tax'] = float(amount.group(1).replace(',', ''))
        
        # Look for items (line with price)
        else:
            item_match = re.search(r'(.+?)\s+\$?([\d,]+\.?\d+)$', text)
            if item_match:
                receipt['items'].append({
                    'name': item_match.group(1).strip(),
                    'price': float(item_match.group(2).replace(',', ''))
                })
    
    return receipt

receipt_data = scan_receipt('receipt.jpg')
print(f"Items: {len(receipt_data['items'])}")
print(f"Total: ${receipt_data['total']}")

Example 3: Multi-language Document

示例3：多语言文档识别

python

from paddleocr import PaddleOCR

def ocr_multilingual(image_path, languages=['en', 'ch']):
    """OCR document with multiple languages."""
    
    all_results = {}
    
    for lang in languages:
        ocr = PaddleOCR(use_angle_cls=True, lang=lang)
        result = ocr.ocr(image_path)
        
        texts = []
        for line in result[0]:
            texts.append({
                'text': line[1][0],
                'confidence': line[1][1]
            })
        
        all_results[lang] = texts
    
    # Merge results, keeping highest confidence
    merged = {}
    for lang, texts in all_results.items():
        for item in texts:
            text = item['text']
            conf = item['confidence']
            
            if text not in merged or merged[text]['confidence'] < conf:
                merged[text] = {'confidence': conf, 'language': lang}
    
    return merged

result = ocr_multilingual('bilingual_document.png')
for text, info in result.items():
    print(f"[{info['language']}] {text} ({info['confidence']:.2f})")

python

from paddleocr import PaddleOCR

def ocr_multilingual(image_path, languages=['en', 'ch']):
    """OCR document with multiple languages."""
    
    all_results = {}
    
    for lang in languages:
        ocr = PaddleOCR(use_angle_cls=True, lang=lang)
        result = ocr.ocr(image_path)
        
        texts = []
        for line in result[0]:
            texts.append({
                'text': line[1][0],
                'confidence': line[1][1]
            })
        
        all_results[lang] = texts
    
    # Merge results, keeping highest confidence
    merged = {}
    for lang, texts in all_results.items():
        for item in texts:
            text = item['text']
            conf = item['confidence']
            
            if text not in merged or merged[text]['confidence'] < conf:
                merged[text] = {'confidence': conf, 'language': lang}
    
    return merged

result = ocr_multilingual('bilingual_document.png')
for text, info in result.items():
    print(f"[{info['language']}] {text} ({info['confidence']:.2f})")

Limitations

局限性

Handwritten text accuracy varies
Very small text may not be detected
Complex backgrounds reduce accuracy
Rotated text needs angle classification
GPU recommended for best performance

手写文本的识别精度参差不齐
极小文本可能无法被识别
复杂背景会降低识别精度
旋转文本需要开启角度分类功能
推荐使用GPU以获得最佳性能

Installation

安装

bash

undefined

bash

undefined

CPU version

pip install paddlepaddle paddleocr

GPU version (CUDA 11.x)

pip install paddlepaddle-gpu paddleocr

Additional dependencies

pip install pdf2image Pillow

undefined

pip install pdf2image Pillow

undefined

smart-ocr

Original

Translation

Smart OCR Skill

智能OCR Skill

Overview

概述

How to Use

使用方法

Domain Knowledge

领域知识

PaddleOCR Fundamentals

PaddleOCR基础

Initialize OCR engine

Initialize OCR engine

Run OCR on image

Run OCR on image

Result structure: [[box, (text, confidence)], ...]

Result structure: [[box, (text, confidence)], ...]

Supported Languages

支持的语言

Common language codes

Common language codes

Use specific language

Use specific language

Configuration Options

配置选项

Processing Different Sources

处理不同来源

Image Files

图片文件

Single image

Single image

Multiple images

Multiple images

PDF Files (Scanned)

扫描PDF文件

URLs and Bytes

URL与字节流

From URL

From URL

From bytes

From bytes

Result Processing

结果处理

Sort by position (top to bottom, left to right)

Sort by position (top to bottom, left to right)

Text Layout Reconstruction

文本布局重建

Best Practices

最佳实践

Common Patterns

常见应用场景

Image Preprocessing

图片预处理

Batch OCR with Progress

带进度的批量OCR

Examples

示例

Example 1: Business Card Reader

示例1：名片识别器

Example 2: Receipt Scanner

示例2：收据扫描器

Example 3: Multi-language Document

示例3：多语言文档识别

Limitations

局限性

Installation

安装

CPU version

CPU version

GPU version (CUDA 11.x)

GPU version (CUDA 11.x)

Additional dependencies

Additional dependencies

Resources

参考资源