pdf-processor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Processor

PDF 处理器

Instructions

操作说明

When processing PDF files, follow these steps based on your specific needs:
处理PDF文件时,请根据具体需求遵循以下步骤:

1. Identify Processing Type

1. 确定处理类型

Determine what you need to do with the PDF:
  • Extract text content
  • Fill form fields
  • Extract images or tables
  • Merge or split PDFs
  • Add annotations or watermarks
  • Convert to other formats
明确你需要对PDF执行的操作:
  • 提取文本内容
  • 填写表单字段
  • 提取图片或表格
  • 合并或拆分PDF
  • 添加注释或水印
  • 转换为其他格式

2. Text Extraction

2. 文本提取

Basic Text Extraction

基础文本提取

python
import PyPDF2
import pdfplumber
python
import PyPDF2
import pdfplumber

Method 1: Using PyPDF2

Method 1: Using PyPDF2

def extract_text_pypdf2(file_path): with open(file_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() return text
def extract_text_pypdf2(file_path): with open(file_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() return text

Method 2: Using pdfplumber (better for tables)

Method 2: Using pdfplumber (better for tables)

def extract_text_pdfplumber(file_path): with pdfplumber.open(file_path) as pdf: text = "" for page in pdf.pages: text += page.extract_text() or "" return text
undefined
def extract_text_pdfplumber(file_path): with pdfplumber.open(file_path) as pdf: text = "" for page in pdf.pages: text += page.extract_text() or "" return text
undefined

Advanced Text Extraction

高级文本提取

  • Preserve formatting and layout
  • Handle multi-column documents
  • Extract text from specific regions
  • Process scanned PDFs with OCR
  • 保留格式和布局
  • 处理多栏文档
  • 从特定区域提取文本
  • 用OCR处理扫描版PDF

3. Form Processing

3. 表单处理

Form Field Detection

表单字段检测

python
def detect_form_fields(file_path):
    reader = PyPDF2.PdfReader(file_path)
    fields = {}
    if reader.get_fields():
        for field_name, field in reader.get_fields().items():
            fields[field_name] = {
                'type': field.field_type,
                'value': field.value,
                'required': field.required if hasattr(field, 'required') else False
            }
    return fields

def fill_form_fields(file_path, output_path, field_data):
    reader = PyPDF2.PdfReader(file_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    if writer.get_fields():
        for field_name, value in field_data.items():
            if field_name in writer.get_fields():
                writer.get_fields()[field_name].value = value

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
python
def detect_form_fields(file_path):
    reader = PyPDF2.PdfReader(file_path)
    fields = {}
    if reader.get_fields():
        for field_name, field in reader.get_fields().items():
            fields[field_name] = {
                'type': field.field_type,
                'value': field.value,
                'required': field.required if hasattr(field, 'required') else False
            }
    return fields

def fill_form_fields(file_path, output_path, field_data):
    reader = PyPDF2.PdfReader(file_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    if writer.get_fields():
        for field_name, value in field_data.items():
            if field_name in writer.get_fields():
                writer.get_fields()[field_name].value = value

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

Common Form Types

常见表单类型

  • Application forms
  • Invoices and receipts
  • Survey forms
  • Legal documents
  • Medical forms
  • 申请表
  • 发票和收据
  • 调查问卷
  • 法律文件
  • 医疗表单

4. Content Analysis

4. 内容分析

Structure Analysis

结构分析

python
def analyze_pdf_structure(file_path):
    with pdfplumber.open(file_path) as pdf:
        analysis = {
            'pages': len(pdf.pages),
            'has_images': False,
            'has_tables': False,
            'has_forms': False,
            'text_density': [],
            'sections': []
        }

        for i, page in enumerate(pdf.pages):
            # Check for images
            if page.images:
                analysis['has_images'] = True

            # Check for tables
            if page.extract_tables():
                analysis['has_tables'] = True

            # Calculate text density
            text = page.extract_text()
            if text:
                density = len(text) / (page.width * page.height)
                analysis['text_density'].append(density)

            # Detect section headers (basic heuristic)
            lines = text.split('\n') if text else []
            for line in lines:
                if line.isupper() and len(line) < 50:
                    analysis['sections'].append({
                        'page': i + 1,
                        'title': line.strip()
                    })

    return analysis
python
def analyze_pdf_structure(file_path):
    with pdfplumber.open(file_path) as pdf:
        analysis = {
            'pages': len(pdf.pages),
            'has_images': False,
            'has_tables': False,
            'has_forms': False,
            'text_density': [],
            'sections': []
        }

        for i, page in enumerate(pdf.pages):
            # Check for images
            if page.images:
                analysis['has_images'] = True

            # Check for tables
            if page.extract_tables():
                analysis['has_tables'] = True

            # Calculate text density
            text = page.extract_text()
            if text:
                density = len(text) / (page.width * page.height)
                analysis['text_density'].append(density)

            # Detect section headers (basic heuristic)
            lines = text.split('\n') if text else []
            for line in lines:
                if line.isupper() and len(line) < 50:
                    analysis['sections'].append({
                        'page': i + 1,
                        'title': line.strip()
                    })

    return analysis

Table Extraction

表格提取

python
def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    'page': page_num + 1,
                    'data': table,
                    'rows': len(table),
                    'columns': len(table[0]) if table else 0
                })
    return tables
python
def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    'page': page_num + 1,
                    'data': table,
                    'rows': len(table),
                    'columns': len(table[0]) if table else 0
                })
    return tables

5. PDF Manipulation

5. PDF 操作

Merge PDFs

合并PDF

python
from PyPDF2 import PdfMerger

def merge_pdfs(file_paths, output_path):
    merger = PdfMerger()
    for path in file_paths:
        merger.append(path)
    merger.write(output_path)
    merger.close()
python
from PyPDF2 import PdfMerger

def merge_pdfs(file_paths, output_path):
    merger = PdfMerger()
    for path in file_paths:
        merger.append(path)
    merger.write(output_path)
    merger.close()

Split PDF

拆分PDF

python
def split_pdf(file_path, output_dir):
    reader = PyPDF2.PdfReader(file_path)
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        output_path = f"{output_dir}/page_{i+1}.pdf"
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)
python
def split_pdf(file_path, output_dir):
    reader = PyPDF2.PdfReader(file_path)
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        output_path = f"{output_dir}/page_{i+1}.pdf"
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

Add Watermark

添加水印

python
def add_watermark(input_path, output_path, watermark_text):
    reader = PyPDF2.PdfReader(input_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)
        # Add watermark logic here
        # This requires additional libraries like reportlab

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)
python
def add_watermark(input_path, output_path, watermark_text):
    reader = PyPDF2.PdfReader(input_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)
        # Add watermark logic here
        # This requires additional libraries like reportlab

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

6. OCR for Scanned PDFs

6. 扫描版PDF的OCR处理

Using Tesseract OCR

使用Tesseract OCR

python
import pytesseract
from PIL import Image
import fitz  # PyMuPDF

def ocr_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text += pytesseract.image_to_string(img)

    return text
python
import pytesseract
from PIL import Image
import fitz  # PyMuPDF

def ocr_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text += pytesseract.image_to_string(img)

    return text

7. Error Handling

7. 错误处理

Common Issues

常见问题

  • Password-protected PDFs
  • Corrupted files
  • Unsupported formats
  • Memory issues with large files
  • Encoding problems
  • 受密码保护的PDF
  • 损坏的文件
  • 不支持的格式
  • 大文件导致的内存问题
  • 编码问题

Error Handling Pattern

错误处理模式

python
import logging

def process_pdf_safely(file_path, processing_func):
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        # Check file size
        file_size = os.path.getsize(file_path)
        if file_size > 100 * 1024 * 1024:  # 100MB limit
            logging.warning(f"Large file detected: {file_size} bytes")

        # Process the file
        result = processing_func(file_path)
        return result

    except Exception as e:
        logging.error(f"Error processing PDF {file_path}: {str(e)}")
        raise
python
import logging

def process_pdf_safely(file_path, processing_func):
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        # Check file size
        file_size = os.path.getsize(file_path)
        if file_size > 100 * 1024 * 1024:  # 100MB limit
            logging.warning(f"Large file detected: {file_size} bytes")

        # Process the file
        result = processing_func(file_path)
        return result

    except Exception as e:
        logging.error(f"Error processing PDF {file_path}: {str(e)}")
        raise

8. Performance Optimization

8. 性能优化

For Large Files

针对大文件

  • Process pages in chunks
  • Use generators for memory efficiency
  • Implement progress tracking
  • Consider parallel processing
  • 分块处理页面
  • 使用生成器提升内存效率
  • 实现进度跟踪
  • 考虑并行处理

Batch Processing

批量处理

python
import concurrent.futures
import os

def batch_process_pdfs(directory, processing_func, max_workers=4):
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for pdf_file in pdf_files:
            file_path = os.path.join(directory, pdf_file)
            future = executor.submit(processing_func, file_path)
            futures.append((pdf_file, future))

        results = {}
        for pdf_file, future in futures:
            try:
                results[pdf_file] = future.result()
            except Exception as e:
                results[pdf_file] = f"Error: {str(e)}"

    return results
python
import concurrent.futures
import os

def batch_process_pdfs(directory, processing_func, max_workers=4):
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for pdf_file in pdf_files:
            file_path = os.path.join(directory, pdf_file)
            future = executor.submit(processing_func, file_path)
            futures.append((pdf_file, future))

        results = {}
        for pdf_file, future in futures:
            try:
                results[pdf_file] = future.result()
            except Exception as e:
                results[pdf_file] = f"Error: {str(e)}"

    return results

Usage Examples

使用示例

Example 1: Extract Text from Invoice

示例1:从发票中提取文本

  1. Load the PDF invoice
  2. Extract all text content
  3. Parse for invoice number, date, amount
  4. Save extracted data to structured format
  1. 加载PDF发票
  2. 提取所有文本内容
  3. 解析发票编号、日期、金额
  4. 将提取的数据保存为结构化格式

Example 2: Fill Application Form

示例2:填写申请表单

  1. Load the application form PDF
  2. Detect all form fields
  3. Fill fields with provided data
  4. Save filled form as new PDF
  1. 加载申请表单PDF
  2. 检测所有表单字段
  3. 用提供的数据填写字段
  4. 将填写后的表单保存为新PDF

Example 3: Extract Tables from Report

示例3:从报告中提取表格

  1. Open multi-page report PDF
  2. Extract all tables from each page
  3. Convert tables to CSV or Excel
  4. Preserve table structure and formatting
  1. 打开多页报告PDF
  2. 提取每页中的所有表格
  3. 将表格转换为CSV或Excel格式
  4. 保留表格结构和格式

Required Libraries

所需库

Install necessary Python packages:
bash
pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillow
安装必要的Python包:
bash
pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillow

Tips

提示

  • Always check if PDF is password-protected first
  • Use different libraries based on your needs (speed vs accuracy)
  • For scanned documents, OCR quality depends on image resolution
  • Consider the PDF version when working with older files
  • Test with sample pages before processing entire documents
  • Handle encoding issues for non-English text
  • 始终先检查PDF是否受密码保护
  • 根据需求选择不同的库(速度 vs 准确性)
  • 对于扫描文档,OCR质量取决于图像分辨率
  • 处理旧文件时请考虑PDF版本
  • 处理整个文档前先用示例页面测试
  • 处理非英文文本时注意编码问题