pdf-processor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PDF Processor

PDF 处理器

Instructions

操作说明

When processing PDF files, follow these steps based on your specific needs:

处理PDF文件时，请根据具体需求遵循以下步骤：

1. Identify Processing Type

1. 确定处理类型

Determine what you need to do with the PDF:

Extract text content
Fill form fields
Extract images or tables
Merge or split PDFs
Add annotations or watermarks
Convert to other formats

明确你需要对PDF执行的操作：

提取文本内容
填写表单字段
提取图片或表格
合并或拆分PDF
添加注释或水印
转换为其他格式

2. Text Extraction

2. 文本提取

Basic Text Extraction

基础文本提取

python

import PyPDF2
import pdfplumber

python

import PyPDF2
import pdfplumber

Method 1: Using PyPDF2

def extract_text_pypdf2(file_path): with open(file_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() return text

Method 2: Using pdfplumber (better for tables)

def extract_text_pdfplumber(file_path): with pdfplumber.open(file_path) as pdf: text = "" for page in pdf.pages: text += page.extract_text() or "" return text

undefined

def extract_text_pdfplumber(file_path): with pdfplumber.open(file_path) as pdf: text = "" for page in pdf.pages: text += page.extract_text() or "" return text

undefined

Advanced Text Extraction

高级文本提取

Preserve formatting and layout
Handle multi-column documents
Extract text from specific regions
Process scanned PDFs with OCR

保留格式和布局
处理多栏文档
从特定区域提取文本
用OCR处理扫描版PDF

3. Form Processing

3. 表单处理

Form Field Detection

表单字段检测

python

def detect_form_fields(file_path):
    reader = PyPDF2.PdfReader(file_path)
    fields = {}
    if reader.get_fields():
        for field_name, field in reader.get_fields().items():
            fields[field_name] = {
                'type': field.field_type,
                'value': field.value,
                'required': field.required if hasattr(field, 'required') else False
            }
    return fields

def fill_form_fields(file_path, output_path, field_data):
    reader = PyPDF2.PdfReader(file_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    if writer.get_fields():
        for field_name, value in field_data.items():
            if field_name in writer.get_fields():
                writer.get_fields()[field_name].value = value

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

python

def detect_form_fields(file_path):
    reader = PyPDF2.PdfReader(file_path)
    fields = {}
    if reader.get_fields():
        for field_name, field in reader.get_fields().items():
            fields[field_name] = {
                'type': field.field_type,
                'value': field.value,
                'required': field.required if hasattr(field, 'required') else False
            }
    return fields

def fill_form_fields(file_path, output_path, field_data):
    reader = PyPDF2.PdfReader(file_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    if writer.get_fields():
        for field_name, value in field_data.items():
            if field_name in writer.get_fields():
                writer.get_fields()[field_name].value = value

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

Common Form Types

常见表单类型

Application forms
Invoices and receipts
Survey forms
Legal documents
Medical forms

申请表
发票和收据
调查问卷
法律文件
医疗表单

4. Content Analysis

4. 内容分析

Structure Analysis

结构分析

python

def analyze_pdf_structure(file_path):
    with pdfplumber.open(file_path) as pdf:
        analysis = {
            'pages': len(pdf.pages),
            'has_images': False,
            'has_tables': False,
            'has_forms': False,
            'text_density': [],
            'sections': []
        }

        for i, page in enumerate(pdf.pages):
            # Check for images
            if page.images:
                analysis['has_images'] = True

            # Check for tables
            if page.extract_tables():
                analysis['has_tables'] = True

            # Calculate text density
            text = page.extract_text()
            if text:
                density = len(text) / (page.width * page.height)
                analysis['text_density'].append(density)

            # Detect section headers (basic heuristic)
            lines = text.split('\n') if text else []
            for line in lines:
                if line.isupper() and len(line) < 50:
                    analysis['sections'].append({
                        'page': i + 1,
                        'title': line.strip()
                    })

    return analysis

python

def analyze_pdf_structure(file_path):
    with pdfplumber.open(file_path) as pdf:
        analysis = {
            'pages': len(pdf.pages),
            'has_images': False,
            'has_tables': False,
            'has_forms': False,
            'text_density': [],
            'sections': []
        }

        for i, page in enumerate(pdf.pages):
            # Check for images
            if page.images:
                analysis['has_images'] = True

            # Check for tables
            if page.extract_tables():
                analysis['has_tables'] = True

            # Calculate text density
            text = page.extract_text()
            if text:
                density = len(text) / (page.width * page.height)
                analysis['text_density'].append(density)

            # Detect section headers (basic heuristic)
            lines = text.split('\n') if text else []
            for line in lines:
                if line.isupper() and len(line) < 50:
                    analysis['sections'].append({
                        'page': i + 1,
                        'title': line.strip()
                    })

    return analysis

Table Extraction

表格提取

python

def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    'page': page_num + 1,
                    'data': table,
                    'rows': len(table),
                    'columns': len(table[0]) if table else 0
                })
    return tables

python

def extract_tables(file_path):
    tables = []
    with pdfplumber.open(file_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            page_tables = page.extract_tables()
            for table in page_tables:
                tables.append({
                    'page': page_num + 1,
                    'data': table,
                    'rows': len(table),
                    'columns': len(table[0]) if table else 0
                })
    return tables

5. PDF Manipulation

5. PDF 操作

Merge PDFs

合并PDF

python

from PyPDF2 import PdfMerger

def merge_pdfs(file_paths, output_path):
    merger = PdfMerger()
    for path in file_paths:
        merger.append(path)
    merger.write(output_path)
    merger.close()

python

from PyPDF2 import PdfMerger

def merge_pdfs(file_paths, output_path):
    merger = PdfMerger()
    for path in file_paths:
        merger.append(path)
    merger.write(output_path)
    merger.close()

Split PDF

拆分PDF

python

def split_pdf(file_path, output_dir):
    reader = PyPDF2.PdfReader(file_path)
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        output_path = f"{output_dir}/page_{i+1}.pdf"
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

python

def split_pdf(file_path, output_dir):
    reader = PyPDF2.PdfReader(file_path)
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        output_path = f"{output_dir}/page_{i+1}.pdf"
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

Add Watermark

添加水印

python

def add_watermark(input_path, output_path, watermark_text):
    reader = PyPDF2.PdfReader(input_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)
        # Add watermark logic here
        # This requires additional libraries like reportlab

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

python

def add_watermark(input_path, output_path, watermark_text):
    reader = PyPDF2.PdfReader(input_path)
    writer = PyPDF2.PdfWriter()

    for page in reader.pages:
        writer.add_page(page)
        # Add watermark logic here
        # This requires additional libraries like reportlab

    with open(output_path, 'wb') as output_file:
        writer.write(output_file)

6. OCR for Scanned PDFs

6. 扫描版PDF的OCR处理

Using Tesseract OCR

使用Tesseract OCR

python

import pytesseract
from PIL import Image
import fitz  # PyMuPDF

def ocr_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text += pytesseract.image_to_string(img)

    return text

python

import pytesseract
from PIL import Image
import fitz  # PyMuPDF

def ocr_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text += pytesseract.image_to_string(img)

    return text

7. Error Handling

7. 错误处理

Common Issues

常见问题

Password-protected PDFs
Corrupted files
Unsupported formats
Memory issues with large files
Encoding problems

受密码保护的PDF
损坏的文件
不支持的格式
大文件导致的内存问题
编码问题

Error Handling Pattern

错误处理模式

python

import logging

def process_pdf_safely(file_path, processing_func):
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        # Check file size
        file_size = os.path.getsize(file_path)
        if file_size > 100 * 1024 * 1024:  # 100MB limit
            logging.warning(f"Large file detected: {file_size} bytes")

        # Process the file
        result = processing_func(file_path)
        return result

    except Exception as e:
        logging.error(f"Error processing PDF {file_path}: {str(e)}")
        raise

python

import logging

def process_pdf_safely(file_path, processing_func):
    try:
        # Check if file exists
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        # Check file size
        file_size = os.path.getsize(file_path)
        if file_size > 100 * 1024 * 1024:  # 100MB limit
            logging.warning(f"Large file detected: {file_size} bytes")

        # Process the file
        result = processing_func(file_path)
        return result

    except Exception as e:
        logging.error(f"Error processing PDF {file_path}: {str(e)}")
        raise

8. Performance Optimization

8. 性能优化

For Large Files

针对大文件

Process pages in chunks
Use generators for memory efficiency
Implement progress tracking
Consider parallel processing

分块处理页面
使用生成器提升内存效率
实现进度跟踪
考虑并行处理

Batch Processing

批量处理

python

import concurrent.futures
import os

def batch_process_pdfs(directory, processing_func, max_workers=4):
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for pdf_file in pdf_files:
            file_path = os.path.join(directory, pdf_file)
            future = executor.submit(processing_func, file_path)
            futures.append((pdf_file, future))

        results = {}
        for pdf_file, future in futures:
            try:
                results[pdf_file] = future.result()
            except Exception as e:
                results[pdf_file] = f"Error: {str(e)}"

    return results

python

import concurrent.futures
import os

def batch_process_pdfs(directory, processing_func, max_workers=4):
    pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for pdf_file in pdf_files:
            file_path = os.path.join(directory, pdf_file)
            future = executor.submit(processing_func, file_path)
            futures.append((pdf_file, future))

        results = {}
        for pdf_file, future in futures:
            try:
                results[pdf_file] = future.result()
            except Exception as e:
                results[pdf_file] = f"Error: {str(e)}"

    return results

Usage Examples

使用示例

Example 1: Extract Text from Invoice

示例1：从发票中提取文本

Load the PDF invoice
Extract all text content
Parse for invoice number, date, amount
Save extracted data to structured format

加载PDF发票
提取所有文本内容
解析发票编号、日期、金额
将提取的数据保存为结构化格式

Example 2: Fill Application Form

示例2：填写申请表单

Load the application form PDF
Detect all form fields
Fill fields with provided data
Save filled form as new PDF

加载申请表单PDF
检测所有表单字段
用提供的数据填写字段
将填写后的表单保存为新PDF

Example 3: Extract Tables from Report

示例3：从报告中提取表格

Open multi-page report PDF
Extract all tables from each page
Convert tables to CSV or Excel
Preserve table structure and formatting

打开多页报告PDF
提取每页中的所有表格
将表格转换为CSV或Excel格式
保留表格结构和格式

Required Libraries

所需库

Install necessary Python packages:

bash

pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillow

安装必要的Python包：

bash

pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillow

Tips

提示

Always check if PDF is password-protected first
Use different libraries based on your needs (speed vs accuracy)
For scanned documents, OCR quality depends on image resolution
Consider the PDF version when working with older files
Test with sample pages before processing entire documents
Handle encoding issues for non-English text

始终先检查PDF是否受密码保护
根据需求选择不同的库（速度 vs 准确性）
对于扫描文档，OCR质量取决于图像分辨率
处理旧文件时请考虑PDF版本
处理整个文档前先用示例页面测试
处理非英文文本时注意编码问题