pdf-processor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Processor
PDF 处理器
Instructions
操作说明
When processing PDF files, follow these steps based on your specific needs:
处理PDF文件时,请根据具体需求遵循以下步骤:
1. Identify Processing Type
1. 确定处理类型
Determine what you need to do with the PDF:
- Extract text content
- Fill form fields
- Extract images or tables
- Merge or split PDFs
- Add annotations or watermarks
- Convert to other formats
明确你需要对PDF执行的操作:
- 提取文本内容
- 填写表单字段
- 提取图片或表格
- 合并或拆分PDF
- 添加注释或水印
- 转换为其他格式
2. Text Extraction
2. 文本提取
Basic Text Extraction
基础文本提取
python
import PyPDF2
import pdfplumberpython
import PyPDF2
import pdfplumberMethod 1: Using PyPDF2
Method 1: Using PyPDF2
def extract_text_pypdf2(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
def extract_text_pypdf2(file_path):
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text()
return text
Method 2: Using pdfplumber (better for tables)
Method 2: Using pdfplumber (better for tables)
def extract_text_pdfplumber(file_path):
with pdfplumber.open(file_path) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
return text
undefineddef extract_text_pdfplumber(file_path):
with pdfplumber.open(file_path) as pdf:
text = ""
for page in pdf.pages:
text += page.extract_text() or ""
return text
undefinedAdvanced Text Extraction
高级文本提取
- Preserve formatting and layout
- Handle multi-column documents
- Extract text from specific regions
- Process scanned PDFs with OCR
- 保留格式和布局
- 处理多栏文档
- 从特定区域提取文本
- 用OCR处理扫描版PDF
3. Form Processing
3. 表单处理
Form Field Detection
表单字段检测
python
def detect_form_fields(file_path):
reader = PyPDF2.PdfReader(file_path)
fields = {}
if reader.get_fields():
for field_name, field in reader.get_fields().items():
fields[field_name] = {
'type': field.field_type,
'value': field.value,
'required': field.required if hasattr(field, 'required') else False
}
return fields
def fill_form_fields(file_path, output_path, field_data):
reader = PyPDF2.PdfReader(file_path)
writer = PyPDF2.PdfWriter()
for page in reader.pages:
writer.add_page(page)
if writer.get_fields():
for field_name, value in field_data.items():
if field_name in writer.get_fields():
writer.get_fields()[field_name].value = value
with open(output_path, 'wb') as output_file:
writer.write(output_file)python
def detect_form_fields(file_path):
reader = PyPDF2.PdfReader(file_path)
fields = {}
if reader.get_fields():
for field_name, field in reader.get_fields().items():
fields[field_name] = {
'type': field.field_type,
'value': field.value,
'required': field.required if hasattr(field, 'required') else False
}
return fields
def fill_form_fields(file_path, output_path, field_data):
reader = PyPDF2.PdfReader(file_path)
writer = PyPDF2.PdfWriter()
for page in reader.pages:
writer.add_page(page)
if writer.get_fields():
for field_name, value in field_data.items():
if field_name in writer.get_fields():
writer.get_fields()[field_name].value = value
with open(output_path, 'wb') as output_file:
writer.write(output_file)Common Form Types
常见表单类型
- Application forms
- Invoices and receipts
- Survey forms
- Legal documents
- Medical forms
- 申请表
- 发票和收据
- 调查问卷
- 法律文件
- 医疗表单
4. Content Analysis
4. 内容分析
Structure Analysis
结构分析
python
def analyze_pdf_structure(file_path):
with pdfplumber.open(file_path) as pdf:
analysis = {
'pages': len(pdf.pages),
'has_images': False,
'has_tables': False,
'has_forms': False,
'text_density': [],
'sections': []
}
for i, page in enumerate(pdf.pages):
# Check for images
if page.images:
analysis['has_images'] = True
# Check for tables
if page.extract_tables():
analysis['has_tables'] = True
# Calculate text density
text = page.extract_text()
if text:
density = len(text) / (page.width * page.height)
analysis['text_density'].append(density)
# Detect section headers (basic heuristic)
lines = text.split('\n') if text else []
for line in lines:
if line.isupper() and len(line) < 50:
analysis['sections'].append({
'page': i + 1,
'title': line.strip()
})
return analysispython
def analyze_pdf_structure(file_path):
with pdfplumber.open(file_path) as pdf:
analysis = {
'pages': len(pdf.pages),
'has_images': False,
'has_tables': False,
'has_forms': False,
'text_density': [],
'sections': []
}
for i, page in enumerate(pdf.pages):
# Check for images
if page.images:
analysis['has_images'] = True
# Check for tables
if page.extract_tables():
analysis['has_tables'] = True
# Calculate text density
text = page.extract_text()
if text:
density = len(text) / (page.width * page.height)
analysis['text_density'].append(density)
# Detect section headers (basic heuristic)
lines = text.split('\n') if text else []
for line in lines:
if line.isupper() and len(line) < 50:
analysis['sections'].append({
'page': i + 1,
'title': line.strip()
})
return analysisTable Extraction
表格提取
python
def extract_tables(file_path):
tables = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages):
page_tables = page.extract_tables()
for table in page_tables:
tables.append({
'page': page_num + 1,
'data': table,
'rows': len(table),
'columns': len(table[0]) if table else 0
})
return tablespython
def extract_tables(file_path):
tables = []
with pdfplumber.open(file_path) as pdf:
for page_num, page in enumerate(pdf.pages):
page_tables = page.extract_tables()
for table in page_tables:
tables.append({
'page': page_num + 1,
'data': table,
'rows': len(table),
'columns': len(table[0]) if table else 0
})
return tables5. PDF Manipulation
5. PDF 操作
Merge PDFs
合并PDF
python
from PyPDF2 import PdfMerger
def merge_pdfs(file_paths, output_path):
merger = PdfMerger()
for path in file_paths:
merger.append(path)
merger.write(output_path)
merger.close()python
from PyPDF2 import PdfMerger
def merge_pdfs(file_paths, output_path):
merger = PdfMerger()
for path in file_paths:
merger.append(path)
merger.write(output_path)
merger.close()Split PDF
拆分PDF
python
def split_pdf(file_path, output_dir):
reader = PyPDF2.PdfReader(file_path)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
output_path = f"{output_dir}/page_{i+1}.pdf"
with open(output_path, 'wb') as output_file:
writer.write(output_file)python
def split_pdf(file_path, output_dir):
reader = PyPDF2.PdfReader(file_path)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
output_path = f"{output_dir}/page_{i+1}.pdf"
with open(output_path, 'wb') as output_file:
writer.write(output_file)Add Watermark
添加水印
python
def add_watermark(input_path, output_path, watermark_text):
reader = PyPDF2.PdfReader(input_path)
writer = PyPDF2.PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add watermark logic here
# This requires additional libraries like reportlab
with open(output_path, 'wb') as output_file:
writer.write(output_file)python
def add_watermark(input_path, output_path, watermark_text):
reader = PyPDF2.PdfReader(input_path)
writer = PyPDF2.PdfWriter()
for page in reader.pages:
writer.add_page(page)
# Add watermark logic here
# This requires additional libraries like reportlab
with open(output_path, 'wb') as output_file:
writer.write(output_file)6. OCR for Scanned PDFs
6. 扫描版PDF的OCR处理
Using Tesseract OCR
使用Tesseract OCR
python
import pytesseract
from PIL import Image
import fitz # PyMuPDF
def ocr_pdf(file_path):
doc = fitz.open(file_path)
text = ""
for page_num in range(len(doc)):
page = doc.load_page(page_num)
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text += pytesseract.image_to_string(img)
return textpython
import pytesseract
from PIL import Image
import fitz # PyMuPDF
def ocr_pdf(file_path):
doc = fitz.open(file_path)
text = ""
for page_num in range(len(doc)):
page = doc.load_page(page_num)
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text += pytesseract.image_to_string(img)
return text7. Error Handling
7. 错误处理
Common Issues
常见问题
- Password-protected PDFs
- Corrupted files
- Unsupported formats
- Memory issues with large files
- Encoding problems
- 受密码保护的PDF
- 损坏的文件
- 不支持的格式
- 大文件导致的内存问题
- 编码问题
Error Handling Pattern
错误处理模式
python
import logging
def process_pdf_safely(file_path, processing_func):
try:
# Check if file exists
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
# Check file size
file_size = os.path.getsize(file_path)
if file_size > 100 * 1024 * 1024: # 100MB limit
logging.warning(f"Large file detected: {file_size} bytes")
# Process the file
result = processing_func(file_path)
return result
except Exception as e:
logging.error(f"Error processing PDF {file_path}: {str(e)}")
raisepython
import logging
def process_pdf_safely(file_path, processing_func):
try:
# Check if file exists
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
# Check file size
file_size = os.path.getsize(file_path)
if file_size > 100 * 1024 * 1024: # 100MB limit
logging.warning(f"Large file detected: {file_size} bytes")
# Process the file
result = processing_func(file_path)
return result
except Exception as e:
logging.error(f"Error processing PDF {file_path}: {str(e)}")
raise8. Performance Optimization
8. 性能优化
For Large Files
针对大文件
- Process pages in chunks
- Use generators for memory efficiency
- Implement progress tracking
- Consider parallel processing
- 分块处理页面
- 使用生成器提升内存效率
- 实现进度跟踪
- 考虑并行处理
Batch Processing
批量处理
python
import concurrent.futures
import os
def batch_process_pdfs(directory, processing_func, max_workers=4):
pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = []
for pdf_file in pdf_files:
file_path = os.path.join(directory, pdf_file)
future = executor.submit(processing_func, file_path)
futures.append((pdf_file, future))
results = {}
for pdf_file, future in futures:
try:
results[pdf_file] = future.result()
except Exception as e:
results[pdf_file] = f"Error: {str(e)}"
return resultspython
import concurrent.futures
import os
def batch_process_pdfs(directory, processing_func, max_workers=4):
pdf_files = [f for f in os.listdir(directory) if f.endswith('.pdf')]
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = []
for pdf_file in pdf_files:
file_path = os.path.join(directory, pdf_file)
future = executor.submit(processing_func, file_path)
futures.append((pdf_file, future))
results = {}
for pdf_file, future in futures:
try:
results[pdf_file] = future.result()
except Exception as e:
results[pdf_file] = f"Error: {str(e)}"
return resultsUsage Examples
使用示例
Example 1: Extract Text from Invoice
示例1:从发票中提取文本
- Load the PDF invoice
- Extract all text content
- Parse for invoice number, date, amount
- Save extracted data to structured format
- 加载PDF发票
- 提取所有文本内容
- 解析发票编号、日期、金额
- 将提取的数据保存为结构化格式
Example 2: Fill Application Form
示例2:填写申请表单
- Load the application form PDF
- Detect all form fields
- Fill fields with provided data
- Save filled form as new PDF
- 加载申请表单PDF
- 检测所有表单字段
- 用提供的数据填写字段
- 将填写后的表单保存为新PDF
Example 3: Extract Tables from Report
示例3:从报告中提取表格
- Open multi-page report PDF
- Extract all tables from each page
- Convert tables to CSV or Excel
- Preserve table structure and formatting
- 打开多页报告PDF
- 提取每页中的所有表格
- 将表格转换为CSV或Excel格式
- 保留表格结构和格式
Required Libraries
所需库
Install necessary Python packages:
bash
pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillow安装必要的Python包:
bash
pip install PyPDF2 pdfplumber PyMuPDF pytesseract pillowTips
提示
- Always check if PDF is password-protected first
- Use different libraries based on your needs (speed vs accuracy)
- For scanned documents, OCR quality depends on image resolution
- Consider the PDF version when working with older files
- Test with sample pages before processing entire documents
- Handle encoding issues for non-English text
- 始终先检查PDF是否受密码保护
- 根据需求选择不同的库(速度 vs 准确性)
- 对于扫描文档,OCR质量取决于图像分辨率
- 处理旧文件时请考虑PDF版本
- 处理整个文档前先用示例页面测试
- 处理非英文文本时注意编码问题