pdf-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Processing
PDF处理
Overview
概述
Generate, manipulate, and extract data from PDF documents. This skill covers the Python PDF ecosystem: pypdf for merging/splitting/metadata, pdfplumber for text and table extraction, reportlab for generation, pytesseract for OCR, and strategies for form filling, watermarking, and complex document assembly.
Apply this skill whenever PDFs need to be created, parsed, transformed, or combined through code.
生成、操作PDF文档并从中提取数据。本技能覆盖Python PDF生态:用于合并/拆分/元数据处理的pypdf,用于文本和表格提取的pdfplumber,用于PDF生成的reportlab,用于OCR的pytesseract,以及表单填充、添加水印和复杂文档组装的实现策略。
每当需要通过代码创建、解析、转换或合并PDF时,均可应用本技能。
Multi-Phase Process
多阶段流程
Phase 1: Requirements
阶段1:需求确认
- Determine operation type (generate, extract, manipulate)
- Identify input PDF characteristics (scanned, digital, forms)
- Define output requirements (format, quality, size)
- Plan data pipeline (source data to PDF or PDF to data)
- Assess volume and performance requirements
STOP — Do NOT select a library until the operation type and input characteristics are clear.
- 确定操作类型(生成、提取、操作)
- 识别输入PDF的特征(扫描件、数字版、含表单)
- 明确输出要求(格式、质量、大小)
- 规划数据 pipeline(源数据转PDF 或 PDF转数据)
- 评估体量和性能要求
停止——在明确操作类型和输入特征前,不要选择任何库。
Phase 2: Implementation
阶段2:功能实现
- Select appropriate library for the task (see decision table)
- Implement core processing logic
- Handle edge cases (corrupted files, encrypted PDFs, mixed content)
- Add error handling and validation
- Optimize for file size and processing speed
STOP — Do NOT skip edge case handling for encrypted, rotated, or scanned PDFs.
- 为任务选择合适的库(参考决策表)
- 实现核心处理逻辑
- 处理边界情况(文件损坏、加密PDF、混合内容)
- 添加错误处理和校验逻辑
- 针对文件大小和处理速度做优化
停止——不要跳过对加密、旋转、扫描类PDF的边界情况处理。
Phase 3: Validation
阶段3:结果验证
- Verify output renders correctly in multiple PDF viewers
- Check text is selectable (not rasterized) when applicable
- Validate extracted data accuracy
- Test with edge case PDFs (large, encrypted, scanned)
- Verify accessibility (tagged PDF where needed)
- 验证输出文件可在多个PDF查看器中正常渲染
- 适用场景下检查文本可选中(未被光栅化)
- 校验提取数据的准确性
- 用边界case PDF(大文件、加密、扫描件)测试
- 验证可访问性(需要时生成带标签的PDF)
Library Selection Decision Table
库选择决策表
| Task | Library | Why | Alternative |
|---|---|---|---|
| Text extraction | pdfplumber | Best accuracy, handles layouts | pypdf (simpler, less accurate) |
| Table extraction | pdfplumber | Structured table parsing | camelot (dedicated table tool) |
| PDF generation | reportlab | Full control, professional quality | weasyprint (HTML-to-PDF) |
| Merge / split | pypdf | Simple, reliable, fast | — |
| Form filling | pypdf | Reads and fills AcroForms | pdfrw (alternative API) |
| Metadata read/write | pypdf | Read/write PDF properties | — |
| OCR (scanned docs) | pytesseract + pdf2image | Scanned document text extraction | EasyOCR (deep learning) |
| Watermarking | pypdf + reportlab | Overlay pages | — |
| HTML to PDF | weasyprint | CSS-based layout, server-friendly | playwright (browser rendering) |
| 任务 | 推荐库 | 选择原因 | 替代方案 |
|---|---|---|---|
| 文本提取 | pdfplumber | 准确率最高,适配复杂布局 | pypdf(更简单,准确率较低) |
| 表格提取 | pdfplumber | 结构化表格解析能力强 | camelot(专业表格工具) |
| PDF生成 | reportlab | 可控性高,输出专业质量 | weasyprint(HTML转PDF) |
| 合并/拆分 | pypdf | 简单、可靠、速度快 | — |
| 表单填充 | pypdf | 可读取和填充AcroForms | pdfrw(API风格不同的替代方案) |
| 元数据读写 | pypdf | 支持读写PDF属性 | — |
| OCR(扫描文档) | pytesseract + pdf2image | 支持扫描文档文本提取 | EasyOCR(深度学习方案) |
| 加水印 | pypdf + reportlab | 支持页面叠加 | — |
| HTML转PDF | weasyprint | 基于CSS布局,适配服务端场景 | playwright(浏览器渲染方案) |
PDF Generation with ReportLab
使用ReportLab生成PDF
python
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm, mm
from reportlab.lib.colors import HexColor
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table,
TableStyle, Image, PageBreak
)
from reportlab.lib import colors
def generate_report(output_path, data):
doc = SimpleDocTemplate(
output_path,
pagesize=A4,
topMargin=2.5*cm,
bottomMargin=2.5*cm,
leftMargin=2.5*cm,
rightMargin=2.5*cm,
)
styles = getSampleStyleSheet()
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Title'],
fontSize=24,
textColor=HexColor('#2F5496'),
spaceAfter=20,
))
story = []
# Title
story.append(Paragraph(data['title'], styles['CustomTitle']))
story.append(Spacer(1, 12))
# Body text
story.append(Paragraph(data['body'], styles['Normal']))
story.append(Spacer(1, 20))
# Table
table_data = [['Name', 'Value', 'Status']]
for row in data['rows']:
table_data.append([row['name'], row['value'], row['status']])
table = Table(table_data, colWidths=[6*cm, 4*cm, 4*cm])
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), HexColor('#2F5496')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('FONTSIZE', (0, 0), (-1, 0), 11),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, HexColor('#F0F4FA')]),
('TOPPADDING', (0, 0), (-1, -1), 8),
('BOTTOMPADDING', (0, 0), (-1, -1), 8),
]))
story.append(table)
doc.build(story)python
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm, mm
from reportlab.lib.colors import HexColor
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table,
TableStyle, Image, PageBreak
)
from reportlab.lib import colors
def generate_report(output_path, data):
doc = SimpleDocTemplate(
output_path,
pagesize=A4,
topMargin=2.5*cm,
bottomMargin=2.5*cm,
leftMargin=2.5*cm,
rightMargin=2.5*cm,
)
styles = getSampleStyleSheet()
styles.add(ParagraphStyle(
name='CustomTitle',
parent=styles['Title'],
fontSize=24,
textColor=HexColor('#2F5496'),
spaceAfter=20,
))
story = []
# Title
story.append(Paragraph(data['title'], styles['CustomTitle']))
story.append(Spacer(1, 12))
# Body text
story.append(Paragraph(data['body'], styles['Normal']))
story.append(Spacer(1, 20))
# Table
table_data = [['Name', 'Value', 'Status']]
for row in data['rows']:
table_data.append([row['name'], row['value'], row['status']])
table = Table(table_data, colWidths=[6*cm, 4*cm, 4*cm])
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), HexColor('#2F5496')),
('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('FONTSIZE', (0, 0), (-1, 0), 11),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, HexColor('#F0F4FA')]),
('TOPPADDING', (0, 0), (-1, -1), 8),
('BOTTOMPADDING', (0, 0), (-1, -1), 8),
]))
story.append(table)
doc.build(story)Custom Page Template (Headers/Footers)
自定义页面模板(页眉/页脚)
python
from reportlab.platypus import BaseDocTemplate, Frame, PageTemplate
from datetime import datetime
def add_header_footer(canvas, doc):
canvas.saveState()
# Header
canvas.setFont('Helvetica', 9)
canvas.setFillColor(HexColor('#888888'))
canvas.drawString(2.5*cm, A4[1] - 1.5*cm, 'Company Name — Confidential')
canvas.drawRightString(A4[0] - 2.5*cm, A4[1] - 1.5*cm, f'Page {doc.page}')
# Footer
canvas.drawCentredString(A4[0]/2, 1.5*cm, f'Generated on {datetime.now():%Y-%m-%d}')
canvas.restoreState()
doc = BaseDocTemplate(output_path, pagesize=A4)
frame = Frame(2.5*cm, 2.5*cm, A4[0]-5*cm, A4[1]-5*cm)
doc.addPageTemplates([PageTemplate(id='main', frames=[frame], onPage=add_header_footer)])python
from reportlab.platypus import BaseDocTemplate, Frame, PageTemplate
from datetime import datetime
def add_header_footer(canvas, doc):
canvas.saveState()
# Header
canvas.setFont('Helvetica', 9)
canvas.setFillColor(HexColor('#888888'))
canvas.drawString(2.5*cm, A4[1] - 1.5*cm, 'Company Name — Confidential')
canvas.drawRightString(A4[0] - 2.5*cm, A4[1] - 1.5*cm, f'Page {doc.page}')
# Footer
canvas.drawCentredString(A4[0]/2, 1.5*cm, f'Generated on {datetime.now():%Y-%m-%d}')
canvas.restoreState()
doc = BaseDocTemplate(output_path, pagesize=A4)
frame = Frame(2.5*cm, 2.5*cm, A4[0]-5*cm, A4[1]-5*cm)
doc.addPageTemplates([PageTemplate(id='main', frames=[frame], onPage=add_header_footer)])Text and Table Extraction
文本和表格提取
pdfplumber
pdfplumber
python
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
# Extract text from all pages
full_text = ''
for page in pdf.pages:
full_text += page.extract_text() + '\n'
# Extract tables
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Extract text from specific area
page = pdf.pages[0]
bbox = (50, 100, 400, 300) # (x0, top, x1, bottom)
cropped = page.within_bbox(bbox)
text = cropped.extract_text()python
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
# Extract text from all pages
full_text = ''
for page in pdf.pages:
full_text += page.extract_text() + '\n'
# Extract tables
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Extract text from specific area
page = pdf.pages[0]
bbox = (50, 100, 400, 300) # (x0, top, x1, bottom)
cropped = page.within_bbox(bbox)
text = cropped.extract_text()Table Extraction Settings
表格提取配置
python
table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines",
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
}
tables = page.extract_tables(table_settings)python
table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines",
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
}
tables = page.extract_tables(table_settings)Form Filling
表单填充
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader('form.pdf')
writer = PdfWriter()
writer.append(reader)python
from pypdf import PdfReader, PdfWriter
reader = PdfReader('form.pdf')
writer = PdfWriter()
writer.append(reader)Fill form fields
Fill form fields
writer.update_page_form_field_values(
writer.pages[0],
{
'full_name': 'Alice Johnson',
'email': 'alice@example.com',
'date': '2025-03-15',
'agree_terms': '/Yes', # Checkbox
},
auto_regenerate=False,
)
with open('filled_form.pdf', 'wb') as f:
writer.write(f)
undefinedwriter.update_page_form_field_values(
writer.pages[0],
{
'full_name': 'Alice Johnson',
'email': 'alice@example.com',
'date': '2025-03-15',
'agree_terms': '/Yes', # Checkbox
},
auto_regenerate=False,
)
with open('filled_form.pdf', 'wb') as f:
writer.write(f)
undefinedOCR (Scanned PDFs)
OCR(扫描版PDF)
python
from pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path, language='eng'):
images = convert_from_path(pdf_path, dpi=300)
full_text = ''
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang=language)
full_text += f'\n--- Page {i+1} ---\n{text}'
return full_textpython
from pdf2image import convert_from_path
import pytesseract
def ocr_pdf(pdf_path, language='eng'):
images = convert_from_path(pdf_path, dpi=300)
full_text = ''
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang=language)
full_text += f'\n--- Page {i+1} ---\n{text}'
return full_textFor better accuracy with specific layouts:
For better accuracy with specific layouts:
def ocr_with_config(image):
custom_config = r'--oem 3 --psm 6' # LSTM engine, assume uniform block
return pytesseract.image_to_string(image, config=custom_config)
undefineddef ocr_with_config(image):
custom_config = r'--oem 3 --psm 6' # LSTM engine, assume uniform block
return pytesseract.image_to_string(image, config=custom_config)
undefinedMerge and Split
合并与拆分
python
from pypdf import PdfReader, PdfWriterpython
from pypdf import PdfReader, PdfWriterMerge multiple PDFs
Merge multiple PDFs
def merge_pdfs(input_paths, output_path):
writer = PdfWriter()
for path in input_paths:
reader = PdfReader(path)
for page in reader.pages:
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)
def merge_pdfs(input_paths, output_path):
writer = PdfWriter()
for path in input_paths:
reader = PdfReader(path)
for page in reader.pages:
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)
Split PDF by page ranges
Split PDF by page ranges
def split_pdf(input_path, ranges, output_dir):
reader = PdfReader(input_path)
for i, (start, end) in enumerate(ranges):
writer = PdfWriter()
for page_num in range(start - 1, min(end, len(reader.pages))):
writer.add_page(reader.pages[page_num])
with open(f'{output_dir}/part_{i+1}.pdf', 'wb') as f:
writer.write(f)
def split_pdf(input_path, ranges, output_dir):
reader = PdfReader(input_path)
for i, (start, end) in enumerate(ranges):
writer = PdfWriter()
for page_num in range(start - 1, min(end, len(reader.pages))):
writer.add_page(reader.pages[page_num])
with open(f'{output_dir}/part_{i+1}.pdf', 'wb') as f:
writer.write(f)
Extract specific pages
Extract specific pages
def extract_pages(input_path, page_numbers, output_path):
reader = PdfReader(input_path)
writer = PdfWriter()
for num in page_numbers:
writer.add_page(reader.pages[num - 1])
with open(output_path, 'wb') as f:
writer.write(f)
undefineddef extract_pages(input_path, page_numbers, output_path):
reader = PdfReader(input_path)
writer = PdfWriter()
for num in page_numbers:
writer.add_page(reader.pages[num - 1])
with open(output_path, 'wb') as f:
writer.write(f)
undefinedWatermarking
添加水印
python
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas as rl_canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO
def create_watermark(text, opacity=0.1):
buffer = BytesIO()
c = rl_canvas.Canvas(buffer, pagesize=A4)
c.setFillAlpha(opacity)
c.setFont('Helvetica-Bold', 60)
c.setFillColorRGB(0.5, 0.5, 0.5)
c.translate(A4[0]/2, A4[1]/2)
c.rotate(45)
c.drawCentredString(0, 0, text)
c.save()
buffer.seek(0)
return PdfReader(buffer)
def apply_watermark(input_path, output_path, watermark_text):
watermark = create_watermark(watermark_text)
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark.pages[0])
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)python
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas as rl_canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO
def create_watermark(text, opacity=0.1):
buffer = BytesIO()
c = rl_canvas.Canvas(buffer, pagesize=A4)
c.setFillAlpha(opacity)
c.setFont('Helvetica-Bold', 60)
c.setFillColorRGB(0.5, 0.5, 0.5)
c.translate(A4[0]/2, A4[1]/2)
c.rotate(45)
c.drawCentredString(0, 0, text)
c.save()
buffer.seek(0)
return PdfReader(buffer)
def apply_watermark(input_path, output_path, watermark_text):
watermark = create_watermark(watermark_text)
reader = PdfReader(input_path)
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark.pages[0])
writer.add_page(page)
with open(output_path, 'wb') as f:
writer.write(f)Metadata Handling
元数据处理
python
from pypdf import PdfReader, PdfWriterpython
from pypdf import PdfReader, PdfWriterRead metadata
Read metadata
reader = PdfReader('document.pdf')
info = reader.metadata
print(f'Title: {info.title}')
print(f'Author: {info.author}')
print(f'Pages: {len(reader.pages)}')
reader = PdfReader('document.pdf')
info = reader.metadata
print(f'Title: {info.title}')
print(f'Author: {info.author}')
print(f'Pages: {len(reader.pages)}')
Write metadata
Write metadata
writer = PdfWriter()
writer.append(reader)
writer.add_metadata({
'/Title': 'Updated Title',
'/Author': 'Author Name',
'/Subject': 'Document Subject',
'/Creator': 'My Application',
})
with open('updated.pdf', 'wb') as f:
writer.write(f)
undefinedwriter = PdfWriter()
writer.append(reader)
writer.add_metadata({
'/Title': 'Updated Title',
'/Author': 'Author Name',
'/Subject': 'Document Subject',
'/Creator': 'My Application',
})
with open('updated.pdf', 'wb') as f:
writer.write(f)
undefinedAnti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Fails | What To Do Instead |
|---|---|---|
| OCR on digital (text-based) PDFs | Slow and inaccurate when text is already extractable | Check if text extracts first, OCR only if empty |
| Not handling encrypted PDFs | Crashes or silent failures | Detect encryption, prompt for password or skip gracefully |
| Loading entire large PDFs into memory | Memory exhaustion on server | Stream pages or process in chunks |
| Ignoring page rotation metadata | Text extraction returns garbled results | Read and apply rotation before extraction |
| Hardcoding page dimensions | Breaks on non-A4 documents | Read dimensions from source PDF |
| Not closing file handles | Resource leaks in long-running processes | Use context managers ( |
| Generating without multi-viewer testing | Rendering differences across viewers | Test in Adobe Reader, Preview, and Chrome |
| Extracting tables without tuning settings | Poor column alignment, merged cells | Adjust |
| 反模式 | 失败原因 | 正确做法 |
|---|---|---|
| 对数字版(基于文本)PDF执行OCR | 文本本身可提取,OCR速度慢且准确率低 | 先尝试直接提取文本,仅当提取结果为空时使用OCR |
| 不处理加密PDF | 导致程序崩溃或静默失败 | 检测加密状态,提示输入密码或优雅跳过 |
| 将整个大型PDF加载到内存中 | 服务端内存溢出 | 流式读取页面或分块处理 |
| 忽略页面旋转元数据 | 文本提取结果乱码 | 提取前读取并应用旋转参数 |
| 硬编码页面尺寸 | 非A4文档处理出错 | 从源PDF读取页面尺寸 |
| 不关闭文件句柄 | 长时间运行的进程出现资源泄漏 | 使用上下文管理器( |
| 生成PDF后未做多查看器测试 | 不同查看器渲染效果不一致 | 在Adobe Reader、Mac预览、Chrome中测试 |
| 提取表格时不调整配置 | 列对齐差、单元格合并错误 | 根据文档类型调整 |
Anti-Rationalization Guards
硬性约束
- Do NOT use OCR without first attempting direct text extraction -- check the PDF type.
- Do NOT skip encryption detection -- handle it explicitly even if "most PDFs aren't encrypted."
- Do NOT assume A4 page size -- read dimensions from the source document.
- Do NOT test in only one PDF viewer -- rendering varies across Adobe, Preview, and Chrome.
- Do NOT process large PDFs without memory-conscious patterns (streaming, chunking).
- 未尝试直接文本提取前不得使用OCR——先确认PDF类型
- 不得跳过加密检测——即使「绝大多数PDF都未加密」也要显式处理
- 不得默认假设是A4页面尺寸——从源文档读取尺寸
- 不得仅在一个PDF查看器中测试——Adobe、Mac预览、Chrome的渲染存在差异
- 处理大型PDF时必须使用内存友好的模式(流式处理、分块处理)
Integration Points
集成点
| Skill | How It Connects |
|---|---|
| DOCX-to-PDF conversion pipeline, or choosing between formats |
| Data from Excel populates PDF report tables |
| Generated PDFs attach to professional emails |
| Research output formatted as PDF whitepapers |
| Output file naming and directory structure conventions |
| PDF generation pipelines in server/CI environments |
| 技能 | 关联方式 |
|---|---|
| DOCX转PDF pipeline,或格式选型决策 |
| 从Excel获取数据填充PDF报告表格 |
| 生成的PDF作为专业邮件的附件 |
| 研究成果格式化为PDF白皮书 |
| 输出文件命名和目录结构规范 |
| 服务端/CI环境中的PDF生成pipeline |
Skill Type
技能类型
FLEXIBLE — Select the appropriate library and approach based on the specific PDF task. ReportLab for generation, pdfplumber for extraction, pypdf for manipulation. Combine as needed.
灵活适配——根据具体PDF任务选择合适的库和实现方案:生成PDF用ReportLab,内容提取用pdfplumber,PDF操作用pypdf,可按需组合使用。