pdf-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Processing

PDF处理

Overview

概述

Generate, manipulate, and extract data from PDF documents. This skill covers the Python PDF ecosystem: pypdf for merging/splitting/metadata, pdfplumber for text and table extraction, reportlab for generation, pytesseract for OCR, and strategies for form filling, watermarking, and complex document assembly.
Apply this skill whenever PDFs need to be created, parsed, transformed, or combined through code.
生成、操作PDF文档并从中提取数据。本技能覆盖Python PDF生态:用于合并/拆分/元数据处理的pypdf,用于文本和表格提取的pdfplumber,用于PDF生成的reportlab,用于OCR的pytesseract,以及表单填充、添加水印和复杂文档组装的实现策略。
每当需要通过代码创建、解析、转换或合并PDF时,均可应用本技能。

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1:需求确认

  1. Determine operation type (generate, extract, manipulate)
  2. Identify input PDF characteristics (scanned, digital, forms)
  3. Define output requirements (format, quality, size)
  4. Plan data pipeline (source data to PDF or PDF to data)
  5. Assess volume and performance requirements
STOP — Do NOT select a library until the operation type and input characteristics are clear.
  1. 确定操作类型(生成、提取、操作)
  2. 识别输入PDF的特征(扫描件、数字版、含表单)
  3. 明确输出要求(格式、质量、大小)
  4. 规划数据 pipeline(源数据转PDF 或 PDF转数据)
  5. 评估体量和性能要求
停止——在明确操作类型和输入特征前,不要选择任何库。

Phase 2: Implementation

阶段2:功能实现

  1. Select appropriate library for the task (see decision table)
  2. Implement core processing logic
  3. Handle edge cases (corrupted files, encrypted PDFs, mixed content)
  4. Add error handling and validation
  5. Optimize for file size and processing speed
STOP — Do NOT skip edge case handling for encrypted, rotated, or scanned PDFs.
  1. 为任务选择合适的库(参考决策表)
  2. 实现核心处理逻辑
  3. 处理边界情况(文件损坏、加密PDF、混合内容)
  4. 添加错误处理和校验逻辑
  5. 针对文件大小和处理速度做优化
停止——不要跳过对加密、旋转、扫描类PDF的边界情况处理。

Phase 3: Validation

阶段3:结果验证

  1. Verify output renders correctly in multiple PDF viewers
  2. Check text is selectable (not rasterized) when applicable
  3. Validate extracted data accuracy
  4. Test with edge case PDFs (large, encrypted, scanned)
  5. Verify accessibility (tagged PDF where needed)
  1. 验证输出文件可在多个PDF查看器中正常渲染
  2. 适用场景下检查文本可选中(未被光栅化)
  3. 校验提取数据的准确性
  4. 用边界case PDF(大文件、加密、扫描件)测试
  5. 验证可访问性(需要时生成带标签的PDF)

Library Selection Decision Table

库选择决策表

TaskLibraryWhyAlternative
Text extractionpdfplumberBest accuracy, handles layoutspypdf (simpler, less accurate)
Table extractionpdfplumberStructured table parsingcamelot (dedicated table tool)
PDF generationreportlabFull control, professional qualityweasyprint (HTML-to-PDF)
Merge / splitpypdfSimple, reliable, fast
Form fillingpypdfReads and fills AcroFormspdfrw (alternative API)
Metadata read/writepypdfRead/write PDF properties
OCR (scanned docs)pytesseract + pdf2imageScanned document text extractionEasyOCR (deep learning)
Watermarkingpypdf + reportlabOverlay pages
HTML to PDFweasyprintCSS-based layout, server-friendlyplaywright (browser rendering)
任务推荐库选择原因替代方案
文本提取pdfplumber准确率最高,适配复杂布局pypdf(更简单,准确率较低)
表格提取pdfplumber结构化表格解析能力强camelot(专业表格工具)
PDF生成reportlab可控性高,输出专业质量weasyprint(HTML转PDF)
合并/拆分pypdf简单、可靠、速度快
表单填充pypdf可读取和填充AcroFormspdfrw(API风格不同的替代方案)
元数据读写pypdf支持读写PDF属性
OCR(扫描文档)pytesseract + pdf2image支持扫描文档文本提取EasyOCR(深度学习方案)
加水印pypdf + reportlab支持页面叠加
HTML转PDFweasyprint基于CSS布局,适配服务端场景playwright(浏览器渲染方案)

PDF Generation with ReportLab

使用ReportLab生成PDF

python
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm, mm
from reportlab.lib.colors import HexColor
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, Image, PageBreak
)
from reportlab.lib import colors

def generate_report(output_path, data):
    doc = SimpleDocTemplate(
        output_path,
        pagesize=A4,
        topMargin=2.5*cm,
        bottomMargin=2.5*cm,
        leftMargin=2.5*cm,
        rightMargin=2.5*cm,
    )

    styles = getSampleStyleSheet()
    styles.add(ParagraphStyle(
        name='CustomTitle',
        parent=styles['Title'],
        fontSize=24,
        textColor=HexColor('#2F5496'),
        spaceAfter=20,
    ))

    story = []

    # Title
    story.append(Paragraph(data['title'], styles['CustomTitle']))
    story.append(Spacer(1, 12))

    # Body text
    story.append(Paragraph(data['body'], styles['Normal']))
    story.append(Spacer(1, 20))

    # Table
    table_data = [['Name', 'Value', 'Status']]
    for row in data['rows']:
        table_data.append([row['name'], row['value'], row['status']])

    table = Table(table_data, colWidths=[6*cm, 4*cm, 4*cm])
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), HexColor('#2F5496')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 11),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, HexColor('#F0F4FA')]),
        ('TOPPADDING', (0, 0), (-1, -1), 8),
        ('BOTTOMPADDING', (0, 0), (-1, -1), 8),
    ]))
    story.append(table)

    doc.build(story)
python
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import cm, mm
from reportlab.lib.colors import HexColor
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, Image, PageBreak
)
from reportlab.lib import colors

def generate_report(output_path, data):
    doc = SimpleDocTemplate(
        output_path,
        pagesize=A4,
        topMargin=2.5*cm,
        bottomMargin=2.5*cm,
        leftMargin=2.5*cm,
        rightMargin=2.5*cm,
    )

    styles = getSampleStyleSheet()
    styles.add(ParagraphStyle(
        name='CustomTitle',
        parent=styles['Title'],
        fontSize=24,
        textColor=HexColor('#2F5496'),
        spaceAfter=20,
    ))

    story = []

    # Title
    story.append(Paragraph(data['title'], styles['CustomTitle']))
    story.append(Spacer(1, 12))

    # Body text
    story.append(Paragraph(data['body'], styles['Normal']))
    story.append(Spacer(1, 20))

    # Table
    table_data = [['Name', 'Value', 'Status']]
    for row in data['rows']:
        table_data.append([row['name'], row['value'], row['status']])

    table = Table(table_data, colWidths=[6*cm, 4*cm, 4*cm])
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), HexColor('#2F5496')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 11),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, HexColor('#F0F4FA')]),
        ('TOPPADDING', (0, 0), (-1, -1), 8),
        ('BOTTOMPADDING', (0, 0), (-1, -1), 8),
    ]))
    story.append(table)

    doc.build(story)

Custom Page Template (Headers/Footers)

自定义页面模板(页眉/页脚)

python
from reportlab.platypus import BaseDocTemplate, Frame, PageTemplate
from datetime import datetime

def add_header_footer(canvas, doc):
    canvas.saveState()
    # Header
    canvas.setFont('Helvetica', 9)
    canvas.setFillColor(HexColor('#888888'))
    canvas.drawString(2.5*cm, A4[1] - 1.5*cm, 'Company Name — Confidential')
    canvas.drawRightString(A4[0] - 2.5*cm, A4[1] - 1.5*cm, f'Page {doc.page}')
    # Footer
    canvas.drawCentredString(A4[0]/2, 1.5*cm, f'Generated on {datetime.now():%Y-%m-%d}')
    canvas.restoreState()

doc = BaseDocTemplate(output_path, pagesize=A4)
frame = Frame(2.5*cm, 2.5*cm, A4[0]-5*cm, A4[1]-5*cm)
doc.addPageTemplates([PageTemplate(id='main', frames=[frame], onPage=add_header_footer)])
python
from reportlab.platypus import BaseDocTemplate, Frame, PageTemplate
from datetime import datetime

def add_header_footer(canvas, doc):
    canvas.saveState()
    # Header
    canvas.setFont('Helvetica', 9)
    canvas.setFillColor(HexColor('#888888'))
    canvas.drawString(2.5*cm, A4[1] - 1.5*cm, 'Company Name — Confidential')
    canvas.drawRightString(A4[0] - 2.5*cm, A4[1] - 1.5*cm, f'Page {doc.page}')
    # Footer
    canvas.drawCentredString(A4[0]/2, 1.5*cm, f'Generated on {datetime.now():%Y-%m-%d}')
    canvas.restoreState()

doc = BaseDocTemplate(output_path, pagesize=A4)
frame = Frame(2.5*cm, 2.5*cm, A4[0]-5*cm, A4[1]-5*cm)
doc.addPageTemplates([PageTemplate(id='main', frames=[frame], onPage=add_header_footer)])

Text and Table Extraction

文本和表格提取

pdfplumber

pdfplumber

python
import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
    # Extract text from all pages
    full_text = ''
    for page in pdf.pages:
        full_text += page.extract_text() + '\n'

    # Extract tables
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

    # Extract text from specific area
    page = pdf.pages[0]
    bbox = (50, 100, 400, 300)  # (x0, top, x1, bottom)
    cropped = page.within_bbox(bbox)
    text = cropped.extract_text()
python
import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
    # Extract text from all pages
    full_text = ''
    for page in pdf.pages:
        full_text += page.extract_text() + '\n'

    # Extract tables
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

    # Extract text from specific area
    page = pdf.pages[0]
    bbox = (50, 100, 400, 300)  # (x0, top, x1, bottom)
    cropped = page.within_bbox(bbox)
    text = cropped.extract_text()

Table Extraction Settings

表格提取配置

python
table_settings = {
    "vertical_strategy": "lines",    # or "text", "explicit"
    "horizontal_strategy": "lines",
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
}

tables = page.extract_tables(table_settings)
python
table_settings = {
    "vertical_strategy": "lines",    # or "text", "explicit"
    "horizontal_strategy": "lines",
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
}

tables = page.extract_tables(table_settings)

Form Filling

表单填充

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader('form.pdf')
writer = PdfWriter()
writer.append(reader)
python
from pypdf import PdfReader, PdfWriter

reader = PdfReader('form.pdf')
writer = PdfWriter()
writer.append(reader)

Fill form fields

Fill form fields

writer.update_page_form_field_values( writer.pages[0], { 'full_name': 'Alice Johnson', 'email': 'alice@example.com', 'date': '2025-03-15', 'agree_terms': '/Yes', # Checkbox }, auto_regenerate=False, )
with open('filled_form.pdf', 'wb') as f: writer.write(f)
undefined
writer.update_page_form_field_values( writer.pages[0], { 'full_name': 'Alice Johnson', 'email': 'alice@example.com', 'date': '2025-03-15', 'agree_terms': '/Yes', # Checkbox }, auto_regenerate=False, )
with open('filled_form.pdf', 'wb') as f: writer.write(f)
undefined

OCR (Scanned PDFs)

OCR(扫描版PDF)

python
from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path, language='eng'):
    images = convert_from_path(pdf_path, dpi=300)
    full_text = ''
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image, lang=language)
        full_text += f'\n--- Page {i+1} ---\n{text}'
    return full_text
python
from pdf2image import convert_from_path
import pytesseract

def ocr_pdf(pdf_path, language='eng'):
    images = convert_from_path(pdf_path, dpi=300)
    full_text = ''
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image, lang=language)
        full_text += f'\n--- Page {i+1} ---\n{text}'
    return full_text

For better accuracy with specific layouts:

For better accuracy with specific layouts:

def ocr_with_config(image): custom_config = r'--oem 3 --psm 6' # LSTM engine, assume uniform block return pytesseract.image_to_string(image, config=custom_config)
undefined
def ocr_with_config(image): custom_config = r'--oem 3 --psm 6' # LSTM engine, assume uniform block return pytesseract.image_to_string(image, config=custom_config)
undefined

Merge and Split

合并与拆分

python
from pypdf import PdfReader, PdfWriter
python
from pypdf import PdfReader, PdfWriter

Merge multiple PDFs

Merge multiple PDFs

def merge_pdfs(input_paths, output_path): writer = PdfWriter() for path in input_paths: reader = PdfReader(path) for page in reader.pages: writer.add_page(page) with open(output_path, 'wb') as f: writer.write(f)
def merge_pdfs(input_paths, output_path): writer = PdfWriter() for path in input_paths: reader = PdfReader(path) for page in reader.pages: writer.add_page(page) with open(output_path, 'wb') as f: writer.write(f)

Split PDF by page ranges

Split PDF by page ranges

def split_pdf(input_path, ranges, output_dir): reader = PdfReader(input_path) for i, (start, end) in enumerate(ranges): writer = PdfWriter() for page_num in range(start - 1, min(end, len(reader.pages))): writer.add_page(reader.pages[page_num]) with open(f'{output_dir}/part_{i+1}.pdf', 'wb') as f: writer.write(f)
def split_pdf(input_path, ranges, output_dir): reader = PdfReader(input_path) for i, (start, end) in enumerate(ranges): writer = PdfWriter() for page_num in range(start - 1, min(end, len(reader.pages))): writer.add_page(reader.pages[page_num]) with open(f'{output_dir}/part_{i+1}.pdf', 'wb') as f: writer.write(f)

Extract specific pages

Extract specific pages

def extract_pages(input_path, page_numbers, output_path): reader = PdfReader(input_path) writer = PdfWriter() for num in page_numbers: writer.add_page(reader.pages[num - 1]) with open(output_path, 'wb') as f: writer.write(f)
undefined
def extract_pages(input_path, page_numbers, output_path): reader = PdfReader(input_path) writer = PdfWriter() for num in page_numbers: writer.add_page(reader.pages[num - 1]) with open(output_path, 'wb') as f: writer.write(f)
undefined

Watermarking

添加水印

python
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas as rl_canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO

def create_watermark(text, opacity=0.1):
    buffer = BytesIO()
    c = rl_canvas.Canvas(buffer, pagesize=A4)
    c.setFillAlpha(opacity)
    c.setFont('Helvetica-Bold', 60)
    c.setFillColorRGB(0.5, 0.5, 0.5)
    c.translate(A4[0]/2, A4[1]/2)
    c.rotate(45)
    c.drawCentredString(0, 0, text)
    c.save()
    buffer.seek(0)
    return PdfReader(buffer)

def apply_watermark(input_path, output_path, watermark_text):
    watermark = create_watermark(watermark_text)
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        page.merge_page(watermark.pages[0])
        writer.add_page(page)

    with open(output_path, 'wb') as f:
        writer.write(f)
python
from pypdf import PdfReader, PdfWriter
from reportlab.pdfgen import canvas as rl_canvas
from reportlab.lib.pagesizes import A4
from io import BytesIO

def create_watermark(text, opacity=0.1):
    buffer = BytesIO()
    c = rl_canvas.Canvas(buffer, pagesize=A4)
    c.setFillAlpha(opacity)
    c.setFont('Helvetica-Bold', 60)
    c.setFillColorRGB(0.5, 0.5, 0.5)
    c.translate(A4[0]/2, A4[1]/2)
    c.rotate(45)
    c.drawCentredString(0, 0, text)
    c.save()
    buffer.seek(0)
    return PdfReader(buffer)

def apply_watermark(input_path, output_path, watermark_text):
    watermark = create_watermark(watermark_text)
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page in reader.pages:
        page.merge_page(watermark.pages[0])
        writer.add_page(page)

    with open(output_path, 'wb') as f:
        writer.write(f)

Metadata Handling

元数据处理

python
from pypdf import PdfReader, PdfWriter
python
from pypdf import PdfReader, PdfWriter

Read metadata

Read metadata

reader = PdfReader('document.pdf') info = reader.metadata print(f'Title: {info.title}') print(f'Author: {info.author}') print(f'Pages: {len(reader.pages)}')
reader = PdfReader('document.pdf') info = reader.metadata print(f'Title: {info.title}') print(f'Author: {info.author}') print(f'Pages: {len(reader.pages)}')

Write metadata

Write metadata

writer = PdfWriter() writer.append(reader) writer.add_metadata({ '/Title': 'Updated Title', '/Author': 'Author Name', '/Subject': 'Document Subject', '/Creator': 'My Application', }) with open('updated.pdf', 'wb') as f: writer.write(f)
undefined
writer = PdfWriter() writer.append(reader) writer.add_metadata({ '/Title': 'Updated Title', '/Author': 'Author Name', '/Subject': 'Document Subject', '/Creator': 'My Application', }) with open('updated.pdf', 'wb') as f: writer.write(f)
undefined

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It FailsWhat To Do Instead
OCR on digital (text-based) PDFsSlow and inaccurate when text is already extractableCheck if text extracts first, OCR only if empty
Not handling encrypted PDFsCrashes or silent failuresDetect encryption, prompt for password or skip gracefully
Loading entire large PDFs into memoryMemory exhaustion on serverStream pages or process in chunks
Ignoring page rotation metadataText extraction returns garbled resultsRead and apply rotation before extraction
Hardcoding page dimensionsBreaks on non-A4 documentsRead dimensions from source PDF
Not closing file handlesResource leaks in long-running processesUse context managers (
with
statements)
Generating without multi-viewer testingRendering differences across viewersTest in Adobe Reader, Preview, and Chrome
Extracting tables without tuning settingsPoor column alignment, merged cellsAdjust
table_settings
per document type
反模式失败原因正确做法
对数字版(基于文本)PDF执行OCR文本本身可提取,OCR速度慢且准确率低先尝试直接提取文本,仅当提取结果为空时使用OCR
不处理加密PDF导致程序崩溃或静默失败检测加密状态,提示输入密码或优雅跳过
将整个大型PDF加载到内存中服务端内存溢出流式读取页面或分块处理
忽略页面旋转元数据文本提取结果乱码提取前读取并应用旋转参数
硬编码页面尺寸非A4文档处理出错从源PDF读取页面尺寸
不关闭文件句柄长时间运行的进程出现资源泄漏使用上下文管理器(
with
语句)
生成PDF后未做多查看器测试不同查看器渲染效果不一致在Adobe Reader、Mac预览、Chrome中测试
提取表格时不调整配置列对齐差、单元格合并错误根据文档类型调整
table_settings
参数

Anti-Rationalization Guards

硬性约束

  • Do NOT use OCR without first attempting direct text extraction -- check the PDF type.
  • Do NOT skip encryption detection -- handle it explicitly even if "most PDFs aren't encrypted."
  • Do NOT assume A4 page size -- read dimensions from the source document.
  • Do NOT test in only one PDF viewer -- rendering varies across Adobe, Preview, and Chrome.
  • Do NOT process large PDFs without memory-conscious patterns (streaming, chunking).
  • 未尝试直接文本提取前不得使用OCR——先确认PDF类型
  • 不得跳过加密检测——即使「绝大多数PDF都未加密」也要显式处理
  • 不得默认假设是A4页面尺寸——从源文档读取尺寸
  • 不得仅在一个PDF查看器中测试——Adobe、Mac预览、Chrome的渲染存在差异
  • 处理大型PDF时必须使用内存友好的模式(流式处理、分块处理)

Integration Points

集成点

SkillHow It Connects
docx-processing
DOCX-to-PDF conversion pipeline, or choosing between formats
xlsx-processing
Data from Excel populates PDF report tables
email-composer
Generated PDFs attach to professional emails
content-research-writer
Research output formatted as PDF whitepapers
file-organizer
Output file naming and directory structure conventions
deployment
PDF generation pipelines in server/CI environments
技能关联方式
docx-processing
DOCX转PDF pipeline,或格式选型决策
xlsx-processing
从Excel获取数据填充PDF报告表格
email-composer
生成的PDF作为专业邮件的附件
content-research-writer
研究成果格式化为PDF白皮书
file-organizer
输出文件命名和目录结构规范
deployment
服务端/CI环境中的PDF生成pipeline

Skill Type

技能类型

FLEXIBLE — Select the appropriate library and approach based on the specific PDF task. ReportLab for generation, pdfplumber for extraction, pypdf for manipulation. Combine as needed.
灵活适配——根据具体PDF任务选择合适的库和实现方案:生成PDF用ReportLab,内容提取用pdfplumber,PDF操作用pypdf,可按需组合使用。