pdf-official

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Processing Guide

PDF处理指南

Overview

概述

This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see reference.md. If you need to fill out a PDF form, read forms.md and follow its instructions.
本指南介绍了使用Python库和命令行工具进行的核心PDF处理操作。如需了解高级功能、JavaScript库及详细示例,请参阅reference.md。如果您需要填写PDF表单,请阅读forms.md并按照其中的说明操作。

Quick Start

快速开始

python
from pypdf import PdfReader, PdfWriter
python
from pypdf import PdfReader, PdfWriter

Read a PDF

Read a PDF

reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")
reader = PdfReader("document.pdf") print(f"Pages: {len(reader.pages)}")

Extract text

Extract text

text = "" for page in reader.pages: text += page.extract_text()
undefined
text = "" for page in reader.pages: text += page.extract_text()
undefined

Python Libraries

Python库

pypdf - Basic Operations

pypdf - 基础操作

Merge PDFs

合并PDF

python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

拆分PDF

python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Metadata

提取元数据

python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")
python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

Rotate Pages

旋转页面

python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)
python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber - Text and Table Extraction

pdfplumber - 文本与表格提取

Extract Text with Layout

提取带布局的文本

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)
python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

提取表格

python
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)
python
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Advanced Table Extraction

高级表格提取

python
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # Check if table is not empty
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)
python
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # Check if table is not empty
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

Combine all tables

Combine all tables

if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)
undefined
if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)
undefined

reportlab - Create PDFs

reportlab - 创建PDF

Basic PDF Creation

基础PDF创建

python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

Add text

Add text

c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF created with reportlab")
c.drawString(100, height - 100, "Hello World!") c.drawString(100, height - 120, "This is a PDF created with reportlab")

Add a line

Add a line

c.line(100, height - 140, 400, height - 140)
c.line(100, height - 140, 400, height - 140)

Save

Save

c.save()
undefined
c.save()
undefined

Create PDF with Multiple Pages

创建多页PDF

python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

Add content

Add content

title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())
title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())

Page 2

Page 2

story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))
story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))

Build PDF

Build PDF

doc.build(story)
undefined
doc.build(story)
undefined

Command-Line Tools

命令行工具

pdftotext (poppler-utils)

pdftotext (poppler-utils)

bash
undefined
bash
undefined

Extract text

Extract text

pdftotext input.pdf output.txt
pdftotext input.pdf output.txt

Extract text preserving layout

Extract text preserving layout

pdftotext -layout input.pdf output.txt
pdftotext -layout input.pdf output.txt

Extract specific pages

Extract specific pages

pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
undefined
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
undefined

qpdf

qpdf

bash
undefined
bash
undefined

Merge PDFs

Merge PDFs

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

Split pages

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

Rotate pages

Rotate pages

qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees

Remove password

Remove password

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
undefined
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
undefined

pdftk (if available)

pdftk (if available)

bash
undefined
bash
undefined

Merge

Merge

pdftk file1.pdf file2.pdf cat output merged.pdf
pdftotk file1.pdf file2.pdf cat output merged.pdf

Split

Split

pdftk input.pdf burst
pdftotk input.pdf burst

Rotate

Rotate

pdftk input.pdf rotate 1east output rotated.pdf
undefined
pdftotk input.pdf rotate 1east output rotated.pdf
undefined

Common Tasks

常见任务

Extract Text from Scanned PDFs

从扫描版PDF提取文本

python
undefined
python
undefined

Requires: pip install pytesseract pdf2image

Requires: pip install pytesseract pdf2image

import pytesseract from pdf2image import convert_from_path
import pytesseract from pdf2image import convert_from_path

Convert PDF to images

Convert PDF to images

images = convert_from_path('scanned.pdf')
images = convert_from_path('scanned.pdf')

OCR each page

OCR each page

text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
print(text)
undefined
text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
print(text)
undefined

Add Watermark

添加水印

python
from pypdf import PdfReader, PdfWriter
python
from pypdf import PdfReader, PdfWriter

Create watermark (or load existing)

Create watermark (or load existing)

watermark = PdfReader("watermark.pdf").pages[0]
watermark = PdfReader("watermark.pdf").pages[0]

Apply to all pages

Apply to all pages

reader = PdfReader("document.pdf") writer = PdfWriter()
for page in reader.pages: page.merge_page(watermark) writer.add_page(page)
with open("watermarked.pdf", "wb") as output: writer.write(output)
undefined
reader = PdfReader("document.pdf") writer = PdfWriter()
for page in reader.pages: page.merge_page(watermark) writer.add_page(page)
with open("watermarked.pdf", "wb") as output: writer.write(output)
undefined

Extract Images

提取图片

bash
undefined
bash
undefined

Using pdfimages (poppler-utils)

Using pdfimages (poppler-utils)

pdfimages -j input.pdf output_prefix
pdfimages -j input.pdf output_prefix

This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.

This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.

undefined
undefined

Password Protection

密码保护

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)
python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

Add password

Add password

writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output: writer.write(output)
undefined
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output: writer.write(output)
undefined

Quick Reference

快速参考

TaskBest ToolCommand/Code
Merge PDFspypdf
writer.add_page(page)
Split PDFspypdfOne page per file
Extract textpdfplumber
page.extract_text()
Extract tablespdfplumber
page.extract_tables()
Create PDFsreportlabCanvas or Platypus
Command line mergeqpdf
qpdf --empty --pages ...
OCR scanned PDFspytesseractConvert to image first
Fill PDF formspdf-lib or pypdf (see forms.md)See forms.md
任务最佳工具命令/代码
合并PDFpypdf
writer.add_page(page)
拆分PDFpypdf单页保存为单个文件
提取文本pdfplumber
page.extract_text()
提取表格pdfplumber
page.extract_tables()
创建PDFreportlabCanvas或Platypus
命令行合并qpdf
qpdf --empty --pages ...
OCR扫描版PDFpytesseract先转换为图片
填写PDF表单pdf-lib或pypdf(见forms.md)见forms.md

Next Steps

后续步骤

  • For advanced pypdfium2 usage, see reference.md
  • For JavaScript libraries (pdf-lib), see reference.md
  • If you need to fill out a PDF form, follow the instructions in forms.md
  • For troubleshooting guides, see reference.md
  • 如需了解pypdfium2的高级用法,请参阅reference.md
  • 如需了解JavaScript库(pdf-lib),请参阅reference.md
  • 如果您需要填写PDF表单,请按照forms.md中的说明操作
  • 如需故障排除指南,请参阅reference.md

When to Use

使用场景

This skill is applicable to execute the workflow or actions described in the overview.
当需要执行概述中描述的工作流或操作时,即可使用本技能。