pdf-offline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Processing Guide
PDF 处理指南
Overview
概述
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
本指南介绍了使用Python库和命令行工具进行的核心PDF处理操作。如需高级功能、JavaScript库及详细示例,请查看REFERENCE.md。如果您需要填写PDF表单,请阅读FORMS.md并遵循其中的说明。
Quick CLI (legacy doc_utils)
快速使用CLI(旧版doc_utils)
This skill bundles a simple CLI () copied from the older skill, so you can do common PDF operations quickly.
doc_utils.pypdfbash
undefined本技能内置了一个从旧版技能中复制的简单CLI工具(),您可以快速完成常见的PDF操作。
pdfdoc_utils.pybash
undefinedRead PDF → JSON
Read PDF → JSON
python3 agent/skills/pdf-offline/doc_utils.py read path/to/file.pdf
python3 agent/skills/pdf-offline/doc_utils.py read path/to/file.pdf
Merge PDFs
Merge PDFs
python3 agent/skills/pdf-offline/doc_utils.py merge merged.pdf a.pdf b.pdf
Optional deps install helper (Python-only, no system packages):
```bash
bash agent/skills/pdf-offline/install.shpython3 agent/skills/pdf-offline/doc_utils.py merge merged.pdf a.pdf b.pdf
可选依赖安装助手(仅支持Python,无需系统包):
```bash
bash agent/skills/pdf-offline/install.shQuick Start
快速开始
python
from pypdf import PdfReader, PdfWriterpython
from pypdf import PdfReader, PdfWriterRead a PDF
Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
Extract text
Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
undefinedtext = ""
for page in reader.pages:
text += page.extract_text()
undefinedPython Libraries
Python库
pypdf - Basic Operations
pypdf - 基础操作
Merge PDFs
合并PDF
python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)Split PDF
拆分PDF
python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)Extract Metadata
提取元数据
python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")Rotate Pages
旋转页面
python
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)python
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)pdfplumber - Text and Table Extraction
pdfplumber - 文本与表格提取
Extract Text with Layout
提取带格式的文本
python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)Extract Tables
提取表格
python
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)python
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for j, table in enumerate(tables):
print(f"Table {j+1} on page {i+1}:")
for row in table:
print(row)Advanced Table Extraction
高级表格提取
python
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)python
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # Check if table is not empty
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)Combine all tables
Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
undefinedif all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
undefinedreportlab - Create PDFs
reportlab - 创建PDF
Basic PDF Creation
基础PDF创建
python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letterpython
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letterAdd text
Add text
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "This is a PDF created with reportlab")
Add a line
Add a line
c.line(100, height - 140, 400, height - 140)
c.line(100, height - 140, 400, height - 140)
Save
Save
c.save()
undefinedc.save()
undefinedCreate PDF with Multiple Pages
创建多页PDF
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []Add content
Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
Page 2
Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
Build PDF
Build PDF
doc.build(story)
undefineddoc.build(story)
undefinedCommand-Line Tools
命令行工具
pdftotext (poppler-utils)
pdftotext (poppler-utils)
bash
undefinedbash
undefinedExtract text
Extract text
pdftotext input.pdf output.txt
pdftotext input.pdf output.txt
Extract text preserving layout
Extract text preserving layout
pdftotext -layout input.pdf output.txt
pdftotext -layout input.pdf output.txt
Extract specific pages
Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
undefinedpdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
undefinedqpdf
qpdf
bash
undefinedbash
undefinedMerge PDFs
Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
Split pages
Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
Rotate pages
Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
Remove password
Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
undefinedqpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
undefinedpdftk (if available)
pdftk(若可用)
bash
undefinedbash
undefinedMerge
Merge
pdftk file1.pdf file2.pdf cat output merged.pdf
pdftk file1.pdf file2.pdf cat output merged.pdf
Split
Split
pdftk input.pdf burst
pdftk input.pdf burst
Rotate
Rotate
pdftk input.pdf rotate 1east output rotated.pdf
undefinedpdftk input.pdf rotate 1east output rotated.pdf
undefinedCommon Tasks
常见任务
Extract Text from Scanned PDFs
从扫描版PDF提取文本
python
undefinedpython
undefinedRequires: pip install pytesseract pdf2image
Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
import pytesseract
from pdf2image import convert_from_path
Convert PDF to images
Convert PDF to images
images = convert_from_path('scanned.pdf')
images = convert_from_path('scanned.pdf')
OCR each page
OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
undefinedtext = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
undefinedAdd Watermark
添加水印
python
from pypdf import PdfReader, PdfWriterpython
from pypdf import PdfReader, PdfWriterCreate watermark (or load existing)
Create watermark (or load existing)
watermark = PdfReader("watermark.pdf").pages[0]
watermark = PdfReader("watermark.pdf").pages[0]
Apply to all pages
Apply to all pages
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
undefinedreader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
undefinedExtract Images
提取图片
bash
undefinedbash
undefinedUsing pdfimages (poppler-utils)
Using pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix
pdfimages -j input.pdf output_prefix
This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
undefinedundefinedPassword Protection
密码保护
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)Add password
Add password
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
undefinedwriter.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)
undefinedQuick Reference
快速参考
| Task | Best Tool | Command/Code |
|---|---|---|
| Merge PDFs | pypdf | |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | |
| Extract tables | pdfplumber | |
| Create PDFs | reportlab | Canvas or Platypus |
| Command line merge | qpdf | |
| OCR scanned PDFs | pytesseract | Convert to image first |
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
| 任务 | 最佳工具 | 命令/代码 |
|---|---|---|
| 合并PDF | pypdf | |
| 拆分PDF | pypdf | 单页单独保存 |
| 提取文本 | pdfplumber | |
| 提取表格 | pdfplumber | |
| 创建PDF | reportlab | Canvas或Platypus |
| 命令行合并 | qpdf | |
| 扫描版PDF光学识别 | pytesseract | 先转换为图片 |
| 填写PDF表单 | pdf-lib或pypdf(查看FORMS.md) | 查看FORMS.md |
Next Steps
后续步骤
- For advanced pypdfium2 usage, see REFERENCE.md
- For JavaScript libraries (pdf-lib), see REFERENCE.md
- If you need to fill out a PDF form, follow the instructions in FORMS.md
- For troubleshooting guides, see REFERENCE.md
- 如需pypdfium2的高级用法,请查看REFERENCE.md
- 如需JavaScript库(pdf-lib)相关内容,请查看REFERENCE.md
- 如果您需要填写PDF表单,请遵循FORMS.md中的说明
- 如需故障排除指南,请查看REFERENCE.md