pdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF Processing Skill

PDF处理技能

You now have expertise in PDF manipulation. Follow these workflows:
你现在具备PDF操作的专业能力。请遵循以下工作流程:

Reading PDFs

读取PDF

Option 1: Quick text extraction (preferred)
bash
undefined
选项1:快速文本提取(推荐)
bash
undefined

Using pdftotext (poppler-utils)

使用pdftotext(poppler-utils工具集)

pdftotext input.pdf - # Output to stdout pdftotext input.pdf output.txt # Output to file
pdftotext input.pdf - # 输出到标准输出 pdftotext input.pdf output.txt # 输出到文件

If pdftotext not available, try:

如果pdftotext不可用,尝试:

python3 -c " import fitz # PyMuPDF doc = fitz.open('input.pdf') for page in doc: print(page.get_text()) "

**Option 2: Page-by-page with metadata**
```python
import fitz  # pip install pymupdf

doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")

for i, page in enumerate(doc):
    text = page.get_text()
    print(f"--- Page {i+1} ---")
    print(text)
python3 -c " import fitz # PyMuPDF doc = fitz.open('input.pdf') for page in doc: print(page.get_text()) "

**选项2:带元数据的逐页处理**
```python
import fitz  # pip install pymupdf

doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")

for i, page in enumerate(doc):
    text = page.get_text()
    print(f"--- Page {i+1} ---")
    print(text)

Creating PDFs

创建PDF

Option 1: From Markdown (recommended)
bash
undefined
选项1:从Markdown创建(推荐)
bash
undefined

Using pandoc

使用pandoc

pandoc input.md -o output.pdf
pandoc input.md -o output.pdf

With custom styling

自定义样式

pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in

**Option 2: Programmatically**
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()
Option 3: From HTML
bash
undefined
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in

**选项2:程序化创建**
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()
选项3:从HTML创建
bash
undefined

Using wkhtmltopdf

使用wkhtmltopdf

wkhtmltopdf input.html output.pdf
wkhtmltopdf input.html output.pdf

Or with Python

或使用Python

python3 -c " import pdfkit pdfkit.from_file('input.html', 'output.pdf') "
undefined
python3 -c " import pdfkit pdfkit.from_file('input.html', 'output.pdf') "
undefined

Merging PDFs

合并PDF

python
import fitz

result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    doc = fitz.open(pdf_path)
    result.insert_pdf(doc)
result.save("merged.pdf")
python
import fitz

result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    doc = fitz.open(pdf_path)
    result.insert_pdf(doc)
result.save("merged.pdf")

Splitting PDFs

拆分PDF

python
import fitz

doc = fitz.open("input.pdf")
for i in range(len(doc)):
    single = fitz.open()
    single.insert_pdf(doc, from_page=i, to_page=i)
    single.save(f"page_{i+1}.pdf")
python
import fitz

doc = fitz.open("input.pdf")
for i in range(len(doc)):
    single = fitz.open()
    single.insert_pdf(doc, from_page=i, to_page=i)
    single.save(f"page_{i+1}.pdf")

Key Libraries

核心库

TaskLibraryInstall
Read/Write/MergePyMuPDF
pip install pymupdf
Create from scratchReportLab
pip install reportlab
HTML to PDFpdfkit
pip install pdfkit
+ wkhtmltopdf
Text extractionpdftotext
brew install poppler
/
apt install poppler-utils
任务安装
读取/写入/合并PyMuPDF
pip install pymupdf
从头创建ReportLab
pip install reportlab
HTML转PDFpdfkit
pip install pdfkit
+ wkhtmltopdf
文本提取pdftotext
brew install poppler
/
apt install poppler-utils

Best Practices

最佳实践

  1. Always check if tools are installed before using them
  2. Handle encoding issues - PDFs may contain various character encodings
  3. Large PDFs: Process page by page to avoid memory issues
  4. OCR for scanned PDFs: Use
    pytesseract
    if text extraction returns empty
  1. 使用前务必检查工具是否已安装
  2. 处理编码问题——PDF可能包含多种字符编码
  3. 大型PDF:逐页处理以避免内存问题
  4. 扫描版PDF的OCR处理:如果文本提取结果为空,使用
    pytesseract