Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF Processing Skill
PDF处理技能
You now have expertise in PDF manipulation. Follow these workflows:
你现在具备PDF操作的专业能力。请遵循以下工作流程:
Reading PDFs
读取PDF
Option 1: Quick text extraction (preferred)
bash
undefined选项1:快速文本提取(推荐)
bash
undefinedUsing pdftotext (poppler-utils)
使用pdftotext(poppler-utils工具集)
pdftotext input.pdf - # Output to stdout
pdftotext input.pdf output.txt # Output to file
pdftotext input.pdf - # 输出到标准输出
pdftotext input.pdf output.txt # 输出到文件
If pdftotext not available, try:
如果pdftotext不可用,尝试:
python3 -c "
import fitz # PyMuPDF
doc = fitz.open('input.pdf')
for page in doc:
print(page.get_text())
"
**Option 2: Page-by-page with metadata**
```python
import fitz # pip install pymupdf
doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")
for i, page in enumerate(doc):
text = page.get_text()
print(f"--- Page {i+1} ---")
print(text)python3 -c "
import fitz # PyMuPDF
doc = fitz.open('input.pdf')
for page in doc:
print(page.get_text())
"
**选项2:带元数据的逐页处理**
```python
import fitz # pip install pymupdf
doc = fitz.open("input.pdf")
print(f"Pages: {len(doc)}")
print(f"Metadata: {doc.metadata}")
for i, page in enumerate(doc):
text = page.get_text()
print(f"--- Page {i+1} ---")
print(text)Creating PDFs
创建PDF
Option 1: From Markdown (recommended)
bash
undefined选项1:从Markdown创建(推荐)
bash
undefinedUsing pandoc
使用pandoc
pandoc input.md -o output.pdf
pandoc input.md -o output.pdf
With custom styling
自定义样式
pandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
**Option 2: Programmatically**
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()Option 3: From HTML
bash
undefinedpandoc input.md -o output.pdf --pdf-engine=xelatex -V geometry:margin=1in
**选项2:程序化创建**
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, PDF!")
c.save()选项3:从HTML创建
bash
undefinedUsing wkhtmltopdf
使用wkhtmltopdf
wkhtmltopdf input.html output.pdf
wkhtmltopdf input.html output.pdf
Or with Python
或使用Python
python3 -c "
import pdfkit
pdfkit.from_file('input.html', 'output.pdf')
"
undefinedpython3 -c "
import pdfkit
pdfkit.from_file('input.html', 'output.pdf')
"
undefinedMerging PDFs
合并PDF
python
import fitz
result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
doc = fitz.open(pdf_path)
result.insert_pdf(doc)
result.save("merged.pdf")python
import fitz
result = fitz.open()
for pdf_path in ["file1.pdf", "file2.pdf", "file3.pdf"]:
doc = fitz.open(pdf_path)
result.insert_pdf(doc)
result.save("merged.pdf")Splitting PDFs
拆分PDF
python
import fitz
doc = fitz.open("input.pdf")
for i in range(len(doc)):
single = fitz.open()
single.insert_pdf(doc, from_page=i, to_page=i)
single.save(f"page_{i+1}.pdf")python
import fitz
doc = fitz.open("input.pdf")
for i in range(len(doc)):
single = fitz.open()
single.insert_pdf(doc, from_page=i, to_page=i)
single.save(f"page_{i+1}.pdf")Key Libraries
核心库
| Task | Library | Install |
|---|---|---|
| Read/Write/Merge | PyMuPDF | |
| Create from scratch | ReportLab | |
| HTML to PDF | pdfkit | |
| Text extraction | pdftotext | |
| 任务 | 库 | 安装 |
|---|---|---|
| 读取/写入/合并 | PyMuPDF | |
| 从头创建 | ReportLab | |
| HTML转PDF | pdfkit | |
| 文本提取 | pdftotext | |
Best Practices
最佳实践
- Always check if tools are installed before using them
- Handle encoding issues - PDFs may contain various character encodings
- Large PDFs: Process page by page to avoid memory issues
- OCR for scanned PDFs: Use if text extraction returns empty
pytesseract
- 使用前务必检查工具是否已安装
- 处理编码问题——PDF可能包含多种字符编码
- 大型PDF:逐页处理以避免内存问题
- 扫描版PDF的OCR处理:如果文本提取结果为空,使用
pytesseract