ask-pdf-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese<critical_constraints>
❌ NO arbitrary file writes → use provided scripts only
❌ NO loading huge PDFs into memory → process in chunks
❌ NO overwriting originals → backup first
✅ MUST use context managers ( statements)
✅ MUST validate PDFs before processing
✅ MUST handle encrypted PDFs with password
</critical_constraints>
<dependencies>
pip install pypdf pdfplumber
</dependencies>
<operations>with<关键约束>
❌ 禁止随意写入文件 → 仅使用提供的脚本
❌ 禁止将大型PDF加载到内存中 → 分块处理
❌ 禁止覆盖原始文件 → 先备份
✅ 必须使用上下文管理器(语句)
✅ 处理前必须验证PDF
✅ 必须处理带密码的加密PDF
</关键约束>
with<依赖安装>
pip install pypdf pdfplumber
</依赖安装>
<操作方法>
Text Extraction (pdfplumber)
文本提取(pdfplumber)
python
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables()python
with pdfplumber.open("doc.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
tables = page.extract_tables()Form Filling (pypdf)
表单填充(pypdf)
python
from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))python
from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))Discover Fields
发现表单字段
python
fields = PdfReader("form.pdf").get_fields()python
fields = PdfReader("form.pdf").get_fields()Merge PDFs
合并PDF
python
writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
writer.append(pdf)
writer.write(open("merged.pdf", "wb"))python
writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
writer.append(pdf)
writer.write(open("merged.pdf", "wb"))</操作方法>
<故障排除>
| 问题 | 解决方案 |
|---|---|
| 无法提取文本 | 基于图像的PDF → 使用OCR(pytesseract) |
| 表单字段无法填充 | 使用get_fields()检查字段名称 |
| 输出文件过大 | 使用writer.compress_identical_objects() |
| </故障排除> |
<经验法则>
- 扫描件文档 → 建议改用OCR
- 表单字段未知 → 先运行get_fields()
- 大量PDF → 分块批量处理 </经验法则>