ask-pdf-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
<critical_constraints> ❌ NO arbitrary file writes → use provided scripts only ❌ NO loading huge PDFs into memory → process in chunks ❌ NO overwriting originals → backup first ✅ MUST use context managers (
with
statements) ✅ MUST validate PDFs before processing ✅ MUST handle encrypted PDFs with password </critical_constraints>
<dependencies> pip install pypdf pdfplumber </dependencies> <operations>
<关键约束> ❌ 禁止随意写入文件 → 仅使用提供的脚本 ❌ 禁止将大型PDF加载到内存中 → 分块处理 ❌ 禁止覆盖原始文件 → 先备份 ✅ 必须使用上下文管理器(
with
语句) ✅ 处理前必须验证PDF ✅ 必须处理带密码的加密PDF </关键约束>
<依赖安装> pip install pypdf pdfplumber </依赖安装>
<操作方法>

Text Extraction (pdfplumber)

文本提取(pdfplumber)

python
with pdfplumber.open("doc.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()
python
with pdfplumber.open("doc.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

Form Filling (pypdf)

表单填充(pypdf)

python
from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))
python
from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))

Discover Fields

发现表单字段

python
fields = PdfReader("form.pdf").get_fields()
python
fields = PdfReader("form.pdf").get_fields()

Merge PDFs

合并PDF

python
writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
    writer.append(pdf)
writer.write(open("merged.pdf", "wb"))
</operations> <troubleshooting> | Issue | Solution | |-------|----------| | No text extracted | Image-based PDF → use OCR (pytesseract) | | Fields not filling | Check names with get_fields() | | Large output | Use writer.compress_identical_objects() | </troubleshooting> <heuristics> - Scanned document → recommend OCR instead - Form fields unknown → run get_fields() first - Many PDFs → batch process with chunks </heuristics>
python
writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
    writer.append(pdf)
writer.write(open("merged.pdf", "wb"))
</操作方法>
<故障排除>
问题解决方案
无法提取文本基于图像的PDF → 使用OCR(pytesseract)
表单字段无法填充使用get_fields()检查字段名称
输出文件过大使用writer.compress_identical_objects()
</故障排除>
<经验法则>
  • 扫描件文档 → 建议改用OCR
  • 表单字段未知 → 先运行get_fields()
  • 大量PDF → 分块批量处理 </经验法则>