ask-pdf-processing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

<critical_constraints> ❌ NO arbitrary file writes → use provided scripts only ❌ NO loading huge PDFs into memory → process in chunks ❌ NO overwriting originals → backup first ✅ MUST use context managers (

with

statements) ✅ MUST validate PDFs before processing ✅ MUST handle encrypted PDFs with password </critical_constraints>

<dependencies> pip install pypdf pdfplumber </dependencies> <operations>

<关键约束> ❌ 禁止随意写入文件 → 仅使用提供的脚本 ❌ 禁止将大型PDF加载到内存中 → 分块处理 ❌ 禁止覆盖原始文件 → 先备份 ✅ 必须使用上下文管理器（

with

语句） ✅ 处理前必须验证PDF ✅ 必须处理带密码的加密PDF </关键约束>

<依赖安装> pip install pypdf pdfplumber </依赖安装>

<操作方法>

Text Extraction (pdfplumber)

文本提取（pdfplumber）

python

with pdfplumber.open("doc.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

python

with pdfplumber.open("doc.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

Form Filling (pypdf)

表单填充（pypdf）

python

from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))

python

from pypdf import PdfReader, PdfWriter
writer = PdfWriter()
writer.append(PdfReader("template.pdf"))
writer.update_page_form_field_values(writer.pages[0], {"name": "John"})
writer.write(open("filled.pdf", "wb"))

Discover Fields

发现表单字段

python

fields = PdfReader("form.pdf").get_fields()

python

fields = PdfReader("form.pdf").get_fields()

Merge PDFs

合并PDF

python

writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
    writer.append(pdf)
writer.write(open("merged.pdf", "wb"))

</operations> <troubleshooting> | Issue | Solution | |-------|----------| | No text extracted | Image-based PDF → use OCR (pytesseract) | | Fields not filling | Check names with get_fields() | | Large output | Use writer.compress_identical_objects() | </troubleshooting> <heuristics> - Scanned document → recommend OCR instead - Form fields unknown → run get_fields() first - Many PDFs → batch process with chunks </heuristics>

python

writer = PdfWriter()
for pdf in ["a.pdf", "b.pdf"]:
    writer.append(pdf)
writer.write(open("merged.pdf", "wb"))

</操作方法>

<故障排除>

问题	解决方案
无法提取文本	基于图像的PDF → 使用OCR（pytesseract）
表单字段无法填充	使用get_fields()检查字段名称
输出文件过大	使用writer.compress_identical_objects()
</故障排除>

<经验法则>

扫描件文档 → 建议改用OCR
表单字段未知 → 先运行get_fields()
大量PDF → 分块批量处理 </经验法则>