document-pdf

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document PDF Skill — Quick Reference

PDF文档处理技能——快速参考

This skill enables PDF creation, extraction, manipulation, and analysis. Claude should apply these patterns when users need to generate invoices, reports, extract data from PDFs, merge documents, or work with PDF forms.
Modern Best Practices (Jan 2026):
  • PDF is a release artifact, not the editable source of truth.
  • Validate export fidelity (fonts, images, links) and accessibility where required.
  • Accessibility: if compliance matters, target a tagged/structured PDF workflow (often PDF/UA-aligned) and validate with tooling.
  • EU distribution: EAA (June 2025) typically implies EN 301 549 expectations for customer-facing PDFs.
  • Treat PDFs as sensitive: scrub metadata, ensure real redaction, and control distribution.
本技能可实现PDF的创建、提取、处理和分析。当用户需要生成发票、报告、从PDF中提取数据、合并文档或处理PDF表单时,Claude应采用这些模式。
2026年1月现代最佳实践:
  • PDF是发布产物,而非可编辑的权威源文件。
  • 验证导出的保真度(字体、图片、链接)及所需的可访问性。
  • 可访问性:若合规性很重要,应采用支持结构化的源格式(DOCX/HTML + 正确导出)生成PDF,而非“事后修复”无标签的PDF。
  • 欧盟分发:EAA(2025年6月)通常要求面向客户的PDF符合EN 301 549标准。
  • 将PDF视为敏感文件:清理元数据,确保真正的内容擦除,并控制分发范围。

Core Decision Rules (2026)

核心决策规则(2026年)

  • First decide: born-digital PDF (selectable text) vs scanned PDF (images). Scanned PDFs usually require OCR; see
    references/pdf-extraction-patterns.md
    .
  • If the user needs accessibility/compliance, prefer generating from a source format that supports structure (DOCX/HTML + proper export) rather than “post-fixing” an untagged PDF.
  • For deterministic ops (merge/split/rotate/scrub), prefer
    scripts/
    helpers over re-implementing ad hoc.
  • Never treat black rectangles or overlays as redaction; use real redaction and verify by copy/paste + search.

  • 首先判断:原生数字PDF(可选中文本)还是扫描版PDF(图片)。扫描版PDF通常需要OCR;请参阅
    references/pdf-extraction-patterns.md
  • 若用户需要可访问性/合规性,优先从支持结构化的源格式(DOCX/HTML + 正确导出)生成PDF,而非“事后修复”无标签的PDF。
  • 对于确定性操作(合并/拆分/旋转/清理),优先使用
    scripts/
    下的辅助脚本,而非临时重新实现。
  • 切勿将黑色矩形或覆盖层视为内容擦除;使用真正的内容擦除功能,并通过复制粘贴+搜索验证。

Quick Reference

快速参考

TaskTool/LibraryLanguageWhen to Use
Create PDFpdfkitNode.jsReports, invoices, certificates
Create PDFReportLabPythonComplex layouts, tables
Create PDFFPDF2PythonSimple PDFs with Unicode support
Create PDFBorbPythonInteractive elements, pure Python
Edit PDFpdf-libNode.jsModify existing PDFs, add pages
Extract textpdfplumberPythonOCR-free text extraction
OCR scanned PDFPyMuPDF + TesseractPythonScanned PDFs (no selectable text)
Extract tablesCamelotPythonTables with borders (Lattice mode)
Extract tablesCamelot/TabulaPythonTables without borders (Stream mode)
Parse/merge/split/rotatepypdfPythonDeterministic PDF manipulation
Fill formspdf-libNode.jsForm automation
HTML to PDFPuppeteer/PlaywrightNode.jsHigh-fidelity web page rendering
HTML to PDFWeasyPrintPythonCSS3-based, no browser needed
任务工具/库语言使用场景
创建PDFpdfkitNode.js报告、发票、证书
创建PDFReportLabPython复杂布局、表格
创建PDFFPDF2Python支持Unicode的简单PDF
创建PDFBorbPython交互元素、纯Python实现
编辑PDFpdf-libNode.js修改现有PDF、添加页面
提取文本pdfplumberPython无需OCR的文本提取
OCR扫描版PDFPyMuPDF + TesseractPython扫描版PDF(无可选中文本)
提取表格CamelotPython带边框的表格(网格模式)
提取表格Camelot/TabulaPython无边框的表格(流模式)
解析/合并/拆分/旋转pypdfPython确定性PDF处理
填充表单pdf-libNode.js表单自动化
HTML转PDFPuppeteer/PlaywrightNode.js高保真网页渲染
HTML转PDFWeasyPrintPython基于CSS3,无需浏览器

When to Use This Skill

何时使用本技能

Claude should invoke this skill when a user requests:
  • Generate PDFs from data (invoices, reports, certificates)
  • Extract text or tables from existing PDFs
  • Merge multiple PDFs into one document
  • Split PDFs into separate files
  • Fill PDF forms programmatically
  • Add watermarks, headers, footers
  • Convert HTML/web pages to PDF

当用户提出以下请求时,Claude应调用本技能:
  • 从数据生成PDF(发票、报告、证书)
  • 从现有PDF中提取文本或表格
  • 将多个PDF合并为一个文档
  • 将PDF拆分为单独文件
  • 以编程方式填充PDF表单
  • 添加水印、页眉、页脚
  • 将HTML/网页转换为PDF

Default Workflow

默认工作流

  • Create: pick
    pdfkit
    (Node) or
    ReportLab
    (Python) and start from
    assets/invoice-template.md
    or
    assets/report-template.md
    ; for advanced layouts use
    references/pdf-generation-patterns.md
    .
  • Extract: use
    references/pdf-extraction-patterns.md
    (text/tables/images/metadata + OCR fallback).
  • Ship: run
    assets/pdf-release-checklist.md
    (fidelity, links, accessibility baseline, privacy).
  • 创建:选择
    pdfkit
    (Node.js)或
    ReportLab
    (Python),并从
    assets/invoice-template.md
    assets/report-template.md
    开始;如需高级布局,请使用
    references/pdf-generation-patterns.md
  • 提取:使用
    references/pdf-extraction-patterns.md
    (文本/表格/图片/元数据 + OCR备选方案)。
  • 发布:运行
    assets/pdf-release-checklist.md
    (保真度、链接、基础可访问性、隐私)。

Scripts (Deterministic Operations)

脚本(确定性操作)

Scripts are optional helpers; they assume Python 3 plus the listed dependencies in each file.
  • Merge:
    python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf
  • Split:
    python3 scripts/split_pdf.py in.pdf out_dir --each-page
  • Rotate:
    python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90
  • Scrub metadata:
    python3 scripts/scrub_metadata.py in.pdf out.pdf
脚本是可选的辅助工具;它们假设已安装Python 3以及每个文件中列出的依赖项。
  • 合并:
    python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf
  • 拆分:
    python3 scripts/split_pdf.py in.pdf out_dir --each-page
  • 旋转:
    python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90
  • 清理元数据:
    python3 scripts/scrub_metadata.py in.pdf out.pdf

PDF Structure Patterns

PDF结构模式

Invoice Template

发票模板

text
INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)
text
INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)

Report Template

报告模板

text
REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

text
REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

Decision Tree

决策树

text
PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

text
PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

Do / Avoid (Jan 2026)

应做/避免事项(2026年1月)

Do

应做

  • Keep a versioned source document (doc/slide/design file) alongside the PDF.
  • Verify links and reading order for long documents.
  • Use real redaction and test by copy/paste.
  • 保留与PDF对应的版本化源文档(文档/幻灯片/设计文件)。
  • 验证长文档的链接和阅读顺序。
  • 使用真正的内容擦除功能,并通过复制粘贴测试。

Avoid

避免

  • Editing PDFs as the primary workflow when a source doc exists.
  • Shipping PDFs with broken links or illegible charts.
  • Including customer PII or secrets in PDFs without explicit approval.
  • 当存在源文档时,将编辑PDF作为主要工作流。
  • 发布包含损坏链接或模糊图表的PDF。
  • 在未获得明确批准的情况下,在PDF中包含客户PII或机密信息。

What Good Looks Like

优秀实践标准

  • Fidelity: export is reproducible from a versioned source file (doc/slide/design) and looks identical across viewers.
  • Accessibility: tags/reading order are correct; links work; scanned docs are OCRed when appropriate.
  • Release hygiene: file naming includes version/date; metadata is clean; no “PDF as source of truth”.
  • Security: redaction is verified (copy/paste test) and sensitive data is minimized.
  • QA: release checklist completed using
    assets/pdf-release-checklist.md
    .
  • 保真度:导出内容可从版本化源文件(文档/幻灯片/设计)重现,且在不同查看器中显示一致。
  • 可访问性:标签/阅读顺序正确;链接可用;扫描文档已按需进行OCR处理。
  • 发布规范:文件名包含版本/日期;元数据已清理;不将“PDF作为权威源文件”。
  • 安全性:已验证内容擦除(复制粘贴测试),并最小化敏感数据。
  • 质量保证:已使用
    assets/pdf-release-checklist.md
    完成发布检查清单。

Optional: AI / Automation

可选:AI/自动化

Use only when explicitly requested and policy-compliant.
  • Generate a release checklist run; humans verify the final PDF manually.
仅在明确请求且符合政策的情况下使用。
  • 生成发布检查清单运行结果;最终PDF需由人工手动验证。

Navigation

导航

Resources
  • references/pdf-generation-patterns.md — Complex layouts, multi-page docs
  • references/pdf-extraction-patterns.md — Text, table, image extraction
  • data/sources.json — Library documentation links
Templates
  • assets/invoice-template.md — Invoice PDF generation
  • assets/report-template.md — Multi-page report structure
  • assets/pdf-release-checklist.md — Links, accessibility, export fidelity
Related Skills
  • ../document-docx/SKILL.md — Word document generation
  • ../document-xlsx/SKILL.md — Excel/spreadsheet workflows
  • ../document-pptx/SKILL.md — PowerPoint presentations
资源
  • references/pdf-generation-patterns.md — 复杂布局、多页文档
  • references/pdf-extraction-patterns.md — 文本、表格、图片提取
  • data/sources.json — 库文档链接
模板
  • assets/invoice-template.md — 发票PDF生成
  • assets/report-template.md — 多页报告结构
  • assets/pdf-release-checklist.md — 链接、可访问性、导出保真度
相关技能
  • ../document-docx/SKILL.md — Word文档生成
  • ../document-xlsx/SKILL.md — Excel/电子表格工作流
  • ../document-pptx/SKILL.md — PowerPoint演示文稿