document-pdf

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document PDF Skill — Quick Reference

PDF文档处理技能——快速参考

This skill enables PDF creation, extraction, manipulation, and analysis. Claude should apply these patterns when users need to generate invoices, reports, extract data from PDFs, merge documents, or work with PDF forms.

Modern Best Practices (Jan 2026):

PDF is a release artifact, not the editable source of truth.
Validate export fidelity (fonts, images, links) and accessibility where required.
Accessibility: if compliance matters, target a tagged/structured PDF workflow (often PDF/UA-aligned) and validate with tooling.
EU distribution: EAA (June 2025) typically implies EN 301 549 expectations for customer-facing PDFs.
Treat PDFs as sensitive: scrub metadata, ensure real redaction, and control distribution.

本技能可实现PDF的创建、提取、处理和分析。当用户需要生成发票、报告、从PDF中提取数据、合并文档或处理PDF表单时，Claude应采用这些模式。

2026年1月现代最佳实践:

PDF是发布产物，而非可编辑的权威源文件。
验证导出的保真度（字体、图片、链接）及所需的可访问性。
可访问性：若合规性很重要，应采用支持结构化的源格式（DOCX/HTML + 正确导出）生成PDF，而非“事后修复”无标签的PDF。
欧盟分发：EAA（2025年6月）通常要求面向客户的PDF符合EN 301 549标准。
将PDF视为敏感文件：清理元数据，确保真正的内容擦除，并控制分发范围。

Core Decision Rules (2026)

核心决策规则（2026年）

First decide: born-digital PDF (selectable text) vs scanned PDF (images). Scanned PDFs usually require OCR; see
```
references/pdf-extraction-patterns.md
```
.
If the user needs accessibility/compliance, prefer generating from a source format that supports structure (DOCX/HTML + proper export) rather than “post-fixing” an untagged PDF.
For deterministic ops (merge/split/rotate/scrub), prefer
```
scripts/
```
helpers over re-implementing ad hoc.
Never treat black rectangles or overlays as redaction; use real redaction and verify by copy/paste + search.

首先判断：原生数字PDF（可选中文本）还是扫描版PDF（图片）。扫描版PDF通常需要OCR；请参阅
```
references/pdf-extraction-patterns.md
```
。
若用户需要可访问性/合规性，优先从支持结构化的源格式（DOCX/HTML + 正确导出）生成PDF，而非“事后修复”无标签的PDF。
对于确定性操作（合并/拆分/旋转/清理），优先使用
```
scripts/
```
下的辅助脚本，而非临时重新实现。
切勿将黑色矩形或覆盖层视为内容擦除；使用真正的内容擦除功能，并通过复制粘贴+搜索验证。

Quick Reference

快速参考

Task	Tool/Library	Language	When to Use
Create PDF	pdfkit	Node.js	Reports, invoices, certificates
Create PDF	ReportLab	Python	Complex layouts, tables
Create PDF	FPDF2	Python	Simple PDFs with Unicode support
Create PDF	Borb	Python	Interactive elements, pure Python
Edit PDF	pdf-lib	Node.js	Modify existing PDFs, add pages
Extract text	pdfplumber	Python	OCR-free text extraction
OCR scanned PDF	PyMuPDF + Tesseract	Python	Scanned PDFs (no selectable text)
Extract tables	Camelot	Python	Tables with borders (Lattice mode)
Extract tables	Camelot/Tabula	Python	Tables without borders (Stream mode)
Parse/merge/split/rotate	pypdf	Python	Deterministic PDF manipulation
Fill forms	pdf-lib	Node.js	Form automation
HTML to PDF	Puppeteer/Playwright	Node.js	High-fidelity web page rendering
HTML to PDF	WeasyPrint	Python	CSS3-based, no browser needed

任务	工具/库	语言	使用场景
创建PDF	pdfkit	Node.js	报告、发票、证书
创建PDF	ReportLab	Python	复杂布局、表格
创建PDF	FPDF2	Python	支持Unicode的简单PDF
创建PDF	Borb	Python	交互元素、纯Python实现
编辑PDF	pdf-lib	Node.js	修改现有PDF、添加页面
提取文本	pdfplumber	Python	无需OCR的文本提取
OCR扫描版PDF	PyMuPDF + Tesseract	Python	扫描版PDF（无可选中文本）
提取表格	Camelot	Python	带边框的表格（网格模式）
提取表格	Camelot/Tabula	Python	无边框的表格（流模式）
解析/合并/拆分/旋转	pypdf	Python	确定性PDF处理
填充表单	pdf-lib	Node.js	表单自动化
HTML转PDF	Puppeteer/Playwright	Node.js	高保真网页渲染
HTML转PDF	WeasyPrint	Python	基于CSS3，无需浏览器

When to Use This Skill

何时使用本技能

Claude should invoke this skill when a user requests:

Generate PDFs from data (invoices, reports, certificates)
Extract text or tables from existing PDFs
Merge multiple PDFs into one document
Split PDFs into separate files
Fill PDF forms programmatically
Add watermarks, headers, footers
Convert HTML/web pages to PDF

当用户提出以下请求时，Claude应调用本技能：

从数据生成PDF（发票、报告、证书）
从现有PDF中提取文本或表格
将多个PDF合并为一个文档
将PDF拆分为单独文件
以编程方式填充PDF表单
添加水印、页眉、页脚
将HTML/网页转换为PDF

Default Workflow

默认工作流

Create: pick

pdfkit

(Node) or

ReportLab

(Python) and start from

assets/invoice-template.md

assets/report-template.md

; for advanced layouts use

references/pdf-generation-patterns.md

Extract: use
```
references/pdf-extraction-patterns.md
```
(text/tables/images/metadata + OCR fallback).
Ship: run
```
assets/pdf-release-checklist.md
```
(fidelity, links, accessibility baseline, privacy).

创建：选择

pdfkit

（Node.js）或

ReportLab

（Python），并从

assets/invoice-template.md

或

assets/report-template.md

开始；如需高级布局，请使用

references/pdf-generation-patterns.md

。

提取：使用
```
references/pdf-extraction-patterns.md
```
（文本/表格/图片/元数据 + OCR备选方案）。
发布：运行
```
assets/pdf-release-checklist.md
```
（保真度、链接、基础可访问性、隐私）。

Scripts (Deterministic Operations)

脚本（确定性操作）

Scripts are optional helpers; they assume Python 3 plus the listed dependencies in each file.

Merge:

python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf

Split:

python3 scripts/split_pdf.py in.pdf out_dir --each-page

Rotate:

python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90

Scrub metadata:

python3 scripts/scrub_metadata.py in.pdf out.pdf

脚本是可选的辅助工具；它们假设已安装Python 3以及每个文件中列出的依赖项。

合并：

python3 scripts/merge_pdfs.py merged.pdf a.pdf b.pdf

拆分：

python3 scripts/split_pdf.py in.pdf out_dir --each-page

旋转：

python3 scripts/rotate_pdf.py in.pdf out.pdf --degrees 90

清理元数据：

python3 scripts/scrub_metadata.py in.pdf out.pdf

PDF Structure Patterns

PDF结构模式

Invoice Template

发票模板

text

INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)

text

INVOICE STRUCTURE
├── Header (logo, company info, invoice #)
├── Bill To / Ship To blocks
├── Line items table
│   ├── Description | Qty | Unit Price | Total
│   └── Subtotal, Tax, Total
├── Payment terms
└── Footer (contact, thank you)

Report Template

报告模板

text

REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

text

REPORT PDF STRUCTURE
├── Cover page (title, author, date)
├── Table of contents
├── Body sections with page numbers
├── Charts/images with captions
├── Appendices
└── Running header/footer

Decision Tree

决策树

text

PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    │
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    │
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    │
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

text

PDF Task: [What do you need?]
    ├─ Create new PDF?
    │   ├─ Simple text/tables → pdfkit (Node) or ReportLab (Python)
    │   ├─ Complex layouts → ReportLab with Platypus
    │   └─ From HTML → Puppeteer or wkhtmltopdf
    │
    ├─ Extract from PDF?
    │   ├─ Text only → pdfplumber (Python)
    │   ├─ Tables → pdfplumber or camelot (Python)
    │   └─ Images → PyMuPDF/fitz (Python)
    │
    ├─ Modify existing PDF?
    │   ├─ Add text/images → pdf-lib (Node)
    │   ├─ Merge/split → pypdf or pdf-lib
    │   └─ Fill forms → pdf-lib
    │
    └─ Batch processing?
        └─ pypdf + pdfplumber pipeline

Do / Avoid (Jan 2026)

应做/避免事项（2026年1月）

Do

应做

Keep a versioned source document (doc/slide/design file) alongside the PDF.
Verify links and reading order for long documents.
Use real redaction and test by copy/paste.

保留与PDF对应的版本化源文档（文档/幻灯片/设计文件）。
验证长文档的链接和阅读顺序。
使用真正的内容擦除功能，并通过复制粘贴测试。

Avoid

避免

Editing PDFs as the primary workflow when a source doc exists.
Shipping PDFs with broken links or illegible charts.
Including customer PII or secrets in PDFs without explicit approval.

当存在源文档时，将编辑PDF作为主要工作流。
发布包含损坏链接或模糊图表的PDF。
在未获得明确批准的情况下，在PDF中包含客户PII或机密信息。

What Good Looks Like

优秀实践标准

Fidelity: export is reproducible from a versioned source file (doc/slide/design) and looks identical across viewers.
Accessibility: tags/reading order are correct; links work; scanned docs are OCRed when appropriate.
Release hygiene: file naming includes version/date; metadata is clean; no “PDF as source of truth”.
Security: redaction is verified (copy/paste test) and sensitive data is minimized.
QA: release checklist completed using
```
assets/pdf-release-checklist.md
```
.

保真度：导出内容可从版本化源文件（文档/幻灯片/设计）重现，且在不同查看器中显示一致。
可访问性：标签/阅读顺序正确；链接可用；扫描文档已按需进行OCR处理。
发布规范：文件名包含版本/日期；元数据已清理；不将“PDF作为权威源文件”。
安全性：已验证内容擦除（复制粘贴测试），并最小化敏感数据。
质量保证：已使用
```
assets/pdf-release-checklist.md
```
完成发布检查清单。

Optional: AI / Automation

可选：AI/自动化

Use only when explicitly requested and policy-compliant.

Generate a release checklist run; humans verify the final PDF manually.

仅在明确请求且符合政策的情况下使用。

生成发布检查清单运行结果；最终PDF需由人工手动验证。

document-pdf

Original

Translation

Document PDF Skill — Quick Reference

PDF文档处理技能——快速参考

Core Decision Rules (2026)

核心决策规则（2026年）

Quick Reference

快速参考

When to Use This Skill

何时使用本技能

Default Workflow

默认工作流

Scripts (Deterministic Operations)

脚本（确定性操作）

PDF Structure Patterns

PDF结构模式

Invoice Template

发票模板

Report Template

报告模板

Decision Tree

决策树

Do / Avoid (Jan 2026)

应做/避免事项（2026年1月）

Do

应做

Avoid

避免

What Good Looks Like

优秀实践标准

Optional: AI / Automation

可选：AI/自动化

Navigation

导航