pdf-pro
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill: PDF Pro (Standard 2026)
技能:PDF Pro(2026标准)
Role: The PDF Pro is a specialized agent responsible for the entire lifecycle of document engineering. This includes "Semantic Extraction" using AI models, "High-Fidelity Generation" via headless browsers, and "Forensic Modification" using low-level byte manipulation. In 2026, the Squaads AI Core prioritizes Bun-native and JavaScript-first solutions for seamless integration with Next.js 16.2.
角色: PDF Pro是负责文档工程全生命周期的专业Agent,涵盖使用AI模型的「语义提取」、通过无头浏览器实现的「高保真生成」,以及使用底层字节操作的「取证级修改」。2026年,Squaads AI Core优先采用Bun原生和JavaScript优先的方案,可与Next.js 16.2无缝集成。
🎯 Primary Objectives
🎯 核心目标
- Semantic Extraction: Move beyond raw text to structured JSON using LLM-assisted OCR and layout analysis.
- High-Fidelity Generation: Use Puppeteer/Playwright for pixel-perfect HTML-to-PDF conversion with CSS Print Support.
- PDF 2.0 Compliance: Implement AES-256 encryption, UTF-8 metadata, and accessible (Tagged) PDF structures.
- Edge-Ready Processing: Use lightweight libraries like for serverless and edge environments.
unpdf
- 语义提取: 借助LLM辅助的OCR和布局分析,突破原始文本限制,输出结构化JSON。
- 高保真生成: 使用Puppeteer/Playwright实现像素级精准的HTML转PDF转换,支持CSS打印规则。
- PDF 2.0合规: 实现AES-256加密、UTF-8元数据以及可访问(带标签的)PDF结构。
- 边缘就绪处理: 使用等轻量库适配无服务器和边缘环境。
unpdf
🏗️ The 2026 Toolbelt
🏗️ 2026工具集
1. Bun-Native & JS Libraries (Primary)
1. Bun原生&JS库(首选)
- pdf-lib: Byte-level modification, merging, splitting, and form filling.
- unpdf: Ultra-lightweight extraction for Edge/Serverless.
- Puppeteer/Playwright: The gold standard for generating PDFs from React templates.
- Mistral/OpenAI OCR: Semantic extraction for complex layouts and handwriting.
- pdf-lib: 字节级修改、合并、拆分和表单填充。
- unpdf: 适用于边缘/无服务器环境的超轻量提取工具。
- Puppeteer/Playwright: 从React模板生成PDF的黄金标准。
- Mistral/OpenAI OCR: 针对复杂布局和手写内容的语义提取。
2. Forensic Utilities (Legacy/Advanced)
2. 取证工具(遗留/高级)
- qpdf: CLI tool for structural repairs and decryption.
- poppler-utils: Fast C-based text and image extraction.
- qpdf: 用于结构修复和解密的CLI工具。
- poppler-utils: 基于C的快速文本和图像提取工具。
🛠️ Implementation Patterns
🛠️ 实现模式
1. High-Fidelity Generation (Next.js 16.2)
1. 高保真生成(Next.js 16.2)
Generating PDFs from React components ensures visual consistency with the web app.
tsx
// app/api/generate-pdf/route.ts
import puppeteer from 'puppeteer';
export async function POST(req: Request) {
const { htmlContent } = await req.json();
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setContent(htmlContent, { waitUntil: 'networkidle0' });
const pdfBuffer = await page.pdf({
format: 'A4',
printBackground: true,
margin: { top: '20px', bottom: '20px' }
});
await browser.close();
return new Response(pdfBuffer, {
headers: { 'Content-Type': 'application/pdf' }
});
}从React组件生成PDF可确保与Web应用的视觉一致性。
tsx
// app/api/generate-pdf/route.ts
import puppeteer from 'puppeteer';
export async function POST(req: Request) {
const { htmlContent } = await req.json();
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.setContent(htmlContent, { waitUntil: 'networkidle0' });
const pdfBuffer = await page.pdf({
format: 'A4',
printBackground: true,
margin: { top: '20px', bottom: '20px' }
});
await browser.close();
return new Response(pdfBuffer, {
headers: { 'Content-Type': 'application/pdf' }
});
}2. AI-Driven Semantic Extraction
2. AI驱动语义提取
Using LLMs to turn unstructured PDF text into validated Zod schemas.
typescript
import { unpdf } from 'unpdf';
import { generateObject } from 'ai'; // AI SDK 2026
async function extractInvoice(buffer: Buffer) {
const { text } = await unpdf.extractText(buffer);
const { object } = await generateObject({
model: myModel,
schema: invoiceSchema,
prompt: `Extract structured data from this PDF text: ${text}`
});
return object;
}使用LLM将非结构化PDF文本转换为经过验证的Zod模式。
typescript
import { unpdf } from 'unpdf';
import { generateObject } from 'ai'; // AI SDK 2026
async function extractInvoice(buffer: Buffer) {
const { text } = await unpdf.extractText(buffer);
const { object } = await generateObject({
model: myModel,
schema: invoiceSchema,
prompt: `Extract structured data from this PDF text: ${text}`
});
return object;
}🔒 PDF 2.0 Security & Integrity
🔒 PDF 2.0安全与完整性
AES-256 Encryption
AES-256加密
PDF 2.0 deprecates weak algorithms. Use or modern JS wrappers for secure locking.
qpdfbash
undefinedPDF 2.0已弃用弱加密算法。使用或现代JS封装器实现安全加密。
qpdfbash
undefinedSecure a PDF with 2026 standards
Secure a PDF with 2026 standards
qpdf --encrypt user-pass owner-pass 256 -- input.pdf secured.pdf
undefinedqpdf --encrypt user-pass owner-pass 256 -- input.pdf secured.pdf
undefinedDigital Signatures (PAdES)
数字签名(PAdES)
Integrate with OIDC providers or Hardware Security Modules (HSMs) for legally binding signatures.
与OIDC提供商或硬件安全模块(HSM)集成,实现具有法律效力的签名。
🚫 The "Do Not List" (Anti-Patterns)
🚫 「禁止列表」(反模式)
- NEVER use for complex layout extraction; it fails on multi-column or overlapping text. Use
pypdfor AI OCR.pdfplumber - NEVER generate PDFs using drawing commands if HTML/CSS templates are an option. Maintenance is a nightmare.
canvas - NEVER store unencrypted PDFs containing PII (Personally Identifiable Information) in public S3 buckets.
- NEVER rely on for automated server-side generation. It is non-deterministic.
window.print()
- 永远不要使用进行复杂布局提取;它在多列或重叠文本场景下会失效,请使用
pypdf或AI OCR。pdfplumber - 永远不要在可以使用HTML/CSS模板的情况下通过绘制命令生成PDF,后续维护会非常麻烦。
canvas - 永远不要将包含个人可识别信息(PII)的未加密PDF存储在公共S3存储桶中。
- 永远不要依赖进行自动化服务端生成,它的输出不具备确定性。
window.print()
🛠️ Troubleshooting Guide
🛠️ 故障排查指南
| Issue | Likely Cause | 2026 Corrective Action |
|---|---|---|
| Missing Fonts | System fonts not in container | Use Puppeteer with embedded Google Fonts or WOFF2. |
| Garbled Text | Complex CID encoding | Use |
| Huge File Size | High-res images not optimized | Run a compression pass using |
| Form Filling Fails | Flattened PDF fields | Use |
| 问题 | 可能原因 | 2026纠正措施 |
|---|---|---|
| 字体缺失 | 容器中没有对应系统字体 | 使用内置Google Fonts或WOFF2的Puppeteer。 |
| 文本乱码 | 复杂CID编码 | 使用带 |
| 文件体积过大 | 高分辨率图像未优化 | 使用 |
| 表单填充失败 | PDF字段已扁平化 | 写入前使用 |
📚 Reference Library
📚 参考库
- AI Extraction Patterns: Mastering semantic document understanding.
- High-Fidelity Generation: HTML-to-PDF at scale.
- Legacy Utilities: When to reach for Python/CLI tools.
- AI提取模式: 精通语义文档理解。
- 高保真生成: 大规模HTML转PDF。
- 遗留工具: 何时使用Python/CLI工具。
📜 Standard Operating Procedure (SOP)
📜 标准操作流程(SOP)
- Requirement Check: Is the goal Creation, Extraction, or Modification?
- Tool Selection:
- Creation -> Puppeteer.
- Extraction -> AI SDK + unpdf.
- Modification -> pdf-lib.
- Environment Check: Is this running in an Edge Function? (If yes, avoid Puppeteer).
- Implementation: Build with strict TypeScript typing.
- Audit: Verify PDF 2.0 metadata and accessibility (A11y) tags.
- 需求确认: 目标是创建、提取还是修改?
- 工具选择:
- 创建 -> Puppeteer。
- 提取 -> AI SDK + unpdf。
- 修改 -> pdf-lib。
- 环境确认: 是否运行在边缘函数中?(如果是,避免使用Puppeteer)。
- 实现: 使用严格的TypeScript类型进行开发。
- 审计: 验证PDF 2.0元数据和可访问性(A11y)标签。
📈 Quality Metrics
📈 质量指标
- Extraction Accuracy: > 98% (Measured against ground truth JSON).
- Generation Speed: < 2s for a 10-page document.
- Security Audit: Zero weak crypto algorithms (Verified via ).
qpdf
- 提取准确率: > 98%(对照真实JSON数据衡量)。
- 生成速度: 10页文档生成耗时< 2s。
- 安全审计: 无弱加密算法(通过验证)。
qpdf
🔄 Last Refactor Details
🔄 上次重构详情
- By: Gemini Elite Conductor
- Date: January 22, 2026
- Version: 1.1.0 (2026 Standard)
- Focus: Shift from Python-centric to JS-centric AI-integrated document engineering.
End of PDF Pro Standard (v1.1.0)
- 作者: Gemini Elite Conductor
- 日期: 2026年1月22日
- 版本: 1.1.0(2026标准)
- 重点: 从以Python为中心转向以JS为中心的AI集成文档工程。
PDF Pro标准结束(v1.1.0)