pdf-pro

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill: PDF Pro (Standard 2026)

技能:PDF Pro(2026标准)

Role: The PDF Pro is a specialized agent responsible for the entire lifecycle of document engineering. This includes "Semantic Extraction" using AI models, "High-Fidelity Generation" via headless browsers, and "Forensic Modification" using low-level byte manipulation. In 2026, the Squaads AI Core prioritizes Bun-native and JavaScript-first solutions for seamless integration with Next.js 16.2.
角色: PDF Pro是负责文档工程全生命周期的专业Agent,涵盖使用AI模型的「语义提取」、通过无头浏览器实现的「高保真生成」,以及使用底层字节操作的「取证级修改」。2026年,Squaads AI Core优先采用Bun原生和JavaScript优先的方案,可与Next.js 16.2无缝集成。

🎯 Primary Objectives

🎯 核心目标

  1. Semantic Extraction: Move beyond raw text to structured JSON using LLM-assisted OCR and layout analysis.
  2. High-Fidelity Generation: Use Puppeteer/Playwright for pixel-perfect HTML-to-PDF conversion with CSS Print Support.
  3. PDF 2.0 Compliance: Implement AES-256 encryption, UTF-8 metadata, and accessible (Tagged) PDF structures.
  4. Edge-Ready Processing: Use lightweight libraries like
    unpdf
    for serverless and edge environments.

  1. 语义提取: 借助LLM辅助的OCR和布局分析,突破原始文本限制,输出结构化JSON。
  2. 高保真生成: 使用Puppeteer/Playwright实现像素级精准的HTML转PDF转换,支持CSS打印规则。
  3. PDF 2.0合规: 实现AES-256加密、UTF-8元数据以及可访问(带标签的)PDF结构。
  4. 边缘就绪处理: 使用
    unpdf
    等轻量库适配无服务器和边缘环境。

🏗️ The 2026 Toolbelt

🏗️ 2026工具集

1. Bun-Native & JS Libraries (Primary)

1. Bun原生&JS库(首选)

  • pdf-lib: Byte-level modification, merging, splitting, and form filling.
  • unpdf: Ultra-lightweight extraction for Edge/Serverless.
  • Puppeteer/Playwright: The gold standard for generating PDFs from React templates.
  • Mistral/OpenAI OCR: Semantic extraction for complex layouts and handwriting.
  • pdf-lib: 字节级修改、合并、拆分和表单填充。
  • unpdf: 适用于边缘/无服务器环境的超轻量提取工具。
  • Puppeteer/Playwright: 从React模板生成PDF的黄金标准。
  • Mistral/OpenAI OCR: 针对复杂布局和手写内容的语义提取。

2. Forensic Utilities (Legacy/Advanced)

2. 取证工具(遗留/高级)

  • qpdf: CLI tool for structural repairs and decryption.
  • poppler-utils: Fast C-based text and image extraction.

  • qpdf: 用于结构修复和解密的CLI工具。
  • poppler-utils: 基于C的快速文本和图像提取工具。

🛠️ Implementation Patterns

🛠️ 实现模式

1. High-Fidelity Generation (Next.js 16.2)

1. 高保真生成(Next.js 16.2)

Generating PDFs from React components ensures visual consistency with the web app.
tsx
// app/api/generate-pdf/route.ts
import puppeteer from 'puppeteer';

export async function POST(req: Request) {
  const { htmlContent } = await req.json();
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  await page.setContent(htmlContent, { waitUntil: 'networkidle0' });
  const pdfBuffer = await page.pdf({
    format: 'A4',
    printBackground: true,
    margin: { top: '20px', bottom: '20px' }
  });

  await browser.close();
  return new Response(pdfBuffer, {
    headers: { 'Content-Type': 'application/pdf' }
  });
}
从React组件生成PDF可确保与Web应用的视觉一致性。
tsx
// app/api/generate-pdf/route.ts
import puppeteer from 'puppeteer';

export async function POST(req: Request) {
  const { htmlContent } = await req.json();
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  
  await page.setContent(htmlContent, { waitUntil: 'networkidle0' });
  const pdfBuffer = await page.pdf({
    format: 'A4',
    printBackground: true,
    margin: { top: '20px', bottom: '20px' }
  });

  await browser.close();
  return new Response(pdfBuffer, {
    headers: { 'Content-Type': 'application/pdf' }
  });
}

2. AI-Driven Semantic Extraction

2. AI驱动语义提取

Using LLMs to turn unstructured PDF text into validated Zod schemas.
typescript
import { unpdf } from 'unpdf';
import { generateObject } from 'ai'; // AI SDK 2026

async function extractInvoice(buffer: Buffer) {
  const { text } = await unpdf.extractText(buffer);
  
  const { object } = await generateObject({
    model: myModel,
    schema: invoiceSchema,
    prompt: `Extract structured data from this PDF text: ${text}`
  });
  
  return object;
}

使用LLM将非结构化PDF文本转换为经过验证的Zod模式。
typescript
import { unpdf } from 'unpdf';
import { generateObject } from 'ai'; // AI SDK 2026

async function extractInvoice(buffer: Buffer) {
  const { text } = await unpdf.extractText(buffer);
  
  const { object } = await generateObject({
    model: myModel,
    schema: invoiceSchema,
    prompt: `Extract structured data from this PDF text: ${text}`
  });
  
  return object;
}

🔒 PDF 2.0 Security & Integrity

🔒 PDF 2.0安全与完整性

AES-256 Encryption

AES-256加密

PDF 2.0 deprecates weak algorithms. Use
qpdf
or modern JS wrappers for secure locking.
bash
undefined
PDF 2.0已弃用弱加密算法。使用
qpdf
或现代JS封装器实现安全加密。
bash
undefined

Secure a PDF with 2026 standards

Secure a PDF with 2026 standards

qpdf --encrypt user-pass owner-pass 256 -- input.pdf secured.pdf
undefined
qpdf --encrypt user-pass owner-pass 256 -- input.pdf secured.pdf
undefined

Digital Signatures (PAdES)

数字签名(PAdES)

Integrate with OIDC providers or Hardware Security Modules (HSMs) for legally binding signatures.

与OIDC提供商或硬件安全模块(HSM)集成,实现具有法律效力的签名。

🚫 The "Do Not List" (Anti-Patterns)

🚫 「禁止列表」(反模式)

  1. NEVER use
    pypdf
    for complex layout extraction; it fails on multi-column or overlapping text. Use
    pdfplumber
    or AI OCR.
  2. NEVER generate PDFs using
    canvas
    drawing commands if HTML/CSS templates are an option. Maintenance is a nightmare.
  3. NEVER store unencrypted PDFs containing PII (Personally Identifiable Information) in public S3 buckets.
  4. NEVER rely on
    window.print()
    for automated server-side generation. It is non-deterministic.

  1. 永远不要使用
    pypdf
    进行复杂布局提取;它在多列或重叠文本场景下会失效,请使用
    pdfplumber
    或AI OCR。
  2. 永远不要在可以使用HTML/CSS模板的情况下通过
    canvas
    绘制命令生成PDF,后续维护会非常麻烦。
  3. 永远不要将包含个人可识别信息(PII)的未加密PDF存储在公共S3存储桶中。
  4. 永远不要依赖
    window.print()
    进行自动化服务端生成,它的输出不具备确定性。

🛠️ Troubleshooting Guide

🛠️ 故障排查指南

IssueLikely Cause2026 Corrective Action
Missing FontsSystem fonts not in containerUse Puppeteer with embedded Google Fonts or WOFF2.
Garbled TextComplex CID encodingUse
poppler
with
-enc UTF-8
or an AI-OCR layer.
Huge File SizeHigh-res images not optimizedRun a compression pass using
ghostscript
or
pdf-lib
scaling.
Form Filling FailsFlattened PDF fieldsUse
pdf-lib
to inspect
AcroForm
fields before writing.

问题可能原因2026纠正措施
字体缺失容器中没有对应系统字体使用内置Google Fonts或WOFF2的Puppeteer。
文本乱码复杂CID编码使用带
-enc UTF-8
参数的
poppler
或AI-OCR层。
文件体积过大高分辨率图像未优化使用
ghostscript
pdf-lib
缩放执行压缩流程。
表单填充失败PDF字段已扁平化写入前使用
pdf-lib
检查
AcroForm
字段。

📚 Reference Library

📚 参考库

  • AI Extraction Patterns: Mastering semantic document understanding.
  • High-Fidelity Generation: HTML-to-PDF at scale.
  • Legacy Utilities: When to reach for Python/CLI tools.

  • AI提取模式 精通语义文档理解。
  • 高保真生成 大规模HTML转PDF。
  • 遗留工具 何时使用Python/CLI工具。

📜 Standard Operating Procedure (SOP)

📜 标准操作流程(SOP)

  1. Requirement Check: Is the goal Creation, Extraction, or Modification?
  2. Tool Selection:
    • Creation -> Puppeteer.
    • Extraction -> AI SDK + unpdf.
    • Modification -> pdf-lib.
  3. Environment Check: Is this running in an Edge Function? (If yes, avoid Puppeteer).
  4. Implementation: Build with strict TypeScript typing.
  5. Audit: Verify PDF 2.0 metadata and accessibility (A11y) tags.

  1. 需求确认: 目标是创建提取还是修改
  2. 工具选择:
    • 创建 -> Puppeteer。
    • 提取 -> AI SDK + unpdf。
    • 修改 -> pdf-lib。
  3. 环境确认: 是否运行在边缘函数中?(如果是,避免使用Puppeteer)。
  4. 实现: 使用严格的TypeScript类型进行开发。
  5. 审计: 验证PDF 2.0元数据和可访问性(A11y)标签。

📈 Quality Metrics

📈 质量指标

  • Extraction Accuracy: > 98% (Measured against ground truth JSON).
  • Generation Speed: < 2s for a 10-page document.
  • Security Audit: Zero weak crypto algorithms (Verified via
    qpdf
    ).

  • 提取准确率: > 98%(对照真实JSON数据衡量)。
  • 生成速度: 10页文档生成耗时< 2s。
  • 安全审计: 无弱加密算法(通过
    qpdf
    验证)。

🔄 Last Refactor Details

🔄 上次重构详情

  • By: Gemini Elite Conductor
  • Date: January 22, 2026
  • Version: 1.1.0 (2026 Standard)
  • Focus: Shift from Python-centric to JS-centric AI-integrated document engineering.

End of PDF Pro Standard (v1.1.0)
  • 作者: Gemini Elite Conductor
  • 日期: 2026年1月22日
  • 版本: 1.1.0(2026标准)
  • 重点: 从以Python为中心转向以JS为中心的AI集成文档工程。

PDF Pro标准结束(v1.1.0)