paddleocr-doc-parsing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PaddleOCR Document Parsing Skill

PaddleOCR 文档解析技能

When to Use This Skill

何时使用该技能

✅ Use Document Parsing for:

Documents with tables (invoices, financial reports, spreadsheets)
Documents with mathematical formulas (academic papers, scientific documents)
Documents with charts and diagrams
Multi-column layouts (newspapers, magazines, brochures)
Complex document structures requiring layout analysis
Any document requiring structured understanding

❌ Use Text Recognition instead for:

Simple text-only extraction
Quick OCR tasks where speed is critical
Screenshots or simple images with clear text

✅ 适合使用文档解析的场景:

包含表格的文档（发票、财务报告、电子表格）
包含数学公式的文档（学术论文、科学文献）
包含图表和示意图的文档
多栏布局的文档（报纸、杂志、宣传册）
需要布局分析的复杂文档结构
任何需要结构化理解的文档

❌ 适合使用文本识别的场景:

仅需提取简单文本的场景
速度优先的快速OCR任务
截图或文本清晰的简单图片

How to Use This Skill

如何使用该技能

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

ONLY use PaddleOCR Document Parsing API - Execute the script
```
python scripts/vl_caller.py
```
NEVER use Claude's built-in vision - Do NOT parse documents yourself
NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
IF API fails - Display the error message and STOP immediately
NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):

Show the error message to the user
Do NOT offer to help using your vision capabilities
Do NOT ask "Would you like me to try parsing it?"
Simply stop and wait for user to fix the configuration

⛔ 强制限制 - 请勿违反 ⛔

仅可使用PaddleOCR文档解析API - 执行脚本
```
python scripts/vl_caller.py
```
禁止使用Claude内置的视觉功能 - 请勿自行解析文档
禁止提供替代方案 - 请勿提出“我可以尝试分析它”或类似表述
若API调用失败 - 显示错误信息并立即停止操作
无 fallback 方案 - 请勿尝试其他文档解析方式

如果脚本执行失败（API未配置、网络错误等）:

向用户显示错误信息
请勿提出使用自身视觉能力提供帮助
请勿询问“您是否希望我尝试解析它？”
直接停止操作，等待用户修复配置

Basic Workflow

基本工作流程

Execute document parsing:

bash

python scripts/vl_caller.py --file-url "URL provided by user"

Or for local files:

bash

python scripts/vl_caller.py --file-path "file path"

Save result to file (recommended):

bash

python scripts/vl_caller.py --file-url "URL" --output result.json --pretty

The script will display:

Result saved to: /absolute/path/to/result.json

This message appears on stderr, the JSON is saved to the file
Tell the user the file path shown in the message

The script returns COMPLETE JSON with all document content:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
- Seals and stamps
- Layout and reading order
Extract what the user needs from the complete data based on their request.

执行文档解析:

bash

python scripts/vl_caller.py --file-url "URL provided by user"

针对本地文件的命令:

bash

python scripts/vl_caller.py --file-path "file path"

将结果保存到文件（推荐）:

bash

python scripts/vl_caller.py --file-url "URL" --output result.json --pretty

脚本会显示:

Result saved to: /absolute/path/to/result.json

该信息会输出到stderr，JSON结果会保存到文件中
告知用户信息中显示的文件路径

脚本会返回包含完整文档内容的JSON:
- 页眉、页脚、页码
- 主要文本内容
- 带结构的表格
- 公式（含LaTeX格式）
- 图形和图表
- 脚注和参考文献
- 印章和戳记
- 布局和阅读顺序
根据用户需求从完整数据中提取相关内容.

IMPORTANT: Complete Content Display

重要提示：完整内容展示

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.

The script returns ALL document content in a structured format
Display the full content requested by the user, do NOT truncate or summarize
If user asks for "all text", show the entire
```
text
```
field
If user asks for "tables", show ALL tables in the document
If user asks for "main content", filter out headers/footers but show ALL body text

What this means:

✅ DO: Display complete text, all tables, all formulas as requested
✅ DO: Present content in reading order using
```
reading_order
```
array
❌ DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
❌ DON'T: Summarize or provide excerpts when user asks for full content
❌ DON'T: Say "Here's a preview" when user expects complete output

Example - Correct:

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

关键要求: 必须根据用户需求展示提取到的完整内容.

脚本会以结构化格式返回所有文档内容
展示用户要求的全部内容，请勿截断或总结
若用户要求“所有文本”，则显示完整的
```
text
```
字段
若用户要求“所有表格”，则显示文档中的全部表格
若用户要求“主要内容”，则过滤掉页眉/页脚，显示所有正文文本

具体要求:

✅ 允许: 按要求展示完整文本、所有表格、所有公式
✅ 允许: 使用
```
reading_order
```
数组按阅读顺序呈现内容
❌ 禁止: 除非内容过长（>10000字符），否则请勿用“...”截断
❌ 禁止: 当用户要求完整内容时，进行总结或提供节选
❌ 禁止: 当用户期望完整输出时，说“这是预览内容”

示例 - 正确做法:

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire text field or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

示例 - 错误做法 ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

理解JSON响应

The script returns a JSON envelope wrapping the raw API result:

json

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": {
        "parsing_res_list": [
          {"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
          {"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
          {"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
        ]
      },
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": {"imgs/filename.jpg": "https://..."}
      }
    }
  ],
  "error": null
}

Key fields:

```
text
```
— extracted markdown text from all pages (use this for quick text display)
```
result
```
— raw API result array (one object per page), for detailed block-level access
```
result[n].prunedResult.parsing_res_list
```
— array of detected content blocks
```
result[n].markdown.text
```
— full page content in markdown/HTML

脚本会返回一个包裹原始API结果的JSON信封:

json

{
  "ok": true,
  "text": "Full markdown/HTML text extracted from all pages",
  "result": [
    {
      "prunedResult": {
        "parsing_res_list": [
          {"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
          {"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
          {"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
        ]
      },
      "markdown": {
        "text": "Full page content in markdown/HTML format",
        "images": {"imgs/filename.jpg": "https://..."}
      }
    }
  ],
  "error": null
}

关键字段:

```
text
```
— 从所有页面提取的markdown文本（可用于快速文本展示）
```
result
```
— 原始API结果数组（每个对象对应一页），用于按块级访问详细内容
```
result[n].prunedResult.parsing_res_list
```
— 检测到的内容块数组
```
result[n].markdown.text
```
— 完整页面内容的markdown/HTML格式

Block Labels

块标签

The

block_label

field indicates the content type:

Label	Description
`text`	Regular text content
`table`	Table (content is HTML `<table>` )
`image`	Embedded image
`seal`	Seal or stamp
`figure_title`	Figure/chart title or caption
`vision_footnote`	Footnote detected by vision model
`aside_text`	Side/margin text

block_label

字段表示内容类型:

标签	描述
`text`	常规文本内容
`table`	表格（内容为HTML `<table>` 格式）
`image`	嵌入图片
`seal`	印章或戳记
`figure_title`	图形/图表的标题或说明
`vision_footnote`	视觉模型检测到的脚注
`aside_text`	侧边/页边文本

Content Extraction Guidelines

内容提取指南

Based on user intent, filter the blocks:

User Says	What to Extract	How
"Extract all text"	Everything	Use `text` field directly
"Get all tables"	table blocks only	Filter `parsing_res_list` by `block_label == "table"`
"Show main content"	text + table blocks	Filter out `aside_text` , `image`
"Complete document"	Everything	Use `text` field or iterate all blocks

根据用户意图过滤内容块:

用户需求	提取内容	实现方式
"提取所有文本"	全部内容	直接使用 `text` 字段
"获取所有表格"	仅表格块	按 `block_label == "table"` 过滤 `parsing_res_list`
"展示主要内容"	文本+表格块	过滤掉 `aside_text` 、 `image`
"完整文档"	全部内容	使用 `text` 字段或遍历所有块

Usage Examples

使用示例

Example 1: Extract Main Content (default behavior)

bash

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then filter JSON to extract core content:

Include: text, table, formula, figure, footnote
Exclude: header, footer, page_number

Example 2: Extract Tables Only

bash

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then filter JSON:

Only keep regions where type="table"
Present table content in markdown format

Example 3: Complete Document with Everything

bash

python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty

Then use the

text

field or present all regions in reading_order.

示例1: 提取主要内容（默认行为）

bash

python scripts/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

然后过滤JSON提取核心内容:

包含: 文本、表格、公式、图形、脚注
排除: 页眉、页脚、页码

示例2: 仅提取表格

bash

python scripts/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

然后过滤JSON:

仅保留类型为"table"的区域
以markdown格式呈现表格内容

示例3: 提取完整文档

bash

python scripts/vl_caller.py \
  --file-url "URL" \
  --pretty

然后使用

text

字段或按阅读顺序遍历所有块.

First-Time Configuration

首次配置

When API is not configured:

The error will show:

Configuration error: API not configured. Get your API at: https://paddleocr.com

Auto-configuration workflow:

Show the exact error message to user (including the URL)

Tell user to provide credentials:

Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN.
Once you have them, send them to me and I'll configure it automatically.

When user provides credentials (accept any format):

PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...

Here's my API: https://xxx and token: abc123

Copy-pasted code format
Any other reasonable format

Parse credentials from user's message:
- Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs)
- Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars)

Configure automatically:

bash

python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"

If configuration succeeds:
- Inform user: "Configuration complete! Parsing document now..."
- Retry the original parsing task
If configuration fails:
- Show the error
- Ask user to verify the credentials

IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.

当API未配置时:

错误信息会显示:

Configuration error: API not configured. Get your API at: https://paddleocr.com

自动配置流程:

向用户显示完整的错误信息（包含URL）

告知用户提供凭证:

请访问上述URL获取您的PADDLEOCR_DOC_PARSING_API_URL和PADDLEOCR_ACCESS_TOKEN。
获取后请发送给我，我会自动完成配置。

当用户提供凭证时（接受任意格式）:

PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...

Here's my API: https://xxx and token: abc123

复制粘贴的代码格式
其他合理格式

从用户消息中解析凭证:
- 提取PADDLEOCR_DOC_PARSING_API_URL的值（查找URL）
- 提取PADDLEOCR_ACCESS_TOKEN的值（长字母数字字符串，通常40字符以上）

自动配置:

bash

python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"

若配置成功:
- 告知用户: "配置完成！正在解析文档..."
- 重试原始的解析任务
若配置失败:
- 显示错误信息
- 请求用户验证凭证

重要提示: 错误消息格式是严格的，必须完全按照脚本提供的内容展示，请勿修改或改写.

Handling Large Files

处理大文件

There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.

Tips for large files:

API没有文件大小限制。对于PDF文件，单次请求最多支持100页。

大文件处理技巧:

Use URL for Large Local Files (Recommended)

为大型本地文件使用URL（推荐）

For very large local files, prefer

--file-url

over

--file-path

to avoid base64 encoding overhead:

bash

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

对于非常大的本地文件，优先使用

--file-url

而非

--file-path

，以避免base64编码的开销:

bash

python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"

Process Specific Pages (PDF Only)

处理特定页面（仅PDF）

If you only need certain pages from a large PDF, extract them first:

bash

undefined

若仅需要大型PDF中的部分页面，可先提取这些页面:

bash

undefined

Using pypdfium2 (requires: pip install pypdfium2)

python -c " import pypdfium2 as pdfium doc = pdfium.PdfDocument('large.pdf')

Extract pages 0-4 (first 5 pages)

new_doc = pdfium.PdfDocument.new() for i in range(min(5, len(doc))): new_doc.import_pages(doc, [i]) new_doc.save('pages_1_5.pdf') "

Then process the smaller file

python scripts/vl_caller.py --file-path "pages_1_5.pdf"

undefined

python scripts/vl_caller.py --file-path "pages_1_5.pdf"

undefined

Error Handling

错误处理

Authentication failed (401/403):

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429):

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format:

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

认证失败 (401/403):

error: Authentication failed

→ 令牌无效，请使用正确的凭证重新配置

API配额耗尽 (429):

error: API quota exceeded

→ 每日API配额已用完，请告知用户等待或升级

不支持的格式:

error: Unsupported file format

→ 文件格式不支持，请转换为PDF/PNG/JPG格式

Pseudo-Code: Content Extraction

伪代码: 内容提取

Extract all text (most common):

python

def extract_all_text(json_response):
    # Quickest: use the pre-extracted text field
    print(json_response['text'])

Extract tables only:

python

def extract_tables(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        tables = [b for b in blocks if b['block_label'] == 'table']
        for i, table in enumerate(tables):
            print(f"Table {i+1}:")
            print(table['block_content'])  # HTML table

Iterate all blocks:

python

def extract_by_block(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        for block in blocks:
            print(f"[{block['block_label']}] {block['block_content'][:100]}")

提取所有文本（最常用）:

python

def extract_all_text(json_response):
    # Quickest: use the pre-extracted text field
    print(json_response['text'])

仅提取表格:

python

def extract_tables(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        tables = [b for b in blocks if b['block_label'] == 'table']
        for i, table in enumerate(tables):
            print(f"Table {i+1}:")
            print(table['block_content'])  # HTML table

遍历所有块:

python

def extract_by_block(json_response):
    for page in json_response['result']:
        blocks = page['prunedResult']['parsing_res_list']
        for block in blocks:
            print(f"[{block['block_label']}] {block['block_content'][:100]}")

Important Notes

重要说明

The script NEVER filters content - It always returns complete data
Claude decides what to present - Based on user's specific request
All data is always available - Can be re-interpreted for different needs
No information is lost - Complete document structure preserved

脚本从不过滤内容 - 始终返回完整数据
Claude决定展示内容 - 基于用户的具体需求
所有数据始终可用 - 可根据不同需求重新解读
无信息丢失 - 完整保留文档结构

Reference Documentation

参考文档

For in-depth understanding of the PaddleOCR Document Parsing system, refer to:

```
references/output_schema.md
```
- Output format specification
```
references/provider_api.md
```
- Provider API contract

Note: Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).

Load these reference documents into context when:

Debugging complex parsing issues
Need to understand output format
Working with provider API details

如需深入了解PaddleOCR文档解析系统，请参考:

```
references/output_schema.md
```
- 输出格式规范
```
references/provider_api.md
```
- 提供商API协议

注意: 模型版本和功能由您的API端点（PADDLEOCR_DOC_PARSING_API_URL）决定.

在以下场景中加载这些参考文档到上下文:

调试复杂的解析问题
需要理解输出格式
处理提供商API的细节

Testing the Skill

测试技能

To verify the skill is working properly:

bash

python scripts/smoke_test.py

This tests configuration and optionally API connectivity.

要验证技能是否正常工作:

bash

python scripts/smoke_test.py

该脚本会测试配置情况，可选测试API连通性.