paddleocr-doc-parsing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePaddleOCR Document Parsing Skill
PaddleOCR 文档解析技能
When to Use This Skill
何时使用该技能
✅ Use Document Parsing for:
- Documents with tables (invoices, financial reports, spreadsheets)
- Documents with mathematical formulas (academic papers, scientific documents)
- Documents with charts and diagrams
- Multi-column layouts (newspapers, magazines, brochures)
- Complex document structures requiring layout analysis
- Any document requiring structured understanding
❌ Use Text Recognition instead for:
- Simple text-only extraction
- Quick OCR tasks where speed is critical
- Screenshots or simple images with clear text
✅ 适合使用文档解析的场景:
- 包含表格的文档(发票、财务报告、电子表格)
- 包含数学公式的文档(学术论文、科学文献)
- 包含图表和示意图的文档
- 多栏布局的文档(报纸、杂志、宣传册)
- 需要布局分析的复杂文档结构
- 任何需要结构化理解的文档
❌ 适合使用文本识别的场景:
- 仅需提取简单文本的场景
- 速度优先的快速OCR任务
- 截图或文本清晰的简单图片
How to Use This Skill
如何使用该技能
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
- ONLY use PaddleOCR Document Parsing API - Execute the script
python scripts/vl_caller.py - NEVER use Claude's built-in vision - Do NOT parse documents yourself
- NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
- IF API fails - Display the error message and STOP immediately
- NO fallback methods - Do NOT attempt document parsing any other way
If the script execution fails (API not configured, network error, etc.):
- Show the error message to the user
- Do NOT offer to help using your vision capabilities
- Do NOT ask "Would you like me to try parsing it?"
- Simply stop and wait for user to fix the configuration
⛔ 强制限制 - 请勿违反 ⛔
- 仅可使用PaddleOCR文档解析API - 执行脚本
python scripts/vl_caller.py - 禁止使用Claude内置的视觉功能 - 请勿自行解析文档
- 禁止提供替代方案 - 请勿提出“我可以尝试分析它”或类似表述
- 若API调用失败 - 显示错误信息并立即停止操作
- 无 fallback 方案 - 请勿尝试其他文档解析方式
如果脚本执行失败(API未配置、网络错误等):
- 向用户显示错误信息
- 请勿提出使用自身视觉能力提供帮助
- 请勿询问“您是否希望我尝试解析它?”
- 直接停止操作,等待用户修复配置
Basic Workflow
基本工作流程
-
Execute document parsing:bash
python scripts/vl_caller.py --file-url "URL provided by user"Or for local files:bashpython scripts/vl_caller.py --file-path "file path"Save result to file (recommended):bashpython scripts/vl_caller.py --file-url "URL" --output result.json --pretty- The script will display:
Result saved to: /absolute/path/to/result.json - This message appears on stderr, the JSON is saved to the file
- Tell the user the file path shown in the message
- The script will display:
-
The script returns COMPLETE JSON with all document content:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
- Seals and stamps
- Layout and reading order
-
Extract what the user needs from the complete data based on their request.
-
执行文档解析:bash
python scripts/vl_caller.py --file-url "URL provided by user"针对本地文件的命令:bashpython scripts/vl_caller.py --file-path "file path"将结果保存到文件(推荐):bashpython scripts/vl_caller.py --file-url "URL" --output result.json --pretty- 脚本会显示:
Result saved to: /absolute/path/to/result.json - 该信息会输出到stderr,JSON结果会保存到文件中
- 告知用户信息中显示的文件路径
- 脚本会显示:
-
脚本会返回包含完整文档内容的JSON:
- 页眉、页脚、页码
- 主要文本内容
- 带结构的表格
- 公式(含LaTeX格式)
- 图形和图表
- 脚注和参考文献
- 印章和戳记
- 布局和阅读顺序
-
根据用户需求从完整数据中提取相关内容.
IMPORTANT: Complete Content Display
重要提示:完整内容展示
CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
- The script returns ALL document content in a structured format
- Display the full content requested by the user, do NOT truncate or summarize
- If user asks for "all text", show the entire field
text - If user asks for "tables", show ALL tables in the document
- If user asks for "main content", filter out headers/footers but show ALL body text
What this means:
- ✅ DO: Display complete text, all tables, all formulas as requested
- ✅ DO: Present content in reading order using array
reading_order - ❌ DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
- ❌ DON'T: Summarize or provide excerpts when user asks for full content
- ❌ DON'T: Say "Here's a preview" when user expects complete output
Example - Correct:
User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)Example - Incorrect ❌:
User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"关键要求: 必须根据用户需求展示提取到的完整内容.
- 脚本会以结构化格式返回所有文档内容
- 展示用户要求的全部内容,请勿截断或总结
- 若用户要求“所有文本”,则显示完整的字段
text - 若用户要求“所有表格”,则显示文档中的全部表格
- 若用户要求“主要内容”,则过滤掉页眉/页脚,显示所有正文文本
具体要求:
- ✅ 允许: 按要求展示完整文本、所有表格、所有公式
- ✅ 允许: 使用数组按阅读顺序呈现内容
reading_order - ❌ 禁止: 除非内容过长(>10000字符),否则请勿用“...”截断
- ❌ 禁止: 当用户要求完整内容时,进行总结或提供节选
- ❌ 禁止: 当用户期望完整输出时,说“这是预览内容”
示例 - 正确做法:
User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)示例 - 错误做法 ❌:
User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"Understanding the JSON Response
理解JSON响应
The script returns a JSON envelope wrapping the raw API result:
json
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": [
{
"prunedResult": {
"parsing_res_list": [
{"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
{"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
{"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
]
},
"markdown": {
"text": "Full page content in markdown/HTML format",
"images": {"imgs/filename.jpg": "https://..."}
}
}
],
"error": null
}Key fields:
- — extracted markdown text from all pages (use this for quick text display)
text - — raw API result array (one object per page), for detailed block-level access
result - — array of detected content blocks
result[n].prunedResult.parsing_res_list - — full page content in markdown/HTML
result[n].markdown.text
脚本会返回一个包裹原始API结果的JSON信封:
json
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": [
{
"prunedResult": {
"parsing_res_list": [
{"block_label": "text", "block_content": "Paragraph text content here...", "block_bbox": [100, 200, 500, 230], "block_id": 3},
{"block_label": "table", "block_content": "<table>...</table>", "block_bbox": [50, 300, 900, 600], "block_id": 5},
{"block_label": "seal", "block_content": "<img .../>", "block_bbox": [400, 50, 600, 180], "block_id": 2}
]
},
"markdown": {
"text": "Full page content in markdown/HTML format",
"images": {"imgs/filename.jpg": "https://..."}
}
}
],
"error": null
}关键字段:
- — 从所有页面提取的markdown文本(可用于快速文本展示)
text - — 原始API结果数组(每个对象对应一页),用于按块级访问详细内容
result - — 检测到的内容块数组
result[n].prunedResult.parsing_res_list - — 完整页面内容的markdown/HTML格式
result[n].markdown.text
Block Labels
块标签
The field indicates the content type:
block_label| Label | Description |
|---|---|
| Regular text content |
| Table (content is HTML |
| Embedded image |
| Seal or stamp |
| Figure/chart title or caption |
| Footnote detected by vision model |
| Side/margin text |
block_label| 标签 | 描述 |
|---|---|
| 常规文本内容 |
| 表格(内容为HTML |
| 嵌入图片 |
| 印章或戳记 |
| 图形/图表的标题或说明 |
| 视觉模型检测到的脚注 |
| 侧边/页边文本 |
Content Extraction Guidelines
内容提取指南
Based on user intent, filter the blocks:
| User Says | What to Extract | How |
|---|---|---|
| "Extract all text" | Everything | Use |
| "Get all tables" | table blocks only | Filter |
| "Show main content" | text + table blocks | Filter out |
| "Complete document" | Everything | Use |
根据用户意图过滤内容块:
| 用户需求 | 提取内容 | 实现方式 |
|---|---|---|
| "提取所有文本" | 全部内容 | 直接使用 |
| "获取所有表格" | 仅表格块 | 按 |
| "展示主要内容" | 文本+表格块 | 过滤掉 |
| "完整文档" | 全部内容 | 使用 |
Usage Examples
使用示例
Example 1: Extract Main Content (default behavior)
bash
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--prettyThen filter JSON to extract core content:
- Include: text, table, formula, figure, footnote
- Exclude: header, footer, page_number
Example 2: Extract Tables Only
bash
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--prettyThen filter JSON:
- Only keep regions where type="table"
- Present table content in markdown format
Example 3: Complete Document with Everything
bash
python scripts/vl_caller.py \
--file-url "URL" \
--prettyThen use the field or present all regions in reading_order.
text示例1: 提取主要内容(默认行为)
bash
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty然后过滤JSON提取核心内容:
- 包含: 文本、表格、公式、图形、脚注
- 排除: 页眉、页脚、页码
示例2: 仅提取表格
bash
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty然后过滤JSON:
- 仅保留类型为"table"的区域
- 以markdown格式呈现表格内容
示例3: 提取完整文档
bash
python scripts/vl_caller.py \
--file-url "URL" \
--pretty然后使用字段或按阅读顺序遍历所有块.
textFirst-Time Configuration
首次配置
When API is not configured:
The error will show:
Configuration error: API not configured. Get your API at: https://paddleocr.comAuto-configuration workflow:
-
Show the exact error message to user (including the URL)
-
Tell user to provide credentials:
Please visit the URL above to get your PADDLEOCR_DOC_PARSING_API_URL and PADDLEOCR_ACCESS_TOKEN. Once you have them, send them to me and I'll configure it automatically. -
When user provides credentials (accept any format):
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123- Copy-pasted code format
- Any other reasonable format
-
Parse credentials from user's message:
- Extract PADDLEOCR_DOC_PARSING_API_URL value (look for URLs)
- Extract PADDLEOCR_ACCESS_TOKEN value (long alphanumeric string, usually 40+ chars)
-
Configure automatically:bash
python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN" -
If configuration succeeds:
- Inform user: "Configuration complete! Parsing document now..."
- Retry the original parsing task
-
If configuration fails:
- Show the error
- Ask user to verify the credentials
IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.
当API未配置时:
错误信息会显示:
Configuration error: API not configured. Get your API at: https://paddleocr.com自动配置流程:
-
向用户显示完整的错误信息(包含URL)
-
告知用户提供凭证:
请访问上述URL获取您的PADDLEOCR_DOC_PARSING_API_URL和PADDLEOCR_ACCESS_TOKEN。 获取后请发送给我,我会自动完成配置。 -
当用户提供凭证时(接受任意格式):
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123- 复制粘贴的代码格式
- 其他合理格式
-
从用户消息中解析凭证:
- 提取PADDLEOCR_DOC_PARSING_API_URL的值(查找URL)
- 提取PADDLEOCR_ACCESS_TOKEN的值(长字母数字字符串,通常40字符以上)
-
自动配置:bash
python scripts/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN" -
若配置成功:
- 告知用户: "配置完成!正在解析文档..."
- 重试原始的解析任务
-
若配置失败:
- 显示错误信息
- 请求用户验证凭证
重要提示: 错误消息格式是严格的,必须完全按照脚本提供的内容展示,请勿修改或改写.
Handling Large Files
处理大文件
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
API没有文件大小限制。对于PDF文件,单次请求最多支持100页。
大文件处理技巧:
Use URL for Large Local Files (Recommended)
为大型本地文件使用URL(推荐)
For very large local files, prefer over to avoid base64 encoding overhead:
--file-url--file-pathbash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"对于非常大的本地文件,优先使用而非,以避免base64编码的开销:
--file-url--file-pathbash
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"Process Specific Pages (PDF Only)
处理特定页面(仅PDF)
If you only need certain pages from a large PDF, extract them first:
bash
undefined若仅需要大型PDF中的部分页面,可先提取这些页面:
bash
undefinedUsing pypdfium2 (requires: pip install pypdfium2)
Using pypdfium2 (requires: pip install pypdfium2)
python -c "
import pypdfium2 as pdfium
doc = pdfium.PdfDocument('large.pdf')
python -c "
import pypdfium2 as pdfium
doc = pdfium.PdfDocument('large.pdf')
Extract pages 0-4 (first 5 pages)
Extract pages 0-4 (first 5 pages)
new_doc = pdfium.PdfDocument.new()
for i in range(min(5, len(doc))):
new_doc.import_pages(doc, [i])
new_doc.save('pages_1_5.pdf')
"
new_doc = pdfium.PdfDocument.new()
for i in range(min(5, len(doc))):
new_doc.import_pages(doc, [i])
new_doc.save('pages_1_5.pdf')
"
Then process the smaller file
Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefinedpython scripts/vl_caller.py --file-path "pages_1_5.pdf"
undefinedError Handling
错误处理
Authentication failed (401/403):
error: Authentication failed→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429):
error: API quota exceeded→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format:
error: Unsupported file format→ File format not supported, convert to PDF/PNG/JPG
认证失败 (401/403):
error: Authentication failed→ 令牌无效,请使用正确的凭证重新配置
API配额耗尽 (429):
error: API quota exceeded→ 每日API配额已用完,请告知用户等待或升级
不支持的格式:
error: Unsupported file format→ 文件格式不支持,请转换为PDF/PNG/JPG格式
Pseudo-Code: Content Extraction
伪代码: 内容提取
Extract all text (most common):
python
def extract_all_text(json_response):
# Quickest: use the pre-extracted text field
print(json_response['text'])Extract tables only:
python
def extract_tables(json_response):
for page in json_response['result']:
blocks = page['prunedResult']['parsing_res_list']
tables = [b for b in blocks if b['block_label'] == 'table']
for i, table in enumerate(tables):
print(f"Table {i+1}:")
print(table['block_content']) # HTML tableIterate all blocks:
python
def extract_by_block(json_response):
for page in json_response['result']:
blocks = page['prunedResult']['parsing_res_list']
for block in blocks:
print(f"[{block['block_label']}] {block['block_content'][:100]}")提取所有文本(最常用):
python
def extract_all_text(json_response):
# Quickest: use the pre-extracted text field
print(json_response['text'])仅提取表格:
python
def extract_tables(json_response):
for page in json_response['result']:
blocks = page['prunedResult']['parsing_res_list']
tables = [b for b in blocks if b['block_label'] == 'table']
for i, table in enumerate(tables):
print(f"Table {i+1}:")
print(table['block_content']) # HTML table遍历所有块:
python
def extract_by_block(json_response):
for page in json_response['result']:
blocks = page['prunedResult']['parsing_res_list']
for block in blocks:
print(f"[{block['block_label']}] {block['block_content'][:100]}")Important Notes
重要说明
- The script NEVER filters content - It always returns complete data
- Claude decides what to present - Based on user's specific request
- All data is always available - Can be re-interpreted for different needs
- No information is lost - Complete document structure preserved
- 脚本从不过滤内容 - 始终返回完整数据
- Claude决定展示内容 - 基于用户的具体需求
- 所有数据始终可用 - 可根据不同需求重新解读
- 无信息丢失 - 完整保留文档结构
Reference Documentation
参考文档
For in-depth understanding of the PaddleOCR Document Parsing system, refer to:
- - Output format specification
references/output_schema.md - - Provider API contract
references/provider_api.md
Note: Model version and capabilities are determined by your API endpoint (PADDLEOCR_DOC_PARSING_API_URL).
Load these reference documents into context when:
- Debugging complex parsing issues
- Need to understand output format
- Working with provider API details
如需深入了解PaddleOCR文档解析系统,请参考:
- - 输出格式规范
references/output_schema.md - - 提供商API协议
references/provider_api.md
注意: 模型版本和功能由您的API端点(PADDLEOCR_DOC_PARSING_API_URL)决定.
在以下场景中加载这些参考文档到上下文:
- 调试复杂的解析问题
- 需要理解输出格式
- 处理提供商API的细节
Testing the Skill
测试技能
To verify the skill is working properly:
bash
python scripts/smoke_test.pyThis tests configuration and optionally API connectivity.
要验证技能是否正常工作:
bash
python scripts/smoke_test.py该脚本会测试配置情况,可选测试API连通性.