doc-parser
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocument Parser Skill
文档解析Skill
Overview
概述
This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.
本Skill借助IBM最先进的文档理解库docling,实现高级文档解析功能。它可以解析复杂PDF、Word文档及图片,同时保留文档结构,提取表格、图表,并支持处理多栏布局。
How to Use
使用方法
- Provide the document to parse
- Specify what you want to extract (text, tables, figures, etc.)
- I'll parse it and return structured data
Example prompts:
- "Parse this PDF and extract all tables"
- "Convert this academic paper to structured markdown"
- "Extract figures and captions from this document"
- "Parse this report preserving the document structure"
- 提供需要解析的文档
- 指定需要提取的内容(文本、表格、图表等)
- 我会对文档进行解析并返回结构化数据
示例提示词:
- "解析这份PDF并提取所有表格"
- "将这篇学术论文转换为结构化Markdown格式"
- "从这份文档中提取图表及标题"
- "解析这份报告并保留原文档结构"
Domain Knowledge
领域知识
docling Fundamentals
docling基础用法
python
from docling.document_converter import DocumentConverterpython
from docling.document_converter import DocumentConverterInitialize converter
初始化转换器
converter = DocumentConverter()
converter = DocumentConverter()
Convert document
转换文档
result = converter.convert("document.pdf")
result = converter.convert("document.pdf")
Access parsed content
访问解析后的内容
doc = result.document
print(doc.export_to_markdown())
undefineddoc = result.document
print(doc.export_to_markdown())
undefinedSupported Formats
支持的格式
| Format | Extension | Notes |
|---|---|---|
| Native and scanned | ||
| Word | .docx | Full structure preserved |
| PowerPoint | .pptx | Slides as sections |
| Images | .png, .jpg | OCR + layout analysis |
| HTML | .html | Structure preserved |
| 格式 | 扩展名 | 说明 |
|---|---|---|
| 原生PDF及扫描件 | ||
| Word | .docx | 完整保留文档结构 |
| PowerPoint | .pptx | 将幻灯片作为章节处理 |
| 图片 | .png, .jpg | OCR识别 + 布局分析 |
| HTML | .html | 保留原结构 |
Basic Usage
基础用法
python
from docling.document_converter import DocumentConverterpython
from docling.document_converter import DocumentConverterCreate converter
创建转换器
converter = DocumentConverter()
converter = DocumentConverter()
Convert single document
转换单个文档
result = converter.convert("report.pdf")
result = converter.convert("report.pdf")
Access document
访问解析后的文档对象
doc = result.document
doc = result.document
Export options
导出选项
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()
undefinedmarkdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()
undefinedAdvanced Configuration
高级配置
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptionspython
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptionsConfigure pipeline
配置处理流程
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
Create converter with options
使用配置创建转换器
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)
result = converter.convert("document.pdf")
undefinedconverter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)
result = converter.convert("document.pdf")
undefinedDocument Structure
文档结构
python
undefinedpython
undefinedDocument hierarchy
文档层级结构
doc = result.document
doc = result.document
Access metadata
访问元数据
print(doc.name)
print(doc.origin)
print(doc.name)
print(doc.origin)
Iterate through content
遍历文档内容
for element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
if element.type == "table":
print(f"Rows: {len(element.data.table_cells)}")undefinedfor element in doc.iterate_items():
print(f"类型: {element.type}")
print(f"文本: {element.text}")
if element.type == "table":
print(f"行数: {len(element.data.table_cells)}")undefinedExtracting Tables
提取表格
python
from docling.document_converter import DocumentConverter
import pandas as pd
def extract_tables(doc_path):
"""Extract all tables from document."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
tables = []
for element in doc.iterate_items():
if element.type == "table":
# Get table data
table_data = element.export_to_dataframe()
tables.append({
'page': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
return tablespython
from docling.document_converter import DocumentConverter
import pandas as pd
def extract_tables(doc_path):
"""从文档中提取所有表格。"""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
tables = []
for element in doc.iterate_items():
if element.type == "table":
# 获取表格数据
table_data = element.export_to_dataframe()
tables.append({
'page': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
return tablesUsage
使用示例
tables = extract_tables("report.pdf")
for i, table in enumerate(tables):
print(f"Table {i+1} on page {table['page']}:")
print(table['dataframe'])
undefinedtables = extract_tables("report.pdf")
for i, table in enumerate(tables):
print(f"第 {i+1} 个表格,位于第 {table['page']} 页:")
print(table['dataframe'])
undefinedExtracting Figures
提取图表
python
def extract_figures(doc_path, output_dir):
"""Extract figures with captions."""
import os
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
figures = []
os.makedirs(output_dir, exist_ok=True)
for element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'page': element.prov[0].page_no if element.prov else None,
}
# Save image if available
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
figures.append(figure_info)
return figurespython
def extract_figures(doc_path, output_dir):
"""提取图表及标题。"""
import os
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
figures = []
os.makedirs(output_dir, exist_ok=True)
for element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'page': element.prov[0].page_no if element.prov else None,
}
# 保存图片(如果有)
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
figures.append(figure_info)
return figuresHandling Multi-column Layouts
处理多栏布局
python
from docling.document_converter import DocumentConverter
def parse_multicolumn(doc_path):
"""Parse document with multi-column layout."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
# docling automatically handles column detection
# Text is returned in reading order
structured_content = []
for element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
# Add bounding box if available
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['page'] = element.prov[0].page_no
structured_content.append(content_item)
return structured_contentpython
from docling.document_converter import DocumentConverter
def parse_multicolumn(doc_path):
"""解析具有多栏布局的文档。"""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
# docling会自动检测栏位
# 返回的文本遵循阅读顺序
structured_content = []
for element in doc.iterate_items():
content_item = {
'type': element.type,
'text': element.text if hasattr(element, 'text') else None,
'level': element.level if hasattr(element, 'level') else None,
}
# 如果有边界框信息则添加
if element.prov:
content_item['bbox'] = element.prov[0].bbox
content_item['page'] = element.prov[0].page_no
structured_content.append(content_item)
return structured_contentExport Formats
导出格式
python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.documentpython
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.documentMarkdown export
导出为Markdown
markdown = doc.export_to_markdown()
with open("output.md", "w") as f:
f.write(markdown)
markdown = doc.export_to_markdown()
with open("output.md", "w") as f:
f.write(markdown)
Plain text
导出为纯文本
text = doc.export_to_text()
text = doc.export_to_text()
JSON/dict format
导出为JSON/字典格式
json_doc = doc.export_to_dict()
json_doc = doc.export_to_dict()
HTML format (if supported)
导出为HTML格式(如果支持)
html = doc.export_to_html()
html = doc.export_to_html()
undefinedundefinedBatch Processing
批量处理
python
from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def batch_parse(input_dir, output_dir, max_workers=4):
"""Parse multiple documents in parallel."""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
converter = DocumentConverter()
def process_single(doc_path):
try:
result = converter.convert(str(doc_path))
md = result.document.export_to_markdown()
out_file = output_path / f"{doc_path.stem}.md"
with open(out_file, 'w') as f:
f.write(md)
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_single, docs))
return resultspython
from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def batch_parse(input_dir, output_dir, max_workers=4):
"""并行解析多个文档。"""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
converter = DocumentConverter()
def process_single(doc_path):
try:
result = converter.convert(str(doc_path))
md = result.document.export_to_markdown()
out_file = output_path / f"{doc_path.stem}.md"
with open(out_file, 'w') as f:
f.write(md)
return {'file': str(doc_path), 'status': 'success'}
except Exception as e:
return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_single, docs))
return resultsBest Practices
最佳实践
- Use Appropriate Pipeline: Configure for your document type
- Handle Large Documents: Process in chunks if needed
- Verify Table Extraction: Complex tables may need review
- Check OCR Quality: Enable OCR for scanned documents
- Cache Results: Store parsed documents for reuse
- 选择合适的处理流程:根据文档类型配置对应的处理选项
- 处理大型文档:必要时分块处理
- 验证表格提取结果:复杂表格可能需要人工复核
- 检查OCR质量:扫描文档请启用OCR功能
- 缓存结果:存储解析后的文档以便重复使用
Common Patterns
常见应用场景
Academic Paper Parser
学术论文解析
python
def parse_academic_paper(pdf_path):
"""Parse academic paper structure."""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
current_section = None
for element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
if element.type == 'title':
paper['title'] = text
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'content': ''
})
current_section = 'section'
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' and paper['sections']:
paper['sections'][-1]['content'] += text + '\n'
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
return paperpython
def parse_academic_paper(pdf_path):
"""解析学术论文结构。"""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
paper = {
'title': None,
'abstract': None,
'sections': [],
'references': [],
'tables': [],
'figures': []
}
current_section = None
for element in doc.iterate_items():
text = element.text if hasattr(element, 'text') else ''
if element.type == 'title':
paper['title'] = text
elif element.type == 'heading':
if 'abstract' in text.lower():
current_section = 'abstract'
elif 'reference' in text.lower():
current_section = 'references'
else:
paper['sections'].append({
'title': text,
'content': ''
})
current_section = 'section'
elif element.type == 'paragraph':
if current_section == 'abstract':
paper['abstract'] = text
elif current_section == 'section' and paper['sections']:
paper['sections'][-1]['content'] += text + '\n'
elif element.type == 'table':
paper['tables'].append({
'caption': element.caption if hasattr(element, 'caption') else None,
'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
})
return paperReport to Structured Data
报告转结构化数据
python
def parse_business_report(doc_path):
"""Parse business report into structured format."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
report = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recommendations': []
}
# Parse document structure
for element in doc.iterate_items():
# Implement parsing logic based on document structure
pass
return reportpython
def parse_business_report(doc_path):
"""将商务报告转换为结构化格式。"""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
report = {
'metadata': {
'title': None,
'date': None,
'author': None
},
'executive_summary': None,
'sections': [],
'key_metrics': [],
'recommendations': []
}
# 解析文档结构
for element in doc.iterate_items():
# 根据文档结构实现解析逻辑
pass
return reportExamples
示例
Example 1: Parse Financial Report
示例1:解析财务报告
python
from docling.document_converter import DocumentConverter
def parse_financial_report(pdf_path):
"""Extract structured data from financial report."""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
# Extract tables
tables = []
for element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
# Identify table type
if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
# Extract markdown for notes
financial_data['markdown'] = doc.export_to_markdown()
return financial_data
report = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(report['income_statement'])python
from docling.document_converter import DocumentConverter
def parse_financial_report(pdf_path):
"""从财务报告中提取结构化数据。"""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
financial_data = {
'income_statement': None,
'balance_sheet': None,
'cash_flow': None,
'notes': []
}
# 提取表格
tables = []
for element in doc.iterate_items():
if element.type == 'table':
table_df = element.export_to_dataframe()
# 识别表格类型
if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
financial_data['income_statement'] = table_df
elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
financial_data['balance_sheet'] = table_df
elif 'cash' in str(table_df).lower():
financial_data['cash_flow'] = table_df
else:
tables.append(table_df)
# 导出Markdown格式的附注
financial_data['markdown'] = doc.export_to_markdown()
return financial_data
report = parse_financial_report('annual_report.pdf')
print("利润表:")
print(report['income_statement'])Example 2: Technical Documentation Parser
示例2:技术文档解析
python
from docling.document_converter import DocumentConverter
def parse_technical_docs(doc_path):
"""Parse technical documentation."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
current_section = None
for element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'content': []
}
documentation['sections'].append(current_section)
elif element.type == 'code':
if current_section:
current_section['content'].append({
'type': 'code',
'content': element.text
})
documentation['code_blocks'].append(element.text)
elif element.type == 'picture':
documentation['diagrams'].append({
'page': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
return documentation
docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")python
from docling.document_converter import DocumentConverter
def parse_technical_docs(doc_path):
"""解析技术文档。"""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
documentation = {
'title': None,
'version': None,
'sections': [],
'code_blocks': [],
'diagrams': []
}
current_section = None
for element in doc.iterate_items():
if element.type == 'title':
documentation['title'] = element.text
elif element.type == 'heading':
current_section = {
'title': element.text,
'level': element.level if hasattr(element, 'level') else 1,
'content': []
}
documentation['sections'].append(current_section)
elif element.type == 'code':
if current_section:
current_section['content'].append({
'type': 'code',
'content': element.text
})
documentation['code_blocks'].append(element.text)
elif element.type == 'picture':
documentation['diagrams'].append({
'page': element.prov[0].page_no if element.prov else None,
'caption': element.caption if hasattr(element, 'caption') else None
})
return documentation
docs = parse_technical_docs('api_documentation.pdf')
print(f"标题: {docs['title']}")
print(f"章节数量: {len(docs['sections'])}")Example 3: Contract Analysis
示例3:合同分析
python
from docling.document_converter import DocumentConverter
def analyze_contract(pdf_path):
"""Parse contract document for key clauses."""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
contract = {
'parties': [],
'clauses': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
import re
# Extract dates
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
# Extract monetary amounts
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
# Parse sections as clauses
for element in doc.iterate_items():
if element.type == 'heading':
contract['clauses'].append({
'title': element.text,
'content': ''
})
elif element.type == 'paragraph' and contract['clauses']:
contract['clauses'][-1]['content'] += element.text + '\n'
return contract
contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")python
from docling.document_converter import DocumentConverter
def analyze_contract(pdf_path):
"""解析合同文档,提取关键条款。"""
converter = DocumentConverter()
result = converter.convert(pdf_path)
doc = result.document
contract = {
'parties': [],
'clauses': [],
'dates': [],
'amounts': [],
'full_text': doc.export_to_text()
}
import re
# 提取日期
date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
# 提取金额
amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
# 将章节解析为条款
for element in doc.iterate_items():
if element.type == 'heading':
contract['clauses'].append({
'title': element.text,
'content': ''
})
elif element.type == 'paragraph' and contract['clauses']:
contract['clauses'][-1]['content'] += element.text + '\n'
return contract
contract_data = analyze_contract('agreement.pdf')
print(f"关键日期: {contract_data['dates']}")
print(f"涉及金额: {contract_data['amounts']}")Limitations
局限性
- Very large documents may require chunking
- Handwritten content needs OCR preprocessing
- Complex nested tables may need manual review
- Some PDF types (encrypted) not supported
- GPU recommended for best performance
- 超大型文档可能需要分块处理
- 手写内容需要提前进行OCR预处理
- 复杂嵌套表格可能需要人工复核
- 部分PDF类型(如加密PDF)不支持
- 推荐使用GPU以获得最佳性能
Installation
安装方法
bash
pip install doclingbash
pip install doclingFor full functionality
安装完整功能版
pip install docling[all]
pip install docling[all]
For OCR support
安装带OCR支持的版本
pip install docling[ocr]
undefinedpip install docling[ocr]
undefined