doc-parser

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document Parser Skill

文档解析Skill

Overview

概述

This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.
本Skill借助IBM最先进的文档理解库docling,实现高级文档解析功能。它可以解析复杂PDF、Word文档及图片,同时保留文档结构,提取表格、图表,并支持处理多栏布局。

How to Use

使用方法

  1. Provide the document to parse
  2. Specify what you want to extract (text, tables, figures, etc.)
  3. I'll parse it and return structured data
Example prompts:
  • "Parse this PDF and extract all tables"
  • "Convert this academic paper to structured markdown"
  • "Extract figures and captions from this document"
  • "Parse this report preserving the document structure"
  1. 提供需要解析的文档
  2. 指定需要提取的内容(文本、表格、图表等)
  3. 我会对文档进行解析并返回结构化数据
示例提示词:
  • "解析这份PDF并提取所有表格"
  • "将这篇学术论文转换为结构化Markdown格式"
  • "从这份文档中提取图表及标题"
  • "解析这份报告并保留原文档结构"

Domain Knowledge

领域知识

docling Fundamentals

docling基础用法

python
from docling.document_converter import DocumentConverter
python
from docling.document_converter import DocumentConverter

Initialize converter

初始化转换器

converter = DocumentConverter()
converter = DocumentConverter()

Convert document

转换文档

result = converter.convert("document.pdf")
result = converter.convert("document.pdf")

Access parsed content

访问解析后的内容

doc = result.document print(doc.export_to_markdown())
undefined
doc = result.document print(doc.export_to_markdown())
undefined

Supported Formats

支持的格式

FormatExtensionNotes
PDF.pdfNative and scanned
Word.docxFull structure preserved
PowerPoint.pptxSlides as sections
Images.png, .jpgOCR + layout analysis
HTML.htmlStructure preserved
格式扩展名说明
PDF.pdf原生PDF及扫描件
Word.docx完整保留文档结构
PowerPoint.pptx将幻灯片作为章节处理
图片.png, .jpgOCR识别 + 布局分析
HTML.html保留原结构

Basic Usage

基础用法

python
from docling.document_converter import DocumentConverter
python
from docling.document_converter import DocumentConverter

Create converter

创建转换器

converter = DocumentConverter()
converter = DocumentConverter()

Convert single document

转换单个文档

result = converter.convert("report.pdf")
result = converter.convert("report.pdf")

Access document

访问解析后的文档对象

doc = result.document
doc = result.document

Export options

导出选项

markdown = doc.export_to_markdown() text = doc.export_to_text() json_doc = doc.export_to_dict()
undefined
markdown = doc.export_to_markdown() text = doc.export_to_text() json_doc = doc.export_to_dict()
undefined

Advanced Configuration

高级配置

python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

Configure pipeline

配置处理流程

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True

Create converter with options

使用配置创建转换器

converter = DocumentConverter( allowed_formats=[InputFormat.PDF, InputFormat.DOCX], pdf_backend_options=pipeline_options )
result = converter.convert("document.pdf")
undefined
converter = DocumentConverter( allowed_formats=[InputFormat.PDF, InputFormat.DOCX], pdf_backend_options=pipeline_options )
result = converter.convert("document.pdf")
undefined

Document Structure

文档结构

python
undefined
python
undefined

Document hierarchy

文档层级结构

doc = result.document
doc = result.document

Access metadata

访问元数据

print(doc.name) print(doc.origin)
print(doc.name) print(doc.origin)

Iterate through content

遍历文档内容

for element in doc.iterate_items(): print(f"Type: {element.type}") print(f"Text: {element.text}")
if element.type == "table":
    print(f"Rows: {len(element.data.table_cells)}")
undefined
for element in doc.iterate_items(): print(f"类型: {element.type}") print(f"文本: {element.text}")
if element.type == "table":
    print(f"行数: {len(element.data.table_cells)}")
undefined

Extracting Tables

提取表格

python
from docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
    """Extract all tables from document."""
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    tables = []
    
    for element in doc.iterate_items():
        if element.type == "table":
            # Get table data
            table_data = element.export_to_dataframe()
            tables.append({
                'page': element.prov[0].page_no if element.prov else None,
                'dataframe': table_data
            })
    
    return tables
python
from docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
    """从文档中提取所有表格。"""
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    tables = []
    
    for element in doc.iterate_items():
        if element.type == "table":
            # 获取表格数据
            table_data = element.export_to_dataframe()
            tables.append({
                'page': element.prov[0].page_no if element.prov else None,
                'dataframe': table_data
            })
    
    return tables

Usage

使用示例

tables = extract_tables("report.pdf") for i, table in enumerate(tables): print(f"Table {i+1} on page {table['page']}:") print(table['dataframe'])
undefined
tables = extract_tables("report.pdf") for i, table in enumerate(tables): print(f"第 {i+1} 个表格,位于第 {table['page']} 页:") print(table['dataframe'])
undefined

Extracting Figures

提取图表

python
def extract_figures(doc_path, output_dir):
    """Extract figures with captions."""
    import os
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    figures = []
    os.makedirs(output_dir, exist_ok=True)
    
    for element in doc.iterate_items():
        if element.type == "picture":
            figure_info = {
                'caption': element.caption if hasattr(element, 'caption') else None,
                'page': element.prov[0].page_no if element.prov else None,
            }
            
            # Save image if available
            if hasattr(element, 'image'):
                img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
                element.image.save(img_path)
                figure_info['path'] = img_path
            
            figures.append(figure_info)
    
    return figures
python
def extract_figures(doc_path, output_dir):
    """提取图表及标题。"""
    import os
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    figures = []
    os.makedirs(output_dir, exist_ok=True)
    
    for element in doc.iterate_items():
        if element.type == "picture":
            figure_info = {
                'caption': element.caption if hasattr(element, 'caption') else None,
                'page': element.prov[0].page_no if element.prov else None,
            }
            
            # 保存图片(如果有)
            if hasattr(element, 'image'):
                img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
                element.image.save(img_path)
                figure_info['path'] = img_path
            
            figures.append(figure_info)
    
    return figures

Handling Multi-column Layouts

处理多栏布局

python
from docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
    """Parse document with multi-column layout."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    # docling automatically handles column detection
    # Text is returned in reading order
    
    structured_content = []
    
    for element in doc.iterate_items():
        content_item = {
            'type': element.type,
            'text': element.text if hasattr(element, 'text') else None,
            'level': element.level if hasattr(element, 'level') else None,
        }
        
        # Add bounding box if available
        if element.prov:
            content_item['bbox'] = element.prov[0].bbox
            content_item['page'] = element.prov[0].page_no
        
        structured_content.append(content_item)
    
    return structured_content
python
from docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
    """解析具有多栏布局的文档。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    # docling会自动检测栏位
    # 返回的文本遵循阅读顺序
    
    structured_content = []
    
    for element in doc.iterate_items():
        content_item = {
            'type': element.type,
            'text': element.text if hasattr(element, 'text') else None,
            'level': element.level if hasattr(element, 'level') else None,
        }
        
        # 如果有边界框信息则添加
        if element.prov:
            content_item['bbox'] = element.prov[0].bbox
            content_item['page'] = element.prov[0].page_no
        
        structured_content.append(content_item)
    
    return structured_content

Export Formats

导出格式

python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document
python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

Markdown export

导出为Markdown

markdown = doc.export_to_markdown() with open("output.md", "w") as f: f.write(markdown)
markdown = doc.export_to_markdown() with open("output.md", "w") as f: f.write(markdown)

Plain text

导出为纯文本

text = doc.export_to_text()
text = doc.export_to_text()

JSON/dict format

导出为JSON/字典格式

json_doc = doc.export_to_dict()
json_doc = doc.export_to_dict()

HTML format (if supported)

导出为HTML格式(如果支持)

html = doc.export_to_html()

html = doc.export_to_html()

undefined
undefined

Batch Processing

批量处理

python
from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
    """Parse multiple documents in parallel."""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    converter = DocumentConverter()
    
    def process_single(doc_path):
        try:
            result = converter.convert(str(doc_path))
            md = result.document.export_to_markdown()
            
            out_file = output_path / f"{doc_path.stem}.md"
            with open(out_file, 'w') as f:
                f.write(md)
            
            return {'file': str(doc_path), 'status': 'success'}
        except Exception as e:
            return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
    docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_single, docs))
    
    return results
python
from docling.document_converter import DocumentConverter
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def batch_parse(input_dir, output_dir, max_workers=4):
    """并行解析多个文档。"""
    
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    converter = DocumentConverter()
    
    def process_single(doc_path):
        try:
            result = converter.convert(str(doc_path))
            md = result.document.export_to_markdown()
            
            out_file = output_path / f"{doc_path.stem}.md"
            with open(out_file, 'w') as f:
                f.write(md)
            
            return {'file': str(doc_path), 'status': 'success'}
        except Exception as e:
            return {'file': str(doc_path), 'status': 'error', 'error': str(e)}
    
    docs = list(input_path.glob('*.pdf')) + list(input_path.glob('*.docx'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_single, docs))
    
    return results

Best Practices

最佳实践

  1. Use Appropriate Pipeline: Configure for your document type
  2. Handle Large Documents: Process in chunks if needed
  3. Verify Table Extraction: Complex tables may need review
  4. Check OCR Quality: Enable OCR for scanned documents
  5. Cache Results: Store parsed documents for reuse
  1. 选择合适的处理流程:根据文档类型配置对应的处理选项
  2. 处理大型文档:必要时分块处理
  3. 验证表格提取结果:复杂表格可能需要人工复核
  4. 检查OCR质量:扫描文档请启用OCR功能
  5. 缓存结果:存储解析后的文档以便重复使用

Common Patterns

常见应用场景

Academic Paper Parser

学术论文解析

python
def parse_academic_paper(pdf_path):
    """Parse academic paper structure."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'references': [],
        'tables': [],
        'figures': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        text = element.text if hasattr(element, 'text') else ''
        
        if element.type == 'title':
            paper['title'] = text
        
        elif element.type == 'heading':
            if 'abstract' in text.lower():
                current_section = 'abstract'
            elif 'reference' in text.lower():
                current_section = 'references'
            else:
                paper['sections'].append({
                    'title': text,
                    'content': ''
                })
                current_section = 'section'
        
        elif element.type == 'paragraph':
            if current_section == 'abstract':
                paper['abstract'] = text
            elif current_section == 'section' and paper['sections']:
                paper['sections'][-1]['content'] += text + '\n'
        
        elif element.type == 'table':
            paper['tables'].append({
                'caption': element.caption if hasattr(element, 'caption') else None,
                'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
            })
    
    return paper
python
def parse_academic_paper(pdf_path):
    """解析学术论文结构。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'references': [],
        'tables': [],
        'figures': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        text = element.text if hasattr(element, 'text') else ''
        
        if element.type == 'title':
            paper['title'] = text
        
        elif element.type == 'heading':
            if 'abstract' in text.lower():
                current_section = 'abstract'
            elif 'reference' in text.lower():
                current_section = 'references'
            else:
                paper['sections'].append({
                    'title': text,
                    'content': ''
                })
                current_section = 'section'
        
        elif element.type == 'paragraph':
            if current_section == 'abstract':
                paper['abstract'] = text
            elif current_section == 'section' and paper['sections']:
                paper['sections'][-1]['content'] += text + '\n'
        
        elif element.type == 'table':
            paper['tables'].append({
                'caption': element.caption if hasattr(element, 'caption') else None,
                'data': element.export_to_dataframe() if hasattr(element, 'export_to_dataframe') else None
            })
    
    return paper

Report to Structured Data

报告转结构化数据

python
def parse_business_report(doc_path):
    """Parse business report into structured format."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    report = {
        'metadata': {
            'title': None,
            'date': None,
            'author': None
        },
        'executive_summary': None,
        'sections': [],
        'key_metrics': [],
        'recommendations': []
    }
    
    # Parse document structure
    for element in doc.iterate_items():
        # Implement parsing logic based on document structure
        pass
    
    return report
python
def parse_business_report(doc_path):
    """将商务报告转换为结构化格式。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    report = {
        'metadata': {
            'title': None,
            'date': None,
            'author': None
        },
        'executive_summary': None,
        'sections': [],
        'key_metrics': [],
        'recommendations': []
    }
    
    # 解析文档结构
    for element in doc.iterate_items():
        # 根据文档结构实现解析逻辑
        pass
    
    return report

Examples

示例

Example 1: Parse Financial Report

示例1:解析财务报告

python
from docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
    """Extract structured data from financial report."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'notes': []
    }
    
    # Extract tables
    tables = []
    for element in doc.iterate_items():
        if element.type == 'table':
            table_df = element.export_to_dataframe()
            
            # Identify table type
            if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
                financial_data['income_statement'] = table_df
            elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
                financial_data['balance_sheet'] = table_df
            elif 'cash' in str(table_df).lower():
                financial_data['cash_flow'] = table_df
            else:
                tables.append(table_df)
    
    # Extract markdown for notes
    financial_data['markdown'] = doc.export_to_markdown()
    
    return financial_data

report = parse_financial_report('annual_report.pdf')
print("Income Statement:")
print(report['income_statement'])
python
from docling.document_converter import DocumentConverter

def parse_financial_report(pdf_path):
    """从财务报告中提取结构化数据。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    financial_data = {
        'income_statement': None,
        'balance_sheet': None,
        'cash_flow': None,
        'notes': []
    }
    
    # 提取表格
    tables = []
    for element in doc.iterate_items():
        if element.type == 'table':
            table_df = element.export_to_dataframe()
            
            # 识别表格类型
            if 'revenue' in str(table_df).lower() or 'income' in str(table_df).lower():
                financial_data['income_statement'] = table_df
            elif 'asset' in str(table_df).lower() or 'liabilities' in str(table_df).lower():
                financial_data['balance_sheet'] = table_df
            elif 'cash' in str(table_df).lower():
                financial_data['cash_flow'] = table_df
            else:
                tables.append(table_df)
    
    # 导出Markdown格式的附注
    financial_data['markdown'] = doc.export_to_markdown()
    
    return financial_data

report = parse_financial_report('annual_report.pdf')
print("利润表:")
print(report['income_statement'])

Example 2: Technical Documentation Parser

示例2:技术文档解析

python
from docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
    """Parse technical documentation."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    documentation = {
        'title': None,
        'version': None,
        'sections': [],
        'code_blocks': [],
        'diagrams': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        if element.type == 'title':
            documentation['title'] = element.text
        
        elif element.type == 'heading':
            current_section = {
                'title': element.text,
                'level': element.level if hasattr(element, 'level') else 1,
                'content': []
            }
            documentation['sections'].append(current_section)
        
        elif element.type == 'code':
            if current_section:
                current_section['content'].append({
                    'type': 'code',
                    'content': element.text
                })
            documentation['code_blocks'].append(element.text)
        
        elif element.type == 'picture':
            documentation['diagrams'].append({
                'page': element.prov[0].page_no if element.prov else None,
                'caption': element.caption if hasattr(element, 'caption') else None
            })
    
    return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"Title: {docs['title']}")
print(f"Sections: {len(docs['sections'])}")
python
from docling.document_converter import DocumentConverter

def parse_technical_docs(doc_path):
    """解析技术文档。"""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    documentation = {
        'title': None,
        'version': None,
        'sections': [],
        'code_blocks': [],
        'diagrams': []
    }
    
    current_section = None
    
    for element in doc.iterate_items():
        if element.type == 'title':
            documentation['title'] = element.text
        
        elif element.type == 'heading':
            current_section = {
                'title': element.text,
                'level': element.level if hasattr(element, 'level') else 1,
                'content': []
            }
            documentation['sections'].append(current_section)
        
        elif element.type == 'code':
            if current_section:
                current_section['content'].append({
                    'type': 'code',
                    'content': element.text
                })
            documentation['code_blocks'].append(element.text)
        
        elif element.type == 'picture':
            documentation['diagrams'].append({
                'page': element.prov[0].page_no if element.prov else None,
                'caption': element.caption if hasattr(element, 'caption') else None
            })
    
    return documentation

docs = parse_technical_docs('api_documentation.pdf')
print(f"标题: {docs['title']}")
print(f"章节数量: {len(docs['sections'])}")

Example 3: Contract Analysis

示例3:合同分析

python
from docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
    """Parse contract document for key clauses."""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    contract = {
        'parties': [],
        'clauses': [],
        'dates': [],
        'amounts': [],
        'full_text': doc.export_to_text()
    }
    
    import re
    
    # Extract dates
    date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
    contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
    
    # Extract monetary amounts
    amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
    contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
    
    # Parse sections as clauses
    for element in doc.iterate_items():
        if element.type == 'heading':
            contract['clauses'].append({
                'title': element.text,
                'content': ''
            })
        elif element.type == 'paragraph' and contract['clauses']:
            contract['clauses'][-1]['content'] += element.text + '\n'
    
    return contract

contract_data = analyze_contract('agreement.pdf')
print(f"Key dates: {contract_data['dates']}")
print(f"Amounts: {contract_data['amounts']}")
python
from docling.document_converter import DocumentConverter

def analyze_contract(pdf_path):
    """解析合同文档,提取关键条款。"""
    
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    doc = result.document
    
    contract = {
        'parties': [],
        'clauses': [],
        'dates': [],
        'amounts': [],
        'full_text': doc.export_to_text()
    }
    
    import re
    
    # 提取日期
    date_pattern = r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b'
    contract['dates'] = re.findall(date_pattern, contract['full_text'], re.IGNORECASE)
    
    # 提取金额
    amount_pattern = r'\$[\d,]+(?:\.\d{2})?|\b\d+(?:,\d{3})*(?:\.\d{2})?\s*(?:USD|dollars)\b'
    contract['amounts'] = re.findall(amount_pattern, contract['full_text'], re.IGNORECASE)
    
    # 将章节解析为条款
    for element in doc.iterate_items():
        if element.type == 'heading':
            contract['clauses'].append({
                'title': element.text,
                'content': ''
            })
        elif element.type == 'paragraph' and contract['clauses']:
            contract['clauses'][-1]['content'] += element.text + '\n'
    
    return contract

contract_data = analyze_contract('agreement.pdf')
print(f"关键日期: {contract_data['dates']}")
print(f"涉及金额: {contract_data['amounts']}")

Limitations

局限性

  • Very large documents may require chunking
  • Handwritten content needs OCR preprocessing
  • Complex nested tables may need manual review
  • Some PDF types (encrypted) not supported
  • GPU recommended for best performance
  • 超大型文档可能需要分块处理
  • 手写内容需要提前进行OCR预处理
  • 复杂嵌套表格可能需要人工复核
  • 部分PDF类型(如加密PDF)不支持
  • 推荐使用GPU以获得最佳性能

Installation

安装方法

bash
pip install docling
bash
pip install docling

For full functionality

安装完整功能版

pip install docling[all]
pip install docling[all]

For OCR support

安装带OCR支持的版本

pip install docling[ocr]
undefined
pip install docling[ocr]
undefined

Resources

相关资源