data-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Data Extractor Skill

数据提取技能

Overview

概述

This skill enables extraction of structured data from any document format using unstructured - a unified library for processing PDFs, Word docs, emails, HTML, and more. Get consistent, structured output regardless of input format.
本技能借助unstructured——一个用于处理PDF、Word文档、电子邮件、HTML等格式的统一库,实现从任意文档格式中提取结构化数据。无论输入格式如何,都能获得一致的结构化输出。

How to Use

使用方法

  1. Provide the document to process
  2. Optionally specify extraction options
  3. I'll extract structured elements with metadata
Example prompts:
  • "Extract all text and tables from this PDF"
  • "Parse this email and get the body, attachments, and metadata"
  • "Convert this HTML page to structured elements"
  • "Extract data from these mixed-format documents"
  1. 提供待处理的文档
  2. (可选)指定提取选项
  3. 我会提取带有元数据的结构化元素
示例提示词:
  • "从这份PDF中提取所有文本和表格"
  • "解析这封电子邮件,提取正文、附件和元数据"
  • "将这个HTML页面转换为结构化元素"
  • "从这些混合格式的文档中提取数据"

Domain Knowledge

领域知识

unstructured Fundamentals

unstructured 基础

python
from unstructured.partition.auto import partition
python
from unstructured.partition.auto import partition

Automatically detect and process any document

自动检测并处理任意文档

elements = partition("document.pdf")
elements = partition("document.pdf")

Access extracted elements

访问提取的元素

for element in elements: print(f"Type: {type(element).name}") print(f"Text: {element.text}") print(f"Metadata: {element.metadata}")
undefined
for element in elements: print(f"类型: {type(element).name}") print(f"文本: {element.text}") print(f"元数据: {element.metadata}")
undefined

Supported Formats

支持的格式

FormatFunctionNotes
PDF
partition_pdf
Native + scanned
Word
partition_docx
Full structure
PowerPoint
partition_pptx
Slides & notes
Excel
partition_xlsx
Sheets & tables
Email
partition_email
Body & attachments
HTML
partition_html
Tags preserved
Markdown
partition_md
Structure preserved
Plain Text
partition_text
Basic parsing
Images
partition_image
OCR extraction
格式函数说明
PDF
partition_pdf
原生PDF + 扫描件
Word
partition_docx
完整结构
PowerPoint
partition_pptx
幻灯片及备注
Excel
partition_xlsx
工作表及表格
电子邮件
partition_email
正文及附件
HTML
partition_html
保留标签
Markdown
partition_md
保留结构
纯文本
partition_text
基础解析
图片
partition_image
OCR提取

Element Types

元素类型

python
from unstructured.documents.elements import (
    Title,
    NarrativeText,
    Text,
    ListItem,
    Table,
    Image,
    Header,
    Footer,
    PageBreak,
    Address,
    EmailAddress,
)
python
from unstructured.documents.elements import (
    Title,
    NarrativeText,
    Text,
    ListItem,
    Table,
    Image,
    Header,
    Footer,
    PageBreak,
    Address,
    EmailAddress,
)

Elements have consistent structure

元素具有一致的结构

element.text # Raw text content element.metadata # Rich metadata element.category # Element type element.id # Unique identifier
undefined
element.text # 原始文本内容 element.metadata # 丰富的元数据 element.category # 元素类型 element.id # 唯一标识符
undefined

Auto Partition

自动分区

python
from unstructured.partition.auto import partition
python
from unstructured.partition.auto import partition

Process any file type

处理任意文件类型

elements = partition( filename="document.pdf", strategy="auto", # or "fast", "hi_res", "ocr_only" include_metadata=True, include_page_breaks=True, )
elements = partition( filename="document.pdf", strategy="auto", # 或 "fast", "hi_res", "ocr_only" include_metadata=True, include_page_breaks=True, )

Filter by type

按类型筛选

titles = [e for e in elements if isinstance(e, Title)] tables = [e for e in elements if isinstance(e, Table)]
undefined
titles = [e for e in elements if isinstance(e, Title)] tables = [e for e in elements if isinstance(e, Table)]
undefined

Format-Specific Partitioning

特定格式分区

python
undefined
python
undefined

PDF with options

PDF 处理选项

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf( filename="document.pdf", strategy="hi_res", # High quality extraction infer_table_structure=True, # Detect tables include_page_breaks=True, languages=["en"], # OCR language )
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf( filename="document.pdf", strategy="hi_res", # 高质量提取 infer_table_structure=True, # 检测表格 include_page_breaks=True, languages=["en"], # OCR识别语言 )

Word documents

Word 文档

from unstructured.partition.docx import partition_docx
elements = partition_docx( filename="document.docx", include_metadata=True, )
from unstructured.partition.docx import partition_docx
elements = partition_docx( filename="document.docx", include_metadata=True, )

HTML

HTML

from unstructured.partition.html import partition_html
elements = partition_html( filename="page.html", include_metadata=True, )
undefined
from unstructured.partition.html import partition_html
elements = partition_html( filename="page.html", include_metadata=True, )
undefined

Working with Tables

表格处理

python
from unstructured.partition.auto import partition

elements = partition("report.pdf", infer_table_structure=True)
python
from unstructured.partition.auto import partition

elements = partition("report.pdf", infer_table_structure=True)

Extract tables

提取表格

for element in elements: if element.category == "Table": print("Table found:") print(element.text)
    # Access structured table data
    if hasattr(element, 'metadata') and element.metadata.text_as_html:
        print("HTML:", element.metadata.text_as_html)
undefined
for element in elements: if element.category == "Table": print("找到表格:") print(element.text)
    # 访问结构化表格数据
    if hasattr(element, 'metadata') and element.metadata.text_as_html:
        print("HTML格式:", element.metadata.text_as_html)
undefined

Metadata Access

元数据访问

python
from unstructured.partition.auto import partition

elements = partition("document.pdf")

for element in elements:
    meta = element.metadata
    
    # Common metadata fields
    print(f"Page: {meta.page_number}")
    print(f"Filename: {meta.filename}")
    print(f"Filetype: {meta.filetype}")
    print(f"Coordinates: {meta.coordinates}")
    print(f"Languages: {meta.languages}")
python
from unstructured.partition.auto import partition

elements = partition("document.pdf")

for element in elements:
    meta = element.metadata
    
    # 常见元数据字段
    print(f"页码: {meta.page_number}")
    print(f"文件名: {meta.filename}")
    print(f"文件类型: {meta.filetype}")
    print(f"坐标: {meta.coordinates}")
    print(f"语言: {meta.languages}")

Chunking for AI/RAG

用于AI/RAG的分块处理

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements
python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

Partition document

解析文档

elements = partition("document.pdf")
elements = partition("document.pdf")

Chunk by title (semantic chunks)

按标题分块(语义分块)

chunks = chunk_by_title( elements, max_characters=1000, combine_text_under_n_chars=200, )
chunks = chunk_by_title( elements, max_characters=1000, combine_text_under_n_chars=200, )

Or basic chunking

或基础分块方式

chunks = chunk_elements( elements, max_characters=500, overlap=50, )
for chunk in chunks: print(f"Chunk ({len(chunk.text)} chars):") print(chunk.text[:100] + "...")
undefined
chunks = chunk_elements( elements, max_characters=500, overlap=50, )
for chunk in chunks: print(f"分块(字符数: {len(chunk.text)}):") print(chunk.text[:100] + "...")
undefined

Batch Processing

批量处理

python
from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_document(file_path):
    """Process single document."""
    try:
        elements = partition(str(file_path))
        return {
            'file': str(file_path),
            'status': 'success',
            'elements': len(elements),
            'text': '\n\n'.join([e.text for e in elements])
        }
    except Exception as e:
        return {
            'file': str(file_path),
            'status': 'error',
            'error': str(e)
        }

def batch_process(input_dir, max_workers=4):
    """Process all documents in directory."""
    input_path = Path(input_dir)
    files = list(input_path.glob('*'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_document, files))
    
    return results
python
from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_document(file_path):
    """处理单个文档。"""
    try:
        elements = partition(str(file_path))
        return {
            'file': str(file_path),
            'status': 'success',
            'elements': len(elements),
            'text': '\n\n'.join([e.text for e in elements])
        }
    except Exception as e:
        return {
            'file': str(file_path),
            'status': 'error',
            'error': str(e)
        }

def batch_process(input_dir, max_workers=4):
    """处理目录下所有文档。"""
    input_path = Path(input_dir)
    files = list(input_path.glob('*'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_document, files))
    
    return results

Export Formats

导出格式

python
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts

elements = partition("document.pdf")
python
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts

elements = partition("document.pdf")

To JSON string

转换为JSON字符串

json_str = elements_to_json(elements)
json_str = elements_to_json(elements)

To list of dicts

转换为字典列表

dicts = elements_to_dicts(elements)
dicts = elements_to_dicts(elements)

To DataFrame

转换为DataFrame

import pandas as pd df = pd.DataFrame(dicts)
undefined
import pandas as pd df = pd.DataFrame(dicts)
undefined

Best Practices

最佳实践

  1. Choose Strategy Wisely: "fast" for speed, "hi_res" for accuracy
  2. Enable Table Detection: For documents with tables
  3. Specify Language: For better OCR on non-English docs
  4. Chunk for RAG: Use semantic chunking for AI applications
  5. Handle Errors: Some formats may fail gracefully
  1. 合理选择策略:追求速度选"fast",追求精度选"hi_res"
  2. 启用表格检测:处理含表格的文档时开启
  3. 指定语言:针对非英文文档提升OCR识别效果
  4. 为RAG分块:AI应用中使用语义分块
  5. 错误处理:部分格式可能需要容错处理

Common Patterns

常见模式

Document to JSON

文档转JSON

python
def document_to_json(file_path, output_path=None):
    """Convert document to structured JSON."""
    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json
    import json
    
    elements = partition(file_path)
    
    # Create structured output
    output = {
        'source': file_path,
        'elements': []
    }
    
    for element in elements:
        output['elements'].append({
            'type': type(element).__name__,
            'text': element.text,
            'metadata': {
                'page': element.metadata.page_number,
                'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
            }
        })
    
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(output, f, indent=2)
    
    return output
python
def document_to_json(file_path, output_path=None):
    """将文档转换为结构化JSON。"""
    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json
    import json
    
    elements = partition(file_path)
    
    # 构建结构化输出
    output = {
        'source': file_path,
        'elements': []
    }
    
    for element in elements:
        output['elements'].append({
            'type': type(element).__name__,
            'text': element.text,
            'metadata': {
                'page': element.metadata.page_number,
                'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
            }
        })
    
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(output, f, indent=2)
    
    return output

Email Parser

电子邮件解析器

python
from unstructured.partition.email import partition_email

def parse_email(email_path):
    """Extract structured data from email."""
    
    elements = partition_email(email_path)
    
    email_data = {
        'subject': None,
        'from': None,
        'to': [],
        'date': None,
        'body': [],
        'attachments': []
    }
    
    for element in elements:
        meta = element.metadata
        
        # Extract headers from metadata
        if meta.subject:
            email_data['subject'] = meta.subject
        if meta.sent_from:
            email_data['from'] = meta.sent_from
        if meta.sent_to:
            email_data['to'] = meta.sent_to
        
        # Body content
        email_data['body'].append({
            'type': type(element).__name__,
            'text': element.text
        })
    
    return email_data
python
from unstructured.partition.email import partition_email

def parse_email(email_path):
    """从电子邮件中提取结构化数据。"""
    
    elements = partition_email(email_path)
    
    email_data = {
        'subject': None,
        'from': None,
        'to': [],
        'date': None,
        'body': [],
        'attachments': []
    }
    
    for element in elements:
        meta = element.metadata
        
        # 从元数据提取邮件头
        if meta.subject:
            email_data['subject'] = meta.subject
        if meta.sent_from:
            email_data['from'] = meta.sent_from
        if meta.sent_to:
            email_data['to'] = meta.sent_to
        
        # 正文内容
        email_data['body'].append({
            'type': type(element).__name__,
            'text': element.text
        })
    
    return email_data

Examples

示例

Example 1: Research Paper Extraction

示例1:科研论文提取

python
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

def extract_paper(pdf_path):
    """Extract structured data from research paper."""
    
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        infer_table_structure=True,
        include_page_breaks=True
    )
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'tables': [],
        'references': []
    }
    
    # Find title (usually first Title element)
    for element in elements:
        if element.category == "Title" and not paper['title']:
            paper['title'] = element.text
            break
    
    # Extract tables
    for element in elements:
        if element.category == "Table":
            paper['tables'].append({
                'page': element.metadata.page_number,
                'content': element.text,
                'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
            })
    
    # Chunk into sections
    chunks = chunk_by_title(elements, max_characters=2000)
    
    current_section = None
    for chunk in chunks:
        if chunk.category == "Title":
            paper['sections'].append({
                'title': chunk.text,
                'content': ''
            })
        elif paper['sections']:
            paper['sections'][-1]['content'] += chunk.text + '\n'
    
    return paper

paper = extract_paper('research_paper.pdf')
print(f"Title: {paper['title']}")
print(f"Tables: {len(paper['tables'])}")
print(f"Sections: {len(paper['sections'])}")
python
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

def extract_paper(pdf_path):
    """从科研论文中提取结构化数据。"""
    
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        infer_table_structure=True,
        include_page_breaks=True
    )
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'tables': [],
        'references': []
    }
    
    # 查找标题(通常为第一个Title元素)
    for element in elements:
        if element.category == "Title" and not paper['title']:
            paper['title'] = element.text
            break
    
    # 提取表格
    for element in elements:
        if element.category == "Table":
            paper['tables'].append({
                'page': element.metadata.page_number,
                'content': element.text,
                'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
            })
    
    # 按章节分块
    chunks = chunk_by_title(elements, max_characters=2000)
    
    current_section = None
    for chunk in chunks:
        if chunk.category == "Title":
            paper['sections'].append({
                'title': chunk.text,
                'content': ''
            })
        elif paper['sections']:
            paper['sections'][-1]['content'] += chunk.text + '\n'
    
    return paper

paper = extract_paper('research_paper.pdf')
print(f"标题: {paper['title']}")
print(f"表格数量: {len(paper['tables'])}")
print(f"章节数量: {len(paper['sections'])}")

Example 2: Invoice Data Extraction

示例2:发票数据提取

python
from unstructured.partition.auto import partition
import re

def extract_invoice_data(file_path):
    """Extract key data from invoice."""
    
    elements = partition(file_path, strategy="hi_res")
    
    # Combine all text
    full_text = '\n'.join([e.text for e in elements])
    
    invoice = {
        'invoice_number': None,
        'date': None,
        'total': None,
        'vendor': None,
        'line_items': [],
        'tables': []
    }
    
    # Extract patterns
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
    if inv_match:
        invoice['invoice_number'] = inv_match.group(1)
    
    date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
    if date_match:
        invoice['date'] = date_match.group(1)
    
    total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
    if total_match:
        invoice['total'] = float(total_match.group(1).replace(',', ''))
    
    # Extract tables
    for element in elements:
        if element.category == "Table":
            invoice['tables'].append(element.text)
    
    return invoice

invoice = extract_invoice_data('invoice.pdf')
print(f"Invoice #: {invoice['invoice_number']}")
print(f"Total: ${invoice['total']}")
python
from unstructured.partition.auto import partition
import re

def extract_invoice_data(file_path):
    """从发票中提取关键数据。"""
    
    elements = partition(file_path, strategy="hi_res")
    
    # 合并所有文本
    full_text = '\n'.join([e.text for e in elements])
    
    invoice = {
        'invoice_number': None,
        'date': None,
        'total': None,
        'vendor': None,
        'line_items': [],
        'tables': []
    }
    
    # 正则匹配提取
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
    if inv_match:
        invoice['invoice_number'] = inv_match.group(1)
    
    date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
    if date_match:
        invoice['date'] = date_match.group(1)
    
    total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
    if total_match:
        invoice['total'] = float(total_match.group(1).replace(',', ''))
    
    # 提取表格
    for element in elements:
        if element.category == "Table":
            invoice['tables'].append(element.text)
    
    return invoice

invoice = extract_invoice_data('invoice.pdf')
print(f"发票编号: {invoice['invoice_number']}")
print(f"总金额: ${invoice['total']}")

Example 3: Document Corpus Builder

示例3:文档语料库构建

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json

def build_corpus(input_dir, output_path):
    """Build searchable corpus from document collection."""
    
    input_path = Path(input_dir)
    corpus = []
    
    # Support multiple formats
    patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
    files = []
    for pattern in patterns:
        files.extend(input_path.glob(pattern))
    
    for file in files:
        print(f"Processing: {file.name}")
        
        try:
            elements = partition(str(file))
            chunks = chunk_by_title(elements, max_characters=1000)
            
            for i, chunk in enumerate(chunks):
                corpus.append({
                    'id': f"{file.stem}_{i}",
                    'source': str(file),
                    'type': type(chunk).__name__,
                    'text': chunk.text,
                    'page': chunk.metadata.page_number if chunk.metadata.page_number else None
                })
        
        except Exception as e:
            print(f"  Error: {e}")
    
    # Save corpus
    with open(output_path, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Corpus built: {len(corpus)} chunks from {len(files)} files")
    return corpus

corpus = build_corpus('./documents', 'corpus.json')
python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json

def build_corpus(input_dir, output_path):
    """从文档集合构建可搜索语料库。"""
    
    input_path = Path(input_dir)
    corpus = []
    
    # 支持多种格式
    patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
    files = []
    for pattern in patterns:
        files.extend(input_path.glob(pattern))
    
    for file in files:
        print(f"处理中: {file.name}")
        
        try:
            elements = partition(str(file))
            chunks = chunk_by_title(elements, max_characters=1000)
            
            for i, chunk in enumerate(chunks):
                corpus.append({
                    'id': f"{file.stem}_{i}",
                    'source': str(file),
                    'type': type(chunk).__name__,
                    'text': chunk.text,
                    'page': chunk.metadata.page_number if chunk.metadata.page_number else None
                })
        
        except Exception as e:
            print(f"  错误: {e}")
    
    # 保存语料库
    with open(output_path, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"语料库构建完成: 从{len(files)}个文档中生成{len(corpus)}个分块")
    return corpus

corpus = build_corpus('./documents', 'corpus.json')

Limitations

局限性

  • Complex layouts may need manual review
  • OCR quality depends on image quality
  • Large files may need chunking
  • Some proprietary formats not supported
  • API rate limits for cloud processing
  • 复杂布局可能需要人工审核
  • OCR识别质量取决于图片清晰度
  • 大文件可能需要分块处理
  • 部分专有格式不支持
  • 云端处理存在API调用次数限制

Installation

安装

bash
undefined
bash
undefined

Basic installation

基础安装

pip install unstructured
pip install unstructured

With all dependencies

安装所有依赖

pip install "unstructured[all-docs]"
pip install "unstructured[all-docs]"

For PDF processing

PDF处理依赖

pip install "unstructured[pdf]"
pip install "unstructured[pdf]"

For specific formats

特定格式依赖

pip install "unstructured[docx,pptx,xlsx]"
undefined
pip install "unstructured[docx,pptx,xlsx]"
undefined

Resources

资源