data-extractor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Data Extractor Skill

数据提取技能

Overview

概述

This skill enables extraction of structured data from any document format using unstructured - a unified library for processing PDFs, Word docs, emails, HTML, and more. Get consistent, structured output regardless of input format.

本技能借助unstructured——一个用于处理PDF、Word文档、电子邮件、HTML等格式的统一库，实现从任意文档格式中提取结构化数据。无论输入格式如何，都能获得一致的结构化输出。

How to Use

使用方法

Provide the document to process
Optionally specify extraction options
I'll extract structured elements with metadata

Example prompts:

"Extract all text and tables from this PDF"
"Parse this email and get the body, attachments, and metadata"
"Convert this HTML page to structured elements"
"Extract data from these mixed-format documents"

提供待处理的文档
（可选）指定提取选项
我会提取带有元数据的结构化元素

示例提示词：

"从这份PDF中提取所有文本和表格"
"解析这封电子邮件，提取正文、附件和元数据"
"将这个HTML页面转换为结构化元素"
"从这些混合格式的文档中提取数据"

Domain Knowledge

领域知识

unstructured Fundamentals

unstructured 基础

python

from unstructured.partition.auto import partition

python

from unstructured.partition.auto import partition

Automatically detect and process any document

自动检测并处理任意文档

elements = partition("document.pdf")

Access extracted elements

访问提取的元素

for element in elements: print(f"Type: {type(element).name}") print(f"Text: {element.text}") print(f"Metadata: {element.metadata}")

undefined

for element in elements: print(f"类型: {type(element).name}") print(f"文本: {element.text}") print(f"元数据: {element.metadata}")

undefined

Supported Formats

支持的格式

Format	Function	Notes
PDF	`partition_pdf`	Native + scanned
Word	`partition_docx`	Full structure
PowerPoint	`partition_pptx`	Slides & notes
Excel	`partition_xlsx`	Sheets & tables
Email	`partition_email`	Body & attachments
HTML	`partition_html`	Tags preserved
Markdown	`partition_md`	Structure preserved
Plain Text	`partition_text`	Basic parsing
Images	`partition_image`	OCR extraction

格式	函数	说明
PDF	`partition_pdf`	原生PDF + 扫描件
Word	`partition_docx`	完整结构
PowerPoint	`partition_pptx`	幻灯片及备注
Excel	`partition_xlsx`	工作表及表格
电子邮件	`partition_email`	正文及附件
HTML	`partition_html`	保留标签
Markdown	`partition_md`	保留结构
纯文本	`partition_text`	基础解析
图片	`partition_image`	OCR提取

Element Types

元素类型

python

from unstructured.documents.elements import (
    Title,
    NarrativeText,
    Text,
    ListItem,
    Table,
    Image,
    Header,
    Footer,
    PageBreak,
    Address,
    EmailAddress,
)

python

from unstructured.documents.elements import (
    Title,
    NarrativeText,
    Text,
    ListItem,
    Table,
    Image,
    Header,
    Footer,
    PageBreak,
    Address,
    EmailAddress,
)

Elements have consistent structure

元素具有一致的结构

element.text # Raw text content element.metadata # Rich metadata element.category # Element type element.id # Unique identifier

undefined

element.text # 原始文本内容 element.metadata # 丰富的元数据 element.category # 元素类型 element.id # 唯一标识符

undefined

Auto Partition

自动分区

python

from unstructured.partition.auto import partition

python

from unstructured.partition.auto import partition

Process any file type

处理任意文件类型

elements = partition( filename="document.pdf", strategy="auto", # or "fast", "hi_res", "ocr_only" include_metadata=True, include_page_breaks=True, )

elements = partition( filename="document.pdf", strategy="auto", # 或 "fast", "hi_res", "ocr_only" include_metadata=True, include_page_breaks=True, )

Filter by type

按类型筛选

titles = [e for e in elements if isinstance(e, Title)] tables = [e for e in elements if isinstance(e, Table)]

undefined

titles = [e for e in elements if isinstance(e, Title)] tables = [e for e in elements if isinstance(e, Table)]

undefined

Format-Specific Partitioning

特定格式分区

python

undefined

python

undefined

PDF with options

PDF 处理选项

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf( filename="document.pdf", strategy="hi_res", # High quality extraction infer_table_structure=True, # Detect tables include_page_breaks=True, languages=["en"], # OCR language )

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf( filename="document.pdf", strategy="hi_res", # 高质量提取 infer_table_structure=True, # 检测表格 include_page_breaks=True, languages=["en"], # OCR识别语言 )

Word documents

Word 文档

from unstructured.partition.docx import partition_docx

elements = partition_docx( filename="document.docx", include_metadata=True, )

from unstructured.partition.docx import partition_docx

elements = partition_docx( filename="document.docx", include_metadata=True, )

HTML

from unstructured.partition.html import partition_html

elements = partition_html( filename="page.html", include_metadata=True, )

undefined

from unstructured.partition.html import partition_html

elements = partition_html( filename="page.html", include_metadata=True, )

undefined

Working with Tables

表格处理

python

from unstructured.partition.auto import partition

elements = partition("report.pdf", infer_table_structure=True)

python

from unstructured.partition.auto import partition

elements = partition("report.pdf", infer_table_structure=True)

Extract tables

提取表格

for element in elements: if element.category == "Table": print("Table found:") print(element.text)

    # Access structured table data
    if hasattr(element, 'metadata') and element.metadata.text_as_html:
        print("HTML:", element.metadata.text_as_html)

undefined

for element in elements: if element.category == "Table": print("找到表格:") print(element.text)

    # 访问结构化表格数据
    if hasattr(element, 'metadata') and element.metadata.text_as_html:
        print("HTML格式:", element.metadata.text_as_html)

undefined

Metadata Access

元数据访问

python

from unstructured.partition.auto import partition

elements = partition("document.pdf")

for element in elements:
    meta = element.metadata
    
    # Common metadata fields
    print(f"Page: {meta.page_number}")
    print(f"Filename: {meta.filename}")
    print(f"Filetype: {meta.filetype}")
    print(f"Coordinates: {meta.coordinates}")
    print(f"Languages: {meta.languages}")

python

from unstructured.partition.auto import partition

elements = partition("document.pdf")

for element in elements:
    meta = element.metadata
    
    # 常见元数据字段
    print(f"页码: {meta.page_number}")
    print(f"文件名: {meta.filename}")
    print(f"文件类型: {meta.filetype}")
    print(f"坐标: {meta.coordinates}")
    print(f"语言: {meta.languages}")

Chunking for AI/RAG

用于AI/RAG的分块处理

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elements

Partition document

解析文档

elements = partition("document.pdf")

Chunk by title (semantic chunks)

按标题分块（语义分块）

chunks = chunk_by_title( elements, max_characters=1000, combine_text_under_n_chars=200, )

Or basic chunking

或基础分块方式

chunks = chunk_elements( elements, max_characters=500, overlap=50, )

for chunk in chunks: print(f"Chunk ({len(chunk.text)} chars):") print(chunk.text[:100] + "...")

undefined

chunks = chunk_elements( elements, max_characters=500, overlap=50, )

for chunk in chunks: print(f"分块（字符数: {len(chunk.text)}）:") print(chunk.text[:100] + "...")

undefined

Batch Processing

批量处理

python

from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_document(file_path):
    """Process single document."""
    try:
        elements = partition(str(file_path))
        return {
            'file': str(file_path),
            'status': 'success',
            'elements': len(elements),
            'text': '\n\n'.join([e.text for e in elements])
        }
    except Exception as e:
        return {
            'file': str(file_path),
            'status': 'error',
            'error': str(e)
        }

def batch_process(input_dir, max_workers=4):
    """Process all documents in directory."""
    input_path = Path(input_dir)
    files = list(input_path.glob('*'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_document, files))
    
    return results

python

from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_document(file_path):
    """处理单个文档。"""
    try:
        elements = partition(str(file_path))
        return {
            'file': str(file_path),
            'status': 'success',
            'elements': len(elements),
            'text': '\n\n'.join([e.text for e in elements])
        }
    except Exception as e:
        return {
            'file': str(file_path),
            'status': 'error',
            'error': str(e)
        }

def batch_process(input_dir, max_workers=4):
    """处理目录下所有文档。"""
    input_path = Path(input_dir)
    files = list(input_path.glob('*'))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process_document, files))
    
    return results

Export Formats

导出格式

python

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts

elements = partition("document.pdf")

python

from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts

elements = partition("document.pdf")

To JSON string

转换为JSON字符串

json_str = elements_to_json(elements)

To list of dicts

转换为字典列表

dicts = elements_to_dicts(elements)

To DataFrame

转换为DataFrame

import pandas as pd df = pd.DataFrame(dicts)

undefined

import pandas as pd df = pd.DataFrame(dicts)

undefined

Best Practices

最佳实践

Choose Strategy Wisely: "fast" for speed, "hi_res" for accuracy
Enable Table Detection: For documents with tables
Specify Language: For better OCR on non-English docs
Chunk for RAG: Use semantic chunking for AI applications
Handle Errors: Some formats may fail gracefully

合理选择策略：追求速度选"fast"，追求精度选"hi_res"
启用表格检测：处理含表格的文档时开启
指定语言：针对非英文文档提升OCR识别效果
为RAG分块：AI应用中使用语义分块
错误处理：部分格式可能需要容错处理

Common Patterns

常见模式

Document to JSON

文档转JSON

python

def document_to_json(file_path, output_path=None):
    """Convert document to structured JSON."""
    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json
    import json
    
    elements = partition(file_path)
    
    # Create structured output
    output = {
        'source': file_path,
        'elements': []
    }
    
    for element in elements:
        output['elements'].append({
            'type': type(element).__name__,
            'text': element.text,
            'metadata': {
                'page': element.metadata.page_number,
                'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
            }
        })
    
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(output, f, indent=2)
    
    return output

python

def document_to_json(file_path, output_path=None):
    """将文档转换为结构化JSON。"""
    from unstructured.partition.auto import partition
    from unstructured.staging.base import elements_to_json
    import json
    
    elements = partition(file_path)
    
    # 构建结构化输出
    output = {
        'source': file_path,
        'elements': []
    }
    
    for element in elements:
        output['elements'].append({
            'type': type(element).__name__,
            'text': element.text,
            'metadata': {
                'page': element.metadata.page_number,
                'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
            }
        })
    
    if output_path:
        with open(output_path, 'w') as f:
            json.dump(output, f, indent=2)
    
    return output

Email Parser

电子邮件解析器

python

from unstructured.partition.email import partition_email

def parse_email(email_path):
    """Extract structured data from email."""
    
    elements = partition_email(email_path)
    
    email_data = {
        'subject': None,
        'from': None,
        'to': [],
        'date': None,
        'body': [],
        'attachments': []
    }
    
    for element in elements:
        meta = element.metadata
        
        # Extract headers from metadata
        if meta.subject:
            email_data['subject'] = meta.subject
        if meta.sent_from:
            email_data['from'] = meta.sent_from
        if meta.sent_to:
            email_data['to'] = meta.sent_to
        
        # Body content
        email_data['body'].append({
            'type': type(element).__name__,
            'text': element.text
        })
    
    return email_data

python

from unstructured.partition.email import partition_email

def parse_email(email_path):
    """从电子邮件中提取结构化数据。"""
    
    elements = partition_email(email_path)
    
    email_data = {
        'subject': None,
        'from': None,
        'to': [],
        'date': None,
        'body': [],
        'attachments': []
    }
    
    for element in elements:
        meta = element.metadata
        
        # 从元数据提取邮件头
        if meta.subject:
            email_data['subject'] = meta.subject
        if meta.sent_from:
            email_data['from'] = meta.sent_from
        if meta.sent_to:
            email_data['to'] = meta.sent_to
        
        # 正文内容
        email_data['body'].append({
            'type': type(element).__name__,
            'text': element.text
        })
    
    return email_data

Examples

示例

Example 1: Research Paper Extraction

示例1：科研论文提取

python

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

def extract_paper(pdf_path):
    """Extract structured data from research paper."""
    
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        infer_table_structure=True,
        include_page_breaks=True
    )
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'tables': [],
        'references': []
    }
    
    # Find title (usually first Title element)
    for element in elements:
        if element.category == "Title" and not paper['title']:
            paper['title'] = element.text
            break
    
    # Extract tables
    for element in elements:
        if element.category == "Table":
            paper['tables'].append({
                'page': element.metadata.page_number,
                'content': element.text,
                'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
            })
    
    # Chunk into sections
    chunks = chunk_by_title(elements, max_characters=2000)
    
    current_section = None
    for chunk in chunks:
        if chunk.category == "Title":
            paper['sections'].append({
                'title': chunk.text,
                'content': ''
            })
        elif paper['sections']:
            paper['sections'][-1]['content'] += chunk.text + '\n'
    
    return paper

paper = extract_paper('research_paper.pdf')
print(f"Title: {paper['title']}")
print(f"Tables: {len(paper['tables'])}")
print(f"Sections: {len(paper['sections'])}")

python

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

def extract_paper(pdf_path):
    """从科研论文中提取结构化数据。"""
    
    elements = partition_pdf(
        filename=pdf_path,
        strategy="hi_res",
        infer_table_structure=True,
        include_page_breaks=True
    )
    
    paper = {
        'title': None,
        'abstract': None,
        'sections': [],
        'tables': [],
        'references': []
    }
    
    # 查找标题（通常为第一个Title元素）
    for element in elements:
        if element.category == "Title" and not paper['title']:
            paper['title'] = element.text
            break
    
    # 提取表格
    for element in elements:
        if element.category == "Table":
            paper['tables'].append({
                'page': element.metadata.page_number,
                'content': element.text,
                'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
            })
    
    # 按章节分块
    chunks = chunk_by_title(elements, max_characters=2000)
    
    current_section = None
    for chunk in chunks:
        if chunk.category == "Title":
            paper['sections'].append({
                'title': chunk.text,
                'content': ''
            })
        elif paper['sections']:
            paper['sections'][-1]['content'] += chunk.text + '\n'
    
    return paper

paper = extract_paper('research_paper.pdf')
print(f"标题: {paper['title']}")
print(f"表格数量: {len(paper['tables'])}")
print(f"章节数量: {len(paper['sections'])}")

Example 2: Invoice Data Extraction

示例2：发票数据提取

python

from unstructured.partition.auto import partition
import re

def extract_invoice_data(file_path):
    """Extract key data from invoice."""
    
    elements = partition(file_path, strategy="hi_res")
    
    # Combine all text
    full_text = '\n'.join([e.text for e in elements])
    
    invoice = {
        'invoice_number': None,
        'date': None,
        'total': None,
        'vendor': None,
        'line_items': [],
        'tables': []
    }
    
    # Extract patterns
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
    if inv_match:
        invoice['invoice_number'] = inv_match.group(1)
    
    date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
    if date_match:
        invoice['date'] = date_match.group(1)
    
    total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
    if total_match:
        invoice['total'] = float(total_match.group(1).replace(',', ''))
    
    # Extract tables
    for element in elements:
        if element.category == "Table":
            invoice['tables'].append(element.text)
    
    return invoice

invoice = extract_invoice_data('invoice.pdf')
print(f"Invoice #: {invoice['invoice_number']}")
print(f"Total: ${invoice['total']}")

python

from unstructured.partition.auto import partition
import re

def extract_invoice_data(file_path):
    """从发票中提取关键数据。"""
    
    elements = partition(file_path, strategy="hi_res")
    
    # 合并所有文本
    full_text = '\n'.join([e.text for e in elements])
    
    invoice = {
        'invoice_number': None,
        'date': None,
        'total': None,
        'vendor': None,
        'line_items': [],
        'tables': []
    }
    
    # 正则匹配提取
    inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
    if inv_match:
        invoice['invoice_number'] = inv_match.group(1)
    
    date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
    if date_match:
        invoice['date'] = date_match.group(1)
    
    total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
    if total_match:
        invoice['total'] = float(total_match.group(1).replace(',', ''))
    
    # 提取表格
    for element in elements:
        if element.category == "Table":
            invoice['tables'].append(element.text)
    
    return invoice

invoice = extract_invoice_data('invoice.pdf')
print(f"发票编号: {invoice['invoice_number']}")
print(f"总金额: ${invoice['total']}")

Example 3: Document Corpus Builder

示例3：文档语料库构建

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json

def build_corpus(input_dir, output_path):
    """Build searchable corpus from document collection."""
    
    input_path = Path(input_dir)
    corpus = []
    
    # Support multiple formats
    patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
    files = []
    for pattern in patterns:
        files.extend(input_path.glob(pattern))
    
    for file in files:
        print(f"Processing: {file.name}")
        
        try:
            elements = partition(str(file))
            chunks = chunk_by_title(elements, max_characters=1000)
            
            for i, chunk in enumerate(chunks):
                corpus.append({
                    'id': f"{file.stem}_{i}",
                    'source': str(file),
                    'type': type(chunk).__name__,
                    'text': chunk.text,
                    'page': chunk.metadata.page_number if chunk.metadata.page_number else None
                })
        
        except Exception as e:
            print(f"  Error: {e}")
    
    # Save corpus
    with open(output_path, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Corpus built: {len(corpus)} chunks from {len(files)} files")
    return corpus

corpus = build_corpus('./documents', 'corpus.json')

python

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json

def build_corpus(input_dir, output_path):
    """从文档集合构建可搜索语料库。"""
    
    input_path = Path(input_dir)
    corpus = []
    
    # 支持多种格式
    patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
    files = []
    for pattern in patterns:
        files.extend(input_path.glob(pattern))
    
    for file in files:
        print(f"处理中: {file.name}")
        
        try:
            elements = partition(str(file))
            chunks = chunk_by_title(elements, max_characters=1000)
            
            for i, chunk in enumerate(chunks):
                corpus.append({
                    'id': f"{file.stem}_{i}",
                    'source': str(file),
                    'type': type(chunk).__name__,
                    'text': chunk.text,
                    'page': chunk.metadata.page_number if chunk.metadata.page_number else None
                })
        
        except Exception as e:
            print(f"  错误: {e}")
    
    # 保存语料库
    with open(output_path, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"语料库构建完成: 从{len(files)}个文档中生成{len(corpus)}个分块")
    return corpus

corpus = build_corpus('./documents', 'corpus.json')

Limitations

局限性

Complex layouts may need manual review
OCR quality depends on image quality
Large files may need chunking
Some proprietary formats not supported
API rate limits for cloud processing

复杂布局可能需要人工审核
OCR识别质量取决于图片清晰度
大文件可能需要分块处理
部分专有格式不支持
云端处理存在API调用次数限制

Installation

安装

bash

undefined

bash

undefined

Basic installation

基础安装

pip install unstructured

With all dependencies

安装所有依赖

pip install "unstructured[all-docs]"

For PDF processing

PDF处理依赖

pip install "unstructured[pdf]"

For specific formats

特定格式依赖

pip install "unstructured[docx,pptx,xlsx]"

undefined

pip install "unstructured[docx,pptx,xlsx]"

undefined