data-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseData Extractor Skill
数据提取技能
Overview
概述
This skill enables extraction of structured data from any document format using unstructured - a unified library for processing PDFs, Word docs, emails, HTML, and more. Get consistent, structured output regardless of input format.
本技能借助unstructured——一个用于处理PDF、Word文档、电子邮件、HTML等格式的统一库,实现从任意文档格式中提取结构化数据。无论输入格式如何,都能获得一致的结构化输出。
How to Use
使用方法
- Provide the document to process
- Optionally specify extraction options
- I'll extract structured elements with metadata
Example prompts:
- "Extract all text and tables from this PDF"
- "Parse this email and get the body, attachments, and metadata"
- "Convert this HTML page to structured elements"
- "Extract data from these mixed-format documents"
- 提供待处理的文档
- (可选)指定提取选项
- 我会提取带有元数据的结构化元素
示例提示词:
- "从这份PDF中提取所有文本和表格"
- "解析这封电子邮件,提取正文、附件和元数据"
- "将这个HTML页面转换为结构化元素"
- "从这些混合格式的文档中提取数据"
Domain Knowledge
领域知识
unstructured Fundamentals
unstructured 基础
python
from unstructured.partition.auto import partitionpython
from unstructured.partition.auto import partitionAutomatically detect and process any document
自动检测并处理任意文档
elements = partition("document.pdf")
elements = partition("document.pdf")
Access extracted elements
访问提取的元素
for element in elements:
print(f"Type: {type(element).name}")
print(f"Text: {element.text}")
print(f"Metadata: {element.metadata}")
undefinedfor element in elements:
print(f"类型: {type(element).name}")
print(f"文本: {element.text}")
print(f"元数据: {element.metadata}")
undefinedSupported Formats
支持的格式
| Format | Function | Notes |
|---|---|---|
| Native + scanned | |
| Word | | Full structure |
| PowerPoint | | Slides & notes |
| Excel | | Sheets & tables |
| Body & attachments | |
| HTML | | Tags preserved |
| Markdown | | Structure preserved |
| Plain Text | | Basic parsing |
| Images | | OCR extraction |
| 格式 | 函数 | 说明 |
|---|---|---|
| 原生PDF + 扫描件 | |
| Word | | 完整结构 |
| PowerPoint | | 幻灯片及备注 |
| Excel | | 工作表及表格 |
| 电子邮件 | | 正文及附件 |
| HTML | | 保留标签 |
| Markdown | | 保留结构 |
| 纯文本 | | 基础解析 |
| 图片 | | OCR提取 |
Element Types
元素类型
python
from unstructured.documents.elements import (
Title,
NarrativeText,
Text,
ListItem,
Table,
Image,
Header,
Footer,
PageBreak,
Address,
EmailAddress,
)python
from unstructured.documents.elements import (
Title,
NarrativeText,
Text,
ListItem,
Table,
Image,
Header,
Footer,
PageBreak,
Address,
EmailAddress,
)Elements have consistent structure
元素具有一致的结构
element.text # Raw text content
element.metadata # Rich metadata
element.category # Element type
element.id # Unique identifier
undefinedelement.text # 原始文本内容
element.metadata # 丰富的元数据
element.category # 元素类型
element.id # 唯一标识符
undefinedAuto Partition
自动分区
python
from unstructured.partition.auto import partitionpython
from unstructured.partition.auto import partitionProcess any file type
处理任意文件类型
elements = partition(
filename="document.pdf",
strategy="auto", # or "fast", "hi_res", "ocr_only"
include_metadata=True,
include_page_breaks=True,
)
elements = partition(
filename="document.pdf",
strategy="auto", # 或 "fast", "hi_res", "ocr_only"
include_metadata=True,
include_page_breaks=True,
)
Filter by type
按类型筛选
titles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]
undefinedtitles = [e for e in elements if isinstance(e, Title)]
tables = [e for e in elements if isinstance(e, Table)]
undefinedFormat-Specific Partitioning
特定格式分区
python
undefinedpython
undefinedPDF with options
PDF 处理选项
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="document.pdf",
strategy="hi_res", # High quality extraction
infer_table_structure=True, # Detect tables
include_page_breaks=True,
languages=["en"], # OCR language
)
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="document.pdf",
strategy="hi_res", # 高质量提取
infer_table_structure=True, # 检测表格
include_page_breaks=True,
languages=["en"], # OCR识别语言
)
Word documents
Word 文档
from unstructured.partition.docx import partition_docx
elements = partition_docx(
filename="document.docx",
include_metadata=True,
)
from unstructured.partition.docx import partition_docx
elements = partition_docx(
filename="document.docx",
include_metadata=True,
)
HTML
HTML
from unstructured.partition.html import partition_html
elements = partition_html(
filename="page.html",
include_metadata=True,
)
undefinedfrom unstructured.partition.html import partition_html
elements = partition_html(
filename="page.html",
include_metadata=True,
)
undefinedWorking with Tables
表格处理
python
from unstructured.partition.auto import partition
elements = partition("report.pdf", infer_table_structure=True)python
from unstructured.partition.auto import partition
elements = partition("report.pdf", infer_table_structure=True)Extract tables
提取表格
for element in elements:
if element.category == "Table":
print("Table found:")
print(element.text)
# Access structured table data
if hasattr(element, 'metadata') and element.metadata.text_as_html:
print("HTML:", element.metadata.text_as_html)undefinedfor element in elements:
if element.category == "Table":
print("找到表格:")
print(element.text)
# 访问结构化表格数据
if hasattr(element, 'metadata') and element.metadata.text_as_html:
print("HTML格式:", element.metadata.text_as_html)undefinedMetadata Access
元数据访问
python
from unstructured.partition.auto import partition
elements = partition("document.pdf")
for element in elements:
meta = element.metadata
# Common metadata fields
print(f"Page: {meta.page_number}")
print(f"Filename: {meta.filename}")
print(f"Filetype: {meta.filetype}")
print(f"Coordinates: {meta.coordinates}")
print(f"Languages: {meta.languages}")python
from unstructured.partition.auto import partition
elements = partition("document.pdf")
for element in elements:
meta = element.metadata
# 常见元数据字段
print(f"页码: {meta.page_number}")
print(f"文件名: {meta.filename}")
print(f"文件类型: {meta.filetype}")
print(f"坐标: {meta.coordinates}")
print(f"语言: {meta.languages}")Chunking for AI/RAG
用于AI/RAG的分块处理
python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elementspython
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import chunk_elementsPartition document
解析文档
elements = partition("document.pdf")
elements = partition("document.pdf")
Chunk by title (semantic chunks)
按标题分块(语义分块)
chunks = chunk_by_title(
elements,
max_characters=1000,
combine_text_under_n_chars=200,
)
chunks = chunk_by_title(
elements,
max_characters=1000,
combine_text_under_n_chars=200,
)
Or basic chunking
或基础分块方式
chunks = chunk_elements(
elements,
max_characters=500,
overlap=50,
)
for chunk in chunks:
print(f"Chunk ({len(chunk.text)} chars):")
print(chunk.text[:100] + "...")
undefinedchunks = chunk_elements(
elements,
max_characters=500,
overlap=50,
)
for chunk in chunks:
print(f"分块(字符数: {len(chunk.text)}):")
print(chunk.text[:100] + "...")
undefinedBatch Processing
批量处理
python
from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def process_document(file_path):
"""Process single document."""
try:
elements = partition(str(file_path))
return {
'file': str(file_path),
'status': 'success',
'elements': len(elements),
'text': '\n\n'.join([e.text for e in elements])
}
except Exception as e:
return {
'file': str(file_path),
'status': 'error',
'error': str(e)
}
def batch_process(input_dir, max_workers=4):
"""Process all documents in directory."""
input_path = Path(input_dir)
files = list(input_path.glob('*'))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_document, files))
return resultspython
from unstructured.partition.auto import partition
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def process_document(file_path):
"""处理单个文档。"""
try:
elements = partition(str(file_path))
return {
'file': str(file_path),
'status': 'success',
'elements': len(elements),
'text': '\n\n'.join([e.text for e in elements])
}
except Exception as e:
return {
'file': str(file_path),
'status': 'error',
'error': str(e)
}
def batch_process(input_dir, max_workers=4):
"""处理目录下所有文档。"""
input_path = Path(input_dir)
files = list(input_path.glob('*'))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_document, files))
return resultsExport Formats
导出格式
python
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts
elements = partition("document.pdf")python
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json, elements_to_dicts
elements = partition("document.pdf")To JSON string
转换为JSON字符串
json_str = elements_to_json(elements)
json_str = elements_to_json(elements)
To list of dicts
转换为字典列表
dicts = elements_to_dicts(elements)
dicts = elements_to_dicts(elements)
To DataFrame
转换为DataFrame
import pandas as pd
df = pd.DataFrame(dicts)
undefinedimport pandas as pd
df = pd.DataFrame(dicts)
undefinedBest Practices
最佳实践
- Choose Strategy Wisely: "fast" for speed, "hi_res" for accuracy
- Enable Table Detection: For documents with tables
- Specify Language: For better OCR on non-English docs
- Chunk for RAG: Use semantic chunking for AI applications
- Handle Errors: Some formats may fail gracefully
- 合理选择策略:追求速度选"fast",追求精度选"hi_res"
- 启用表格检测:处理含表格的文档时开启
- 指定语言:针对非英文文档提升OCR识别效果
- 为RAG分块:AI应用中使用语义分块
- 错误处理:部分格式可能需要容错处理
Common Patterns
常见模式
Document to JSON
文档转JSON
python
def document_to_json(file_path, output_path=None):
"""Convert document to structured JSON."""
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
import json
elements = partition(file_path)
# Create structured output
output = {
'source': file_path,
'elements': []
}
for element in elements:
output['elements'].append({
'type': type(element).__name__,
'text': element.text,
'metadata': {
'page': element.metadata.page_number,
'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
}
})
if output_path:
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return outputpython
def document_to_json(file_path, output_path=None):
"""将文档转换为结构化JSON。"""
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
import json
elements = partition(file_path)
# 构建结构化输出
output = {
'source': file_path,
'elements': []
}
for element in elements:
output['elements'].append({
'type': type(element).__name__,
'text': element.text,
'metadata': {
'page': element.metadata.page_number,
'coordinates': element.metadata.coordinates.to_dict() if element.metadata.coordinates else None
}
})
if output_path:
with open(output_path, 'w') as f:
json.dump(output, f, indent=2)
return outputEmail Parser
电子邮件解析器
python
from unstructured.partition.email import partition_email
def parse_email(email_path):
"""Extract structured data from email."""
elements = partition_email(email_path)
email_data = {
'subject': None,
'from': None,
'to': [],
'date': None,
'body': [],
'attachments': []
}
for element in elements:
meta = element.metadata
# Extract headers from metadata
if meta.subject:
email_data['subject'] = meta.subject
if meta.sent_from:
email_data['from'] = meta.sent_from
if meta.sent_to:
email_data['to'] = meta.sent_to
# Body content
email_data['body'].append({
'type': type(element).__name__,
'text': element.text
})
return email_datapython
from unstructured.partition.email import partition_email
def parse_email(email_path):
"""从电子邮件中提取结构化数据。"""
elements = partition_email(email_path)
email_data = {
'subject': None,
'from': None,
'to': [],
'date': None,
'body': [],
'attachments': []
}
for element in elements:
meta = element.metadata
# 从元数据提取邮件头
if meta.subject:
email_data['subject'] = meta.subject
if meta.sent_from:
email_data['from'] = meta.sent_from
if meta.sent_to:
email_data['to'] = meta.sent_to
# 正文内容
email_data['body'].append({
'type': type(element).__name__,
'text': element.text
})
return email_dataExamples
示例
Example 1: Research Paper Extraction
示例1:科研论文提取
python
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
def extract_paper(pdf_path):
"""Extract structured data from research paper."""
elements = partition_pdf(
filename=pdf_path,
strategy="hi_res",
infer_table_structure=True,
include_page_breaks=True
)
paper = {
'title': None,
'abstract': None,
'sections': [],
'tables': [],
'references': []
}
# Find title (usually first Title element)
for element in elements:
if element.category == "Title" and not paper['title']:
paper['title'] = element.text
break
# Extract tables
for element in elements:
if element.category == "Table":
paper['tables'].append({
'page': element.metadata.page_number,
'content': element.text,
'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
})
# Chunk into sections
chunks = chunk_by_title(elements, max_characters=2000)
current_section = None
for chunk in chunks:
if chunk.category == "Title":
paper['sections'].append({
'title': chunk.text,
'content': ''
})
elif paper['sections']:
paper['sections'][-1]['content'] += chunk.text + '\n'
return paper
paper = extract_paper('research_paper.pdf')
print(f"Title: {paper['title']}")
print(f"Tables: {len(paper['tables'])}")
print(f"Sections: {len(paper['sections'])}")python
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
def extract_paper(pdf_path):
"""从科研论文中提取结构化数据。"""
elements = partition_pdf(
filename=pdf_path,
strategy="hi_res",
infer_table_structure=True,
include_page_breaks=True
)
paper = {
'title': None,
'abstract': None,
'sections': [],
'tables': [],
'references': []
}
# 查找标题(通常为第一个Title元素)
for element in elements:
if element.category == "Title" and not paper['title']:
paper['title'] = element.text
break
# 提取表格
for element in elements:
if element.category == "Table":
paper['tables'].append({
'page': element.metadata.page_number,
'content': element.text,
'html': element.metadata.text_as_html if hasattr(element.metadata, 'text_as_html') else None
})
# 按章节分块
chunks = chunk_by_title(elements, max_characters=2000)
current_section = None
for chunk in chunks:
if chunk.category == "Title":
paper['sections'].append({
'title': chunk.text,
'content': ''
})
elif paper['sections']:
paper['sections'][-1]['content'] += chunk.text + '\n'
return paper
paper = extract_paper('research_paper.pdf')
print(f"标题: {paper['title']}")
print(f"表格数量: {len(paper['tables'])}")
print(f"章节数量: {len(paper['sections'])}")Example 2: Invoice Data Extraction
示例2:发票数据提取
python
from unstructured.partition.auto import partition
import re
def extract_invoice_data(file_path):
"""Extract key data from invoice."""
elements = partition(file_path, strategy="hi_res")
# Combine all text
full_text = '\n'.join([e.text for e in elements])
invoice = {
'invoice_number': None,
'date': None,
'total': None,
'vendor': None,
'line_items': [],
'tables': []
}
# Extract patterns
inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
if inv_match:
invoice['invoice_number'] = inv_match.group(1)
date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
if date_match:
invoice['date'] = date_match.group(1)
total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
if total_match:
invoice['total'] = float(total_match.group(1).replace(',', ''))
# Extract tables
for element in elements:
if element.category == "Table":
invoice['tables'].append(element.text)
return invoice
invoice = extract_invoice_data('invoice.pdf')
print(f"Invoice #: {invoice['invoice_number']}")
print(f"Total: ${invoice['total']}")python
from unstructured.partition.auto import partition
import re
def extract_invoice_data(file_path):
"""从发票中提取关键数据。"""
elements = partition(file_path, strategy="hi_res")
# 合并所有文本
full_text = '\n'.join([e.text for e in elements])
invoice = {
'invoice_number': None,
'date': None,
'total': None,
'vendor': None,
'line_items': [],
'tables': []
}
# 正则匹配提取
inv_match = re.search(r'Invoice\s*#?\s*:?\s*(\w+[-\w]*)', full_text, re.I)
if inv_match:
invoice['invoice_number'] = inv_match.group(1)
date_match = re.search(r'Date\s*:?\s*(\d{1,2}[-/]\d{1,2}[-/]\d{2,4})', full_text, re.I)
if date_match:
invoice['date'] = date_match.group(1)
total_match = re.search(r'Total\s*:?\s*\$?([\d,]+\.?\d*)', full_text, re.I)
if total_match:
invoice['total'] = float(total_match.group(1).replace(',', ''))
# 提取表格
for element in elements:
if element.category == "Table":
invoice['tables'].append(element.text)
return invoice
invoice = extract_invoice_data('invoice.pdf')
print(f"发票编号: {invoice['invoice_number']}")
print(f"总金额: ${invoice['total']}")Example 3: Document Corpus Builder
示例3:文档语料库构建
python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json
def build_corpus(input_dir, output_path):
"""Build searchable corpus from document collection."""
input_path = Path(input_dir)
corpus = []
# Support multiple formats
patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
files = []
for pattern in patterns:
files.extend(input_path.glob(pattern))
for file in files:
print(f"Processing: {file.name}")
try:
elements = partition(str(file))
chunks = chunk_by_title(elements, max_characters=1000)
for i, chunk in enumerate(chunks):
corpus.append({
'id': f"{file.stem}_{i}",
'source': str(file),
'type': type(chunk).__name__,
'text': chunk.text,
'page': chunk.metadata.page_number if chunk.metadata.page_number else None
})
except Exception as e:
print(f" Error: {e}")
# Save corpus
with open(output_path, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"Corpus built: {len(corpus)} chunks from {len(files)} files")
return corpus
corpus = build_corpus('./documents', 'corpus.json')python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
from pathlib import Path
import json
def build_corpus(input_dir, output_path):
"""从文档集合构建可搜索语料库。"""
input_path = Path(input_dir)
corpus = []
# 支持多种格式
patterns = ['*.pdf', '*.docx', '*.html', '*.txt', '*.md']
files = []
for pattern in patterns:
files.extend(input_path.glob(pattern))
for file in files:
print(f"处理中: {file.name}")
try:
elements = partition(str(file))
chunks = chunk_by_title(elements, max_characters=1000)
for i, chunk in enumerate(chunks):
corpus.append({
'id': f"{file.stem}_{i}",
'source': str(file),
'type': type(chunk).__name__,
'text': chunk.text,
'page': chunk.metadata.page_number if chunk.metadata.page_number else None
})
except Exception as e:
print(f" 错误: {e}")
# 保存语料库
with open(output_path, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"语料库构建完成: 从{len(files)}个文档中生成{len(corpus)}个分块")
return corpus
corpus = build_corpus('./documents', 'corpus.json')Limitations
局限性
- Complex layouts may need manual review
- OCR quality depends on image quality
- Large files may need chunking
- Some proprietary formats not supported
- API rate limits for cloud processing
- 复杂布局可能需要人工审核
- OCR识别质量取决于图片清晰度
- 大文件可能需要分块处理
- 部分专有格式不支持
- 云端处理存在API调用次数限制
Installation
安装
bash
undefinedbash
undefinedBasic installation
基础安装
pip install unstructured
pip install unstructured
With all dependencies
安装所有依赖
pip install "unstructured[all-docs]"
pip install "unstructured[all-docs]"
For PDF processing
PDF处理依赖
pip install "unstructured[pdf]"
pip install "unstructured[pdf]"
For specific formats
特定格式依赖
pip install "unstructured[docx,pptx,xlsx]"
undefinedpip install "unstructured[docx,pptx,xlsx]"
undefined