table-extractor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTable Extractor
Table Extractor
Extract tables from PDFs and images into structured data formats.
从PDF和图片中提取表格并转换为结构化数据格式。
Features
功能特性
- PDF Tables: Extract tables from digital PDFs
- Image Tables: OCR-based extraction from images
- Multiple Tables: Extract all tables from document
- Format Export: CSV, Excel, JSON output
- Table Detection: Auto-detect table boundaries
- Column Alignment: Smart column detection
- Multi-Page: Process entire PDF documents
- PDF表格:从数字化PDF中提取表格
- 图片表格:基于OCR的图片表格提取
- 多表格提取:提取文档中的所有表格
- 格式导出:支持CSV、Excel、JSON输出
- 表格检测:自动检测表格边界
- 列对齐:智能列检测
- 多页处理:处理完整的PDF文档
Quick Start
快速开始
python
from table_extractor import TableExtractor
extractor = TableExtractor()python
from table_extractor import TableExtractor
extractor = TableExtractor()Extract from PDF
Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()
Save first table to CSV
Save first table to CSV
tables[0].to_csv("table.csv")
tables[0].to_csv("table.csv")
Extract from image
Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)
undefinedextractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)
undefinedCLI Usage
CLI 使用方式
bash
undefinedbash
undefinedExtract from PDF
Extract from PDF
python table_extractor.py --input document.pdf --output tables/
python table_extractor.py --input document.pdf --output tables/
Extract specific pages
Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/
python table_extractor.py --input document.pdf --pages 1-3 --output tables/
Extract from image
Extract from image
python table_extractor.py --input scan.png --output table.csv
python table_extractor.py --input scan.png --output table.csv
Export to Excel
Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx
With OCR for scanned PDFs
With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/
undefinedpython table_extractor.py --input scanned.pdf --ocr --output tables/
undefinedAPI Reference
API 参考
TableExtractor Class
TableExtractor 类
python
class TableExtractor:
def __init__(self)
# Loading
def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
def load_image(self, filepath: str) -> 'TableExtractor'
# Extraction
def extract_table(self, page: int = 0) -> pd.DataFrame
def extract_all(self) -> List[pd.DataFrame]
def extract_page(self, page: int) -> List[pd.DataFrame]
# Detection
def detect_tables(self, page: int = 0) -> List[Dict]
def get_table_count(self) -> int
# Configuration
def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'
# Export
def to_csv(self, tables: List, output_dir: str) -> List[str]
def to_excel(self, tables: List, output: str) -> str
def to_json(self, tables: List, output: str) -> strpython
class TableExtractor:
def __init__(self)
# Loading
def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
def load_image(self, filepath: str) -> 'TableExtractor'
# Extraction
def extract_table(self, page: int = 0) -> pd.DataFrame
def extract_all(self) -> List[pd.DataFrame]
def extract_page(self, page: int) -> List[pd.DataFrame]
# Detection
def detect_tables(self, page: int = 0) -> List[Dict]
def get_table_count(self) -> int
# Configuration
def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'
# Export
def to_csv(self, tables: List, output_dir: str) -> List[str]
def to_excel(self, tables: List, output: str) -> str
def to_json(self, tables: List, output: str) -> strSupported Formats
支持的格式
Input
输入格式
- PDF documents (text-based and scanned)
- Images: PNG, JPEG, TIFF, BMP
- Screenshots with tables
- PDF文档(基于文本和扫描版)
- 图片:PNG、JPEG、TIFF、BMP
- 含表格的截图
Output
输出格式
- CSV (one file per table)
- Excel (multiple sheets)
- JSON (array of tables)
- Pandas DataFrame
- CSV(每个表格对应一个文件)
- Excel(多工作表)
- JSON(表格数组)
- Pandas DataFrame
Table Detection
表格检测
python
undefinedpython
undefinedDetect tables without extracting
Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
tables_info = extractor.detect_tables(page=0)
Returns:
Returns:
[
[
{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
]
]
undefinedundefinedExample Workflows
示例工作流
PDF Report Tables
PDF报告表格提取
python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")Extract all tables
Extract all tables
tables = extractor.extract_all()
tables = extractor.extract_all()
Export each to CSV
Export each to CSV
for i, table in enumerate(tables):
table.to_csv(f"table_{i}.csv", index=False)
undefinedfor i, table in enumerate(tables):
table.to_csv(f"table_{i}.csv", index=False)
undefinedScanned Document
扫描文档处理
python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")
table = extractor.extract_table()
print(table)python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")
table = extractor.extract_table()
print(table)Dependencies
依赖项
- pdfplumber>=0.10.0
- pillow>=10.0.0
- pandas>=2.0.0
- pytesseract>=0.3.10 (for OCR)
- opencv-python>=4.8.0
- pdfplumber>=0.10.0
- pillow>=10.0.0
- pandas>=2.0.0
- pytesseract>=0.3.10 (for OCR)
- opencv-python>=4.8.0