table-extractor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Table Extractor

Table Extractor

Extract tables from PDFs and images into structured data formats.
从PDF和图片中提取表格并转换为结构化数据格式。

Features

功能特性

  • PDF Tables: Extract tables from digital PDFs
  • Image Tables: OCR-based extraction from images
  • Multiple Tables: Extract all tables from document
  • Format Export: CSV, Excel, JSON output
  • Table Detection: Auto-detect table boundaries
  • Column Alignment: Smart column detection
  • Multi-Page: Process entire PDF documents
  • PDF表格:从数字化PDF中提取表格
  • 图片表格:基于OCR的图片表格提取
  • 多表格提取:提取文档中的所有表格
  • 格式导出:支持CSV、Excel、JSON输出
  • 表格检测:自动检测表格边界
  • 列对齐:智能列检测
  • 多页处理:处理完整的PDF文档

Quick Start

快速开始

python
from table_extractor import TableExtractor

extractor = TableExtractor()
python
from table_extractor import TableExtractor

extractor = TableExtractor()

Extract from PDF

Extract from PDF

extractor.load_pdf("document.pdf") tables = extractor.extract_all()
extractor.load_pdf("document.pdf") tables = extractor.extract_all()

Save first table to CSV

Save first table to CSV

tables[0].to_csv("table.csv")
tables[0].to_csv("table.csv")

Extract from image

Extract from image

extractor.load_image("scanned_table.png") table = extractor.extract_table() print(table)
undefined
extractor.load_image("scanned_table.png") table = extractor.extract_table() print(table)
undefined

CLI Usage

CLI 使用方式

bash
undefined
bash
undefined

Extract from PDF

Extract from PDF

python table_extractor.py --input document.pdf --output tables/
python table_extractor.py --input document.pdf --output tables/

Extract specific pages

Extract specific pages

python table_extractor.py --input document.pdf --pages 1-3 --output tables/
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

Extract from image

Extract from image

python table_extractor.py --input scan.png --output table.csv
python table_extractor.py --input scan.png --output table.csv

Export to Excel

Export to Excel

python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

With OCR for scanned PDFs

With OCR for scanned PDFs

python table_extractor.py --input scanned.pdf --ocr --output tables/
undefined
python table_extractor.py --input scanned.pdf --ocr --output tables/
undefined

API Reference

API 参考

TableExtractor Class

TableExtractor 类

python
class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str
python
class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

Supported Formats

支持的格式

Input

输入格式

  • PDF documents (text-based and scanned)
  • Images: PNG, JPEG, TIFF, BMP
  • Screenshots with tables
  • PDF文档(基于文本和扫描版)
  • 图片:PNG、JPEG、TIFF、BMP
  • 含表格的截图

Output

输出格式

  • CSV (one file per table)
  • Excel (multiple sheets)
  • JSON (array of tables)
  • Pandas DataFrame
  • CSV(每个表格对应一个文件)
  • Excel(多工作表)
  • JSON(表格数组)
  • Pandas DataFrame

Table Detection

表格检测

python
undefined
python
undefined

Detect tables without extracting

Detect tables without extracting

tables_info = extractor.detect_tables(page=0)
tables_info = extractor.detect_tables(page=0)

Returns:

Returns:

[

[

{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},

{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},

{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}

{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}

]

]

undefined
undefined

Example Workflows

示例工作流

PDF Report Tables

PDF报告表格提取

python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")
python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

Extract all tables

Extract all tables

tables = extractor.extract_all()
tables = extractor.extract_all()

Export each to CSV

Export each to CSV

for i, table in enumerate(tables): table.to_csv(f"table_{i}.csv", index=False)
undefined
for i, table in enumerate(tables): table.to_csv(f"table_{i}.csv", index=False)
undefined

Scanned Document

扫描文档处理

python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)
python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

Dependencies

依赖项

  • pdfplumber>=0.10.0
  • pillow>=10.0.0
  • pandas>=2.0.0
  • pytesseract>=0.3.10 (for OCR)
  • opencv-python>=4.8.0
  • pdfplumber>=0.10.0
  • pillow>=10.0.0
  • pandas>=2.0.0
  • pytesseract>=0.3.10 (for OCR)
  • opencv-python>=4.8.0