Table Extractor

Extract tables from PDFs and images into structured data formats.

从PDF和图片中提取表格并转换为结构化数据格式。

Features

功能特性

PDF Tables: Extract tables from digital PDFs
Image Tables: OCR-based extraction from images
Multiple Tables: Extract all tables from document
Format Export: CSV, Excel, JSON output
Table Detection: Auto-detect table boundaries
Column Alignment: Smart column detection
Multi-Page: Process entire PDF documents

PDF表格：从数字化PDF中提取表格
图片表格：基于OCR的图片表格提取
多表格提取：提取文档中的所有表格
格式导出：支持CSV、Excel、JSON输出
表格检测：自动检测表格边界
列对齐：智能列检测
多页处理：处理完整的PDF文档

Quick Start

快速开始

python

from table_extractor import TableExtractor

extractor = TableExtractor()

python

from table_extractor import TableExtractor

extractor = TableExtractor()

Extract from PDF

extractor.load_pdf("document.pdf") tables = extractor.extract_all()

Save first table to CSV

tables[0].to_csv("table.csv")

Extract from image

extractor.load_image("scanned_table.png") table = extractor.extract_table() print(table)

undefined

extractor.load_image("scanned_table.png") table = extractor.extract_table() print(table)

undefined

CLI Usage

CLI 使用方式

bash

undefined

bash

undefined

Extract from PDF

python table_extractor.py --input document.pdf --output tables/

Extract specific pages

python table_extractor.py --input document.pdf --pages 1-3 --output tables/

Extract from image

python table_extractor.py --input scan.png --output table.csv

Export to Excel

python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

With OCR for scanned PDFs

python table_extractor.py --input scanned.pdf --ocr --output tables/

undefined

python table_extractor.py --input scanned.pdf --ocr --output tables/

undefined

API Reference

API 参考

TableExtractor Class

TableExtractor 类

python

class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

python

class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

Supported Formats

支持的格式

Input

输入格式

PDF documents (text-based and scanned)
Images: PNG, JPEG, TIFF, BMP
Screenshots with tables

PDF文档（基于文本和扫描版）
图片：PNG、JPEG、TIFF、BMP
含表格的截图

Output

输出格式

CSV (one file per table)
Excel (multiple sheets)
JSON (array of tables)
Pandas DataFrame

CSV（每个表格对应一个文件）
Excel（多工作表）
JSON（表格数组）
Pandas DataFrame

Table Detection

表格检测

python

undefined

python

undefined

Detect tables without extracting

tables_info = extractor.detect_tables(page=0)

Returns:

[

{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},

{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}

]

undefined

undefined

Example Workflows

示例工作流

PDF Report Tables

PDF报告表格提取

python

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

python

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

Extract all tables

tables = extractor.extract_all()

Export each to CSV

for i, table in enumerate(tables): table.to_csv(f"table_{i}.csv", index=False)

undefined

for i, table in enumerate(tables): table.to_csv(f"table_{i}.csv", index=False)

undefined

Scanned Document

扫描文档处理

python

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

python

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

Dependencies

依赖项

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (for OCR)
opencv-python>=4.8.0

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (for OCR)
opencv-python>=4.8.0

table-extractor

Original

Translation

Table Extractor

Table Extractor

Features

功能特性

Quick Start

快速开始

Extract from PDF

Extract from PDF

Save first table to CSV

Save first table to CSV

Extract from image

Extract from image

CLI Usage

CLI 使用方式

Extract from PDF

Extract from PDF

Extract specific pages

Extract specific pages

Extract from image

Extract from image

Export to Excel

Export to Excel

With OCR for scanned PDFs

With OCR for scanned PDFs

API Reference

API 参考

TableExtractor Class

TableExtractor 类

Supported Formats

支持的格式

Input

输入格式

Output

输出格式

Table Detection

表格检测

Detect tables without extracting

Detect tables without extracting

Returns:

Returns:

[

[

{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},

{"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},

{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}

{"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}

]

]

Example Workflows

示例工作流

PDF Report Tables

PDF报告表格提取

Extract all tables

Extract all tables

Export each to CSV

Export each to CSV

Scanned Document

扫描文档处理

Dependencies

依赖项