table-extractor

Original：🇺🇸 English

Translated

1 scriptsChecked / no sensitive code detected

Extract tables from PDFs and images to CSV or Excel. Support for scanned documents with OCR, multi-page PDFs, and complex table structures.

7installs

Sourcedkyazzentwatwa/chatgpt-skills

Added on2026-02-11

NPX Install

npx skill4agent add dkyazzentwatwa/chatgpt-skills table-extractor

SKILL.md Content

View Translation Comparison →

Table Extractor

Extract tables from PDFs and images into structured data formats.

Features

PDF Tables: Extract tables from digital PDFs
Image Tables: OCR-based extraction from images
Multiple Tables: Extract all tables from document
Format Export: CSV, Excel, JSON output
Table Detection: Auto-detect table boundaries
Column Alignment: Smart column detection
Multi-Page: Process entire PDF documents

Quick Start

python

from table_extractor import TableExtractor

extractor = TableExtractor()

# Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()

# Save first table to CSV
tables[0].to_csv("table.csv")

# Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)

CLI Usage

bash

# Extract from PDF
python table_extractor.py --input document.pdf --output tables/

# Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

# Extract from image
python table_extractor.py --input scan.png --output table.csv

# Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

# With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/

API Reference

TableExtractor Class

python

class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str

Supported Formats

Input

PDF documents (text-based and scanned)
Images: PNG, JPEG, TIFF, BMP
Screenshots with tables

Output

CSV (one file per table)
Excel (multiple sheets)
JSON (array of tables)
Pandas DataFrame

Table Detection

python

# Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
# Returns:
# [
#     {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
#     {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]

Example Workflows

PDF Report Tables

python

extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

# Extract all tables
tables = extractor.extract_all()

# Export each to CSV
for i, table in enumerate(tables):
    table.to_csv(f"table_{i}.csv", index=False)

Scanned Document

python

extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)

Dependencies

pdfplumber>=0.10.0
pillow>=10.0.0
pandas>=2.0.0
pytesseract>=0.3.10 (for OCR)
opencv-python>=4.8.0

table-extractor

NPX Install

Tags

SKILL.md Content

Table Extractor

Features

Quick Start

CLI Usage

API Reference

TableExtractor Class

Supported Formats

Input

Output

Table Detection

Example Workflows

PDF Report Tables

Scanned Document

Dependencies