extracting-pdf-text

Original：🇺🇸 English

Translated

4 scriptsChecked / no sensitive code detected

Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.

9installs

Sourceletta-ai/skills

Added on2026-02-07

NPX Install

npx skill4agent add letta-ai/skills extracting-pdf-text

SKILL.md Content

View Translation Comparison →

Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

Quick Decision Guide

PDF Type	Best Approach	Script
Simple text PDF	PyMuPDF	`scripts/extract_pymupdf.py`
PDF with tables	pdfplumber	`scripts/extract_pdfplumber.py`
Scanned/image PDF (local)	pytesseract	`scripts/extract_with_ocr.py`
Complex layout, highest accuracy	Mistral OCR API	`scripts/extract_mistral_ocr.py`
End-to-end RAG pipeline	marker-pdf	`pip install marker-pdf`

Recommended Workflow

Try PyMuPDF first - fastest, handles most text-based PDFs well
If tables are mangled - switch to pdfplumber
If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

bash

uv run scripts/extract_pymupdf.py input.pdf output.md

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses

pymupdf4llm

which formats text for RAG systems.

pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

bash

uv run scripts/extract_pdfplumber.py input.pdf output.md

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

bash

uv run scripts/extract_with_ocr.py input.pdf output.txt

Requires:

pytesseract

,

pdf2image

, and Tesseract installed (

brew install tesseract

on macOS).

API-Based Extraction

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

Pricing: ~1000 pages per dollar (very cost-effective)

bash

export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md

Features:

Outputs clean markdown
Preserves document structure (headings, lists, tables)
Handles images, math equations, multilingual text
95%+ accuracy on complex documents

For detailed API options and other services, see references/api-services.md.

Output Format Recommendations

For LLM consumption, markdown is preferred:

Preserves semantic structure (headings become context boundaries)
Tables remain readable
Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see references/local-tools.md.

extracting-pdf-text

NPX Install

Tags

SKILL.md Content

Extracting PDF Text for LLMs

Quick Decision Guide

Recommended Workflow

Local Extraction (No API Required)

PyMuPDF - Fast General Extraction

pdfplumber - Table Extraction

Local OCR - Scanned Documents

API-Based Extraction

Mistral OCR API

Output Format Recommendations