extracting-pdf-text
Original:🇺🇸 English
Translated
4 scriptsChecked / no sensitive code detected
Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
8installs
Sourceletta-ai/skills
Added on
NPX Install
npx skill4agent add letta-ai/skills extracting-pdf-textTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Extracting PDF Text for LLMs
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
Quick Decision Guide
| PDF Type | Best Approach | Script |
|---|---|---|
| Simple text PDF | PyMuPDF | |
| PDF with tables | pdfplumber | |
| Scanned/image PDF (local) | pytesseract | |
| Complex layout, highest accuracy | Mistral OCR API | |
| End-to-end RAG pipeline | marker-pdf | |
Recommended Workflow
- Try PyMuPDF first - fastest, handles most text-based PDFs well
- If tables are mangled - switch to pdfplumber
- If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)
Local Extraction (No API Required)
PyMuPDF - Fast General Extraction
Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
bash
uv run scripts/extract_pymupdf.py input.pdf output.mdThe script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses which formats text for RAG systems.
pymupdf4llmpdfplumber - Table Extraction
Best for: PDFs with tables, financial documents, structured data.
bash
uv run scripts/extract_pdfplumber.py input.pdf output.mdTables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
Local OCR - Scanned Documents
Best for: Scanned PDFs when API access is unavailable.
bash
uv run scripts/extract_with_ocr.py input.pdf output.txtRequires: , , and Tesseract installed ( on macOS).
pytesseractpdf2imagebrew install tesseractAPI-Based Extraction
Mistral OCR API
Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.mdFeatures:
- Outputs clean markdown
- Preserves document structure (headings, lists, tables)
- Handles images, math equations, multilingual text
- 95%+ accuracy on complex documents
For detailed API options and other services, see references/api-services.md.
Output Format Recommendations
For LLM consumption, markdown is preferred:
- Preserves semantic structure (headings become context boundaries)
- Tables remain readable
- Compatible with most RAG chunking strategies
For detailed comparisons of local tools, see references/local-tools.md.