extracting-pdf-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Extracting PDF Text for LLMs

为LLM提取PDF文本

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
本技能提供工具和指南,用于从PDF中提取适合语言模型使用的文本格式。

Quick Decision Guide

快速决策指南

PDF TypeBest ApproachScript
Simple text PDFPyMuPDF
scripts/extract_pymupdf.py
PDF with tablespdfplumber
scripts/extract_pdfplumber.py
Scanned/image PDF (local)pytesseract
scripts/extract_with_ocr.py
Complex layout, highest accuracyMistral OCR API
scripts/extract_mistral_ocr.py
End-to-end RAG pipelinemarker-pdf
pip install marker-pdf
PDF类型最佳方案脚本
纯文本PDFPyMuPDF
scripts/extract_pymupdf.py
含表格的PDFpdfplumber
scripts/extract_pdfplumber.py
扫描版/图片PDF(本地)pytesseract
scripts/extract_with_ocr.py
复杂布局、最高精度Mistral OCR API
scripts/extract_mistral_ocr.py
端到端RAG流水线marker-pdf
pip install marker-pdf

Recommended Workflow

推荐工作流

  1. Try PyMuPDF first - fastest, handles most text-based PDFs well
  2. If tables are mangled - switch to pdfplumber
  3. If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)
  1. 优先尝试PyMuPDF - 速度最快,能很好处理大多数纯文本PDF
  2. 如果表格显示混乱 - 切换为pdfplumber
  3. 如果是扫描版/图片格式 - 使用Mistral OCR API(精度最高)或本地OCR工具(免费但速度较慢)

Local Extraction (No API Required)

本地提取(无需API)

PyMuPDF - Fast General Extraction

PyMuPDF - 快速通用提取

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
bash
uv run scripts/extract_pymupdf.py input.pdf output.md
The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses
pymupdf4llm
which formats text for RAG systems.
最适合:文本密集型PDF、对速度要求高的工作流、基础结构保留。
bash
uv run scripts/extract_pymupdf.py input.pdf output.md
该脚本输出保留标题和段落的markdown格式。为适配LLM优化输出,它使用
pymupdf4llm
来为RAG系统格式化文本。

pdfplumber - Table Extraction

pdfplumber - 表格提取

Best for: PDFs with tables, financial documents, structured data.
bash
uv run scripts/extract_pdfplumber.py input.pdf output.md
Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
最适合:含表格的PDF、财务文档、结构化数据。
bash
uv run scripts/extract_pdfplumber.py input.pdf output.md
表格会转换为markdown格式。注意:pdfplumber在机器生成的PDF上表现最佳,不适用于扫描版文档。

Local OCR - Scanned Documents

本地OCR - 扫描版文档

Best for: Scanned PDFs when API access is unavailable.
bash
uv run scripts/extract_with_ocr.py input.pdf output.txt
Requires:
pytesseract
,
pdf2image
, and Tesseract installed (
brew install tesseract
on macOS).
最适合:无法访问API时处理扫描版PDF。
bash
uv run scripts/extract_with_ocr.py input.pdf output.txt
依赖:
pytesseract
pdf2image
,以及已安装的Tesseract(macOS系统可通过
brew install tesseract
安装)。

API-Based Extraction

基于API的提取

Mistral OCR API

Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
Features:
  • Outputs clean markdown
  • Preserves document structure (headings, lists, tables)
  • Handles images, math equations, multilingual text
  • 95%+ accuracy on complex documents
For detailed API options and other services, see references/api-services.md.
最适合:复杂布局、扫描版文档、最高精度、多语言内容、数学公式。
定价:约1美元处理1000页(性价比极高)
bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
特性:
  • 输出整洁的markdown格式
  • 保留文档结构(标题、列表、表格)
  • 可处理图片、数学公式、多语言文本
  • 复杂文档处理精度达95%以上
如需详细的API选项及其他服务,请查看references/api-services.md

Output Format Recommendations

输出格式建议

For LLM consumption, markdown is preferred:
  • Preserves semantic structure (headings become context boundaries)
  • Tables remain readable
  • Compatible with most RAG chunking strategies
For detailed comparisons of local tools, see references/local-tools.md.
对于LLM使用,推荐使用markdown格式:
  • 保留语义结构(标题作为上下文边界)
  • 表格保持可读
  • 兼容大多数RAG分块策略
如需本地工具的详细对比,请查看references/local-tools.md