extracting-pdf-text
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseExtracting PDF Text for LLMs
为LLM提取PDF文本
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
本技能提供工具和指南,用于从PDF中提取适合语言模型使用的文本格式。
Quick Decision Guide
快速决策指南
| PDF Type | Best Approach | Script |
|---|---|---|
| Simple text PDF | PyMuPDF | |
| PDF with tables | pdfplumber | |
| Scanned/image PDF (local) | pytesseract | |
| Complex layout, highest accuracy | Mistral OCR API | |
| End-to-end RAG pipeline | marker-pdf | |
| PDF类型 | 最佳方案 | 脚本 |
|---|---|---|
| 纯文本PDF | PyMuPDF | |
| 含表格的PDF | pdfplumber | |
| 扫描版/图片PDF(本地) | pytesseract | |
| 复杂布局、最高精度 | Mistral OCR API | |
| 端到端RAG流水线 | marker-pdf | |
Recommended Workflow
推荐工作流
- Try PyMuPDF first - fastest, handles most text-based PDFs well
- If tables are mangled - switch to pdfplumber
- If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)
- 优先尝试PyMuPDF - 速度最快,能很好处理大多数纯文本PDF
- 如果表格显示混乱 - 切换为pdfplumber
- 如果是扫描版/图片格式 - 使用Mistral OCR API(精度最高)或本地OCR工具(免费但速度较慢)
Local Extraction (No API Required)
本地提取(无需API)
PyMuPDF - Fast General Extraction
PyMuPDF - 快速通用提取
Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
bash
uv run scripts/extract_pymupdf.py input.pdf output.mdThe script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses which formats text for RAG systems.
pymupdf4llm最适合:文本密集型PDF、对速度要求高的工作流、基础结构保留。
bash
uv run scripts/extract_pymupdf.py input.pdf output.md该脚本输出保留标题和段落的markdown格式。为适配LLM优化输出,它使用来为RAG系统格式化文本。
pymupdf4llmpdfplumber - Table Extraction
pdfplumber - 表格提取
Best for: PDFs with tables, financial documents, structured data.
bash
uv run scripts/extract_pdfplumber.py input.pdf output.mdTables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
最适合:含表格的PDF、财务文档、结构化数据。
bash
uv run scripts/extract_pdfplumber.py input.pdf output.md表格会转换为markdown格式。注意:pdfplumber在机器生成的PDF上表现最佳,不适用于扫描版文档。
Local OCR - Scanned Documents
本地OCR - 扫描版文档
Best for: Scanned PDFs when API access is unavailable.
bash
uv run scripts/extract_with_ocr.py input.pdf output.txtRequires: , , and Tesseract installed ( on macOS).
pytesseractpdf2imagebrew install tesseract最适合:无法访问API时处理扫描版PDF。
bash
uv run scripts/extract_with_ocr.py input.pdf output.txt依赖:、,以及已安装的Tesseract(macOS系统可通过安装)。
pytesseractpdf2imagebrew install tesseractAPI-Based Extraction
基于API的提取
Mistral OCR API
Mistral OCR API
Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.mdFeatures:
- Outputs clean markdown
- Preserves document structure (headings, lists, tables)
- Handles images, math equations, multilingual text
- 95%+ accuracy on complex documents
For detailed API options and other services, see references/api-services.md.
最适合:复杂布局、扫描版文档、最高精度、多语言内容、数学公式。
定价:约1美元处理1000页(性价比极高)
bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md特性:
- 输出整洁的markdown格式
- 保留文档结构(标题、列表、表格)
- 可处理图片、数学公式、多语言文本
- 复杂文档处理精度达95%以上
如需详细的API选项及其他服务,请查看references/api-services.md。
Output Format Recommendations
输出格式建议
For LLM consumption, markdown is preferred:
- Preserves semantic structure (headings become context boundaries)
- Tables remain readable
- Compatible with most RAG chunking strategies
For detailed comparisons of local tools, see references/local-tools.md.
对于LLM使用,推荐使用markdown格式:
- 保留语义结构(标题作为上下文边界)
- 表格保持可读
- 兼容大多数RAG分块策略
如需本地工具的详细对比,请查看references/local-tools.md。