ocr-and-documents

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PDF & Document Extraction

PDF与文档提取

For DOCX: use
python-docx
(parses actual document structure, far better than OCR). For PPTX: see the
powerpoint
skill (uses
python-pptx
with full slide/notes support). This skill covers PDFs and scanned documents.
对于DOCX文件:使用
python-docx
(可解析实际文档结构,效果远优于OCR)。 对于PPTX文件:参考
powerpoint
技能(使用
python-pptx
,支持完整幻灯片/备注提取)。 本技能覆盖PDF与扫描文档的处理。

Step 1: Remote URL Available?

步骤1:是否有远程URL?

If the document has a URL, always try
web_extract
first
:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
如果文档有可用URL,优先尝试
web_extract
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
该工具通过Firecrawl实现PDF转Markdown转换,无需本地依赖。
仅在以下场景使用本地提取:文件为本地文件、web_extract失败,或需要批量处理。

Step 2: Choose Local Extractor

步骤2:选择本地提取工具

Featurepymupdf (~25MB)marker-pdf (~3-5GB)
Text-based PDF
Scanned PDF (OCR)✅ (90+ languages)
Tables✅ (basic)✅ (high accuracy)
Equations / LaTeX
Code blocks
Forms
Headers/footers removal
Reading order detection
Images extraction✅ (embedded)✅ (with context)
Images → text (OCR)
EPUB
Markdown output✅ (via pymupdf4llm)✅ (native, higher quality)
Install size~25MB~3-5GB (PyTorch + models)
SpeedInstant~1-14s/page (CPU), ~0.2s/page (GPU)
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

特性pymupdf(约25MB)marker-pdf(约3-5GB)
纯文本PDF
扫描版PDF(OCR)✅(支持90+种语言)
表格提取✅(基础功能)✅(高精度)
公式/LaTeX
代码块
表单
页眉/页脚移除
阅读顺序识别
图片提取✅(嵌入式图片)✅(带上下文)
图片转文本(OCR)
EPUB格式
Markdown输出✅(通过pymupdf4llm)✅(原生支持,质量更高)
安装体积~25MB~3-5GB(含PyTorch及模型)
处理速度即时~1-14秒/页(CPU),~0.2秒/页(GPU)
决策建议:除非需要OCR、公式、表单或复杂布局分析,否则优先使用pymupdf。
如果用户需要marker-pdf的功能,但系统剩余磁盘空间不足5GB:
"该文档需要OCR/高级提取功能(依赖marker-pdf),此工具需要约5GB空间安装PyTorch及模型。当前系统剩余空间为[X]GB。可选方案:释放磁盘空间、提供文档URL以使用web_extract,或尝试使用pymupdf(仅支持纯文本PDF,不支持扫描件或公式)。"

pymupdf (lightweight)

pymupdf(轻量型)

bash
pip install pymupdf pymupdf4llm
Via helper script:
bash
python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
Inline:
bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

bash
pip install pymupdf pymupdf4llm
通过辅助脚本使用
bash
python scripts/extract_pymupdf.py document.pdf              # 提取纯文本
python scripts/extract_pymupdf.py document.pdf --markdown    # 提取为Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # 提取表格
python scripts/extract_pymupdf.py document.pdf --images out/ # 提取图片到指定目录
python scripts/extract_pymupdf.py document.pdf --metadata    # 提取标题、作者、页数等元数据
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # 提取指定页码范围
直接调用
bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

marker-pdf (high-quality OCR)

marker-pdf(高质量OCR)

bash
undefined
bash
undefined

Check disk space first

先检查磁盘空间

python scripts/extract_marker.py --check
pip install marker-pdf

**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
CLI (installed with marker-pdf):
bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch

python scripts/extract_marker.py --check
pip install marker-pdf

**通过辅助脚本使用**:
```bash
python scripts/extract_marker.py document.pdf                # 提取为Markdown
python scripts/extract_marker.py document.pdf --json         # 提取为带元数据的JSON
python scripts/extract_marker.py document.pdf --output_dir out/  # 保存图片到指定目录
python scripts/extract_marker.py scanned.pdf                 # 处理扫描版PDF(OCR)
python scripts/extract_marker.py document.pdf --use_llm      # 借助LLM提升提取精度
命令行工具(安装marker-pdf后自带):
bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # 批量处理

Arxiv Papers

ArXiv论文处理

undefined
undefined

Abstract only (fast)

仅提取摘要(快速)

web_extract(urls=["https://arxiv.org/abs/2402.03300"])
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper

提取完整论文

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Search

搜索论文

web_search(query="arxiv GRPO reinforcement learning 2026")
undefined
web_search(query="arxiv GRPO reinforcement learning 2026")
undefined

Split, Merge & Search

拆分、合并与搜索

pymupdf handles these natively — use
execute_code
or inline Python:
python
undefined
pymupdf原生支持这些功能——可使用
execute_code
或直接编写Python代码:
python
undefined

Split: extract pages 1-5 to a new PDF

拆分:提取第1-5页生成新PDF

import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf")

```python
import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf")

```python

Merge multiple PDFs

合并多个PDF

import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf")

```python
import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf")

```python

Search for text across all pages

全文搜索指定文本

import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text"))

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

---
import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"第{i+1}页:找到{len(results)}处匹配") print(page.get_text("text"))

无需额外依赖——pymupdf在一个包中整合了拆分、合并、搜索和文本提取功能。

---

Notes

注意事项

  • web_extract
    is always first choice for URLs
  • pymupdf is the safe default — instant, no models, works everywhere
  • marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
  • Both helper scripts accept
    --help
    for full usage
  • marker-pdf downloads ~2.5GB of models to
    ~/.cache/huggingface/
    on first use
  • For Word docs:
    pip install python-docx
    (better than OCR — parses actual structure)
  • For PowerPoint: see the
    powerpoint
    skill (uses python-pptx)
  • 处理带URL的文档时,
    web_extract
    始终是首选
  • pymupdf是安全默认选项——处理即时、无需模型、兼容性强
  • marker-pdf适用于OCR、扫描文档、公式、复杂布局场景——仅在需要时安装
  • 两个辅助脚本均支持
    --help
    查看完整使用说明
  • marker-pdf首次使用时会下载约2.5GB模型到
    ~/.cache/huggingface/
    目录
  • 处理Word文档:使用
    pip install python-docx
    (效果优于OCR,可解析实际文档结构)
  • 处理PowerPoint文档:参考
    powerpoint
    技能(使用python-pptx)