ocr-and-documents

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PDF & Document Extraction

PDF与文档提取

For DOCX: use

python-docx

(parses actual document structure, far better than OCR). For PPTX: see the

powerpoint

skill (uses

python-pptx

with full slide/notes support). This skill covers PDFs and scanned documents.

对于DOCX文件：使用

python-docx

（可解析实际文档结构，效果远优于OCR）。对于PPTX文件：参考

powerpoint

技能（使用

python-pptx

，支持完整幻灯片/备注提取）。本技能覆盖PDF与扫描文档的处理。

Step 1: Remote URL Available?

步骤1：是否有远程URL？

If the document has a URL, always try
web_extract
first:

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

如果文档有可用URL，优先尝试
web_extract
：

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])

该工具通过Firecrawl实现PDF转Markdown转换，无需本地依赖。

仅在以下场景使用本地提取：文件为本地文件、web_extract失败，或需要批量处理。

Step 2: Choose Local Extractor

步骤2：选择本地提取工具

Feature	pymupdf (~25MB)	marker-pdf (~3-5GB)
Text-based PDF	✅	✅
Scanned PDF (OCR)	❌	✅ (90+ languages)
Tables	✅ (basic)	✅ (high accuracy)
Equations / LaTeX	❌	✅
Code blocks	❌	✅
Forms	❌	✅
Headers/footers removal	❌	✅
Reading order detection	❌	✅
Images extraction	✅ (embedded)	✅ (with context)
Images → text (OCR)	❌	✅
EPUB	✅	✅
Markdown output	✅ (via pymupdf4llm)	✅ (native, higher quality)
Install size	~25MB	~3-5GB (PyTorch + models)
Speed	Instant	~1-14s/page (CPU), ~0.2s/page (GPU)

Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:

"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

特性	pymupdf（约25MB）	marker-pdf（约3-5GB）
纯文本PDF	✅	✅
扫描版PDF（OCR）	❌	✅（支持90+种语言）
表格提取	✅（基础功能）	✅（高精度）
公式/LaTeX	❌	✅
代码块	❌	✅
表单	❌	✅
页眉/页脚移除	❌	✅
阅读顺序识别	❌	✅
图片提取	✅（嵌入式图片）	✅（带上下文）
图片转文本（OCR）	❌	✅
EPUB格式	✅	✅
Markdown输出	✅（通过pymupdf4llm）	✅（原生支持，质量更高）
安装体积	~25MB	~3-5GB（含PyTorch及模型）
处理速度	即时	~1-14秒/页（CPU），~0.2秒/页（GPU）

决策建议：除非需要OCR、公式、表单或复杂布局分析，否则优先使用pymupdf。

如果用户需要marker-pdf的功能，但系统剩余磁盘空间不足5GB：

"该文档需要OCR/高级提取功能（依赖marker-pdf），此工具需要约5GB空间安装PyTorch及模型。当前系统剩余空间为[X]GB。可选方案：释放磁盘空间、提供文档URL以使用web_extract，或尝试使用pymupdf（仅支持纯文本PDF，不支持扫描件或公式）。"

pymupdf (lightweight)

pymupdf（轻量型）

bash

pip install pymupdf pymupdf4llm

Via helper script:

bash

python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages

Inline:

bash

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

bash

pip install pymupdf pymupdf4llm

通过辅助脚本使用：

bash

python scripts/extract_pymupdf.py document.pdf              # 提取纯文本
python scripts/extract_pymupdf.py document.pdf --markdown    # 提取为Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # 提取表格
python scripts/extract_pymupdf.py document.pdf --images out/ # 提取图片到指定目录
python scripts/extract_pymupdf.py document.pdf --metadata    # 提取标题、作者、页数等元数据
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # 提取指定页码范围

直接调用：

bash

python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"

marker-pdf (high-quality OCR)

marker-pdf（高质量OCR）

bash

undefined

bash

undefined

Check disk space first

先检查磁盘空间

python scripts/extract_marker.py --check

pip install marker-pdf


**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy

CLI (installed with marker-pdf):

bash

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch

python scripts/extract_marker.py --check

pip install marker-pdf


**通过辅助脚本使用**：
```bash
python scripts/extract_marker.py document.pdf                # 提取为Markdown
python scripts/extract_marker.py document.pdf --json         # 提取为带元数据的JSON
python scripts/extract_marker.py document.pdf --output_dir out/  # 保存图片到指定目录
python scripts/extract_marker.py scanned.pdf                 # 处理扫描版PDF（OCR）
python scripts/extract_marker.py document.pdf --use_llm      # 借助LLM提升提取精度

命令行工具（安装marker-pdf后自带）：

bash

marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # 批量处理

Arxiv Papers

ArXiv论文处理

undefined

undefined

Abstract only (fast)

仅提取摘要（快速）

web_extract(urls=["https://arxiv.org/abs/2402.03300"])

Full paper

提取完整论文

web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Search

搜索论文

web_search(query="arxiv GRPO reinforcement learning 2026")

undefined

web_search(query="arxiv GRPO reinforcement learning 2026")

undefined

Split, Merge & Search

拆分、合并与搜索

pymupdf handles these natively — use

execute_code

or inline Python:

python

undefined

pymupdf原生支持这些功能——可使用

execute_code

或直接编写Python代码：

python

undefined

Split: extract pages 1-5 to a new PDF

拆分：提取第1-5页生成新PDF

import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf")


```python

import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf")


```python

Merge multiple PDFs

合并多个PDF

import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf")


```python

import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf")


```python

Search for text across all pages

全文搜索指定文本

import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text"))


No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

---

import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"第{i+1}页：找到{len(results)}处匹配") print(page.get_text("text"))


无需额外依赖——pymupdf在一个包中整合了拆分、合并、搜索和文本提取功能。

---

Notes

注意事项

```
web_extract
```
is always first choice for URLs
pymupdf is the safe default — instant, no models, works everywhere
marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
Both helper scripts accept
```
--help
```
for full usage
marker-pdf downloads ~2.5GB of models to
```
~/.cache/huggingface/
```
on first use
For Word docs:
```
pip install python-docx
```
(better than OCR — parses actual structure)
For PowerPoint: see the
```
powerpoint
```
skill (uses python-pptx)

处理带URL的文档时，
```
web_extract
```
始终是首选
pymupdf是安全默认选项——处理即时、无需模型、兼容性强
marker-pdf适用于OCR、扫描文档、公式、复杂布局场景——仅在需要时安装
两个辅助脚本均支持
```
--help
```
查看完整使用说明
marker-pdf首次使用时会下载约2.5GB模型到
```
~/.cache/huggingface/
```
目录
处理Word文档：使用
```
pip install python-docx
```
（效果优于OCR，可解析实际文档结构）
处理PowerPoint文档：参考
```
powerpoint
```
技能（使用python-pptx）