ocr-and-documents
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF & Document Extraction
PDF与文档提取
For DOCX: use (parses actual document structure, far better than OCR).
For PPTX: see the skill (uses with full slide/notes support).
This skill covers PDFs and scanned documents.
python-docxpowerpointpython-pptx对于DOCX文件:使用(可解析实际文档结构,效果远优于OCR)。
对于PPTX文件:参考技能(使用,支持完整幻灯片/备注提取)。
本技能覆盖PDF与扫描文档的处理。
python-docxpowerpointpython-pptxStep 1: Remote URL Available?
步骤1:是否有远程URL?
If the document has a URL, always try first:
web_extractweb_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
如果文档有可用URL,优先尝试:
web_extractweb_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])该工具通过Firecrawl实现PDF转Markdown转换,无需本地依赖。
仅在以下场景使用本地提取:文件为本地文件、web_extract失败,或需要批量处理。
Step 2: Choose Local Extractor
步骤2:选择本地提取工具
| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
| 特性 | pymupdf(约25MB) | marker-pdf(约3-5GB) |
|---|---|---|
| 纯文本PDF | ✅ | ✅ |
| 扫描版PDF(OCR) | ❌ | ✅(支持90+种语言) |
| 表格提取 | ✅(基础功能) | ✅(高精度) |
| 公式/LaTeX | ❌ | ✅ |
| 代码块 | ❌ | ✅ |
| 表单 | ❌ | ✅ |
| 页眉/页脚移除 | ❌ | ✅ |
| 阅读顺序识别 | ❌ | ✅ |
| 图片提取 | ✅(嵌入式图片) | ✅(带上下文) |
| 图片转文本(OCR) | ❌ | ✅ |
| EPUB格式 | ✅ | ✅ |
| Markdown输出 | ✅(通过pymupdf4llm) | ✅(原生支持,质量更高) |
| 安装体积 | ~25MB | ~3-5GB(含PyTorch及模型) |
| 处理速度 | 即时 | ~1-14秒/页(CPU),~0.2秒/页(GPU) |
决策建议:除非需要OCR、公式、表单或复杂布局分析,否则优先使用pymupdf。
如果用户需要marker-pdf的功能,但系统剩余磁盘空间不足5GB:
"该文档需要OCR/高级提取功能(依赖marker-pdf),此工具需要约5GB空间安装PyTorch及模型。当前系统剩余空间为[X]GB。可选方案:释放磁盘空间、提供文档URL以使用web_extract,或尝试使用pymupdf(仅支持纯文本PDF,不支持扫描件或公式)。"
pymupdf (lightweight)
pymupdf(轻量型)
bash
pip install pymupdf pymupdf4llmVia helper script:
bash
python scripts/extract_pymupdf.py document.pdf # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pagesInline:
bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"bash
pip install pymupdf pymupdf4llm通过辅助脚本使用:
bash
python scripts/extract_pymupdf.py document.pdf # 提取纯文本
python scripts/extract_pymupdf.py document.pdf --markdown # 提取为Markdown
python scripts/extract_pymupdf.py document.pdf --tables # 提取表格
python scripts/extract_pymupdf.py document.pdf --images out/ # 提取图片到指定目录
python scripts/extract_pymupdf.py document.pdf --metadata # 提取标题、作者、页数等元数据
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # 提取指定页码范围直接调用:
bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"marker-pdf (high-quality OCR)
marker-pdf(高质量OCR)
bash
undefinedbash
undefinedCheck disk space first
先检查磁盘空间
python scripts/extract_marker.py --check
pip install marker-pdf
**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracyCLI (installed with marker-pdf):
bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batchpython scripts/extract_marker.py --check
pip install marker-pdf
**通过辅助脚本使用**:
```bash
python scripts/extract_marker.py document.pdf # 提取为Markdown
python scripts/extract_marker.py document.pdf --json # 提取为带元数据的JSON
python scripts/extract_marker.py document.pdf --output_dir out/ # 保存图片到指定目录
python scripts/extract_marker.py scanned.pdf # 处理扫描版PDF(OCR)
python scripts/extract_marker.py document.pdf --use_llm # 借助LLM提升提取精度命令行工具(安装marker-pdf后自带):
bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # 批量处理Arxiv Papers
ArXiv论文处理
undefinedundefinedAbstract only (fast)
仅提取摘要(快速)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Full paper
提取完整论文
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
Search
搜索论文
web_search(query="arxiv GRPO reinforcement learning 2026")
undefinedweb_search(query="arxiv GRPO reinforcement learning 2026")
undefinedSplit, Merge & Search
拆分、合并与搜索
pymupdf handles these natively — use or inline Python:
execute_codepython
undefinedpymupdf原生支持这些功能——可使用或直接编写Python代码:
execute_codepython
undefinedSplit: extract pages 1-5 to a new PDF
拆分:提取第1-5页生成新PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
```pythonimport pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
```pythonMerge multiple PDFs
合并多个PDF
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
```pythonimport pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
```pythonSearch for text across all pages
全文搜索指定文本
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"Page {i+1}: {len(results)} match(es)")
print(page.get_text("text"))
No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
---import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"第{i+1}页:找到{len(results)}处匹配")
print(page.get_text("text"))
无需额外依赖——pymupdf在一个包中整合了拆分、合并、搜索和文本提取功能。
---Notes
注意事项
- is always first choice for URLs
web_extract - pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept for full usage
--help - marker-pdf downloads ~2.5GB of models to on first use
~/.cache/huggingface/ - For Word docs: (better than OCR — parses actual structure)
pip install python-docx - For PowerPoint: see the skill (uses python-pptx)
powerpoint
- 处理带URL的文档时,始终是首选
web_extract - pymupdf是安全默认选项——处理即时、无需模型、兼容性强
- marker-pdf适用于OCR、扫描文档、公式、复杂布局场景——仅在需要时安装
- 两个辅助脚本均支持查看完整使用说明
--help - marker-pdf首次使用时会下载约2.5GB模型到目录
~/.cache/huggingface/ - 处理Word文档:使用(效果优于OCR,可解析实际文档结构)
pip install python-docx - 处理PowerPoint文档:参考技能(使用python-pptx)
powerpoint