mistral-pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Mistral PDF to Markdown Converter

Mistral PDF转Markdown转换器

Convert PDF documents to Markdown format using Mistral's OCR API. Automatically extracts text, formatting, and images.
使用Mistral的OCR API将PDF文档转换为Markdown格式。自动提取文本、格式和图像。

When to Use

适用场景

  • Converting research papers or documents to Markdown
  • Extracting text from scanned PDFs (OCR capability)
  • Preserving document structure with headers and formatting
  • Extracting embedded images from PDFs
  • 将研究论文或文档转换为Markdown
  • 从扫描PDF中提取文本(支持OCR功能)
  • 保留文档的标题和格式结构
  • 从PDF中提取嵌入式图像

Quick Start

快速开始

Use the conversion script from this skill's directory:
bash
undefined
使用本skill目录下的转换脚本:
bash
undefined

Convert entire PDF

转换整个PDF

python scripts/convert_pdf_to_markdown.py input.pdf output.md
python scripts/convert_pdf_to_markdown.py input.pdf output.md

Convert specific pages

转换指定页面

python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1-5" python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1,3,5"
undefined
python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1-5" python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1,3,5"
undefined

Output Structure

输出结构

Output/PDFConversions/
├── document.md          # Markdown with text and image references
└── images/
    ├── img-0.jpeg      # Extracted images
    ├── img-1.jpeg
    └── ...
Output/PDFConversions/
├── document.md          # 包含文本和图像引用的Markdown文件
└── images/
    ├── img-0.jpeg      # 提取的图像
    ├── img-1.jpeg
    └── ...

Usage in Code

代码中使用

python
from pathlib import Path
import subprocess
python
from pathlib import Path
import subprocess

Run conversion script

运行转换脚本

result = subprocess.run([ "python", ".claude/skills/mistral-pdf-to-markdown/scripts/convert_pdf_to_markdown.py", "input.pdf", "Output/PDFConversions/output.md", "--pages", "1-10" ], capture_output=True, text=True)
print(result.stdout)
undefined
result = subprocess.run([ "python", ".claude/skills/mistral-pdf-to-markdown/scripts/convert_pdf_to_markdown.py", "input.pdf", "Output/PDFConversions/output.md", "--pages", "1-10" ], capture_output=True, text=True)
print(result.stdout)
undefined

Key Features

核心功能

  • Markdown formatting: Preserves headers, lists, and structure
  • Image extraction: Saves images to
    images/
    subfolder automatically
  • Page selection: Extract specific pages or ranges
  • Scanned PDF support: True OCR capability for image-based PDFs
  • Relative paths: Image references use
    ![...](images/img-X.jpeg)
  • Markdown格式保留:保留标题、列表和文档结构
  • 图像提取:自动将图像保存到
    images/
    子文件夹
  • 页面选择:提取指定页面或页面范围
  • 扫描PDF支持:对基于图像的PDF提供真正的OCR功能
  • 相对路径:图像引用使用
    ![...](images/img-X.jpeg)
    格式

Requirements

依赖要求

The script requires:
  • Mistral API key in
    Notes/.env
    (line 2:
    mistral_api_key=...
    )
  • Python packages:
    mistralai
    ,
    python-dotenv
    ,
    pypdf
运行脚本需要:
  • Notes/.env
    文件中配置Mistral API密钥(第2行:
    mistral_api_key=...
  • Python包:
    mistralai
    ,
    python-dotenv
    ,
    pypdf

Common Use Cases

常见使用案例

Convert Research Paper

转换研究论文

bash
python scripts/convert_pdf_to_markdown.py \
  "Data/papers/research.pdf" \
  "Notes/Paper Markdown/research.md"
bash
python scripts/convert_pdf_to_markdown.py \
  "Data/papers/research.pdf" \
  "Notes/Paper Markdown/research.md"

Extract Specific Sections

提取指定章节

bash
undefined
bash
undefined

Extract pages 10-20 (introduction and methods)

提取第10-20页(引言和方法部分)

python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
undefined
python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
undefined

Extract Figures Only

仅提取图表

bash
undefined
bash
undefined

Extract pages with figures

提取包含图表的页面

python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
undefined
python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
undefined

Error Handling

错误处理

API Key Not Found:
Error: Mistral API key not found in Notes/.env
→ Add
mistral_api_key=YOUR_KEY
to line 2 of
Notes/.env
Page Out of Range:
Warning: Page 100 out of range, skipping
→ Check PDF page count and adjust page selection
API Rate Limit: → Wait a moment and retry, or reduce page count per request
未找到API密钥:
Error: Mistral API key not found in Notes/.env
→ 在
Notes/.env
文件的第2行添加
mistral_api_key=你的密钥
页面超出范围:
Warning: Page 100 out of range, skipping
→ 检查PDF的总页数并调整页面选择范围
API速率限制: → 稍等片刻后重试,或减少每次请求的页面数量

Notes

注意事项

  • Images are saved as JPEG files in
    images/
    subfolder
  • Markdown image references are automatically updated to
    images/img-X.jpeg
  • Large PDFs may take longer to process due to API limits
  • For simple text extraction without OCR, consider using the
    pdf
    skill instead
  • Scanned PDFs benefit most from this skill's OCR capability
  • 图像将以JPEG格式保存到
    images/
    子文件夹
  • Markdown中的图像引用会自动更新为
    images/img-X.jpeg
  • 大型PDF可能因API限制需要更长时间处理
  • 如果仅需简单文本提取且无需OCR,可考虑使用
    pdf
    skill
  • 扫描文档最能体现本skill的OCR功能优势

See Also

相关链接

  • pdf
    skill - For local PDF manipulation without API calls
  • reference.md
    - Additional details about the Mistral OCR API
  • pdf
    skill - 无需API调用的本地PDF处理工具
  • reference.md
    - 关于Mistral OCR API的更多详细信息