mistral-pdf-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMistral PDF to Markdown Converter
Mistral PDF转Markdown转换器
Convert PDF documents to Markdown format using Mistral's OCR API. Automatically extracts text, formatting, and images.
使用Mistral的OCR API将PDF文档转换为Markdown格式。自动提取文本、格式和图像。
When to Use
适用场景
- Converting research papers or documents to Markdown
- Extracting text from scanned PDFs (OCR capability)
- Preserving document structure with headers and formatting
- Extracting embedded images from PDFs
- 将研究论文或文档转换为Markdown
- 从扫描PDF中提取文本(支持OCR功能)
- 保留文档的标题和格式结构
- 从PDF中提取嵌入式图像
Quick Start
快速开始
Use the conversion script from this skill's directory:
bash
undefined使用本skill目录下的转换脚本:
bash
undefinedConvert entire PDF
转换整个PDF
python scripts/convert_pdf_to_markdown.py input.pdf output.md
python scripts/convert_pdf_to_markdown.py input.pdf output.md
Convert specific pages
转换指定页面
python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1-5"
python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1,3,5"
undefinedpython scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1-5"
python scripts/convert_pdf_to_markdown.py input.pdf output.md --pages "1,3,5"
undefinedOutput Structure
输出结构
Output/PDFConversions/
├── document.md # Markdown with text and image references
└── images/
├── img-0.jpeg # Extracted images
├── img-1.jpeg
└── ...Output/PDFConversions/
├── document.md # 包含文本和图像引用的Markdown文件
└── images/
├── img-0.jpeg # 提取的图像
├── img-1.jpeg
└── ...Usage in Code
代码中使用
python
from pathlib import Path
import subprocesspython
from pathlib import Path
import subprocessRun conversion script
运行转换脚本
result = subprocess.run([
"python",
".claude/skills/mistral-pdf-to-markdown/scripts/convert_pdf_to_markdown.py",
"input.pdf",
"Output/PDFConversions/output.md",
"--pages", "1-10"
], capture_output=True, text=True)
print(result.stdout)
undefinedresult = subprocess.run([
"python",
".claude/skills/mistral-pdf-to-markdown/scripts/convert_pdf_to_markdown.py",
"input.pdf",
"Output/PDFConversions/output.md",
"--pages", "1-10"
], capture_output=True, text=True)
print(result.stdout)
undefinedKey Features
核心功能
- Markdown formatting: Preserves headers, lists, and structure
- Image extraction: Saves images to subfolder automatically
images/ - Page selection: Extract specific pages or ranges
- Scanned PDF support: True OCR capability for image-based PDFs
- Relative paths: Image references use

- Markdown格式保留:保留标题、列表和文档结构
- 图像提取:自动将图像保存到子文件夹
images/ - 页面选择:提取指定页面或页面范围
- 扫描PDF支持:对基于图像的PDF提供真正的OCR功能
- 相对路径:图像引用使用格式

Requirements
依赖要求
The script requires:
- Mistral API key in (line 2:
Notes/.env)mistral_api_key=... - Python packages: ,
mistralai,python-dotenvpypdf
运行脚本需要:
- 在文件中配置Mistral API密钥(第2行:
Notes/.env)mistral_api_key=... - Python包:,
mistralai,python-dotenvpypdf
Common Use Cases
常见使用案例
Convert Research Paper
转换研究论文
bash
python scripts/convert_pdf_to_markdown.py \
"Data/papers/research.pdf" \
"Notes/Paper Markdown/research.md"bash
python scripts/convert_pdf_to_markdown.py \
"Data/papers/research.pdf" \
"Notes/Paper Markdown/research.md"Extract Specific Sections
提取指定章节
bash
undefinedbash
undefinedExtract pages 10-20 (introduction and methods)
提取第10-20页(引言和方法部分)
python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
undefinedpython scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
"paper.pdf"
"Notes/Paper Markdown/intro_methods.md"
--pages "10-20"
undefinedExtract Figures Only
仅提取图表
bash
undefinedbash
undefinedExtract pages with figures
提取包含图表的页面
python scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
undefinedpython scripts/convert_pdf_to_markdown.py
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
"paper.pdf"
"Notes/Paper Markdown/figures.md"
--pages "25,27,30,35"
undefinedError Handling
错误处理
API Key Not Found:
Error: Mistral API key not found in Notes/.env→ Add to line 2 of
mistral_api_key=YOUR_KEYNotes/.envPage Out of Range:
Warning: Page 100 out of range, skipping→ Check PDF page count and adjust page selection
API Rate Limit:
→ Wait a moment and retry, or reduce page count per request
未找到API密钥:
Error: Mistral API key not found in Notes/.env→ 在文件的第2行添加
Notes/.envmistral_api_key=你的密钥页面超出范围:
Warning: Page 100 out of range, skipping→ 检查PDF的总页数并调整页面选择范围
API速率限制:
→ 稍等片刻后重试,或减少每次请求的页面数量
Notes
注意事项
- Images are saved as JPEG files in subfolder
images/ - Markdown image references are automatically updated to
images/img-X.jpeg - Large PDFs may take longer to process due to API limits
- For simple text extraction without OCR, consider using the skill instead
pdf - Scanned PDFs benefit most from this skill's OCR capability
- 图像将以JPEG格式保存到子文件夹
images/ - Markdown中的图像引用会自动更新为
images/img-X.jpeg - 大型PDF可能因API限制需要更长时间处理
- 如果仅需简单文本提取且无需OCR,可考虑使用skill
pdf - 扫描文档最能体现本skill的OCR功能优势
See Also
相关链接
- skill - For local PDF manipulation without API calls
pdf - - Additional details about the Mistral OCR API
reference.md
- skill - 无需API调用的本地PDF处理工具
pdf - - 关于Mistral OCR API的更多详细信息
reference.md