extracting-mistral-ocr
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMistral OCR PDF extraction
Mistral OCR PDF提取
Quick start (default)
快速开始(默认配置)
Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:
bash
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocrOutput directory layout:
- (all pages concatenated)
combined.md - (per-page markdown)
pages/page-000.md - (full OCR response)
raw_response.json - (decoded embedded images, if requested)
images/ - (separate tables, if requested)
tables/
运行捆绑脚本对本地PDF进行OCR识别,并生成Markdown + JSON格式的输出:
bash
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr输出目录结构:
- (所有页面内容拼接)
combined.md - (单页Markdown内容)
pages/page-000.md - (完整OCR响应数据)
raw_response.json - (解码后的嵌入图片,若请求提取)
images/ - (单独提取的表格,若请求提取)
tables/
Workflow
工作流程
-
Pick input mode
- Local PDF (most common): upload via Files API, then OCR via .
file_id - Public URL: OCR directly via .
document_url
- Local PDF (most common): upload via Files API, then OCR via
-
Choose output fidelity (defaults are safe for RAG)
- Keep unless the user explicitly wants tables split out.
table_format=inline - Set when the user needs figures/diagrams extracted.
--include-image-base64 - Use if header/footer noise hurts downstream search.
--extract-header/--extract-footer
- Keep
-
Run OCR
- Use to produce a deterministic on-disk artefact set.
scripts/mistral_ocr_extract.py
- Use
-
(Optional) Structured extraction from the whole document
- If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
- The OCR API can return a document-level in addition to page markdown.
document_annotation
Example:bashpython {baseDir}/scripts/mistral_ocr_extract.py \ --input invoice.pdf \ --out out/invoice \ --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \ --annotation-format json_object
-
选择输入模式
- 本地PDF(最常用):通过Files API上传,再通过进行OCR识别。
file_id - 公开URL:直接通过进行OCR识别。
document_url
- 本地PDF(最常用):通过Files API上传,再通过
-
选择输出精度(默认配置适合RAG场景)
- 除非用户明确要求拆分表格,否则保持。
table_format=inline - 当用户需要提取图形/图表时,设置参数。
--include-image-base64 - 如果页眉/页脚的干扰信息影响后续搜索,使用参数。
--extract-header/--extract-footer
- 除非用户明确要求拆分表格,否则保持
-
运行OCR识别
- 使用生成确定性的磁盘文件集。
scripts/mistral_ocr_extract.py
- 使用
-
(可选)从整个文档提取结构化内容
- 如果用户需要提取字段(如发票总额、合同方等),提供标注提示词。
- OCR API除了返回页面Markdown内容外,还可以返回文档级别的。
document_annotation
示例:bashpython {baseDir}/scripts/mistral_ocr_extract.py \ --input invoice.pdf \ --out out/invoice \ --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \ --annotation-format json_object
Decision rules
决策规则
- If the PDF is local and not publicly accessible, upload it (the script does this automatically).
- If the PDF URL is private or requires authentication, do not pass it as ; upload instead.
document_url - If output quality is critical, prefer for downstream parsing over brittle regex.
table_format=html
- 若PDF为本地文件且无法公开访问:上传文件(脚本会自动完成此操作)。
- 若PDF URL为私有或需要身份验证:不要将其作为传入,而是上传文件。
document_url - 若输出质量至关重要:相比易出错的正则表达式,优先选择以便后续解析。
table_format=html
Common failure modes
常见失败场景
- Missing : set it in the environment before running.
MISTRAL_API_KEY - URL OCR fails: the URL likely is not publicly accessible; upload the file.
- Large files: upload supports large files, but very large PDFs may need page selection () or batch processing.
--pages
- 缺少:运行前需在环境变量中设置该密钥。
MISTRAL_API_KEY - URL OCR识别失败:该URL可能无法公开访问,请上传文件。
- 大文件处理:上传支持大文件,但超大PDF可能需要选择指定页面(参数)或分批处理。
--pages
References
参考资料
- API + parameters:
references/mistral_ocr_api.md - Output mapping rules (placeholders to extracted images/tables):
references/output_mapping.md - Example annotation prompts for common document types:
references/annotation_prompts.md
- API及参数:
references/mistral_ocr_api.md - 输出映射规则(占位符到提取的图片/表格):
references/output_mapping.md - 常见文档类型的标注提示词示例:
references/annotation_prompts.md