extracting-mistral-ocr

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Mistral OCR PDF extraction

Mistral OCR PDF提取

Quick start (default)

快速开始（默认配置）

Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:

bash

python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr

Output directory layout:

```
combined.md
```
(all pages concatenated)
```
pages/page-000.md
```
(per-page markdown)
```
raw_response.json
```
(full OCR response)
```
images/
```
(decoded embedded images, if requested)
```
tables/
```
(separate tables, if requested)

运行捆绑脚本对本地PDF进行OCR识别，并生成Markdown + JSON格式的输出：

bash

python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr

输出目录结构：

```
combined.md
```
（所有页面内容拼接）
```
pages/page-000.md
```
（单页Markdown内容）
```
raw_response.json
```
（完整OCR响应数据）
```
images/
```
（解码后的嵌入图片，若请求提取）
```
tables/
```
（单独提取的表格，若请求提取）

Workflow

工作流程

Pick input mode
- Local PDF (most common): upload via Files API, then OCR via
```
file_id
```
  .
- Public URL: OCR directly via
```
document_url
```
  .
Choose output fidelity (defaults are safe for RAG)
- Keep
```
table_format=inline
```
  unless the user explicitly wants tables split out.
- Set
```
--include-image-base64
```
  when the user needs figures/diagrams extracted.
- Use
```
--extract-header/--extract-footer
```
  if header/footer noise hurts downstream search.
Run OCR
- Use
```
scripts/mistral_ocr_extract.py
```
  to produce a deterministic on-disk artefact set.

(Optional) Structured extraction from the whole document

If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
The OCR API can return a document-level
```
document_annotation
```
in addition to page markdown.

Example:

bash

python {baseDir}/scripts/mistral_ocr_extract.py \
  --input invoice.pdf \
  --out out/invoice \
  --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
  --annotation-format json_object

选择输入模式
- 本地PDF（最常用）：通过Files API上传，再通过
```
file_id
```
  进行OCR识别。
- 公开URL：直接通过
```
document_url
```
  进行OCR识别。
选择输出精度（默认配置适合RAG场景）
- 除非用户明确要求拆分表格，否则保持
```
table_format=inline
```
  。
- 当用户需要提取图形/图表时，设置
```
--include-image-base64
```
  参数。
- 如果页眉/页脚的干扰信息影响后续搜索，使用
```
--extract-header/--extract-footer
```
  参数。
运行OCR识别
- 使用
```
scripts/mistral_ocr_extract.py
```
  生成确定性的磁盘文件集。

（可选）从整个文档提取结构化内容

如果用户需要提取字段（如发票总额、合同方等），提供标注提示词。
OCR API除了返回页面Markdown内容外，还可以返回文档级别的
```
document_annotation
```
。

示例：

bash

python {baseDir}/scripts/mistral_ocr_extract.py \
  --input invoice.pdf \
  --out out/invoice \
  --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
  --annotation-format json_object

Decision rules

决策规则

If the PDF is local and not publicly accessible, upload it (the script does this automatically).
If the PDF URL is private or requires authentication, do not pass it as
```
document_url
```
; upload instead.
If output quality is critical, prefer
```
table_format=html
```
for downstream parsing over brittle regex.

若PDF为本地文件且无法公开访问：上传文件（脚本会自动完成此操作）。
若PDF URL为私有或需要身份验证：不要将其作为
```
document_url
```
传入，而是上传文件。
若输出质量至关重要：相比易出错的正则表达式，优先选择
```
table_format=html
```
以便后续解析。

Common failure modes

常见失败场景

Missing
MISTRAL_API_KEY
: set it in the environment before running.
URL OCR fails: the URL likely is not publicly accessible; upload the file.
Large files: upload supports large files, but very large PDFs may need page selection (
```
--pages
```
) or batch processing.

缺少
MISTRAL_API_KEY
：运行前需在环境变量中设置该密钥。
URL OCR识别失败：该URL可能无法公开访问，请上传文件。
大文件处理：上传支持大文件，但超大PDF可能需要选择指定页面（
```
--pages
```
参数）或分批处理。

References

参考资料

API + parameters:
```
references/mistral_ocr_api.md
```
Output mapping rules (placeholders to extracted images/tables):
```
references/output_mapping.md
```
Example annotation prompts for common document types:
```
references/annotation_prompts.md
```

API及参数：
```
references/mistral_ocr_api.md
```
输出映射规则（占位符到提取的图片/表格）：
```
references/output_mapping.md
```
常见文档类型的标注提示词示例：
```
references/annotation_prompts.md
```