extracting-mistral-ocr

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Mistral OCR PDF extraction

Mistral OCR PDF提取

Quick start (default)

快速开始(默认配置)

Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:
bash
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr
Output directory layout:
  • combined.md
    (all pages concatenated)
  • pages/page-000.md
    (per-page markdown)
  • raw_response.json
    (full OCR response)
  • images/
    (decoded embedded images, if requested)
  • tables/
    (separate tables, if requested)
运行捆绑脚本对本地PDF进行OCR识别,并生成Markdown + JSON格式的输出:
bash
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr
输出目录结构:
  • combined.md
    (所有页面内容拼接)
  • pages/page-000.md
    (单页Markdown内容)
  • raw_response.json
    (完整OCR响应数据)
  • images/
    (解码后的嵌入图片,若请求提取)
  • tables/
    (单独提取的表格,若请求提取)

Workflow

工作流程

  1. Pick input mode
    • Local PDF (most common): upload via Files API, then OCR via
      file_id
      .
    • Public URL: OCR directly via
      document_url
      .
  2. Choose output fidelity (defaults are safe for RAG)
    • Keep
      table_format=inline
      unless the user explicitly wants tables split out.
    • Set
      --include-image-base64
      when the user needs figures/diagrams extracted.
    • Use
      --extract-header/--extract-footer
      if header/footer noise hurts downstream search.
  3. Run OCR
    • Use
      scripts/mistral_ocr_extract.py
      to produce a deterministic on-disk artefact set.
  4. (Optional) Structured extraction from the whole document
    • If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
    • The OCR API can return a document-level
      document_annotation
      in addition to page markdown.
    Example:
    bash
    python {baseDir}/scripts/mistral_ocr_extract.py \
      --input invoice.pdf \
      --out out/invoice \
      --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
      --annotation-format json_object
  1. 选择输入模式
    • 本地PDF(最常用):通过Files API上传,再通过
      file_id
      进行OCR识别。
    • 公开URL:直接通过
      document_url
      进行OCR识别。
  2. 选择输出精度(默认配置适合RAG场景)
    • 除非用户明确要求拆分表格,否则保持
      table_format=inline
    • 当用户需要提取图形/图表时,设置
      --include-image-base64
      参数。
    • 如果页眉/页脚的干扰信息影响后续搜索,使用
      --extract-header/--extract-footer
      参数。
  3. 运行OCR识别
    • 使用
      scripts/mistral_ocr_extract.py
      生成确定性的磁盘文件集。
  4. (可选)从整个文档提取结构化内容
    • 如果用户需要提取字段(如发票总额、合同方等),提供标注提示词。
    • OCR API除了返回页面Markdown内容外,还可以返回文档级别的
      document_annotation
    示例:
    bash
    python {baseDir}/scripts/mistral_ocr_extract.py \
      --input invoice.pdf \
      --out out/invoice \
      --annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
      --annotation-format json_object

Decision rules

决策规则

  • If the PDF is local and not publicly accessible, upload it (the script does this automatically).
  • If the PDF URL is private or requires authentication, do not pass it as
    document_url
    ; upload instead.
  • If output quality is critical, prefer
    table_format=html
    for downstream parsing over brittle regex.
  • 若PDF为本地文件且无法公开访问:上传文件(脚本会自动完成此操作)。
  • 若PDF URL为私有或需要身份验证:不要将其作为
    document_url
    传入,而是上传文件。
  • 若输出质量至关重要:相比易出错的正则表达式,优先选择
    table_format=html
    以便后续解析。

Common failure modes

常见失败场景

  • Missing
    MISTRAL_API_KEY
    : set it in the environment before running.
  • URL OCR fails: the URL likely is not publicly accessible; upload the file.
  • Large files: upload supports large files, but very large PDFs may need page selection (
    --pages
    ) or batch processing.
  • 缺少
    MISTRAL_API_KEY
    :运行前需在环境变量中设置该密钥。
  • URL OCR识别失败:该URL可能无法公开访问,请上传文件。
  • 大文件处理:上传支持大文件,但超大PDF可能需要选择指定页面(
    --pages
    参数)或分批处理。

References

参考资料

  • API + parameters:
    references/mistral_ocr_api.md
  • Output mapping rules (placeholders to extracted images/tables):
    references/output_mapping.md
  • Example annotation prompts for common document types:
    references/annotation_prompts.md
  • API及参数:
    references/mistral_ocr_api.md
  • 输出映射规则(占位符到提取的图片/表格):
    references/output_mapping.md
  • 常见文档类型的标注提示词示例:
    references/annotation_prompts.md