docling

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Docling Document Parser

Docling文档解析器

Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.
Docling是一款文档解析库,可将PDF、Word文档、PowerPoint、图片等格式转换为具有高级布局识别能力的结构化数据。

Quick Start

快速开始

Basic document conversion:
python
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
基础文档转换:
python
from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # URL、路径或BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Core Concepts

核心概念

DocumentConverter

DocumentConverter

The main entry point for document conversion. Supports various input formats and conversion options.
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
文档转换的主入口,支持多种输入格式和转换选项。
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

Basic converter (all formats enabled)

基础转换器(启用所有格式)

converter = DocumentConverter()
converter = DocumentConverter()

Restricted formats

限制支持格式

converter = DocumentConverter( allowed_formats=[InputFormat.PDF, InputFormat.DOCX] )
converter = DocumentConverter( allowed_formats=[InputFormat.PDF, InputFormat.DOCX] )

Custom pipeline options

自定义流水线选项

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )
undefined
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.do_table_structure = True
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )
undefined

ConversionResult

ConversionResult

All conversion operations return a
ConversionResult
containing:
  • document
    : The parsed
    DoclingDocument
  • status
    :
    ConversionStatus.SUCCESS
    ,
    PARTIAL_SUCCESS
    , or
    FAILURE
  • errors
    : List of errors encountered during conversion
  • input
    : Information about the source document
python
result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    markdown = result.document.export_to_markdown()
    html = result.document.export_to_html()
    data = result.document.export_to_dict()
所有转换操作都会返回
ConversionResult
,包含以下内容:
  • document
    :解析后的
    DoclingDocument
  • status
    ConversionStatus.SUCCESS
    PARTIAL_SUCCESS
    FAILURE
  • errors
    :转换过程中遇到的错误列表
  • input
    :源文档的相关信息
python
result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    markdown = result.document.export_to_markdown()
    html = result.document.export_to_html()
    data = result.document.export_to_dict()

Supported Formats

支持的格式

Input Formats

输入格式

  • Documents: PDF, DOCX, PPTX, XLSX
  • Markup: HTML, Markdown, AsciiDoc
  • Data: CSV, JSON (Docling format)
  • Images: PNG, JPEG, TIFF, BMP, WEBP
  • Audio: WAV, MP3
  • Video Text: WebVTT
  • Schema-specific: USPTO XML, JATS XML, METS-GBS
  • 文档:PDF、DOCX、PPTX、XLSX
  • 标记语言:HTML、Markdown、AsciiDoc
  • 数据格式:CSV、JSON(Docling格式)
  • 图片:PNG、JPEG、TIFF、BMP、WEBP
  • 音频:WAV、MP3
  • 视频文本:WebVTT
  • 特定Schema:USPTO XML、JATS XML、METS-GBS

Output Formats

输出格式

  • Markdown:
    export_to_markdown()
    or
    save_as_markdown()
  • HTML:
    export_to_html()
    or
    save_as_html()
  • JSON:
    export_to_dict()
    or
    save_as_json()
    (note: no
    export_to_json()
    method)
  • Text:
    export_to_text()
    or
    export_to_markdown(strict_text=True)
    or
    save_as_markdown(strict_text=True)
  • DocTags:
    export_to_doctags()
    or
    save_as_doctags()
  • Markdown
    export_to_markdown()
    save_as_markdown()
  • HTML
    export_to_html()
    save_as_html()
  • JSON
    export_to_dict()
    save_as_json()
    (注意:无
    export_to_json()
    方法)
  • 纯文本
    export_to_text()
    export_to_markdown(strict_text=True)
    save_as_markdown(strict_text=True)
  • DocTags
    export_to_doctags()
    save_as_doctags()

Common Patterns

常见使用模式

Single File Conversion

单文件转换

python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
python
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

Export to different formats

导出为不同格式

markdown = result.document.export_to_markdown() html = result.document.export_to_html() json_data = result.document.export_to_dict()
markdown = result.document.export_to_markdown() html = result.document.export_to_html() json_data = result.document.export_to_dict()

Or save directly to file

或直接保存到文件

result.document.save_as_markdown("output.md") result.document.save_as_html("output.html") result.document.save_as_json("output.json")
undefined
result.document.save_as_markdown("output.md") result.document.save_as_html("output.html") result.document.save_as_json("output.json")
undefined

Batch Processing

批量处理

See references/batch.md for details on
convert_all()
.
详情请参考references/batch.md中关于
convert_all()
的说明。

URL Conversion

URL转换

python
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")
python
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")

Binary Stream Conversion

二进制流转换

python
from io import BytesIO
from docling.datamodel.base_models import DocumentStream

with open("document.pdf", "rb") as f:
    buf = BytesIO(f.read())

source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)
python
from io import BytesIO
from docling.datamodel.base_models import DocumentStream

with open("document.pdf", "rb") as f:
    buf = BytesIO(f.read())

source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)

Format-Specific Options

格式特定选项

python
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
python
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

Configure PDF-specific options

配置PDF特定选项

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options.lang = ["en", "es"] pipeline_options.do_table_structure = True pipeline_options.generate_page_images = True
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )
undefined
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options.lang = ["en", "es"] pipeline_options.do_table_structure = True pipeline_options.generate_page_images = True
converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )
undefined

Resource Limits

资源限制

python
converter = DocumentConverter()
python
converter = DocumentConverter()

Limit file size (bytes) and page count

限制文件大小(字节)和页数

result = converter.convert( "large_document.pdf", max_file_size=20_971_520, # 20 MB max_num_pages=100 )
undefined
result = converter.convert( "large_document.pdf", max_file_size=20_971_520, # 20 MB max_num_pages=100 )
undefined

Document Chunking

文档分块

See references/chunking.md for RAG integration.
关于RAG集成的详情,请参考references/chunking.md

DoclingDocument Structure

DoclingDocument结构

The
DoclingDocument
is a Pydantic model representing parsed content:
python
undefined
DoclingDocument
是一个Pydantic模型,用于表示解析后的内容:
python
undefined

Access document structure

访问文档结构

doc = result.document
doc = result.document

Content items (lists)

内容项(列表)

doc.texts # TextItem instances (paragraphs, headings, etc.) doc.tables # TableItem instances doc.pictures # PictureItem instances doc.key_value_items # Key-value pairs
doc.texts # TextItem实例(段落、标题等) doc.tables # TableItem实例 doc.pictures # PictureItem实例 doc.key_value_items # 键值对

Structure (tree nodes)

结构(树节点)

doc.body # Main content hierarchy doc.furniture # Headers, footers, page numbers doc.groups # Lists, chapters, sections
doc.body # 主要内容层级 doc.furniture # 页眉、页脚、页码 doc.groups # 列表、章节、小节

Iterate all elements in reading order

按阅读顺序遍历所有元素

for item, level in doc.iterate_items(): print(f"{' ' * level}{item.label}: {item.text[:50]}")
undefined
for item, level in doc.iterate_items(): print(f"{' ' * level}{item.label}: {item.text[:50]}")
undefined

Advanced Features

高级功能

OCR Configuration

OCR配置

python
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    TesseractOcrOptions,
    TesseractCliOcrOptions,
    OcrMacOptions,
    RapidOcrOptions
)
python
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    TesseractOcrOptions,
    TesseractCliOcrOptions,
    OcrMacOptions,
    RapidOcrOptions
)

EasyOCR (default)

EasyOCR(默认)

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])

Tesseract

Tesseract

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

RapidOCR

RapidOCR

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = RapidOcrOptions()
undefined
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = True pipeline_options.ocr_options = RapidOcrOptions()
undefined

Table Extraction Options

表格提取选项

python
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableFormerMode
)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True
python
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableFormerMode
)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True

Use cell matching (map to PDF cells)

使用单元格匹配(映射到PDF单元格)

pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.do_cell_matching = True

Or use predicted cells

或使用预测单元格

pipeline_options.table_structure_options.do_cell_matching = False
pipeline_options.table_structure_options.do_cell_matching = False

Choose accuracy mode

选择精度模式

pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
undefined
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
undefined

Page Images

页面图片

python
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True  # Needed for HTML export with images
python
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True  # 导出带图片的HTML时需要启用

Export with embedded images

导出为嵌入图片的HTML

result.document.save_as_html( "output.html", image_mode=ImageRefMode.EMBEDDED )
undefined
result.document.save_as_html( "output.html", image_mode=ImageRefMode.EMBEDDED )
undefined

Error Handling

错误处理

python
from docling.datamodel.base_models import ConversionStatus

result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
    print("Partial conversion:")
    for error in result.errors:
        print(f"  {error.error_message}")
else:  # FAILURE
    print("Conversion failed:")
    for error in result.errors:
        print(f"  {error.error_message}")
For batch processing with error handling:
python
undefined
python
from docling.datamodel.base_models import ConversionStatus

result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    print("转换成功")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
    print("部分转换成功:")
    for error in result.errors:
        print(f"  {error.error_message}")
else:  # FAILURE
    print("转换失败:")
    for error in result.errors:
        print(f"  {error.error_message}")
批量处理的错误处理:
python
undefined

Continue processing on errors

遇到错误时继续处理

results = converter.convert_all( ["doc1.pdf", "doc2.pdf", "doc3.pdf"], raises_on_error=False )
for result in results: if result.status == ConversionStatus.SUCCESS: result.document.save_as_markdown(f"{result.input.file.stem}.md") else: print(f"Failed: {result.input.file}")
undefined
results = converter.convert_all( ["doc1.pdf", "doc2.pdf", "doc3.pdf"], raises_on_error=False )
for result in results: if result.status == ConversionStatus.SUCCESS: result.document.save_as_markdown(f"{result.input.file.stem}.md") else: print(f"处理失败:{result.input.file}")
undefined

CLI Usage

CLI使用

bash
undefined
bash
undefined

Basic conversion

基础转换

docling document.pdf
docling document.pdf

Convert to specific output

转换为指定输出格式

docling --to markdown document.pdf
docling --to markdown document.pdf

With custom model path

使用自定义模型路径

docling --artifacts-path /path/to/models document.pdf
docling --artifacts-path /path/to/models document.pdf

Using VLM pipeline

使用VLM流水线

docling --pipeline vlm --vlm-model granite_docling document.pdf
undefined
docling --pipeline vlm --vlm-model granite_docling document.pdf
undefined

Reference Documentation

参考文档

  • Parsing Options - DocumentConverter initialization, format-specific options, OCR configuration
  • Batch Processing - convert_all(), error handling, concurrency patterns
  • Chunking - HierarchicalChunker, HybridChunker, RAG integration
  • Output Formats - export_to_markdown(), export_to_html(), export_to_dict(), document structure
  • 解析选项 - DocumentConverter初始化、格式特定选项、OCR配置
  • 批量处理 - convert_all()、错误处理、并发模式
  • 分块 - HierarchicalChunker、HybridChunker、RAG集成
  • 输出格式 - export_to_markdown()、export_to_html()、export_to_dict()、文档结构

Key Types

核心类型

  • DocumentConverter
    : Main conversion class
  • ConversionResult
    : Result of conversion with document and status
  • DoclingDocument
    : Unified document representation (Pydantic model)
  • InputFormat
    : Enum of supported input formats
  • ConversionStatus
    : SUCCESS, PARTIAL_SUCCESS, FAILURE
  • PdfPipelineOptions
    : Configuration for PDF pipeline
  • ImageRefMode
    : EMBEDDED, REFERENCED, PLACEHOLDER
  • DocumentConverter
    :主转换类
  • ConversionResult
    :包含文档和状态的转换结果
  • DoclingDocument
    :统一的文档表示(Pydantic模型)
  • InputFormat
    :支持的输入格式枚举
  • ConversionStatus
    :SUCCESS、PARTIAL_SUCCESS、FAILURE
  • PdfPipelineOptions
    :PDF流水线配置
  • ImageRefMode
    :EMBEDDED、REFERENCED、PLACEHOLDER

Integration Examples

集成示例

LangChain

LangChain

python
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)
python
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

LlamaIndex

python
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document

converter = DocumentConverter()
result = converter.convert("document.pdf")

chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))

documents = [
    Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
    for chunk in chunks
]
python
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document

converter = DocumentConverter()
result = converter.convert("document.pdf")

chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))

documents = [
    Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
    for chunk in chunks
]

Notes

注意事项

  • Docling uses a synchronous API (no native async support)
  • Models are downloaded automatically on first use (can be prefetched)
  • Supports local execution for air-gapped environments
  • Supports GPU acceleration for OCR and table detection
  • Default models run on CPU; GPU requires configuration
  • Docling使用同步API(无原生异步支持)
  • 模型会在首次使用时自动下载(可预下载)
  • 支持在离线环境中本地运行
  • 支持GPU加速OCR和表格检测
  • 默认模型在CPU上运行;GPU需要配置