docling
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocling Document Parser
Docling文档解析器
Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.
Docling是一款文档解析库,可将PDF、Word文档、PowerPoint、图片等格式转换为具有高级布局识别能力的结构化数据。
Quick Start
快速开始
Basic document conversion:
python
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())基础文档转换:
python
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # URL、路径或BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())Core Concepts
核心概念
DocumentConverter
DocumentConverter
The main entry point for document conversion. Supports various input formats and conversion options.
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions文档转换的主入口,支持多种输入格式和转换选项。
python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptionsBasic converter (all formats enabled)
基础转换器(启用所有格式)
converter = DocumentConverter()
converter = DocumentConverter()
Restricted formats
限制支持格式
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)
Custom pipeline options
自定义流水线选项
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
undefinedpipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
undefinedConversionResult
ConversionResult
All conversion operations return a containing:
ConversionResult- : The parsed
documentDoclingDocument - :
status,ConversionStatus.SUCCESS, orPARTIAL_SUCCESSFAILURE - : List of errors encountered during conversion
errors - : Information about the source document
input
python
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
data = result.document.export_to_dict()所有转换操作都会返回,包含以下内容:
ConversionResult- :解析后的
documentDoclingDocument - :
status、ConversionStatus.SUCCESS或PARTIAL_SUCCESSFAILURE - :转换过程中遇到的错误列表
errors - :源文档的相关信息
input
python
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
data = result.document.export_to_dict()Supported Formats
支持的格式
Input Formats
输入格式
- Documents: PDF, DOCX, PPTX, XLSX
- Markup: HTML, Markdown, AsciiDoc
- Data: CSV, JSON (Docling format)
- Images: PNG, JPEG, TIFF, BMP, WEBP
- Audio: WAV, MP3
- Video Text: WebVTT
- Schema-specific: USPTO XML, JATS XML, METS-GBS
- 文档:PDF、DOCX、PPTX、XLSX
- 标记语言:HTML、Markdown、AsciiDoc
- 数据格式:CSV、JSON(Docling格式)
- 图片:PNG、JPEG、TIFF、BMP、WEBP
- 音频:WAV、MP3
- 视频文本:WebVTT
- 特定Schema:USPTO XML、JATS XML、METS-GBS
Output Formats
输出格式
- Markdown: or
export_to_markdown()save_as_markdown() - HTML: or
export_to_html()save_as_html() - JSON: or
export_to_dict()(note: nosave_as_json()method)export_to_json() - Text: or
export_to_text()orexport_to_markdown(strict_text=True)save_as_markdown(strict_text=True) - DocTags: or
export_to_doctags()save_as_doctags()
- Markdown:或
export_to_markdown()save_as_markdown() - HTML:或
export_to_html()save_as_html() - JSON:或
export_to_dict()(注意:无save_as_json()方法)export_to_json() - 纯文本:或
export_to_text()或export_to_markdown(strict_text=True)save_as_markdown(strict_text=True) - DocTags:或
export_to_doctags()save_as_doctags()
Common Patterns
常见使用模式
Single File Conversion
单文件转换
python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")Export to different formats
导出为不同格式
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()
Or save directly to file
或直接保存到文件
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")
undefinedresult.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")
undefinedBatch Processing
批量处理
See references/batch.md for details on .
convert_all()详情请参考references/batch.md中关于的说明。
convert_all()URL Conversion
URL转换
python
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")python
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")Binary Stream Conversion
二进制流转换
python
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
with open("document.pdf", "rb") as f:
buf = BytesIO(f.read())
source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)python
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
with open("document.pdf", "rb") as f:
buf = BytesIO(f.read())
source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)Format-Specific Options
格式特定选项
python
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOptionpython
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOptionConfigure PDF-specific options
配置PDF特定选项
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
undefinedpipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
undefinedResource Limits
资源限制
python
converter = DocumentConverter()python
converter = DocumentConverter()Limit file size (bytes) and page count
限制文件大小(字节)和页数
result = converter.convert(
"large_document.pdf",
max_file_size=20_971_520, # 20 MB
max_num_pages=100
)
undefinedresult = converter.convert(
"large_document.pdf",
max_file_size=20_971_520, # 20 MB
max_num_pages=100
)
undefinedDocument Chunking
文档分块
See references/chunking.md for RAG integration.
关于RAG集成的详情,请参考references/chunking.md。
DoclingDocument Structure
DoclingDocument结构
The is a Pydantic model representing parsed content:
DoclingDocumentpython
undefinedDoclingDocumentpython
undefinedAccess document structure
访问文档结构
doc = result.document
doc = result.document
Content items (lists)
内容项(列表)
doc.texts # TextItem instances (paragraphs, headings, etc.)
doc.tables # TableItem instances
doc.pictures # PictureItem instances
doc.key_value_items # Key-value pairs
doc.texts # TextItem实例(段落、标题等)
doc.tables # TableItem实例
doc.pictures # PictureItem实例
doc.key_value_items # 键值对
Structure (tree nodes)
结构(树节点)
doc.body # Main content hierarchy
doc.furniture # Headers, footers, page numbers
doc.groups # Lists, chapters, sections
doc.body # 主要内容层级
doc.furniture # 页眉、页脚、页码
doc.groups # 列表、章节、小节
Iterate all elements in reading order
按阅读顺序遍历所有元素
for item, level in doc.iterate_items():
print(f"{' ' * level}{item.label}: {item.text[:50]}")
undefinedfor item, level in doc.iterate_items():
print(f"{' ' * level}{item.label}: {item.text[:50]}")
undefinedAdvanced Features
高级功能
OCR Configuration
OCR配置
python
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
TesseractOcrOptions,
TesseractCliOcrOptions,
OcrMacOptions,
RapidOcrOptions
)python
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
TesseractOcrOptions,
TesseractCliOcrOptions,
OcrMacOptions,
RapidOcrOptions
)EasyOCR (default)
EasyOCR(默认)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
Tesseract
Tesseract
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
RapidOCR
RapidOCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()
undefinedpipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()
undefinedTable Extraction Options
表格提取选项
python
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableFormerMode
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = Truepython
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableFormerMode
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = TrueUse cell matching (map to PDF cells)
使用单元格匹配(映射到PDF单元格)
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.table_structure_options.do_cell_matching = True
Or use predicted cells
或使用预测单元格
pipeline_options.table_structure_options.do_cell_matching = False
pipeline_options.table_structure_options.do_cell_matching = False
Choose accuracy mode
选择精度模式
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
undefinedpipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
undefinedPage Images
页面图片
python
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True # Needed for HTML export with imagespython
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True # 导出带图片的HTML时需要启用Export with embedded images
导出为嵌入图片的HTML
result.document.save_as_html(
"output.html",
image_mode=ImageRefMode.EMBEDDED
)
undefinedresult.document.save_as_html(
"output.html",
image_mode=ImageRefMode.EMBEDDED
)
undefinedError Handling
错误处理
python
from docling.datamodel.base_models import ConversionStatus
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
print("Partial conversion:")
for error in result.errors:
print(f" {error.error_message}")
else: # FAILURE
print("Conversion failed:")
for error in result.errors:
print(f" {error.error_message}")For batch processing with error handling:
python
undefinedpython
from docling.datamodel.base_models import ConversionStatus
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
print("转换成功")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
print("部分转换成功:")
for error in result.errors:
print(f" {error.error_message}")
else: # FAILURE
print("转换失败:")
for error in result.errors:
print(f" {error.error_message}")批量处理的错误处理:
python
undefinedContinue processing on errors
遇到错误时继续处理
results = converter.convert_all(
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
raises_on_error=False
)
for result in results:
if result.status == ConversionStatus.SUCCESS:
result.document.save_as_markdown(f"{result.input.file.stem}.md")
else:
print(f"Failed: {result.input.file}")
undefinedresults = converter.convert_all(
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
raises_on_error=False
)
for result in results:
if result.status == ConversionStatus.SUCCESS:
result.document.save_as_markdown(f"{result.input.file.stem}.md")
else:
print(f"处理失败:{result.input.file}")
undefinedCLI Usage
CLI使用
bash
undefinedbash
undefinedBasic conversion
基础转换
docling document.pdf
docling document.pdf
Convert to specific output
转换为指定输出格式
docling --to markdown document.pdf
docling --to markdown document.pdf
With custom model path
使用自定义模型路径
docling --artifacts-path /path/to/models document.pdf
docling --artifacts-path /path/to/models document.pdf
Using VLM pipeline
使用VLM流水线
docling --pipeline vlm --vlm-model granite_docling document.pdf
undefineddocling --pipeline vlm --vlm-model granite_docling document.pdf
undefinedReference Documentation
参考文档
- Parsing Options - DocumentConverter initialization, format-specific options, OCR configuration
- Batch Processing - convert_all(), error handling, concurrency patterns
- Chunking - HierarchicalChunker, HybridChunker, RAG integration
- Output Formats - export_to_markdown(), export_to_html(), export_to_dict(), document structure
- 解析选项 - DocumentConverter初始化、格式特定选项、OCR配置
- 批量处理 - convert_all()、错误处理、并发模式
- 分块 - HierarchicalChunker、HybridChunker、RAG集成
- 输出格式 - export_to_markdown()、export_to_html()、export_to_dict()、文档结构
Key Types
核心类型
- : Main conversion class
DocumentConverter - : Result of conversion with document and status
ConversionResult - : Unified document representation (Pydantic model)
DoclingDocument - : Enum of supported input formats
InputFormat - : SUCCESS, PARTIAL_SUCCESS, FAILURE
ConversionStatus - : Configuration for PDF pipeline
PdfPipelineOptions - : EMBEDDED, REFERENCED, PLACEHOLDER
ImageRefMode
- :主转换类
DocumentConverter - :包含文档和状态的转换结果
ConversionResult - :统一的文档表示(Pydantic模型)
DoclingDocument - :支持的输入格式枚举
InputFormat - :SUCCESS、PARTIAL_SUCCESS、FAILURE
ConversionStatus - :PDF流水线配置
PdfPipelineOptions - :EMBEDDED、REFERENCED、PLACEHOLDER
ImageRefMode
Integration Examples
集成示例
LangChain
LangChain
python
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)python
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)LlamaIndex
LlamaIndex
python
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document
converter = DocumentConverter()
result = converter.convert("document.pdf")
chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))
documents = [
Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
for chunk in chunks
]python
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document
converter = DocumentConverter()
result = converter.convert("document.pdf")
chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))
documents = [
Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
for chunk in chunks
]Notes
注意事项
- Docling uses a synchronous API (no native async support)
- Models are downloaded automatically on first use (can be prefetched)
- Supports local execution for air-gapped environments
- Supports GPU acceleration for OCR and table detection
- Default models run on CPU; GPU requires configuration
- Docling使用同步API(无原生异步支持)
- 模型会在首次使用时自动下载(可预下载)
- 支持在离线环境中本地运行
- 支持GPU加速OCR和表格检测
- 默认模型在CPU上运行;GPU需要配置