markitdown-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMarkItDown Skill
MarkItDown 工具
Microsoft's Python utility for converting various file formats to Markdown
for LLM and text analysis pipelines.
微软推出的 Python 工具,可将多种文件格式转换为 Markdown,适用于 LLM 和文本分析流水线。
Overview
概述
MarkItDown converts documents while preserving structure (headings, lists,
tables, links). It's optimized for LLM consumption rather than
human-readable output.
MarkItDown 在转换文档时会保留结构(标题、列表、表格、链接)。它针对 LLM 处理进行了优化,而非面向人类可读的输出。
Supported Formats
支持的格式
| Category | Formats |
|---|---|
| Documents | PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS) |
| Media | Images (EXIF + OCR), Audio (WAV, MP3 transcription) |
| Web | HTML, YouTube URLs, Wikipedia, RSS/Atom feeds |
| Data | CSV, JSON, XML, Jupyter notebooks (.ipynb) |
| Archives | ZIP (iterates contents), EPub |
| Outlook MSG files |
| 分类 | 格式 |
|---|---|
| 文档 | PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS) |
| 媒体 | 图片(EXIF + OCR)、音频(WAV、MP3 转写) |
| 网页 | HTML、YouTube 链接、维基百科、RSS/Atom 源 |
| 数据 | CSV、JSON、XML、Jupyter Notebook (.ipynb) |
| 归档文件 | ZIP(遍历内容)、EPub |
| 邮件 | Outlook MSG 文件 |
Quick Start
快速开始
Installation
安装
bash
undefinedbash
undefinedFull installation (recommended)
完整安装(推荐)
pip install 'markitdown[all]'
pip install 'markitdown[all]'
Minimal with specific formats
仅安装特定格式支持的精简版本
pip install 'markitdown[pdf,docx,pptx]'
pip install 'markitdown[pdf,docx,pptx]'
Using uv
使用 uv 安装
uv pip install 'markitdown[all]'
undefineduv pip install 'markitdown[all]'
undefinedOptional Dependencies
可选扩展依赖
| Extra | Description |
|---|---|
| All optional dependencies |
| PDF file support |
| Word documents |
| PowerPoint presentations |
| Excel spreadsheets |
| Legacy Excel files |
| Outlook MSG files |
| Azure Document Intelligence |
| WAV/MP3 transcription |
| YouTube video transcripts |
| 扩展依赖 | 说明 |
|---|---|
| 所有可选依赖 |
| PDF 文件支持 |
| Word 文档支持 |
| PowerPoint 演示文稿支持 |
| Excel 电子表格支持 |
| 旧版 Excel 文件支持 |
| Outlook MSG 文件支持 |
| Azure Document Intelligence 支持 |
| WAV/MP3 转写支持 |
| YouTube 视频字幕转写支持 |
Command-Line Usage
命令行使用
bash
undefinedbash
undefinedBasic conversion
基础转换
markitdown document.pdf > output.md
markitdown document.pdf > output.md
Specify output file
指定输出文件
markitdown document.pdf -o output.md
markitdown document.pdf -o output.md
Pipe input
管道输入
cat document.pdf | markitdown > output.md
cat document.pdf | markitdown > output.md
With Azure Document Intelligence
结合 Azure Document Intelligence 使用
markitdown document.pdf -o output.md -d -e "<endpoint>"
undefinedmarkitdown document.pdf -o output.md -d -e "<endpoint>"
undefinedPython API
Python API
python
from markitdown import MarkItDownpython
from markitdown import MarkItDownBasic conversion
基础转换
md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)
md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)
With LLM for image descriptions
结合 LLM 生成图片描述
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this image in detail"
)
result = md.convert("image.jpg")
print(result.text_content)
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this image in detail"
)
result = md.convert("image.jpg")
print(result.text_content)
With Azure Document Intelligence
结合 Azure Document Intelligence 使用
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("complex-document.pdf")
print(result.text_content)
undefinedmd = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("complex-document.pdf")
print(result.text_content)
undefinedCommon Use Cases
常见使用场景
Batch Convert Directory
批量转换目录文件
python
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)
for file in input_dir.glob("*"):
if file.is_file():
try:
result = md.convert(str(file))
output_file = output_dir / f"{file.stem}.md"
output_file.write_text(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Failed: {file.name} - {e}")python
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)
for file in input_dir.glob("*"):
if file.is_file():
try:
result = md.convert(str(file))
output_file = output_dir / f"{file.stem}.md"
output_file.write_text(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Failed: {file.name} - {e}")Process for LLM Context
为 LLM 上下文处理文档
python
from markitdown import MarkItDown
def prepare_for_llm(file_path: str) -> str:
"""Convert document to LLM-ready markdown."""
md = MarkItDown()
result = md.convert(file_path)
# Add source reference
content = f"# Source: {file_path}\n\n{result.text_content}"
return contentpython
from markitdown import MarkItDown
def prepare_for_llm(file_path: str) -> str:
"""将文档转换为适用于 LLM 的 Markdown 格式。"""
md = MarkItDown()
result = md.convert(file_path)
# 添加来源引用
content = f"# Source: {file_path}\n\n{result.text_content}"
return contentUse with your LLM
与你的 LLM 配合使用
context = prepare_for_llm("report.pdf")
undefinedcontext = prepare_for_llm("report.pdf")
undefinedExtract YouTube Transcript
提取 YouTube 字幕
bash
undefinedbash
undefinedCLI
命令行方式
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md
```pythonmarkitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md
```pythonPython
Python 方式
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
undefinedfrom markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
undefinedImage OCR with AI Description
图片 OCR 结合 AI 描述
python
from markitdown import MarkItDown
from openai import OpenAIpython
from markitdown import MarkItDown
from openai import OpenAIInitialize with LLM support
初始化并启用 LLM 支持
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o"
)
Convert image with AI description
转换图片并生成 AI 描述
result = md.convert("screenshot.png")
print(result.text_content)
undefinedresult = md.convert("screenshot.png")
print(result.text_content)
undefinedConvert Jupyter Notebook
转换 Jupyter Notebook
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content) # Code cells, outputs, markdownpython
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content) # 包含代码单元格、输出结果和 Markdown 内容Extract Wikipedia Content
提取维基百科内容
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content) # Main article content onlypython
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content) # 仅提取主文章内容Parse RSS Feed
解析 RSS 源
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content) # Feed entries as markdownpython
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content) # 将源条目转换为 Markdown 格式Plugin System
插件系统
MarkItDown supports third-party plugins for extended functionality.
bash
undefinedMarkItDown 支持第三方插件以扩展功能。
bash
undefinedList installed plugins
列出已安装的插件
markitdown --list-plugins
markitdown --list-plugins
Enable plugins during conversion
转换时启用插件
markitdown --use-plugins document.pdf
```pythonmarkitdown --use-plugins document.pdf
```pythonEnable plugins in Python
在 Python 中启用插件
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")
> Search GitHub for `#markitdown-plugin` to find available plugins.md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")
> 在 GitHub 上搜索 `#markitdown-plugin` 可找到可用插件。MCP Server Integration
MCP 服务器集成
MarkItDown offers an MCP (Model Context Protocol) server for integration
with LLM applications like Claude Desktop.
bash
undefinedMarkItDown 提供 MCP(Model Context Protocol)服务器,可与 Claude Desktop 等 LLM 应用集成。
bash
undefinedInstall MCP server
安装 MCP 服务器
pip install markitdown-mcp
pip install markitdown-mcp
Or from source
或从源码安装
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-mcp
pip install -e .
See [markitdown-mcp][mcp-repo] for configuration details.
[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcpgit clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-mcp
pip install -e .
配置细节请查看 [markitdown-mcp][mcp-repo]。
[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcpDocker Usage
Docker 使用
bash
undefinedbash
undefinedBuild image
构建镜像
docker build -t markitdown:latest .
docker build -t markitdown:latest .
Convert file
转换文件
docker run --rm -i markitdown:latest < document.pdf > output.md
undefineddocker run --rm -i markitdown:latest < document.pdf > output.md
undefinedTroubleshooting
故障排除
| Issue | Solution |
|---|---|
| Missing dependencies | Install with |
| PDF extraction fails | Try Azure Document Intelligence for complex PDFs |
| Image text not extracted | Ensure OCR dependencies installed or use LLM mode |
| Large file timeout | Process in chunks or use streaming |
| Plugin not found | Run |
| 问题 | 解决方案 |
|---|---|
| 缺少依赖 | 使用 |
| PDF 提取失败 | 对于复杂 PDF,尝试使用 Azure Document Intelligence |
| 图片文本未提取 | 确保已安装 OCR 依赖,或使用 LLM 模式 |
| 大文件超时 | 分块处理或使用流式处理 |
| 插件未找到 | 运行 |
Common Errors
常见错误
bash
undefinedbash
undefinedModuleNotFoundError for specific format
特定格式对应的模块未找到
pip install 'markitdown[pdf]' # Install missing dependency
pip install 'markitdown[pdf]' # 安装缺失的依赖
Azure authentication
Azure 身份验证
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"
undefinedexport AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"
undefinedRequirements
环境要求
- Python >= 3.10
- Virtual environment recommended
bash
undefined- Python >= 3.10
- 推荐使用虚拟环境
bash
undefinedCreate virtual environment
创建虚拟环境
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
Install
安装工具
pip install 'markitdown[all]'
undefinedpip install 'markitdown[all]'
undefinedReferences
参考资料
- - Complete CLI options
references/cli-reference.md - - Python API details
references/api-reference.md - - Extended examples
references/examples.md - - Custom converters, URI handling
references/advanced-features.md - GitHub: https://github.com/microsoft/markitdown
- PyPI: https://pypi.org/project/markitdown/
- - 完整的命令行选项
references/cli-reference.md - - Python API 详细说明
references/api-reference.md - - 扩展示例
references/examples.md - - 自定义转换器、URI 处理
references/advanced-features.md - GitHub: https://github.com/microsoft/markitdown
- PyPI: https://pypi.org/project/markitdown/