markdown-tools
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMarkdown Tools
Markdown 工具
Convert documents to high-quality markdown with intelligent multi-tool orchestration.
通过智能多工具编排将文档转换为高质量的Markdown格式。
Dual Mode Architecture
双模式架构
| Mode | Speed | Quality | Use Case |
|---|---|---|---|
| Quick (default) | Fast | Good | Drafts, simple documents |
| Heavy | Slower | Best | Final documents, complex layouts |
| 模式 | 速度 | 质量 | 适用场景 |
|---|---|---|---|
| 快速(默认) | 快 | 良好 | 草稿文档、简单格式文档 |
| 深度 | 较慢 | 最优 | 终稿文档、复杂布局文档 |
Quick Start
快速开始
Installation
安装
bash
undefinedbash
undefinedRequired: PDF/DOCX/PPTX support
Required: PDF/DOCX/PPTX support
uv tool install "markitdown[pdf]"
pip install pymupdf4llm
brew install pandoc
undefineduv tool install "markitdown[pdf]"
pip install pymupdf4llm
brew install pandoc
undefinedBasic Conversion
基础转换
bash
undefinedbash
undefinedQuick Mode (default) - fast, single best tool
Quick Mode (default) - fast, single best tool
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
Heavy Mode - multi-tool parallel execution with merge
Heavy Mode - multi-tool parallel execution with merge
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md --heavy
Check available tools
Check available tools
uv run scripts/convert.py --list-tools
undefineduv run scripts/convert.py --list-tools
undefinedTool Selection Matrix
工具选择矩阵
| Format | Quick Mode Tool | Heavy Mode Tools |
|---|---|---|
| pymupdf4llm | pymupdf4llm + markitdown | |
| DOCX | pandoc | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |
| 格式 | 快速模式工具 | 深度模式工具 |
|---|---|---|
| pymupdf4llm | pymupdf4llm + markitdown | |
| DOCX | pandoc | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |
Tool Characteristics
工具特性
- pymupdf4llm: LLM-optimized PDF conversion with native table detection and image extraction
- markitdown: Microsoft's universal converter, good for Office formats
- pandoc: Excellent structure preservation for DOCX/PPTX
- pymupdf4llm: 针对LLM优化的PDF转换工具,支持原生表格检测和图片提取
- markitdown: 微软推出的通用转换器,适用于Office格式文档
- pandoc: 擅长保留DOCX/PPTX文档的结构
Heavy Mode Workflow
深度模式工作流
Heavy Mode runs multiple tools in parallel and selects the best segments:
- Parallel Execution: Run all applicable tools simultaneously
- Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
- Quality Scoring: Score each segment based on completeness and structure
- Intelligent Merge: Select best version of each segment across tools
深度模式会并行运行多个工具,并选择最优的内容片段进行合并:
- 并行执行: 同时运行所有适用的工具
- 片段分析: 将每个工具的输出解析为不同片段(表格、标题、图片、段落)
- 质量评分: 根据完整性和结构对每个片段进行评分
- 智能合并: 从所有工具的输出中选择每个片段的最优版本
Merge Criteria
合并规则
| Segment Type | Selection Criteria |
|---|---|
| Tables | More rows/columns, proper header separator |
| Images | Alt text present, local paths preferred |
| Headings | Proper hierarchy, appropriate length |
| Lists | More items, nested structure preserved |
| Paragraphs | Content completeness |
| 片段类型 | 选择标准 |
|---|---|
| 表格 | 包含更多行/列,表头分隔符格式正确 |
| 图片 | 包含替代文本,优先选择本地路径 |
| 标题 | 层级结构正确,长度合适 |
| 列表 | 包含更多条目,嵌套结构完整保留 |
| 段落 | 内容完整度高 |
Image Extraction
图片提取
bash
undefinedbash
undefinedExtract images with metadata
Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
Generate markdown references file
Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
Output:
- Images: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- Metadata: `assets/images_metadata.json` (page, position, dimensions)uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md
输出内容:
- 图片: `assets/img_page1_1.png`, `assets/img_page2_1.jpg`
- 元数据: `assets/images_metadata.json`(包含页码、位置、尺寸信息)Quality Validation
质量验证
bash
undefinedbash
undefinedValidate conversion quality
Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
Generate HTML report
Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefineduv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html
undefinedQuality Metrics
质量指标
| Metric | Pass | Warn | Fail |
|---|---|---|---|
| Text Retention | >95% | 85-95% | <85% |
| Table Retention | 100% | 90-99% | <90% |
| Image Retention | 100% | 80-99% | <80% |
| 指标 | 通过 | 警告 | 失败 |
|---|---|---|---|
| 文本保留率 | >95% | 85-95% | <85% |
| 表格保留率 | 100% | 90-99% | <90% |
| 图片保留率 | 100% | 80-99% | <80% |
Merge Outputs Manually
手动合并输出
bash
undefinedbash
undefinedMerge multiple markdown files
Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md
python scripts/merge_outputs.py output1.md output2.md -o merged.md
Show segment attribution
Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefinedpython scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose
undefinedPath Conversion (Windows/WSL)
路径转换(Windows/WSL)
bash
undefinedbash
undefinedWindows → WSL conversion
Windows → WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
Output: /mnt/c/Users/name/Documents/file.pdf
Output: /mnt/c/Users/name/Documents/file.pdf
undefinedundefinedCommon Issues
常见问题
"No conversion tools available"
bash
undefined"No conversion tools available"
bash
undefinedInstall all tools
Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc
**FontBBox warnings during PDF conversion**
- Harmless font parsing warnings, output is still correct
**Images missing from output**
- Use Heavy Mode for better image preservation
- Or extract separately with `scripts/extract_pdf_images.py`
**Tables broken in output**
- Use Heavy Mode - it selects the most complete table version
- Or validate with `scripts/validate_output.py`pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc
**FontBBox warnings during PDF conversion**
- 这是无害的字体解析警告,输出内容仍然正确
**Images missing from output**
- 使用深度模式可提升图片保留效果
- 或通过 `scripts/extract_pdf_images.py` 单独提取图片
**Tables broken in output**
- 使用深度模式 - 它会选择最完整的表格版本
- 或通过 `scripts/validate_output.py` 验证转换结果Bundled Scripts
内置脚本
| Script | Purpose |
|---|---|
| Main orchestrator with Quick/Heavy mode |
| Merge multiple markdown outputs |
| Quality validation with HTML report |
| PDF image extraction with metadata |
| Windows to WSL path converter |
| 脚本 | 用途 |
|---|---|
| 主编排工具,支持快速/深度模式 |
| 合并多个Markdown输出文件 |
| 转换质量验证,生成HTML报告 |
| 提取PDF中的图片并生成元数据 |
| Windows与WSL路径转换工具 |
References
参考资料
- - Detailed Heavy Mode documentation
references/heavy-mode-guide.md - - Tool capabilities comparison
references/tool-comparison.md - - Batch operation examples
references/conversion-examples.md
- - 深度模式详细文档
references/heavy-mode-guide.md - - 工具能力对比文档
references/tool-comparison.md - - 批量操作示例文档
references/conversion-examples.md