Loading...
Loading...
Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".
npx skill4agent add daymade/claude-code-skills doc-to-markdown# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media
# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md
# Run tests
uv run --with pytest pytest scripts/test_convert.py -v| Mode | Speed | Quality | Use Case |
|---|---|---|---|
| Quick (default) | Fast | Good | Drafts, simple documents |
| Heavy | Slower | Best | Final documents, complex layouts |
| Format | Quick Mode | Heavy Mode |
|---|---|---|
| pymupdf4llm | pymupdf4llm + markitdown | |
| DOCX | pandoc + post-processing | pandoc + markitdown |
| PPTX | markitdown | markitdown + pandoc |
| XLSX | markitdown | markitdown |
| Problem | Fix | Test coverage |
|---|---|---|
Grid tables ( | Single-column → blockquote, multi-column → pipe table | |
Simple tables ( | Multi-column images → pipe table with captions | |
Image path nesting ( | Flatten to | |
Pandoc attributes ( | Removed | |
CJK bold spacing ( | Add space around | |
| Indented dashed code blocks | → fenced ``` with language detection | |
Escaped brackets ( | → | |
Double-bracket links ( | → | |
****content**Before: 打开**飞书**,就可以 → some renderers fail to bold
After: 打开 **飞书** ,就可以 → universally renders correctly| Segment Type | Selection Criteria |
|---|---|
| Tables | More rows/columns, proper header separator |
| Images | Alt text present, local paths preferred |
| Headings | Proper hierarchy, appropriate length |
| Lists | More items, nested structure preserved |
| Paragraphs | Content completeness |
# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets
# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.mdassets/img_page1_1.pngassets/img_page2_1.jpgassets/images_metadata.json# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md
# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html| Metric | Pass | Warn | Fail |
|---|---|---|---|
| Text Retention | >95% | 85-95% | <85% |
| Table Retention | 100% | 90-99% | <90% |
| Image Retention | 100% | 80-99% | <80% |
# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md
# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose# Windows to WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
# Output: /mnt/c/Users/name/Documents/file.pdf# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandocscripts/extract_pdf_images.pyscripts/validate_output.py| Script | Purpose |
|---|---|
| Main orchestrator with Quick/Heavy mode + DOCX post-processing |
| 31 tests covering all post-processing functions |
| Merge multiple markdown outputs |
| Quality validation with HTML report |
| PDF image extraction with metadata |
| Windows to WSL path converter |
references/benchmark-2026-03-22.mdreferences/heavy-mode-guide.mdreferences/tool-comparison.mdreferences/conversion-examples.md