pdf-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePDF to Markdown Converter
PDF转Markdown转换器
Extract complete PDF content as structured Markdown using IBM Docling AI, preserving:
- Headers (detected by font size, converted to # tags)
- Bold, italic, monospace formatting
- Tables (high-accuracy extraction using TableFormer AI model)
- Lists (ordered and unordered)
- Multi-column layouts (correct reading order)
- Code blocks
- Images (extracted and copied next to output with relative paths)
借助IBM Docling AI提取完整的PDF内容并转换为结构化Markdown,保留以下元素:
- 标题(通过字体大小识别,转换为#标签)
- 粗体、斜体、等宽格式
- 表格(使用TableFormer AI模型实现高精度提取)
- 列表(有序和无序列表)
- 多栏布局(正确的阅读顺序)
- 代码块
- 图片(提取后复制到输出文件旁,使用相对路径)
When to Use This Skill
适用场景
USE THIS when:
- User wants the "whole PDF" or "entire document" in context
- Analyzing, summarizing, or discussing PDF content
- User says "load", "read", "bring in", "extract" a PDF
- Grepping/searching would miss context or structure
- PDF has tables, formatting, or structure to preserve
请在以下场景使用本技能:
- 用户希望将“整个PDF”或“完整文档”纳入上下文
- 需要分析、总结或讨论PDF内容
- 用户提到“加载”“读取”“纳入”“提取”PDF
- 搜索/查找会丢失上下文或结构信息
- PDF包含需要保留格式的表格、排版或结构
Environment Setup
环境搭建
This skill uses a dedicated virtual environment at to avoid polluting the user's working directory.
~/.claude/skills/pdf-to-markdown/.venv/本技能使用专用虚拟环境 ,避免污染用户的工作目录。
~/.claude/skills/pdf-to-markdown/.venv/First-Time Setup (if .venv doesn't exist)
首次搭建(若.venv不存在)
bash
cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-corebash
cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-coreVerify Installation
验证安装
bash
~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import docling; import docling_core; print('OK')"bash
~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import docling; import docling_core; print('OK')"Quick Start
快速开始
bash
undefinedbash
undefinedConvert PDF to markdown (always extracts images)
将PDF转换为Markdown(始终提取图片)
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf
Output: document.md + images/ folder (next to the .md file)
输出结果:document.md + images/文件夹(位于.md文件旁)
undefinedundefinedStandard Workflow
标准工作流
When user provides a PDF and wants full content in context:
当用户提供PDF并希望将完整内容纳入上下文时:
Step 1: Ensure the skill venv exists
步骤1:确保技能虚拟环境存在
bash
test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core)bash
test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core)Step 2: Convert PDF to Markdown
步骤2:将PDF转换为Markdown
bash
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf"bash
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf"Step 3: Read the output
步骤3:读取输出内容
bash
undefinedbash
undefinedOutput is written to document.md in the same directory as the PDF
输出内容写入PDF所在目录的document.md文件
cat /path/to/document.md
undefinedcat /path/to/document.md
undefinedCaching
缓存机制
PDFs are aggressively cached to avoid re-processing. First extraction is slow (~1 sec/page), every subsequent request is instant.
PDF会被深度缓存以避免重复处理。首次提取较慢(约每页1秒),后续所有请求均为即时响应。
How It Works
工作原理
- Cache location:
~/.cache/pdf-to-markdown/<cache_key>/ - Cache key: Based on file content hash
- Invalidation: Cache is invalidated when:
- Source PDF is modified (size or mtime changes)
- Extractor version changes (automatic re-extraction)
- Explicitly cleared with or
--clear-cache--clear-all-cache
- 缓存位置:
~/.cache/pdf-to-markdown/<cache_key>/ - 缓存键:基于文件内容哈希值
- 失效条件:当以下情况发生时缓存失效:
- 源PDF被修改(大小或修改时间变化)
- 提取器版本更新(自动重新提取)
- 使用或
--clear-cache显式清除--clear-all-cache
Cache Commands
缓存命令
bash
undefinedbash
undefinedClear cache for a specific PDF
清除特定PDF的缓存
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --clear-cache
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --clear-cache
Clear entire cache
清除全部缓存
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --clear-all-cache
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --clear-all-cache
Show cache statistics
查看缓存统计信息
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats
undefined~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats
undefinedCache Contents
缓存内容
~/.cache/pdf-to-markdown/<cache_key>/
├── metadata.json # source path, mtime, size, total_pages
├── full_output.md # cached full markdown
└── images/ # extracted images~/.cache/pdf-to-markdown/<cache_key>/
├── metadata.json # 源路径、修改时间、大小、总页数
├── full_output.md # 缓存的完整Markdown内容
└── images/ # 提取的图片Image Handling
图片处理
Images are always extracted. They are:
- Cached in
~/.cache/pdf-to-markdown/<cache_key>/images/ - Copied to folder next to the output
images/file.md - Referenced in the markdown with relative paths ()
images/filename.png - Summarized in a table at the end of the document
图片始终会被提取,具体处理方式:
- 缓存在
~/.cache/pdf-to-markdown/<cache_key>/images/ - 复制到输出.md文件旁的文件夹
images/ - 在Markdown中使用相对路径引用()
images/filename.png - 在文档末尾以表格形式汇总
Auto-View Behavior for Images
图片自动查看规则
IMPORTANT: When the extracted markdown contains image references like:
**[Image: figure_1.png (1200x800, 125.3KB)]**And the user asks about something that might be visual (charts, graphs, diagrams, figures, screenshots, layouts, designs, plots, illustrations), automatically use the Read tool to view the relevant image file(s) before answering. Don't ask the user - just look at it.
Examples of when to auto-view images:
- User: "What does the chart on page 3 show?" → Read the image file
- User: "Summarize the figures in this paper" → Read all image files
- User: "What's in the diagram?" → Read the image file
- User: "Describe the architecture shown" → Read the image file
- User: "What are the results?" (and there's a results figure) → Read it
重要提示:当提取的Markdown包含如下图片引用时:
**[Image: figure_1.png (1200x800, 125.3KB)]**若用户询问的内容可能涉及视觉元素(图表、图形、示意图、插图、截图、布局、设计、曲线、插画),请自动使用Read工具查看相关图片文件后再作答,无需询问用户。直接查看即可。
自动查看图片的场景示例:
- 用户:“第3页的图表显示了什么?” → 查看图片文件
- 用户:“总结本文中的图表内容” → 查看所有图片文件
- 用户:“示意图里有什么?” → 查看图片文件
- 用户:“描述所示的架构” → 查看图片文件
- 用户:“结果是什么?”(且存在结果图表)→ 查看该图片
Output Format
输出格式
The markdown output includes:
Markdown输出包含以下部分:
Header (metadata)
头部(元数据)
yaml
---
source: document.pdf
total_pages: 42
extracted_at: 2025-01-15T10:30:00
from_cache: true
images_dir: images
---yaml
---
source: document.pdf
total_pages: 42
extracted_at: 2025-01-15T10:30:00
from_cache: true
images_dir: images
---Content with image references
带图片引用的内容
markdown
undefinedmarkdown
undefinedMain Title
主标题
Section Header
章节标题
Regular paragraph text with bold, italic, and formatting.
code
[Image: figure_1.png (800x600, 45.2KB)]
| Column A | Column B |
|---|---|
| Data 1 | Data 2 |
undefined常规段落文本,包含粗体、斜体和格式。
代码
[Image: figure_1.png (800x600, 45.2KB)]
| 列A | 列B |
|---|---|
| 数据1 | 数据2 |
undefinedImage summary table (at end)
图片汇总表格(位于文档末尾)
markdown
---markdown
---Extracted Images
提取的图片
| # | File | Dimensions | Size |
|---|---|---|---|
| 1 | figure_1.png | 800x600 | 45.2KB |
| 2 | chart_2.png | 1200x800 | 89.1KB |
undefined| # | 文件 | 尺寸 | 大小 |
|---|---|---|---|
| 1 | figure_1.png | 800x600 | 45.2KB |
| 2 | chart_2.png | 1200x800 | 89.1KB |
undefinedScript Reference
脚本参考
Location:
~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.pyUsage: pdf_to_md.py <input.pdf> [output.md] [options]
Options:
--no-progress Disable progress indicator
Cache Options:
--clear-cache Clear cache for this PDF and re-extract
--clear-all-cache Clear entire cache directory and exit
--cache-stats Show cache statistics and exit位置:
~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.pyUsage: pdf_to_md.py <input.pdf> [output.md] [options]
Options:
--no-progress 禁用进度指示器
Cache Options:
--clear-cache 清除此PDF的缓存并重新提取
--clear-all-cache 清除整个缓存目录并退出
--cache-stats 显示缓存统计信息并退出Performance
性能说明
- First extraction: ~1 second per page (Docling AI processing)
- First run: Downloads AI models (~500MB one-time)
- Cached extraction: Instant
- High-resolution images: 4x default resolution for crisp output
- 首次提取:约每页1秒(Docling AI处理时间)
- 首次运行:下载AI模型(约500MB,仅一次)
- 缓存提取:即时响应
- 高分辨率图片:默认分辨率的4倍,输出清晰
Troubleshooting
故障排除
"No module named docling" or venv doesn't exist
出现“No module named docling”或虚拟环境不存在
Recreate the skill's virtual environment:
bash
cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core重新创建技能的虚拟环境:
bash
cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-corePoor extraction quality
提取质量不佳
For scanned PDFs, ensure Tesseract OCR is installed:
brew install tesseract对于扫描版PDF,请确保已安装Tesseract OCR:
brew install tesseractTables not formatting correctly
表格格式不正确
This skill uses IBM's TableFormer AI model which has ~93.6% accuracy on complex tables. If tables are still garbled, the PDF may have unusual formatting.
本技能使用IBM的TableFormer AI模型,在复杂表格上的准确率约为93.6%。若表格仍出现乱码,可能是PDF存在特殊格式。