pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PDF to Markdown Converter

PDF转Markdown转换器

Extract complete PDF content as structured Markdown using IBM Docling AI, preserving:

Headers (detected by font size, converted to # tags)
Bold, italic, monospace formatting
Tables (high-accuracy extraction using TableFormer AI model)
Lists (ordered and unordered)
Multi-column layouts (correct reading order)
Code blocks
Images (extracted and copied next to output with relative paths)

借助IBM Docling AI提取完整的PDF内容并转换为结构化Markdown，保留以下元素：

标题（通过字体大小识别，转换为#标签）
粗体、斜体、等宽格式
表格（使用TableFormer AI模型实现高精度提取）
列表（有序和无序列表）
多栏布局（正确的阅读顺序）
代码块
图片（提取后复制到输出文件旁，使用相对路径）

When to Use This Skill

适用场景

USE THIS when:

User wants the "whole PDF" or "entire document" in context
Analyzing, summarizing, or discussing PDF content
User says "load", "read", "bring in", "extract" a PDF
Grepping/searching would miss context or structure
PDF has tables, formatting, or structure to preserve

请在以下场景使用本技能：

用户希望将“整个PDF”或“完整文档”纳入上下文
需要分析、总结或讨论PDF内容
用户提到“加载”“读取”“纳入”“提取”PDF
搜索/查找会丢失上下文或结构信息
PDF包含需要保留格式的表格、排版或结构

Environment Setup

环境搭建

This skill uses a dedicated virtual environment at

~/.claude/skills/pdf-to-markdown/.venv/

to avoid polluting the user's working directory.

本技能使用专用虚拟环境

~/.claude/skills/pdf-to-markdown/.venv/

，避免污染用户的工作目录。

First-Time Setup (if .venv doesn't exist)

首次搭建（若.venv不存在）

bash

cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core

bash

cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core

Verify Installation

验证安装

bash

~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import docling; import docling_core; print('OK')"

bash

~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import docling; import docling_core; print('OK')"

Quick Start

快速开始

bash

undefined

bash

undefined

Convert PDF to markdown (always extracts images)

将PDF转换为Markdown（始终提取图片）

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf

Output: document.md + images/ folder (next to the .md file)

输出结果：document.md + images/文件夹（位于.md文件旁）

undefined

undefined

Standard Workflow

标准工作流

When user provides a PDF and wants full content in context:

当用户提供PDF并希望将完整内容纳入上下文时：

Step 1: Ensure the skill venv exists

步骤1：确保技能虚拟环境存在

bash

test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core)

bash

test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core)

Step 2: Convert PDF to Markdown

步骤2：将PDF转换为Markdown

bash

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf"

bash

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf"

Step 3: Read the output

步骤3：读取输出内容

bash

undefined

bash

undefined

Output is written to document.md in the same directory as the PDF

输出内容写入PDF所在目录的document.md文件

cat /path/to/document.md

undefined

cat /path/to/document.md

undefined

Caching

缓存机制

PDFs are aggressively cached to avoid re-processing. First extraction is slow (~1 sec/page), every subsequent request is instant.

PDF会被深度缓存以避免重复处理。首次提取较慢（约每页1秒），后续所有请求均为即时响应。

How It Works

工作原理

Cache location:
```
~/.cache/pdf-to-markdown/<cache_key>/
```
Cache key: Based on file content hash
Invalidation: Cache is invalidated when:
- Source PDF is modified (size or mtime changes)
- Extractor version changes (automatic re-extraction)
- Explicitly cleared with
```
--clear-cache
```
  or
```
--clear-all-cache
```

缓存位置：
```
~/.cache/pdf-to-markdown/<cache_key>/
```
缓存键：基于文件内容哈希值
失效条件：当以下情况发生时缓存失效：
- 源PDF被修改（大小或修改时间变化）
- 提取器版本更新（自动重新提取）
- 使用
```
--clear-cache
```
  或
```
--clear-all-cache
```
  显式清除

Cache Commands

缓存命令

bash

undefined

bash

undefined

Clear cache for a specific PDF

清除特定PDF的缓存

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --clear-cache

Clear entire cache

清除全部缓存

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --clear-all-cache

Show cache statistics

查看缓存统计信息

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats

undefined

~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats

undefined

Cache Contents

缓存内容

~/.cache/pdf-to-markdown/<cache_key>/
├── metadata.json    # source path, mtime, size, total_pages
├── full_output.md   # cached full markdown
└── images/          # extracted images

~/.cache/pdf-to-markdown/<cache_key>/
├── metadata.json    # 源路径、修改时间、大小、总页数
├── full_output.md   # 缓存的完整Markdown内容
└── images/          # 提取的图片

Image Handling

图片处理

Images are always extracted. They are:

Cached in

~/.cache/pdf-to-markdown/<cache_key>/images/

Copied to
```
images/
```
folder next to the output
```
.md
```
file
Referenced in the markdown with relative paths (
```
images/filename.png
```
)
Summarized in a table at the end of the document

图片始终会被提取，具体处理方式：

缓存在

~/.cache/pdf-to-markdown/<cache_key>/images/

复制到输出.md文件旁的
```
images/
```
文件夹
在Markdown中使用相对路径引用（
```
images/filename.png
```
）
在文档末尾以表格形式汇总

Auto-View Behavior for Images

图片自动查看规则

IMPORTANT: When the extracted markdown contains image references like:

**[Image: figure_1.png (1200x800, 125.3KB)]**

And the user asks about something that might be visual (charts, graphs, diagrams, figures, screenshots, layouts, designs, plots, illustrations), automatically use the Read tool to view the relevant image file(s) before answering. Don't ask the user - just look at it.

Examples of when to auto-view images:

User: "What does the chart on page 3 show?" → Read the image file
User: "Summarize the figures in this paper" → Read all image files
User: "What's in the diagram?" → Read the image file
User: "Describe the architecture shown" → Read the image file
User: "What are the results?" (and there's a results figure) → Read it

重要提示：当提取的Markdown包含如下图片引用时：

**[Image: figure_1.png (1200x800, 125.3KB)]**

若用户询问的内容可能涉及视觉元素（图表、图形、示意图、插图、截图、布局、设计、曲线、插画），请自动使用Read工具查看相关图片文件后再作答，无需询问用户。直接查看即可。

自动查看图片的场景示例：

用户：“第3页的图表显示了什么？” → 查看图片文件
用户：“总结本文中的图表内容” → 查看所有图片文件
用户：“示意图里有什么？” → 查看图片文件
用户：“描述所示的架构” → 查看图片文件
用户：“结果是什么？”（且存在结果图表）→ 查看该图片

Output Format

输出格式

The markdown output includes:

Markdown输出包含以下部分：

Header (metadata)

头部（元数据）

yaml

---
source: document.pdf
total_pages: 42
extracted_at: 2025-01-15T10:30:00
from_cache: true
images_dir: images
---

yaml

---
source: document.pdf
total_pages: 42
extracted_at: 2025-01-15T10:30:00
from_cache: true
images_dir: images
---

Content with image references

带图片引用的内容

markdown

undefined

markdown

undefined

Main Title

主标题

Section Header

章节标题

Regular paragraph text with bold, italic, and

code

formatting.

[Image: figure_1.png (800x600, 45.2KB)]

Column A	Column B
Data 1	Data 2

undefined

常规段落文本，包含粗体、斜体和

代码

格式。

[Image: figure_1.png (800x600, 45.2KB)]

列A	列B
数据1	数据2

undefined

Image summary table (at end)

图片汇总表格（位于文档末尾）

markdown

---

markdown

---

Extracted Images

提取的图片

#	File	Dimensions	Size
1	figure_1.png	800x600	45.2KB
2	chart_2.png	1200x800	89.1KB

undefined

#	文件	尺寸	大小
1	figure_1.png	800x600	45.2KB
2	chart_2.png	1200x800	89.1KB

undefined

Script Reference

脚本参考

Location:

~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py

Usage: pdf_to_md.py <input.pdf> [output.md] [options]

Options:
  --no-progress     Disable progress indicator

Cache Options:
  --clear-cache        Clear cache for this PDF and re-extract
  --clear-all-cache    Clear entire cache directory and exit
  --cache-stats        Show cache statistics and exit

位置：

~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py

Usage: pdf_to_md.py <input.pdf> [output.md] [options]

Options:
  --no-progress     禁用进度指示器

Cache Options:
  --clear-cache        清除此PDF的缓存并重新提取
  --clear-all-cache    清除整个缓存目录并退出
  --cache-stats        显示缓存统计信息并退出

Performance

性能说明

First extraction: ~1 second per page (Docling AI processing)
First run: Downloads AI models (~500MB one-time)
Cached extraction: Instant
High-resolution images: 4x default resolution for crisp output

首次提取：约每页1秒（Docling AI处理时间）
首次运行：下载AI模型（约500MB，仅一次）
缓存提取：即时响应
高分辨率图片：默认分辨率的4倍，输出清晰

Troubleshooting

故障排除

"No module named docling" or venv doesn't exist

出现“No module named docling”或虚拟环境不存在

Recreate the skill's virtual environment:

bash

cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core

重新创建技能的虚拟环境：

bash

cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core

Poor extraction quality

提取质量不佳

For scanned PDFs, ensure Tesseract OCR is installed:

brew install tesseract

对于扫描版PDF，请确保已安装Tesseract OCR：

brew install tesseract

Tables not formatting correctly

表格格式不正确

This skill uses IBM's TableFormer AI model which has ~93.6% accuracy on complex tables. If tables are still garbled, the PDF may have unusual formatting.

本技能使用IBM的TableFormer AI模型，在复杂表格上的准确率约为93.6%。若表格仍出现乱码，可能是PDF存在特殊格式。