pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

pdf-to-markdown

Convert PDF files to Markdown format.

将PDF文件转换为Markdown格式。

Installation Required

需安装依赖

bash

cd .claude/skills/pdf-to-markdown
npm install

Dependencies:

pdf-parse

bash

cd .claude/skills/pdf-to-markdown
npm install

依赖项：

pdf-parse

Quick Start

快速开始

bash

undefined

bash

undefined

Basic conversion

基础转换

node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./document.pdf

Custom output path

自定义输出路径

node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md

undefined

node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md

undefined

CLI Options

CLI选项

Option	Required	Description
`--file <path>`	Yes	Input PDF file
`--output <path>`	No	Output Markdown path (default: input name + .md)

选项	是否必填	描述
`--file <path>`	是	输入PDF文件路径
`--output <path>`	否	输出Markdown文件路径（默认：输入文件名 + .md）

Output Format (JSON)

输出格式（JSON）

json

{
  "success": true,
  "input": "/path/to/input.pdf",
  "output": "/path/to/output.md",
  "wordCount": 1523,
  "warnings": ["Tables may not be accurately converted"]
}

json

{
  "success": true,
  "input": "/path/to/input.pdf",
  "output": "/path/to/output.md",
  "wordCount": 1523,
  "warnings": ["Tables may not be accurately converted"]
}

Supported Elements

支持的元素

Text extraction from digital PDFs
Headings (detected by font size heuristics)
Paragraphs
Basic lists
Links (when embedded in PDF)

从数字化PDF中提取文本
标题（通过字体大小规则检测）
段落
基础列表
链接（当PDF中包含嵌入链接时）

Known Limitations

已知限制

Tables: Very limited support; may not render correctly
Multi-column layouts: Text may interleave between columns
Scanned PDFs: NOT supported (requires OCR - see alternatives below)
Images: NOT extracted (PDF images are not included in output)
Complex formatting: May be simplified or lost
Password-protected PDFs: NOT supported

表格：支持非常有限，可能无法正确渲染
多栏布局：文本可能在栏之间交错
扫描版PDF：不支持（需要OCR - 参见下方替代方案）
图片：不提取（输出中不包含PDF内的图片）
复杂格式：可能被简化或丢失
加密PDF：不支持

Alternatives for Unsupported Cases

不支持场景的替代方案

For scanned PDFs (OCR needed):

Use
```
scribe.js-ocr
```
library (AGPL license)
Commercial OCR services (Google Cloud Vision, AWS Textract)

For complex tables:

Consider AI-based extraction (LLM post-processing)
Manual review and correction

For image extraction:

Use
```
unpdf
```
library with
```
sharp
```
for image extraction
Process images separately and reference in markdown

对于扫描版PDF（需OCR）：

使用
```
scribe.js-ocr
```
库（AGPL许可证）
商用OCR服务（Google Cloud Vision、AWS Textract）

对于复杂表格：

考虑基于AI的提取（LLM后处理）
人工审核与修正

对于图片提取：

使用
```
unpdf
```
库搭配
```
sharp
```
进行图片提取
单独处理图片并在Markdown中引用

Troubleshooting

故障排除

Dependencies not found: Run

npm install

in skill directory Empty output: PDF may be scanned/image-based (requires OCR) Garbled text: PDF may use embedded fonts not supported by parser Memory issues: Large PDFs may require

--max-old-space-size=4096

flag

未找到依赖项： 在技能目录下运行

npm install

输出为空： PDF可能是扫描/图片格式（需要OCR） 文本乱码： PDF可能使用了解析器不支持的嵌入字体 内存问题： 大体积PDF可能需要添加

--max-old-space-size=4096

参数

IMPORTANT Task Planning Notes

重要任务规划提示

Always plan and break many small todo tasks
Always add a final review todo task to review the works done at the end to find any fix or enhancement needed

始终将任务拆分为多个小的待办事项
始终添加一个最终审核的待办事项，在最后检查已完成的工作，找出需要修复或优化的地方