pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

pdf-to-markdown

pdf-to-markdown

Convert PDF files to Markdown format.
将PDF文件转换为Markdown格式。

Installation Required

需安装依赖

bash
cd .claude/skills/pdf-to-markdown
npm install
Dependencies:
pdf-parse
bash
cd .claude/skills/pdf-to-markdown
npm install
依赖项:
pdf-parse

Quick Start

快速开始

bash
undefined
bash
undefined

Basic conversion

基础转换

node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./document.pdf
node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./document.pdf

Custom output path

自定义输出路径

node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md
undefined
node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md
undefined

CLI Options

CLI选项

OptionRequiredDescription
--file <path>
YesInput PDF file
--output <path>
NoOutput Markdown path (default: input name + .md)
选项是否必填描述
--file <path>
输入PDF文件路径
--output <path>
输出Markdown文件路径(默认:输入文件名 + .md)

Output Format (JSON)

输出格式(JSON)

json
{
  "success": true,
  "input": "/path/to/input.pdf",
  "output": "/path/to/output.md",
  "wordCount": 1523,
  "warnings": ["Tables may not be accurately converted"]
}
json
{
  "success": true,
  "input": "/path/to/input.pdf",
  "output": "/path/to/output.md",
  "wordCount": 1523,
  "warnings": ["Tables may not be accurately converted"]
}

Supported Elements

支持的元素

  • Text extraction from digital PDFs
  • Headings (detected by font size heuristics)
  • Paragraphs
  • Basic lists
  • Links (when embedded in PDF)
  • 从数字化PDF中提取文本
  • 标题(通过字体大小规则检测)
  • 段落
  • 基础列表
  • 链接(当PDF中包含嵌入链接时)

Known Limitations

已知限制

  • Tables: Very limited support; may not render correctly
  • Multi-column layouts: Text may interleave between columns
  • Scanned PDFs: NOT supported (requires OCR - see alternatives below)
  • Images: NOT extracted (PDF images are not included in output)
  • Complex formatting: May be simplified or lost
  • Password-protected PDFs: NOT supported
  • 表格:支持非常有限,可能无法正确渲染
  • 多栏布局:文本可能在栏之间交错
  • 扫描版PDF:不支持(需要OCR - 参见下方替代方案)
  • 图片:不提取(输出中不包含PDF内的图片)
  • 复杂格式:可能被简化或丢失
  • 加密PDF:不支持

Alternatives for Unsupported Cases

不支持场景的替代方案

For scanned PDFs (OCR needed):
  • Use
    scribe.js-ocr
    library (AGPL license)
  • Commercial OCR services (Google Cloud Vision, AWS Textract)
For complex tables:
  • Consider AI-based extraction (LLM post-processing)
  • Manual review and correction
For image extraction:
  • Use
    unpdf
    library with
    sharp
    for image extraction
  • Process images separately and reference in markdown
对于扫描版PDF(需OCR):
  • 使用
    scribe.js-ocr
    库(AGPL许可证)
  • 商用OCR服务(Google Cloud Vision、AWS Textract)
对于复杂表格:
  • 考虑基于AI的提取(LLM后处理)
  • 人工审核与修正
对于图片提取:
  • 使用
    unpdf
    库搭配
    sharp
    进行图片提取
  • 单独处理图片并在Markdown中引用

Troubleshooting

故障排除

Dependencies not found: Run
npm install
in skill directory Empty output: PDF may be scanned/image-based (requires OCR) Garbled text: PDF may use embedded fonts not supported by parser Memory issues: Large PDFs may require
--max-old-space-size=4096
flag
未找到依赖项: 在技能目录下运行
npm install
输出为空: PDF可能是扫描/图片格式(需要OCR) 文本乱码: PDF可能使用了解析器不支持的嵌入字体 内存问题: 大体积PDF可能需要添加
--max-old-space-size=4096
参数

IMPORTANT Task Planning Notes

重要任务规划提示

  • Always plan and break many small todo tasks
  • Always add a final review todo task to review the works done at the end to find any fix or enhancement needed
  • 始终将任务拆分为多个小的待办事项
  • 始终添加一个最终审核的待办事项,在最后检查已完成的工作,找出需要修复或优化的地方