pdf-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesepdf-to-markdown
pdf-to-markdown
Convert PDF files to Markdown format.
将PDF文件转换为Markdown格式。
Installation Required
需安装依赖
bash
cd .claude/skills/pdf-to-markdown
npm installDependencies:
pdf-parsebash
cd .claude/skills/pdf-to-markdown
npm install依赖项:
pdf-parseQuick Start
快速开始
bash
undefinedbash
undefinedBasic conversion
基础转换
node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./document.pdf
--file ./document.pdf
node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./document.pdf
--file ./document.pdf
Custom output path
自定义输出路径
node .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md
--file ./doc.pdf
--output ./output/doc.md
undefinednode .claude/skills/pdf-to-markdown/scripts/convert.cjs
--file ./doc.pdf
--output ./output/doc.md
--file ./doc.pdf
--output ./output/doc.md
undefinedCLI Options
CLI选项
| Option | Required | Description |
|---|---|---|
| Yes | Input PDF file |
| No | Output Markdown path (default: input name + .md) |
| 选项 | 是否必填 | 描述 |
|---|---|---|
| 是 | 输入PDF文件路径 |
| 否 | 输出Markdown文件路径(默认:输入文件名 + .md) |
Output Format (JSON)
输出格式(JSON)
json
{
"success": true,
"input": "/path/to/input.pdf",
"output": "/path/to/output.md",
"wordCount": 1523,
"warnings": ["Tables may not be accurately converted"]
}json
{
"success": true,
"input": "/path/to/input.pdf",
"output": "/path/to/output.md",
"wordCount": 1523,
"warnings": ["Tables may not be accurately converted"]
}Supported Elements
支持的元素
- Text extraction from digital PDFs
- Headings (detected by font size heuristics)
- Paragraphs
- Basic lists
- Links (when embedded in PDF)
- 从数字化PDF中提取文本
- 标题(通过字体大小规则检测)
- 段落
- 基础列表
- 链接(当PDF中包含嵌入链接时)
Known Limitations
已知限制
- Tables: Very limited support; may not render correctly
- Multi-column layouts: Text may interleave between columns
- Scanned PDFs: NOT supported (requires OCR - see alternatives below)
- Images: NOT extracted (PDF images are not included in output)
- Complex formatting: May be simplified or lost
- Password-protected PDFs: NOT supported
- 表格:支持非常有限,可能无法正确渲染
- 多栏布局:文本可能在栏之间交错
- 扫描版PDF:不支持(需要OCR - 参见下方替代方案)
- 图片:不提取(输出中不包含PDF内的图片)
- 复杂格式:可能被简化或丢失
- 加密PDF:不支持
Alternatives for Unsupported Cases
不支持场景的替代方案
For scanned PDFs (OCR needed):
- Use library (AGPL license)
scribe.js-ocr - Commercial OCR services (Google Cloud Vision, AWS Textract)
For complex tables:
- Consider AI-based extraction (LLM post-processing)
- Manual review and correction
For image extraction:
- Use library with
unpdffor image extractionsharp - Process images separately and reference in markdown
对于扫描版PDF(需OCR):
- 使用库(AGPL许可证)
scribe.js-ocr - 商用OCR服务(Google Cloud Vision、AWS Textract)
对于复杂表格:
- 考虑基于AI的提取(LLM后处理)
- 人工审核与修正
对于图片提取:
- 使用库搭配
unpdf进行图片提取sharp - 单独处理图片并在Markdown中引用
Troubleshooting
故障排除
Dependencies not found: Run in skill directory
Empty output: PDF may be scanned/image-based (requires OCR)
Garbled text: PDF may use embedded fonts not supported by parser
Memory issues: Large PDFs may require flag
npm install--max-old-space-size=4096未找到依赖项: 在技能目录下运行
输出为空: PDF可能是扫描/图片格式(需要OCR)
文本乱码: PDF可能使用了解析器不支持的嵌入字体
内存问题: 大体积PDF可能需要添加参数
npm install--max-old-space-size=4096IMPORTANT Task Planning Notes
重要任务规划提示
- Always plan and break many small todo tasks
- Always add a final review todo task to review the works done at the end to find any fix or enhancement needed
- 始终将任务拆分为多个小的待办事项
- 始终添加一个最终审核的待办事项,在最后检查已完成的工作,找出需要修复或优化的地方