pdf-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese[IMPORTANT] Useto break ALL work into small tasks BEFORE starting — including tasks for each file read. This prevents context loss from long files. For simple tasks, AI MUST ask user whether to skip.TaskCreate
【重要提示】 在开始所有工作前,请使用将所有任务拆分为小任务,包括每个文件读取的任务。这可以避免长文件导致的上下文丢失。对于简单任务,AI必须询问用户是否可以跳过拆分。TaskCreate
Quick Summary
快速概览
Goal: Convert PDF files to well-formatted Markdown with auto-detection of native text vs scanned documents. Only native-text conversion is implemented; OCR is planned.
Workflow:
- Auto-Detect — Determine if PDF has native text or needs OCR
- Convert — Run with input path and optional mode/output flags
scripts/convert.cjs - Output — Returns JSON with success status, page count, and output path
Key Rules:
- Use (default) to let the tool decide native vs OCR
--mode auto - OCR for scanned PDFs requires additional setup
tesseract.js - Complex multi-column layouts may not preserve structure perfectly
目标: 将PDF文件转换为格式规范的Markdown,可自动检测是原生文本还是扫描文档。目前仅实现了原生文本转换功能,OCR功能正在规划中。
工作流:
- 自动检测 —— 判断PDF是否包含原生文本,或是需要OCR识别
- 转换 —— 传入输入路径和可选的模式/输出标志,运行脚本
scripts/convert.cjs - 输出 —— 返回包含成功状态、页数和输出路径的JSON
关键规则:
- 使用(默认值)让工具自行决定采用原生转换还是OCR
--mode auto - 扫描版PDF的OCR功能需要额外配置
tesseract.js - 复杂的多栏布局可能无法完美保留结构
pdf-to-markdown
pdf-to-markdown
Convert PDF files to Markdown format with automatic detection of native text vs scanned documents.
将PDF文件转换为Markdown格式,可自动检测文档是原生文本还是扫描件。
No npm install required
无需npm安装
The skill runs with Node.js only; no in the repo. It uses npx to run on first use (cached by npx). Optional: run in the skill directory for faster runs without npx.
node_modules@opendocsg/pdf2mdnpm installRequirements: Node.js ≥18, (included with npm).
npx本技能仅需Node.js即可运行,代码仓库中没有依赖。首次使用时会通过npx运行(npx会自动缓存)。可选操作:你也可以在技能目录下执行,这样后续运行时无需调用npx,速度更快。
node_modules@opendocsg/pdf2mdnpm install要求: Node.js ≥18,(npm默认已包含)。
npxQuick Start
快速开始
bash
undefinedbash
undefinedBasic conversion (auto-detect native vs scanned)
Basic conversion (auto-detect native vs scanned)
node .agents/skills/pdf-to-markdown/scripts/convert.cjs --input ./document.pdf
node .agents/skills/pdf-to-markdown/scripts/convert.cjs --input ./document.pdf
Specify output path
Specify output path
node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf -o ./output.md
node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf -o ./output.md
Force native mode (skip OCR detection)
Force native mode (skip OCR detection)
node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf --mode native
undefinednode .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf --mode native
undefinedCLI Options
CLI选项
| Option | Short | Description | Default |
|---|---|---|---|
| | Input PDF file path | (required) |
| | Output markdown file path | |
| | Conversion mode: | |
| | Show help message |
| 选项 | 简写 | 描述 | 默认值 |
|---|---|---|---|
| | 输入PDF文件路径 | (必填) |
| | 输出markdown文件路径 | |
| | 转换模式: | |
| | 显示帮助信息 |
Features
功能特性
- Auto-Detection: Automatically determines if PDF has native text or requires OCR
- Native PDFs: Fast extraction using @opendocsg/pdf2md
- Tables: Basic table structure preservation
- Cross-Platform: Works on Windows, macOS, Linux
- No System Dependencies: Pure JavaScript implementation
- 自动检测: 自动判断PDF是否包含原生文本,或是需要OCR识别
- 原生PDF处理: 使用@opendocsg/pdf2md实现快速提取
- 表格支持: 可保留基础的表格结构
- 跨平台: 支持Windows、macOS、Linux系统
- 无系统依赖: 纯JavaScript实现
Conversion Modes
转换模式
Auto (Default)
自动(默认)
Checks if PDF has extractable text on first page. Uses native extraction if text found, otherwise falls back to OCR warning.
检查PDF第一页是否有可提取的文本。如果检测到文本则使用原生提取,否则返回OCR提醒。
Native
原生模式
Fast direct text extraction. Best for PDFs with selectable text (not scanned images).
快速直接提取文本,最适合包含可选中文本的PDF(非扫描图片版)。
OCR (Scanned PDFs) - Coming Soon
OCR(扫描PDF)- 待上线
For scanned documents. Currently not implemented - the skill will notify you if a PDF appears to be scanned.
适用于扫描文档,目前暂未实现——如果检测到PDF是扫描版,本技能会给出提示。
Output
输出
Returns JSON on success:
json
{
"success": true,
"input": "/path/to/input.pdf",
"output": "/path/to/output.md",
"stats": {
"pages": 5,
"mode": "native"
}
}成功时返回JSON:
json
{
"success": true,
"input": "/path/to/input.pdf",
"output": "/path/to/output.md",
"stats": {
"pages": 5,
"mode": "native"
}
}Limitations
局限性
- Complex multi-column layouts may not preserve structure
- Scanned PDF OCR accuracy depends on image quality
- Mathematical formulas may not convert perfectly
- First-run OCR downloads language data (~15MB)
- 复杂的多栏布局可能无法保留原有结构
- 扫描PDF的OCR识别准确率取决于图片质量
- 数学公式可能无法完美转换
- 首次运行OCR会下载语言数据(约15MB)
OCR Setup (Optional)
OCR配置(可选)
OCR mode is wired into the skill but not yet implemented. If you want to prepare your environment or extend it yourself, install the OCR dependencies so Node can resolve them:
bash
cd .agents/skills/pdf-to-markdown
npm install tesseract.js pdfjs-dist canvasNote: The package may require build tools on some systems.
canvasIMPORTANT Task Planning Notes (MUST FOLLOW)
- Always plan and break work into many small todo tasks
- Always add a final review todo task to verify work quality and identify fixes/enhancements
OCR模式已接入技能框架,但暂未实现功能。如果你想要提前配置环境或是自行扩展功能,可以安装OCR依赖,方便Node调用:
bash
cd .agents/skills/pdf-to-markdown
npm install tesseract.js pdfjs-dist canvas注意: 部分系统下安装包需要额外的构建工具。
canvas重要任务规划说明(必须遵守)
- 请始终提前规划,将工作拆分为多个小型待办任务
- 请务必在最后添加一个审核待办任务,用于验证工作质量,确认需要修复/优化的内容