pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
[IMPORTANT] Use
TaskCreate
to break ALL work into small tasks BEFORE starting — including tasks for each file read. This prevents context loss from long files. For simple tasks, AI MUST ask user whether to skip.
【重要提示】 在开始所有工作前,请使用
TaskCreate
将所有任务拆分为小任务,包括每个文件读取的任务。这可以避免长文件导致的上下文丢失。对于简单任务,AI必须询问用户是否可以跳过拆分。

Quick Summary

快速概览

Goal: Convert PDF files to well-formatted Markdown with auto-detection of native text vs scanned documents. Only native-text conversion is implemented; OCR is planned.
Workflow:
  1. Auto-Detect — Determine if PDF has native text or needs OCR
  2. Convert — Run
    scripts/convert.cjs
    with input path and optional mode/output flags
  3. Output — Returns JSON with success status, page count, and output path
Key Rules:
  • Use
    --mode auto
    (default) to let the tool decide native vs OCR
  • OCR for scanned PDFs requires additional
    tesseract.js
    setup
  • Complex multi-column layouts may not preserve structure perfectly
目标: 将PDF文件转换为格式规范的Markdown,可自动检测是原生文本还是扫描文档。目前仅实现了原生文本转换功能,OCR功能正在规划中。
工作流:
  1. 自动检测 —— 判断PDF是否包含原生文本,或是需要OCR识别
  2. 转换 —— 传入输入路径和可选的模式/输出标志,运行
    scripts/convert.cjs
    脚本
  3. 输出 —— 返回包含成功状态、页数和输出路径的JSON
关键规则:
  • 使用
    --mode auto
    (默认值)让工具自行决定采用原生转换还是OCR
  • 扫描版PDF的OCR功能需要额外配置
    tesseract.js
  • 复杂的多栏布局可能无法完美保留结构

pdf-to-markdown

pdf-to-markdown

Convert PDF files to Markdown format with automatic detection of native text vs scanned documents.
将PDF文件转换为Markdown格式,可自动检测文档是原生文本还是扫描件。

No npm install required

无需npm安装

The skill runs with Node.js only; no
node_modules
in the repo. It uses npx to run
@opendocsg/pdf2md
on first use (cached by npx). Optional: run
npm install
in the skill directory for faster runs without npx.
Requirements: Node.js ≥18,
npx
(included with npm).
本技能仅需Node.js即可运行,代码仓库中没有
node_modules
依赖。首次使用时会通过npx运行
@opendocsg/pdf2md
(npx会自动缓存)。可选操作:你也可以在技能目录下执行
npm install
,这样后续运行时无需调用npx,速度更快。
要求: Node.js ≥18,
npx
(npm默认已包含)。

Quick Start

快速开始

bash
undefined
bash
undefined

Basic conversion (auto-detect native vs scanned)

Basic conversion (auto-detect native vs scanned)

node .agents/skills/pdf-to-markdown/scripts/convert.cjs --input ./document.pdf
node .agents/skills/pdf-to-markdown/scripts/convert.cjs --input ./document.pdf

Specify output path

Specify output path

node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf -o ./output.md
node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf -o ./output.md

Force native mode (skip OCR detection)

Force native mode (skip OCR detection)

node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf --mode native
undefined
node .agents/skills/pdf-to-markdown/scripts/convert.cjs -i ./doc.pdf --mode native
undefined

CLI Options

CLI选项

OptionShortDescriptionDefault
--input
-i
Input PDF file path(required)
--output
-o
Output markdown file path
{input}.md
--mode
-m
Conversion mode:
auto
,
native
,
ocr
auto
--help
-h
Show help message
选项简写描述默认值
--input
-i
输入PDF文件路径(必填)
--output
-o
输出markdown文件路径
{input}.md
--mode
-m
转换模式:
auto
,
native
,
ocr
auto
--help
-h
显示帮助信息

Features

功能特性

  • Auto-Detection: Automatically determines if PDF has native text or requires OCR
  • Native PDFs: Fast extraction using @opendocsg/pdf2md
  • Tables: Basic table structure preservation
  • Cross-Platform: Works on Windows, macOS, Linux
  • No System Dependencies: Pure JavaScript implementation
  • 自动检测: 自动判断PDF是否包含原生文本,或是需要OCR识别
  • 原生PDF处理: 使用@opendocsg/pdf2md实现快速提取
  • 表格支持: 可保留基础的表格结构
  • 跨平台: 支持Windows、macOS、Linux系统
  • 无系统依赖: 纯JavaScript实现

Conversion Modes

转换模式

Auto (Default)

自动(默认)

Checks if PDF has extractable text on first page. Uses native extraction if text found, otherwise falls back to OCR warning.
检查PDF第一页是否有可提取的文本。如果检测到文本则使用原生提取,否则返回OCR提醒。

Native

原生模式

Fast direct text extraction. Best for PDFs with selectable text (not scanned images).
快速直接提取文本,最适合包含可选中文本的PDF(非扫描图片版)。

OCR (Scanned PDFs) - Coming Soon

OCR(扫描PDF)- 待上线

For scanned documents. Currently not implemented - the skill will notify you if a PDF appears to be scanned.
适用于扫描文档,目前暂未实现——如果检测到PDF是扫描版,本技能会给出提示。

Output

输出

Returns JSON on success:
json
{
    "success": true,
    "input": "/path/to/input.pdf",
    "output": "/path/to/output.md",
    "stats": {
        "pages": 5,
        "mode": "native"
    }
}
成功时返回JSON:
json
{
    "success": true,
    "input": "/path/to/input.pdf",
    "output": "/path/to/output.md",
    "stats": {
        "pages": 5,
        "mode": "native"
    }
}

Limitations

局限性

  • Complex multi-column layouts may not preserve structure
  • Scanned PDF OCR accuracy depends on image quality
  • Mathematical formulas may not convert perfectly
  • First-run OCR downloads language data (~15MB)
  • 复杂的多栏布局可能无法保留原有结构
  • 扫描PDF的OCR识别准确率取决于图片质量
  • 数学公式可能无法完美转换
  • 首次运行OCR会下载语言数据(约15MB)

OCR Setup (Optional)

OCR配置(可选)

OCR mode is wired into the skill but not yet implemented. If you want to prepare your environment or extend it yourself, install the OCR dependencies so Node can resolve them:
bash
cd .agents/skills/pdf-to-markdown
npm install tesseract.js pdfjs-dist canvas
Note: The
canvas
package may require build tools on some systems.

IMPORTANT Task Planning Notes (MUST FOLLOW)
  • Always plan and break work into many small todo tasks
  • Always add a final review todo task to verify work quality and identify fixes/enhancements
OCR模式已接入技能框架,但暂未实现功能。如果你想要提前配置环境或是自行扩展功能,可以安装OCR依赖,方便Node调用:
bash
cd .agents/skills/pdf-to-markdown
npm install tesseract.js pdfjs-dist canvas
注意: 部分系统下安装
canvas
包需要额外的构建工具。

重要任务规划说明(必须遵守)
  • 请始终提前规划,将工作拆分为多个小型待办任务
  • 请务必在最后添加一个审核待办任务,用于验证工作质量,确认需要修复/优化的内容