pdf-to-markdown

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PDF to Markdown

PDF 转 Markdown

Convert PDFs into structured, semantic Markdown that preserves the document's logical structure — headings, tables, lists, and reading order — rather than producing flat text. This is significantly higher quality than reading a PDF directly with the

read

tool, which only extracts raw text without structure.

将PDF转换为结构化、语义化的Markdown格式，保留文档的逻辑结构——标题、表格、列表和阅读顺序——而非生成纯文本。这比直接使用

read

工具读取PDF的质量高得多，后者仅提取无结构的原始文本。

Usage

使用方法

Before running any commands, set

SKILL_DIR

to the absolute path of the directory containing this SKILL.md file. Use

$SKILL_DIR/bin/pdf-to-markdown

in all commands below.

The

$SKILL_DIR/bin/pdf-to-markdown

wrapper automatically installs the platform-specific binary into

~/.local/share/nutrient/cli/

from the CDN. It caches the binary and only checks for updates every 6 hours, so subsequent runs are fast.

运行任何命令前，请将

SKILL_DIR

设置为包含此SKILL.md文件的目录的绝对路径。以下所有命令均使用

$SKILL_DIR/bin/pdf-to-markdown

。

$SKILL_DIR/bin/pdf-to-markdown

包装器会自动从CDN将特定平台的二进制文件安装到

~/.local/share/nutrient/cli/

目录中。它会缓存二进制文件，且每6小时仅检查一次更新，因此后续运行速度更快。

Single file

单个文件

bash

$SKILL_DIR/bin/pdf-to-markdown INPUT.pdf OUTPUT.md

OUTPUT.md

is omitted, the converter writes the Markdown to stdout instead.

bash

$SKILL_DIR/bin/pdf-to-markdown INPUT.pdf OUTPUT.md

如果省略

OUTPUT.md

，转换器会将Markdown内容写入标准输出（stdout）。

Batch directory (2+ files)

批量目录（2个及以上文件）

For multiple files, pass directories instead of individual files. The converter processes all PDFs in the input directory in parallel, which is much faster than converting one at a time.

bash

$SKILL_DIR/bin/pdf-to-markdown INPUT_DIR/ OUTPUT_DIR/

对于多个文件，请传入目录而非单个文件。转换器会并行处理输入目录中的所有PDF，这比逐个转换快得多。

bash

$SKILL_DIR/bin/pdf-to-markdown INPUT_DIR/ OUTPUT_DIR/

Image export

图片导出

To extract images from the PDF and reference them in the output Markdown, add the

--enable-image-export

flag:

bash

$SKILL_DIR/bin/pdf-to-markdown --enable-image-export INPUT.pdf OUTPUT.md

Images are saved to

{output}_resources/

alongside the output file and referenced as standard Markdown image links. This is useful when feeding output to LLMs that support vision, or when image context improves downstream accuracy. Off by default because it increases processing time for image-heavy documents.

要从PDF中提取图片并在输出的Markdown中引用它们，请添加

--enable-image-export

标志：

bash

$SKILL_DIR/bin/pdf-to-markdown --enable-image-export INPUT.pdf OUTPUT.md

图片会保存到输出文件旁的

{output}_resources/

目录中，并以标准Markdown图片链接的形式引用。这在将输出提供给支持视觉功能的LLM，或图片上下文能提高下游处理准确性时非常有用。默认关闭此功能，因为对于包含大量图片的文档，这会增加处理时间。

Workflow

工作流程

Choose mode: Use batch directory mode for 2+ files, single file mode otherwise.

Run the converter:

$SKILL_DIR/bin/pdf-to-markdown INPUT [OUTPUT]

Check the exit code: Exit 0 means success. On failure, read stderr for the error message.
Validate the output: If the output file is empty or near-empty, see Troubleshooting below.
Report the output path: Tell the user where the converted file(s) are. Do NOT read the markdown back into context by default — converted documents can be very large and will fill the context window. Only read the output if the user's task specifically requires analyzing or summarizing the content (e.g., "summarize this PDF", "what does this contract say about X").

选择模式：处理2个及以上文件时使用批量目录模式，否则使用单个文件模式。

运行转换器：

$SKILL_DIR/bin/pdf-to-markdown INPUT [OUTPUT]

检查退出码：退出码0表示成功。失败时，请读取标准错误（stderr）获取错误信息。
验证输出：如果输出文件为空或内容极少，请查看下方的故障排除部分。
报告输出路径：告知用户转换后的文件位置。默认情况下不要将Markdown内容读回上下文——转换后的文档可能非常大，会填满上下文窗口。仅当用户的任务明确要求分析或总结内容时（例如“总结此PDF”、“这份合同中关于X的内容是什么”），才读取输出内容。

Troubleshooting

故障排除

Empty or minimal output: The PDF may be scanned/image-only and contains no extractable text.
Non-zero exit code: Read stderr for the specific error. Common causes: corrupted PDF, unsupported encryption, or network issues during first-run binary download.
First run is slow: The wrapper downloads the platform binary on first use (~a few seconds). Subsequent runs use the cached binary.

输出为空或内容极少：该PDF可能是扫描件/纯图片格式，不包含可提取的文本。
非零退出码：读取标准错误（stderr）获取具体错误信息。常见原因包括：PDF损坏、不支持的加密、首次运行时二进制文件下载出现网络问题。
首次运行速度慢：包装器在首次使用时会下载特定平台的二进制文件（约几秒）。后续运行会使用缓存的二进制文件。

License

许可证

Free for processing up to 1,000 documents per calendar month.

Commercial license required for:

processing over 1,000 documents/month
redistributing the binary
OEM/white-label use

Contact

sales@nutrient.io

for commercial licensing.

每个日历月处理最多1000份文档时免费使用。

以下情况需要商业许可证：

每月处理超过1000份文档
重新分发二进制文件
OEM/白标使用

如需商业许可，请联系

sales@nutrient.io

。