document-converter

Purpose

用途

Convert documents between formats and perform PDF operations using local, free, offline tools. No API key, no cost, no internet required. Supports Office formats, PDF manipulation, image-to-PDF, and OCR using LibreOffice, ghostscript, pdftk, tesseract, and imagemagick.

使用本地免费的离线工具进行文档格式转换和PDF操作。无需API密钥、无需付费、无需联网。支持Office格式转换、PDF处理、图片转PDF以及基于LibreOffice、ghostscript、pdftk、tesseract和imagemagick的OCR识别。

When to Use

适用场景

Converting Office documents (DOCX, PPTX, XLSX, ODP, ODT) to PDF or HTML
Converting PDF pages to images (PNG, JPG, TIFF) or images to PDF
PDF operations: merge multiple PDFs, split by page range, rotate pages, encrypt or decrypt
OCR: extract searchable text from scanned PDFs or image files
Any document format conversion that does not involve video or audio

Do NOT use for:

Video or audio format conversion (use a dedicated media skill)
Converting code between programming languages
Simple markdown → HTML (use pandoc directly)
Structured text extraction from PDFs (use
```
docling-converter
```
instead)

将Office文档（DOCX、PPTX、XLSX、ODP、ODT）转换为PDF或HTML
将PDF页面转换为图片（PNG、JPG、TIFF）或图片转换为PDF
PDF操作：合并多个PDF、按页面范围拆分、旋转页面、加密或解密
OCR识别：从扫描版PDF或图片文件中提取可搜索文本
任何不涉及视频或音频的文档格式转换

请勿用于：

视频或音频格式转换（使用专门的媒体处理Skill）
编程语言间的代码转换
简单的Markdown转HTML（直接使用pandoc）
从PDF中提取结构化文本（改用
```
docling-converter
```
）

Step 0: Discovery — Detect Available Tools

步骤0：检测可用工具

Before any conversion, detect which tools are installed and identify the operating system:

bash

undefined

在进行任何转换之前，先检测已安装的工具并识别操作系统：

bash

undefined

Detect OS

uname -s # Darwin = macOS, Linux = Linux; on Windows use 'ver' or check $OS

Check each tool

libreoffice --version 2>/dev/null || echo "NOT FOUND" gs --version 2>/dev/null || echo "NOT FOUND" pdftk --version 2>/dev/null || echo "NOT FOUND" tesseract --version 2>/dev/null || echo "NOT FOUND" convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick


If a required tool is missing, show the install command for the user's OS before proceeding:

| Tool | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |

If the user is on Windows and Microsoft Office is installed, note that LibreOffice and Office produce equivalent results for DOCX/PPTX/XLSX; Office via COM automation is an advanced alternative.

libreoffice --version 2>/dev/null || echo "NOT FOUND" gs --version 2>/dev/null || echo "NOT FOUND" pdftk --version 2>/dev/null || echo "NOT FOUND" tesseract --version 2>/dev/null || echo "NOT FOUND" convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick


如果缺少所需工具，请先为用户的操作系统显示对应的安装命令：

| 工具 | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |

如果用户使用Windows且已安装Microsoft Office，请注意LibreOffice和Office对DOCX/PPTX/XLSX的转换效果相当；通过COM自动化调用Office是一种进阶替代方案。

Workflow

工作流程

Office Documents → PDF (most common)

Office文档转PDF（最常用）

Use LibreOffice headless mode. Works for DOCX, PPTX, XLSX, ODP, ODT, and any format LibreOffice opens.

bash

undefined

使用LibreOffice无头模式。适用于DOCX、PPTX、XLSX、ODP、ODT以及任何LibreOffice可打开的格式。

bash

undefined

Convert single file to PDF in the same directory

libreoffice --headless --convert-to pdf "/path/to/file.pptx" --outdir "/path/to/output/"

Convert to HTML

libreoffice --headless --convert-to html "/path/to/file.docx" --outdir "/path/to/output/"

Batch: convert all PPTX in a directory

libreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/


Notes:
- Output file is placed in `--outdir` with the same base name and new extension
- For batch, run one `libreoffice` process — do NOT spawn multiple instances in parallel (LibreOffice uses a single user profile lock)
- If LibreOffice is open as a GUI app, close it first or use `--norestore` flag

libreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/


注意事项：
- 输出文件将保存在`--outdir`指定目录，文件名与原文件相同，仅扩展名变更
- 批量转换时，只需运行一次`libreoffice`进程——请勿并行启动多个实例（LibreOffice使用单一用户配置文件锁）
- 如果LibreOffice的GUI应用处于打开状态，请先关闭或添加`--norestore`参数

PDF → Images / Images → PDF

PDF转图片 / 图片转PDF

Use ImageMagick for image-PDF interchange:

bash

undefined

使用ImageMagick实现图片与PDF的互转：

bash

undefined

PDF pages → PNG (one file per page)

convert -density 150 "/path/to/doc.pdf" "/path/to/output/page_%03d.png"

PDF pages → JPG with quality control

convert -density 150 "/path/to/doc.pdf" -quality 85 "/path/to/output/page_%03d.jpg"

Single image → PDF

convert "/path/to/image.png" "/path/to/output/document.pdf"

Multiple images → single PDF

convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"


Note: On some systems ImageMagick's PDF support requires ghostscript. If conversion fails with a policy error, check `/etc/ImageMagick-*/policy.xml` and ensure PDF is not restricted.

convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"


注意：部分系统中ImageMagick的PDF支持依赖ghostscript。如果转换因策略错误失败，请检查`/etc/ImageMagick-*/policy.xml`并确保PDF未被限制。

PDF Operations — Merge, Split, Rotate

PDF操作——合并、拆分、旋转

Prefer

pdftk

when available (simpler syntax). Fall back to

ghostscript

if pdftk is not installed.

Merge PDFs:

bash

undefined

优先使用

pdftk

（语法更简洁）。如果未安装pdftk，则退而使用

ghostscript

。

合并PDF：

bash

undefined

pdftk (preferred)

pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf

ghostscript (fallback)

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf


**Split PDF by page range:**
```bash

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf


**按页面范围拆分PDF：**
```bash

pdftk — extract pages 1-3 and 5

pdftk input.pdf cat 1-3 5 output extracted.pdf

ghostscript — extract pages 2 to 5

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf


**Rotate pages:**
```bash

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf


**旋转页面：**
```bash

pdftk — rotate all pages 90° clockwise

pdftk input.pdf rotate 1-endeast output rotated.pdf

pdftk — rotate specific page (page 3) 180°

pdftk input.pdf rotate 3south output rotated.pdf


**Encrypt PDF (add password):**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"

Decrypt PDF (remove password):

bash

pdftk secured.pdf input_pw "password" output decrypted.pdf

pdftk input.pdf rotate 3south output rotated.pdf


**加密PDF（添加密码）：**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"

解密PDF（移除密码）：

bash

pdftk secured.pdf input_pw "password" output decrypted.pdf

OCR — Extract Text from Scanned Documents

OCR识别——从扫描文档提取文本

Use Tesseract. Input must be an image (PNG, JPG, TIFF) or a PDF that ImageMagick can rasterize first.

bash

undefined

使用Tesseract。输入必须为图片（PNG、JPG、TIFF）或可被ImageMagick先栅格化的PDF。

bash

undefined

OCR a single image → searchable text file

tesseract "/path/to/scan.png" "/path/to/output/result"

Output: result.txt

OCR with language specification (Portuguese example)

tesseract "/path/to/scan.png" output -l por

OCR → searchable PDF (requires tesseract with pdf support)

tesseract "/path/to/scan.png" output -l eng pdf

Output: output.pdf

OCR a scanned PDF: rasterize first, then OCR

convert -density 300 "scan.pdf" "scan_page_%03d.tiff" tesseract "scan_page_000.tiff" output -l eng pdf


For multi-page scanned PDFs:
```bash

convert -density 300 "scan.pdf" "scan_page_%03d.tiff" tesseract "scan_page_000.tiff" output -l eng pdf


对于多页扫描版PDF：
```bash

Rasterize all pages

convert -density 300 "scan.pdf" "page_%03d.tiff"

OCR each page and merge results

for f in page_.tiff; do tesseract "$f" "${f%.tiff}" -l eng pdf done pdftk page_.pdf cat output final_ocr.pdf

undefined

for f in page_.tiff; do tesseract "$f" "${f%.tiff}" -l eng pdf done pdftk page_.pdf cat output final_ocr.pdf

undefined

Tool Routing Decision Table

工具选择决策表

Task	Primary tool	Fallback
Office → PDF	LibreOffice	None (LibreOffice is the standard)
Office → HTML	LibreOffice	None
PDF → images	ImageMagick	ghostscript ( `gs -sDEVICE=png16m` )
Images → PDF	ImageMagick	ghostscript
Merge PDFs	pdftk	ghostscript
Split PDF	pdftk	ghostscript
Rotate PDF	pdftk	ghostscript
Encrypt PDF	pdftk	None
Decrypt PDF	pdftk	None
OCR	tesseract	None

任务	首选工具	替代工具
Office转PDF	LibreOffice	无（LibreOffice是标准工具）
Office转HTML	LibreOffice	无
PDF转图片	ImageMagick	ghostscript ( `gs -sDEVICE=png16m` )
图片转PDF	ImageMagick	ghostscript
合并PDF	pdftk	ghostscript
拆分PDF	pdftk	ghostscript
旋转PDF	pdftk	ghostscript
加密PDF	pdftk	无
解密PDF	pdftk	无
OCR识别	tesseract	无

Comparison: When to Use document-converter vs Alternatives

对比：何时使用document-converter vs 替代工具

Tool	Best for
`document-converter`	Office → PDF, PDF operations (merge/split/rotate/encrypt), OCR, image ↔ PDF
`docling-converter`	PDF/Office → structured Markdown or JSON; layout-aware content extraction
`pandoc`	Markdown ↔ HTML ↔ LaTeX ↔ DOCX; lightweight lightweight text format conversions
`pptx-translator`	Translating PowerPoint files between languages

工具	最佳适用场景
`document-converter`	Office转PDF、PDF操作（合并/拆分/旋转/加密）、OCR识别、图片与PDF互转
`docling-converter`	PDF/Office转结构化Markdown或JSON；保留布局的内容提取
`pandoc`	Markdown与HTML/LaTeX/DOCX互转；轻量级文本格式转换
`pptx-translator`	PowerPoint文件多语言翻译

Critical Rules

关键规则

NEVER start a batch LibreOffice conversion with multiple parallel processes — LibreOffice uses a single user-profile lock and parallel instances will crash
ALWAYS run Step 0 Discovery to confirm the required tool is installed before invoking it
ALWAYS suggest the correct install command for the user's OS when a tool is missing
NEVER use cloudconvert, external APIs, or paid services — this skill is fully offline
ALWAYS prefer pdftk over ghostscript for PDF operations when both are available (simpler, safer syntax)
ALWAYS specify output directory explicitly to avoid writing files to unexpected locations

切勿并行启动多个LibreOffice进程进行批量转换——LibreOffice使用单一用户配置文件锁，并行实例会崩溃
在调用工具前，务必执行步骤0的检测，确认所需工具已安装
当工具缺失时，务必为用户的操作系统推荐正确的安装命令
切勿使用cloudconvert、外部API或付费服务——该Skill完全离线运行
当pdftk和ghostscript均可用时，优先使用pdftk进行PDF操作（语法更简洁、安全）
务必明确指定输出目录，避免将文件写入意外位置

Example Usage

示例用法

Convert PPTX to PDF:

bash

libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/

Merge 3 PDFs:

bash

pdftk report.pdf appendix.pdf cover.pdf cat output final.pdf

OCR a scanned image (Portuguese):

bash

tesseract scan.png resultado -l por

PDF pages to PNG images:

bash

convert -density 150 document.pdf page_%03d.png

Encrypt a PDF:

bash

pdftk sensitive.pdf output protected.pdf user_pw "secret123"

将PPTX转换为PDF：

bash

libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/

合并3个PDF：

bash

pdftk report.pdf appendix.pdf cover.pdf cat output final.pdf

对扫描图片进行OCR识别（葡萄牙语）：

bash

tesseract scan.png resultado -l por

将PDF页面转换为PNG图片：

bash

convert -density 150 document.pdf page_%03d.png

加密PDF：

bash

pdftk sensitive.pdf output protected.pdf user_pw "secret123"

document-converter

Original

Translation

document-converter

document-converter

Purpose

用途

When to Use

适用场景

Step 0: Discovery — Detect Available Tools

步骤0：检测可用工具

Detect OS

Detect OS

Check each tool

Check each tool

Workflow

工作流程

Office Documents → PDF (most common)

Office文档转PDF（最常用）

Convert single file to PDF in the same directory

Convert single file to PDF in the same directory

Convert to HTML

Convert to HTML

Batch: convert all PPTX in a directory

Batch: convert all PPTX in a directory

PDF → Images / Images → PDF

PDF转图片 / 图片转PDF

PDF pages → PNG (one file per page)

PDF pages → PNG (one file per page)

PDF pages → JPG with quality control

PDF pages → JPG with quality control

Single image → PDF

Single image → PDF

Multiple images → single PDF

Multiple images → single PDF

PDF Operations — Merge, Split, Rotate

PDF操作——合并、拆分、旋转

pdftk (preferred)

pdftk (preferred)

ghostscript (fallback)

ghostscript (fallback)

pdftk — extract pages 1-3 and 5

pdftk — extract pages 1-3 and 5

ghostscript — extract pages 2 to 5

ghostscript — extract pages 2 to 5

pdftk — rotate all pages 90° clockwise

pdftk — rotate all pages 90° clockwise

pdftk — rotate specific page (page 3) 180°

pdftk — rotate specific page (page 3) 180°

OCR — Extract Text from Scanned Documents

OCR识别——从扫描文档提取文本

OCR a single image → searchable text file

OCR a single image → searchable text file

Output: result.txt

Output: result.txt

OCR with language specification (Portuguese example)

OCR with language specification (Portuguese example)

OCR → searchable PDF (requires tesseract with pdf support)

OCR → searchable PDF (requires tesseract with pdf support)

Output: output.pdf

Output: output.pdf

OCR a scanned PDF: rasterize first, then OCR

OCR a scanned PDF: rasterize first, then OCR

Rasterize all pages

Rasterize all pages

OCR each page and merge results

OCR each page and merge results

Tool Routing Decision Table

工具选择决策表

Comparison: When to Use document-converter vs Alternatives

对比：何时使用document-converter vs 替代工具

Critical Rules

关键规则

Example Usage

示例用法