document-converter

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

document-converter

document-converter

Purpose

用途

Convert documents between formats and perform PDF operations using local, free, offline tools. No API key, no cost, no internet required. Supports Office formats, PDF manipulation, image-to-PDF, and OCR using LibreOffice, ghostscript, pdftk, tesseract, and imagemagick.
使用本地免费的离线工具进行文档格式转换和PDF操作。无需API密钥、无需付费、无需联网。支持Office格式转换、PDF处理、图片转PDF以及基于LibreOffice、ghostscript、pdftk、tesseract和imagemagick的OCR识别。

When to Use

适用场景

  • Converting Office documents (DOCX, PPTX, XLSX, ODP, ODT) to PDF or HTML
  • Converting PDF pages to images (PNG, JPG, TIFF) or images to PDF
  • PDF operations: merge multiple PDFs, split by page range, rotate pages, encrypt or decrypt
  • OCR: extract searchable text from scanned PDFs or image files
  • Any document format conversion that does not involve video or audio
Do NOT use for:
  • Video or audio format conversion (use a dedicated media skill)
  • Converting code between programming languages
  • Simple markdown → HTML (use pandoc directly)
  • Structured text extraction from PDFs (use
    docling-converter
    instead)
  • 将Office文档(DOCX、PPTX、XLSX、ODP、ODT)转换为PDF或HTML
  • 将PDF页面转换为图片(PNG、JPG、TIFF)或图片转换为PDF
  • PDF操作:合并多个PDF、按页面范围拆分、旋转页面、加密或解密
  • OCR识别:从扫描版PDF或图片文件中提取可搜索文本
  • 任何不涉及视频或音频的文档格式转换
请勿用于:
  • 视频或音频格式转换(使用专门的媒体处理Skill)
  • 编程语言间的代码转换
  • 简单的Markdown转HTML(直接使用pandoc)
  • 从PDF中提取结构化文本(改用
    docling-converter

Step 0: Discovery — Detect Available Tools

步骤0:检测可用工具

Before any conversion, detect which tools are installed and identify the operating system:
bash
undefined
在进行任何转换之前,先检测已安装的工具并识别操作系统:
bash
undefined

Detect OS

Detect OS

uname -s # Darwin = macOS, Linux = Linux; on Windows use 'ver' or check $OS
uname -s # Darwin = macOS, Linux = Linux; on Windows use 'ver' or check $OS

Check each tool

Check each tool

libreoffice --version 2>/dev/null || echo "NOT FOUND" gs --version 2>/dev/null || echo "NOT FOUND" pdftk --version 2>/dev/null || echo "NOT FOUND" tesseract --version 2>/dev/null || echo "NOT FOUND" convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick

If a required tool is missing, show the install command for the user's OS before proceeding:

| Tool | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |

If the user is on Windows and Microsoft Office is installed, note that LibreOffice and Office produce equivalent results for DOCX/PPTX/XLSX; Office via COM automation is an advanced alternative.
libreoffice --version 2>/dev/null || echo "NOT FOUND" gs --version 2>/dev/null || echo "NOT FOUND" pdftk --version 2>/dev/null || echo "NOT FOUND" tesseract --version 2>/dev/null || echo "NOT FOUND" convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick

如果缺少所需工具,请先为用户的操作系统显示对应的安装命令:

| 工具 | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |

如果用户使用Windows且已安装Microsoft Office,请注意LibreOffice和Office对DOCX/PPTX/XLSX的转换效果相当;通过COM自动化调用Office是一种进阶替代方案。

Workflow

工作流程

Office Documents → PDF (most common)

Office文档转PDF(最常用)

Use LibreOffice headless mode. Works for DOCX, PPTX, XLSX, ODP, ODT, and any format LibreOffice opens.
bash
undefined
使用LibreOffice无头模式。适用于DOCX、PPTX、XLSX、ODP、ODT以及任何LibreOffice可打开的格式。
bash
undefined

Convert single file to PDF in the same directory

Convert single file to PDF in the same directory

libreoffice --headless --convert-to pdf "/path/to/file.pptx" --outdir "/path/to/output/"
libreoffice --headless --convert-to pdf "/path/to/file.pptx" --outdir "/path/to/output/"

Convert to HTML

Convert to HTML

libreoffice --headless --convert-to html "/path/to/file.docx" --outdir "/path/to/output/"
libreoffice --headless --convert-to html "/path/to/file.docx" --outdir "/path/to/output/"

Batch: convert all PPTX in a directory

Batch: convert all PPTX in a directory

libreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/

Notes:
- Output file is placed in `--outdir` with the same base name and new extension
- For batch, run one `libreoffice` process — do NOT spawn multiple instances in parallel (LibreOffice uses a single user profile lock)
- If LibreOffice is open as a GUI app, close it first or use `--norestore` flag
libreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/

注意事项:
- 输出文件将保存在`--outdir`指定目录,文件名与原文件相同,仅扩展名变更
- 批量转换时,只需运行一次`libreoffice`进程——请勿并行启动多个实例(LibreOffice使用单一用户配置文件锁)
- 如果LibreOffice的GUI应用处于打开状态,请先关闭或添加`--norestore`参数

PDF → Images / Images → PDF

PDF转图片 / 图片转PDF

Use ImageMagick for image-PDF interchange:
bash
undefined
使用ImageMagick实现图片与PDF的互转:
bash
undefined

PDF pages → PNG (one file per page)

PDF pages → PNG (one file per page)

convert -density 150 "/path/to/doc.pdf" "/path/to/output/page_%03d.png"
convert -density 150 "/path/to/doc.pdf" "/path/to/output/page_%03d.png"

PDF pages → JPG with quality control

PDF pages → JPG with quality control

convert -density 150 "/path/to/doc.pdf" -quality 85 "/path/to/output/page_%03d.jpg"
convert -density 150 "/path/to/doc.pdf" -quality 85 "/path/to/output/page_%03d.jpg"

Single image → PDF

Single image → PDF

convert "/path/to/image.png" "/path/to/output/document.pdf"
convert "/path/to/image.png" "/path/to/output/document.pdf"

Multiple images → single PDF

Multiple images → single PDF

convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"

Note: On some systems ImageMagick's PDF support requires ghostscript. If conversion fails with a policy error, check `/etc/ImageMagick-*/policy.xml` and ensure PDF is not restricted.
convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"

注意:部分系统中ImageMagick的PDF支持依赖ghostscript。如果转换因策略错误失败,请检查`/etc/ImageMagick-*/policy.xml`并确保PDF未被限制。

PDF Operations — Merge, Split, Rotate

PDF操作——合并、拆分、旋转

Prefer
pdftk
when available (simpler syntax). Fall back to
ghostscript
if pdftk is not installed.
Merge PDFs:
bash
undefined
优先使用
pdftk
(语法更简洁)。如果未安装pdftk,则退而使用
ghostscript
合并PDF:
bash
undefined

pdftk (preferred)

pdftk (preferred)

pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf
pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf

ghostscript (fallback)

ghostscript (fallback)

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf

**Split PDF by page range:**
```bash
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf

**按页面范围拆分PDF:**
```bash

pdftk — extract pages 1-3 and 5

pdftk — extract pages 1-3 and 5

pdftk input.pdf cat 1-3 5 output extracted.pdf
pdftk input.pdf cat 1-3 5 output extracted.pdf

ghostscript — extract pages 2 to 5

ghostscript — extract pages 2 to 5

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf

**Rotate pages:**
```bash
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf

**旋转页面:**
```bash

pdftk — rotate all pages 90° clockwise

pdftk — rotate all pages 90° clockwise

pdftk input.pdf rotate 1-endeast output rotated.pdf
pdftk input.pdf rotate 1-endeast output rotated.pdf

pdftk — rotate specific page (page 3) 180°

pdftk — rotate specific page (page 3) 180°

pdftk input.pdf rotate 3south output rotated.pdf

**Encrypt PDF (add password):**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"
Decrypt PDF (remove password):
bash
pdftk secured.pdf input_pw "password" output decrypted.pdf
pdftk input.pdf rotate 3south output rotated.pdf

**加密PDF(添加密码):**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"
解密PDF(移除密码):
bash
pdftk secured.pdf input_pw "password" output decrypted.pdf

OCR — Extract Text from Scanned Documents

OCR识别——从扫描文档提取文本

Use Tesseract. Input must be an image (PNG, JPG, TIFF) or a PDF that ImageMagick can rasterize first.
bash
undefined
使用Tesseract。输入必须为图片(PNG、JPG、TIFF)或可被ImageMagick先栅格化的PDF。
bash
undefined

OCR a single image → searchable text file

OCR a single image → searchable text file

tesseract "/path/to/scan.png" "/path/to/output/result"
tesseract "/path/to/scan.png" "/path/to/output/result"

Output: result.txt

Output: result.txt

OCR with language specification (Portuguese example)

OCR with language specification (Portuguese example)

tesseract "/path/to/scan.png" output -l por
tesseract "/path/to/scan.png" output -l por

OCR → searchable PDF (requires tesseract with pdf support)

OCR → searchable PDF (requires tesseract with pdf support)

tesseract "/path/to/scan.png" output -l eng pdf
tesseract "/path/to/scan.png" output -l eng pdf

Output: output.pdf

Output: output.pdf

OCR a scanned PDF: rasterize first, then OCR

OCR a scanned PDF: rasterize first, then OCR

convert -density 300 "scan.pdf" "scan_page_%03d.tiff" tesseract "scan_page_000.tiff" output -l eng pdf

For multi-page scanned PDFs:
```bash
convert -density 300 "scan.pdf" "scan_page_%03d.tiff" tesseract "scan_page_000.tiff" output -l eng pdf

对于多页扫描版PDF:
```bash

Rasterize all pages

Rasterize all pages

convert -density 300 "scan.pdf" "page_%03d.tiff"
convert -density 300 "scan.pdf" "page_%03d.tiff"

OCR each page and merge results

OCR each page and merge results

for f in page_.tiff; do tesseract "$f" "${f%.tiff}" -l eng pdf done pdftk page_.pdf cat output final_ocr.pdf
undefined
for f in page_.tiff; do tesseract "$f" "${f%.tiff}" -l eng pdf done pdftk page_.pdf cat output final_ocr.pdf
undefined

Tool Routing Decision Table

工具选择决策表

TaskPrimary toolFallback
Office → PDFLibreOfficeNone (LibreOffice is the standard)
Office → HTMLLibreOfficeNone
PDF → imagesImageMagickghostscript (
gs -sDEVICE=png16m
)
Images → PDFImageMagickghostscript
Merge PDFspdftkghostscript
Split PDFpdftkghostscript
Rotate PDFpdftkghostscript
Encrypt PDFpdftkNone
Decrypt PDFpdftkNone
OCRtesseractNone
任务首选工具替代工具
Office转PDFLibreOffice无(LibreOffice是标准工具)
Office转HTMLLibreOffice
PDF转图片ImageMagickghostscript (
gs -sDEVICE=png16m
)
图片转PDFImageMagickghostscript
合并PDFpdftkghostscript
拆分PDFpdftkghostscript
旋转PDFpdftkghostscript
加密PDFpdftk
解密PDFpdftk
OCR识别tesseract

Comparison: When to Use document-converter vs Alternatives

对比:何时使用document-converter vs 替代工具

ToolBest for
document-converter
Office → PDF, PDF operations (merge/split/rotate/encrypt), OCR, image ↔ PDF
docling-converter
PDF/Office → structured Markdown or JSON; layout-aware content extraction
pandoc
Markdown ↔ HTML ↔ LaTeX ↔ DOCX; lightweight lightweight text format conversions
pptx-translator
Translating PowerPoint files between languages
工具最佳适用场景
document-converter
Office转PDF、PDF操作(合并/拆分/旋转/加密)、OCR识别、图片与PDF互转
docling-converter
PDF/Office转结构化Markdown或JSON;保留布局的内容提取
pandoc
Markdown与HTML/LaTeX/DOCX互转;轻量级文本格式转换
pptx-translator
PowerPoint文件多语言翻译

Critical Rules

关键规则

  • NEVER start a batch LibreOffice conversion with multiple parallel processes — LibreOffice uses a single user-profile lock and parallel instances will crash
  • ALWAYS run Step 0 Discovery to confirm the required tool is installed before invoking it
  • ALWAYS suggest the correct install command for the user's OS when a tool is missing
  • NEVER use cloudconvert, external APIs, or paid services — this skill is fully offline
  • ALWAYS prefer pdftk over ghostscript for PDF operations when both are available (simpler, safer syntax)
  • ALWAYS specify output directory explicitly to avoid writing files to unexpected locations
  • 切勿并行启动多个LibreOffice进程进行批量转换——LibreOffice使用单一用户配置文件锁,并行实例会崩溃
  • 在调用工具前,务必执行步骤0的检测,确认所需工具已安装
  • 当工具缺失时,务必为用户的操作系统推荐正确的安装命令
  • 切勿使用cloudconvert、外部API或付费服务——该Skill完全离线运行
  • 当pdftk和ghostscript均可用时,优先使用pdftk进行PDF操作(语法更简洁、安全)
  • 务必明确指定输出目录,避免将文件写入意外位置

Example Usage

示例用法

Convert PPTX to PDF:
bash
libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/
Merge 3 PDFs:
bash
pdftk report.pdf appendix.pdf cover.pdf cat output final.pdf
OCR a scanned image (Portuguese):
bash
tesseract scan.png resultado -l por
PDF pages to PNG images:
bash
convert -density 150 document.pdf page_%03d.png
Encrypt a PDF:
bash
pdftk sensitive.pdf output protected.pdf user_pw "secret123"
将PPTX转换为PDF:
bash
libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/
合并3个PDF:
bash
pdftk report.pdf appendix.pdf cover.pdf cat output final.pdf
对扫描图片进行OCR识别(葡萄牙语):
bash
tesseract scan.png resultado -l por
将PDF页面转换为PNG图片:
bash
convert -density 150 document.pdf page_%03d.png
加密PDF:
bash
pdftk sensitive.pdf output protected.pdf user_pw "secret123"