document-converter
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesedocument-converter
document-converter
Purpose
用途
Convert documents between formats and perform PDF operations using local, free, offline tools. No API key, no cost, no internet required. Supports Office formats, PDF manipulation, image-to-PDF, and OCR using LibreOffice, ghostscript, pdftk, tesseract, and imagemagick.
使用本地免费的离线工具进行文档格式转换和PDF操作。无需API密钥、无需付费、无需联网。支持Office格式转换、PDF处理、图片转PDF以及基于LibreOffice、ghostscript、pdftk、tesseract和imagemagick的OCR识别。
When to Use
适用场景
- Converting Office documents (DOCX, PPTX, XLSX, ODP, ODT) to PDF or HTML
- Converting PDF pages to images (PNG, JPG, TIFF) or images to PDF
- PDF operations: merge multiple PDFs, split by page range, rotate pages, encrypt or decrypt
- OCR: extract searchable text from scanned PDFs or image files
- Any document format conversion that does not involve video or audio
Do NOT use for:
- Video or audio format conversion (use a dedicated media skill)
- Converting code between programming languages
- Simple markdown → HTML (use pandoc directly)
- Structured text extraction from PDFs (use instead)
docling-converter
- 将Office文档(DOCX、PPTX、XLSX、ODP、ODT)转换为PDF或HTML
- 将PDF页面转换为图片(PNG、JPG、TIFF)或图片转换为PDF
- PDF操作:合并多个PDF、按页面范围拆分、旋转页面、加密或解密
- OCR识别:从扫描版PDF或图片文件中提取可搜索文本
- 任何不涉及视频或音频的文档格式转换
请勿用于:
- 视频或音频格式转换(使用专门的媒体处理Skill)
- 编程语言间的代码转换
- 简单的Markdown转HTML(直接使用pandoc)
- 从PDF中提取结构化文本(改用)
docling-converter
Step 0: Discovery — Detect Available Tools
步骤0:检测可用工具
Before any conversion, detect which tools are installed and identify the operating system:
bash
undefined在进行任何转换之前,先检测已安装的工具并识别操作系统:
bash
undefinedDetect OS
Detect OS
uname -s # Darwin = macOS, Linux = Linux; on Windows use 'ver' or check $OS
uname -s # Darwin = macOS, Linux = Linux; on Windows use 'ver' or check $OS
Check each tool
Check each tool
libreoffice --version 2>/dev/null || echo "NOT FOUND"
gs --version 2>/dev/null || echo "NOT FOUND"
pdftk --version 2>/dev/null || echo "NOT FOUND"
tesseract --version 2>/dev/null || echo "NOT FOUND"
convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick
If a required tool is missing, show the install command for the user's OS before proceeding:
| Tool | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |
If the user is on Windows and Microsoft Office is installed, note that LibreOffice and Office produce equivalent results for DOCX/PPTX/XLSX; Office via COM automation is an advanced alternative.libreoffice --version 2>/dev/null || echo "NOT FOUND"
gs --version 2>/dev/null || echo "NOT FOUND"
pdftk --version 2>/dev/null || echo "NOT FOUND"
tesseract --version 2>/dev/null || echo "NOT FOUND"
convert -version 2>/dev/null || echo "NOT FOUND" # ImageMagick
如果缺少所需工具,请先为用户的操作系统显示对应的安装命令:
| 工具 | macOS | Linux (apt) | Windows |
|------|-------|-------------|---------|
| LibreOffice | `brew install --cask libreoffice` | `sudo apt install libreoffice` | `winget install TheDocumentFoundation.LibreOffice` |
| Ghostscript | `brew install ghostscript` | `sudo apt install ghostscript` | `winget install ArtifexSoftware.GhostScript` |
| pdftk | `brew install pdftk-java` | `sudo apt install pdftk` | `winget install PDFTechnologies.PDFtk` |
| Tesseract | `brew install tesseract` | `sudo apt install tesseract-ocr` | `winget install UB-Mannheim.TesseractOCR` |
| ImageMagick | `brew install imagemagick` | `sudo apt install imagemagick` | `winget install ImageMagick.ImageMagick` |
如果用户使用Windows且已安装Microsoft Office,请注意LibreOffice和Office对DOCX/PPTX/XLSX的转换效果相当;通过COM自动化调用Office是一种进阶替代方案。Workflow
工作流程
Office Documents → PDF (most common)
Office文档转PDF(最常用)
Use LibreOffice headless mode. Works for DOCX, PPTX, XLSX, ODP, ODT, and any format LibreOffice opens.
bash
undefined使用LibreOffice无头模式。适用于DOCX、PPTX、XLSX、ODP、ODT以及任何LibreOffice可打开的格式。
bash
undefinedConvert single file to PDF in the same directory
Convert single file to PDF in the same directory
libreoffice --headless --convert-to pdf "/path/to/file.pptx" --outdir "/path/to/output/"
libreoffice --headless --convert-to pdf "/path/to/file.pptx" --outdir "/path/to/output/"
Convert to HTML
Convert to HTML
libreoffice --headless --convert-to html "/path/to/file.docx" --outdir "/path/to/output/"
libreoffice --headless --convert-to html "/path/to/file.docx" --outdir "/path/to/output/"
Batch: convert all PPTX in a directory
Batch: convert all PPTX in a directory
libreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/
Notes:
- Output file is placed in `--outdir` with the same base name and new extension
- For batch, run one `libreoffice` process — do NOT spawn multiple instances in parallel (LibreOffice uses a single user profile lock)
- If LibreOffice is open as a GUI app, close it first or use `--norestore` flaglibreoffice --headless --convert-to pdf /path/to/folder/*.pptx --outdir /path/to/output/
注意事项:
- 输出文件将保存在`--outdir`指定目录,文件名与原文件相同,仅扩展名变更
- 批量转换时,只需运行一次`libreoffice`进程——请勿并行启动多个实例(LibreOffice使用单一用户配置文件锁)
- 如果LibreOffice的GUI应用处于打开状态,请先关闭或添加`--norestore`参数PDF → Images / Images → PDF
PDF转图片 / 图片转PDF
Use ImageMagick for image-PDF interchange:
bash
undefined使用ImageMagick实现图片与PDF的互转:
bash
undefinedPDF pages → PNG (one file per page)
PDF pages → PNG (one file per page)
convert -density 150 "/path/to/doc.pdf" "/path/to/output/page_%03d.png"
convert -density 150 "/path/to/doc.pdf" "/path/to/output/page_%03d.png"
PDF pages → JPG with quality control
PDF pages → JPG with quality control
convert -density 150 "/path/to/doc.pdf" -quality 85 "/path/to/output/page_%03d.jpg"
convert -density 150 "/path/to/doc.pdf" -quality 85 "/path/to/output/page_%03d.jpg"
Single image → PDF
Single image → PDF
convert "/path/to/image.png" "/path/to/output/document.pdf"
convert "/path/to/image.png" "/path/to/output/document.pdf"
Multiple images → single PDF
Multiple images → single PDF
convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"
Note: On some systems ImageMagick's PDF support requires ghostscript. If conversion fails with a policy error, check `/etc/ImageMagick-*/policy.xml` and ensure PDF is not restricted.convert img1.png img2.jpg img3.tiff "/path/to/output/combined.pdf"
注意:部分系统中ImageMagick的PDF支持依赖ghostscript。如果转换因策略错误失败,请检查`/etc/ImageMagick-*/policy.xml`并确保PDF未被限制。PDF Operations — Merge, Split, Rotate
PDF操作——合并、拆分、旋转
Prefer when available (simpler syntax). Fall back to if pdftk is not installed.
pdftkghostscriptMerge PDFs:
bash
undefined优先使用(语法更简洁)。如果未安装pdftk,则退而使用。
pdftkghostscript合并PDF:
bash
undefinedpdftk (preferred)
pdftk (preferred)
pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf
pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf
ghostscript (fallback)
ghostscript (fallback)
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf
**Split PDF by page range:**
```bashgs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf
**按页面范围拆分PDF:**
```bashpdftk — extract pages 1-3 and 5
pdftk — extract pages 1-3 and 5
pdftk input.pdf cat 1-3 5 output extracted.pdf
pdftk input.pdf cat 1-3 5 output extracted.pdf
ghostscript — extract pages 2 to 5
ghostscript — extract pages 2 to 5
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf
**Rotate pages:**
```bashgs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dFirstPage=2 -dLastPage=5 -sOutputFile=extracted.pdf input.pdf
**旋转页面:**
```bashpdftk — rotate all pages 90° clockwise
pdftk — rotate all pages 90° clockwise
pdftk input.pdf rotate 1-endeast output rotated.pdf
pdftk input.pdf rotate 1-endeast output rotated.pdf
pdftk — rotate specific page (page 3) 180°
pdftk — rotate specific page (page 3) 180°
pdftk input.pdf rotate 3south output rotated.pdf
**Encrypt PDF (add password):**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"Decrypt PDF (remove password):
bash
pdftk secured.pdf input_pw "password" output decrypted.pdfpdftk input.pdf rotate 3south output rotated.pdf
**加密PDF(添加密码):**
```bash
pdftk input.pdf output secured.pdf user_pw "userpass" owner_pw "ownerpass"解密PDF(移除密码):
bash
pdftk secured.pdf input_pw "password" output decrypted.pdfOCR — Extract Text from Scanned Documents
OCR识别——从扫描文档提取文本
Use Tesseract. Input must be an image (PNG, JPG, TIFF) or a PDF that ImageMagick can rasterize first.
bash
undefined使用Tesseract。输入必须为图片(PNG、JPG、TIFF)或可被ImageMagick先栅格化的PDF。
bash
undefinedOCR a single image → searchable text file
OCR a single image → searchable text file
tesseract "/path/to/scan.png" "/path/to/output/result"
tesseract "/path/to/scan.png" "/path/to/output/result"
Output: result.txt
Output: result.txt
OCR with language specification (Portuguese example)
OCR with language specification (Portuguese example)
tesseract "/path/to/scan.png" output -l por
tesseract "/path/to/scan.png" output -l por
OCR → searchable PDF (requires tesseract with pdf support)
OCR → searchable PDF (requires tesseract with pdf support)
tesseract "/path/to/scan.png" output -l eng pdf
tesseract "/path/to/scan.png" output -l eng pdf
Output: output.pdf
Output: output.pdf
OCR a scanned PDF: rasterize first, then OCR
OCR a scanned PDF: rasterize first, then OCR
convert -density 300 "scan.pdf" "scan_page_%03d.tiff"
tesseract "scan_page_000.tiff" output -l eng pdf
For multi-page scanned PDFs:
```bashconvert -density 300 "scan.pdf" "scan_page_%03d.tiff"
tesseract "scan_page_000.tiff" output -l eng pdf
对于多页扫描版PDF:
```bashRasterize all pages
Rasterize all pages
convert -density 300 "scan.pdf" "page_%03d.tiff"
convert -density 300 "scan.pdf" "page_%03d.tiff"
OCR each page and merge results
OCR each page and merge results
for f in page_.tiff; do
tesseract "$f" "${f%.tiff}" -l eng pdf
done
pdftk page_.pdf cat output final_ocr.pdf
undefinedfor f in page_.tiff; do
tesseract "$f" "${f%.tiff}" -l eng pdf
done
pdftk page_.pdf cat output final_ocr.pdf
undefinedTool Routing Decision Table
工具选择决策表
| Task | Primary tool | Fallback |
|---|---|---|
| Office → PDF | LibreOffice | None (LibreOffice is the standard) |
| Office → HTML | LibreOffice | None |
| PDF → images | ImageMagick | ghostscript ( |
| Images → PDF | ImageMagick | ghostscript |
| Merge PDFs | pdftk | ghostscript |
| Split PDF | pdftk | ghostscript |
| Rotate PDF | pdftk | ghostscript |
| Encrypt PDF | pdftk | None |
| Decrypt PDF | pdftk | None |
| OCR | tesseract | None |
| 任务 | 首选工具 | 替代工具 |
|---|---|---|
| Office转PDF | LibreOffice | 无(LibreOffice是标准工具) |
| Office转HTML | LibreOffice | 无 |
| PDF转图片 | ImageMagick | ghostscript ( |
| 图片转PDF | ImageMagick | ghostscript |
| 合并PDF | pdftk | ghostscript |
| 拆分PDF | pdftk | ghostscript |
| 旋转PDF | pdftk | ghostscript |
| 加密PDF | pdftk | 无 |
| 解密PDF | pdftk | 无 |
| OCR识别 | tesseract | 无 |
Comparison: When to Use document-converter vs Alternatives
对比:何时使用document-converter vs 替代工具
| Tool | Best for |
|---|---|
| Office → PDF, PDF operations (merge/split/rotate/encrypt), OCR, image ↔ PDF |
| PDF/Office → structured Markdown or JSON; layout-aware content extraction |
| Markdown ↔ HTML ↔ LaTeX ↔ DOCX; lightweight lightweight text format conversions |
| Translating PowerPoint files between languages |
| 工具 | 最佳适用场景 |
|---|---|
| Office转PDF、PDF操作(合并/拆分/旋转/加密)、OCR识别、图片与PDF互转 |
| PDF/Office转结构化Markdown或JSON;保留布局的内容提取 |
| Markdown与HTML/LaTeX/DOCX互转;轻量级文本格式转换 |
| PowerPoint文件多语言翻译 |
Critical Rules
关键规则
- NEVER start a batch LibreOffice conversion with multiple parallel processes — LibreOffice uses a single user-profile lock and parallel instances will crash
- ALWAYS run Step 0 Discovery to confirm the required tool is installed before invoking it
- ALWAYS suggest the correct install command for the user's OS when a tool is missing
- NEVER use cloudconvert, external APIs, or paid services — this skill is fully offline
- ALWAYS prefer pdftk over ghostscript for PDF operations when both are available (simpler, safer syntax)
- ALWAYS specify output directory explicitly to avoid writing files to unexpected locations
- 切勿并行启动多个LibreOffice进程进行批量转换——LibreOffice使用单一用户配置文件锁,并行实例会崩溃
- 在调用工具前,务必执行步骤0的检测,确认所需工具已安装
- 当工具缺失时,务必为用户的操作系统推荐正确的安装命令
- 切勿使用cloudconvert、外部API或付费服务——该Skill完全离线运行
- 当pdftk和ghostscript均可用时,优先使用pdftk进行PDF操作(语法更简洁、安全)
- 务必明确指定输出目录,避免将文件写入意外位置
Example Usage
示例用法
Convert PPTX to PDF:
bash
libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/Merge 3 PDFs:
bash
pdftk report.pdf appendix.pdf cover.pdf cat output final.pdfOCR a scanned image (Portuguese):
bash
tesseract scan.png resultado -l porPDF pages to PNG images:
bash
convert -density 150 document.pdf page_%03d.pngEncrypt a PDF:
bash
pdftk sensitive.pdf output protected.pdf user_pw "secret123"将PPTX转换为PDF:
bash
libreoffice --headless --convert-to pdf presentation.pptx --outdir ./output/合并3个PDF:
bash
pdftk report.pdf appendix.pdf cover.pdf cat output final.pdf对扫描图片进行OCR识别(葡萄牙语):
bash
tesseract scan.png resultado -l por将PDF页面转换为PNG图片:
bash
convert -density 150 document.pdf page_%03d.png加密PDF:
bash
pdftk sensitive.pdf output protected.pdf user_pw "secret123"