document-converter-suite

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document Converter Suite

文档转换工具套件

Overview

概述

Provide a best-effort conversion workflow between 8 document formats:

Office Formats: PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX) Text Formats: Plain Text (TXT), CSV, Markdown (MD), HTML

Uses

pypdf

python-docx

python-pptx

openpyxl

reportlab

mistune

beautifulsoup4

, and

Pillow

Prefer reliable extraction + rebuild (text, headings, bullets, basic tables) over pixel-perfect layout.

提供8种文档格式之间的尽力而为转换工作流：

办公格式：PDF、Word（DOCX）、PowerPoint（PPTX）、Excel（XLSX） 文本格式：纯文本（TXT）、CSV、Markdown（MD）、HTML

使用

pypdf

、

python-docx

、

python-pptx

、

openpyxl

、

reportlab

、

mistune

、

beautifulsoup4

和

Pillow

库。

相较于像素级完美的布局，更倾向于可靠的提取与重建（文本、标题、项目符号、基础表格）。

When to use

适用场景

Use when the request involves:

Converting a file between .pdf / .docx / .pptx / .xlsx / .txt / .csv / .md / .html
Making a document more editable by moving its content into Office or text formats
Exporting slide text or spreadsheet cell grids to a different format
Converting Markdown/HTML documentation to Office formats or vice versa
Extracting tables from Office documents to CSV/XLSX
Batch-converting a folder of mixed documents

Supported conversion paths: 64 total (8×8 matrix) - see

references/conversion_matrix.md

Avoid promising visual fidelity. Emphasize that output is clean and structured, not identical.

当需求涉及以下内容时使用：

在**.pdf / .docx / .pptx / .xlsx / .txt / .csv / .md / .html**格式之间转换文件
将文档内容迁移至办公或文本格式，使其更易于编辑
将幻灯片文本或电子表格单元格网格导出为其他格式
将Markdown/HTML文档转换为办公格式，或反之
从办公文档中提取表格至CSV/XLSX格式
批量转换文件夹中的混合格式文档

支持的转换路径：共64种（8×8矩阵）——详见

references/conversion_matrix.md

避免承诺视觉保真度。需强调输出内容整洁且结构化，但与原文档并非完全一致。

Workflow decision tree

工作流决策树

Identify input and desired output (extensions matter).
Classify the user's goal:
- Editable content → proceed with this suite.
- Visually identical rendering → explain limitations; suggest external rendering tools.
Pick conversion mode:
- Single file → run
```
scripts/convert.py
```
  .
- Folder/batch → run
```
scripts/batch_convert.py
```
  .

Tune safety caps if needed:

PDF:
```
--max-pages
```
,
```
--max-chars
```
XLSX:
```
--max-rows
```
,
```
--max-cols
```

Run conversion, then sanity-check output size and structure.
Iterate (e.g., increase max rows/cols, split large docs, or choose a different target format).

确定输入与期望输出格式（文件扩展名至关重要）。
明确用户目标：
- 可编辑内容 → 使用本工具套件。
- 视觉完全一致的渲染 → 说明局限性；建议使用外部渲染工具。
选择转换模式：
- 单个文件 → 运行
```
scripts/convert.py
```
  。
- 文件夹/批量 → 运行
```
scripts/batch_convert.py
```
  。
如有需要，调整安全限制：
- PDF：
```
--max-pages
```
  、
```
--max-chars
```
- XLSX：
```
--max-rows
```
  、
```
--max-cols
```
执行转换，然后检查输出文件的大小与结构是否合理。
迭代优化（例如，增加最大行数/列数、拆分大型文档，或选择其他目标格式）。

Quick start

快速开始

Single-file conversion

单个文件转换

Run:

bash

python scripts/convert.py <input-file> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

Examples:

bash

undefined

运行：

bash

python scripts/convert.py <input-file> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

示例：

bash

undefined

Office format conversions

办公格式转换

python scripts/convert.py report.pdf --to docx python scripts/convert.py deck.pptx --to pdf --out deck_export.pdf python scripts/convert.py data.xlsx --to pptx --max-rows 40 --max-cols 12

Text format conversions

文本格式转换

python scripts/convert.py documentation.md --to docx python scripts/convert.py data.csv --to xlsx python scripts/convert.py report.docx --to html python scripts/convert.py notes.txt --to md

undefined

python scripts/convert.py documentation.md --to docx python scripts/convert.py data.csv --to xlsx python scripts/convert.py report.docx --to html python scripts/convert.py notes.txt --to md

undefined

Batch conversion

批量转换

Run:

bash

python scripts/batch_convert.py <input-dir> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

Examples:

bash

python scripts/batch_convert.py ./inbox --to docx --recursive
python scripts/batch_convert.py ./inbox --to pdf --outdir ./out --recursive --overwrite
python scripts/batch_convert.py ./markdown-docs --to html --pattern "*.md"
python scripts/batch_convert.py ./data --to xlsx --pattern "*.csv"

运行：

bash

python scripts/batch_convert.py <input-dir> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>

示例：

bash

python scripts/batch_convert.py ./inbox --to docx --recursive
python scripts/batch_convert.py ./inbox --to pdf --outdir ./out --recursive --overwrite
python scripts/batch_convert.py ./markdown-docs --to html --pattern "*.md"
python scripts/batch_convert.py ./data --to xlsx --pattern "*.csv"

Conversion behavior

转换行为

Follow these defaults (and say them out loud if the user might be expecting magic):

遵循以下默认规则（若用户可能期望完美转换，需明确说明）：

Office Format Conversions

办公格式转换

PDF → (DOCX/PPTX/XLSX/TXT/MD/HTML): extract text with
```
pypdf
```
; no OCR; each page becomes a section/slide block.
DOCX → (PDF/PPTX/XLSX/TXT/CSV/MD/HTML): export paragraphs, headings (with improved detection), and tables.
- Improved heading detection: now uses font size + bold + ALL CAPS heuristics, not just style names.
PPTX → (DOCX/PDF/XLSX/TXT/CSV/MD/HTML): export slide titles + text frames; export tables.
- Multi-table support: PPTX now creates one slide per table when multiple tables exist.
XLSX → (DOCX/PPTX/PDF/TXT/CSV/MD/HTML): export bounded value grid per sheet (defaults: 200×50).
- Truncation warnings: printed to stderr when data exceeds limits (e.g., "Sheet 'Data': Truncated 500 rows → 200 rows").

PDF → (DOCX/PPTX/XLSX/TXT/MD/HTML)：使用
```
pypdf
```
提取文本；不支持OCR；每页内容转换为一个章节/幻灯片块。
DOCX → (PDF/PPTX/XLSX/TXT/CSV/MD/HTML)：导出段落、标题（优化了检测逻辑）和表格。
- 优化的标题检测：现在不仅使用样式名称，还结合字体大小+加粗、全大写+加粗的规则进行判断。
PPTX → (DOCX/PDF/XLSX/TXT/CSV/MD/HTML)：导出幻灯片标题+文本框内容；导出表格。
- 多表格支持：当PPTX中存在多个表格时，现在会为每个表格单独创建一张幻灯片。
XLSX → (DOCX/PPTX/PDF/TXT/CSV/MD/HTML)：导出每个工作表的有限值网格（默认：200行×50列）。
- 截断警告：当数据超出限制时，会在标准错误输出中打印提示（例如："Sheet 'Data': Truncated 500 rows → 200 rows"）。

Text Format Conversions

文本格式转换

TXT → (DOCX/PPTX/XLSX/PDF/CSV/MD/HTML): lines become paragraphs/bullets; simple structure preservation.
CSV → (XLSX/DOCX/PPTX/HTML): headers + rows mapped to tables/sheets; auto-delimiter detection.
MD → (DOCX/PPTX/XLSX/PDF/TXT/CSV/HTML): parsed with
```
mistune
```
; headings, lists, tables, code blocks preserved.
- High fidelity: Markdown ↔ HTML and Markdown ↔ DOCX maintain structure well.
HTML → (DOCX/PPTX/XLSX/PDF/TXT/CSV/MD): parsed with
```
beautifulsoup4
```
; semantic structure extracted.
- High fidelity: HTML ↔ Markdown and HTML ↔ DOCX maintain structure well.

TXT → (DOCX/PPTX/XLSX/PDF/CSV/MD/HTML)：每行内容转换为段落/项目符号；保留简单结构。
CSV → (XLSX/DOCX/PPTX/HTML)：表头+行映射为表格/工作表；自动检测分隔符。
MD → (DOCX/PPTX/XLSX/PDF/TXT/CSV/HTML)：使用
```
mistune
```
解析；保留标题、列表、表格、代码块。
- 高保真度：Markdown ↔ HTML以及Markdown ↔ DOCX转换能很好地保留结构。
HTML → (DOCX/PPTX/XLSX/PDF/TXT/CSV/MD)：使用
```
beautifulsoup4
```
解析；提取语义化结构。
- 高保真度：HTML ↔ Markdown以及HTML ↔ DOCX转换能很好地保留结构。

Quality Improvements

质量改进

Multi-table PPTX: Creates one slide per table (instead of dropping extra tables)
Smart heading detection: DOCX headings detected by style, font size+bold, or ALL CAPS+bold
Data truncation warnings: XLSX conversions warn when data is truncated
Image extraction foundation:
```
image_handler.py
```
provides hash-based deduplication for future image support

Load extra detail from:

```
references/conversion_matrix.md
```
- Full 8×8 conversion matrix
```
references/limitations.md
```
- Format-specific limitations and edge cases

多表格PPTX支持：为每个表格单独创建一张幻灯片（而非丢弃多余表格）
智能标题检测：DOCX标题可通过样式、字体大小+加粗，或全大写+bold规则检测
数据截断警告：XLSX转换时，若数据被截断会发出警告
图片提取基础框架：
```
image_handler.py
```
提供基于哈希的去重功能，为未来的图片支持奠定基础

更多细节可参考：

```
references/conversion_matrix.md
```
- 完整的8×8转换矩阵
```
references/limitations.md
```
- 特定格式的局限性与边缘情况

Guardrails and honesty rules

约束与诚信规则

State "best-effort" explicitly for any conversion request.
Do not claim formatting fidelity (fonts, spacing, images, charts, animations).
Call out scanned PDFs as a likely failure mode (no OCR).
For giant spreadsheets, prefer increasing caps gradually and/or limiting to specific sheets (if user provides intent).

对于任何转换请求，需明确说明是“尽力而为”的转换。
不得承诺格式保真度（字体、间距、图片、图表、动画）。
需指出扫描版PDF可能转换失败（不支持OCR）。
对于大型电子表格，建议逐步增加限制值，或（若用户明确需求）仅转换特定工作表。

Bundled scripts

内置脚本

```
scripts/convert.py
```
: single-file CLI converter
```
scripts/batch_convert.py
```
: batch converter for directories
```
scripts/lib/*
```
: internal readers/writers and conversion orchestration

```
scripts/convert.py
```
：单文件CLI转换工具
```
scripts/batch_convert.py
```
：目录批量转换工具
```
scripts/lib/*
```
：内部读写模块与转换编排逻辑