docx

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

DOCX 创建、编辑与分析

DOCX Creation, Editing, and Analysis

概述

Overview

.docx 本质为 ZIP，内含 XML 等资源。按任务类型选择不同工作流。

.docx is essentially a ZIP file containing resources such as XML. Select different workflows based on task types.

工作流选择

Workflow Selection

仅阅读/分析 → 文本提取或原始 XML 访问
新建文档 → 「创建新 Word 文档」流程
编辑已有文档：
- 自己的文档 + 简单修改 → 基础 OOXML 编辑
- 他人文档 / 法律、学术、商业、政府文档 → 修订流程（Redlining）（推荐或必须）

Read/Analysis Only → Text extraction or raw XML access
Create New Document → "Create New Word Document" workflow
Edit Existing Document:
- Your own document + simple modifications → Basic OOXML editing
- Documents from others / legal, academic, business, government documents → Redlining (Track Changes) Workflow (recommended or required)

阅读与分析

Reading and Analysis

文本提取

Text Extraction

用 pandoc 转为 markdown，可保留修订：

bash

pandoc --track-changes=all path-to-file.docx -o output.md

Convert to markdown with pandoc, which can retain track changes:

bash

pandoc --track-changes=all path-to-file.docx -o output.md

选项: --track-changes=accept/reject/all

Options: --track-changes=accept/reject/all

undefined

undefined

原始 XML

Raw XML

批注、复杂格式、结构、媒体、元数据需解包后读 XML。
解包：

python ooxml/scripts/unpack.py <office_file> <output_directory>

关键路径：

word/document.xml

、

word/comments.xml

、

word/media/

；修订用

<w:ins>

、

<w:del>

。

Comments, complex formatting, structure, media, and metadata require unpacking to read XML.
Unpack:

python ooxml/scripts/unpack.py <office_file> <output_directory>

Key paths:

word/document.xml

word/comments.xml

word/media/

; Track changes use

<w:ins>

and

<w:del>

创建新文档

Create New Document

使用 docx-js（JavaScript/TypeScript）。先完整阅读 docx-js.md，再以 Document / Paragraph / TextRun 构建，用

Packer.toBuffer()

导出 .docx。

Use docx-js (JavaScript/TypeScript). First read docx-js.md thoroughly, then build with Document / Paragraph / TextRun, and export .docx using

Packer.toBuffer()

编辑已有文档

Edit Existing Document

使用 Document 库（Python，操作 OOXML）。流程：

完整阅读 ooxml.md

解包：

unpack.py <office_file> <output_directory>

用 Document 库编写脚本编辑
打包：
```
pack.py <input_directory> <office_file>
```

Use Document Library (Python, operates on OOXML). Workflow:

Read ooxml.md thoroughly

Unpack:

unpack.py <office_file> <output_directory>

Write scripts to edit using the Document Library
Pack:
```
pack.py <input_directory> <office_file>
```

修订流程（Redlining）

Redlining (Track Changes) Workflow

markdown 表示：

pandoc --track-changes=all ... -o current.md

识别并分批修改：按章节/类型/难度分组，每批约 3～10 处。
解包、阅读 ooxml.md，按建议 RSID 使用。
分批实现：
```
grep
```
定位
```
word/document.xml
```
，用
```
get_node
```
等实现变更，
```
doc.save()
```
。
打包：
```
pack.py
```
生成 .docx。
验证：再次
```
pandoc --track-changes=all
```
转 md，
```
grep
```
核对修改是否完整、无多余变更。

原则：仅标记实际变更的文本；未改部分复用原

<w:r>

与 RSID。

Markdown Representation:

pandoc --track-changes=all ... -o current.md

Identify and Batch Modifications: Group by chapter/type/difficulty, 3~10 changes per batch.
Unpack, read ooxml.md, use RSIDs as recommended.
Implement in Batches: Use
```
grep
```
to locate
```
word/document.xml
```
, use
```
get_node
```
etc. to implement changes,
```
doc.save()
```
.
Pack: Generate .docx with
```
pack.py
```
.
Verify: Convert to md again with
```
pandoc --track-changes=all
```
, use
```
grep
```
to check if modifications are complete and no extra changes exist.

Principle: Only mark actually changed text; reuse original

<w:r>

and RSIDs for unchanged parts.

转成图片

Convert to Images

bash

soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page

bash

soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page

-f / -l 可指定页范围

-f / -l can specify page range

undefined

undefined

依赖

Dependencies

pandoc、docx（npm）、LibreOffice、poppler-utils、defusedxml（pip）。

pandoc, docx (npm), LibreOffice, poppler-utils, defusedxml (pip).