docx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DOCX 创建、编辑与分析

DOCX Creation, Editing, and Analysis

概述

Overview

.docx 本质为 ZIP,内含 XML 等资源。按任务类型选择不同工作流。
.docx is essentially a ZIP file containing resources such as XML. Select different workflows based on task types.

工作流选择

Workflow Selection

  • 仅阅读/分析 → 文本提取或原始 XML 访问
  • 新建文档 → 「创建新 Word 文档」流程
  • 编辑已有文档
    • 自己的文档 + 简单修改 → 基础 OOXML 编辑
    • 他人文档 / 法律、学术、商业、政府文档 → 修订流程(Redlining)(推荐或必须)
  • Read/Analysis Only → Text extraction or raw XML access
  • Create New Document → "Create New Word Document" workflow
  • Edit Existing Document:
    • Your own document + simple modifications → Basic OOXML editing
    • Documents from others / legal, academic, business, government documents → Redlining (Track Changes) Workflow (recommended or required)

阅读与分析

Reading and Analysis

文本提取

Text Extraction

用 pandoc 转为 markdown,可保留修订:
bash
pandoc --track-changes=all path-to-file.docx -o output.md
Convert to markdown with pandoc, which can retain track changes:
bash
pandoc --track-changes=all path-to-file.docx -o output.md

选项: --track-changes=accept/reject/all

Options: --track-changes=accept/reject/all

undefined
undefined

原始 XML

Raw XML

批注、复杂格式、结构、媒体、元数据需解包后读 XML。
解包:
python ooxml/scripts/unpack.py <office_file> <output_directory>

关键路径:
word/document.xml
word/comments.xml
word/media/
;修订用
<w:ins>
<w:del>
Comments, complex formatting, structure, media, and metadata require unpacking to read XML.
Unpack:
python ooxml/scripts/unpack.py <office_file> <output_directory>

Key paths:
word/document.xml
,
word/comments.xml
,
word/media/
; Track changes use
<w:ins>
and
<w:del>
.

创建新文档

Create New Document

使用 docx-js(JavaScript/TypeScript)。先完整阅读 docx-js.md,再以 Document / Paragraph / TextRun 构建,用
Packer.toBuffer()
导出 .docx。
Use docx-js (JavaScript/TypeScript). First read docx-js.md thoroughly, then build with Document / Paragraph / TextRun, and export .docx using
Packer.toBuffer()
.

编辑已有文档

Edit Existing Document

使用 Document 库(Python,操作 OOXML)。流程:
  1. 完整阅读 ooxml.md
  2. 解包:
    unpack.py <office_file> <output_directory>
  3. 用 Document 库编写脚本编辑
  4. 打包:
    pack.py <input_directory> <office_file>
Use Document Library (Python, operates on OOXML). Workflow:
  1. Read ooxml.md thoroughly
  2. Unpack:
    unpack.py <office_file> <output_directory>
  3. Write scripts to edit using the Document Library
  4. Pack:
    pack.py <input_directory> <office_file>

修订流程(Redlining)

Redlining (Track Changes) Workflow

  1. markdown 表示
    pandoc --track-changes=all ... -o current.md
  2. 识别并分批修改:按章节/类型/难度分组,每批约 3~10 处。
  3. 解包、阅读 ooxml.md,按建议 RSID 使用。
  4. 分批实现
    grep
    定位
    word/document.xml
    ,用
    get_node
    等实现变更,
    doc.save()
  5. 打包
    pack.py
    生成 .docx。
  6. 验证:再次
    pandoc --track-changes=all
    转 md,
    grep
    核对修改是否完整、无多余变更。
原则:仅标记实际变更的文本;未改部分复用原
<w:r>
与 RSID。
  1. Markdown Representation:
    pandoc --track-changes=all ... -o current.md
  2. Identify and Batch Modifications: Group by chapter/type/difficulty, 3~10 changes per batch.
  3. Unpack, read ooxml.md, use RSIDs as recommended.
  4. Implement in Batches: Use
    grep
    to locate
    word/document.xml
    , use
    get_node
    etc. to implement changes,
    doc.save()
    .
  5. Pack: Generate .docx with
    pack.py
    .
  6. Verify: Convert to md again with
    pandoc --track-changes=all
    , use
    grep
    to check if modifications are complete and no extra changes exist.
Principle: Only mark actually changed text; reuse original
<w:r>
and RSIDs for unchanged parts.

转成图片

Convert to Images

bash
soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
bash
soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page

-f / -l 可指定页范围

-f / -l can specify page range

undefined
undefined

依赖

Dependencies

pandoc、docx(npm)、LibreOffice、poppler-utils、defusedxml(pip)。
pandoc, docx (npm), LibreOffice, poppler-utils, defusedxml (pip).