docx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDOCX 创建、编辑与分析
DOCX Creation, Editing, and Analysis
概述
Overview
.docx 本质为 ZIP,内含 XML 等资源。按任务类型选择不同工作流。
.docx is essentially a ZIP file containing resources such as XML. Select different workflows based on task types.
工作流选择
Workflow Selection
- 仅阅读/分析 → 文本提取或原始 XML 访问
- 新建文档 → 「创建新 Word 文档」流程
- 编辑已有文档:
- 自己的文档 + 简单修改 → 基础 OOXML 编辑
- 他人文档 / 法律、学术、商业、政府文档 → 修订流程(Redlining)(推荐或必须)
- Read/Analysis Only → Text extraction or raw XML access
- Create New Document → "Create New Word Document" workflow
- Edit Existing Document:
- Your own document + simple modifications → Basic OOXML editing
- Documents from others / legal, academic, business, government documents → Redlining (Track Changes) Workflow (recommended or required)
阅读与分析
Reading and Analysis
文本提取
Text Extraction
用 pandoc 转为 markdown,可保留修订:
bash
pandoc --track-changes=all path-to-file.docx -o output.mdConvert to markdown with pandoc, which can retain track changes:
bash
pandoc --track-changes=all path-to-file.docx -o output.md选项: --track-changes=accept/reject/all
Options: --track-changes=accept/reject/all
undefinedundefined原始 XML
Raw XML
批注、复杂格式、结构、媒体、元数据需解包后读 XML。
解包:
关键路径:、、;修订用 、。
解包:
python ooxml/scripts/unpack.py <office_file> <output_directory>关键路径:
word/document.xmlword/comments.xmlword/media/<w:ins><w:del>Comments, complex formatting, structure, media, and metadata require unpacking to read XML.
Unpack:
Key paths:, , ; Track changes use and .
Unpack:
python ooxml/scripts/unpack.py <office_file> <output_directory>Key paths:
word/document.xmlword/comments.xmlword/media/<w:ins><w:del>创建新文档
Create New Document
使用 docx-js(JavaScript/TypeScript)。先完整阅读 docx-js.md,再以 Document / Paragraph / TextRun 构建,用 导出 .docx。
Packer.toBuffer()Use docx-js (JavaScript/TypeScript). First read docx-js.md thoroughly, then build with Document / Paragraph / TextRun, and export .docx using .
Packer.toBuffer()编辑已有文档
Edit Existing Document
使用 Document 库(Python,操作 OOXML)。流程:
- 完整阅读 ooxml.md
- 解包:
unpack.py <office_file> <output_directory> - 用 Document 库编写脚本编辑
- 打包:
pack.py <input_directory> <office_file>
Use Document Library (Python, operates on OOXML). Workflow:
- Read ooxml.md thoroughly
- Unpack:
unpack.py <office_file> <output_directory> - Write scripts to edit using the Document Library
- Pack:
pack.py <input_directory> <office_file>
修订流程(Redlining)
Redlining (Track Changes) Workflow
- markdown 表示:
pandoc --track-changes=all ... -o current.md - 识别并分批修改:按章节/类型/难度分组,每批约 3~10 处。
- 解包、阅读 ooxml.md,按建议 RSID 使用。
- 分批实现:定位
grep,用word/document.xml等实现变更,get_node。doc.save() - 打包:生成 .docx。
pack.py - 验证:再次 转 md,
pandoc --track-changes=all核对修改是否完整、无多余变更。grep
原则:仅标记实际变更的文本;未改部分复用原 与 RSID。
<w:r>- Markdown Representation:
pandoc --track-changes=all ... -o current.md - Identify and Batch Modifications: Group by chapter/type/difficulty, 3~10 changes per batch.
- Unpack, read ooxml.md, use RSIDs as recommended.
- Implement in Batches: Use to locate
grep, useword/document.xmletc. to implement changes,get_node.doc.save() - Pack: Generate .docx with .
pack.py - Verify: Convert to md again with , use
pandoc --track-changes=allto check if modifications are complete and no extra changes exist.grep
Principle: Only mark actually changed text; reuse original and RSIDs for unchanged parts.
<w:r>转成图片
Convert to Images
bash
soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf pagebash
soffice --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page-f / -l 可指定页范围
-f / -l can specify page range
undefinedundefined依赖
Dependencies
pandoc、docx(npm)、LibreOffice、poppler-utils、defusedxml(pip)。
pandoc, docx (npm), LibreOffice, poppler-utils, defusedxml (pip).