anthropics-docx
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDOCX creation, editing, and analysis
DOCX文档的创建、编辑与分析
Overview
概述
A .docx file is a ZIP archive containing XML files.
.docx文件是一个包含XML文件的ZIP压缩包。
Quick Reference
快速参考
| Task | Approach |
|---|---|
| Read/analyze content | |
| Create new document | Use |
| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
| 任务 | 处理方法 |
|---|---|
| 读取/分析内容 | 使用 |
| 创建新文档 | 使用 |
| 编辑现有文档 | 解压 → 编辑XML → 重新打包 - 详见下方“编辑现有文档”部分 |
Converting .doc to .docx
将.doc转换为.docx
Legacy files must be converted before editing:
.docbash
python scripts/office/soffice.py --headless --convert-to docx document.doc旧版文件在编辑前必须先转换:
.docbash
python scripts/office/soffice.py --headless --convert-to docx document.docReading Content
读取内容
bash
undefinedbash
undefinedText extraction with tracked changes
提取包含修订记录的文本
pandoc --track-changes=all document.docx -o output.md
pandoc --track-changes=all document.docx -o output.md
Raw XML access
访问原始XML
python scripts/office/unpack.py document.docx unpacked/
undefinedpython scripts/office/unpack.py document.docx unpacked/
undefinedConverting to Images
转换为图片
bash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf pagebash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf pageAccepting Tracked Changes
接受修订记录
To produce a clean document with all tracked changes accepted (requires LibreOffice):
bash
python scripts/accept_changes.py input.docx output.docx要生成已接受所有修订的干净文档(需要LibreOffice):
bash
python scripts/accept_changes.py input.docx output.docxCreating New Documents
创建新文档
Generate .docx files with JavaScript, then validate. Install:
npm install -g docx使用JavaScript生成.docx文件,然后进行验证。安装命令:
npm install -g docxSetup
初始化设置
javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
TabStopType, TabStopPosition, Column, SectionType,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
TabStopType, TabStopPosition, Column, SectionType,
TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
VerticalAlign, PageNumber, PageBreak } = require('docx');
const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));Validation
验证
After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
bash
python scripts/office/validate.py doc.docx创建文件后,对其进行验证。如果验证失败,解压文件、修复XML后重新打包。
bash
python scripts/office/validate.py doc.docxPage Size
页面尺寸
javascript
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5 inches in DXA
height: 15840 // 11 inches in DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
}
},
children: [/* content */]
}]Common page sizes (DXA units, 1440 DXA = 1 inch):
| Paper | Width | Height | Content Width (1" margins) |
|---|---|---|---|
| US Letter | 12,240 | 15,840 | 9,360 |
| A4 (default) | 11,906 | 16,838 | 9,026 |
Landscape orientation: docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
javascript
size: {
width: 12240, // Pass SHORT edge as width
height: 15840, // Pass LONG edge as height
orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)javascript
// 重点:docx-js默认使用A4纸,而非美国信纸
// 为保证结果一致,请始终显式设置页面尺寸
sections: [{
properties: {
page: {
size: {
width: 12240, // 8.5英寸,单位为DXA
height: 15840 // 11英寸,单位为DXA
},
margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1英寸边距
}
},
children: [/* content */]
}]常见页面尺寸(单位为DXA,1440 DXA = 1英寸):
| 纸张类型 | 宽度 | 高度 | 内容宽度(1英寸边距) |
|---|---|---|---|
| 美国信纸 | 12,240 | 15,840 | 9,360 |
| A4(默认) | 11,906 | 16,838 | 9,026 |
横向排版: docx-js会自动交换宽高,因此只需传入纵向尺寸并设置排版方向即可:
javascript
size: {
width: 12240, // 将短边作为width传入
height: 15840, // 将长边作为height传入
orientation: PageOrientation.LANDSCAPE // docx-js会在XML中自动交换宽高
},
// 内容宽度 = 15840 - 左边距 - 右边距(使用长边计算)Styles (Override Built-in Headings)
样式(覆盖内置标题样式)
Use Arial as the default font (universally supported). Keep titles black for readability.
javascript
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
paragraphStyles: [
// IMPORTANT: Use exact IDs to override built-in styles
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});使用Arial作为默认字体(兼容性最广)。标题保持黑色以保证可读性。
javascript
const doc = new Document({
styles: {
default: { document: { run: { font: "Arial", size: 24 } } }, // 默认12号字
paragraphStyles: [
// 注意:必须使用精确ID来覆盖内置样式
{ id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel是生成目录的必需项
{ id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
]
},
sections: [{
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
]
}]
});Lists (NEVER use unicode bullets)
列表(禁止使用Unicode项目符号)
javascript
// ❌ WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("• Item")] }) // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD
// ✅ CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// ⚠️ Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)javascript
// ❌ 错误 - 切勿手动插入项目符号字符
new Paragraph({ children: [new TextRun("• Item")] }) // 错误
new Paragraph({ children: [new TextRun("\u2022 Item")] }) // 错误
// ✅ 正确 - 使用编号配置和LevelFormat.BULLET
const doc = new Document({
numbering: {
config: [
{ reference: "bullets",
levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
{ reference: "numbers",
levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
]
},
sections: [{
children: [
new Paragraph({ numbering: { reference: "bullets", level: 0 },
children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 },
children: [new TextRun("Numbered item")] }),
]
}]
});
// ⚠️ 每个reference对应独立的编号序列
// 相同reference:编号连续(1,2,3 之后是4,5,6)
// 不同reference:编号重置(1,2,3 之后是1,2,3)Tables
表格
CRITICAL: Tables need dual widths - set both on the table AND on each cell. Without both, tables render incorrectly on some platforms.
columnWidthswidthjavascript
// CRITICAL: Always set table width for consistent rendering
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})Table width calculation:
Always use — breaks in Google Docs.
WidthType.DXAWidthType.PERCENTAGEjavascript
// Table width = sum of columnWidths = content width
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // Must sum to table widthWidth rules:
- Always use — never
WidthType.DXA(incompatible with Google Docs)WidthType.PERCENTAGE - Table width must equal the sum of
columnWidths - Cell must match corresponding
widthcolumnWidth - Cell are internal padding - they reduce content area, not add to cell width
margins - For full-width tables: use content width (page width minus left and right margins)
重点:表格需要双重宽度设置 - 同时在表格上设置和在每个单元格上设置。缺少任何一个,表格在部分平台上的渲染都会出现问题。
columnWidthswidthjavascript
// 重点:始终设置表格宽度以保证渲染一致
// 重点:使用ShadingType.CLEAR(而非SOLID)避免黑色背景
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
new Table({
width: { size: 9360, type: WidthType.DXA }, // 始终使用DXA(百分比在Google Docs中会失效)
columnWidths: [4680, 4680], // 必须与表格宽度总和相等(DXA:1440 = 1英寸)
rows: [
new TableRow({
children: [
new TableCell({
borders,
width: { size: 4680, type: WidthType.DXA }, // 同时在每个单元格上设置宽度
shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // 使用CLEAR而非SOLID
margins: { top: 80, bottom: 80, left: 120, right: 120 }, // 单元格内边距(内部填充,不增加单元格宽度)
children: [new Paragraph({ children: [new TextRun("Cell")] })]
})
]
})
]
})表格宽度计算规则:
始终使用 — 在Google Docs中会失效。
WidthType.DXAWidthType.PERCENTAGEjavascript
// 表格宽度 = columnWidths的总和 = 内容宽度
// 1英寸边距的美国信纸:12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360] // 必须等于表格宽度总和宽度规则:
- 始终使用— 切勿使用
WidthType.DXA(与Google Docs不兼容)WidthType.PERCENTAGE - 表格宽度必须等于的总和
columnWidths - 单元格必须与对应的
width一致columnWidth - 单元格是内部填充 - 会缩小内容区域,而非增加单元格宽度
margins - 全宽表格:使用内容宽度(页面宽度减去左右边距)
Images
图片
javascript
// CRITICAL: type parameter is REQUIRED
new Paragraph({
children: [new ImageRun({
type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // All three required
})]
})javascript
// 重点:type参数是必填项
new Paragraph({
children: [new ImageRun({
type: "png", // 必填项:png, jpg, jpeg, gif, bmp, svg
data: fs.readFileSync("image.png"),
transformation: { width: 200, height: 150 },
altText: { title: "Title", description: "Desc", name: "Name" } // 三个属性都必填
})]
})Page Breaks
分页符
javascript
// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })
// Or use pageBreakBefore
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })javascript
// 重点:PageBreak必须放在Paragraph内部
new Paragraph({ children: [new PageBreak()] })
// 或者使用pageBreakBefore属性
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })Hyperlinks
超链接
javascript
// External link
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
link: "https://example.com",
})]
})
// Internal link (bookmark + reference)
// 1. Create bookmark at destination
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. Link to it
new Paragraph({ children: [new InternalHyperlink({
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
anchor: "chapter1",
})]})javascript
// 外部链接
new Paragraph({
children: [new ExternalHyperlink({
children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
link: "https://example.com",
})]
})
// 内部链接(书签 + 引用)
// 1. 在目标位置创建书签
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. 链接到该书签
new Paragraph({ children: [new InternalHyperlink({
children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
anchor: "chapter1",
})]})Footnotes
脚注
javascript
const doc = new Document({
footnotes: {
1: { children: [new Paragraph("Source: Annual Report 2024")] },
2: { children: [new Paragraph("See appendix for methodology")] },
},
sections: [{
children: [new Paragraph({
children: [
new TextRun("Revenue grew 15%"),
new FootnoteReferenceRun(1),
new TextRun(" using adjusted metrics"),
new FootnoteReferenceRun(2),
],
})]
}]
});javascript
const doc = new Document({
footnotes: {
1: { children: [new Paragraph("Source: Annual Report 2024")] },
2: { children: [new Paragraph("See appendix for methodology")] },
},
sections: [{
children: [new Paragraph({
children: [
new TextRun("Revenue grew 15%"),
new FootnoteReferenceRun(1),
new TextRun(" using adjusted metrics"),
new FootnoteReferenceRun(2),
],
})]
}]
});Tab Stops
制表位
javascript
// Right-align text on same line (e.g., date opposite a title)
new Paragraph({
children: [
new TextRun("Company Name"),
new TextRun("\tJanuary 2025"),
],
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})
// Dot leader (e.g., TOC-style)
new Paragraph({
children: [
new TextRun("Introduction"),
new TextRun({ children: [
new PositionalTab({
alignment: PositionalTabAlignment.RIGHT,
relativeTo: PositionalTabRelativeTo.MARGIN,
leader: PositionalTabLeader.DOT,
}),
"3",
]}),
],
})javascript
// 同一行右对齐文本(例如标题旁的日期)
new Paragraph({
children: [
new TextRun("Company Name"),
new TextRun("\tJanuary 2025"),
],
tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})
// 点线引导(例如目录样式)
new Paragraph({
children: [
new TextRun("Introduction"),
new TextRun({ children: [
new PositionalTab({
alignment: PositionalTabAlignment.RIGHT,
relativeTo: PositionalTabRelativeTo.MARGIN,
leader: PositionalTabLeader.DOT,
}),
"3",
]}),
],
})Multi-Column Layouts
多栏布局
javascript
// Equal-width columns
sections: [{
properties: {
column: {
count: 2, // number of columns
space: 720, // gap between columns in DXA (720 = 0.5 inch)
equalWidth: true,
separate: true, // vertical line between columns
},
},
children: [/* content flows naturally across columns */]
}]
// Custom-width columns (equalWidth must be false)
sections: [{
properties: {
column: {
equalWidth: false,
children: [
new Column({ width: 5400, space: 720 }),
new Column({ width: 3240 }),
],
},
},
children: [/* content */]
}]Force a column break with a new section using .
type: SectionType.NEXT_COLUMNjavascript
// 等宽分栏
sections: [{
properties: {
column: {
count: 2, // 分栏数量
space: 720, // 栏间距,单位为DXA(720 = 0.5英寸)
equalWidth: true,
separate: true, // 栏间添加竖线
},
},
children: [/* 内容自动在栏间流动 */]
}]
// 自定义宽度分栏(必须设置equalWidth为false)
sections: [{
properties: {
column: {
equalWidth: false,
children: [
new Column({ width: 5400, space: 720 }),
new Column({ width: 3240 }),
],
},
},
children: [/* content */]
}]使用创建新节来强制分栏。
type: SectionType.NEXT_COLUMNTable of Contents
目录
javascript
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })javascript
// 重点:标题必须仅使用HeadingLevel - 不能使用自定义样式
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })Headers/Footers
页眉/页脚
javascript
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]javascript
sections: [{
properties: {
page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1英寸
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
},
footers: {
default: new Footer({ children: [new Paragraph({
children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
})] })
},
children: [/* content */]
}]Critical Rules for docx-js
docx-js使用关键规则
- Set page size explicitly - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
- Landscape: pass portrait dimensions - docx-js swaps width/height internally; pass short edge as , long edge as
width, and setheightorientation: PageOrientation.LANDSCAPE - Never use - use separate Paragraph elements
\n - Never use unicode bullets - use with numbering config
LevelFormat.BULLET - PageBreak must be in Paragraph - standalone creates invalid XML
- ImageRun requires - always specify png/jpg/etc
type - Always set table with DXA - never use
width(breaks in Google Docs)WidthType.PERCENTAGE - Tables need dual widths - array AND cell
columnWidths, both must matchwidth - Table width = sum of columnWidths - for DXA, ensure they add up exactly
- Always add cell margins - use for readable padding
margins: { top: 80, bottom: 80, left: 120, right: 120 } - Use - never SOLID for table shading
ShadingType.CLEAR - Never use tables as dividers/rules - cells have minimum height and render as empty boxes (including in headers/footers); use on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables
border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } } - TOC requires HeadingLevel only - no custom styles on heading paragraphs
- Override built-in styles - use exact IDs: "Heading1", "Heading2", etc.
- Include - required for TOC (0 for H1, 1 for H2, etc.)
outlineLevel
- 显式设置页面尺寸 - docx-js默认使用A4纸;美国文档请使用美国信纸(12240 x 15840 DXA)
- 横向排版:传入纵向尺寸 - docx-js会自动交换宽高;将短边作为,长边作为
width,并设置heightorientation: PageOrientation.LANDSCAPE - 切勿使用- 使用独立的Paragraph元素
\n - 切勿使用Unicode项目符号 - 使用配合编号配置
LevelFormat.BULLET - PageBreak必须放在Paragraph内部 - 单独使用会生成无效XML
- ImageRun必须设置- 始终指定png/jpg等格式
type - 始终使用DXA设置表格- 切勿使用
width(在Google Docs中会失效)WidthType.PERCENTAGE - 表格需要双重宽度设置 - 数组和单元格
columnWidths必须匹配width - 表格宽度 = columnWidths的总和 - 对于DXA,必须确保总和完全相等
- 始终添加单元格内边距 - 使用以保证可读性
margins: { top: 80, bottom: 80, left: 120, right: 120 } - 使用- 表格底纹切勿使用SOLID
ShadingType.CLEAR - 切勿将表格用作分隔线 - 单元格有最小高度,会渲染为空框(包括页眉/页脚中);请在Paragraph上使用代替。对于两栏页脚,使用制表位(见制表位部分)而非表格
border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } } - 目录仅支持HeadingLevel - 标题段落不能使用自定义样式
- 覆盖内置样式 - 使用精确ID:"Heading1"、"Heading2"等
- 必须包含- 生成目录的必需项(H1对应0,H2对应1等)
outlineLevel
Editing Existing Documents
编辑现有文档
Follow all 3 steps in order.
请按顺序完成以下3个步骤。
Step 1: Unpack
步骤1:解压
bash
python scripts/office/unpack.py document.docx unpacked/Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities ( etc.) so they survive editing. Use to skip run merging.
“--merge-runs falsebash
python scripts/office/unpack.py document.docx unpacked/提取XML文件、格式化输出、合并相邻run、将智能引号转换为XML实体(如等)以保证编辑后不会丢失。使用跳过run合并。
“--merge-runs falseStep 2: Edit XML
步骤2:编辑XML
Edit files in . See XML Reference below for patterns.
unpacked/word/Use "Claude" as the author for tracked changes and comments, unless the user explicitly requests use of a different name.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Use smart quotes for new content. When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
xml
<!-- Use these entities for professional typography -->
<w:t>Here’s a quote: “Hello”</w:t>| Entity | Character |
|---|---|
| ‘ (left single) |
| ’ (right single / apostrophe) |
| “ (left double) |
| ” (right double) |
Adding comments: Use to handle boilerplate across multiple XML files (text must be pre-escaped XML):
comment.pybash
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author nameThen add markers to document.xml (see Comments in XML Reference).
编辑目录下的文件。XML参考部分提供了常用模式。
unpacked/word/默认使用“Claude”作为作者处理修订和批注,除非用户明确要求使用其他名称。
**直接使用编辑工具进行字符串替换。请勿编写Python脚本。**脚本会增加不必要的复杂度,编辑工具可直观显示替换内容。
**重点:新增内容请使用智能引号。**添加包含撇号或引号的文本时,使用XML实体生成智能引号:
xml
<!-- 使用以下实体保证专业排版 -->
<w:t>Here’s a quote: “Hello”</w:t>| 实体 | 对应字符 |
|---|---|
| ‘ (左单引号) |
| ’ (右单引号/撇号) |
| “ (左双引号) |
| ” (右双引号) |
添加批注: 使用处理多个XML文件中的重复内容(文本必须是预转义的XML):
comment.pybash
python scripts/comment.py unpacked/ 0 "Comment text with & and ’"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # 回复批注0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # 自定义作者名称然后在document.xml中添加标记(见XML参考部分的批注内容)。
Step 3: Pack
步骤3:重新打包
bash
python scripts/office/pack.py unpacked/ output.docx --original document.docxValidates with auto-repair, condenses XML, and creates DOCX. Use to skip.
--validate falseAuto-repair will fix:
- >= 0x7FFFFFFF (regenerates valid ID)
durableId - Missing on
xml:space="preserve"with whitespace<w:t>
Auto-repair won't fix:
- Malformed XML, invalid element nesting, missing relationships, schema violations
bash
python scripts/office/pack.py unpacked/ output.docx --original document.docx自动验证并修复问题、压缩XML、生成DOCX文件。使用跳过验证。
--validate false自动修复可解决以下问题:
- >= 0x7FFFFFFF(重新生成有效ID)
durableId - 带有空白字符的缺少
<w:t>xml:space="preserve"
自动修复无法解决以下问题:
- 格式错误的XML、无效的元素嵌套、缺失的关联关系、违反Schema规则
Common Pitfalls
常见陷阱
- Replace entire elements: When adding tracked changes, replace the whole
<w:r>block with<w:r>...</w:r>as siblings. Don't inject tracked change tags inside a run.<w:del>...<w:ins>... - Preserve formatting: Copy the original run's
<w:rPr>block into your tracked change runs to maintain bold, font size, etc.<w:rPr>
- 替换完整的元素:添加修订记录时,将整个
<w:r>块替换为<w:r>...</w:r>作为同级元素。不要在run内部插入修订标签。<w:del>...<w:ins>... - 保留格式:将原始run的
<w:rPr>块复制到修订记录的run中,以保持加粗、字号等格式。<w:rPr>
XML Reference
XML参考
Schema Compliance
Schema合规性
- Element order in :
<w:pPr>,<w:pStyle>,<w:numPr>,<w:spacing>,<w:ind>,<w:jc>last<w:rPr> - Whitespace: Add to
xml:space="preserve"with leading/trailing spaces<w:t> - RSIDs: Must be 8-digit hex (e.g., )
00AB1234
- 中的元素顺序:
<w:pPr>、<w:pStyle>、<w:numPr>、<w:spacing>、<w:ind>、<w:jc>放在最后<w:rPr> - 空白字符:带有前导/尾随空格的需添加
<w:t>xml:space="preserve" - RSIDs:必须是8位十六进制数(例如)
00AB1234
Tracked Changes
修订记录
Insertion:
xml
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>Deletion:
xml
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>Inside : Use instead of , and instead of .
<w:del><w:delText><w:t><w:delInstrText><w:instrText>Minimal edits - only mark what changes:
xml
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>Deleting entire paragraphs/list items - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add inside :
<w:del/><w:pPr><w:rPr>xml
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- list numbering if present -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>Without the in , accepting changes leaves an empty paragraph/list item.
<w:del/><w:pPr><w:rPr>Rejecting another author's insertion - nest deletion inside their insertion:
xml
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>Restoring another author's deletion - add insertion after (don't modify their deletion):
xml
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>插入内容:
xml
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:t>inserted text</w:t></w:r>
</w:ins>删除内容:
xml
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>在内部: 使用代替,使用代替。
<w:del><w:delText><w:t><w:delInstrText><w:instrText>最小化修改 - 仅标记变更部分:
xml
<!-- 将“30 days”改为“60 days” -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
<w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
<w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>删除整个段落/列表项 - 删除段落所有内容时,还需将段落标记也标记为已删除,使其与下一段落合并。在中添加:
<w:pPr><w:rPr><w:del/>xml
<w:p>
<w:pPr>
<w:numPr>...</w:numPr> <!-- 若为列表项则保留编号配置 -->
<w:rPr>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
</w:rPr>
</w:pPr>
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
</w:del>
</w:p>如果不在中添加,接受修订后会留下空段落/列表项。
<w:pPr><w:rPr><w:del/>拒绝其他作者插入的内容 - 在其插入内容内部嵌套删除标签:
xml
<w:ins w:author="Jane" w:id="5">
<w:del w:author="Claude" w:id="10">
<w:r><w:delText>their inserted text</w:delText></w:r>
</w:del>
</w:ins>恢复其他作者删除的内容 - 在删除内容后添加插入标签(不要修改原删除标签):
xml
<w:del w:author="Jane" w:id="5">
<w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
<w:r><w:t>deleted text</w:t></w:r>
</w:ins>Comments
批注
After running (see Step 2), add markers to document.xml. For replies, use flag and nest markers inside the parent's.
comment.py--parentCRITICAL: and are siblings of , never inside .
<w:commentRangeStart><w:commentRangeEnd><w:r><w:r>xml
<!-- Comment markers are direct children of w:p, never inside w:r -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>运行后(见步骤2),在document.xml中添加标记。回复批注时使用参数并将标记嵌套在父批注标记内。
comment.py--parent重点:和是的同级元素,切勿放在内部。
<w:commentRangeStart><w:commentRangeEnd><w:r><w:r>xml
<!-- 批注标记是w:p的直接子元素,切勿放在w:r内部 -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
<w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<!-- 批注0包含回复批注1 -->
<w:commentRangeStart w:id="0"/>
<w:commentRangeStart w:id="1"/>
<w:r><w:t>text</w:t></w:r>
<w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>Images
图片
- Add image file to
word/media/ - Add relationship to :
word/_rels/document.xml.rels
xml
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>- Add content type to :
[Content_Types].xml
xml
<Default Extension="png" ContentType="image/png"/>- Reference in document.xml:
xml
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMUs: 914400 = 1 inch -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>- 将图片文件添加到目录
word/media/ - 在中添加关联关系:
word/_rels/document.xml.rels
xml
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>- 在中添加内容类型:
[Content_Types].xml
xml
<Default Extension="png" ContentType="image/png"/>- 在document.xml中引用图片:
xml
<w:drawing>
<wp:inline>
<wp:extent cx="914400" cy="914400"/> <!-- EMU单位:914400 = 1英寸 -->
<a:graphic>
<a:graphicData uri=".../picture">
<pic:pic>
<pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:inline>
</w:drawing>Dependencies
依赖工具
- pandoc: Text extraction
- docx: (new documents)
npm install -g docx - LibreOffice: PDF conversion (auto-configured for sandboxed environments via )
scripts/office/soffice.py - Poppler: for images
pdftoppm
- pandoc:文本提取
- docx:(创建新文档)
npm install -g docx - LibreOffice:PDF转换(通过在沙箱环境中自动配置)
scripts/office/soffice.py - Poppler:图片转换工具
pdftoppm