anthropics-docx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DOCX creation, editing, and analysis

DOCX文档的创建、编辑与分析

Overview

概述

A .docx file is a ZIP archive containing XML files.
.docx文件是一个包含XML文件的ZIP压缩包。

Quick Reference

快速参考

TaskApproach
Read/analyze content
pandoc
or unpack for raw XML
Create new documentUse
docx-js
- see Creating New Documents below
Edit existing documentUnpack → edit XML → repack - see Editing Existing Documents below
任务处理方法
读取/分析内容使用
pandoc
或解压后直接处理原始XML
创建新文档使用
docx-js
- 详见下方“创建新文档”部分
编辑现有文档解压 → 编辑XML → 重新打包 - 详见下方“编辑现有文档”部分

Converting .doc to .docx

将.doc转换为.docx

Legacy
.doc
files must be converted before editing:
bash
python scripts/office/soffice.py --headless --convert-to docx document.doc
旧版
.doc
文件在编辑前必须先转换:
bash
python scripts/office/soffice.py --headless --convert-to docx document.doc

Reading Content

读取内容

bash
undefined
bash
undefined

Text extraction with tracked changes

提取包含修订记录的文本

pandoc --track-changes=all document.docx -o output.md
pandoc --track-changes=all document.docx -o output.md

Raw XML access

访问原始XML

python scripts/office/unpack.py document.docx unpacked/
undefined
python scripts/office/unpack.py document.docx unpacked/
undefined

Converting to Images

转换为图片

bash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page
bash
python scripts/office/soffice.py --headless --convert-to pdf document.docx
pdftoppm -jpeg -r 150 document.pdf page

Accepting Tracked Changes

接受修订记录

To produce a clean document with all tracked changes accepted (requires LibreOffice):
bash
python scripts/accept_changes.py input.docx output.docx

要生成已接受所有修订的干净文档(需要LibreOffice):
bash
python scripts/accept_changes.py input.docx output.docx

Creating New Documents

创建新文档

Generate .docx files with JavaScript, then validate. Install:
npm install -g docx
使用JavaScript生成.docx文件,然后进行验证。安装命令:
npm install -g docx

Setup

初始化设置

javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
        Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
        InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
        PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
        TabStopType, TabStopPosition, Column, SectionType,
        TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
        VerticalAlign, PageNumber, PageBreak } = require('docx');

const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
javascript
const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
        Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
        InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
        PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
        TabStopType, TabStopPosition, Column, SectionType,
        TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
        VerticalAlign, PageNumber, PageBreak } = require('docx');

const doc = new Document({ sections: [{ children: [/* content */] }] });
Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));

Validation

验证

After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
bash
python scripts/office/validate.py doc.docx
创建文件后,对其进行验证。如果验证失败,解压文件、修复XML后重新打包。
bash
python scripts/office/validate.py doc.docx

Page Size

页面尺寸

javascript
// CRITICAL: docx-js defaults to A4, not US Letter
// Always set page size explicitly for consistent results
sections: [{
  properties: {
    page: {
      size: {
        width: 12240,   // 8.5 inches in DXA
        height: 15840   // 11 inches in DXA
      },
      margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
    }
  },
  children: [/* content */]
}]
Common page sizes (DXA units, 1440 DXA = 1 inch):
PaperWidthHeightContent Width (1" margins)
US Letter12,24015,8409,360
A4 (default)11,90616,8389,026
Landscape orientation: docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
javascript
size: {
  width: 12240,   // Pass SHORT edge as width
  height: 15840,  // Pass LONG edge as height
  orientation: PageOrientation.LANDSCAPE  // docx-js swaps them in the XML
},
// Content width = 15840 - left margin - right margin (uses the long edge)
javascript
// 重点:docx-js默认使用A4纸,而非美国信纸
// 为保证结果一致,请始终显式设置页面尺寸
sections: [{
  properties: {
    page: {
      size: {
        width: 12240,   // 8.5英寸,单位为DXA
        height: 15840   // 11英寸,单位为DXA
      },
      margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1英寸边距
    }
  },
  children: [/* content */]
}]
常见页面尺寸(单位为DXA,1440 DXA = 1英寸):
纸张类型宽度高度内容宽度(1英寸边距)
美国信纸12,24015,8409,360
A4(默认)11,90616,8389,026
横向排版: docx-js会自动交换宽高,因此只需传入纵向尺寸并设置排版方向即可:
javascript
size: {
  width: 12240,   // 将短边作为width传入
  height: 15840,  // 将长边作为height传入
  orientation: PageOrientation.LANDSCAPE  // docx-js会在XML中自动交换宽高
},
// 内容宽度 = 15840 - 左边距 - 右边距(使用长边计算)

Styles (Override Built-in Headings)

样式(覆盖内置标题样式)

Use Arial as the default font (universally supported). Keep titles black for readability.
javascript
const doc = new Document({
  styles: {
    default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
    paragraphStyles: [
      // IMPORTANT: Use exact IDs to override built-in styles
      { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
        run: { size: 32, bold: true, font: "Arial" },
        paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
      { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
        run: { size: 28, bold: true, font: "Arial" },
        paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
    ]
  },
  sections: [{
    children: [
      new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
    ]
  }]
});
使用Arial作为默认字体(兼容性最广)。标题保持黑色以保证可读性。
javascript
const doc = new Document({
  styles: {
    default: { document: { run: { font: "Arial", size: 24 } } }, // 默认12号字
    paragraphStyles: [
      // 注意:必须使用精确ID来覆盖内置样式
      { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
        run: { size: 32, bold: true, font: "Arial" },
        paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel是生成目录的必需项
      { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
        run: { size: 28, bold: true, font: "Arial" },
        paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
    ]
  },
  sections: [{
    children: [
      new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
    ]
  }]
});

Lists (NEVER use unicode bullets)

列表(禁止使用Unicode项目符号)

javascript
// ❌ WRONG - never manually insert bullet characters
new Paragraph({ children: [new TextRun("• Item")] })  // BAD
new Paragraph({ children: [new TextRun("\u2022 Item")] })  // BAD

// ✅ CORRECT - use numbering config with LevelFormat.BULLET
const doc = new Document({
  numbering: {
    config: [
      { reference: "bullets",
        levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
      { reference: "numbers",
        levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
    ]
  },
  sections: [{
    children: [
      new Paragraph({ numbering: { reference: "bullets", level: 0 },
        children: [new TextRun("Bullet item")] }),
      new Paragraph({ numbering: { reference: "numbers", level: 0 },
        children: [new TextRun("Numbered item")] }),
    ]
  }]
});

// ⚠️ Each reference creates INDEPENDENT numbering
// Same reference = continues (1,2,3 then 4,5,6)
// Different reference = restarts (1,2,3 then 1,2,3)
javascript
// ❌ 错误 - 切勿手动插入项目符号字符
new Paragraph({ children: [new TextRun("• Item")] })  // 错误
new Paragraph({ children: [new TextRun("\u2022 Item")] })  // 错误

// ✅ 正确 - 使用编号配置和LevelFormat.BULLET
const doc = new Document({
  numbering: {
    config: [
      { reference: "bullets",
        levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
      { reference: "numbers",
        levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
    ]
  },
  sections: [{
    children: [
      new Paragraph({ numbering: { reference: "bullets", level: 0 },
        children: [new TextRun("Bullet item")] }),
      new Paragraph({ numbering: { reference: "numbers", level: 0 },
        children: [new TextRun("Numbered item")] }),
    ]
  }]
});

// ⚠️ 每个reference对应独立的编号序列
// 相同reference:编号连续(1,2,3 之后是4,5,6)
// 不同reference:编号重置(1,2,3 之后是1,2,3)

Tables

表格

CRITICAL: Tables need dual widths - set both
columnWidths
on the table AND
width
on each cell. Without both, tables render incorrectly on some platforms.
javascript
// CRITICAL: Always set table width for consistent rendering
// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };

new Table({
  width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
  columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
  rows: [
    new TableRow({
      children: [
        new TableCell({
          borders,
          width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
          shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
          margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
          children: [new Paragraph({ children: [new TextRun("Cell")] })]
        })
      ]
    })
  ]
})
Table width calculation:
Always use
WidthType.DXA
WidthType.PERCENTAGE
breaks in Google Docs.
javascript
// Table width = sum of columnWidths = content width
// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360]  // Must sum to table width
Width rules:
  • Always use
    WidthType.DXA
    — never
    WidthType.PERCENTAGE
    (incompatible with Google Docs)
  • Table width must equal the sum of
    columnWidths
  • Cell
    width
    must match corresponding
    columnWidth
  • Cell
    margins
    are internal padding - they reduce content area, not add to cell width
  • For full-width tables: use content width (page width minus left and right margins)
重点:表格需要双重宽度设置 - 同时在表格上设置
columnWidths
和在每个单元格上设置
width
。缺少任何一个,表格在部分平台上的渲染都会出现问题。
javascript
// 重点:始终设置表格宽度以保证渲染一致
// 重点:使用ShadingType.CLEAR(而非SOLID)避免黑色背景
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };

new Table({
  width: { size: 9360, type: WidthType.DXA }, // 始终使用DXA(百分比在Google Docs中会失效)
  columnWidths: [4680, 4680], // 必须与表格宽度总和相等(DXA:1440 = 1英寸)
  rows: [
    new TableRow({
      children: [
        new TableCell({
          borders,
          width: { size: 4680, type: WidthType.DXA }, // 同时在每个单元格上设置宽度
          shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // 使用CLEAR而非SOLID
          margins: { top: 80, bottom: 80, left: 120, right: 120 }, // 单元格内边距(内部填充,不增加单元格宽度)
          children: [new Paragraph({ children: [new TextRun("Cell")] })]
        })
      ]
    })
  ]
})
表格宽度计算规则:
始终使用
WidthType.DXA
WidthType.PERCENTAGE
在Google Docs中会失效。
javascript
// 表格宽度 = columnWidths的总和 = 内容宽度
// 1英寸边距的美国信纸:12240 - 2880 = 9360 DXA
width: { size: 9360, type: WidthType.DXA },
columnWidths: [7000, 2360]  // 必须等于表格宽度总和
宽度规则:
  • 始终使用
    WidthType.DXA
    — 切勿使用
    WidthType.PERCENTAGE
    (与Google Docs不兼容)
  • 表格宽度必须等于
    columnWidths
    的总和
  • 单元格
    width
    必须与对应的
    columnWidth
    一致
  • 单元格
    margins
    是内部填充 - 会缩小内容区域,而非增加单元格宽度
  • 全宽表格:使用内容宽度(页面宽度减去左右边距)

Images

图片

javascript
// CRITICAL: type parameter is REQUIRED
new Paragraph({
  children: [new ImageRun({
    type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
    data: fs.readFileSync("image.png"),
    transformation: { width: 200, height: 150 },
    altText: { title: "Title", description: "Desc", name: "Name" } // All three required
  })]
})
javascript
// 重点:type参数是必填项
new Paragraph({
  children: [new ImageRun({
    type: "png", // 必填项:png, jpg, jpeg, gif, bmp, svg
    data: fs.readFileSync("image.png"),
    transformation: { width: 200, height: 150 },
    altText: { title: "Title", description: "Desc", name: "Name" } // 三个属性都必填
  })]
})

Page Breaks

分页符

javascript
// CRITICAL: PageBreak must be inside a Paragraph
new Paragraph({ children: [new PageBreak()] })

// Or use pageBreakBefore
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
javascript
// 重点:PageBreak必须放在Paragraph内部
new Paragraph({ children: [new PageBreak()] })

// 或者使用pageBreakBefore属性
new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })

Hyperlinks

超链接

javascript
// External link
new Paragraph({
  children: [new ExternalHyperlink({
    children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
    link: "https://example.com",
  })]
})

// Internal link (bookmark + reference)
// 1. Create bookmark at destination
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
  new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. Link to it
new Paragraph({ children: [new InternalHyperlink({
  children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
  anchor: "chapter1",
})]})
javascript
// 外部链接
new Paragraph({
  children: [new ExternalHyperlink({
    children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
    link: "https://example.com",
  })]
})

// 内部链接(书签 + 引用)
// 1. 在目标位置创建书签
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
  new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
]})
// 2. 链接到该书签
new Paragraph({ children: [new InternalHyperlink({
  children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
  anchor: "chapter1",
})]})

Footnotes

脚注

javascript
const doc = new Document({
  footnotes: {
    1: { children: [new Paragraph("Source: Annual Report 2024")] },
    2: { children: [new Paragraph("See appendix for methodology")] },
  },
  sections: [{
    children: [new Paragraph({
      children: [
        new TextRun("Revenue grew 15%"),
        new FootnoteReferenceRun(1),
        new TextRun(" using adjusted metrics"),
        new FootnoteReferenceRun(2),
      ],
    })]
  }]
});
javascript
const doc = new Document({
  footnotes: {
    1: { children: [new Paragraph("Source: Annual Report 2024")] },
    2: { children: [new Paragraph("See appendix for methodology")] },
  },
  sections: [{
    children: [new Paragraph({
      children: [
        new TextRun("Revenue grew 15%"),
        new FootnoteReferenceRun(1),
        new TextRun(" using adjusted metrics"),
        new FootnoteReferenceRun(2),
      ],
    })]
  }]
});

Tab Stops

制表位

javascript
// Right-align text on same line (e.g., date opposite a title)
new Paragraph({
  children: [
    new TextRun("Company Name"),
    new TextRun("\tJanuary 2025"),
  ],
  tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})

// Dot leader (e.g., TOC-style)
new Paragraph({
  children: [
    new TextRun("Introduction"),
    new TextRun({ children: [
      new PositionalTab({
        alignment: PositionalTabAlignment.RIGHT,
        relativeTo: PositionalTabRelativeTo.MARGIN,
        leader: PositionalTabLeader.DOT,
      }),
      "3",
    ]}),
  ],
})
javascript
// 同一行右对齐文本(例如标题旁的日期)
new Paragraph({
  children: [
    new TextRun("Company Name"),
    new TextRun("\tJanuary 2025"),
  ],
  tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
})

// 点线引导(例如目录样式)
new Paragraph({
  children: [
    new TextRun("Introduction"),
    new TextRun({ children: [
      new PositionalTab({
        alignment: PositionalTabAlignment.RIGHT,
        relativeTo: PositionalTabRelativeTo.MARGIN,
        leader: PositionalTabLeader.DOT,
      }),
      "3",
    ]}),
  ],
})

Multi-Column Layouts

多栏布局

javascript
// Equal-width columns
sections: [{
  properties: {
    column: {
      count: 2,          // number of columns
      space: 720,        // gap between columns in DXA (720 = 0.5 inch)
      equalWidth: true,
      separate: true,    // vertical line between columns
    },
  },
  children: [/* content flows naturally across columns */]
}]

// Custom-width columns (equalWidth must be false)
sections: [{
  properties: {
    column: {
      equalWidth: false,
      children: [
        new Column({ width: 5400, space: 720 }),
        new Column({ width: 3240 }),
      ],
    },
  },
  children: [/* content */]
}]
Force a column break with a new section using
type: SectionType.NEXT_COLUMN
.
javascript
// 等宽分栏
sections: [{
  properties: {
    column: {
      count: 2,          // 分栏数量
      space: 720,        // 栏间距,单位为DXA(720 = 0.5英寸)
      equalWidth: true,
      separate: true,    // 栏间添加竖线
    },
  },
  children: [/* 内容自动在栏间流动 */]
}]

// 自定义宽度分栏(必须设置equalWidth为false)
sections: [{
  properties: {
    column: {
      equalWidth: false,
      children: [
        new Column({ width: 5400, space: 720 }),
        new Column({ width: 3240 }),
      ],
    },
  },
  children: [/* content */]
}]
使用
type: SectionType.NEXT_COLUMN
创建新节来强制分栏。

Table of Contents

目录

javascript
// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
javascript
// 重点:标题必须仅使用HeadingLevel - 不能使用自定义样式
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })

Headers/Footers

页眉/页脚

javascript
sections: [{
  properties: {
    page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
  },
  headers: {
    default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
  },
  footers: {
    default: new Footer({ children: [new Paragraph({
      children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
    })] })
  },
  children: [/* content */]
}]
javascript
sections: [{
  properties: {
    page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1英寸
  },
  headers: {
    default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
  },
  footers: {
    default: new Footer({ children: [new Paragraph({
      children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
    })] })
  },
  children: [/* content */]
}]

Critical Rules for docx-js

docx-js使用关键规则

  • Set page size explicitly - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
  • Landscape: pass portrait dimensions - docx-js swaps width/height internally; pass short edge as
    width
    , long edge as
    height
    , and set
    orientation: PageOrientation.LANDSCAPE
  • Never use
    \n
    - use separate Paragraph elements
  • Never use unicode bullets - use
    LevelFormat.BULLET
    with numbering config
  • PageBreak must be in Paragraph - standalone creates invalid XML
  • ImageRun requires
    type
    - always specify png/jpg/etc
  • Always set table
    width
    with DXA
    - never use
    WidthType.PERCENTAGE
    (breaks in Google Docs)
  • Tables need dual widths -
    columnWidths
    array AND cell
    width
    , both must match
  • Table width = sum of columnWidths - for DXA, ensure they add up exactly
  • Always add cell margins - use
    margins: { top: 80, bottom: 80, left: 120, right: 120 }
    for readable padding
  • Use
    ShadingType.CLEAR
    - never SOLID for table shading
  • Never use tables as dividers/rules - cells have minimum height and render as empty boxes (including in headers/footers); use
    border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }
    on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables
  • TOC requires HeadingLevel only - no custom styles on heading paragraphs
  • Override built-in styles - use exact IDs: "Heading1", "Heading2", etc.
  • Include
    outlineLevel
    - required for TOC (0 for H1, 1 for H2, etc.)

  • 显式设置页面尺寸 - docx-js默认使用A4纸;美国文档请使用美国信纸(12240 x 15840 DXA)
  • 横向排版:传入纵向尺寸 - docx-js会自动交换宽高;将短边作为
    width
    ,长边作为
    height
    ,并设置
    orientation: PageOrientation.LANDSCAPE
  • 切勿使用
    \n
    - 使用独立的Paragraph元素
  • 切勿使用Unicode项目符号 - 使用
    LevelFormat.BULLET
    配合编号配置
  • PageBreak必须放在Paragraph内部 - 单独使用会生成无效XML
  • ImageRun必须设置
    type
    - 始终指定png/jpg等格式
  • 始终使用DXA设置表格
    width
    - 切勿使用
    WidthType.PERCENTAGE
    (在Google Docs中会失效)
  • 表格需要双重宽度设置 -
    columnWidths
    数组和单元格
    width
    必须匹配
  • 表格宽度 = columnWidths的总和 - 对于DXA,必须确保总和完全相等
  • 始终添加单元格内边距 - 使用
    margins: { top: 80, bottom: 80, left: 120, right: 120 }
    以保证可读性
  • 使用
    ShadingType.CLEAR
    - 表格底纹切勿使用SOLID
  • 切勿将表格用作分隔线 - 单元格有最小高度,会渲染为空框(包括页眉/页脚中);请在Paragraph上使用
    border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }
    代替。对于两栏页脚,使用制表位(见制表位部分)而非表格
  • 目录仅支持HeadingLevel - 标题段落不能使用自定义样式
  • 覆盖内置样式 - 使用精确ID:"Heading1"、"Heading2"等
  • 必须包含
    outlineLevel
    - 生成目录的必需项(H1对应0,H2对应1等)

Editing Existing Documents

编辑现有文档

Follow all 3 steps in order.
请按顺序完成以下3个步骤。

Step 1: Unpack

步骤1:解压

bash
python scripts/office/unpack.py document.docx unpacked/
Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (
“
etc.) so they survive editing. Use
--merge-runs false
to skip run merging.
bash
python scripts/office/unpack.py document.docx unpacked/
提取XML文件、格式化输出、合并相邻run、将智能引号转换为XML实体(如
“
等)以保证编辑后不会丢失。使用
--merge-runs false
跳过run合并。

Step 2: Edit XML

步骤2:编辑XML

Edit files in
unpacked/word/
. See XML Reference below for patterns.
Use "Claude" as the author for tracked changes and comments, unless the user explicitly requests use of a different name.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Use smart quotes for new content. When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
xml
<!-- Use these entities for professional typography -->
<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>
EntityCharacter
&#x2018;
‘ (left single)
&#x2019;
’ (right single / apostrophe)
&#x201C;
“ (left double)
&#x201D;
” (right double)
Adding comments: Use
comment.py
to handle boilerplate across multiple XML files (text must be pre-escaped XML):
bash
python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0  # reply to comment 0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"  # custom author name
Then add markers to document.xml (see Comments in XML Reference).
编辑
unpacked/word/
目录下的文件。XML参考部分提供了常用模式。
默认使用“Claude”作为作者处理修订和批注,除非用户明确要求使用其他名称。
**直接使用编辑工具进行字符串替换。请勿编写Python脚本。**脚本会增加不必要的复杂度,编辑工具可直观显示替换内容。
**重点:新增内容请使用智能引号。**添加包含撇号或引号的文本时,使用XML实体生成智能引号:
xml
<!-- 使用以下实体保证专业排版 -->
<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>
实体对应字符
&#x2018;
‘ (左单引号)
&#x2019;
’ (右单引号/撇号)
&#x201C;
“ (左双引号)
&#x201D;
” (右双引号)
添加批注: 使用
comment.py
处理多个XML文件中的重复内容(文本必须是预转义的XML):
bash
python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
python scripts/comment.py unpacked/ 1 "Reply text" --parent 0  # 回复批注0
python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"  # 自定义作者名称
然后在document.xml中添加标记(见XML参考部分的批注内容)。

Step 3: Pack

步骤3:重新打包

bash
python scripts/office/pack.py unpacked/ output.docx --original document.docx
Validates with auto-repair, condenses XML, and creates DOCX. Use
--validate false
to skip.
Auto-repair will fix:
  • durableId
    >= 0x7FFFFFFF (regenerates valid ID)
  • Missing
    xml:space="preserve"
    on
    <w:t>
    with whitespace
Auto-repair won't fix:
  • Malformed XML, invalid element nesting, missing relationships, schema violations
bash
python scripts/office/pack.py unpacked/ output.docx --original document.docx
自动验证并修复问题、压缩XML、生成DOCX文件。使用
--validate false
跳过验证。
自动修复可解决以下问题:
  • durableId
    >= 0x7FFFFFFF(重新生成有效ID)
  • 带有空白字符的
    <w:t>
    缺少
    xml:space="preserve"
自动修复无法解决以下问题:
  • 格式错误的XML、无效的元素嵌套、缺失的关联关系、违反Schema规则

Common Pitfalls

常见陷阱

  • Replace entire
    <w:r>
    elements
    : When adding tracked changes, replace the whole
    <w:r>...</w:r>
    block with
    <w:del>...<w:ins>...
    as siblings. Don't inject tracked change tags inside a run.
  • Preserve
    <w:rPr>
    formatting
    : Copy the original run's
    <w:rPr>
    block into your tracked change runs to maintain bold, font size, etc.

  • 替换完整的
    <w:r>
    元素
    :添加修订记录时,将整个
    <w:r>...</w:r>
    块替换为
    <w:del>...<w:ins>...
    作为同级元素。不要在run内部插入修订标签。
  • 保留
    <w:rPr>
    格式
    :将原始run的
    <w:rPr>
    块复制到修订记录的run中,以保持加粗、字号等格式。

XML Reference

XML参考

Schema Compliance

Schema合规性

  • Element order in
    <w:pPr>
    :
    <w:pStyle>
    ,
    <w:numPr>
    ,
    <w:spacing>
    ,
    <w:ind>
    ,
    <w:jc>
    ,
    <w:rPr>
    last
  • Whitespace: Add
    xml:space="preserve"
    to
    <w:t>
    with leading/trailing spaces
  • RSIDs: Must be 8-digit hex (e.g.,
    00AB1234
    )
  • <w:pPr>
    中的元素顺序
    <w:pStyle>
    <w:numPr>
    <w:spacing>
    <w:ind>
    <w:jc>
    <w:rPr>
    放在最后
  • 空白字符:带有前导/尾随空格的
    <w:t>
    需添加
    xml:space="preserve"
  • RSIDs:必须是8位十六进制数(例如
    00AB1234

Tracked Changes

修订记录

Insertion:
xml
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:t>inserted text</w:t></w:r>
</w:ins>
Deletion:
xml
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
Inside
<w:del>
: Use
<w:delText>
instead of
<w:t>
, and
<w:delInstrText>
instead of
<w:instrText>
.
Minimal edits - only mark what changes:
xml
<!-- Change "30 days" to "60 days" -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
  <w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
  <w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>
Deleting entire paragraphs/list items - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add
<w:del/>
inside
<w:pPr><w:rPr>
:
xml
<w:p>
  <w:pPr>
    <w:numPr>...</w:numPr>  <!-- list numbering if present -->
    <w:rPr>
      <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
    </w:rPr>
  </w:pPr>
  <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
    <w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
  </w:del>
</w:p>
Without the
<w:del/>
in
<w:pPr><w:rPr>
, accepting changes leaves an empty paragraph/list item.
Rejecting another author's insertion - nest deletion inside their insertion:
xml
<w:ins w:author="Jane" w:id="5">
  <w:del w:author="Claude" w:id="10">
    <w:r><w:delText>their inserted text</w:delText></w:r>
  </w:del>
</w:ins>
Restoring another author's deletion - add insertion after (don't modify their deletion):
xml
<w:del w:author="Jane" w:id="5">
  <w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
  <w:r><w:t>deleted text</w:t></w:r>
</w:ins>
插入内容:
xml
<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:t>inserted text</w:t></w:r>
</w:ins>
删除内容:
xml
<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:del>
内部:
使用
<w:delText>
代替
<w:t>
,使用
<w:delInstrText>
代替
<w:instrText>
最小化修改 - 仅标记变更部分:
xml
<!-- 将“30 days”改为“60 days” -->
<w:r><w:t>The term is </w:t></w:r>
<w:del w:id="1" w:author="Claude" w:date="...">
  <w:r><w:delText>30</w:delText></w:r>
</w:del>
<w:ins w:id="2" w:author="Claude" w:date="...">
  <w:r><w:t>60</w:t></w:r>
</w:ins>
<w:r><w:t> days.</w:t></w:r>
删除整个段落/列表项 - 删除段落所有内容时,还需将段落标记也标记为已删除,使其与下一段落合并。在
<w:pPr><w:rPr>
中添加
<w:del/>
xml
<w:p>
  <w:pPr>
    <w:numPr>...</w:numPr>  <!-- 若为列表项则保留编号配置 -->
    <w:rPr>
      <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
    </w:rPr>
  </w:pPr>
  <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
    <w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
  </w:del>
</w:p>
如果不在
<w:pPr><w:rPr>
中添加
<w:del/>
,接受修订后会留下空段落/列表项。
拒绝其他作者插入的内容 - 在其插入内容内部嵌套删除标签:
xml
<w:ins w:author="Jane" w:id="5">
  <w:del w:author="Claude" w:id="10">
    <w:r><w:delText>their inserted text</w:delText></w:r>
  </w:del>
</w:ins>
恢复其他作者删除的内容 - 在删除内容后添加插入标签(不要修改原删除标签):
xml
<w:del w:author="Jane" w:id="5">
  <w:r><w:delText>deleted text</w:delText></w:r>
</w:del>
<w:ins w:author="Claude" w:id="10">
  <w:r><w:t>deleted text</w:t></w:r>
</w:ins>

Comments

批注

After running
comment.py
(see Step 2), add markers to document.xml. For replies, use
--parent
flag and nest markers inside the parent's.
CRITICAL:
<w:commentRangeStart>
and
<w:commentRangeEnd>
are siblings of
<w:r>
, never inside
<w:r>
.
xml
<!-- Comment markers are direct children of w:p, never inside w:r -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>

<!-- Comment 0 with reply 1 nested inside -->
<w:commentRangeStart w:id="0"/>
  <w:commentRangeStart w:id="1"/>
  <w:r><w:t>text</w:t></w:r>
  <w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
运行
comment.py
后(见步骤2),在document.xml中添加标记。回复批注时使用
--parent
参数并将标记嵌套在父批注标记内。
重点:
<w:commentRangeStart>
<w:commentRangeEnd>
<w:r>
的同级元素,切勿放在
<w:r>
内部。
xml
<!-- 批注标记是w:p的直接子元素,切勿放在w:r内部 -->
<w:commentRangeStart w:id="0"/>
<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
  <w:r><w:delText>deleted</w:delText></w:r>
</w:del>
<w:r><w:t> more text</w:t></w:r>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>

<!-- 批注0包含回复批注1 -->
<w:commentRangeStart w:id="0"/>
  <w:commentRangeStart w:id="1"/>
  <w:r><w:t>text</w:t></w:r>
  <w:commentRangeEnd w:id="1"/>
<w:commentRangeEnd w:id="0"/>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>

Images

图片

  1. Add image file to
    word/media/
  2. Add relationship to
    word/_rels/document.xml.rels
    :
xml
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
  1. Add content type to
    [Content_Types].xml
    :
xml
<Default Extension="png" ContentType="image/png"/>
  1. Reference in document.xml:
xml
<w:drawing>
  <wp:inline>
    <wp:extent cx="914400" cy="914400"/>  <!-- EMUs: 914400 = 1 inch -->
    <a:graphic>
      <a:graphicData uri=".../picture">
        <pic:pic>
          <pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
        </pic:pic>
      </a:graphicData>
    </a:graphic>
  </wp:inline>
</w:drawing>

  1. 将图片文件添加到
    word/media/
    目录
  2. word/_rels/document.xml.rels
    中添加关联关系:
xml
<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
  1. [Content_Types].xml
    中添加内容类型:
xml
<Default Extension="png" ContentType="image/png"/>
  1. 在document.xml中引用图片:
xml
<w:drawing>
  <wp:inline>
    <wp:extent cx="914400" cy="914400"/>  <!-- EMU单位:914400 = 1英寸 -->
    <a:graphic>
      <a:graphicData uri=".../picture">
        <pic:pic>
          <pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
        </pic:pic>
      </a:graphicData>
    </a:graphic>
  </wp:inline>
</w:drawing>

Dependencies

依赖工具

  • pandoc: Text extraction
  • docx:
    npm install -g docx
    (new documents)
  • LibreOffice: PDF conversion (auto-configured for sandboxed environments via
    scripts/office/soffice.py
    )
  • Poppler:
    pdftoppm
    for images
  • pandoc:文本提取
  • docx
    npm install -g docx
    (创建新文档)
  • LibreOffice:PDF转换(通过
    scripts/office/soffice.py
    在沙箱环境中自动配置)
  • Poppler
    pdftoppm
    图片转换工具