hwpx

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

HWPX creation, editing, and analysis

HWPX文档的创建、编辑与分析

Overview

概述

A .hwpx file is a ZIP archive containing XML files, based on the OWPML (Open Word-Processor Markup Language) standard (KS X 6101).
.hwpx文件是一个包含XML文件的ZIP压缩包,基于OWPML(Open Word-Processor Markup Language,开放式文字处理器标记语言)标准(KS X 6101)。

Quick Reference

快速参考

TaskApproach
Read/analyze content
hwpxjs
or unpack for raw XML
Create new documentUse
hwpxjs
- see Creating New Documents below
Edit existing documentUnpack → edit XML → repack - see Editing Existing Documents below
任务实现方法
读取/分析内容使用
hwpxjs
或解压后直接处理原始XML
创建新文档使用
hwpxjs
- 详见下方「创建新文档」部分
编辑现有文档解压 → 编辑XML → 重新打包 - 详见下方「编辑现有文档」部分

Converting .hwp to .hwpx

将.hwp格式转换为.hwpx格式

Legacy
.hwp
files must be converted before editing:
bash
undefined
旧版
.hwp
文件必须先转换才能编辑:
bash
undefined

Using hwpxjs CLI (pure TypeScript, no external dependencies)

使用hwpxjs CLI(纯TypeScript实现,无外部依赖)

npx hwpxjs convert:hwp document.hwp output.hwpx
npx hwpxjs convert:hwp document.hwpx output.hwpx

Or using LibreOffice as fallback

或使用LibreOffice作为备选方案

python scripts/office/soffice.py --headless --convert-to hwpx document.hwp
undefined
python scripts/office/soffice.py --headless --convert-to hwpx document.hwp
undefined

Reading Content

读取内容

bash
undefined
bash
undefined

Text extraction via CLI

通过CLI提取文本

npx hwpxjs txt document.hwpx
npx hwpxjs txt document.hwpx

HTML conversion (includes images/styles)

转换为HTML格式(包含图片/样式)

npx hwpxjs html document.hwpx > output.html
npx hwpxjs html document.hwpx > output.html

Raw XML access

访问原始XML

python scripts/unpack.py document.hwpx unpacked/
undefined
python scripts/unpack.py document.hwpx unpacked/
undefined

Converting to Images

转换为图片格式

bash
python scripts/office/soffice.py --headless --convert-to pdf document.hwpx
pdftoppm -jpeg -r 150 document.pdf page

bash
python scripts/office/soffice.py --headless --convert-to pdf document.hwpx
pdftoppm -jpeg -r 150 document.pdf page

Creating New Documents

创建新文档

Generate .hwpx files with JavaScript. Install:
npm install @ssabrojs/hwpxjs
使用JavaScript生成.hwpx文件。安装命令:
npm install @ssabrojs/hwpxjs

Setup

初始化设置

javascript
const { HwpxWriter, HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");

// Create document from plain text
const writer = new HwpxWriter();
const content = `문서 제목

첫 번째 문단입니다.
두 번째 문단입니다.`;

const buffer = await writer.createFromPlainText(content);
fs.writeFileSync("output.hwpx", buffer);
javascript
const { HwpxWriter, HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");

// 从纯文本创建文档
const writer = new HwpxWriter();
const content = `문서 제목

첫 번째 문단입니다.
두 번째 문단입니다.`;

const buffer = await writer.createFromPlainText(content);
fs.writeFileSync("output.hwpx", buffer);

Reading Documents

读取文档

javascript
const { HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");

const reader = new HwpxReader();
const fileBuffer = fs.readFileSync("document.hwpx");
await reader.loadFromArrayBuffer(fileBuffer.buffer);

// Extract text
const text = await reader.extractText();
console.log(text);

// Get document info
const info = await reader.getDocumentInfo();
console.log(info);

// List images
const images = await reader.listImages();
console.log(images);
// [{ binPath: "BinData/0.jpg", width: 200, height: 150, format: "jpg" }]
javascript
const { HwpxReader } = require("@ssabrojs/hwpxjs");
const fs = require("fs");

const reader = new HwpxReader();
const fileBuffer = fs.readFileSync("document.hwpx");
await reader.loadFromArrayBuffer(fileBuffer.buffer);

// 提取文本
const text = await reader.extractText();
console.log(text);

// 获取文档信息
const info = await reader.getDocumentInfo();
console.log(info);

// 列出图片
const images = await reader.listImages();
console.log(images);
// [{ binPath: "BinData/0.jpg", width: 200, height: 150, format: "jpg" }]

HTML Conversion

HTML格式转换

javascript
// Basic HTML conversion
const html = await reader.extractHtml();

// With all options
const fullHtml = await reader.extractHtml({
  paragraphTag: "p",
  tableClassName: "hwpx-table",
  renderImages: true,       // Include images
  renderTables: true,       // Include tables
  renderStyles: true,       // Apply styles (bold, italic, color)
  embedImages: true,        // Base64 embed images
  tableHeaderFirstRow: true // First row as <th>
});
javascript
// 基础HTML转换
const html = await reader.extractHtml();

// 全选项配置
const fullHtml = await reader.extractHtml({
  paragraphTag: "p",
  tableClassName: "hwpx-table",
  renderImages: true,       // 包含图片
  renderTables: true,       // 包含表格
  renderStyles: true,       // 应用样式(粗体、斜体、颜色)
  embedImages: true,        // 以Base64格式嵌入图片
  tableHeaderFirstRow: true // 将第一行设为<th>表头
});

HWP to HWPX Conversion

HWP格式转HWPX格式

javascript
const { HwpConverter } = require("@ssabrojs/hwpxjs");

const converter = new HwpConverter({ verbose: true });

// Check availability
if (converter.isAvailable()) {
  // Convert HWP to HWPX
  const result = await converter.convertHwpToHwpx("input.hwp", "output.hwpx");
  if (result.success) {
    console.log(`Converted: ${result.processingTime}ms`);
  }

  // Or extract text only
  const text = await converter.convertHwpToText("input.hwp");
}
javascript
const { HwpConverter } = require("@ssabrojs/hwpxjs");

const converter = new HwpConverter({ verbose: true });

// 检查可用性
if (converter.isAvailable()) {
  // 将HWP转换为HWPX
  const result = await converter.convertHwpToHwpx("input.hwp", "output.hwpx");
  if (result.success) {
    console.log(`转换完成:${result.processingTime}ms`);
  }

  // 或仅提取文本
  const text = await converter.convertHwpToText("input.hwp");
}

Template Processing

模板处理

javascript
// hwpxjs supports {{key}} template replacement
const reader = new HwpxReader();
await reader.loadFromArrayBuffer(templateBuffer);

// Apply template replacements
const html = await reader.extractHtml();
const result = html
  .replace(/\{\{name\}\}/g, "홍길동")
  .replace(/\{\{date\}\}/g, "2025-01-01");
javascript
// hwpxjs支持{{key}}模板替换
const reader = new HwpxReader();
await reader.loadFromArrayBuffer(templateBuffer);

// 应用模板替换
const html = await reader.extractHtml();
const result = html
  .replace(/\{\{name\}\}/g, "홍길동")
  .replace(/\{\{date\}\}/g, "2025-01-01");

Critical Rules for hwpxjs

hwpxjs使用关键规则

  • createFromPlainText returns Buffer - save with
    fs.writeFileSync(path, buffer)
  • loadFromArrayBuffer for reading - pass
    fileBuffer.buffer
    not
    fileBuffer
  • Text-only creation - for tables/images, use XML editing approach below
  • HwpConverter for HWP files - pure TypeScript, no LibreOffice needed
  • extractHtml for rich content - includes styles, tables, images

  • createFromPlainText返回Buffer - 使用
    fs.writeFileSync(path, buffer)
    保存
  • 读取时使用loadFromArrayBuffer - 传入
    fileBuffer.buffer
    而非
    fileBuffer
  • 纯文本创建限制 - 如需创建包含表格/图片的文档,请使用下方的XML编辑方法
  • 处理HWP文件使用HwpConverter - 纯TypeScript实现,无需依赖LibreOffice
  • 提取富内容使用extractHtml - 包含样式、表格、图片

Editing Existing Documents

编辑现有文档

Follow all 3 steps in order.
请严格按以下3个步骤操作。

Step 1: Unpack

步骤1:解压

bash
python scripts/unpack.py document.hwpx unpacked/
bash
python scripts/unpack.py document.hwpx unpacked/

Step 2: Edit XML

步骤2:编辑XML

Edit files in
unpacked/Contents/
. See XML Reference below for patterns.
Use the Edit tool directly for string replacement. Do not write Python scripts. Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
CRITICAL: Remove
<hp:linesegarray>
when modifying text.
This element contains cached layout data. Leaving stale linesegarray causes character overlap:
xml
<!-- BEFORE: paragraph with stale layout cache -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
  <hp:run charPrIDRef="19">
    <hp:t>Original text</hp:t>
  </hp:run>
  <hp:linesegarray>
    <hp:lineseg textpos="0" vertpos="0" vertsize="1000" horzsize="5000" .../>
  </hp:linesegarray>
</hp:p>

<!-- AFTER: remove linesegarray entirely -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
  <hp:run charPrIDRef="19">
    <hp:t>New longer text that exceeds original width</hp:t>
  </hp:run>
</hp:p>
Note: Multiple
<hp:run>
elements share one
<hp:linesegarray>
. Remove it when editing ANY run in the paragraph.
编辑
unpacked/Contents/
目录下的文件。下方XML参考部分提供了常见模式。
**直接使用编辑工具进行字符串替换,请勿编写Python脚本。**脚本会引入不必要的复杂度,而编辑工具能直观展示替换内容。
**重要提示:修改文本时请移除
<hp:linesegarray>
元素。**该元素包含缓存的布局数据,保留过时的linesegarray会导致字符重叠:
xml
<!-- 修改前:包含过时布局缓存的段落 -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
  <hp:run charPrIDRef="19">
    <hp:t>Original text</hp:t>
  </hp:run>
  <hp:linesegarray>
    <hp:lineseg textpos="0" vertpos="0" vertsize="1000" horzsize="5000" .../>
  </hp:linesegarray>
</hp:p>

<!-- 修改后:完全移除linesegarray -->
<hp:p id="0" paraPrIDRef="0" styleIDRef="0">
  <hp:run charPrIDRef="19">
    <hp:t>New longer text that exceeds original width</hp:t>
  </hp:run>
</hp:p>
注意:多个
<hp:run>
元素共享同一个
<hp:linesegarray>
。编辑段落中的任意run时,都需要移除该元素。

Step 3: Pack

步骤3:重新打包

bash
python scripts/pack.py unpacked/ output.hwpx
bash
python scripts/pack.py unpacked/ output.hwpx

Common Pitfalls

常见问题

  • Character overlap after edit: Remove
    <hp:linesegarray>
    from the edited
    <hp:p>
    . Multiple
    <hp:run>
    elements share one linesegarray—remove it when editing ANY run.
  • Wrong table cell modified: Include
    <hp:cellAddr>
    in search pattern. CRITICAL:
    <hp:cellAddr>
    appears AFTER cell content, not before.
    Use
    grep -B20 'colAddr="2" rowAddr="0"' section0.xml
    .
  • Preserve
    charPrIDRef
    : Don't change charPrIDRef when editing text—it references font/size/style in header.xml.
  • File corruption from string replacement: Use lxml for structural changes (inserting elements). String replacement breaks XML parent-child relationships.
  • Page overflow from text replacement: Replacing blanks/spaces with text can cause content overflow and page breaks. Solutions: (1) Keep replacement text similar in length to original spaces, (2) Preserve charPrIDRef for underlined fields to maintain underline style, (3) Reduce unnecessary whitespace proportionally, (4) Cell/margin adjustments may be needed.
  • Image size too large (e.g., 635mm): HWP unit calculation error. 1 HWP unit = 1/7200 inch, so 1mm ≈ 283.5 HWP units.
    • ❌ Wrong:
      width="180000"
      → 635mm (too large!)
    • ✅ Correct:
      width="3400"
      → ~12mm (signature size)
    • Formula:
      mm × (7200 ÷ 25.4) = HWP units

  • 编辑后字符重叠:从编辑的
    <hp:p>
    中移除
    <hp:linesegarray>
    。多个
    <hp:run>
    元素共享同一个linesegarray——编辑任意run时都需移除它。
  • 修改了错误的表格单元格:搜索时请包含
    <hp:cellAddr>
    元素。**重要提示:
    <hp:cellAddr>
    位于单元格内容之后,而非之前。**可使用命令
    grep -B20 'colAddr="2" rowAddr="0"' section0.xml
    查找。
  • 保留
    charPrIDRef
    :编辑文本时请勿修改charPrIDRef——它引用了header.xml中的字体/字号/样式设置。
  • 字符串替换导致文件损坏:如需结构化修改(如插入元素),请使用lxml。字符串替换会破坏XML的父子关系。
  • 文本替换导致页面溢出:将空白/空格替换为文本可能会导致内容溢出和分页问题。解决方案:(1) 保持替换文本长度与原空白内容相近,(2) 保留下划线字段的charPrIDRef以维持下划线样式,(3) 按比例减少不必要的空白,(4) 可能需要调整单元格/边距。
  • 图片尺寸过大(如635mm):HWP单位计算错误。1个HWP单位 = 1/7200英寸,因此1mm ≈ 283.5个HWP单位
    • ❌ 错误示例:
      width="180000"
      → 635mm(过大!)
    • ✅ 正确示例:
      width="3400"
      → ~12mm(签名尺寸)
    • 计算公式:
      mm × (7200 ÷ 25.4) = HWP单位

XML Reference

XML参考

Key Elements

核心元素

ElementPurpose
<hp:p>
Paragraph
<hp:run>
Text run with formatting
<hp:t>
Text content
<hp:tbl>
Table
<hp:tc>
Table cell
<hp:cellAddr>
Cell position (AFTER content)
<hp:pic>
Image
<hp:linesegarray>
Layout cache (remove when editing)
元素用途
<hp:p>
段落
<hp:run>
带格式的文本块
<hp:t>
文本内容
<hp:tbl>
表格
<hp:tc>
表格单元格
<hp:cellAddr>
单元格位置(位于内容之后)
<hp:pic>
图片
<hp:linesegarray>
布局缓存(编辑时需移除)

Paragraph Structure

段落结构

xml
<hp:p id="0" paraPrIDRef="0" styleIDRef="0" pageBreak="0">
  <hp:run charPrIDRef="0">
    <hp:t>Text content</hp:t>
  </hp:run>
  <hp:linesegarray>  <!-- Remove this when editing text -->
    <hp:lineseg textpos="0" vertpos="0" vertsize="1000" .../>
  </hp:linesegarray>
</hp:p>
xml
<hp:p id="0" paraPrIDRef="0" styleIDRef="0" pageBreak="0">
  <hp:run charPrIDRef="0">
    <hp:t>Text content</hp:t>
  </hp:run>
  <hp:linesegarray>  <!-- 编辑文本时请移除该元素 -->
    <hp:lineseg textpos="0" vertpos="0" vertsize="1000" .../>
  </hp:linesegarray>
</hp:p>

Table Cell Structure

表格单元格结构

xml
<hp:tc borderFillIDRef="5">
  <hp:subList textDirection="HORIZONTAL" vertAlign="CENTER">
    <hp:p paraPrIDRef="20">
      <hp:run charPrIDRef="19">
        <hp:t>Cell content</hp:t>
      </hp:run>
    </hp:p>
  </hp:subList>
  <hp:cellAddr colAddr="0" rowAddr="0"/>  <!-- Position identifier -->
  <hp:cellSpan colSpan="1" rowSpan="1"/>
  <hp:cellSz width="5136" height="4179"/>
</hp:tc>
xml
<hp:tc borderFillIDRef="5">
  <hp:subList textDirection="HORIZONTAL" vertAlign="CENTER">
    <hp:p paraPrIDRef="20">
      <hp:run charPrIDRef="19">
        <hp:t>Cell content</hp:t>
      </hp:run>
    </hp:p>
  </hp:subList>
  <hp:cellAddr colAddr="0" rowAddr="0"/>  <!-- 位置标识符 -->
  <hp:cellSpan colSpan="1" rowSpan="1"/>
  <hp:cellSz width="5136" height="4179"/>
</hp:tc>

Images

图片

CRITICAL:
<hp:pic>
MUST be inside
<hp:run>
, followed by empty
<hp:t/>
  1. Add image file to
    BinData/
  2. Add to manifest
    Contents/content.hpf
    :
xml
<opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>
  1. Reference in section0.xml:
xml
<hp:p id="0" paraPrIDRef="38" styleIDRef="41">
  <hp:run charPrIDRef="0">
    <hp:pic id="12345" zOrder="0" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM">
      <hp:orgSz width="7200" height="7200"/>  <!-- 1 inch = 7200 HWP units -->
      <hp:curSz width="3600" height="3600"/>  <!-- Display: 0.5 inch -->
      <hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
      <hp:sz width="3600" widthRelTo="ABSOLUTE" height="3600" heightRelTo="ABSOLUTE"/>
      <hp:pos treatAsChar="1" horzRelTo="COLUMN" horzAlign="CENTER" vertRelTo="PARA" vertAlign="TOP"/>
    </hp:pic>
    <hp:t/>  <!-- REQUIRED: empty text element after hp:pic -->
  </hp:run>
</hp:p>
Size units: HWP uses 1/7200 inch units. 1mm ≈ 283.5 units (7200 ÷ 25.4)
For safe image insertion using lxml, see references/image-insertion.md.
重要提示:
<hp:pic>
必须位于
<hp:run>
内部,且后面需跟空的
<hp:t/>
元素
  1. 将图片文件添加至
    BinData/
    目录
  2. 在清单文件
    Contents/content.hpf
    中添加引用:
xml
<opf:item id="image1" href="BinData/image1.png" media-type="image/png" isEmbeded="1"/>
  1. 在section0.xml中引用图片:
xml
<hp:p id="0" paraPrIDRef="38" styleIDRef="41">
  <hp:run charPrIDRef="0">
    <hp:pic id="12345" zOrder="0" numberingType="PICTURE" textWrap="TOP_AND_BOTTOM">
      <hp:orgSz width="7200" height="7200"/>  <!-- 1英寸 = 7200个HWP单位 -->
      <hp:curSz width="3600" height="3600"/>  <!-- 显示尺寸:0.5英寸 -->
      <hc:img binaryItemIDRef="image1" bright="0" contrast="0" effect="REAL_PIC" alpha="0"/>
      <hp:sz width="3600" widthRelTo="ABSOLUTE" height="3600" heightRelTo="ABSOLUTE"/>
      <hp:pos treatAsChar="1" horzRelTo="COLUMN" horzAlign="CENTER" vertRelTo="PARA" vertAlign="TOP"/>
    </hp:pic>
    <hp:t/>  <!-- 必填:hp:pic后需跟空文本元素 -->
  </hp:run>
</hp:p>
尺寸单位:HWP使用1/7200英寸作为单位。1mm ≈ 283.5个单位(7200 ÷ 25.4)
如需使用lxml安全插入图片,请参考references/image-insertion.md

Page Break

分页符

xml
<hp:p pageBreak="1" ...>  <!-- pageBreak="1" inserts break before paragraph -->
xml
<hp:p pageBreak="1" ...>  <!-- pageBreak="1"会在段落前插入分页符 -->

Differences from DOCX

与DOCX的差异

AspectHWPXDOCX
Text element
<hp:t>
<w:t>
Paragraph
<hp:p>
<w:p>
Run
<hp:run>
<w:r>
Layout cache
<hp:linesegarray>
None
Content location
Contents/section*.xml
word/document.xml
Cell identifier
<hp:cellAddr>
after content
implicit order
Key difference: HWPX stores layout cache in linesegarray; DOCX doesn't. This is why editing HWPX requires removing linesegarray.
For detailed XML structures (headers/footers, lists/numbering, paragraph formatting), see references/xml-reference.md.

方面HWPXDOCX
文本元素
<hp:t>
<w:t>
段落
<hp:p>
<w:p>
文本块
<hp:run>
<w:r>
布局缓存
<hp:linesegarray>
内容位置
Contents/section*.xml
word/document.xml
单元格标识符
<hp:cellAddr>
位于内容之后
隐式顺序
核心差异:HWPX在linesegarray中存储布局缓存,而DOCX没有。这就是编辑HWPX时需要移除linesegarray的原因。
如需详细的XML结构(页眉/页脚、列表/编号、段落格式),请参考references/xml-reference.md

Dependencies

依赖项

bash
npm install @ssabrojs/hwpxjs
  • hwpxjs:
    npm install @ssabrojs/hwpxjs
    - reading, writing, HTML conversion, HWP→HWPX conversion
  • pyhwp2md: Converting HWP/HWPX to Markdown (alternative)
  • LibreOffice: PDF conversion (auto-configured via
    scripts/office/soffice.py
    )
  • Poppler:
    pdftoppm
    for PDF to images
bash
npm install @ssabrojs/hwpxjs
  • hwpxjs
    npm install @ssabrojs/hwpxjs
    - 读取、写入、HTML转换、HWP→HWPX转换
  • pyhwp2md:将HWP/HWPX转换为Markdown(备选工具)
  • LibreOffice:PDF转换(通过
    scripts/office/soffice.py
    自动配置)
  • Poppler
    pdftoppm
    工具,用于将PDF转换为图片