document-processing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document Processing Guide

文档处理指南

Work with office documents: PDF, Excel, Word, and PowerPoint.

处理办公文档：PDF、Excel、Word和PowerPoint。

Format Overview

格式概述

Format	Extension	Structure	Best For
PDF	.pdf	Binary/text	Reports, forms, archives
Excel	.xlsx	XML in ZIP	Data, calculations, models
Word	.docx	XML in ZIP	Text documents, contracts
PowerPoint	.pptx	XML in ZIP	Presentations, slides

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.

格式	扩展名	结构	最佳用途
PDF	.pdf	二进制/文本	报告、表单、归档
Excel	.xlsx	ZIP包内的XML文件	数据、计算、模型
Word	.docx	ZIP包内的XML文件	文本文档、合同
PowerPoint	.pptx	ZIP包内的XML文件	演示文稿、幻灯片

核心概念：XLSX、DOCX和PPTX都是包含XML文件的ZIP压缩包，你可以解压它们以访问原始内容。

PDF Processing

PDF处理

PDF Tools

PDF工具

Task	Best Tool
Basic read/write	pypdf
Text extraction	pdfplumber
Table extraction	pdfplumber
Create PDFs	reportlab
OCR scanned PDFs	pytesseract + pdf2image
Command line	qpdf, pdftotext

任务	最佳工具
基础读写	pypdf
文本提取	pdfplumber
表格提取	pdfplumber
创建PDF	reportlab
扫描版PDF的OCR识别	pytesseract + pdf2image
命令行工具	qpdf, pdftotext

Common Operations

常见操作

Operation	Approach
Merge	Loop through files, add pages to writer
Split	Create new writer per page
Extract tables	Use pdfplumber, convert to DataFrame
Rotate	Call `.rotate(degrees)` on page
Encrypt	Use writer's `.encrypt()` method
OCR	Convert to images, run pytesseract

操作	实现方法
合并	遍历文件，将页面添加到写入器
拆分	为每个页面创建新的写入器
提取表格	使用pdfplumber，转换为DataFrame
旋转	对页面调用 `.rotate(degrees)` 方法
加密	使用写入器的 `.encrypt()` 方法
OCR识别	转换为图片，运行pytesseract

Excel Processing

Excel处理

Excel Tools

Excel工具

Task	Best Tool
Data analysis	pandas
Formulas & formatting	openpyxl
Simple CSV	pandas
Financial models	openpyxl

任务	最佳工具
数据分析	pandas
公式与格式设置	openpyxl
简单CSV处理	pandas
财务模型构建	openpyxl

Critical Rule: Use Formulas

重要规则：使用公式

Approach	Result
Wrong: Calculate in Python, write value	Static number, breaks when data changes
Right: Write Excel formula	Dynamic, recalculates automatically

做法	结果
错误：在Python中计算，写入数值	静态数字，数据变化时失效
正确：写入Excel公式	动态计算，会自动重新计算

Financial Model Standards

财务模型标准

Convention	Meaning
Blue text	Hardcoded inputs
Black text	Formulas
Green text	Links to other sheets
Yellow fill	Needs attention

惯例	含义
蓝色文本	硬编码输入值
黑色文本	公式
绿色文本	链接到其他工作表
黄色填充	需要关注的内容

Common Formula Errors

常见公式错误

Error	Cause
#REF!	Invalid cell reference
#DIV/0!	Division by zero
#VALUE!	Wrong data type
#NAME?	Unknown function name

错误代码	原因
#REF!	无效单元格引用
#DIV/0!	除以零
#VALUE!	数据类型错误
#NAME?	未知函数名称

Word Processing

Word处理

Word Tools

Word工具

Task	Best Tool
Text extraction	pandoc
Create new	python-docx or docx-js
Simple edits	python-docx
Tracked changes	Direct XML editing

任务	最佳工具
文本提取	pandoc
创建新文档	python-docx 或 docx-js
简单编辑	python-docx
修订模式处理	直接编辑XML

Document Structure

文档结构

File	Contains
`word/document.xml`	Main content
`word/comments.xml`	Comments
`word/media/`	Images

文件路径	包含内容
`word/document.xml`	主要内容
`word/comments.xml`	批注
`word/media/`	图片

Tracked Changes (Redlining)

修订模式（红线标注）

Element	XML Tag
Deletion	`<w:del><w:delText>...</w:delText></w:del>`
Insertion	`<w:ins><w:t>...</w:t></w:ins>`

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.

元素	XML标签
删除内容	`<w:del><w:delText>...</w:delText></w:del>`
插入内容	`<w:ins><w:t>...</w:t></w:ins>`

核心概念：对于专业/法律类文档，使用修订模式XML而非直接替换文本。

PowerPoint Processing

PowerPoint处理

PowerPoint Tools

PowerPoint工具

Task	Best Tool
Text extraction	markitdown
Create new	pptxgenjs (JS) or python-pptx
Edit existing	Direct XML or python-pptx

任务	最佳工具
文本提取	markitdown
创建新演示文稿	pptxgenjs（JS）或 python-pptx
编辑现有演示文稿	直接编辑XML 或 python-pptx

Slide Structure

幻灯片结构

Path	Contains
`ppt/slides/slide{N}.xml`	Slide content
`ppt/notesSlides/`	Speaker notes
`ppt/slideMasters/`	Master templates
`ppt/media/`	Images

路径	包含内容
`ppt/slides/slide{N}.xml`	幻灯片内容
`ppt/notesSlides/`	演讲者备注
`ppt/slideMasters/`	母版模板
`ppt/media/`	图片

Design Principles

设计原则

Principle	Guideline
Fonts	Use web-safe: Arial, Helvetica, Georgia
Layout	Two-column preferred, avoid vertical stacking
Hierarchy	Size, weight, color for emphasis
Consistency	Repeat patterns across slides

原则	指南
字体	使用网页安全字体：Arial、Helvetica、Georgia
布局	首选两栏布局，避免垂直堆叠
层级	通过字号、字重、颜色强调重点
一致性	跨幻灯片重复使用统一样式

Converting Between Formats

格式转换

Conversion	Tool
Any → PDF	LibreOffice headless
PDF → Images	pdftoppm
DOCX → Markdown	pandoc
Any → Text	Appropriate extractor

转换方向	工具
任意格式 → PDF	LibreOffice headless
PDF → 图片	pdftoppm
DOCX → Markdown	pandoc
任意格式 → 文本	对应格式的提取工具

Best Practices

最佳实践

Practice	Why
Use formulas in Excel	Dynamic calculations
Preserve formatting on edit	Don't lose styles
Test output opens correctly	Catch corruption early
Use tracked changes for contracts	Audit trail
Extract to markdown for analysis	Easier to process

实践	原因
在Excel中使用公式	实现动态计算
编辑时保留格式	避免丢失样式
测试输出文件能否正常打开	尽早发现文件损坏问题
合同类文档使用修订模式	保留审计追踪痕迹
提取为Markdown格式进行分析	更易于处理

Common Packages

常用工具包

Language	Packages
Python	pypdf, pdfplumber, openpyxl, python-docx, python-pptx
JavaScript	docx, pptxgenjs
CLI	pandoc, qpdf, pdftotext, libreoffice

语言	工具包
Python	pypdf, pdfplumber, openpyxl, python-docx, python-pptx
JavaScript	docx, pptxgenjs
命令行	pandoc, qpdf, pdftotext, libreoffice