document-processing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document Processing

文档处理

Source: This skill is adapted from Anthropic's Skills document processing skills (pdf, docx, pptx, xlsx) for Claude Code and AI agents.

Create, edit, and analyze office documents including PDFs, Word documents, PowerPoint presentations, and Excel spreadsheets.

来源： 本技能改编自**Anthropic's Skills**中面向Claude Code和AI Agent的文档处理技能（pdf、docx、pptx、xlsx）。

创建、编辑和分析各类办公文档，包括PDF、Word文档、PowerPoint演示文稿和Excel电子表格。

Quick Reference: Which Tool to Use

快速参考：选择合适的工具

Task	Document Type	Best Tool
Extract text	PDF	`pdfplumber` , `pdftotext`
Merge/split	PDF	`pypdf` , `qpdf`
Fill forms	PDF	`pdf-lib` (JS), `pypdf`
Create new	PDF	`reportlab`
OCR scanned	PDF	`pytesseract` + `pdf2image`
Extract text	DOCX	`pandoc` , `markitdown`
Create new	DOCX	`docx-js` (JS)
Edit existing	DOCX	OOXML (unpack/edit/pack)
Extract text	PPTX	`markitdown`
Create new	PPTX	`html2pptx` , `PptxGenJS`
Edit existing	PPTX	OOXML (unpack/edit/pack)
Data analysis	XLSX	`pandas`
Formulas/formatting	XLSX	`openpyxl`

任务	文档类型	推荐工具
提取文本	PDF	`pdfplumber` , `pdftotext`
合并/拆分	PDF	`pypdf` , `qpdf`
填写表单	PDF	`pdf-lib` (JS), `pypdf`
创建新文档	PDF	`reportlab`
OCR扫描件	PDF	`pytesseract` + `pdf2image`
提取文本	DOCX	`pandoc` , `markitdown`
创建新文档	DOCX	`docx-js` (JS)
编辑现有文档	DOCX	OOXML（解压/编辑/打包）
提取文本	PPTX	`markitdown`
创建新文档	PPTX	`html2pptx` , `PptxGenJS`
编辑现有文档	PPTX	OOXML（解压/编辑/打包）
数据分析	XLSX	`pandas`
公式/格式设置	XLSX	`openpyxl`

PDF Processing

PDF处理

Text Extraction

文本提取

python

import pdfplumber

python

import pdfplumber

Extract text with layout preservation

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)

undefined

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)

undefined

Table Extraction

表格提取

python

import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

python

import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

Combine all tables

if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)

undefined

if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)

undefined

Merge PDFs

合并PDF

python

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

python

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

拆分PDF

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Rotate Pages

旋转页面

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

OCR Scanned PDFs

OCR扫描PDF

python

undefined

python

undefined

Requires: pip install pytesseract pdf2image

import pytesseract from pdf2image import convert_from_path

Convert PDF to images

images = convert_from_path('scanned.pdf')

OCR each page

text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"

print(text)

undefined

text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"

print(text)

undefined

Add Watermark

添加水印

python

from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

python

from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Password Protection

密码保护

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

python

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Create PDF with ReportLab

使用ReportLab创建PDF

python

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

python

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

Add content

title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))

body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())

title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))

body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())

Page 2

story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))

doc.build(story)

undefined

story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))

doc.build(story)

undefined

Command Line Tools

命令行工具

bash

undefined

bash

undefined

Extract text (poppler-utils)

pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout

Merge PDFs (qpdf)

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

Rotate pages

qpdf input.pdf output.pdf --rotate=+90:1

Remove password

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Extract images

pdfimages -j input.pdf output_prefix

---

pdfimages -j input.pdf output_prefix

---

Word Document (DOCX) Processing

Word文档（DOCX）处理

Text Extraction

文本提取

bash

undefined

bash

undefined

Convert to markdown with pandoc

pandoc document.docx -o output.md

With tracked changes preserved

pandoc --track-changes=all document.docx -o output.md

undefined

pandoc --track-changes=all document.docx -o output.md

undefined

Create New Document (docx-js)

创建新文档（docx-js）

javascript

import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';

const doc = new Document({
  sections: [{
    properties: {},
    children: [
      new Paragraph({
        text: "Document Title",
        heading: HeadingLevel.HEADING_1,
      }),
      new Paragraph({
        children: [
          new TextRun("This is a "),
          new TextRun({
            text: "bold",
            bold: true,
          }),
          new TextRun(" word in a paragraph."),
        ],
      }),
      new Paragraph({
        text: "This is another paragraph.",
      }),
    ],
  }],
});

// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);

javascript

import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';

const doc = new Document({
  sections: [{
    properties: {},
    children: [
      new Paragraph({
        text: "Document Title",
        heading: HeadingLevel.HEADING_1,
      }),
      new Paragraph({
        children: [
          new TextRun("This is a "),
          new TextRun({
            text: "bold",
            bold: true,
          }),
          new TextRun(" word in a paragraph."),
        ],
      }),
      new Paragraph({
        text: "This is another paragraph.",
      }),
    ],
  }],
});

// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);

Create Document with Tables

创建带表格的文档

javascript

import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';

const table = new Table({
  rows: [
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Header 1")] }),
        new TableCell({ children: [new Paragraph("Header 2")] }),
        new TableCell({ children: [new Paragraph("Header 3")] }),
      ],
    }),
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Cell 1")] }),
        new TableCell({ children: [new Paragraph("Cell 2")] }),
        new TableCell({ children: [new Paragraph("Cell 3")] }),
      ],
    }),
  ],
});

const doc = new Document({
  sections: [{
    children: [
      new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
      table,
    ],
  }],
});

javascript

import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';

const table = new Table({
  rows: [
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Header 1")] }),
        new TableCell({ children: [new Paragraph("Header 2")] }),
        new TableCell({ children: [new Paragraph("Header 3")] }),
      ],
    }),
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Cell 1")] }),
        new TableCell({ children: [new Paragraph("Cell 2")] }),
        new TableCell({ children: [new Paragraph("Cell 3")] }),
      ],
    }),
  ],
});

const doc = new Document({
  sections: [{
    children: [
      new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
      table,
    ],
  }],
});

Edit Existing Document (OOXML)

编辑现有文档（OOXML）

For complex edits, work with raw OOXML:

Unpack the document:

bash

python ooxml/scripts/unpack.py document.docx unpacked/

Edit XML files (primarily
```
word/document.xml
```
)

Validate and pack:

bash

python ooxml/scripts/validate.py unpacked/ --original document.docx
python ooxml/scripts/pack.py unpacked/ output.docx

对于复杂编辑，可直接操作原始OOXML：

解压文档：

bash

python ooxml/scripts/unpack.py document.docx unpacked/

编辑XML文件（主要是
```
word/document.xml
```
）

验证并打包：

bash

python ooxml/scripts/validate.py unpacked/ --original document.docx
python ooxml/scripts/pack.py unpacked/ output.docx

Tracked Changes Workflow

修订模式工作流

For document review with track changes:

bash

undefined

适用于带修订标记的文档审阅场景：

bash

undefined

1. Get current state

1. 获取当前状态

pandoc --track-changes=all document.docx -o current.md

2. Unpack

2. 解压

python ooxml/scripts/unpack.py document.docx unpacked/

3. Edit using tracked change patterns

3. 使用修订标记模式编辑

Use <w:ins> for insertions, <w:del> for deletions

用<w:ins>表示插入内容，<w:del>表示删除内容

4. Pack final document

4. 打包最终文档

python ooxml/scripts/pack.py unpacked/ reviewed.docx

---

python ooxml/scripts/pack.py unpacked/ reviewed.docx

---

PowerPoint (PPTX) Processing

PowerPoint（PPTX）处理

Text Extraction

文本提取

bash

python -m markitdown presentation.pptx

bash

python -m markitdown presentation.pptx

Create New Presentation (PptxGenJS)

创建新演示文稿（PptxGenJS）

javascript

import PptxGenJS from 'pptxgenjs';

const pptx = new PptxGenJS();

// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
  x: 1, y: 2, w: 8, h: 1.5,
  fontSize: 36,
  bold: true,
  color: "363636",
  align: "center",
});
slide1.addText("Subtitle goes here", {
  x: 1, y: 3.5, w: 8, h: 0.5,
  fontSize: 18,
  color: "666666",
  align: "center",
});

// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
  x: 0.5, y: 0.5, w: 9, h: 0.8,
  fontSize: 28,
  bold: true,
});
slide2.addText([
  { text: "• First important point\n", options: { bullet: true } },
  { text: "• Second important point\n", options: { bullet: true } },
  { text: "• Third important point\n", options: { bullet: true } },
], {
  x: 0.5, y: 1.5, w: 9, h: 3,
  fontSize: 18,
});

// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
  { name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
  { name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
  x: 1, y: 1, w: 8, h: 4,
  showLegend: true,
  legendPos: "b",
});

// Save
pptx.writeFile("output.pptx");

javascript

import PptxGenJS from 'pptxgenjs';

const pptx = new PptxGenJS();

// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
  x: 1, y: 2, w: 8, h: 1.5,
  fontSize: 36,
  bold: true,
  color: "363636",
  align: "center",
});
slide1.addText("Subtitle goes here", {
  x: 1, y: 3.5, w: 8, h: 0.5,
  fontSize: 18,
  color: "666666",
  align: "center",
});

// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
  x: 0.5, y: 0.5, w: 9, h: 0.8,
  fontSize: 28,
  bold: true,
});
slide2.addText([
  { text: "• First important point\n", options: { bullet: true } },
  { text: "• Second important point\n", options: { bullet: true } },
  { text: "• Third important point\n", options: { bullet: true } },
], {
  x: 0.5, y: 1.5, w: 9, h: 3,
  fontSize: 18,
});

// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
  { name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
  { name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
  x: 1, y: 1, w: 8, h: 4,
  showLegend: true,
  legendPos: "b",
});

// Save
pptx.writeFile("output.pptx");

Edit Existing Presentation (OOXML)

编辑现有演示文稿（OOXML）

bash

undefined

bash

undefined

1. Unpack

1. 解压

python ooxml/scripts/unpack.py presentation.pptx unpacked/

2. Key files:

2. 关键文件：

- ppt/slides/slide1.xml, slide2.xml, etc.

- ppt/notesSlides/ for speaker notes

- ppt/theme/ for styling

3. Validate and pack

3. 验证并打包

python ooxml/scripts/validate.py unpacked/ --original presentation.pptx python ooxml/scripts/pack.py unpacked/ output.pptx

undefined

python ooxml/scripts/validate.py unpacked/ --original presentation.pptx python ooxml/scripts/pack.py unpacked/ output.pptx

undefined

Create Thumbnail Grid

创建缩略图网格

bash

undefined

bash

undefined

Create visual overview of all slides

python scripts/thumbnail.py presentation.pptx --cols 4

undefined

python scripts/thumbnail.py presentation.pptx --cols 4

undefined

Convert Slides to Images

将幻灯片转换为图片

bash

undefined

bash

undefined

Convert to PDF first

soffice --headless --convert-to pdf presentation.pptx

Then PDF to images

pdftoppm -jpeg -r 150 presentation.pdf slide

Creates slide-1.jpg, slide-2.jpg, etc.

---

---

Excel (XLSX) Processing

Excel（XLSX）处理

Data Analysis with Pandas

使用Pandas进行数据分析

python

import pandas as pd

python

import pandas as pd

Read Excel

df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict

Analyze

df.head() # Preview data df.info() # Column info df.describe() # Statistics

Filter and transform

filtered = df[df['Sales'] > 1000] grouped = df.groupby('Category')['Revenue'].sum()

Write Excel

df.to_excel('output.xlsx', index=False)

undefined

df.to_excel('output.xlsx', index=False)

undefined

Create Excel with Formulas (openpyxl)

使用openpyxl创建带公式的Excel

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

Add data

sheet['A1'] = 'Product' sheet['B1'] = 'Price' sheet['C1'] = 'Quantity' sheet['D1'] = 'Total'

Header formatting

for cell in ['A1', 'B1', 'C1', 'D1']: sheet[cell].font = Font(bold=True, color='FFFFFF') sheet[cell].fill = PatternFill('solid', start_color='4472C4') sheet[cell].alignment = Alignment(horizontal='center')

Add data rows

data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]

for row_idx, (product, price, qty) in enumerate(data, start=2): sheet[f'A{row_idx}'] = product sheet[f'B{row_idx}'] = price sheet[f'C{row_idx}'] = qty # FORMULA - not hardcoded value! sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'

data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]

Add sum formula at bottom

last_row = len(data) + 2 sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'

Column width

sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10

wb.save('output.xlsx')

undefined

sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10

wb.save('output.xlsx')

undefined

Financial Model Standards

创建缩略图网格

Color Coding

—

python

from openpyxl.styles import Font

bash

undefined

Industry-standard colors

Create visual overview of all slides

BLUE = Font(color='0000FF') # Hardcoded inputs BLACK = Font(color='000000') # Formulas GREEN = Font(color='008000') # Links from other sheets RED = Font(color='FF0000') # External links

python scripts/thumbnail.py presentation.pptx --cols 4

undefined

Apply to cells

将幻灯片转换为图片

sheet['B5'].font = BLUE # User input sheet['B6'].font = BLACK # Formula

undefined

bash

undefined

Number Formatting

Convert to PDF first

python

undefined

soffice --headless --convert-to pdf presentation.pptx

Currency with thousands separator

Then PDF to images

sheet['B5'].number_format = '$#,##0'

pdftoppm -jpeg -r 150 presentation.pdf slide

Percentage with one decimal

Creates slide-1.jpg, slide-2.jpg, etc.

sheet['B6'].number_format = '0.0%'

---

Zeros as dashes

Excel（XLSX）处理

—

使用Pandas进行数据分析

sheet['B7'].number_format = '$#,##0;($#,##0);"-"'

python

import pandas as pd

Multiples

Read Excel

sheet['B8'].number_format = '0.0x'

undefined

df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict

CRITICAL: Use Formulas, Not Hardcoded Values

Analyze

python

undefined

df.head() # Preview data df.info() # Column info df.describe() # Statistics

❌ WRONG - Hardcoding calculated values

Filter and transform

total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000

filtered = df[df['Sales'] > 1000] grouped = df.groupby('Category')['Revenue'].sum()

✅ CORRECT - Use Excel formulas

Write Excel

sheet['B10'] = '=SUM(B2:B9)'

df.to_excel('output.xlsx', index=False)

undefined

❌ WRONG - Computing in Python

使用openpyxl创建带公式的Excel

growth = (current - previous) / previous sheet['C5'] = growth

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

✅ CORRECT - Excel formula

Add data

sheet['C5'] = '=(C4-C2)/C2'

undefined

sheet['A1'] = 'Product' sheet['B1'] = 'Price' sheet['C1'] = 'Quantity' sheet['D1'] = 'Total'

Edit Existing Excel

Header formatting

python

from openpyxl import load_workbook

Load with formulas preserved

Add data rows

wb = load_workbook('existing.xlsx') sheet = wb.active

data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]

Modify cells

Add sum formula at bottom

sheet['A1'] = 'New Value' sheet.insert_rows(2) sheet.delete_cols(3)

last_row = len(data) + 2 sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'

Add new sheet

Column width

new_sheet = wb.create_sheet('Analysis') new_sheet['A1'] = 'Data'

wb.save('modified.xlsx')

undefined

sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10

wb.save('output.xlsx')

undefined

Recalculate Formulas

财务模型标准

—

颜色编码

After creating/modifying Excel files with formulas:

bash

undefined

python

from openpyxl.styles import Font

Recalculate all formulas using LibreOffice

Industry-standard colors

python recalc.py output.xlsx

---

BLUE = Font(color='0000FF') # Hardcoded inputs BLACK = Font(color='000000') # Formulas GREEN = Font(color='008000') # Links from other sheets RED = Font(color='FF0000') # External links

Dependencies

Apply to cells

Install as needed:

bash

undefined

sheet['B5'].font = BLUE # User input sheet['B6'].font = BLACK # Formula

undefined

PDF

数字格式

pip install pypdf pdfplumber reportlab pytesseract pdf2image

python

undefined

DOCX

Currency with thousands separator

npm install -g docx pip install "markitdown[docx]"

sheet['B5'].number_format = '$#,##0'

PPTX

Percentage with one decimal

npm install -g pptxgenjs pip install "markitdown[pptx]"

sheet['B6'].number_format = '0.0%'

XLSX

Zeros as dashes

pip install pandas openpyxl

sheet['B7'].number_format = '$#,##0;($#,##0);"-"'

Command line tools

Multiples

sudo apt-get install poppler-utils qpdf libreoffice pandoc

---

sheet['B8'].number_format = '0.0x'

undefined

Quick Task Reference

重要提示：使用公式而非硬编码值

I want to...	Command/Code
Extract PDF text	`pdfplumber.open(f).pages[0].extract_text()`
Merge PDFs	`pypdf.PdfWriter()` + loop
Split PDF	One `PdfWriter()` per page
OCR scanned PDF	`pdf2image` → `pytesseract`
Convert DOCX to MD	`pandoc doc.docx -o doc.md`
Create DOCX	`docx-js` (JavaScript)
Extract PPTX text	`python -m markitdown pres.pptx`
Create PPTX	`PptxGenJS` (JavaScript)
Analyze Excel	`pandas.read_excel()`
Excel with formulas	`openpyxl`

python

undefined

Credits & Attribution

❌ WRONG - Hardcoding calculated values

This skill is adapted from Anthropic's Skills.

Original repositories:

https://github.com/anthropics/skills (pdf, docx, pptx, xlsx)

Copyright (c) Anthropic - MIT License
Adapted by webconsulting.at for this skill collection

total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000

—

✅ CORRECT - Use Excel formulas

—

sheet['B10'] = '=SUM(B2:B9)'

—

❌ WRONG - Computing in Python

—

growth = (current - previous) / previous sheet['C5'] = growth

—

✅ CORRECT - Excel formula

—

sheet['C5'] = '=(C4-C2)/C2'

undefined

—

编辑现有Excel文档

—

python

from openpyxl import load_workbook

—

Load with formulas preserved

—

wb = load_workbook('existing.xlsx') sheet = wb.active

—

Modify cells

—

sheet['A1'] = 'New Value' sheet.insert_rows(2) sheet.delete_cols(3)

—

Add new sheet

—

new_sheet = wb.create_sheet('Analysis') new_sheet['A1'] = 'Data'

wb.save('modified.xlsx')

undefined

—

重新计算公式

—

创建或修改带公式的Excel文档后：

bash

undefined

—

Recalculate all formulas using LibreOffice

—

python recalc.py output.xlsx

---

—

依赖安装

—

根据需要安装以下依赖：

bash

undefined

—

PDF

—

pip install pypdf pdfplumber reportlab pytesseract pdf2image

—

DOCX

—

npm install -g docx pip install "markitdown[docx]"

—

PPTX

—

npm install -g pptxgenjs pip install "markitdown[pptx]"

—

XLSX

—

pip install pandas openpyxl

—

Command line tools

—

sudo apt-get install poppler-utils qpdf libreoffice pandoc

---

—

快速任务参考

—

我想要...	命令/代码
提取PDF文本	`pdfplumber.open(f).pages[0].extract_text()`
合并PDF	`pypdf.PdfWriter()` + 循环
拆分PDF	每页创建一个 `PdfWriter()`
OCR扫描PDF	`pdf2image` → `pytesseract`
将DOCX转换为MD	`pandoc doc.docx -o doc.md`
创建DOCX	`docx-js` （JavaScript）
提取PPTX文本	`python -m markitdown pres.pptx`
创建PPTX	`PptxGenJS` （JavaScript）
分析Excel	`pandas.read_excel()`
带公式的Excel	`openpyxl`

—

致谢与版权说明

—

本技能改编自**Anthropic's Skills**。

原始仓库：

https://github.com/anthropics/skills (pdf, docx, pptx, xlsx)