document-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document Processing

文档处理

Source: This skill is adapted from Anthropic's Skills document processing skills (pdf, docx, pptx, xlsx) for Claude Code and AI agents.
Create, edit, and analyze office documents including PDFs, Word documents, PowerPoint presentations, and Excel spreadsheets.

来源: 本技能改编自**Anthropic's Skills**中面向Claude Code和AI Agent的文档处理技能(pdf、docx、pptx、xlsx)。
创建、编辑和分析各类办公文档,包括PDF、Word文档、PowerPoint演示文稿和Excel电子表格。

Quick Reference: Which Tool to Use

快速参考:选择合适的工具

TaskDocument TypeBest Tool
Extract textPDF
pdfplumber
,
pdftotext
Merge/splitPDF
pypdf
,
qpdf
Fill formsPDF
pdf-lib
(JS),
pypdf
Create newPDF
reportlab
OCR scannedPDF
pytesseract
+
pdf2image
Extract textDOCX
pandoc
,
markitdown
Create newDOCX
docx-js
(JS)
Edit existingDOCXOOXML (unpack/edit/pack)
Extract textPPTX
markitdown
Create newPPTX
html2pptx
,
PptxGenJS
Edit existingPPTXOOXML (unpack/edit/pack)
Data analysisXLSX
pandas
Formulas/formattingXLSX
openpyxl

任务文档类型推荐工具
提取文本PDF
pdfplumber
,
pdftotext
合并/拆分PDF
pypdf
,
qpdf
填写表单PDF
pdf-lib
(JS),
pypdf
创建新文档PDF
reportlab
OCR扫描件PDF
pytesseract
+
pdf2image
提取文本DOCX
pandoc
,
markitdown
创建新文档DOCX
docx-js
(JS)
编辑现有文档DOCXOOXML(解压/编辑/打包)
提取文本PPTX
markitdown
创建新文档PPTX
html2pptx
,
PptxGenJS
编辑现有文档PPTXOOXML(解压/编辑/打包)
数据分析XLSX
pandas
公式/格式设置XLSX
openpyxl

PDF Processing

PDF处理

Text Extraction

文本提取

python
import pdfplumber
python
import pdfplumber

Extract text with layout preservation

Extract text with layout preservation

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)
undefined
with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text)
undefined

Table Extraction

表格提取

python
import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)
python
import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

Combine all tables

Combine all tables

if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)
undefined
if all_tables: combined_df = pd.concat(all_tables, ignore_index=True) combined_df.to_excel("extracted_tables.xlsx", index=False)
undefined

Merge PDFs

合并PDF

python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

拆分PDF

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Rotate Pages

旋转页面

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)
python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)

OCR Scanned PDFs

OCR扫描PDF

python
undefined
python
undefined

Requires: pip install pytesseract pdf2image

Requires: pip install pytesseract pdf2image

import pytesseract from pdf2image import convert_from_path
import pytesseract from pdf2image import convert_from_path

Convert PDF to images

Convert PDF to images

images = convert_from_path('scanned.pdf')
images = convert_from_path('scanned.pdf')

OCR each page

OCR each page

text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
print(text)
undefined
text = "" for i, image in enumerate(images): text += f"Page {i+1}:\n" text += pytesseract.image_to_string(image) text += "\n\n"
print(text)
undefined

Add Watermark

添加水印

python
from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)
python
from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Password Protection

密码保护

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)
python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Create PDF with ReportLab

使用ReportLab创建PDF

python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

Add content

Add content

title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())
title = Paragraph("Report Title", styles['Title']) story.append(title) story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal']) story.append(body) story.append(PageBreak())

Page 2

Page 2

story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
undefined
story.append(Paragraph("Page 2", styles['Heading1'])) story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
undefined

Command Line Tools

命令行工具

bash
undefined
bash
undefined

Extract text (poppler-utils)

Extract text (poppler-utils)

pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout
pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout

Merge PDFs (qpdf)

Merge PDFs (qpdf)

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

Split pages

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

Rotate pages

Rotate pages

qpdf input.pdf output.pdf --rotate=+90:1
qpdf input.pdf output.pdf --rotate=+90:1

Remove password

Remove password

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

Extract images

Extract images

pdfimages -j input.pdf output_prefix

---
pdfimages -j input.pdf output_prefix

---

Word Document (DOCX) Processing

Word文档(DOCX)处理

Text Extraction

文本提取

bash
undefined
bash
undefined

Convert to markdown with pandoc

Convert to markdown with pandoc

pandoc document.docx -o output.md
pandoc document.docx -o output.md

With tracked changes preserved

With tracked changes preserved

pandoc --track-changes=all document.docx -o output.md
undefined
pandoc --track-changes=all document.docx -o output.md
undefined

Create New Document (docx-js)

创建新文档(docx-js)

javascript
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';

const doc = new Document({
  sections: [{
    properties: {},
    children: [
      new Paragraph({
        text: "Document Title",
        heading: HeadingLevel.HEADING_1,
      }),
      new Paragraph({
        children: [
          new TextRun("This is a "),
          new TextRun({
            text: "bold",
            bold: true,
          }),
          new TextRun(" word in a paragraph."),
        ],
      }),
      new Paragraph({
        text: "This is another paragraph.",
      }),
    ],
  }],
});

// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);
javascript
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';

const doc = new Document({
  sections: [{
    properties: {},
    children: [
      new Paragraph({
        text: "Document Title",
        heading: HeadingLevel.HEADING_1,
      }),
      new Paragraph({
        children: [
          new TextRun("This is a "),
          new TextRun({
            text: "bold",
            bold: true,
          }),
          new TextRun(" word in a paragraph."),
        ],
      }),
      new Paragraph({
        text: "This is another paragraph.",
      }),
    ],
  }],
});

// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);

Create Document with Tables

创建带表格的文档

javascript
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';

const table = new Table({
  rows: [
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Header 1")] }),
        new TableCell({ children: [new Paragraph("Header 2")] }),
        new TableCell({ children: [new Paragraph("Header 3")] }),
      ],
    }),
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Cell 1")] }),
        new TableCell({ children: [new Paragraph("Cell 2")] }),
        new TableCell({ children: [new Paragraph("Cell 3")] }),
      ],
    }),
  ],
});

const doc = new Document({
  sections: [{
    children: [
      new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
      table,
    ],
  }],
});
javascript
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';

const table = new Table({
  rows: [
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Header 1")] }),
        new TableCell({ children: [new Paragraph("Header 2")] }),
        new TableCell({ children: [new Paragraph("Header 3")] }),
      ],
    }),
    new TableRow({
      children: [
        new TableCell({ children: [new Paragraph("Cell 1")] }),
        new TableCell({ children: [new Paragraph("Cell 2")] }),
        new TableCell({ children: [new Paragraph("Cell 3")] }),
      ],
    }),
  ],
});

const doc = new Document({
  sections: [{
    children: [
      new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
      table,
    ],
  }],
});

Edit Existing Document (OOXML)

编辑现有文档(OOXML)

For complex edits, work with raw OOXML:
  1. Unpack the document:
    bash
    python ooxml/scripts/unpack.py document.docx unpacked/
  2. Edit XML files (primarily
    word/document.xml
    )
  3. Validate and pack:
    bash
    python ooxml/scripts/validate.py unpacked/ --original document.docx
    python ooxml/scripts/pack.py unpacked/ output.docx
对于复杂编辑,可直接操作原始OOXML:
  1. 解压文档:
    bash
    python ooxml/scripts/unpack.py document.docx unpacked/
  2. 编辑XML文件(主要是
    word/document.xml
  3. 验证并打包:
    bash
    python ooxml/scripts/validate.py unpacked/ --original document.docx
    python ooxml/scripts/pack.py unpacked/ output.docx

Tracked Changes Workflow

修订模式工作流

For document review with track changes:
bash
undefined
适用于带修订标记的文档审阅场景:
bash
undefined

1. Get current state

1. 获取当前状态

pandoc --track-changes=all document.docx -o current.md
pandoc --track-changes=all document.docx -o current.md

2. Unpack

2. 解压

python ooxml/scripts/unpack.py document.docx unpacked/
python ooxml/scripts/unpack.py document.docx unpacked/

3. Edit using tracked change patterns

3. 使用修订标记模式编辑

Use <w:ins> for insertions, <w:del> for deletions

用<w:ins>表示插入内容,<w:del>表示删除内容

4. Pack final document

4. 打包最终文档

python ooxml/scripts/pack.py unpacked/ reviewed.docx

---
python ooxml/scripts/pack.py unpacked/ reviewed.docx

---

PowerPoint (PPTX) Processing

PowerPoint(PPTX)处理

Text Extraction

文本提取

bash
python -m markitdown presentation.pptx
bash
python -m markitdown presentation.pptx

Create New Presentation (PptxGenJS)

创建新演示文稿(PptxGenJS)

javascript
import PptxGenJS from 'pptxgenjs';

const pptx = new PptxGenJS();

// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
  x: 1, y: 2, w: 8, h: 1.5,
  fontSize: 36,
  bold: true,
  color: "363636",
  align: "center",
});
slide1.addText("Subtitle goes here", {
  x: 1, y: 3.5, w: 8, h: 0.5,
  fontSize: 18,
  color: "666666",
  align: "center",
});

// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
  x: 0.5, y: 0.5, w: 9, h: 0.8,
  fontSize: 28,
  bold: true,
});
slide2.addText([
  { text: "• First important point\n", options: { bullet: true } },
  { text: "• Second important point\n", options: { bullet: true } },
  { text: "• Third important point\n", options: { bullet: true } },
], {
  x: 0.5, y: 1.5, w: 9, h: 3,
  fontSize: 18,
});

// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
  { name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
  { name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
  x: 1, y: 1, w: 8, h: 4,
  showLegend: true,
  legendPos: "b",
});

// Save
pptx.writeFile("output.pptx");
javascript
import PptxGenJS from 'pptxgenjs';

const pptx = new PptxGenJS();

// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
  x: 1, y: 2, w: 8, h: 1.5,
  fontSize: 36,
  bold: true,
  color: "363636",
  align: "center",
});
slide1.addText("Subtitle goes here", {
  x: 1, y: 3.5, w: 8, h: 0.5,
  fontSize: 18,
  color: "666666",
  align: "center",
});

// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
  x: 0.5, y: 0.5, w: 9, h: 0.8,
  fontSize: 28,
  bold: true,
});
slide2.addText([
  { text: "• First important point\n", options: { bullet: true } },
  { text: "• Second important point\n", options: { bullet: true } },
  { text: "• Third important point\n", options: { bullet: true } },
], {
  x: 0.5, y: 1.5, w: 9, h: 3,
  fontSize: 18,
});

// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
  { name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
  { name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
  x: 1, y: 1, w: 8, h: 4,
  showLegend: true,
  legendPos: "b",
});

// Save
pptx.writeFile("output.pptx");

Edit Existing Presentation (OOXML)

编辑现有演示文稿(OOXML)

bash
undefined
bash
undefined

1. Unpack

1. 解压

python ooxml/scripts/unpack.py presentation.pptx unpacked/
python ooxml/scripts/unpack.py presentation.pptx unpacked/

2. Key files:

2. 关键文件:

- ppt/slides/slide1.xml, slide2.xml, etc.

- ppt/slides/slide1.xml, slide2.xml, etc.

- ppt/notesSlides/ for speaker notes

- ppt/notesSlides/ for speaker notes

- ppt/theme/ for styling

- ppt/theme/ for styling

3. Validate and pack

3. 验证并打包

python ooxml/scripts/validate.py unpacked/ --original presentation.pptx python ooxml/scripts/pack.py unpacked/ output.pptx
undefined
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx python ooxml/scripts/pack.py unpacked/ output.pptx
undefined

Create Thumbnail Grid

创建缩略图网格

bash
undefined
bash
undefined

Create visual overview of all slides

Create visual overview of all slides

python scripts/thumbnail.py presentation.pptx --cols 4
undefined
python scripts/thumbnail.py presentation.pptx --cols 4
undefined

Convert Slides to Images

将幻灯片转换为图片

bash
undefined
bash
undefined

Convert to PDF first

Convert to PDF first

soffice --headless --convert-to pdf presentation.pptx
soffice --headless --convert-to pdf presentation.pptx

Then PDF to images

Then PDF to images

pdftoppm -jpeg -r 150 presentation.pdf slide
pdftoppm -jpeg -r 150 presentation.pdf slide

Creates slide-1.jpg, slide-2.jpg, etc.

Creates slide-1.jpg, slide-2.jpg, etc.


---

---

Excel (XLSX) Processing

Excel(XLSX)处理

Data Analysis with Pandas

使用Pandas进行数据分析

python
import pandas as pd
python
import pandas as pd

Read Excel

Read Excel

df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict

Analyze

Analyze

df.head() # Preview data df.info() # Column info df.describe() # Statistics
df.head() # Preview data df.info() # Column info df.describe() # Statistics

Filter and transform

Filter and transform

filtered = df[df['Sales'] > 1000] grouped = df.groupby('Category')['Revenue'].sum()
filtered = df[df['Sales'] > 1000] grouped = df.groupby('Category')['Revenue'].sum()

Write Excel

Write Excel

df.to_excel('output.xlsx', index=False)
undefined
df.to_excel('output.xlsx', index=False)
undefined

Create Excel with Formulas (openpyxl)

使用openpyxl创建带公式的Excel

python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

Add data

Add data

sheet['A1'] = 'Product' sheet['B1'] = 'Price' sheet['C1'] = 'Quantity' sheet['D1'] = 'Total'
sheet['A1'] = 'Product' sheet['B1'] = 'Price' sheet['C1'] = 'Quantity' sheet['D1'] = 'Total'

Header formatting

Header formatting

for cell in ['A1', 'B1', 'C1', 'D1']: sheet[cell].font = Font(bold=True, color='FFFFFF') sheet[cell].fill = PatternFill('solid', start_color='4472C4') sheet[cell].alignment = Alignment(horizontal='center')
for cell in ['A1', 'B1', 'C1', 'D1']: sheet[cell].font = Font(bold=True, color='FFFFFF') sheet[cell].fill = PatternFill('solid', start_color='4472C4') sheet[cell].alignment = Alignment(horizontal='center')

Add data rows

Add data rows

data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]
for row_idx, (product, price, qty) in enumerate(data, start=2): sheet[f'A{row_idx}'] = product sheet[f'B{row_idx}'] = price sheet[f'C{row_idx}'] = qty # FORMULA - not hardcoded value! sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]
for row_idx, (product, price, qty) in enumerate(data, start=2): sheet[f'A{row_idx}'] = product sheet[f'B{row_idx}'] = price sheet[f'C{row_idx}'] = qty # FORMULA - not hardcoded value! sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'

Add sum formula at bottom

Add sum formula at bottom

last_row = len(data) + 2 sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
last_row = len(data) + 2 sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'

Column width

Column width

sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefined
sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefined

Financial Model Standards

创建缩略图网格

Color Coding

python
from openpyxl.styles import Font
bash
undefined

Industry-standard colors

Create visual overview of all slides

BLUE = Font(color='0000FF') # Hardcoded inputs BLACK = Font(color='000000') # Formulas GREEN = Font(color='008000') # Links from other sheets RED = Font(color='FF0000') # External links
python scripts/thumbnail.py presentation.pptx --cols 4
undefined

Apply to cells

将幻灯片转换为图片

sheet['B5'].font = BLUE # User input sheet['B6'].font = BLACK # Formula
undefined
bash
undefined

Number Formatting

Convert to PDF first

python
undefined
soffice --headless --convert-to pdf presentation.pptx

Currency with thousands separator

Then PDF to images

sheet['B5'].number_format = '$#,##0'
pdftoppm -jpeg -r 150 presentation.pdf slide

Percentage with one decimal

Creates slide-1.jpg, slide-2.jpg, etc.

sheet['B6'].number_format = '0.0%'

---

Zeros as dashes

Excel(XLSX)处理

使用Pandas进行数据分析

sheet['B7'].number_format = '$#,##0;($#,##0);"-"'
python
import pandas as pd

Multiples

Read Excel

sheet['B8'].number_format = '0.0x'
undefined
df = pd.read_excel('file.xlsx') # Default: first sheet all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict

CRITICAL: Use Formulas, Not Hardcoded Values

Analyze

python
undefined
df.head() # Preview data df.info() # Column info df.describe() # Statistics

❌ WRONG - Hardcoding calculated values

Filter and transform

total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000
filtered = df[df['Sales'] > 1000] grouped = df.groupby('Category')['Revenue'].sum()

✅ CORRECT - Use Excel formulas

Write Excel

sheet['B10'] = '=SUM(B2:B9)'
df.to_excel('output.xlsx', index=False)
undefined

❌ WRONG - Computing in Python

使用openpyxl创建带公式的Excel

growth = (current - previous) / previous sheet['C5'] = growth
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment

wb = Workbook()
sheet = wb.active

✅ CORRECT - Excel formula

Add data

sheet['C5'] = '=(C4-C2)/C2'
undefined
sheet['A1'] = 'Product' sheet['B1'] = 'Price' sheet['C1'] = 'Quantity' sheet['D1'] = 'Total'

Edit Existing Excel

Header formatting

python
from openpyxl import load_workbook
for cell in ['A1', 'B1', 'C1', 'D1']: sheet[cell].font = Font(bold=True, color='FFFFFF') sheet[cell].fill = PatternFill('solid', start_color='4472C4') sheet[cell].alignment = Alignment(horizontal='center')

Load with formulas preserved

Add data rows

wb = load_workbook('existing.xlsx') sheet = wb.active
data = [ ('Widget A', 10.00, 5), ('Widget B', 15.00, 3), ('Widget C', 20.00, 8), ]
for row_idx, (product, price, qty) in enumerate(data, start=2): sheet[f'A{row_idx}'] = product sheet[f'B{row_idx}'] = price sheet[f'C{row_idx}'] = qty # FORMULA - not hardcoded value! sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'

Modify cells

Add sum formula at bottom

sheet['A1'] = 'New Value' sheet.insert_rows(2) sheet.delete_cols(3)
last_row = len(data) + 2 sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'

Add new sheet

Column width

new_sheet = wb.create_sheet('Analysis') new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
undefined
sheet.column_dimensions['A'].width = 15 sheet.column_dimensions['B'].width = 10 sheet.column_dimensions['C'].width = 10 sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefined

Recalculate Formulas

财务模型标准

颜色编码

After creating/modifying Excel files with formulas:
bash
undefined
python
from openpyxl.styles import Font

Recalculate all formulas using LibreOffice

Industry-standard colors

python recalc.py output.xlsx

---
BLUE = Font(color='0000FF') # Hardcoded inputs BLACK = Font(color='000000') # Formulas GREEN = Font(color='008000') # Links from other sheets RED = Font(color='FF0000') # External links

Dependencies

Apply to cells

Install as needed:
bash
undefined
sheet['B5'].font = BLUE # User input sheet['B6'].font = BLACK # Formula
undefined

PDF

数字格式

pip install pypdf pdfplumber reportlab pytesseract pdf2image
python
undefined

DOCX

Currency with thousands separator

npm install -g docx pip install "markitdown[docx]"
sheet['B5'].number_format = '$#,##0'

PPTX

Percentage with one decimal

npm install -g pptxgenjs pip install "markitdown[pptx]"
sheet['B6'].number_format = '0.0%'

XLSX

Zeros as dashes

pip install pandas openpyxl
sheet['B7'].number_format = '$#,##0;($#,##0);"-"'

Command line tools

Multiples

sudo apt-get install poppler-utils qpdf libreoffice pandoc

---
sheet['B8'].number_format = '0.0x'
undefined

Quick Task Reference

重要提示:使用公式而非硬编码值

I want to...Command/Code
Extract PDF text
pdfplumber.open(f).pages[0].extract_text()
Merge PDFs
pypdf.PdfWriter()
+ loop
Split PDFOne
PdfWriter()
per page
OCR scanned PDF
pdf2image
pytesseract
Convert DOCX to MD
pandoc doc.docx -o doc.md
Create DOCX
docx-js
(JavaScript)
Extract PPTX text
python -m markitdown pres.pptx
Create PPTX
PptxGenJS
(JavaScript)
Analyze Excel
pandas.read_excel()
Excel with formulas
openpyxl

python
undefined

Credits & Attribution

❌ WRONG - Hardcoding calculated values

This skill is adapted from Anthropic's Skills.
Original repositories:
Copyright (c) Anthropic - MIT License
Adapted by webconsulting.at for this skill collection
total = df['Sales'].sum() sheet['B10'] = total # Hardcodes 5000

✅ CORRECT - Use Excel formulas

sheet['B10'] = '=SUM(B2:B9)'

❌ WRONG - Computing in Python

growth = (current - previous) / previous sheet['C5'] = growth

✅ CORRECT - Excel formula

sheet['C5'] = '=(C4-C2)/C2'
undefined

编辑现有Excel文档

python
from openpyxl import load_workbook

Load with formulas preserved

wb = load_workbook('existing.xlsx') sheet = wb.active

Modify cells

sheet['A1'] = 'New Value' sheet.insert_rows(2) sheet.delete_cols(3)

Add new sheet

new_sheet = wb.create_sheet('Analysis') new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
undefined

重新计算公式

创建或修改带公式的Excel文档后:
bash
undefined

Recalculate all formulas using LibreOffice

python recalc.py output.xlsx

---

依赖安装

根据需要安装以下依赖:
bash
undefined

PDF

pip install pypdf pdfplumber reportlab pytesseract pdf2image

DOCX

npm install -g docx pip install "markitdown[docx]"

PPTX

npm install -g pptxgenjs pip install "markitdown[pptx]"

XLSX

pip install pandas openpyxl

Command line tools

sudo apt-get install poppler-utils qpdf libreoffice pandoc

---

快速任务参考

我想要...命令/代码
提取PDF文本
pdfplumber.open(f).pages[0].extract_text()
合并PDF
pypdf.PdfWriter()
+ 循环
拆分PDF每页创建一个
PdfWriter()
OCR扫描PDF
pdf2image
pytesseract
将DOCX转换为MD
pandoc doc.docx -o doc.md
创建DOCX
docx-js
(JavaScript)
提取PPTX文本
python -m markitdown pres.pptx
创建PPTX
PptxGenJS
(JavaScript)
分析Excel
pandas.read_excel()
带公式的Excel
openpyxl

致谢与版权说明

本技能改编自**Anthropic's Skills**。
原始仓库:
版权所有 (c) Anthropic - MIT许可证
由webconsulting.at改编至本技能集合