document-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocument Processing
文档处理
Source: This skill is adapted from Anthropic's Skills document processing skills (pdf, docx, pptx, xlsx) for Claude Code and AI agents.
Create, edit, and analyze office documents including PDFs, Word documents, PowerPoint presentations,
and Excel spreadsheets.
来源: 本技能改编自**Anthropic's Skills**中面向Claude Code和AI Agent的文档处理技能(pdf、docx、pptx、xlsx)。
创建、编辑和分析各类办公文档,包括PDF、Word文档、PowerPoint演示文稿和Excel电子表格。
Quick Reference: Which Tool to Use
快速参考:选择合适的工具
| Task | Document Type | Best Tool |
|---|---|---|
| Extract text | | |
| Merge/split | | |
| Fill forms | | |
| Create new | | |
| OCR scanned | | |
| Extract text | DOCX | |
| Create new | DOCX | |
| Edit existing | DOCX | OOXML (unpack/edit/pack) |
| Extract text | PPTX | |
| Create new | PPTX | |
| Edit existing | PPTX | OOXML (unpack/edit/pack) |
| Data analysis | XLSX | |
| Formulas/formatting | XLSX | |
| 任务 | 文档类型 | 推荐工具 |
|---|---|---|
| 提取文本 | | |
| 合并/拆分 | | |
| 填写表单 | | |
| 创建新文档 | | |
| OCR扫描件 | | |
| 提取文本 | DOCX | |
| 创建新文档 | DOCX | |
| 编辑现有文档 | DOCX | OOXML(解压/编辑/打包) |
| 提取文本 | PPTX | |
| 创建新文档 | PPTX | |
| 编辑现有文档 | PPTX | OOXML(解压/编辑/打包) |
| 数据分析 | XLSX | |
| 公式/格式设置 | XLSX | |
PDF Processing
PDF处理
Text Extraction
文本提取
python
import pdfplumberpython
import pdfplumberExtract text with layout preservation
Extract text with layout preservation
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
undefinedwith pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
undefinedTable Extraction
表格提取
python
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)python
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)Combine all tables
Combine all tables
if all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
undefinedif all_tables:
combined_df = pd.concat(all_tables, ignore_index=True)
combined_df.to_excel("extracted_tables.xlsx", index=False)
undefinedMerge PDFs
合并PDF
python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)Split PDF
拆分PDF
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)Rotate Pages
旋转页面
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90) # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
writer.write(output)OCR Scanned PDFs
OCR扫描PDF
python
undefinedpython
undefinedRequires: pip install pytesseract pdf2image
Requires: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path
import pytesseract
from pdf2image import convert_from_path
Convert PDF to images
Convert PDF to images
images = convert_from_path('scanned.pdf')
images = convert_from_path('scanned.pdf')
OCR each page
OCR each page
text = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
undefinedtext = ""
for i, image in enumerate(images):
text += f"Page {i+1}:\n"
text += pytesseract.image_to_string(image)
text += "\n\n"
print(text)
undefinedAdd Watermark
添加水印
python
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)Password Protection
密码保护
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
writer.write(output)Create PDF with ReportLab
使用ReportLab创建PDF
python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []Add content
Add content
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
title = Paragraph("Report Title", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())
Page 2
Page 2
story.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
undefinedstory.append(Paragraph("Page 2", styles['Heading1']))
story.append(Paragraph("Content for page 2", styles['Normal']))
doc.build(story)
undefinedCommand Line Tools
命令行工具
bash
undefinedbash
undefinedExtract text (poppler-utils)
Extract text (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # Preserve layout
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt # Preserve layout
Merge PDFs (qpdf)
Merge PDFs (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
Split pages
Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
Rotate pages
Rotate pages
qpdf input.pdf output.pdf --rotate=+90:1
qpdf input.pdf output.pdf --rotate=+90:1
Remove password
Remove password
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
Extract images
Extract images
pdfimages -j input.pdf output_prefix
---pdfimages -j input.pdf output_prefix
---Word Document (DOCX) Processing
Word文档(DOCX)处理
Text Extraction
文本提取
bash
undefinedbash
undefinedConvert to markdown with pandoc
Convert to markdown with pandoc
pandoc document.docx -o output.md
pandoc document.docx -o output.md
With tracked changes preserved
With tracked changes preserved
pandoc --track-changes=all document.docx -o output.md
undefinedpandoc --track-changes=all document.docx -o output.md
undefinedCreate New Document (docx-js)
创建新文档(docx-js)
javascript
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
text: "Document Title",
heading: HeadingLevel.HEADING_1,
}),
new Paragraph({
children: [
new TextRun("This is a "),
new TextRun({
text: "bold",
bold: true,
}),
new TextRun(" word in a paragraph."),
],
}),
new Paragraph({
text: "This is another paragraph.",
}),
],
}],
});
// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);javascript
import { Document, Paragraph, TextRun, HeadingLevel, Packer } from 'docx';
import * as fs from 'fs';
const doc = new Document({
sections: [{
properties: {},
children: [
new Paragraph({
text: "Document Title",
heading: HeadingLevel.HEADING_1,
}),
new Paragraph({
children: [
new TextRun("This is a "),
new TextRun({
text: "bold",
bold: true,
}),
new TextRun(" word in a paragraph."),
],
}),
new Paragraph({
text: "This is another paragraph.",
}),
],
}],
});
// Export to file
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);Create Document with Tables
创建带表格的文档
javascript
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';
const table = new Table({
rows: [
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Header 1")] }),
new TableCell({ children: [new Paragraph("Header 2")] }),
new TableCell({ children: [new Paragraph("Header 3")] }),
],
}),
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Cell 1")] }),
new TableCell({ children: [new Paragraph("Cell 2")] }),
new TableCell({ children: [new Paragraph("Cell 3")] }),
],
}),
],
});
const doc = new Document({
sections: [{
children: [
new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
table,
],
}],
});javascript
import { Document, Paragraph, Table, TableRow, TableCell, Packer } from 'docx';
const table = new Table({
rows: [
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Header 1")] }),
new TableCell({ children: [new Paragraph("Header 2")] }),
new TableCell({ children: [new Paragraph("Header 3")] }),
],
}),
new TableRow({
children: [
new TableCell({ children: [new Paragraph("Cell 1")] }),
new TableCell({ children: [new Paragraph("Cell 2")] }),
new TableCell({ children: [new Paragraph("Cell 3")] }),
],
}),
],
});
const doc = new Document({
sections: [{
children: [
new Paragraph({ text: "Table Example", heading: HeadingLevel.HEADING_1 }),
table,
],
}],
});Edit Existing Document (OOXML)
编辑现有文档(OOXML)
For complex edits, work with raw OOXML:
-
Unpack the document:bash
python ooxml/scripts/unpack.py document.docx unpacked/ -
Edit XML files (primarily)
word/document.xml -
Validate and pack:bash
python ooxml/scripts/validate.py unpacked/ --original document.docx python ooxml/scripts/pack.py unpacked/ output.docx
对于复杂编辑,可直接操作原始OOXML:
-
解压文档:bash
python ooxml/scripts/unpack.py document.docx unpacked/ -
编辑XML文件(主要是)
word/document.xml -
验证并打包:bash
python ooxml/scripts/validate.py unpacked/ --original document.docx python ooxml/scripts/pack.py unpacked/ output.docx
Tracked Changes Workflow
修订模式工作流
For document review with track changes:
bash
undefined适用于带修订标记的文档审阅场景:
bash
undefined1. Get current state
1. 获取当前状态
pandoc --track-changes=all document.docx -o current.md
pandoc --track-changes=all document.docx -o current.md
2. Unpack
2. 解压
python ooxml/scripts/unpack.py document.docx unpacked/
python ooxml/scripts/unpack.py document.docx unpacked/
3. Edit using tracked change patterns
3. 使用修订标记模式编辑
Use <w:ins> for insertions, <w:del> for deletions
用<w:ins>表示插入内容,<w:del>表示删除内容
4. Pack final document
4. 打包最终文档
python ooxml/scripts/pack.py unpacked/ reviewed.docx
---python ooxml/scripts/pack.py unpacked/ reviewed.docx
---PowerPoint (PPTX) Processing
PowerPoint(PPTX)处理
Text Extraction
文本提取
bash
python -m markitdown presentation.pptxbash
python -m markitdown presentation.pptxCreate New Presentation (PptxGenJS)
创建新演示文稿(PptxGenJS)
javascript
import PptxGenJS from 'pptxgenjs';
const pptx = new PptxGenJS();
// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
x: 1, y: 2, w: 8, h: 1.5,
fontSize: 36,
bold: true,
color: "363636",
align: "center",
});
slide1.addText("Subtitle goes here", {
x: 1, y: 3.5, w: 8, h: 0.5,
fontSize: 18,
color: "666666",
align: "center",
});
// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
x: 0.5, y: 0.5, w: 9, h: 0.8,
fontSize: 28,
bold: true,
});
slide2.addText([
{ text: "• First important point\n", options: { bullet: true } },
{ text: "• Second important point\n", options: { bullet: true } },
{ text: "• Third important point\n", options: { bullet: true } },
], {
x: 0.5, y: 1.5, w: 9, h: 3,
fontSize: 18,
});
// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
{ name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
{ name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
x: 1, y: 1, w: 8, h: 4,
showLegend: true,
legendPos: "b",
});
// Save
pptx.writeFile("output.pptx");javascript
import PptxGenJS from 'pptxgenjs';
const pptx = new PptxGenJS();
// Slide 1 - Title
const slide1 = pptx.addSlide();
slide1.addText("Presentation Title", {
x: 1, y: 2, w: 8, h: 1.5,
fontSize: 36,
bold: true,
color: "363636",
align: "center",
});
slide1.addText("Subtitle goes here", {
x: 1, y: 3.5, w: 8, h: 0.5,
fontSize: 18,
color: "666666",
align: "center",
});
// Slide 2 - Content
const slide2 = pptx.addSlide();
slide2.addText("Key Points", {
x: 0.5, y: 0.5, w: 9, h: 0.8,
fontSize: 28,
bold: true,
});
slide2.addText([
{ text: "• First important point\n", options: { bullet: true } },
{ text: "• Second important point\n", options: { bullet: true } },
{ text: "• Third important point\n", options: { bullet: true } },
], {
x: 0.5, y: 1.5, w: 9, h: 3,
fontSize: 18,
});
// Slide 3 - Chart
const slide3 = pptx.addSlide();
slide3.addChart(pptx.ChartType.bar, [
{ name: "Q1", labels: ["Jan", "Feb", "Mar"], values: [100, 200, 300] },
{ name: "Q2", labels: ["Apr", "May", "Jun"], values: [150, 250, 350] },
], {
x: 1, y: 1, w: 8, h: 4,
showLegend: true,
legendPos: "b",
});
// Save
pptx.writeFile("output.pptx");Edit Existing Presentation (OOXML)
编辑现有演示文稿(OOXML)
bash
undefinedbash
undefined1. Unpack
1. 解压
python ooxml/scripts/unpack.py presentation.pptx unpacked/
python ooxml/scripts/unpack.py presentation.pptx unpacked/
2. Key files:
2. 关键文件:
- ppt/slides/slide1.xml, slide2.xml, etc.
- ppt/slides/slide1.xml, slide2.xml, etc.
- ppt/notesSlides/ for speaker notes
- ppt/notesSlides/ for speaker notes
- ppt/theme/ for styling
- ppt/theme/ for styling
3. Validate and pack
3. 验证并打包
python ooxml/scripts/validate.py unpacked/ --original presentation.pptx
python ooxml/scripts/pack.py unpacked/ output.pptx
undefinedpython ooxml/scripts/validate.py unpacked/ --original presentation.pptx
python ooxml/scripts/pack.py unpacked/ output.pptx
undefinedCreate Thumbnail Grid
创建缩略图网格
bash
undefinedbash
undefinedCreate visual overview of all slides
Create visual overview of all slides
python scripts/thumbnail.py presentation.pptx --cols 4
undefinedpython scripts/thumbnail.py presentation.pptx --cols 4
undefinedConvert Slides to Images
将幻灯片转换为图片
bash
undefinedbash
undefinedConvert to PDF first
Convert to PDF first
soffice --headless --convert-to pdf presentation.pptx
soffice --headless --convert-to pdf presentation.pptx
Then PDF to images
Then PDF to images
pdftoppm -jpeg -r 150 presentation.pdf slide
pdftoppm -jpeg -r 150 presentation.pdf slide
Creates slide-1.jpg, slide-2.jpg, etc.
Creates slide-1.jpg, slide-2.jpg, etc.
---
---Excel (XLSX) Processing
Excel(XLSX)处理
Data Analysis with Pandas
使用Pandas进行数据分析
python
import pandas as pdpython
import pandas as pdRead Excel
Read Excel
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
df = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
Analyze
Analyze
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
df.head() # Preview data
df.info() # Column info
df.describe() # Statistics
Filter and transform
Filter and transform
filtered = df[df['Sales'] > 1000]
grouped = df.groupby('Category')['Revenue'].sum()
filtered = df[df['Sales'] > 1000]
grouped = df.groupby('Category')['Revenue'].sum()
Write Excel
Write Excel
df.to_excel('output.xlsx', index=False)
undefineddf.to_excel('output.xlsx', index=False)
undefinedCreate Excel with Formulas (openpyxl)
使用openpyxl创建带公式的Excel
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.activepython
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.activeAdd data
Add data
sheet['A1'] = 'Product'
sheet['B1'] = 'Price'
sheet['C1'] = 'Quantity'
sheet['D1'] = 'Total'
sheet['A1'] = 'Product'
sheet['B1'] = 'Price'
sheet['C1'] = 'Quantity'
sheet['D1'] = 'Total'
Header formatting
Header formatting
for cell in ['A1', 'B1', 'C1', 'D1']:
sheet[cell].font = Font(bold=True, color='FFFFFF')
sheet[cell].fill = PatternFill('solid', start_color='4472C4')
sheet[cell].alignment = Alignment(horizontal='center')
for cell in ['A1', 'B1', 'C1', 'D1']:
sheet[cell].font = Font(bold=True, color='FFFFFF')
sheet[cell].fill = PatternFill('solid', start_color='4472C4')
sheet[cell].alignment = Alignment(horizontal='center')
Add data rows
Add data rows
data = [
('Widget A', 10.00, 5),
('Widget B', 15.00, 3),
('Widget C', 20.00, 8),
]
for row_idx, (product, price, qty) in enumerate(data, start=2):
sheet[f'A{row_idx}'] = product
sheet[f'B{row_idx}'] = price
sheet[f'C{row_idx}'] = qty
# FORMULA - not hardcoded value!
sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
data = [
('Widget A', 10.00, 5),
('Widget B', 15.00, 3),
('Widget C', 20.00, 8),
]
for row_idx, (product, price, qty) in enumerate(data, start=2):
sheet[f'A{row_idx}'] = product
sheet[f'B{row_idx}'] = price
sheet[f'C{row_idx}'] = qty
# FORMULA - not hardcoded value!
sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
Add sum formula at bottom
Add sum formula at bottom
last_row = len(data) + 2
sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
last_row = len(data) + 2
sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
Column width
Column width
sheet.column_dimensions['A'].width = 15
sheet.column_dimensions['B'].width = 10
sheet.column_dimensions['C'].width = 10
sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefinedsheet.column_dimensions['A'].width = 15
sheet.column_dimensions['B'].width = 10
sheet.column_dimensions['C'].width = 10
sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefinedFinancial Model Standards
创建缩略图网格
Color Coding
—
python
from openpyxl.styles import Fontbash
undefinedIndustry-standard colors
Create visual overview of all slides
BLUE = Font(color='0000FF') # Hardcoded inputs
BLACK = Font(color='000000') # Formulas
GREEN = Font(color='008000') # Links from other sheets
RED = Font(color='FF0000') # External links
python scripts/thumbnail.py presentation.pptx --cols 4
undefinedApply to cells
将幻灯片转换为图片
sheet['B5'].font = BLUE # User input
sheet['B6'].font = BLACK # Formula
undefinedbash
undefinedNumber Formatting
Convert to PDF first
python
undefinedsoffice --headless --convert-to pdf presentation.pptx
Currency with thousands separator
Then PDF to images
sheet['B5'].number_format = '$#,##0'
pdftoppm -jpeg -r 150 presentation.pdf slide
Percentage with one decimal
Creates slide-1.jpg, slide-2.jpg, etc.
sheet['B6'].number_format = '0.0%'
---Zeros as dashes
Excel(XLSX)处理
—
使用Pandas进行数据分析
sheet['B7'].number_format = '$#,##0;($#,##0);"-"'
python
import pandas as pdMultiples
Read Excel
sheet['B8'].number_format = '0.0x'
undefineddf = pd.read_excel('file.xlsx') # Default: first sheet
all_sheets = pd.read_excel('file.xlsx', sheet_name=None) # All sheets as dict
CRITICAL: Use Formulas, Not Hardcoded Values
Analyze
python
undefineddf.head() # Preview data
df.info() # Column info
df.describe() # Statistics
❌ WRONG - Hardcoding calculated values
Filter and transform
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
filtered = df[df['Sales'] > 1000]
grouped = df.groupby('Category')['Revenue'].sum()
✅ CORRECT - Use Excel formulas
Write Excel
sheet['B10'] = '=SUM(B2:B9)'
df.to_excel('output.xlsx', index=False)
undefined❌ WRONG - Computing in Python
使用openpyxl创建带公式的Excel
growth = (current - previous) / previous
sheet['C5'] = growth
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
wb = Workbook()
sheet = wb.active✅ CORRECT - Excel formula
Add data
sheet['C5'] = '=(C4-C2)/C2'
undefinedsheet['A1'] = 'Product'
sheet['B1'] = 'Price'
sheet['C1'] = 'Quantity'
sheet['D1'] = 'Total'
Edit Existing Excel
Header formatting
python
from openpyxl import load_workbookfor cell in ['A1', 'B1', 'C1', 'D1']:
sheet[cell].font = Font(bold=True, color='FFFFFF')
sheet[cell].fill = PatternFill('solid', start_color='4472C4')
sheet[cell].alignment = Alignment(horizontal='center')
Load with formulas preserved
Add data rows
wb = load_workbook('existing.xlsx')
sheet = wb.active
data = [
('Widget A', 10.00, 5),
('Widget B', 15.00, 3),
('Widget C', 20.00, 8),
]
for row_idx, (product, price, qty) in enumerate(data, start=2):
sheet[f'A{row_idx}'] = product
sheet[f'B{row_idx}'] = price
sheet[f'C{row_idx}'] = qty
# FORMULA - not hardcoded value!
sheet[f'D{row_idx}'] = f'=B{row_idx}*C{row_idx}'
Modify cells
Add sum formula at bottom
sheet['A1'] = 'New Value'
sheet.insert_rows(2)
sheet.delete_cols(3)
last_row = len(data) + 2
sheet[f'D{last_row}'] = f'=SUM(D2:D{last_row-1})'
Add new sheet
Column width
new_sheet = wb.create_sheet('Analysis')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
undefinedsheet.column_dimensions['A'].width = 15
sheet.column_dimensions['B'].width = 10
sheet.column_dimensions['C'].width = 10
sheet.column_dimensions['D'].width = 10
wb.save('output.xlsx')
undefinedRecalculate Formulas
财务模型标准
—
颜色编码
After creating/modifying Excel files with formulas:
bash
undefinedpython
from openpyxl.styles import FontRecalculate all formulas using LibreOffice
Industry-standard colors
python recalc.py output.xlsx
---BLUE = Font(color='0000FF') # Hardcoded inputs
BLACK = Font(color='000000') # Formulas
GREEN = Font(color='008000') # Links from other sheets
RED = Font(color='FF0000') # External links
Dependencies
Apply to cells
Install as needed:
bash
undefinedsheet['B5'].font = BLUE # User input
sheet['B6'].font = BLACK # Formula
undefined数字格式
pip install pypdf pdfplumber reportlab pytesseract pdf2image
python
undefinedDOCX
Currency with thousands separator
npm install -g docx
pip install "markitdown[docx]"
sheet['B5'].number_format = '$#,##0'
PPTX
Percentage with one decimal
npm install -g pptxgenjs
pip install "markitdown[pptx]"
sheet['B6'].number_format = '0.0%'
XLSX
Zeros as dashes
pip install pandas openpyxl
sheet['B7'].number_format = '$#,##0;($#,##0);"-"'
Command line tools
Multiples
sudo apt-get install poppler-utils qpdf libreoffice pandoc
---sheet['B8'].number_format = '0.0x'
undefinedQuick Task Reference
重要提示:使用公式而非硬编码值
| I want to... | Command/Code |
|---|---|
| Extract PDF text | |
| Merge PDFs | |
| Split PDF | One |
| OCR scanned PDF | |
| Convert DOCX to MD | |
| Create DOCX | |
| Extract PPTX text | |
| Create PPTX | |
| Analyze Excel | |
| Excel with formulas | |
python
undefinedCredits & Attribution
❌ WRONG - Hardcoding calculated values
This skill is adapted from Anthropic's Skills.
Original repositories:
- https://github.com/anthropics/skills (pdf, docx, pptx, xlsx)
Copyright (c) Anthropic - MIT License
Adapted by webconsulting.at for this skill collection
Adapted by webconsulting.at for this skill collection
total = df['Sales'].sum()
sheet['B10'] = total # Hardcodes 5000
—
✅ CORRECT - Use Excel formulas
—
sheet['B10'] = '=SUM(B2:B9)'
—
❌ WRONG - Computing in Python
—
growth = (current - previous) / previous
sheet['C5'] = growth
—
✅ CORRECT - Excel formula
—
sheet['C5'] = '=(C4-C2)/C2'
undefined—
编辑现有Excel文档
—
python
from openpyxl import load_workbook—
Load with formulas preserved
—
wb = load_workbook('existing.xlsx')
sheet = wb.active
—
Modify cells
—
sheet['A1'] = 'New Value'
sheet.insert_rows(2)
sheet.delete_cols(3)
—
Add new sheet
—
new_sheet = wb.create_sheet('Analysis')
new_sheet['A1'] = 'Data'
wb.save('modified.xlsx')
undefined—
重新计算公式
—
创建或修改带公式的Excel文档后:
bash
undefined—
Recalculate all formulas using LibreOffice
—
python recalc.py output.xlsx
---—
依赖安装
—
根据需要安装以下依赖:
bash
undefined—
—
pip install pypdf pdfplumber reportlab pytesseract pdf2image
—
DOCX
—
npm install -g docx
pip install "markitdown[docx]"
—
PPTX
—
npm install -g pptxgenjs
pip install "markitdown[pptx]"
—
XLSX
—
pip install pandas openpyxl
—
Command line tools
—
sudo apt-get install poppler-utils qpdf libreoffice pandoc
---—
快速任务参考
—
| 我想要... | 命令/代码 |
|---|---|
| 提取PDF文本 | |
| 合并PDF | |
| 拆分PDF | 每页创建一个 |
| OCR扫描PDF | |
| 将DOCX转换为MD | |
| 创建DOCX | |
| 提取PPTX文本 | |
| 创建PPTX | |
| 分析Excel | |
| 带公式的Excel | |
—
致谢与版权说明
—
本技能改编自**Anthropic's Skills**。
原始仓库:
- https://github.com/anthropics/skills (pdf, docx, pptx, xlsx)
版权所有 (c) Anthropic - MIT许可证
由webconsulting.at改编至本技能集合
由webconsulting.at改编至本技能集合