xlsx-processing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseXLSX Processing
XLSX 处理
Overview
概述
Manipulate Excel files programmatically using openpyxl for rich formatting and pandas for data analysis. This skill covers reading/writing spreadsheets, formulas, charts, conditional formatting, data validation, pivot table generation, CSV import/export, and strategies for handling large files.
Apply this skill whenever Excel files need to be created, read, transformed, or enriched through code rather than manual editing.
使用支持丰富格式设置的 openpyxl 和用于数据分析的 pandas 以编程方式操作Excel文件。该技能涵盖电子表格读写、公式、图表、条件格式、数据验证、数据透视表生成、CSV导入/导出,以及大文件处理策略。
当你需要通过代码而非手动编辑来创建、读取、转换或丰富Excel文件时,即可应用该技能。
Multi-Phase Process
多阶段流程
Phase 1: Requirements
阶段1:需求确认
- Determine operation (read, write, transform, report)
- Identify data sources and volume
- Define formatting and formula requirements
- Plan sheet structure and naming
- Assess performance needs (row count, file size)
STOP — Do NOT begin implementation until you know the row count and whether formatting is needed (this determines library choice).
- 确定操作类型(读取、写入、转换、生成报告)
- 明确数据源和数据量
- 定义格式和公式要求
- 规划工作表结构和命名
- 评估性能需求(行数、文件大小)
停止 — 在明确行数和是否需要格式设置之前,请勿开始实现(这将决定库的选择)。
Phase 2: Implementation
阶段2:实现
- Select library (see decision table)
- Implement data loading and transformation
- Apply formatting, formulas, and validation
- Add charts and conditional formatting
- Optimize for file size and memory
STOP — Do NOT skip memory optimization for files exceeding 10,000 rows.
- 选择库(参考决策表)
- 实现数据加载和转换
- 应用格式、公式和验证规则
- 添加图表和条件格式
- 针对文件大小和内存占用进行优化
停止 — 对于行数超过10000的文件,请勿跳过内存优化步骤。
Phase 3: Validation
阶段3:验证
- Open in Excel, LibreOffice, and Google Sheets
- Verify formulas calculate correctly
- Check formatting renders consistently
- Test with edge cases (empty data, max rows)
- Validate data accuracy
- 在Excel、LibreOffice和Google Sheets中打开文件
- 验证公式计算结果正确
- 检查格式渲染一致性
- 使用边界场景测试(空数据、最大行数)
- 验证数据准确性
Library Decision Table
库选择决策表
| Scenario | Library | Why |
|---|---|---|
| Rich formatting (colors, borders, fonts) | openpyxl | Full formatting API |
| Data analysis, aggregation, pivots | pandas | DataFrame operations |
| Formatted report from data analysis | pandas + openpyxl | Combine strengths |
| Reading data only, no formatting needed | pandas | Simplest API |
| Large file (> 10K rows), write-heavy | openpyxl write_only | Streaming writes, low memory |
| Large file (> 10K rows), read-heavy | openpyxl read_only | Streaming reads, low memory |
| CSV to/from Excel conversion | pandas | One-liner operations |
| Charts in spreadsheet | openpyxl | Chart API with full control |
| 场景 | 库 | 选择原因 |
|---|---|---|
| 丰富格式设置(颜色、边框、字体) | openpyxl | 提供完整的格式设置API |
| 数据分析、聚合、数据透视 | pandas | 基于DataFrame的便捷操作 |
| 数据分析结果生成带格式的报告 | pandas + openpyxl | 结合两者优势 |
| 仅需读取数据、无需格式设置 | pandas | API最简单 |
| 大文件(>1万行)、写入为主 | openpyxl write_only 模式 | 流式写入、低内存占用 |
| 大文件(>1万行)、读取为主 | openpyxl read_only 模式 | 流式读取、低内存占用 |
| CSV与Excel格式互相转换 | pandas | 单命令即可完成操作 |
| 电子表格中添加图表 | openpyxl | 支持完全可控的图表API |
openpyxl Patterns
openpyxl 常用模式
Creating a Workbook
创建工作簿
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
wb = Workbook()
ws = wb.active
ws.title = "Report"python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
wb = Workbook()
ws = wb.active
ws.title = "Report"Header row
Header row
headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement']
header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF')
header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid')
header_alignment = Alignment(horizontal='center', vertical='center')
thin_border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin'),
)
for col, header in enumerate(headers, 1):
cell = ws.cell(row=1, column=col, value=header)
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
cell.border = thin_border
headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement']
header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF')
header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid')
header_alignment = Alignment(horizontal='center', vertical='center')
thin_border = Border(
left=Side(style='thin'),
right=Side(style='thin'),
top=Side(style='thin'),
bottom=Side(style='thin'),
)
for col, header in enumerate(headers, 1):
cell = ws.cell(row=1, column=col, value=header)
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
cell.border = thin_border
Data rows
Data rows
for row_idx, row_data in enumerate(data, 2):
for col_idx, value in enumerate(row_data, 1):
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.border = thin_border
for row_idx, row_data in enumerate(data, 2):
for col_idx, value in enumerate(row_data, 1):
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.border = thin_border
Auto-fit column widths
Auto-fit column widths
for col in range(1, len(headers) + 1):
max_length = max(
len(str(ws.cell(row=row, column=col).value or ''))
for row in range(1, ws.max_row + 1)
)
ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)
for col in range(1, len(headers) + 1):
max_length = max(
len(str(ws.cell(row=row, column=col).value or ''))
for row in range(1, ws.max_row + 1)
)
ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)
Freeze header row
Freeze header row
ws.freeze_panes = 'A2'
wb.save('report.xlsx')
undefinedws.freeze_panes = 'A2'
wb.save('report.xlsx')
undefinedFormulas
公式使用
python
undefinedpython
undefinedBasic formulas
Basic formulas
ws['E2'] = '=C2/D2' # Division
ws['F2'] = '=SUM(C2:C100)' # Sum
ws['G2'] = '=AVERAGE(C2:C100)' # Average
ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if
ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional
ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup
ws['E2'] = '=C2/D2' # Division
ws['F2'] = '=SUM(C2:C100)' # Sum
ws['G2'] = '=AVERAGE(C2:C100)' # Average
ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if
ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional
ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup
Array formula (Excel 365 dynamic array)
Array formula (Excel 365 dynamic array)
ws['K2'] = '=UNIQUE(A2:A100)'
ws['K2'] = '=UNIQUE(A2:A100)'
Named range
Named range
from openpyxl.workbook.defined_name import DefinedName
ref = f"Report!$C$2:$C${len(data)+1}"
defn = DefinedName('RevenueRange', attr_text=ref)
wb.defined_names.add(defn)
undefinedfrom openpyxl.workbook.defined_name import DefinedName
ref = f"Report!$C$2:$C${len(data)+1}"
defn = DefinedName('RevenueRange', attr_text=ref)
wb.defined_names.add(defn)
undefinedCharts
图表使用
python
from openpyxl.chart import BarChart, LineChart, PieChart, Referencepython
from openpyxl.chart import BarChart, LineChart, PieChart, ReferenceBar chart
Bar chart
chart = BarChart()
chart.type = 'col'
chart.title = 'Revenue by Department'
chart.y_axis.title = 'Revenue ($)'
chart.x_axis.title = 'Department'
chart.style = 10 # Built-in style
data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row)
cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
chart.width = 20
chart.height = 12
ws.add_chart(chart, 'G2')
chart = BarChart()
chart.type = 'col'
chart.title = 'Revenue by Department'
chart.y_axis.title = 'Revenue ($)'
chart.x_axis.title = 'Department'
chart.style = 10 # Built-in style
data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row)
cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
chart.width = 20
chart.height = 12
ws.add_chart(chart, 'G2')
Line chart with multiple series
Line chart with multiple series
line = LineChart()
line.title = 'Monthly Trends'
for col in range(3, 6):
values = Reference(ws, min_col=col, min_row=1, max_row=13)
line.add_data(values, titles_from_data=True)
cats = Reference(ws, min_col=1, min_row=2, max_row=13)
line.set_categories(cats)
ws.add_chart(line, 'G20')
undefinedline = LineChart()
line.title = 'Monthly Trends'
for col in range(3, 6):
values = Reference(ws, min_col=col, min_row=1, max_row=13)
line.add_data(values, titles_from_data=True)
cats = Reference(ws, min_col=1, min_row=2, max_row=13)
line.set_categories(cats)
ws.add_chart(line, 'G20')
undefinedConditional Formatting
条件格式
python
from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRulepython
from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRuleHighlight cells above threshold
Highlight cells above threshold
ws.conditional_formatting.add(
'C2:C100',
CellIsRule(operator='greaterThan', formula=['100000'],
fill=PatternFill(bgColor='C6EFCE'))
)
ws.conditional_formatting.add(
'C2:C100',
CellIsRule(operator='greaterThan', formula=['100000'],
fill=PatternFill(bgColor='C6EFCE'))
)
Red for below target
Red for below target
ws.conditional_formatting.add(
'E2:E100',
CellIsRule(operator='lessThan', formula=['1'],
fill=PatternFill(bgColor='FFC7CE'),
font=Font(color='9C0006'))
)
ws.conditional_formatting.add(
'E2:E100',
CellIsRule(operator='lessThan', formula=['1'],
fill=PatternFill(bgColor='FFC7CE'),
font=Font(color='9C0006'))
)
Color scale (green to red)
Color scale (green to red)
ws.conditional_formatting.add(
'E2:E100',
ColorScaleRule(
start_type='min', start_color='F8696B',
mid_type='percentile', mid_value=50, mid_color='FFEB84',
end_type='max', end_color='63BE7B'
)
)
ws.conditional_formatting.add(
'E2:E100',
ColorScaleRule(
start_type='min', start_color='F8696B',
mid_type='percentile', mid_value=50, mid_color='FFEB84',
end_type='max', end_color='63BE7B'
)
)
Data bars
Data bars
ws.conditional_formatting.add(
'C2:C100',
DataBarRule(start_type='min', end_type='max', color='638EC6')
)
undefinedws.conditional_formatting.add(
'C2:C100',
DataBarRule(start_type='min', end_type='max', color='638EC6')
)
undefinedData Validation
数据验证
python
from openpyxl.worksheet.datavalidation import DataValidationpython
from openpyxl.worksheet.datavalidation import DataValidationDropdown list
Dropdown list
dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True)
dv.error = 'Please select a valid status'
dv.errorTitle = 'Invalid Entry'
ws.add_data_validation(dv)
dv.add('D2:D100')
dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True)
dv.error = 'Please select a valid status'
dv.errorTitle = 'Invalid Entry'
ws.add_data_validation(dv)
dv.add('D2:D100')
Number range
Number range
nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100)
nv.error = 'Value must be between 0 and 100'
ws.add_data_validation(nv)
nv.add('F2:F100')
nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100)
nv.error = 'Value must be between 0 and 100'
ws.add_data_validation(nv)
nv.add('F2:F100')
Date validation
Date validation
dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01')
ws.add_data_validation(dv_date)
dv_date.add('G2:G100')
undefineddv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01')
ws.add_data_validation(dv_date)
dv_date.add('G2:G100')
undefinedpandas Patterns
pandas 常用模式
Reading Excel
读取Excel
python
import pandas as pdpython
import pandas as pdRead single sheet
Read single sheet
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Read with options
Read with options
df = pd.read_excel('data.xlsx',
sheet_name='Sales',
header=0,
usecols='A:E',
dtype={'ID': str, 'Revenue': float},
parse_dates=['Date'],
na_values=['N/A', 'null', ''],
)
df = pd.read_excel('data.xlsx',
sheet_name='Sales',
header=0,
usecols='A:E',
dtype={'ID': str, 'Revenue': float},
parse_dates=['Date'],
na_values=['N/A', 'null', ''],
)
Read all sheets
Read all sheets
sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames
undefinedsheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames
undefinedWriting Excel with pandas + openpyxl
结合pandas和openpyxl写入Excel
python
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df_summary.to_excel(writer, sheet_name='Summary', index=False)
df_detail.to_excel(writer, sheet_name='Detail', index=False)
# Access openpyxl workbook for formatting
wb = writer.book
ws = wb['Summary']
# Apply formatting...python
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df_summary.to_excel(writer, sheet_name='Summary', index=False)
df_detail.to_excel(writer, sheet_name='Detail', index=False)
# Access openpyxl workbook for formatting
wb = writer.book
ws = wb['Summary']
# Apply formatting...Pivot Tables
数据透视表
python
undefinedpython
undefinedCreate pivot table
Create pivot table
pivot = pd.pivot_table(
df,
values='Revenue',
index='Department',
columns='Quarter',
aggfunc='sum',
margins=True,
margins_name='Total'
)
pivot = pd.pivot_table(
df,
values='Revenue',
index='Department',
columns='Quarter',
aggfunc='sum',
margins=True,
margins_name='Total'
)
Write to Excel with formatting
Write to Excel with formatting
pivot.to_excel(writer, sheet_name='Pivot')
undefinedpivot.to_excel(writer, sheet_name='Pivot')
undefinedCSV Import/Export
CSV 导入/导出
python
undefinedpython
undefinedCSV to XLSX
CSV to XLSX
df = pd.read_csv('data.csv', encoding='utf-8-sig')
df.to_excel('output.xlsx', index=False)
df = pd.read_csv('data.csv', encoding='utf-8-sig')
df.to_excel('output.xlsx', index=False)
XLSX to CSV
XLSX to CSV
df = pd.read_excel('data.xlsx')
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
df = pd.read_excel('data.xlsx')
df.to_csv('output.csv', index=False, encoding='utf-8-sig')
Handle encoding issues
Handle encoding issues
df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'
undefineddf = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'
undefinedLarge File Handling
大文件处理
Memory-Efficient Reading
低内存读取
python
undefinedpython
undefinedopenpyxl read-only mode
openpyxl read-only mode
from openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True)
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
process_row(row)
wb.close()
undefinedfrom openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True)
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
process_row(row)
wb.close()
undefinedChunked Writing
分块写入
python
undefinedpython
undefinedWrite large datasets in chunks
Write large datasets in chunks
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = Workbook(write_only=True)
ws = wb.create_sheet()
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = Workbook(write_only=True)
ws = wb.create_sheet()
Write header
Write header
ws.append(headers)
ws.append(headers)
Write in chunks
Write in chunks
chunk_size = 10000
for chunk in pd.read_csv('large.csv', chunksize=chunk_size):
for row in dataframe_to_rows(chunk, index=False, header=False):
ws.append(row)
wb.save('output.xlsx')
undefinedchunk_size = 10000
for chunk in pd.read_csv('large.csv', chunksize=chunk_size):
for row in dataframe_to_rows(chunk, index=False, header=False):
ws.append(row)
wb.save('output.xlsx')
undefinedPerformance Decision Table
性能决策表
| Rows | Strategy | Notes |
|---|---|---|
| < 10,000 | Standard openpyxl or pandas | Full formatting available |
| 10K - 100K | write_only / read_only mode, chunked | Limited formatting in write_only |
| 100K - 1M | write_only mode, consider CSV instead | Near Excel row limit |
| > 1M | Use CSV or Parquet, not XLSX | Excel limit: 1,048,576 rows |
| 行数 | 处理策略 | 说明 |
|---|---|---|
| < 10,000 | 标准openpyxl或pandas模式 | 支持完整格式设置 |
| 1万 - 10万 | write_only / read_only 模式、分块处理 | write_only模式下格式设置功能有限 |
| 10万 - 100万 | write_only 模式,优先考虑使用CSV | 接近Excel的行数上限 |
| > 100万 | 使用CSV或Parquet格式,不要用XLSX | Excel上限为1,048,576行 |
Anti-Patterns / Common Mistakes
反模式/常见错误
| Anti-Pattern | Why It Fails | What To Do Instead |
|---|---|---|
| openpyxl for pure data analysis | Verbose and slow for analytics | Use pandas for data operations |
| Loading large files into memory | Memory exhaustion, crashes | Use read_only / write_only modes |
| Hardcoding row/column numbers | Breaks when data shape changes | Calculate from data length |
| Inconsistent date formats | Dates render as numbers or strings | Set number_format explicitly |
| Not closing read_only workbooks | Resource leaks | Always call |
| Using .xls format | Legacy, limited, security risks | Always use .xlsx |
| Formatting cells one by one | Extremely slow for large ranges | Apply styles to ranges or use named styles |
| Not testing in actual Excel | Features render differently | Test in Excel, LibreOffice, and Google Sheets |
| Forgetting to freeze header row | Poor UX when scrolling large data | Always freeze panes for data sheets |
| 反模式 | 失效原因 | 替代方案 |
|---|---|---|
| 使用openpyxl做纯数据分析 | 分析操作繁琐且速度慢 | 使用pandas进行数据操作 |
| 直接将大文件加载到内存 | 内存耗尽、程序崩溃 | 使用read_only / write_only 模式 |
| 硬编码行/列号 | 数据结构变化时代码失效 | 根据数据长度动态计算 |
| 日期格式不一致 | 日期会显示为数字或字符串 | 显式设置number_format |
| 不关闭read_only模式的工作簿 | 资源泄漏 | 始终调用 |
| 使用.xls格式 | 老旧、功能受限、存在安全风险 | 始终使用.xlsx格式 |
| 逐个单元格设置格式 | 大范围设置时速度极慢 | 对范围应用样式或使用命名样式 |
| 不在实际Excel软件中测试 | 不同软件功能渲染效果不同 | 在Excel、LibreOffice和Google Sheets中都进行测试 |
| 忘记冻结表头行 | 滚动查看大数据时用户体验差 | 数据表始终冻结窗格固定表头 |
Anti-Rationalization Guards
反合理化约束
- Do NOT use openpyxl for data analysis that pandas handles in one line.
- Do NOT skip the row count assessment -- it determines your entire approach.
- Do NOT assume standard mode works for files over 10K rows -- use streaming modes.
- Do NOT test only in one spreadsheet application -- formatting varies.
- Do NOT forget to close workbooks opened in read_only mode.
- 不要使用openpyxl处理pandas一行代码就能完成的数据分析任务。
- 不要跳过行数评估步骤——它决定了你的整体实现方案。
- 不要假设标准模式适用于超过1万行的文件——使用流式模式。
- 不要仅在一个电子表格应用中测试——不同软件的格式渲染存在差异。
- 不要忘记关闭以read_only模式打开的工作簿。
Integration Points
集成点
| Skill | How It Connects |
|---|---|
| Excel data feeds into PDF report generation |
| Excel data populates Word document tables |
| Generated spreadsheets attach to professional emails |
| Output file naming and directory structure conventions |
| Database exports to Excel for reporting |
| Automated report generation in CI/CD pipelines |
| 技能 | 关联方式 |
|---|---|
| Excel数据可用于PDF报告生成 |
| Excel数据可填充Word文档表格 |
| 生成的电子表格可作为专业邮件的附件 |
| 输出文件命名和目录结构约定 |
| 数据库导出为Excel用于报告展示 |
| CI/CD流水线中的自动化报告生成 |
Skill Type
技能类型
FLEXIBLE — Choose openpyxl for rich formatting and pandas for data analysis. Combine both when you need formatted reports from data analysis. Adapt file handling strategy to data volume.
灵活适配 — 需丰富格式设置时选择openpyxl,需数据分析时选择pandas。当你需要从数据分析结果生成带格式的报告时可结合使用两者。根据数据量调整文件处理策略。