xlsx-processing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

XLSX Processing

XLSX 处理

Overview

概述

Manipulate Excel files programmatically using openpyxl for rich formatting and pandas for data analysis. This skill covers reading/writing spreadsheets, formulas, charts, conditional formatting, data validation, pivot table generation, CSV import/export, and strategies for handling large files.
Apply this skill whenever Excel files need to be created, read, transformed, or enriched through code rather than manual editing.
使用支持丰富格式设置的 openpyxl 和用于数据分析的 pandas 以编程方式操作Excel文件。该技能涵盖电子表格读写、公式、图表、条件格式、数据验证、数据透视表生成、CSV导入/导出,以及大文件处理策略。
当你需要通过代码而非手动编辑来创建、读取、转换或丰富Excel文件时,即可应用该技能。

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1:需求确认

  1. Determine operation (read, write, transform, report)
  2. Identify data sources and volume
  3. Define formatting and formula requirements
  4. Plan sheet structure and naming
  5. Assess performance needs (row count, file size)
STOP — Do NOT begin implementation until you know the row count and whether formatting is needed (this determines library choice).
  1. 确定操作类型(读取、写入、转换、生成报告)
  2. 明确数据源和数据量
  3. 定义格式和公式要求
  4. 规划工作表结构和命名
  5. 评估性能需求(行数、文件大小)
停止 — 在明确行数和是否需要格式设置之前,请勿开始实现(这将决定库的选择)。

Phase 2: Implementation

阶段2:实现

  1. Select library (see decision table)
  2. Implement data loading and transformation
  3. Apply formatting, formulas, and validation
  4. Add charts and conditional formatting
  5. Optimize for file size and memory
STOP — Do NOT skip memory optimization for files exceeding 10,000 rows.
  1. 选择库(参考决策表)
  2. 实现数据加载和转换
  3. 应用格式、公式和验证规则
  4. 添加图表和条件格式
  5. 针对文件大小和内存占用进行优化
停止 — 对于行数超过10000的文件,请勿跳过内存优化步骤。

Phase 3: Validation

阶段3:验证

  1. Open in Excel, LibreOffice, and Google Sheets
  2. Verify formulas calculate correctly
  3. Check formatting renders consistently
  4. Test with edge cases (empty data, max rows)
  5. Validate data accuracy
  1. 在Excel、LibreOffice和Google Sheets中打开文件
  2. 验证公式计算结果正确
  3. 检查格式渲染一致性
  4. 使用边界场景测试(空数据、最大行数)
  5. 验证数据准确性

Library Decision Table

库选择决策表

ScenarioLibraryWhy
Rich formatting (colors, borders, fonts)openpyxlFull formatting API
Data analysis, aggregation, pivotspandasDataFrame operations
Formatted report from data analysispandas + openpyxlCombine strengths
Reading data only, no formatting neededpandasSimplest API
Large file (> 10K rows), write-heavyopenpyxl write_onlyStreaming writes, low memory
Large file (> 10K rows), read-heavyopenpyxl read_onlyStreaming reads, low memory
CSV to/from Excel conversionpandasOne-liner operations
Charts in spreadsheetopenpyxlChart API with full control
场景选择原因
丰富格式设置(颜色、边框、字体)openpyxl提供完整的格式设置API
数据分析、聚合、数据透视pandas基于DataFrame的便捷操作
数据分析结果生成带格式的报告pandas + openpyxl结合两者优势
仅需读取数据、无需格式设置pandasAPI最简单
大文件(>1万行)、写入为主openpyxl write_only 模式流式写入、低内存占用
大文件(>1万行)、读取为主openpyxl read_only 模式流式读取、低内存占用
CSV与Excel格式互相转换pandas单命令即可完成操作
电子表格中添加图表openpyxl支持完全可控的图表API

openpyxl Patterns

openpyxl 常用模式

Creating a Workbook

创建工作簿

python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

wb = Workbook()
ws = wb.active
ws.title = "Report"
python
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

wb = Workbook()
ws = wb.active
ws.title = "Report"

Header row

Header row

headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement'] header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid') header_alignment = Alignment(horizontal='center', vertical='center') thin_border = Border( left=Side(style='thin'), right=Side(style='thin'), top=Side(style='thin'), bottom=Side(style='thin'), )
for col, header in enumerate(headers, 1): cell = ws.cell(row=1, column=col, value=header) cell.font = header_font cell.fill = header_fill cell.alignment = header_alignment cell.border = thin_border
headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement'] header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid') header_alignment = Alignment(horizontal='center', vertical='center') thin_border = Border( left=Side(style='thin'), right=Side(style='thin'), top=Side(style='thin'), bottom=Side(style='thin'), )
for col, header in enumerate(headers, 1): cell = ws.cell(row=1, column=col, value=header) cell.font = header_font cell.fill = header_fill cell.alignment = header_alignment cell.border = thin_border

Data rows

Data rows

for row_idx, row_data in enumerate(data, 2): for col_idx, value in enumerate(row_data, 1): cell = ws.cell(row=row_idx, column=col_idx, value=value) cell.border = thin_border
for row_idx, row_data in enumerate(data, 2): for col_idx, value in enumerate(row_data, 1): cell = ws.cell(row=row_idx, column=col_idx, value=value) cell.border = thin_border

Auto-fit column widths

Auto-fit column widths

for col in range(1, len(headers) + 1): max_length = max( len(str(ws.cell(row=row, column=col).value or '')) for row in range(1, ws.max_row + 1) ) ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)
for col in range(1, len(headers) + 1): max_length = max( len(str(ws.cell(row=row, column=col).value or '')) for row in range(1, ws.max_row + 1) ) ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)

Freeze header row

Freeze header row

ws.freeze_panes = 'A2'
wb.save('report.xlsx')
undefined
ws.freeze_panes = 'A2'
wb.save('report.xlsx')
undefined

Formulas

公式使用

python
undefined
python
undefined

Basic formulas

Basic formulas

ws['E2'] = '=C2/D2' # Division ws['F2'] = '=SUM(C2:C100)' # Sum ws['G2'] = '=AVERAGE(C2:C100)' # Average ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup
ws['E2'] = '=C2/D2' # Division ws['F2'] = '=SUM(C2:C100)' # Sum ws['G2'] = '=AVERAGE(C2:C100)' # Average ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup

Array formula (Excel 365 dynamic array)

Array formula (Excel 365 dynamic array)

ws['K2'] = '=UNIQUE(A2:A100)'
ws['K2'] = '=UNIQUE(A2:A100)'

Named range

Named range

from openpyxl.workbook.defined_name import DefinedName ref = f"Report!$C$2:$C${len(data)+1}" defn = DefinedName('RevenueRange', attr_text=ref) wb.defined_names.add(defn)
undefined
from openpyxl.workbook.defined_name import DefinedName ref = f"Report!$C$2:$C${len(data)+1}" defn = DefinedName('RevenueRange', attr_text=ref) wb.defined_names.add(defn)
undefined

Charts

图表使用

python
from openpyxl.chart import BarChart, LineChart, PieChart, Reference
python
from openpyxl.chart import BarChart, LineChart, PieChart, Reference

Bar chart

Bar chart

chart = BarChart() chart.type = 'col' chart.title = 'Revenue by Department' chart.y_axis.title = 'Revenue ($)' chart.x_axis.title = 'Department' chart.style = 10 # Built-in style
data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row) cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row) chart.add_data(data_ref, titles_from_data=True) chart.set_categories(cats_ref) chart.width = 20 chart.height = 12
ws.add_chart(chart, 'G2')
chart = BarChart() chart.type = 'col' chart.title = 'Revenue by Department' chart.y_axis.title = 'Revenue ($)' chart.x_axis.title = 'Department' chart.style = 10 # Built-in style
data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row) cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row) chart.add_data(data_ref, titles_from_data=True) chart.set_categories(cats_ref) chart.width = 20 chart.height = 12
ws.add_chart(chart, 'G2')

Line chart with multiple series

Line chart with multiple series

line = LineChart() line.title = 'Monthly Trends' for col in range(3, 6): values = Reference(ws, min_col=col, min_row=1, max_row=13) line.add_data(values, titles_from_data=True) cats = Reference(ws, min_col=1, min_row=2, max_row=13) line.set_categories(cats) ws.add_chart(line, 'G20')
undefined
line = LineChart() line.title = 'Monthly Trends' for col in range(3, 6): values = Reference(ws, min_col=col, min_row=1, max_row=13) line.add_data(values, titles_from_data=True) cats = Reference(ws, min_col=1, min_row=2, max_row=13) line.set_categories(cats) ws.add_chart(line, 'G20')
undefined

Conditional Formatting

条件格式

python
from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRule
python
from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRule

Highlight cells above threshold

Highlight cells above threshold

ws.conditional_formatting.add( 'C2:C100', CellIsRule(operator='greaterThan', formula=['100000'], fill=PatternFill(bgColor='C6EFCE')) )
ws.conditional_formatting.add( 'C2:C100', CellIsRule(operator='greaterThan', formula=['100000'], fill=PatternFill(bgColor='C6EFCE')) )

Red for below target

Red for below target

ws.conditional_formatting.add( 'E2:E100', CellIsRule(operator='lessThan', formula=['1'], fill=PatternFill(bgColor='FFC7CE'), font=Font(color='9C0006')) )
ws.conditional_formatting.add( 'E2:E100', CellIsRule(operator='lessThan', formula=['1'], fill=PatternFill(bgColor='FFC7CE'), font=Font(color='9C0006')) )

Color scale (green to red)

Color scale (green to red)

ws.conditional_formatting.add( 'E2:E100', ColorScaleRule( start_type='min', start_color='F8696B', mid_type='percentile', mid_value=50, mid_color='FFEB84', end_type='max', end_color='63BE7B' ) )
ws.conditional_formatting.add( 'E2:E100', ColorScaleRule( start_type='min', start_color='F8696B', mid_type='percentile', mid_value=50, mid_color='FFEB84', end_type='max', end_color='63BE7B' ) )

Data bars

Data bars

ws.conditional_formatting.add( 'C2:C100', DataBarRule(start_type='min', end_type='max', color='638EC6') )
undefined
ws.conditional_formatting.add( 'C2:C100', DataBarRule(start_type='min', end_type='max', color='638EC6') )
undefined

Data Validation

数据验证

python
from openpyxl.worksheet.datavalidation import DataValidation
python
from openpyxl.worksheet.datavalidation import DataValidation

Dropdown list

Dropdown list

dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True) dv.error = 'Please select a valid status' dv.errorTitle = 'Invalid Entry' ws.add_data_validation(dv) dv.add('D2:D100')
dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True) dv.error = 'Please select a valid status' dv.errorTitle = 'Invalid Entry' ws.add_data_validation(dv) dv.add('D2:D100')

Number range

Number range

nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100) nv.error = 'Value must be between 0 and 100' ws.add_data_validation(nv) nv.add('F2:F100')
nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100) nv.error = 'Value must be between 0 and 100' ws.add_data_validation(nv) nv.add('F2:F100')

Date validation

Date validation

dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01') ws.add_data_validation(dv_date) dv_date.add('G2:G100')
undefined
dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01') ws.add_data_validation(dv_date) dv_date.add('G2:G100')
undefined

pandas Patterns

pandas 常用模式

Reading Excel

读取Excel

python
import pandas as pd
python
import pandas as pd

Read single sheet

Read single sheet

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Read with options

Read with options

df = pd.read_excel('data.xlsx', sheet_name='Sales', header=0, usecols='A:E', dtype={'ID': str, 'Revenue': float}, parse_dates=['Date'], na_values=['N/A', 'null', ''], )
df = pd.read_excel('data.xlsx', sheet_name='Sales', header=0, usecols='A:E', dtype={'ID': str, 'Revenue': float}, parse_dates=['Date'], na_values=['N/A', 'null', ''], )

Read all sheets

Read all sheets

sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames
undefined
sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames
undefined

Writing Excel with pandas + openpyxl

结合pandas和openpyxl写入Excel

python
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
    df_summary.to_excel(writer, sheet_name='Summary', index=False)
    df_detail.to_excel(writer, sheet_name='Detail', index=False)

    # Access openpyxl workbook for formatting
    wb = writer.book
    ws = wb['Summary']
    # Apply formatting...
python
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
    df_summary.to_excel(writer, sheet_name='Summary', index=False)
    df_detail.to_excel(writer, sheet_name='Detail', index=False)

    # Access openpyxl workbook for formatting
    wb = writer.book
    ws = wb['Summary']
    # Apply formatting...

Pivot Tables

数据透视表

python
undefined
python
undefined

Create pivot table

Create pivot table

pivot = pd.pivot_table( df, values='Revenue', index='Department', columns='Quarter', aggfunc='sum', margins=True, margins_name='Total' )
pivot = pd.pivot_table( df, values='Revenue', index='Department', columns='Quarter', aggfunc='sum', margins=True, margins_name='Total' )

Write to Excel with formatting

Write to Excel with formatting

pivot.to_excel(writer, sheet_name='Pivot')
undefined
pivot.to_excel(writer, sheet_name='Pivot')
undefined

CSV Import/Export

CSV 导入/导出

python
undefined
python
undefined

CSV to XLSX

CSV to XLSX

df = pd.read_csv('data.csv', encoding='utf-8-sig') df.to_excel('output.xlsx', index=False)
df = pd.read_csv('data.csv', encoding='utf-8-sig') df.to_excel('output.xlsx', index=False)

XLSX to CSV

XLSX to CSV

df = pd.read_excel('data.xlsx') df.to_csv('output.csv', index=False, encoding='utf-8-sig')
df = pd.read_excel('data.xlsx') df.to_csv('output.csv', index=False, encoding='utf-8-sig')

Handle encoding issues

Handle encoding issues

df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'
undefined
df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'
undefined

Large File Handling

大文件处理

Memory-Efficient Reading

低内存读取

python
undefined
python
undefined

openpyxl read-only mode

openpyxl read-only mode

from openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True) ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True): process_row(row)
wb.close()
undefined
from openpyxl import load_workbook
wb = load_workbook('large_file.xlsx', read_only=True) ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True): process_row(row)
wb.close()
undefined

Chunked Writing

分块写入

python
undefined
python
undefined

Write large datasets in chunks

Write large datasets in chunks

from openpyxl import Workbook from openpyxl.utils.dataframe import dataframe_to_rows
wb = Workbook(write_only=True) ws = wb.create_sheet()
from openpyxl import Workbook from openpyxl.utils.dataframe import dataframe_to_rows
wb = Workbook(write_only=True) ws = wb.create_sheet()

Write header

Write header

ws.append(headers)
ws.append(headers)

Write in chunks

Write in chunks

chunk_size = 10000 for chunk in pd.read_csv('large.csv', chunksize=chunk_size): for row in dataframe_to_rows(chunk, index=False, header=False): ws.append(row)
wb.save('output.xlsx')
undefined
chunk_size = 10000 for chunk in pd.read_csv('large.csv', chunksize=chunk_size): for row in dataframe_to_rows(chunk, index=False, header=False): ws.append(row)
wb.save('output.xlsx')
undefined

Performance Decision Table

性能决策表

RowsStrategyNotes
< 10,000Standard openpyxl or pandasFull formatting available
10K - 100Kwrite_only / read_only mode, chunkedLimited formatting in write_only
100K - 1Mwrite_only mode, consider CSV insteadNear Excel row limit
> 1MUse CSV or Parquet, not XLSXExcel limit: 1,048,576 rows
行数处理策略说明
< 10,000标准openpyxl或pandas模式支持完整格式设置
1万 - 10万write_only / read_only 模式、分块处理write_only模式下格式设置功能有限
10万 - 100万write_only 模式,优先考虑使用CSV接近Excel的行数上限
> 100万使用CSV或Parquet格式,不要用XLSXExcel上限为1,048,576行

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-PatternWhy It FailsWhat To Do Instead
openpyxl for pure data analysisVerbose and slow for analyticsUse pandas for data operations
Loading large files into memoryMemory exhaustion, crashesUse read_only / write_only modes
Hardcoding row/column numbersBreaks when data shape changesCalculate from data length
Inconsistent date formatsDates render as numbers or stringsSet number_format explicitly
Not closing read_only workbooksResource leaksAlways call
wb.close()
or use context manager
Using .xls formatLegacy, limited, security risksAlways use .xlsx
Formatting cells one by oneExtremely slow for large rangesApply styles to ranges or use named styles
Not testing in actual ExcelFeatures render differentlyTest in Excel, LibreOffice, and Google Sheets
Forgetting to freeze header rowPoor UX when scrolling large dataAlways freeze panes for data sheets
反模式失效原因替代方案
使用openpyxl做纯数据分析分析操作繁琐且速度慢使用pandas进行数据操作
直接将大文件加载到内存内存耗尽、程序崩溃使用read_only / write_only 模式
硬编码行/列号数据结构变化时代码失效根据数据长度动态计算
日期格式不一致日期会显示为数字或字符串显式设置number_format
不关闭read_only模式的工作簿资源泄漏始终调用
wb.close()
或使用上下文管理器
使用.xls格式老旧、功能受限、存在安全风险始终使用.xlsx格式
逐个单元格设置格式大范围设置时速度极慢对范围应用样式或使用命名样式
不在实际Excel软件中测试不同软件功能渲染效果不同在Excel、LibreOffice和Google Sheets中都进行测试
忘记冻结表头行滚动查看大数据时用户体验差数据表始终冻结窗格固定表头

Anti-Rationalization Guards

反合理化约束

  • Do NOT use openpyxl for data analysis that pandas handles in one line.
  • Do NOT skip the row count assessment -- it determines your entire approach.
  • Do NOT assume standard mode works for files over 10K rows -- use streaming modes.
  • Do NOT test only in one spreadsheet application -- formatting varies.
  • Do NOT forget to close workbooks opened in read_only mode.
  • 不要使用openpyxl处理pandas一行代码就能完成的数据分析任务。
  • 不要跳过行数评估步骤——它决定了你的整体实现方案。
  • 不要假设标准模式适用于超过1万行的文件——使用流式模式。
  • 不要仅在一个电子表格应用中测试——不同软件的格式渲染存在差异。
  • 不要忘记关闭以read_only模式打开的工作簿。

Integration Points

集成点

SkillHow It Connects
pdf-processing
Excel data feeds into PDF report generation
docx-processing
Excel data populates Word document tables
email-composer
Generated spreadsheets attach to professional emails
file-organizer
Output file naming and directory structure conventions
database-schema-design
Database exports to Excel for reporting
deployment
Automated report generation in CI/CD pipelines
技能关联方式
pdf-processing
Excel数据可用于PDF报告生成
docx-processing
Excel数据可填充Word文档表格
email-composer
生成的电子表格可作为专业邮件的附件
file-organizer
输出文件命名和目录结构约定
database-schema-design
数据库导出为Excel用于报告展示
deployment
CI/CD流水线中的自动化报告生成

Skill Type

技能类型

FLEXIBLE — Choose openpyxl for rich formatting and pandas for data analysis. Combine both when you need formatted reports from data analysis. Adapt file handling strategy to data volume.
灵活适配 — 需丰富格式设置时选择openpyxl,需数据分析时选择pandas。当你需要从数据分析结果生成带格式的报告时可结合使用两者。根据数据量调整文件处理策略。