XLSX Processing

XLSX 处理

Overview

概述

Manipulate Excel files programmatically using openpyxl for rich formatting and pandas for data analysis. This skill covers reading/writing spreadsheets, formulas, charts, conditional formatting, data validation, pivot table generation, CSV import/export, and strategies for handling large files.

Apply this skill whenever Excel files need to be created, read, transformed, or enriched through code rather than manual editing.

使用支持丰富格式设置的 openpyxl 和用于数据分析的 pandas 以编程方式操作Excel文件。该技能涵盖电子表格读写、公式、图表、条件格式、数据验证、数据透视表生成、CSV导入/导出，以及大文件处理策略。

当你需要通过代码而非手动编辑来创建、读取、转换或丰富Excel文件时，即可应用该技能。

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1：需求确认

Determine operation (read, write, transform, report)
Identify data sources and volume
Define formatting and formula requirements
Plan sheet structure and naming
Assess performance needs (row count, file size)

STOP — Do NOT begin implementation until you know the row count and whether formatting is needed (this determines library choice).

确定操作类型（读取、写入、转换、生成报告）
明确数据源和数据量
定义格式和公式要求
规划工作表结构和命名
评估性能需求（行数、文件大小）

停止 — 在明确行数和是否需要格式设置之前，请勿开始实现（这将决定库的选择）。

Phase 2: Implementation

阶段2：实现

Select library (see decision table)
Implement data loading and transformation
Apply formatting, formulas, and validation
Add charts and conditional formatting
Optimize for file size and memory

STOP — Do NOT skip memory optimization for files exceeding 10,000 rows.

选择库（参考决策表）
实现数据加载和转换
应用格式、公式和验证规则
添加图表和条件格式
针对文件大小和内存占用进行优化

停止 — 对于行数超过10000的文件，请勿跳过内存优化步骤。

Phase 3: Validation

阶段3：验证

Open in Excel, LibreOffice, and Google Sheets
Verify formulas calculate correctly
Check formatting renders consistently
Test with edge cases (empty data, max rows)
Validate data accuracy

在Excel、LibreOffice和Google Sheets中打开文件
验证公式计算结果正确
检查格式渲染一致性
使用边界场景测试（空数据、最大行数）
验证数据准确性

Library Decision Table

库选择决策表

Scenario	Library	Why
Rich formatting (colors, borders, fonts)	openpyxl	Full formatting API
Data analysis, aggregation, pivots	pandas	DataFrame operations
Formatted report from data analysis	pandas + openpyxl	Combine strengths
Reading data only, no formatting needed	pandas	Simplest API
Large file (> 10K rows), write-heavy	openpyxl write_only	Streaming writes, low memory
Large file (> 10K rows), read-heavy	openpyxl read_only	Streaming reads, low memory
CSV to/from Excel conversion	pandas	One-liner operations
Charts in spreadsheet	openpyxl	Chart API with full control

场景	库	选择原因
丰富格式设置（颜色、边框、字体）	openpyxl	提供完整的格式设置API
数据分析、聚合、数据透视	pandas	基于DataFrame的便捷操作
数据分析结果生成带格式的报告	pandas + openpyxl	结合两者优势
仅需读取数据、无需格式设置	pandas	API最简单
大文件（>1万行）、写入为主	openpyxl write_only 模式	流式写入、低内存占用
大文件（>1万行）、读取为主	openpyxl read_only 模式	流式读取、低内存占用
CSV与Excel格式互相转换	pandas	单命令即可完成操作
电子表格中添加图表	openpyxl	支持完全可控的图表API

openpyxl Patterns

openpyxl 常用模式

Creating a Workbook

创建工作簿

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

wb = Workbook()
ws = wb.active
ws.title = "Report"

python

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

wb = Workbook()
ws = wb.active
ws.title = "Report"

Header row

headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement'] header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid') header_alignment = Alignment(horizontal='center', vertical='center') thin_border = Border( left=Side(style='thin'), right=Side(style='thin'), top=Side(style='thin'), bottom=Side(style='thin'), )

for col, header in enumerate(headers, 1): cell = ws.cell(row=1, column=col, value=header) cell.font = header_font cell.fill = header_fill cell.alignment = header_alignment cell.border = thin_border

headers = ['Name', 'Department', 'Revenue', 'Target', 'Achievement'] header_font = Font(name='Calibri', size=11, bold=True, color='FFFFFF') header_fill = PatternFill(start_color='2F5496', end_color='2F5496', fill_type='solid') header_alignment = Alignment(horizontal='center', vertical='center') thin_border = Border( left=Side(style='thin'), right=Side(style='thin'), top=Side(style='thin'), bottom=Side(style='thin'), )

for col, header in enumerate(headers, 1): cell = ws.cell(row=1, column=col, value=header) cell.font = header_font cell.fill = header_fill cell.alignment = header_alignment cell.border = thin_border

Data rows

for row_idx, row_data in enumerate(data, 2): for col_idx, value in enumerate(row_data, 1): cell = ws.cell(row=row_idx, column=col_idx, value=value) cell.border = thin_border

Auto-fit column widths

for col in range(1, len(headers) + 1): max_length = max( len(str(ws.cell(row=row, column=col).value or '')) for row in range(1, ws.max_row + 1) ) ws.column_dimensions[get_column_letter(col)].width = min(max_length + 2, 50)

Freeze header row

ws.freeze_panes = 'A2'

wb.save('report.xlsx')

undefined

ws.freeze_panes = 'A2'

wb.save('report.xlsx')

undefined

Formulas

公式使用

python

undefined

python

undefined

Basic formulas

ws['E2'] = '=C2/D2' # Division ws['F2'] = '=SUM(C2:C100)' # Sum ws['G2'] = '=AVERAGE(C2:C100)' # Average ws['H2'] = '=COUNTIF(E2:E100,">1")' # Count if ws['I2'] = '=IF(E2>=1,"Met","Below")' # Conditional ws['J2'] = '=VLOOKUP(A2,Sheet2!A:B,2,FALSE)' # Lookup

Array formula (Excel 365 dynamic array)

ws['K2'] = '=UNIQUE(A2:A100)'

Named range

from openpyxl.workbook.defined_name import DefinedName ref = f"Report!$C$2:$C${len(data)+1}" defn = DefinedName('RevenueRange', attr_text=ref) wb.defined_names.add(defn)

undefined

from openpyxl.workbook.defined_name import DefinedName ref = f"Report!$C$2:$C${len(data)+1}" defn = DefinedName('RevenueRange', attr_text=ref) wb.defined_names.add(defn)

undefined

Charts

图表使用

python

from openpyxl.chart import BarChart, LineChart, PieChart, Reference

python

from openpyxl.chart import BarChart, LineChart, PieChart, Reference

Bar chart

chart = BarChart() chart.type = 'col' chart.title = 'Revenue by Department' chart.y_axis.title = 'Revenue ($)' chart.x_axis.title = 'Department' chart.style = 10 # Built-in style

data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row) cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row) chart.add_data(data_ref, titles_from_data=True) chart.set_categories(cats_ref) chart.width = 20 chart.height = 12

ws.add_chart(chart, 'G2')

chart = BarChart() chart.type = 'col' chart.title = 'Revenue by Department' chart.y_axis.title = 'Revenue ($)' chart.x_axis.title = 'Department' chart.style = 10 # Built-in style

data_ref = Reference(ws, min_col=3, min_row=1, max_row=ws.max_row) cats_ref = Reference(ws, min_col=2, min_row=2, max_row=ws.max_row) chart.add_data(data_ref, titles_from_data=True) chart.set_categories(cats_ref) chart.width = 20 chart.height = 12

ws.add_chart(chart, 'G2')

Line chart with multiple series

line = LineChart() line.title = 'Monthly Trends' for col in range(3, 6): values = Reference(ws, min_col=col, min_row=1, max_row=13) line.add_data(values, titles_from_data=True) cats = Reference(ws, min_col=1, min_row=2, max_row=13) line.set_categories(cats) ws.add_chart(line, 'G20')

undefined

line = LineChart() line.title = 'Monthly Trends' for col in range(3, 6): values = Reference(ws, min_col=col, min_row=1, max_row=13) line.add_data(values, titles_from_data=True) cats = Reference(ws, min_col=1, min_row=2, max_row=13) line.set_categories(cats) ws.add_chart(line, 'G20')

undefined

Conditional Formatting

条件格式

python

from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRule

python

from openpyxl.formatting.rule import CellIsRule, ColorScaleRule, DataBarRule

Highlight cells above threshold

ws.conditional_formatting.add( 'C2:C100', CellIsRule(operator='greaterThan', formula=['100000'], fill=PatternFill(bgColor='C6EFCE')) )

Red for below target

ws.conditional_formatting.add( 'E2:E100', CellIsRule(operator='lessThan', formula=['1'], fill=PatternFill(bgColor='FFC7CE'), font=Font(color='9C0006')) )

Color scale (green to red)

ws.conditional_formatting.add( 'E2:E100', ColorScaleRule( start_type='min', start_color='F8696B', mid_type='percentile', mid_value=50, mid_color='FFEB84', end_type='max', end_color='63BE7B' ) )

Data bars

ws.conditional_formatting.add( 'C2:C100', DataBarRule(start_type='min', end_type='max', color='638EC6') )

undefined

ws.conditional_formatting.add( 'C2:C100', DataBarRule(start_type='min', end_type='max', color='638EC6') )

undefined

Data Validation

数据验证

python

from openpyxl.worksheet.datavalidation import DataValidation

python

from openpyxl.worksheet.datavalidation import DataValidation

Dropdown list

dv = DataValidation(type='list', formula1='"Active,Inactive,Pending"', allow_blank=True) dv.error = 'Please select a valid status' dv.errorTitle = 'Invalid Entry' ws.add_data_validation(dv) dv.add('D2:D100')

Number range

nv = DataValidation(type='whole', operator='between', formula1=0, formula2=100) nv.error = 'Value must be between 0 and 100' ws.add_data_validation(nv) nv.add('F2:F100')

Date validation

dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01') ws.add_data_validation(dv_date) dv_date.add('G2:G100')

undefined

dv_date = DataValidation(type='date', operator='greaterThan', formula1='2025-01-01') ws.add_data_validation(dv_date) dv_date.add('G2:G100')

undefined

pandas Patterns

pandas 常用模式

Reading Excel

读取Excel

python

import pandas as pd

python

import pandas as pd

Read single sheet

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Read with options

df = pd.read_excel('data.xlsx', sheet_name='Sales', header=0, usecols='A:E', dtype={'ID': str, 'Revenue': float}, parse_dates=['Date'], na_values=['N/A', 'null', ''], )

Read all sheets

sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames

undefined

sheets = pd.read_excel('data.xlsx', sheet_name=None) # Dict of DataFrames

undefined

Writing Excel with pandas + openpyxl

结合pandas和openpyxl写入Excel

python

with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
    df_summary.to_excel(writer, sheet_name='Summary', index=False)
    df_detail.to_excel(writer, sheet_name='Detail', index=False)

    # Access openpyxl workbook for formatting
    wb = writer.book
    ws = wb['Summary']
    # Apply formatting...

python

with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
    df_summary.to_excel(writer, sheet_name='Summary', index=False)
    df_detail.to_excel(writer, sheet_name='Detail', index=False)

    # Access openpyxl workbook for formatting
    wb = writer.book
    ws = wb['Summary']
    # Apply formatting...

Pivot Tables

数据透视表

python

undefined

python

undefined

Create pivot table

pivot = pd.pivot_table( df, values='Revenue', index='Department', columns='Quarter', aggfunc='sum', margins=True, margins_name='Total' )

Write to Excel with formatting

pivot.to_excel(writer, sheet_name='Pivot')

undefined

pivot.to_excel(writer, sheet_name='Pivot')

undefined

CSV Import/Export

CSV 导入/导出

python

undefined

python

undefined

CSV to XLSX

df = pd.read_csv('data.csv', encoding='utf-8-sig') df.to_excel('output.xlsx', index=False)

XLSX to CSV

df = pd.read_excel('data.xlsx') df.to_csv('output.csv', index=False, encoding='utf-8-sig')

Handle encoding issues

df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'

undefined

df = pd.read_csv('data.csv', encoding='latin-1') # or 'cp1252'

undefined

Large File Handling

大文件处理

Memory-Efficient Reading

低内存读取

python

undefined

python

undefined

openpyxl read-only mode

from openpyxl import load_workbook

wb = load_workbook('large_file.xlsx', read_only=True) ws = wb.active

for row in ws.iter_rows(min_row=2, values_only=True): process_row(row)

wb.close()

undefined

from openpyxl import load_workbook

wb = load_workbook('large_file.xlsx', read_only=True) ws = wb.active

for row in ws.iter_rows(min_row=2, values_only=True): process_row(row)

wb.close()

undefined

Chunked Writing

分块写入

python

undefined

python

undefined

Write large datasets in chunks

from openpyxl import Workbook from openpyxl.utils.dataframe import dataframe_to_rows

wb = Workbook(write_only=True) ws = wb.create_sheet()

from openpyxl import Workbook from openpyxl.utils.dataframe import dataframe_to_rows

wb = Workbook(write_only=True) ws = wb.create_sheet()

Write header

ws.append(headers)

Write in chunks

chunk_size = 10000 for chunk in pd.read_csv('large.csv', chunksize=chunk_size): for row in dataframe_to_rows(chunk, index=False, header=False): ws.append(row)

wb.save('output.xlsx')

undefined

chunk_size = 10000 for chunk in pd.read_csv('large.csv', chunksize=chunk_size): for row in dataframe_to_rows(chunk, index=False, header=False): ws.append(row)

wb.save('output.xlsx')

undefined

Performance Decision Table

性能决策表

Rows	Strategy	Notes
< 10,000	Standard openpyxl or pandas	Full formatting available
10K - 100K	write_only / read_only mode, chunked	Limited formatting in write_only
100K - 1M	write_only mode, consider CSV instead	Near Excel row limit
> 1M	Use CSV or Parquet, not XLSX	Excel limit: 1,048,576 rows

行数	处理策略	说明
< 10,000	标准openpyxl或pandas模式	支持完整格式设置
1万 - 10万	write_only / read_only 模式、分块处理	write_only模式下格式设置功能有限
10万 - 100万	write_only 模式，优先考虑使用CSV	接近Excel的行数上限
> 100万	使用CSV或Parquet格式，不要用XLSX	Excel上限为1,048,576行

Anti-Patterns / Common Mistakes

反模式/常见错误

Anti-Pattern	Why It Fails	What To Do Instead
openpyxl for pure data analysis	Verbose and slow for analytics	Use pandas for data operations
Loading large files into memory	Memory exhaustion, crashes	Use read_only / write_only modes
Hardcoding row/column numbers	Breaks when data shape changes	Calculate from data length
Inconsistent date formats	Dates render as numbers or strings	Set number_format explicitly
Not closing read_only workbooks	Resource leaks	Always call `wb.close()` or use context manager
Using .xls format	Legacy, limited, security risks	Always use .xlsx
Formatting cells one by one	Extremely slow for large ranges	Apply styles to ranges or use named styles
Not testing in actual Excel	Features render differently	Test in Excel, LibreOffice, and Google Sheets
Forgetting to freeze header row	Poor UX when scrolling large data	Always freeze panes for data sheets

反模式	失效原因	替代方案
使用openpyxl做纯数据分析	分析操作繁琐且速度慢	使用pandas进行数据操作
直接将大文件加载到内存	内存耗尽、程序崩溃	使用read_only / write_only 模式
硬编码行/列号	数据结构变化时代码失效	根据数据长度动态计算
日期格式不一致	日期会显示为数字或字符串	显式设置number_format
不关闭read_only模式的工作簿	资源泄漏	始终调用 `wb.close()` 或使用上下文管理器
使用.xls格式	老旧、功能受限、存在安全风险	始终使用.xlsx格式
逐个单元格设置格式	大范围设置时速度极慢	对范围应用样式或使用命名样式
不在实际Excel软件中测试	不同软件功能渲染效果不同	在Excel、LibreOffice和Google Sheets中都进行测试
忘记冻结表头行	滚动查看大数据时用户体验差	数据表始终冻结窗格固定表头

Anti-Rationalization Guards

反合理化约束

Do NOT use openpyxl for data analysis that pandas handles in one line.
Do NOT skip the row count assessment -- it determines your entire approach.
Do NOT assume standard mode works for files over 10K rows -- use streaming modes.
Do NOT test only in one spreadsheet application -- formatting varies.
Do NOT forget to close workbooks opened in read_only mode.

不要使用openpyxl处理pandas一行代码就能完成的数据分析任务。
不要跳过行数评估步骤——它决定了你的整体实现方案。
不要假设标准模式适用于超过1万行的文件——使用流式模式。
不要仅在一个电子表格应用中测试——不同软件的格式渲染存在差异。
不要忘记关闭以read_only模式打开的工作簿。

Integration Points

集成点

Skill	How It Connects
`pdf-processing`	Excel data feeds into PDF report generation
`docx-processing`	Excel data populates Word document tables
`email-composer`	Generated spreadsheets attach to professional emails
`file-organizer`	Output file naming and directory structure conventions
`database-schema-design`	Database exports to Excel for reporting
`deployment`	Automated report generation in CI/CD pipelines

技能	关联方式
`pdf-processing`	Excel数据可用于PDF报告生成
`docx-processing`	Excel数据可填充Word文档表格
`email-composer`	生成的电子表格可作为专业邮件的附件
`file-organizer`	输出文件命名和目录结构约定
`database-schema-design`	数据库导出为Excel用于报告展示
`deployment`	CI/CD流水线中的自动化报告生成

Skill Type

技能类型

FLEXIBLE — Choose openpyxl for rich formatting and pandas for data analysis. Combine both when you need formatted reports from data analysis. Adapt file handling strategy to data volume.

灵活适配 — 需丰富格式设置时选择openpyxl，需数据分析时选择pandas。当你需要从数据分析结果生成带格式的报告时可结合使用两者。根据数据量调整文件处理策略。

xlsx-processing

Original

Translation

XLSX Processing

XLSX 处理

Overview

概述

Multi-Phase Process

多阶段流程

Phase 1: Requirements

阶段1：需求确认

Phase 2: Implementation

阶段2：实现

Phase 3: Validation

阶段3：验证

Library Decision Table

库选择决策表

openpyxl Patterns

openpyxl 常用模式

Creating a Workbook

创建工作簿

Header row

Header row

Data rows

Data rows

Auto-fit column widths

Auto-fit column widths

Freeze header row

Freeze header row

Formulas

公式使用

Basic formulas

Basic formulas

Array formula (Excel 365 dynamic array)

Array formula (Excel 365 dynamic array)

Named range

Named range

Charts

图表使用

Bar chart

Bar chart

Line chart with multiple series

Line chart with multiple series

Conditional Formatting

条件格式

Highlight cells above threshold

Highlight cells above threshold

Red for below target

Red for below target

Color scale (green to red)

Color scale (green to red)

Data bars

Data bars

Data Validation

数据验证

Dropdown list

Dropdown list

Number range

Number range

Date validation

Date validation

pandas Patterns

pandas 常用模式

Reading Excel

读取Excel

Read single sheet

Read single sheet

Read with options

Read with options

Read all sheets

Read all sheets

Writing Excel with pandas + openpyxl

结合pandas和openpyxl写入Excel

Pivot Tables

数据透视表

Create pivot table

Create pivot table

Write to Excel with formatting

Write to Excel with formatting

CSV Import/Export