pandas-best-practices

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Pandas Best Practices

Pandas最佳实践

Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.

针对Pandas开发的专家指南，聚焦数据处理、分析以及高效的DataFrame操作。

Code Style and Structure

代码风格与结构

Write concise, technical responses with accurate Python examples
Prioritize reproducibility in data analysis workflows
Use functional programming; avoid unnecessary classes
Prefer vectorized operations over explicit loops
Use descriptive variable names reflecting data content
Follow PEP 8 style guidelines

编写简洁、专业的回复，并附带准确的Python示例
优先保证数据分析工作流的可复现性
使用函数式编程；避免不必要的类
优先选择向量化操作，而非显式循环
使用能反映数据内容的描述性变量名
遵循PEP 8风格指南

DataFrame Creation and I/O

DataFrame创建与输入输出

Use

pd.read_csv()

pd.read_excel()

pd.read_json()

with appropriate parameters

Specify
```
dtype
```
parameter to ensure correct data types on load
Use
```
parse_dates
```
for automatic datetime parsing
Set
```
index_col
```
when the data has a natural index column
Use
```
chunksize
```
for reading large files incrementally

结合合适的参数使用

pd.read_csv()

、

pd.read_excel()

、

pd.read_json()

指定
```
dtype
```
参数，确保加载时数据类型正确
使用
```
parse_dates
```
自动解析日期时间
当数据有自然索引列时，设置
```
index_col
```
对于大文件，使用
```
chunksize
```
进行增量读取

Data Selection

数据选择

Use
```
.loc[]
```
for label-based indexing
Use
```
.iloc[]
```
for integer position-based indexing
Avoid chained indexing (e.g.,
```
df['col'][0]
```
) - use
```
.loc
```
or
```
.iloc
```
instead
Use boolean indexing for conditional selection:
```
df[df['col'] > value]
```
Use
```
.query()
```
method for complex filtering conditions

使用
```
.loc[]
```
进行基于标签的索引
使用
```
.iloc[]
```
进行基于整数位置的索引
避免链式索引（例如
```
df['col'][0]
```
）——改用
```
.loc
```
或
```
.iloc
```
使用布尔索引进行条件筛选：
```
df[df['col'] > value]
```
复杂筛选条件使用
```
.query()
```
方法

Method Chaining

方法链式调用

Prefer method chaining for data transformations when possible
Use
```
.pipe()
```
for applying custom functions in a chain
Chain operations like
```
.assign()
```
,
```
.query()
```
,
```
.groupby()
```
,
```
.agg()
```
Keep chains readable by breaking across multiple lines

尽可能优先使用方法链式调用进行数据转换
使用
```
.pipe()
```
在链式调用中应用自定义函数
链式调用
```
.assign()
```
、
```
.query()
```
、
```
.groupby()
```
、
```
.agg()
```
等操作
通过换行保持链式调用的可读性

Data Cleaning and Validation

数据清洗与验证

Missing Data

缺失数据

Check for missing data with
```
.isna()
```
and
```
.info()
```
Handle missing data appropriately:
```
.fillna()
```
,
```
.dropna()
```
, or imputation
Use
```
pd.NA
```
for nullable integer and boolean types
Document decisions about missing data handling

使用
```
.isna()
```
和
```
.info()
```
检查缺失数据
合理处理缺失数据：使用
```
.fillna()
```
、
```
.dropna()
```
或插补法
对于可空整数和布尔类型，使用
```
pd.NA
```
记录缺失数据处理的决策

Data Quality Checks

数据质量检查

Implement data quality checks at the beginning of analysis
Validate data types with
```
.dtypes
```
and convert as needed
Check for duplicates with
```
.duplicated()
```
and handle appropriately
Use
```
.describe()
```
for quick statistical overview

在分析开始时实施数据质量检查
使用
```
.dtypes
```
验证数据类型，并按需转换
使用
```
.duplicated()
```
检查重复数据并合理处理
使用
```
.describe()
```
快速获取统计概览

Type Conversion

类型转换

Use
```
.astype()
```
for explicit type conversion
Use
```
pd.to_datetime()
```
for date parsing
Use
```
pd.to_numeric()
```
with
```
errors='coerce'
```
for safe numeric conversion
Utilize categorical data types for low-cardinality string columns

使用
```
.astype()
```
进行显式类型转换
使用
```
pd.to_datetime()
```
解析日期
使用
```
pd.to_numeric()
```
并设置
```
errors='coerce'
```
进行安全的数值转换
对低基数字符串列使用分类数据类型

Grouping and Aggregation

分组与聚合

GroupBy Operations

GroupBy操作

Use
```
.groupby()
```
for efficient aggregation operations
Specify aggregation functions with
```
.agg()
```
for multiple operations
Use named aggregation for clearer output column names
Consider
```
.transform()
```
for broadcasting results back to original shape

使用
```
.groupby()
```
进行高效的聚合操作
使用
```
.agg()
```
指定多个聚合函数
使用命名聚合让输出列名更清晰
考虑使用
```
.transform()
```
将结果广播回原始形状

Pivot Tables and Reshaping

透视表与数据重塑

Use
```
.pivot_table()
```
for multi-dimensional aggregation
Use
```
.melt()
```
to convert wide to long format
Use
```
.pivot()
```
to convert long to wide format
Use
```
.stack()
```
and
```
.unstack()
```
for hierarchical index manipulation

使用
```
.pivot_table()
```
进行多维聚合
使用
```
.melt()
```
将宽格式转换为长格式
使用
```
.pivot()
```
将长格式转换为宽格式
使用
```
.stack()
```
和
```
.unstack()
```
处理层次化索引

Performance Optimization

性能优化

Memory Efficiency

内存效率

Use categorical data types for low-cardinality strings
Downcast numeric types when appropriate
Use
```
pd.eval()
```
and
```
.eval()
```
for large expression evaluation

对低基数字符串列使用分类数据类型
适时向下转换数值类型
使用
```
pd.eval()
```
和
```
.eval()
```
处理大型表达式

Computation Speed

计算速度

Use vectorized operations instead of
```
.apply()
```
with row-wise functions
Prefer built-in aggregation functions over custom ones
Use
```
.values
```
or
```
.to_numpy()
```
for NumPy operations when faster

使用向量化操作替代带逐行函数的
```
.apply()
```
优先使用内置聚合函数而非自定义函数
当速度更快时，使用
```
.values
```
或
```
.to_numpy()
```
进行NumPy操作

Avoiding Common Pitfalls

避免常见陷阱

Avoid iterating with
```
.iterrows()
```
- use vectorized operations
Don't modify DataFrames while iterating
Be aware of SettingWithCopyWarning - use
```
.copy()
```
when needed
Avoid growing DataFrames row by row - collect in list and create once

避免使用
```
.iterrows()
```
迭代——改用向量化操作
迭代时不要修改DataFrame
注意SettingWithCopyWarning——必要时使用
```
.copy()
```
避免逐行扩展DataFrame——先收集到列表中再一次性创建

Time Series Operations

时间序列操作

Use
```
DatetimeIndex
```
for time series data
Leverage
```
.resample()
```
for time-based aggregation
Use
```
.shift()
```
and
```
.diff()
```
for lag operations
Use
```
.rolling()
```
and
```
.expanding()
```
for window calculations

对时间序列数据使用
```
DatetimeIndex
```
利用
```
.resample()
```
进行基于时间的聚合
使用
```
.shift()
```
和
```
.diff()
```
进行滞后操作
使用
```
.rolling()
```
和
```
.expanding()
```
进行窗口计算

Merging and Joining

合并与连接

Use
```
.merge()
```
for SQL-style joins
Specify
```
how
```
parameter: 'inner', 'outer', 'left', 'right'
Use
```
validate
```
parameter to check join cardinality
Use
```
.concat()
```
for stacking DataFrames

使用
```
.merge()
```
实现类SQL风格的连接
指定
```
how
```
参数：'inner'、'outer'、'left'、'right'
使用
```
validate
```
参数检查连接的基数
使用
```
.concat()
```
堆叠DataFrame

Key Conventions

关键约定

Import as
```
import pandas as pd
```
Use
```
snake_case
```
for column names when possible
Document data sources and transformations
Keep notebooks reproducible with clear cell execution order

导入时使用
```
import pandas as pd
```
尽可能对列名使用
```
snake_case
```
命名法
记录数据源与转换过程
保持Notebook的可复现性，确保单元格执行顺序清晰