pandas-best-practices
Original:🇺🇸 English
Translated
Best practices for Pandas data manipulation, analysis, and DataFrame operations in Python
11installs
Sourcemindrally/skills
Added on
NPX Install
npx skill4agent add mindrally/skills pandas-best-practicesTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →Pandas Best Practices
Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.
Code Style and Structure
- Write concise, technical responses with accurate Python examples
- Prioritize reproducibility in data analysis workflows
- Use functional programming; avoid unnecessary classes
- Prefer vectorized operations over explicit loops
- Use descriptive variable names reflecting data content
- Follow PEP 8 style guidelines
DataFrame Creation and I/O
- Use ,
pd.read_csv(),pd.read_excel()with appropriate parameterspd.read_json() - Specify parameter to ensure correct data types on load
dtype - Use for automatic datetime parsing
parse_dates - Set when the data has a natural index column
index_col - Use for reading large files incrementally
chunksize
Data Selection
- Use for label-based indexing
.loc[] - Use for integer position-based indexing
.iloc[] - Avoid chained indexing (e.g., ) - use
df['col'][0]or.locinstead.iloc - Use boolean indexing for conditional selection:
df[df['col'] > value] - Use method for complex filtering conditions
.query()
Method Chaining
- Prefer method chaining for data transformations when possible
- Use for applying custom functions in a chain
.pipe() - Chain operations like ,
.assign(),.query(),.groupby().agg() - Keep chains readable by breaking across multiple lines
Data Cleaning and Validation
Missing Data
- Check for missing data with and
.isna().info() - Handle missing data appropriately: ,
.fillna(), or imputation.dropna() - Use for nullable integer and boolean types
pd.NA - Document decisions about missing data handling
Data Quality Checks
- Implement data quality checks at the beginning of analysis
- Validate data types with and convert as needed
.dtypes - Check for duplicates with and handle appropriately
.duplicated() - Use for quick statistical overview
.describe()
Type Conversion
- Use for explicit type conversion
.astype() - Use for date parsing
pd.to_datetime() - Use with
pd.to_numeric()for safe numeric conversionerrors='coerce' - Utilize categorical data types for low-cardinality string columns
Grouping and Aggregation
GroupBy Operations
- Use for efficient aggregation operations
.groupby() - Specify aggregation functions with for multiple operations
.agg() - Use named aggregation for clearer output column names
- Consider for broadcasting results back to original shape
.transform()
Pivot Tables and Reshaping
- Use for multi-dimensional aggregation
.pivot_table() - Use to convert wide to long format
.melt() - Use to convert long to wide format
.pivot() - Use and
.stack()for hierarchical index manipulation.unstack()
Performance Optimization
Memory Efficiency
- Use categorical data types for low-cardinality strings
- Downcast numeric types when appropriate
- Use and
pd.eval()for large expression evaluation.eval()
Computation Speed
- Use vectorized operations instead of with row-wise functions
.apply() - Prefer built-in aggregation functions over custom ones
- Use or
.valuesfor NumPy operations when faster.to_numpy()
Avoiding Common Pitfalls
- Avoid iterating with - use vectorized operations
.iterrows() - Don't modify DataFrames while iterating
- Be aware of SettingWithCopyWarning - use when needed
.copy() - Avoid growing DataFrames row by row - collect in list and create once
Time Series Operations
- Use for time series data
DatetimeIndex - Leverage for time-based aggregation
.resample() - Use and
.shift()for lag operations.diff() - Use and
.rolling()for window calculations.expanding()
Merging and Joining
- Use for SQL-style joins
.merge() - Specify parameter: 'inner', 'outer', 'left', 'right'
how - Use parameter to check join cardinality
validate - Use for stacking DataFrames
.concat()
Key Conventions
- Import as
import pandas as pd - Use for column names when possible
snake_case - Document data sources and transformations
- Keep notebooks reproducible with clear cell execution order