pandas-best-practices

Original🇺🇸 English
Translated

Best practices for Pandas data manipulation, analysis, and DataFrame operations in Python

11installs
Added on

NPX Install

npx skill4agent add mindrally/skills pandas-best-practices

Pandas Best Practices

Expert guidelines for Pandas development, focusing on data manipulation, analysis, and efficient DataFrame operations.

Code Style and Structure

  • Write concise, technical responses with accurate Python examples
  • Prioritize reproducibility in data analysis workflows
  • Use functional programming; avoid unnecessary classes
  • Prefer vectorized operations over explicit loops
  • Use descriptive variable names reflecting data content
  • Follow PEP 8 style guidelines

DataFrame Creation and I/O

  • Use
    pd.read_csv()
    ,
    pd.read_excel()
    ,
    pd.read_json()
    with appropriate parameters
  • Specify
    dtype
    parameter to ensure correct data types on load
  • Use
    parse_dates
    for automatic datetime parsing
  • Set
    index_col
    when the data has a natural index column
  • Use
    chunksize
    for reading large files incrementally

Data Selection

  • Use
    .loc[]
    for label-based indexing
  • Use
    .iloc[]
    for integer position-based indexing
  • Avoid chained indexing (e.g.,
    df['col'][0]
    ) - use
    .loc
    or
    .iloc
    instead
  • Use boolean indexing for conditional selection:
    df[df['col'] > value]
  • Use
    .query()
    method for complex filtering conditions

Method Chaining

  • Prefer method chaining for data transformations when possible
  • Use
    .pipe()
    for applying custom functions in a chain
  • Chain operations like
    .assign()
    ,
    .query()
    ,
    .groupby()
    ,
    .agg()
  • Keep chains readable by breaking across multiple lines

Data Cleaning and Validation

Missing Data

  • Check for missing data with
    .isna()
    and
    .info()
  • Handle missing data appropriately:
    .fillna()
    ,
    .dropna()
    , or imputation
  • Use
    pd.NA
    for nullable integer and boolean types
  • Document decisions about missing data handling

Data Quality Checks

  • Implement data quality checks at the beginning of analysis
  • Validate data types with
    .dtypes
    and convert as needed
  • Check for duplicates with
    .duplicated()
    and handle appropriately
  • Use
    .describe()
    for quick statistical overview

Type Conversion

  • Use
    .astype()
    for explicit type conversion
  • Use
    pd.to_datetime()
    for date parsing
  • Use
    pd.to_numeric()
    with
    errors='coerce'
    for safe numeric conversion
  • Utilize categorical data types for low-cardinality string columns

Grouping and Aggregation

GroupBy Operations

  • Use
    .groupby()
    for efficient aggregation operations
  • Specify aggregation functions with
    .agg()
    for multiple operations
  • Use named aggregation for clearer output column names
  • Consider
    .transform()
    for broadcasting results back to original shape

Pivot Tables and Reshaping

  • Use
    .pivot_table()
    for multi-dimensional aggregation
  • Use
    .melt()
    to convert wide to long format
  • Use
    .pivot()
    to convert long to wide format
  • Use
    .stack()
    and
    .unstack()
    for hierarchical index manipulation

Performance Optimization

Memory Efficiency

  • Use categorical data types for low-cardinality strings
  • Downcast numeric types when appropriate
  • Use
    pd.eval()
    and
    .eval()
    for large expression evaluation

Computation Speed

  • Use vectorized operations instead of
    .apply()
    with row-wise functions
  • Prefer built-in aggregation functions over custom ones
  • Use
    .values
    or
    .to_numpy()
    for NumPy operations when faster

Avoiding Common Pitfalls

  • Avoid iterating with
    .iterrows()
    - use vectorized operations
  • Don't modify DataFrames while iterating
  • Be aware of SettingWithCopyWarning - use
    .copy()
    when needed
  • Avoid growing DataFrames row by row - collect in list and create once

Time Series Operations

  • Use
    DatetimeIndex
    for time series data
  • Leverage
    .resample()
    for time-based aggregation
  • Use
    .shift()
    and
    .diff()
    for lag operations
  • Use
    .rolling()
    and
    .expanding()
    for window calculations

Merging and Joining

  • Use
    .merge()
    for SQL-style joins
  • Specify
    how
    parameter: 'inner', 'outer', 'left', 'right'
  • Use
    validate
    parameter to check join cardinality
  • Use
    .concat()
    for stacking DataFrames

Key Conventions

  • Import as
    import pandas as pd
  • Use
    snake_case
    for column names when possible
  • Document data sources and transformations
  • Keep notebooks reproducible with clear cell execution order