pandera

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pandera: DataFrame Validation

Pandera:DataFrame验证

Pandera is an open-source framework for validating DataFrame-like objects at runtime. Define schemas once and reuse them across pandas, polars, Dask, Modin, PySpark, and Ibis backends.
Pandera是一个用于在运行时验证类DataFrame对象的开源框架。只需定义一次schema,即可在pandas、polars、Dask、Modin、PySpark和Ibis等后端中复用。

Import Convention

导入约定

Since pandera v0.24.0, use the backend-specific module. Using the top-level
pandera
module produces a
FutureWarning
and will be deprecated in v0.29.0.
python
import pandera.pandas as pa          # pandas (recommended)
import pandera.polars as pa          # polars
from pandera.typing.pandas import DataFrame, Series, Index
从pandera v0.24.0版本开始,请使用后端特定的模块。使用顶层
pandera
模块会触发
FutureWarning
,并且该方式将在v0.29.0版本中被弃用。
python
import pandera.pandas as pa          # pandas(推荐)
import pandera.polars as pa          # polars
from pandera.typing.pandas import DataFrame, Series, Index

Two Schema Styles

两种Schema风格

Object-based API (
DataFrameSchema
)

基于对象的API(
DataFrameSchema

Suitable for dynamic schema construction or when schemas need to be built programmatically.
python
import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, pa.Check.gt(0)),
    "email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
    "score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
    "status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})

validated = schema.validate(df)
适用于动态构建schema,或者需要以编程方式构建schema的场景。
python
import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, pa.Check.gt(0)),
    "email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
    "score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
    "status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})

validated = schema.validate(df)

Class-based API (
DataFrameModel
) — preferred

基于类的API(
DataFrameModel
)——推荐使用

Pydantic-style syntax with type annotations. Produces cleaner, reusable schemas that integrate with
@pa.check_types
.
python
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series

class UserSchema(pa.DataFrameModel):
    user_id: int = pa.Field(gt=0)
    email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
    score: float = pa.Field(ge=0.0, le=1.0)
    status: str = pa.Field(isin=["active", "inactive", "banned"])

    class Config:
        strict = True       # reject extra columns
        coerce = False      # do not silently cast types
采用Pydantic风格的类型注解语法。生成的schema更简洁、可复用,并且能与
@pa.check_types
集成。
python
import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series

class UserSchema(pa.DataFrameModel):
    user_id: int = pa.Field(gt=0)
    email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
    score: float = pa.Field(ge=0.0, le=1.0)
    status: str = pa.Field(isin=["active", "inactive", "banned"])

    class Config:
        strict = True       # 存在额外列时抛出错误
        coerce = False      # 不自动静默转换类型

Validate directly

直接验证

UserSchema.validate(df)
UserSchema.validate(df)

Or via typing annotation + decorator

或者通过类型注解 + 装饰器

@pa.check_types def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df
undefined
@pa.check_types def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df
undefined

Checks

检查器

Built-in Checks (prefer these over lambdas)

内置检查器(优先使用,而非lambda)

python
pa.Check.gt(0)               # greater than
pa.Check.ge(0)               # greater than or equal
pa.Check.lt(100)             # less than
pa.Check.le(100)             # less than or equal
pa.Check.eq("value")         # equal to
pa.Check.ne("value")         # not equal to
pa.Check.isin(["a", "b"])    # membership
pa.Check.notin(["x"])        # exclusion
pa.Check.str_matches(r"^\d+$")  # regex match
pa.Check.in_range(0, 100)    # closed interval
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255)  # min/max string length
python
pa.Check.gt(0)               # 大于
pa.Check.ge(0)               # 大于等于
pa.Check.lt(100)             # 小于
pa.Check.le(100)             # 小于等于
pa.Check.eq("value")         # 等于
pa.Check.ne("value")         # 不等于
pa.Check.isin(["a", "b"])    # 属于指定集合
pa.Check.notin(["x"])        # 不属于指定集合
pa.Check.str_matches(r"^\d+$")  # 正则匹配
pa.Check.in_range(0, 100)    # 闭区间范围
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255)  # 字符串最小/最大长度

Custom Checks

自定义检查器

python
undefined
python
undefined

Vectorized (default, faster — operates on the whole Series)

向量化(默认,性能更高——对整个Series操作)

pa.Check(lambda s: s.str.len() <= 255)
pa.Check(lambda s: s.str.len() <= 255)

Element-wise (scalar input, use only when vectorized is impractical)

元素级(标量输入,仅当向量化实现不现实时使用)

pa.Check(lambda x: x > 0, element_wise=True)
pa.Check(lambda x: x > 0, element_wise=True)

Always add an error message

始终添加错误提示信息

pa.Check(lambda s: s > 0, error="values must be positive")
undefined
pa.Check(lambda s: s > 0, error="values must be positive")
undefined

DataFrame-level Checks

DataFrame级检查器

python
schema = pa.DataFrameSchema(
    columns={...},
    checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)
In
DataFrameModel
, use
@pa.dataframe_check
:
python
class Schema(pa.DataFrameModel):
    start_date: int
    end_date: int

    @pa.dataframe_check
    @classmethod
    def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
        return df["end_date"] >= df["start_date"]
python
schema = pa.DataFrameSchema(
    columns={...},
    checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)
DataFrameModel
中,使用
@pa.dataframe_check
python
class Schema(pa.DataFrameModel):
    start_date: int
    end_date: int

    @pa.dataframe_check
    @classmethod
    def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
        return df["end_date"] >= df["start_date"]

Nullable and Optional Columns

可空与可选列

python
undefined
python
undefined

Object API: allow nulls in a column

对象API:允许列包含空值

pa.Column(float, nullable=True)
pa.Column(float, nullable=True)

DataFrameModel: make a column optional (may be absent)

DataFrameModel:将列设为可选(可以不存在)

from typing import Optional
class Schema(pa.DataFrameModel): required_col: Series[int] optional_col: Optional[Series[float]]
undefined
from typing import Optional
class Schema(pa.DataFrameModel): required_col: Series[int] optional_col: Optional[Series[float]]
undefined

Coercion

类型强制转换

Enable coercion to cast data to the declared type before validation. Use deliberately — coercion can hide upstream data issues.
python
undefined
启用强制转换可在验证前将数据转换为声明的类型。需谨慎使用——强制转换可能掩盖上游数据问题。
python
undefined

Per-column

按列设置

pa.Column(int, coerce=True)
pa.Column(int, coerce=True)

Schema-wide via Config

通过Config在全局Schema设置

class Schema(pa.DataFrameModel): year: int = pa.Field(gt=2000, coerce=True)
class Config:
    coerce = True
undefined
class Schema(pa.DataFrameModel): year: int = pa.Field(gt=2000, coerce=True)
class Config:
    coerce = True
undefined

Lazy Validation — Collect All Errors

延迟验证——收集所有错误

By default pandera raises on the first error. Use
lazy=True
to collect all failures before raising, useful for batch reporting.
python
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)   # DataFrame of all failures
默认情况下,pandera会在遇到第一个错误时抛出异常。使用
lazy=True
可收集所有失败后再抛出,适用于批量报告场景。
python
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)   # 所有失败案例的DataFrame

Decorator Integration

装饰器集成

Integrate validation transparently into pipelines using decorators.
python
undefined
使用装饰器将验证透明地集成到数据流水线中。
python
undefined

DataFrameModel + check_types (recommended)

DataFrameModel + check_types(推荐)

@pa.check_types def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]: return df.assign(revenue=df["units"] * df["price"])
@pa.check_types def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]: return df.assign(revenue=df["units"] * df["price"])

Object API: check_input / check_output

对象API:check_input / check_output

@pa.check_input(input_schema) @pa.check_output(output_schema) def pipeline_step(df): return df
@pa.check_input(input_schema) @pa.check_output(output_schema) def pipeline_step(df): return df

check_io: concisely specify both

check_io:简洁地同时指定输入输出

@pa.check_io(raw=input_schema, out=output_schema) def pipeline_step(raw): return raw

Decorators work on sync/async functions, methods, class methods, and static methods.
@pa.check_io(raw=input_schema, out=output_schema) def pipeline_step(raw): return raw

装饰器适用于同步/异步函数、方法、类方法和静态方法。

Schema Inheritance

Schema继承

Build specialized schemas from a base to avoid repetition.
python
class BaseEvent(pa.DataFrameModel):
    event_id: str
    timestamp: int = pa.Field(gt=0)

class ClickEvent(BaseEvent):
    url: str
    user_agent: str

    class Config:
        strict = True
基于基础Schema构建专用Schema,避免重复代码。
python
class BaseEvent(pa.DataFrameModel):
    event_id: str
    timestamp: int = pa.Field(gt=0)

class ClickEvent(BaseEvent):
    url: str
    user_agent: str

    class Config:
        strict = True

Schema Persistence (YAML / Script)

Schema持久化(YAML/脚本)

Serialize and reload schemas to keep validation reproducible.
python
import pandera.io
序列化和重新加载Schema,确保验证的可复现性。
python
import pandera.io

Save

保存

pandera.io.to_yaml(schema, "./schema.yaml")
pandera.io.to_yaml(schema, "./schema.yaml")

Load

加载

schema = pandera.io.from_yaml("./schema.yaml")
schema = pandera.io.from_yaml("./schema.yaml")

Generate Python script

生成Python脚本

pandera.io.to_script(schema, "./schema_definition.py")
undefined
pandera.io.to_script(schema, "./schema_definition.py")
undefined

Schema Inference (Prototyping Only)

Schema推断(仅用于原型开发)

Infer a schema from existing data to bootstrap development. Always review and tighten the generated schema before using in production.
python
import pandera.pandas as pa

inferred = pa.infer_schema(df)
print(inferred.to_script())   # inspect then copy-edit
从现有数据中推断Schema,用于快速启动开发。在生产环境中使用前,务必审查并完善生成的Schema。
python
import pandera.pandas as pa

inferred = pa.infer_schema(df)
print(inferred.to_script())   # 检查后复制编辑

Dropping Invalid Rows

丢弃无效行

Use
drop_invalid_rows=True
on
DataFrameSchema
to filter out failing rows instead of raising an error. Supported on pandas and polars.
python
schema = pa.DataFrameSchema(
    {"score": pa.Column(float, pa.Check.ge(0))},
    drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)
DataFrameSchema
上设置
drop_invalid_rows=True
可过滤掉失败行,而非抛出错误。支持pandas和polars后端。
python
schema = pa.DataFrameSchema(
    {"score": pa.Column(float, pa.Check.ge(0))},
    drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)

Error Handling

错误处理

python
from pandera.errors import SchemaError, SchemaErrors
python
from pandera.errors import SchemaError, SchemaErrors

Single error (eager validation)

单个错误(即时验证)

try: schema.validate(df) except SchemaError as exc: print(exc.failure_cases) # Series/DataFrame of failures
try: schema.validate(df) except SchemaError as exc: print(exc.failure_cases) # 失败案例的Series/DataFrame

Multiple errors (lazy validation)

多个错误(延迟验证)

try: schema.validate(df, lazy=True) except SchemaErrors as exc: # Structured dict with SCHEMA and DATA keys print(exc.error_counts) print(exc.failure_cases)
undefined
try: schema.validate(df, lazy=True) except SchemaErrors as exc: # 包含SCHEMA和DATA键的结构化字典 print(exc.error_counts) print(exc.failure_cases)
undefined

Key Configuration Options (
Config
)

关键配置选项(
Config

OptionTypeEffect
strict
bool
Raise if extra columns present
coerce
bool
Cast columns to declared dtypes
ordered
bool
Require columns in declared order
name
str
Schema name shown in error messages
add_missing_columns
bool
Insert columns with default values
选项类型效果
strict
bool
存在额外列时抛出错误
coerce
bool
将列转换为声明的数据类型
ordered
bool
要求列顺序与声明一致
name
str
错误信息中显示的Schema名称
add_missing_columns
bool
插入带有默认值的缺失列

Best Practices

最佳实践

  • Use
    DataFrameModel
    over
    DataFrameSchema
    for new code — cleaner syntax, inheritance, and type-annotation integration.
  • Prefer
    strict=True
    to catch unexpected extra columns early.
  • Use built-in checks (
    Check.gt
    ,
    Check.isin
    , etc.) over custom lambdas where possible — they produce better error messages.
  • Write vectorized checks (
    element_wise=False
    , the default) for performance; only use
    element_wise=True
    when the logic is truly scalar.
  • Always add
    error=
    messages
    to custom
    Check
    objects to improve debuggability.
  • Use lazy validation in pipelines that process large batches so all failures surface in one pass.
  • Never rely on inferred schemas in production — always explicitly define constraints.
  • Use
    coerce=True
    deliberately
    — set at the column level to limit scope; avoid schema-wide coercion unless certain.
  • Prefer
    raise_warning=True
    only for non-critical informational checks (e.g., normality tests), not for data integrity constraints.
  • 优先使用
    DataFrameModel
    而非
    DataFrameSchema
    编写新代码——语法更简洁、支持继承、可与类型注解集成。
  • 推荐设置
    strict=True
    以尽早捕获意外的额外列。
  • 优先使用内置检查器
    Check.gt
    Check.isin
    等)而非自定义lambda——它们能生成更清晰的错误信息。
  • 编写向量化检查器
    element_wise=False
    ,默认设置)以提升性能;仅当逻辑确实为标量时才使用
    element_wise=True
  • 始终为自定义
    Check
    添加
    error=
    提示信息
    ,以提升可调试性。
  • 在处理大批量数据的流水线中使用延迟验证,以便一次性暴露所有失败。
  • 生产环境中绝不依赖推断的Schema——始终显式定义约束条件。
  • 谨慎使用
    coerce=True
    ——尽量在列级别设置以限制影响范围;除非确定必要,否则避免全局Schema强制转换。
  • 仅对非关键信息检查使用
    raise_warning=True
    (如正态性检验),数据完整性约束请勿使用。

Additional Resources

更多资源

  • references/checks-and-validation.md
    — Built-in check catalog, groupby checks, wide checks, hypothesis testing
  • references/dataframe-models.md
    — Field spec, schema inheritance, MultiIndex, aliases, parsers, Polars usage
  • references/checks-and-validation.md
    ——内置检查器目录、分组检查、宽表检查、假设检验
  • references/dataframe-models.md
    ——字段规范、Schema继承、MultiIndex、别名、解析器、Polars使用说明