pandera

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Pandera: DataFrame Validation

Pandera：DataFrame验证

Pandera is an open-source framework for validating DataFrame-like objects at runtime. Define schemas once and reuse them across pandas, polars, Dask, Modin, PySpark, and Ibis backends.

Pandera是一个用于在运行时验证类DataFrame对象的开源框架。只需定义一次schema，即可在pandas、polars、Dask、Modin、PySpark和Ibis等后端中复用。

Import Convention

导入约定

Since pandera v0.24.0, use the backend-specific module. Using the top-level

pandera

module produces a

FutureWarning

and will be deprecated in v0.29.0.

python

import pandera.pandas as pa          # pandas (recommended)
import pandera.polars as pa          # polars
from pandera.typing.pandas import DataFrame, Series, Index

从pandera v0.24.0版本开始，请使用后端特定的模块。使用顶层

pandera

模块会触发

FutureWarning

，并且该方式将在v0.29.0版本中被弃用。

python

import pandera.pandas as pa          # pandas（推荐）
import pandera.polars as pa          # polars
from pandera.typing.pandas import DataFrame, Series, Index

Two Schema Styles

两种Schema风格

Object-based API (

DataFrameSchema

)

基于对象的API（

DataFrameSchema

）

Suitable for dynamic schema construction or when schemas need to be built programmatically.

python

import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, pa.Check.gt(0)),
    "email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
    "score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
    "status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})

validated = schema.validate(df)

适用于动态构建schema，或者需要以编程方式构建schema的场景。

python

import pandas as pd
import pandera.pandas as pa

schema = pa.DataFrameSchema({
    "user_id": pa.Column(int, pa.Check.gt(0)),
    "email": pa.Column(str, pa.Check.str_matches(r"^[^@]+@[^@]+\.[^@]+$")),
    "score": pa.Column(float, [pa.Check.ge(0.0), pa.Check.le(1.0)]),
    "status": pa.Column(str, pa.Check.isin(["active", "inactive", "banned"])),
})

validated = schema.validate(df)

Class-based API (

DataFrameModel

) — preferred

基于类的API（

DataFrameModel

）——推荐使用

Pydantic-style syntax with type annotations. Produces cleaner, reusable schemas that integrate with

@pa.check_types

python

import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series

class UserSchema(pa.DataFrameModel):
    user_id: int = pa.Field(gt=0)
    email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
    score: float = pa.Field(ge=0.0, le=1.0)
    status: str = pa.Field(isin=["active", "inactive", "banned"])

    class Config:
        strict = True       # reject extra columns
        coerce = False      # do not silently cast types

采用Pydantic风格的类型注解语法。生成的schema更简洁、可复用，并且能与

@pa.check_types

集成。

python

import pandera.pandas as pa
from pandera.typing.pandas import DataFrame, Series

class UserSchema(pa.DataFrameModel):
    user_id: int = pa.Field(gt=0)
    email: str = pa.Field(str_matches=r"^[^@]+@[^@]+\.[^@]+$")
    score: float = pa.Field(ge=0.0, le=1.0)
    status: str = pa.Field(isin=["active", "inactive", "banned"])

    class Config:
        strict = True       # 存在额外列时抛出错误
        coerce = False      # 不自动静默转换类型

Validate directly

直接验证

UserSchema.validate(df)

Or via typing annotation + decorator

或者通过类型注解 + 装饰器

@pa.check_types def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df

undefined

@pa.check_types def process(df: DataFrame[UserSchema]) -> DataFrame[UserSchema]: return df

undefined

Checks

检查器

Built-in Checks (prefer these over lambdas)

内置检查器（优先使用，而非lambda）

python

pa.Check.gt(0)               # greater than
pa.Check.ge(0)               # greater than or equal
pa.Check.lt(100)             # less than
pa.Check.le(100)             # less than or equal
pa.Check.eq("value")         # equal to
pa.Check.ne("value")         # not equal to
pa.Check.isin(["a", "b"])    # membership
pa.Check.notin(["x"])        # exclusion
pa.Check.str_matches(r"^\d+$")  # regex match
pa.Check.in_range(0, 100)    # closed interval
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255)  # min/max string length

python

pa.Check.gt(0)               # 大于
pa.Check.ge(0)               # 大于等于
pa.Check.lt(100)             # 小于
pa.Check.le(100)             # 小于等于
pa.Check.eq("value")         # 等于
pa.Check.ne("value")         # 不等于
pa.Check.isin(["a", "b"])    # 属于指定集合
pa.Check.notin(["x"])        # 不属于指定集合
pa.Check.str_matches(r"^\d+$")  # 正则匹配
pa.Check.in_range(0, 100)    # 闭区间范围
pa.Check.str_startswith("prefix")
pa.Check.str_endswith("suffix")
pa.Check.str_length(1, 255)  # 字符串最小/最大长度

Custom Checks

自定义检查器

python

undefined

python

undefined

Vectorized (default, faster — operates on the whole Series)

向量化（默认，性能更高——对整个Series操作）

pa.Check(lambda s: s.str.len() <= 255)

Element-wise (scalar input, use only when vectorized is impractical)

元素级（标量输入，仅当向量化实现不现实时使用）

pa.Check(lambda x: x > 0, element_wise=True)

Always add an error message

始终添加错误提示信息

pa.Check(lambda s: s > 0, error="values must be positive")

undefined

pa.Check(lambda s: s > 0, error="values must be positive")

undefined

DataFrame-level Checks

DataFrame级检查器

python

schema = pa.DataFrameSchema(
    columns={...},
    checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)

DataFrameModel

, use

@pa.dataframe_check

python

class Schema(pa.DataFrameModel):
    start_date: int
    end_date: int

    @pa.dataframe_check
    @classmethod
    def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
        return df["end_date"] >= df["start_date"]

python

schema = pa.DataFrameSchema(
    columns={...},
    checks=pa.Check(lambda df: df["end_date"] >= df["start_date"]),
)

在

DataFrameModel

中，使用

@pa.dataframe_check

：

python

class Schema(pa.DataFrameModel):
    start_date: int
    end_date: int

    @pa.dataframe_check
    @classmethod
    def end_after_start(cls, df: pd.DataFrame) -> pd.Series:
        return df["end_date"] >= df["start_date"]

Nullable and Optional Columns

可空与可选列

python

undefined

python

undefined

Object API: allow nulls in a column

对象API：允许列包含空值

pa.Column(float, nullable=True)

DataFrameModel: make a column optional (may be absent)

DataFrameModel：将列设为可选（可以不存在）

from typing import Optional

class Schema(pa.DataFrameModel): required_col: Series[int] optional_col: Optional[Series[float]]

undefined

from typing import Optional

class Schema(pa.DataFrameModel): required_col: Series[int] optional_col: Optional[Series[float]]

undefined

Coercion

类型强制转换

Enable coercion to cast data to the declared type before validation. Use deliberately — coercion can hide upstream data issues.

python

undefined

启用强制转换可在验证前将数据转换为声明的类型。需谨慎使用——强制转换可能掩盖上游数据问题。

python

undefined

Per-column

按列设置

pa.Column(int, coerce=True)

Schema-wide via Config

通过Config在全局Schema设置

class Schema(pa.DataFrameModel): year: int = pa.Field(gt=2000, coerce=True)

class Config:
    coerce = True

undefined

class Schema(pa.DataFrameModel): year: int = pa.Field(gt=2000, coerce=True)

class Config:
    coerce = True

undefined

Lazy Validation — Collect All Errors

延迟验证——收集所有错误

By default pandera raises on the first error. Use

lazy=True

to collect all failures before raising, useful for batch reporting.

python

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)   # DataFrame of all failures

默认情况下，pandera会在遇到第一个错误时抛出异常。使用

lazy=True

可收集所有失败后再抛出，适用于批量报告场景。

python

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)   # 所有失败案例的DataFrame

Decorator Integration

装饰器集成

Integrate validation transparently into pipelines using decorators.

python

undefined

使用装饰器将验证透明地集成到数据流水线中。

python

undefined

DataFrameModel + check_types (recommended)

DataFrameModel + check_types（推荐）

@pa.check_types def transform(df: DataFrame[InputSchema]) -> DataFrame[OutputSchema]: return df.assign(revenue=df["units"] * df["price"])

Object API: check_input / check_output

对象API：check_input / check_output

@pa.check_input(input_schema) @pa.check_output(output_schema) def pipeline_step(df): return df

check_io: concisely specify both

check_io：简洁地同时指定输入输出

@pa.check_io(raw=input_schema, out=output_schema) def pipeline_step(raw): return raw


Decorators work on sync/async functions, methods, class methods, and static methods.

@pa.check_io(raw=input_schema, out=output_schema) def pipeline_step(raw): return raw


装饰器适用于同步/异步函数、方法、类方法和静态方法。

Schema Inheritance

Schema继承

Build specialized schemas from a base to avoid repetition.

python

class BaseEvent(pa.DataFrameModel):
    event_id: str
    timestamp: int = pa.Field(gt=0)

class ClickEvent(BaseEvent):
    url: str
    user_agent: str

    class Config:
        strict = True

基于基础Schema构建专用Schema，避免重复代码。

python

class BaseEvent(pa.DataFrameModel):
    event_id: str
    timestamp: int = pa.Field(gt=0)

class ClickEvent(BaseEvent):
    url: str
    user_agent: str

    class Config:
        strict = True

Schema Persistence (YAML / Script)

Schema持久化（YAML/脚本）

Serialize and reload schemas to keep validation reproducible.

python

import pandera.io

序列化和重新加载Schema，确保验证的可复现性。

python

import pandera.io

Save

保存

pandera.io.to_yaml(schema, "./schema.yaml")

Load

加载

schema = pandera.io.from_yaml("./schema.yaml")

Generate Python script

生成Python脚本

pandera.io.to_script(schema, "./schema_definition.py")

undefined

pandera.io.to_script(schema, "./schema_definition.py")

undefined

Schema Inference (Prototyping Only)

Schema推断（仅用于原型开发）

Infer a schema from existing data to bootstrap development. Always review and tighten the generated schema before using in production.

python

import pandera.pandas as pa

inferred = pa.infer_schema(df)
print(inferred.to_script())   # inspect then copy-edit

从现有数据中推断Schema，用于快速启动开发。在生产环境中使用前，务必审查并完善生成的Schema。

python

import pandera.pandas as pa

inferred = pa.infer_schema(df)
print(inferred.to_script())   # 检查后复制编辑

Dropping Invalid Rows

丢弃无效行

Use

drop_invalid_rows=True

DataFrameSchema

to filter out failing rows instead of raising an error. Supported on pandas and polars.

python

schema = pa.DataFrameSchema(
    {"score": pa.Column(float, pa.Check.ge(0))},
    drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)

在

DataFrameSchema

上设置

drop_invalid_rows=True

可过滤掉失败行，而非抛出错误。支持pandas和polars后端。

python

schema = pa.DataFrameSchema(
    {"score": pa.Column(float, pa.Check.ge(0))},
    drop_invalid_rows=True,
)
cleaned = schema.validate(df_with_bad_rows)

Error Handling

错误处理

python

from pandera.errors import SchemaError, SchemaErrors

python

from pandera.errors import SchemaError, SchemaErrors

Single error (eager validation)

单个错误（即时验证）

try: schema.validate(df) except SchemaError as exc: print(exc.failure_cases) # Series/DataFrame of failures

try: schema.validate(df) except SchemaError as exc: print(exc.failure_cases) # 失败案例的Series/DataFrame

Multiple errors (lazy validation)

多个错误（延迟验证）

try: schema.validate(df, lazy=True) except SchemaErrors as exc: # Structured dict with SCHEMA and DATA keys print(exc.error_counts) print(exc.failure_cases)

undefined

try: schema.validate(df, lazy=True) except SchemaErrors as exc: # 包含SCHEMA和DATA键的结构化字典 print(exc.error_counts) print(exc.failure_cases)

undefined

Key Configuration Options (

Config

)

关键配置选项（

Config

）

Option	Type	Effect
`strict`	`bool`	Raise if extra columns present
`coerce`	`bool`	Cast columns to declared dtypes
`ordered`	`bool`	Require columns in declared order
`name`	`str`	Schema name shown in error messages
`add_missing_columns`	`bool`	Insert columns with default values

选项	类型	效果
`strict`	`bool`	存在额外列时抛出错误
`coerce`	`bool`	将列转换为声明的数据类型
`ordered`	`bool`	要求列顺序与声明一致
`name`	`str`	错误信息中显示的Schema名称
`add_missing_columns`	`bool`	插入带有默认值的缺失列

Best Practices

最佳实践

Use
DataFrameModel
over
```
DataFrameSchema
```
for new code — cleaner syntax, inheritance, and type-annotation integration.
Prefer
strict=True
to catch unexpected extra columns early.
Use built-in checks (
```
Check.gt
```
,
```
Check.isin
```
, etc.) over custom lambdas where possible — they produce better error messages.
Write vectorized checks (
```
element_wise=False
```
, the default) for performance; only use
```
element_wise=True
```
when the logic is truly scalar.
Always add
error=
messages to custom
```
Check
```
objects to improve debuggability.
Use lazy validation in pipelines that process large batches so all failures surface in one pass.
Never rely on inferred schemas in production — always explicitly define constraints.
Use
coerce=True
deliberately — set at the column level to limit scope; avoid schema-wide coercion unless certain.
Prefer
raise_warning=True
only for non-critical informational checks (e.g., normality tests), not for data integrity constraints.

优先使用
DataFrameModel
而非
```
DataFrameSchema
```
编写新代码——语法更简洁、支持继承、可与类型注解集成。
推荐设置
strict=True
以尽早捕获意外的额外列。
优先使用内置检查器（
```
Check.gt
```
、
```
Check.isin
```
等）而非自定义lambda——它们能生成更清晰的错误信息。
编写向量化检查器（
```
element_wise=False
```
，默认设置）以提升性能；仅当逻辑确实为标量时才使用
```
element_wise=True
```
。
始终为自定义
Check
添加
error=
提示信息，以提升可调试性。
在处理大批量数据的流水线中使用延迟验证，以便一次性暴露所有失败。
生产环境中绝不依赖推断的Schema——始终显式定义约束条件。
谨慎使用
coerce=True
——尽量在列级别设置以限制影响范围；除非确定必要，否则避免全局Schema强制转换。
仅对非关键信息检查使用
raise_warning=True
（如正态性检验），数据完整性约束请勿使用。

pandera

Original

Translation

Pandera: DataFrame Validation

Pandera：DataFrame验证

Import Convention

导入约定

Two Schema Styles

两种Schema风格

Object-based API (DataFrameSchema)

基于对象的API（DataFrameSchema）

Class-based API (DataFrameModel) — preferred

基于类的API（DataFrameModel）——推荐使用

Validate directly

直接验证

Or via typing annotation + decorator

或者通过类型注解 + 装饰器

Checks

检查器

Built-in Checks (prefer these over lambdas)

内置检查器（优先使用，而非lambda）

Custom Checks

自定义检查器

Vectorized (default, faster — operates on the whole Series)

向量化（默认，性能更高——对整个Series操作）

Element-wise (scalar input, use only when vectorized is impractical)

元素级（标量输入，仅当向量化实现不现实时使用）

Always add an error message

始终添加错误提示信息

DataFrame-level Checks

DataFrame级检查器

Nullable and Optional Columns

可空与可选列

Object API: allow nulls in a column

对象API：允许列包含空值

DataFrameModel: make a column optional (may be absent)

DataFrameModel：将列设为可选（可以不存在）

Coercion

类型强制转换

Per-column

按列设置

Schema-wide via Config

通过Config在全局Schema设置

Lazy Validation — Collect All Errors

延迟验证——收集所有错误

Decorator Integration

装饰器集成

DataFrameModel + check_types (recommended)

DataFrameModel + check_types（推荐）

Object API: check_input / check_output

对象API：check_input / check_output

check_io: concisely specify both

check_io：简洁地同时指定输入输出

Schema Inheritance

Schema继承

Schema Persistence (YAML / Script)

Schema持久化（YAML/脚本）

Save

保存

Load

加载

Generate Python script

生成Python脚本

Schema Inference (Prototyping Only)

Schema推断（仅用于原型开发）

Dropping Invalid Rows

丢弃无效行

Error Handling

错误处理

Single error (eager validation)

单个错误（即时验证）

Multiple errors (lazy validation)

多个错误（延迟验证）

Key Configuration Options (Config)

关键配置选项（Config）

Best Practices

最佳实践

Additional Resources

更多资源

Object-based API (
`DataFrameSchema`
)

基于对象的API（
`DataFrameSchema`
）

Class-based API (
`DataFrameModel`
) — preferred

基于类的API（
`DataFrameModel`
）——推荐使用

Key Configuration Options (
`Config`
)

关键配置选项（
`Config`
）