polars

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Polars

Overview

概述

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.

Polars是一个基于Apache Arrow的、用于Python和Rust的闪电般快速的DataFrame库。借助Polars的基于表达式的API、延迟计算框架和高性能数据处理能力，可实现高效的数据处理、Pandas迁移以及数据管道优化。

Quick Start

快速开始

Installation and Basic Usage

安装与基础使用

Install Polars:

python

uv pip install polars

Basic DataFrame creation and operations:

python

import polars as pl

安装Polars：

python

uv pip install polars

基础DataFrame创建与操作：

python

import polars as pl

Create DataFrame

df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })

Select columns

df.select("name", "age")

Filter rows

df.filter(pl.col("age") > 25)

Add computed columns

df.with_columns( age_plus_10=pl.col("age") + 10 )

undefined

df.with_columns( age_plus_10=pl.col("age") + 10 )

undefined

Core Concepts

核心概念

Expressions

表达式

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.

Key principles:

Use
```
pl.col("column_name")
```
to reference columns
Chain methods to build complex transformations
Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)

Example:

python

undefined

表达式是Polars操作的基本构建块。它们描述数据转换，可组合、复用和优化。

核心原则：

使用
```
pl.col("column_name")
```
引用列
链式调用方法以构建复杂转换
表达式是延迟计算的，仅在特定上下文（select、with_columns、filter、group_by）中执行

示例：

python

undefined

Expression-based computation

df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )

undefined

df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )

undefined

Lazy vs Eager Evaluation

延迟计算 vs 即时计算

Eager (DataFrame): Operations execute immediately

python

df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately

Lazy (LazyFrame): Operations build a query plan, optimized before execution

python

lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query

When to use lazy:

Working with large datasets
Complex query pipelines
When only some columns/rows are needed
Performance is critical

Benefits of lazy evaluation:

Automatic query optimization
Predicate pushdown
Projection pushdown
Parallel execution

For detailed concepts, load

references/core_concepts.md

**即时计算（DataFrame）：**操作立即执行

python

df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately

**延迟计算（LazyFrame）：**操作构建查询计划，执行前会进行优化

python

lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query

何时使用延迟计算：

处理大型数据集时
复杂查询管道
仅需要部分列/行时
性能至关重要时

延迟计算的优势：

自动查询优化
谓词下推
投影下推
并行执行

如需了解详细概念，请加载

references/core_concepts.md

。

Common Operations

常见操作

Select

选择列

Select and manipulate columns:

python

undefined

选择并操作列：

python

undefined

Select specific columns

df.select("name", "age")

Select with expressions

df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )

Select all columns matching a pattern

df.select(pl.col("^.*_id$"))

undefined

df.select(pl.col("^.*_id$"))

undefined

Filter

过滤行

Filter rows by conditions:

python

undefined

根据条件过滤行：

python

undefined

Single condition

df.filter(pl.col("age") > 25)

Multiple conditions (cleaner than using &)

df.filter( pl.col("age") > 25, pl.col("city") == "NY" )

Complex conditions

df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )

undefined

df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )

undefined

With Columns

添加/修改列

Add or modify columns while preserving existing ones:

python

undefined

添加或修改列的同时保留现有列：

python

undefined

Add new columns

df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )

Parallel computation (all columns computed in parallel)

df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )

undefined

df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )

undefined

Group By and Aggregations

分组与聚合

Group data and compute aggregations:

python

undefined

对数据分组并计算聚合值：

python

undefined

Basic grouping

df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )

Multiple group keys

df.group_by("city", "department").agg( pl.col("salary").sum() )

Conditional aggregations

df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )


For detailed operation patterns, load `references/operations.md`.

df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )


如需了解详细操作模式，请加载`references/operations.md`。

Aggregations and Window Functions

聚合与窗口函数

Aggregation Functions

聚合函数

Common aggregations within

group_by

context:

```
pl.len()
```
- count rows
```
pl.col("x").sum()
```
- sum values
```
pl.col("x").mean()
```
- average
```
pl.col("x").min()
```
/
```
pl.col("x").max()
```
- extremes
```
pl.first()
```
/
```
pl.last()
```
- first/last values

group_by

上下文中的常见聚合操作：

```
pl.len()
```
- 统计行数
```
pl.col("x").sum()
```
- 求和
```
pl.col("x").mean()
```
- 平均值
```
pl.col("x").min()
```
/
```
pl.col("x").max()
```
- 极值
```
pl.first()
```
/
```
pl.last()
```
- 首个/最后一个值

Window Functions with

over()

使用

over()

的窗口函数

Apply aggregations while preserving row count:

python

undefined

应用聚合操作的同时保留行数：

python

undefined

Add group statistics to each row

df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )

Multiple grouping columns

df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )


**Mapping strategies:**
- `group_to_rows` (default): Preserves original row order
- `explode`: Faster but groups rows together
- `join`: Creates list columns

df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )


**映射策略：**
- `group_to_rows`（默认）：保留原始行顺序
- `explode`：速度更快但会将行分组在一起
- `join`：创建列表列

Data I/O

数据读写（I/O）

Supported Formats

支持的格式

Polars supports reading and writing:

CSV, Parquet, JSON, Excel
Databases (via connectors)
Cloud storage (S3, Azure, GCS)
Google BigQuery
Multiple/partitioned files

Polars支持读写以下格式：

CSV、Parquet、JSON、Excel
数据库（通过连接器）
云存储（S3、Azure、GCS）
Google BigQuery
多文件/分区文件

Common I/O Operations

常见I/O操作

CSV:

python

undefined

CSV：

python

undefined

Eager

df = pl.read_csv("file.csv") df.write_csv("output.csv")

Lazy (preferred for large files)

lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()


**Parquet (recommended for performance):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

JSON:

python

df = pl.read_json("file.json")
df.write_json("output.json")

For comprehensive I/O documentation, load

references/io_guide.md

lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()


**Parquet（推荐用于高性能场景）：**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")

JSON：

python

df = pl.read_json("file.json")
df.write_json("output.json")

如需全面的I/O文档，请加载

references/io_guide.md

。

Transformations

数据转换

Joins

连接

Combine DataFrames:

python

undefined

合并DataFrame：

python

undefined

Inner join

df1.join(df2, on="id", how="inner")

Left join

df1.join(df2, on="id", how="left")

Join on different column names

df1.join(df2, left_on="user_id", right_on="id")

undefined

df1.join(df2, left_on="user_id", right_on="id")

undefined

Concatenation

拼接

Stack DataFrames:

python

undefined

堆叠DataFrame：

python

undefined

Vertical (stack rows)

pl.concat([df1, df2], how="vertical")

Horizontal (add columns)

pl.concat([df1, df2], how="horizontal")

Diagonal (union with different schemas)

pl.concat([df1, df2], how="diagonal")

undefined

pl.concat([df1, df2], how="diagonal")

undefined

Pivot and Unpivot

透视与逆透视

Reshape data:

python

undefined

重塑数据：

python

undefined

Pivot (wide format)

df.pivot(values="sales", index="date", columns="product")

Unpivot (long format)

df.unpivot(index="id", on=["col1", "col2"])


For detailed transformation examples, load `references/transformations.md`.

df.unpivot(index="id", on=["col1", "col2"])


如需详细的转换示例，请加载`references/transformations.md`。

Pandas Migration

Pandas迁移

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:

Polars相比Pandas提供了显著的性能提升，且API更简洁。核心差异如下：

Conceptual Differences

概念差异

No index: Polars uses integer positions only
Strict typing: No silent type conversions
Lazy evaluation: Available via LazyFrame
Parallel by default: Operations parallelized automatically

无索引：Polars仅使用整数位置
严格类型：无隐式类型转换
延迟计算：可通过LazyFrame实现
默认并行：操作自动并行化

Common Operation Mappings

常见操作映射

Operation	Pandas	Polars
Select column	`df["col"]`	`df.select("col")`
Filter	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
Add column	`df.assign(x=...)`	`df.with_columns(x=...)`
Group by	`df.groupby("col").agg(...)`	`df.group_by("col").agg(...)`
Window	`df.groupby("col").transform(...)`	`df.with_columns(...).over("col")`

操作	Pandas	Polars
选择列	`df["col"]`	`df.select("col")`
过滤	`df[df["col"] > 10]`	`df.filter(pl.col("col") > 10)`
添加列	`df.assign(x=...)`	`df.with_columns(x=...)`
分组	`df.groupby("col").agg(...)`	`df.group_by("col").agg(...)`
窗口函数	`df.groupby("col").transform(...)`	`df.with_columns(...).over("col")`

Key Syntax Patterns

核心语法模式

Pandas sequential (slow):

python

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars parallel (fast):

python

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

For comprehensive migration guide, load

references/pandas_migration.md

Pandas 顺序执行（慢）：

python

df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)

Polars 并行执行（快）：

python

df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)

如需全面的迁移指南，请加载

references/pandas_migration.md

。

Best Practices

最佳实践

Performance Optimization

性能优化

Use lazy evaluation for large datasets:

python

lf = pl.scan_csv("large.csv")  # Don't use read_csv
result = lf.filter(...).select(...).collect()

Avoid Python functions in hot paths:
- Stay within expression API for parallelization
- Use
```
.map_elements()
```
  only when necessary
- Prefer native Polars operations
Use streaming for very large data:
python
```
lf.collect(streaming=True)
```

Select only needed columns early:

python

# Good: Select columns early
lf.select("col1", "col2").filter(...)

# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")

Use appropriate data types:
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data

对大型数据集使用延迟计算：

python

lf = pl.scan_csv("large.csv")  # Don't use read_csv
result = lf.filter(...).select(...).collect()

避免在热点路径中使用Python函数：
- 尽量使用表达式API以实现并行化
- 仅在必要时使用
```
.map_elements()
```
- 优先使用Polars原生操作
对超大型数据使用流式处理：
python
```
lf.collect(streaming=True)
```

尽早仅选择需要的列：

python

# Good: Select columns early
lf.select("col1", "col2").filter(...)

# Bad: Filter on all columns first
lf.filter(...).select("col1", "col2")

使用合适的数据类型：
- 低基数字符串使用分类类型
- 选择合适的整数大小（i32 vs i64）
- 时间数据使用日期类型

Expression Patterns

表达式模式

Conditional operations:

python

pl.when(condition).then(value).otherwise(other_value)

Column operations across multiple columns:

python

df.select(pl.col("^.*_value$") * 2)  # Regex pattern

Null handling:

python

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

For additional best practices and patterns, load

references/best_practices.md

条件操作：

python

pl.when(condition).then(value).otherwise(other_value)

多列操作：

python

df.select(pl.col("^.*_value$") * 2)  # Regex pattern

空值处理：

python

pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()

如需更多最佳实践和模式，请加载

references/best_practices.md

。

Resources

资源

This skill includes comprehensive reference documentation:

本技能包含全面的参考文档：

references/

```
core_concepts.md
```
- Detailed explanations of expressions, lazy evaluation, and type system
```
operations.md
```
- Comprehensive guide to all common operations with examples
```
pandas_migration.md
```
- Complete migration guide from pandas to Polars
```
io_guide.md
```
- Data I/O operations for all supported formats
```
transformations.md
```
- Joins, concatenation, pivots, and reshaping operations
```
best_practices.md
```
- Performance optimization tips and common patterns

Load these references as needed when users require detailed information about specific topics.

```
core_concepts.md
```
- 表达式、延迟计算和类型系统的详细说明
```
operations.md
```
- 所有常见操作的综合指南及示例
```
pandas_migration.md
```
- 从Pandas迁移到Polars的完整指南
```
io_guide.md
```
- 所有支持格式的数据I/O操作指南
```
transformations.md
```
- 连接、拼接、透视和重塑操作指南
```
best_practices.md
```
- 性能优化技巧和常见模式

当用户需要特定主题的详细信息时，可按需加载这些参考文档。

polars

Original

Translation

Polars

Polars

Overview

概述

Quick Start

快速开始

Installation and Basic Usage

安装与基础使用

Create DataFrame

Create DataFrame

Select columns

Select columns

Filter rows

Filter rows

Add computed columns

Add computed columns

Core Concepts

核心概念

Expressions

表达式

Expression-based computation

Expression-based computation

Lazy vs Eager Evaluation

延迟计算 vs 即时计算

Common Operations

常见操作

Select

选择列

Select specific columns

Select specific columns

Select with expressions

Select with expressions

Select all columns matching a pattern

Select all columns matching a pattern

Filter

过滤行

Single condition

Single condition

Multiple conditions (cleaner than using &)

Multiple conditions (cleaner than using &)

Complex conditions

Complex conditions

With Columns

添加/修改列

Add new columns

Add new columns

Parallel computation (all columns computed in parallel)

Parallel computation (all columns computed in parallel)

Group By and Aggregations

分组与聚合

Basic grouping

Basic grouping

Multiple group keys

Multiple group keys

Conditional aggregations

Conditional aggregations

Aggregations and Window Functions

聚合与窗口函数

Aggregation Functions

聚合函数

Window Functions with over()

使用over()的窗口函数

Add group statistics to each row

Add group statistics to each row

Multiple grouping columns

Multiple grouping columns

Data I/O

数据读写（I/O）

Supported Formats

支持的格式

Common I/O Operations

常见I/O操作

Eager

Eager

Lazy (preferred for large files)

Lazy (preferred for large files)

Transformations

Window Functions with
`over()`

使用
`over()`
的窗口函数