polars

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Polars

Polars

Overview

概述

Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Polars是一个基于Apache Arrow的、用于Python和Rust的闪电般快速的DataFrame库。借助Polars的基于表达式的API、延迟计算框架和高性能数据处理能力,可实现高效的数据处理、Pandas迁移以及数据管道优化。

Quick Start

快速开始

Installation and Basic Usage

安装与基础使用

Install Polars:
python
uv pip install polars
Basic DataFrame creation and operations:
python
import polars as pl
安装Polars:
python
uv pip install polars
基础DataFrame创建与操作:
python
import polars as pl

Create DataFrame

Create DataFrame

df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })
df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] })

Select columns

Select columns

df.select("name", "age")
df.select("name", "age")

Filter rows

Filter rows

df.filter(pl.col("age") > 25)
df.filter(pl.col("age") > 25)

Add computed columns

Add computed columns

df.with_columns( age_plus_10=pl.col("age") + 10 )
undefined
df.with_columns( age_plus_10=pl.col("age") + 10 )
undefined

Core Concepts

核心概念

Expressions

表达式

Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
  • Use
    pl.col("column_name")
    to reference columns
  • Chain methods to build complex transformations
  • Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
Example:
python
undefined
表达式是Polars操作的基本构建块。它们描述数据转换,可组合、复用和优化。
核心原则:
  • 使用
    pl.col("column_name")
    引用列
  • 链式调用方法以构建复杂转换
  • 表达式是延迟计算的,仅在特定上下文(select、with_columns、filter、group_by)中执行
示例:
python
undefined

Expression-based computation

Expression-based computation

df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )
undefined
df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )
undefined

Lazy vs Eager Evaluation

延迟计算 vs 即时计算

Eager (DataFrame): Operations execute immediately
python
df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
python
lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query
When to use lazy:
  • Working with large datasets
  • Complex query pipelines
  • When only some columns/rows are needed
  • Performance is critical
Benefits of lazy evaluation:
  • Automatic query optimization
  • Predicate pushdown
  • Projection pushdown
  • Parallel execution
For detailed concepts, load
references/core_concepts.md
.
**即时计算(DataFrame):**操作立即执行
python
df = pl.read_csv("file.csv")  # Reads immediately
result = df.filter(pl.col("age") > 25)  # Executes immediately
**延迟计算(LazyFrame):**操作构建查询计划,执行前会进行优化
python
lf = pl.scan_csv("file.csv")  # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect()  # Now executes optimized query
何时使用延迟计算:
  • 处理大型数据集时
  • 复杂查询管道
  • 仅需要部分列/行时
  • 性能至关重要时
延迟计算的优势:
  • 自动查询优化
  • 谓词下推
  • 投影下推
  • 并行执行
如需了解详细概念,请加载
references/core_concepts.md

Common Operations

常见操作

Select

选择列

Select and manipulate columns:
python
undefined
选择并操作列:
python
undefined

Select specific columns

Select specific columns

df.select("name", "age")
df.select("name", "age")

Select with expressions

Select with expressions

df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )
df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") )

Select all columns matching a pattern

Select all columns matching a pattern

df.select(pl.col("^.*_id$"))
undefined
df.select(pl.col("^.*_id$"))
undefined

Filter

过滤行

Filter rows by conditions:
python
undefined
根据条件过滤行:
python
undefined

Single condition

Single condition

df.filter(pl.col("age") > 25)
df.filter(pl.col("age") > 25)

Multiple conditions (cleaner than using &)

Multiple conditions (cleaner than using &)

df.filter( pl.col("age") > 25, pl.col("city") == "NY" )
df.filter( pl.col("age") > 25, pl.col("city") == "NY" )

Complex conditions

Complex conditions

df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )
undefined
df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )
undefined

With Columns

添加/修改列

Add or modify columns while preserving existing ones:
python
undefined
添加或修改列的同时保留现有列:
python
undefined

Add new columns

Add new columns

df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )
df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() )

Parallel computation (all columns computed in parallel)

Parallel computation (all columns computed in parallel)

df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )
undefined
df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )
undefined

Group By and Aggregations

分组与聚合

Group data and compute aggregations:
python
undefined
对数据分组并计算聚合值:
python
undefined

Basic grouping

Basic grouping

df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )
df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") )

Multiple group keys

Multiple group keys

df.group_by("city", "department").agg( pl.col("salary").sum() )
df.group_by("city", "department").agg( pl.col("salary").sum() )

Conditional aggregations

Conditional aggregations

df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )

For detailed operation patterns, load `references/operations.md`.
df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )

如需了解详细操作模式,请加载`references/operations.md`。

Aggregations and Window Functions

聚合与窗口函数

Aggregation Functions

聚合函数

Common aggregations within
group_by
context:
  • pl.len()
    - count rows
  • pl.col("x").sum()
    - sum values
  • pl.col("x").mean()
    - average
  • pl.col("x").min()
    /
    pl.col("x").max()
    - extremes
  • pl.first()
    /
    pl.last()
    - first/last values
group_by
上下文中的常见聚合操作:
  • pl.len()
    - 统计行数
  • pl.col("x").sum()
    - 求和
  • pl.col("x").mean()
    - 平均值
  • pl.col("x").min()
    /
    pl.col("x").max()
    - 极值
  • pl.first()
    /
    pl.last()
    - 首个/最后一个值

Window Functions with
over()

使用
over()
的窗口函数

Apply aggregations while preserving row count:
python
undefined
应用聚合操作的同时保留行数:
python
undefined

Add group statistics to each row

Add group statistics to each row

df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )
df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") )

Multiple grouping columns

Multiple grouping columns

df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )

**Mapping strategies:**
- `group_to_rows` (default): Preserves original row order
- `explode`: Faster but groups rows together
- `join`: Creates list columns
df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )

**映射策略:**
- `group_to_rows`(默认):保留原始行顺序
- `explode`:速度更快但会将行分组在一起
- `join`:创建列表列

Data I/O

数据读写(I/O)

Supported Formats

支持的格式

Polars supports reading and writing:
  • CSV, Parquet, JSON, Excel
  • Databases (via connectors)
  • Cloud storage (S3, Azure, GCS)
  • Google BigQuery
  • Multiple/partitioned files
Polars支持读写以下格式:
  • CSV、Parquet、JSON、Excel
  • 数据库(通过连接器)
  • 云存储(S3、Azure、GCS)
  • Google BigQuery
  • 多文件/分区文件

Common I/O Operations

常见I/O操作

CSV:
python
undefined
CSV:
python
undefined

Eager

Eager

df = pl.read_csv("file.csv") df.write_csv("output.csv")
df = pl.read_csv("file.csv") df.write_csv("output.csv")

Lazy (preferred for large files)

Lazy (preferred for large files)

lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()

**Parquet (recommended for performance):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
python
df = pl.read_json("file.json")
df.write_json("output.json")
For comprehensive I/O documentation, load
references/io_guide.md
.
lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()

**Parquet(推荐用于高性能场景):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")
JSON:
python
df = pl.read_json("file.json")
df.write_json("output.json")
如需全面的I/O文档,请加载
references/io_guide.md

Transformations

数据转换

Joins

连接

Combine DataFrames:
python
undefined
合并DataFrame:
python
undefined

Inner join

Inner join

df1.join(df2, on="id", how="inner")
df1.join(df2, on="id", how="inner")

Left join

Left join

df1.join(df2, on="id", how="left")
df1.join(df2, on="id", how="left")

Join on different column names

Join on different column names

df1.join(df2, left_on="user_id", right_on="id")
undefined
df1.join(df2, left_on="user_id", right_on="id")
undefined

Concatenation

拼接

Stack DataFrames:
python
undefined
堆叠DataFrame:
python
undefined

Vertical (stack rows)

Vertical (stack rows)

pl.concat([df1, df2], how="vertical")
pl.concat([df1, df2], how="vertical")

Horizontal (add columns)

Horizontal (add columns)

pl.concat([df1, df2], how="horizontal")
pl.concat([df1, df2], how="horizontal")

Diagonal (union with different schemas)

Diagonal (union with different schemas)

pl.concat([df1, df2], how="diagonal")
undefined
pl.concat([df1, df2], how="diagonal")
undefined

Pivot and Unpivot

透视与逆透视

Reshape data:
python
undefined
重塑数据:
python
undefined

Pivot (wide format)

Pivot (wide format)

df.pivot(values="sales", index="date", columns="product")
df.pivot(values="sales", index="date", columns="product")

Unpivot (long format)

Unpivot (long format)

df.unpivot(index="id", on=["col1", "col2"])

For detailed transformation examples, load `references/transformations.md`.
df.unpivot(index="id", on=["col1", "col2"])

如需详细的转换示例,请加载`references/transformations.md`。

Pandas Migration

Pandas迁移

Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
Polars相比Pandas提供了显著的性能提升,且API更简洁。核心差异如下:

Conceptual Differences

概念差异

  • No index: Polars uses integer positions only
  • Strict typing: No silent type conversions
  • Lazy evaluation: Available via LazyFrame
  • Parallel by default: Operations parallelized automatically
  • 无索引:Polars仅使用整数位置
  • 严格类型:无隐式类型转换
  • 延迟计算:可通过LazyFrame实现
  • 默认并行:操作自动并行化

Common Operation Mappings

常见操作映射

OperationPandasPolars
Select column
df["col"]
df.select("col")
Filter
df[df["col"] > 10]
df.filter(pl.col("col") > 10)
Add column
df.assign(x=...)
df.with_columns(x=...)
Group by
df.groupby("col").agg(...)
df.group_by("col").agg(...)
Window
df.groupby("col").transform(...)
df.with_columns(...).over("col")
操作PandasPolars
选择列
df["col"]
df.select("col")
过滤
df[df["col"] > 10]
df.filter(pl.col("col") > 10)
添加列
df.assign(x=...)
df.with_columns(x=...)
分组
df.groupby("col").agg(...)
df.group_by("col").agg(...)
窗口函数
df.groupby("col").transform(...)
df.with_columns(...).over("col")

Key Syntax Patterns

核心语法模式

Pandas sequential (slow):
python
df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)
Polars parallel (fast):
python
df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)
For comprehensive migration guide, load
references/pandas_migration.md
.
Pandas 顺序执行(慢):
python
df.assign(
    col_a=lambda df_: df_.value * 10,
    col_b=lambda df_: df_.value * 100
)
Polars 并行执行(快):
python
df.with_columns(
    col_a=pl.col("value") * 10,
    col_b=pl.col("value") * 100,
)
如需全面的迁移指南,请加载
references/pandas_migration.md

Best Practices

最佳实践

Performance Optimization

性能优化

  1. Use lazy evaluation for large datasets:
    python
    lf = pl.scan_csv("large.csv")  # Don't use read_csv
    result = lf.filter(...).select(...).collect()
  2. Avoid Python functions in hot paths:
    • Stay within expression API for parallelization
    • Use
      .map_elements()
      only when necessary
    • Prefer native Polars operations
  3. Use streaming for very large data:
    python
    lf.collect(streaming=True)
  4. Select only needed columns early:
    python
    # Good: Select columns early
    lf.select("col1", "col2").filter(...)
    
    # Bad: Filter on all columns first
    lf.filter(...).select("col1", "col2")
  5. Use appropriate data types:
    • Categorical for low-cardinality strings
    • Appropriate integer sizes (i32 vs i64)
    • Date types for temporal data
  1. 对大型数据集使用延迟计算:
    python
    lf = pl.scan_csv("large.csv")  # Don't use read_csv
    result = lf.filter(...).select(...).collect()
  2. 避免在热点路径中使用Python函数:
    • 尽量使用表达式API以实现并行化
    • 仅在必要时使用
      .map_elements()
    • 优先使用Polars原生操作
  3. 对超大型数据使用流式处理:
    python
    lf.collect(streaming=True)
  4. 尽早仅选择需要的列:
    python
    # Good: Select columns early
    lf.select("col1", "col2").filter(...)
    
    # Bad: Filter on all columns first
    lf.filter(...).select("col1", "col2")
  5. 使用合适的数据类型:
    • 低基数字符串使用分类类型
    • 选择合适的整数大小(i32 vs i64)
    • 时间数据使用日期类型

Expression Patterns

表达式模式

Conditional operations:
python
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
python
df.select(pl.col("^.*_value$") * 2)  # Regex pattern
Null handling:
python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
For additional best practices and patterns, load
references/best_practices.md
.
条件操作:
python
pl.when(condition).then(value).otherwise(other_value)
多列操作:
python
df.select(pl.col("^.*_value$") * 2)  # Regex pattern
空值处理:
python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()
如需更多最佳实践和模式,请加载
references/best_practices.md

Resources

资源

This skill includes comprehensive reference documentation:
本技能包含全面的参考文档:

references/

references/

  • core_concepts.md
    - Detailed explanations of expressions, lazy evaluation, and type system
  • operations.md
    - Comprehensive guide to all common operations with examples
  • pandas_migration.md
    - Complete migration guide from pandas to Polars
  • io_guide.md
    - Data I/O operations for all supported formats
  • transformations.md
    - Joins, concatenation, pivots, and reshaping operations
  • best_practices.md
    - Performance optimization tips and common patterns
Load these references as needed when users require detailed information about specific topics.
  • core_concepts.md
    - 表达式、延迟计算和类型系统的详细说明
  • operations.md
    - 所有常见操作的综合指南及示例
  • pandas_migration.md
    - 从Pandas迁移到Polars的完整指南
  • io_guide.md
    - 所有支持格式的数据I/O操作指南
  • transformations.md
    - 连接、拼接、透视和重塑操作指南
  • best_practices.md
    - 性能优化技巧和常见模式
当用户需要特定主题的详细信息时,可按需加载这些参考文档。