polars
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePolars
Polars
Overview
概述
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Polars是一个基于Apache Arrow的、用于Python和Rust的闪电般快速的DataFrame库。借助Polars的基于表达式的API、延迟计算框架和高性能数据处理能力,可实现高效的数据处理、Pandas迁移以及数据管道优化。
Quick Start
快速开始
Installation and Basic Usage
安装与基础使用
Install Polars:
python
uv pip install polarsBasic DataFrame creation and operations:
python
import polars as pl安装Polars:
python
uv pip install polars基础DataFrame创建与操作:
python
import polars as plCreate DataFrame
Create DataFrame
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
df = pl.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"city": ["NY", "LA", "SF"]
})
Select columns
Select columns
df.select("name", "age")
df.select("name", "age")
Filter rows
Filter rows
df.filter(pl.col("age") > 25)
df.filter(pl.col("age") > 25)
Add computed columns
Add computed columns
df.with_columns(
age_plus_10=pl.col("age") + 10
)
undefineddf.with_columns(
age_plus_10=pl.col("age") + 10
)
undefinedCore Concepts
核心概念
Expressions
表达式
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
- Use to reference columns
pl.col("column_name") - Chain methods to build complex transformations
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
Example:
python
undefined表达式是Polars操作的基本构建块。它们描述数据转换,可组合、复用和优化。
核心原则:
- 使用引用列
pl.col("column_name") - 链式调用方法以构建复杂转换
- 表达式是延迟计算的,仅在特定上下文(select、with_columns、filter、group_by)中执行
示例:
python
undefinedExpression-based computation
Expression-based computation
df.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
undefineddf.select(
pl.col("name"),
(pl.col("age") * 12).alias("age_in_months")
)
undefinedLazy vs Eager Evaluation
延迟计算 vs 即时计算
Eager (DataFrame): Operations execute immediately
python
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediatelyLazy (LazyFrame): Operations build a query plan, optimized before execution
python
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized queryWhen to use lazy:
- Working with large datasets
- Complex query pipelines
- When only some columns/rows are needed
- Performance is critical
Benefits of lazy evaluation:
- Automatic query optimization
- Predicate pushdown
- Projection pushdown
- Parallel execution
For detailed concepts, load .
references/core_concepts.md**即时计算(DataFrame):**操作立即执行
python
df = pl.read_csv("file.csv") # Reads immediately
result = df.filter(pl.col("age") > 25) # Executes immediately**延迟计算(LazyFrame):**操作构建查询计划,执行前会进行优化
python
lf = pl.scan_csv("file.csv") # Doesn't read yet
result = lf.filter(pl.col("age") > 25).select("name", "age")
df = result.collect() # Now executes optimized query何时使用延迟计算:
- 处理大型数据集时
- 复杂查询管道
- 仅需要部分列/行时
- 性能至关重要时
延迟计算的优势:
- 自动查询优化
- 谓词下推
- 投影下推
- 并行执行
如需了解详细概念,请加载。
references/core_concepts.mdCommon Operations
常见操作
Select
选择列
Select and manipulate columns:
python
undefined选择并操作列:
python
undefinedSelect specific columns
Select specific columns
df.select("name", "age")
df.select("name", "age")
Select with expressions
Select with expressions
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
df.select(
pl.col("name"),
(pl.col("age") * 2).alias("double_age")
)
Select all columns matching a pattern
Select all columns matching a pattern
df.select(pl.col("^.*_id$"))
undefineddf.select(pl.col("^.*_id$"))
undefinedFilter
过滤行
Filter rows by conditions:
python
undefined根据条件过滤行:
python
undefinedSingle condition
Single condition
df.filter(pl.col("age") > 25)
df.filter(pl.col("age") > 25)
Multiple conditions (cleaner than using &)
Multiple conditions (cleaner than using &)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
df.filter(
pl.col("age") > 25,
pl.col("city") == "NY"
)
Complex conditions
Complex conditions
df.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
undefineddf.filter(
(pl.col("age") > 25) | (pl.col("city") == "LA")
)
undefinedWith Columns
添加/修改列
Add or modify columns while preserving existing ones:
python
undefined添加或修改列的同时保留现有列:
python
undefinedAdd new columns
Add new columns
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
df.with_columns(
age_plus_10=pl.col("age") + 10,
name_upper=pl.col("name").str.to_uppercase()
)
Parallel computation (all columns computed in parallel)
Parallel computation (all columns computed in parallel)
df.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
undefineddf.with_columns(
pl.col("value") * 10,
pl.col("value") * 100,
)
undefinedGroup By and Aggregations
分组与聚合
Group data and compute aggregations:
python
undefined对数据分组并计算聚合值:
python
undefinedBasic grouping
Basic grouping
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
df.group_by("city").agg(
pl.col("age").mean().alias("avg_age"),
pl.len().alias("count")
)
Multiple group keys
Multiple group keys
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
df.group_by("city", "department").agg(
pl.col("salary").sum()
)
Conditional aggregations
Conditional aggregations
df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
For detailed operation patterns, load `references/operations.md`.df.group_by("city").agg(
(pl.col("age") > 30).sum().alias("over_30")
)
如需了解详细操作模式,请加载`references/operations.md`。Aggregations and Window Functions
聚合与窗口函数
Aggregation Functions
聚合函数
Common aggregations within context:
group_by- - count rows
pl.len() - - sum values
pl.col("x").sum() - - average
pl.col("x").mean() - /
pl.col("x").min()- extremespl.col("x").max() - /
pl.first()- first/last valuespl.last()
group_by- - 统计行数
pl.len() - - 求和
pl.col("x").sum() - - 平均值
pl.col("x").mean() - /
pl.col("x").min()- 极值pl.col("x").max() - /
pl.first()- 首个/最后一个值pl.last()
Window Functions with over()
over()使用over()
的窗口函数
over()Apply aggregations while preserving row count:
python
undefined应用聚合操作的同时保留行数:
python
undefinedAdd group statistics to each row
Add group statistics to each row
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
df.with_columns(
avg_age_by_city=pl.col("age").mean().over("city"),
rank_in_city=pl.col("salary").rank().over("city")
)
Multiple grouping columns
Multiple grouping columns
df.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
**Mapping strategies:**
- `group_to_rows` (default): Preserves original row order
- `explode`: Faster but groups rows together
- `join`: Creates list columnsdf.with_columns(
group_avg=pl.col("value").mean().over("category", "region")
)
**映射策略:**
- `group_to_rows`(默认):保留原始行顺序
- `explode`:速度更快但会将行分组在一起
- `join`:创建列表列Data I/O
数据读写(I/O)
Supported Formats
支持的格式
Polars supports reading and writing:
- CSV, Parquet, JSON, Excel
- Databases (via connectors)
- Cloud storage (S3, Azure, GCS)
- Google BigQuery
- Multiple/partitioned files
Polars支持读写以下格式:
- CSV、Parquet、JSON、Excel
- 数据库(通过连接器)
- 云存储(S3、Azure、GCS)
- Google BigQuery
- 多文件/分区文件
Common I/O Operations
常见I/O操作
CSV:
python
undefinedCSV:
python
undefinedEager
Eager
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
df = pl.read_csv("file.csv")
df.write_csv("output.csv")
Lazy (preferred for large files)
Lazy (preferred for large files)
lf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
**Parquet (recommended for performance):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")JSON:
python
df = pl.read_json("file.json")
df.write_json("output.json")For comprehensive I/O documentation, load .
references/io_guide.mdlf = pl.scan_csv("file.csv")
result = lf.filter(...).select(...).collect()
**Parquet(推荐用于高性能场景):**
```python
df = pl.read_parquet("file.parquet")
df.write_parquet("output.parquet")JSON:
python
df = pl.read_json("file.json")
df.write_json("output.json")如需全面的I/O文档,请加载。
references/io_guide.mdTransformations
数据转换
Joins
连接
Combine DataFrames:
python
undefined合并DataFrame:
python
undefinedInner join
Inner join
df1.join(df2, on="id", how="inner")
df1.join(df2, on="id", how="inner")
Left join
Left join
df1.join(df2, on="id", how="left")
df1.join(df2, on="id", how="left")
Join on different column names
Join on different column names
df1.join(df2, left_on="user_id", right_on="id")
undefineddf1.join(df2, left_on="user_id", right_on="id")
undefinedConcatenation
拼接
Stack DataFrames:
python
undefined堆叠DataFrame:
python
undefinedVertical (stack rows)
Vertical (stack rows)
pl.concat([df1, df2], how="vertical")
pl.concat([df1, df2], how="vertical")
Horizontal (add columns)
Horizontal (add columns)
pl.concat([df1, df2], how="horizontal")
pl.concat([df1, df2], how="horizontal")
Diagonal (union with different schemas)
Diagonal (union with different schemas)
pl.concat([df1, df2], how="diagonal")
undefinedpl.concat([df1, df2], how="diagonal")
undefinedPivot and Unpivot
透视与逆透视
Reshape data:
python
undefined重塑数据:
python
undefinedPivot (wide format)
Pivot (wide format)
df.pivot(values="sales", index="date", columns="product")
df.pivot(values="sales", index="date", columns="product")
Unpivot (long format)
Unpivot (long format)
df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load `references/transformations.md`.df.unpivot(index="id", on=["col1", "col2"])
如需详细的转换示例,请加载`references/transformations.md`。Pandas Migration
Pandas迁移
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
Polars相比Pandas提供了显著的性能提升,且API更简洁。核心差异如下:
Conceptual Differences
概念差异
- No index: Polars uses integer positions only
- Strict typing: No silent type conversions
- Lazy evaluation: Available via LazyFrame
- Parallel by default: Operations parallelized automatically
- 无索引:Polars仅使用整数位置
- 严格类型:无隐式类型转换
- 延迟计算:可通过LazyFrame实现
- 默认并行:操作自动并行化
Common Operation Mappings
常见操作映射
| Operation | Pandas | Polars |
|---|---|---|
| Select column | | |
| Filter | | |
| Add column | | |
| Group by | | |
| Window | | |
| 操作 | Pandas | Polars |
|---|---|---|
| 选择列 | | |
| 过滤 | | |
| 添加列 | | |
| 分组 | | |
| 窗口函数 | | |
Key Syntax Patterns
核心语法模式
Pandas sequential (slow):
python
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)Polars parallel (fast):
python
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)For comprehensive migration guide, load .
references/pandas_migration.mdPandas 顺序执行(慢):
python
df.assign(
col_a=lambda df_: df_.value * 10,
col_b=lambda df_: df_.value * 100
)Polars 并行执行(快):
python
df.with_columns(
col_a=pl.col("value") * 10,
col_b=pl.col("value") * 100,
)如需全面的迁移指南,请加载。
references/pandas_migration.mdBest Practices
最佳实践
Performance Optimization
性能优化
-
Use lazy evaluation for large datasets:python
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect() -
Avoid Python functions in hot paths:
- Stay within expression API for parallelization
- Use only when necessary
.map_elements() - Prefer native Polars operations
-
Use streaming for very large data:python
lf.collect(streaming=True) -
Select only needed columns early:python
# Good: Select columns early lf.select("col1", "col2").filter(...) # Bad: Filter on all columns first lf.filter(...).select("col1", "col2") -
Use appropriate data types:
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data
-
对大型数据集使用延迟计算:python
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect() -
避免在热点路径中使用Python函数:
- 尽量使用表达式API以实现并行化
- 仅在必要时使用
.map_elements() - 优先使用Polars原生操作
-
对超大型数据使用流式处理:python
lf.collect(streaming=True) -
尽早仅选择需要的列:python
# Good: Select columns early lf.select("col1", "col2").filter(...) # Bad: Filter on all columns first lf.filter(...).select("col1", "col2") -
使用合适的数据类型:
- 低基数字符串使用分类类型
- 选择合适的整数大小(i32 vs i64)
- 时间数据使用日期类型
Expression Patterns
表达式模式
Conditional operations:
python
pl.when(condition).then(value).otherwise(other_value)Column operations across multiple columns:
python
df.select(pl.col("^.*_value$") * 2) # Regex patternNull handling:
python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()For additional best practices and patterns, load .
references/best_practices.md条件操作:
python
pl.when(condition).then(value).otherwise(other_value)多列操作:
python
df.select(pl.col("^.*_value$") * 2) # Regex pattern空值处理:
python
pl.col("x").fill_null(0)
pl.col("x").is_null()
pl.col("x").drop_nulls()如需更多最佳实践和模式,请加载。
references/best_practices.mdResources
资源
This skill includes comprehensive reference documentation:
本技能包含全面的参考文档:
references/
references/
- - Detailed explanations of expressions, lazy evaluation, and type system
core_concepts.md - - Comprehensive guide to all common operations with examples
operations.md - - Complete migration guide from pandas to Polars
pandas_migration.md - - Data I/O operations for all supported formats
io_guide.md - - Joins, concatenation, pivots, and reshaping operations
transformations.md - - Performance optimization tips and common patterns
best_practices.md
Load these references as needed when users require detailed information about specific topics.
- - 表达式、延迟计算和类型系统的详细说明
core_concepts.md - - 所有常见操作的综合指南及示例
operations.md - - 从Pandas迁移到Polars的完整指南
pandas_migration.md - - 所有支持格式的数据I/O操作指南
io_guide.md - - 连接、拼接、透视和重塑操作指南
transformations.md - - 性能优化技巧和常见模式
best_practices.md
当用户需要特定主题的详细信息时,可按需加载这些参考文档。