dask

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Dask Parallel and Distributed Computing

Dask并行与分布式计算

Scale pandas/NumPy workflows beyond memory and across clusters.

扩展pandas/NumPy工作流，突破内存限制并跨集群运行。

When to Use

适用场景

Datasets exceed available RAM
Need to parallelize pandas or NumPy operations
Processing multiple files efficiently (CSVs, Parquet)
Building custom parallel workflows
Distributing workloads across multiple cores/machines

数据集超出可用RAM
需要并行化pandas或NumPy操作
高效处理多个文件（CSV、Parquet）
构建自定义并行工作流
在多个核心/机器上分配工作负载

Dask Collections

Dask集合

Collection	Like	Use Case
DataFrame	pandas	Tabular data, CSV/Parquet
Array	NumPy	Numerical arrays, matrices
Bag	list	Unstructured data, JSON logs
Delayed	Custom	Arbitrary Python functions

Key concept: All collections are lazy—computation happens only when you call

.compute()

集合类型	类似工具	使用场景
DataFrame	pandas	表格数据、CSV/Parquet格式
Array	NumPy	数值数组、矩阵
Bag	列表	非结构化数据、JSON日志
Delayed	自定义函数	任意Python函数

核心概念：所有集合都是惰性的——只有当你调用

.compute()

时才会执行计算。

Lazy Evaluation

惰性求值

Function	Behavior	Use
`dd.read_csv()`	Lazy load	Large CSVs
`dd.read_parquet()`	Lazy load	Large Parquet
Operations	Build graph	Chain transforms
`.compute()`	Execute	Get final result

Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call

.compute()

once at the end, not after every operation.

函数	行为	用途
`dd.read_csv()`	惰性加载	大型CSV文件
`dd.read_parquet()`	惰性加载	大型Parquet文件
各类操作	构建任务图	链式转换
`.compute()`	执行计算	获取最终结果

核心概念：Dask会构建操作的任务图，对其进行优化，然后并行执行。仅在最后调用一次

.compute()

，不要在每次操作后都调用。

Schedulers

调度器

Scheduler	Best For	Start
threaded	NumPy/Pandas (releases GIL)	Default
processes	Pure Python (GIL bound)	`scheduler='processes'`
synchronous	Debugging	`scheduler='synchronous'`
distributed	Monitoring, scaling, clusters	`Client()`

调度器	最佳适用场景	启动方式
threaded	NumPy/Pandas（释放GIL）	默认
processes	纯Python代码（受GIL限制）	`scheduler='processes'`
synchronous	调试	`scheduler='synchronous'`
distributed	监控、扩缩容、集群	`Client()`

Distributed Scheduler

分布式调度器

Feature	Benefit
Dashboard	Real-time progress monitoring
Cluster scaling	Add/remove workers
Fault tolerance	Retry failed tasks
Worker resources	Memory management

特性	优势
仪表盘	实时进度监控
集群扩缩容	添加/移除工作节点
容错性	重试失败任务
工作节点资源管理	内存管理

Chunking Concepts

分块概念

DataFrame Partitions

DataFrame分区

Concept	Description
Partition	Subset of rows (like a mini DataFrame)
npartitions	Number of partitions
divisions	Index boundaries between partitions

概念	描述
Partition（分区）	行的子集（类似迷你DataFrame）
npartitions	分区数量
divisions	分区之间的索引边界

Array Chunks

Array分块

Concept	Description
Chunk	Subset of array (n-dimensional block)
chunks	Tuple of chunk sizes per dimension
Optimal size	~100 MB per chunk

Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.

概念	描述
Chunk（分块）	数组的子集（n维块）
chunks	每个维度的分块大小元组
最优大小	每个分块约100 MB

核心概念：分块大小至关重要。过小会导致调度开销过大，过大会引发内存问题。目标是每个分块约100 MB。

DataFrame Operations

DataFrame操作

Supported (parallel)

支持的并行操作

Category	Operations
Selection	`filter` , `loc` , column selection
Aggregation	`groupby` , `sum` , `mean` , `count`
Transforms	`apply` (row-wise), `map_partitions`
Joins	`merge` , `join` (shuffles data)
I/O	`read_csv` , `read_parquet` , `to_parquet`

类别	操作
筛选	`filter` 、 `loc` 、列选择
聚合	`groupby` 、 `sum` 、 `mean` 、 `count`
转换	`apply` （按行）、 `map_partitions`
连接	`merge` 、 `join` （会打乱数据）
输入输出	`read_csv` 、 `read_parquet` 、 `to_parquet`

Avoid or Use Carefully

需避免或谨慎使用的操作

Operation	Issue	Alternative
`iterrows`	Kills parallelism	`map_partitions`
`apply(axis=1)`	Slow	`map_partitions`
Repeated `compute()`	Inefficient	Single `compute()` at end
`sort_values`	Expensive shuffle	Avoid if possible

操作	问题	替代方案
`iterrows`	破坏并行性	`map_partitions`
`apply(axis=1)`	速度慢	`map_partitions`
重复调用 `compute()`	效率低下	在最后只调用一次 `compute()`
`sort_values`	开销巨大的 shuffle 操作	尽可能避免

Common Patterns

常见模式

ETL Pipeline

ETL管道

```
scan_*
```
or
```
read_*
```
(lazy load)
Chain filters and transforms
Single
```
.compute()
```
or
```
.to_parquet()
```

```
scan_*
```
或
```
read_*
```
（惰性加载）
链式调用筛选和转换操作
单次
```
.compute()
```
或
```
.to_parquet()
```

Multi-File Processing

多文件处理

Pattern	Description
Glob patterns	`dd.read_csv('data/*.csv')`
Partition per file	Natural parallelism
Output partitioned	`to_parquet('output/')`

模式	描述
通配符模式	`dd.read_csv('data/*.csv')`
每个文件对应一个分区	天然的并行性
输出分区存储	`to_parquet('output/')`

Custom Operations

自定义操作

Method	Use Case
`map_partitions`	Apply function to each partition
`map_blocks`	Apply function to each array block
`delayed`	Wrap arbitrary Python functions

方法	适用场景
`map_partitions`	对每个分区应用函数
`map_blocks`	对每个数组块应用函数
`delayed`	包装任意Python函数

Best Practices

最佳实践

Practice	Why
Don't load locally first	Let Dask handle loading
Single compute() at end	Avoid redundant computation
Use Parquet	Faster than CSV, columnar
Match partition to files	One partition per file
Check task graph size	`len(ddf.__dask_graph__())` < 100k
Use distributed for debugging	Dashboard shows progress

实践	原因
不要先本地加载数据	让Dask处理加载过程
在最后只调用一次compute()	避免重复计算
使用Parquet格式	比CSV更快，列式存储
分区与文件匹配	一个文件对应一个分区
检查任务图大小	`len(ddf.__dask_graph__())` 应小于100k
使用分布式调度器调试	仪表盘可显示进度

Common Pitfalls

常见陷阱

Pitfall	Solution
Loading with pandas first	Use `dd.read_*` directly
compute() in loops	Collect all, single compute()
Too many partitions	Repartition to ~100 MB each
Memory errors	Reduce chunk size, add workers
Slow shuffles	Avoid sorts/joins when possible

陷阱	解决方案
先用pandas加载数据	直接使用 `dd.read_*`
在循环中调用compute()	收集所有操作后，只调用一次compute()
分区过多	重新分区为每个分区约100 MB
内存错误	减小分块大小，添加工作节点
缓慢的shuffle操作	尽可能避免排序/连接操作

vs Alternatives

与替代工具对比

Tool	Best For	Trade-off
Dask	Scale pandas/NumPy, clusters	Setup complexity
Polars	Fast in-memory	Must fit in RAM
Vaex	Out-of-core single machine	Limited operations
Spark	Enterprise, SQL-heavy	Infrastructure

工具	最佳适用场景	权衡
Dask	扩展pandas/NumPy、集群环境	配置复杂度
Polars	快速内存内处理	数据必须能放入RAM
Vaex	单机器核外处理	操作有限
Spark	企业级、重SQL场景	基础设施要求高

Resources

资源

Docs: https://docs.dask.org/
Best Practices: https://docs.dask.org/en/stable/best-practices.html
Examples: https://examples.dask.org/

文档：https://docs.dask.org/
最佳实践：https://docs.dask.org/en/stable/best-practices.html
示例：https://examples.dask.org/