dask

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dask Parallel and Distributed Computing

Dask并行与分布式计算

Scale pandas/NumPy workflows beyond memory and across clusters.
扩展pandas/NumPy工作流,突破内存限制并跨集群运行。

When to Use

适用场景

  • Datasets exceed available RAM
  • Need to parallelize pandas or NumPy operations
  • Processing multiple files efficiently (CSVs, Parquet)
  • Building custom parallel workflows
  • Distributing workloads across multiple cores/machines

  • 数据集超出可用RAM
  • 需要并行化pandas或NumPy操作
  • 高效处理多个文件(CSV、Parquet)
  • 构建自定义并行工作流
  • 在多个核心/机器上分配工作负载

Dask Collections

Dask集合

CollectionLikeUse Case
DataFramepandasTabular data, CSV/Parquet
ArrayNumPyNumerical arrays, matrices
BaglistUnstructured data, JSON logs
DelayedCustomArbitrary Python functions
Key concept: All collections are lazy—computation happens only when you call
.compute()
.

集合类型类似工具使用场景
DataFramepandas表格数据、CSV/Parquet格式
ArrayNumPy数值数组、矩阵
Bag列表非结构化数据、JSON日志
Delayed自定义函数任意Python函数
核心概念:所有集合都是惰性的——只有当你调用
.compute()
时才会执行计算。

Lazy Evaluation

惰性求值

FunctionBehaviorUse
dd.read_csv()
Lazy loadLarge CSVs
dd.read_parquet()
Lazy loadLarge Parquet
OperationsBuild graphChain transforms
.compute()
ExecuteGet final result
Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call
.compute()
once at the end, not after every operation.

函数行为用途
dd.read_csv()
惰性加载大型CSV文件
dd.read_parquet()
惰性加载大型Parquet文件
各类操作构建任务图链式转换
.compute()
执行计算获取最终结果
核心概念:Dask会构建操作的任务图,对其进行优化,然后并行执行。仅在最后调用一次
.compute()
,不要在每次操作后都调用。

Schedulers

调度器

SchedulerBest ForStart
threadedNumPy/Pandas (releases GIL)Default
processesPure Python (GIL bound)
scheduler='processes'
synchronousDebugging
scheduler='synchronous'
distributedMonitoring, scaling, clusters
Client()
调度器最佳适用场景启动方式
threadedNumPy/Pandas(释放GIL)默认
processes纯Python代码(受GIL限制)
scheduler='processes'
synchronous调试
scheduler='synchronous'
distributed监控、扩缩容、集群
Client()

Distributed Scheduler

分布式调度器

FeatureBenefit
DashboardReal-time progress monitoring
Cluster scalingAdd/remove workers
Fault toleranceRetry failed tasks
Worker resourcesMemory management

特性优势
仪表盘实时进度监控
集群扩缩容添加/移除工作节点
容错性重试失败任务
工作节点资源管理内存管理

Chunking Concepts

分块概念

DataFrame Partitions

DataFrame分区

ConceptDescription
PartitionSubset of rows (like a mini DataFrame)
npartitionsNumber of partitions
divisionsIndex boundaries between partitions
概念描述
Partition(分区)行的子集(类似迷你DataFrame)
npartitions分区数量
divisions分区之间的索引边界

Array Chunks

Array分块

ConceptDescription
ChunkSubset of array (n-dimensional block)
chunksTuple of chunk sizes per dimension
Optimal size~100 MB per chunk
Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.

概念描述
Chunk(分块)数组的子集(n维块)
chunks每个维度的分块大小元组
最优大小每个分块约100 MB
核心概念:分块大小至关重要。过小会导致调度开销过大,过大会引发内存问题。目标是每个分块约100 MB。

DataFrame Operations

DataFrame操作

Supported (parallel)

支持的并行操作

CategoryOperations
Selection
filter
,
loc
, column selection
Aggregation
groupby
,
sum
,
mean
,
count
Transforms
apply
(row-wise),
map_partitions
Joins
merge
,
join
(shuffles data)
I/O
read_csv
,
read_parquet
,
to_parquet
类别操作
筛选
filter
loc
、列选择
聚合
groupby
sum
mean
count
转换
apply
(按行)、
map_partitions
连接
merge
join
(会打乱数据)
输入输出
read_csv
read_parquet
to_parquet

Avoid or Use Carefully

需避免或谨慎使用的操作

OperationIssueAlternative
iterrows
Kills parallelism
map_partitions
apply(axis=1)
Slow
map_partitions
Repeated
compute()
InefficientSingle
compute()
at end
sort_values
Expensive shuffleAvoid if possible

操作问题替代方案
iterrows
破坏并行性
map_partitions
apply(axis=1)
速度慢
map_partitions
重复调用
compute()
效率低下在最后只调用一次
compute()
sort_values
开销巨大的 shuffle 操作尽可能避免

Common Patterns

常见模式

ETL Pipeline

ETL管道

  1. scan_*
    or
    read_*
    (lazy load)
  2. Chain filters and transforms
  3. Single
    .compute()
    or
    .to_parquet()
  1. scan_*
    read_*
    (惰性加载)
  2. 链式调用筛选和转换操作
  3. 单次
    .compute()
    .to_parquet()

Multi-File Processing

多文件处理

PatternDescription
Glob patterns
dd.read_csv('data/*.csv')
Partition per fileNatural parallelism
Output partitioned
to_parquet('output/')
模式描述
通配符模式
dd.read_csv('data/*.csv')
每个文件对应一个分区天然的并行性
输出分区存储
to_parquet('output/')

Custom Operations

自定义操作

MethodUse Case
map_partitions
Apply function to each partition
map_blocks
Apply function to each array block
delayed
Wrap arbitrary Python functions

方法适用场景
map_partitions
对每个分区应用函数
map_blocks
对每个数组块应用函数
delayed
包装任意Python函数

Best Practices

最佳实践

PracticeWhy
Don't load locally firstLet Dask handle loading
Single compute() at endAvoid redundant computation
Use ParquetFaster than CSV, columnar
Match partition to filesOne partition per file
Check task graph size
len(ddf.__dask_graph__())
< 100k
Use distributed for debuggingDashboard shows progress

实践原因
不要先本地加载数据让Dask处理加载过程
在最后只调用一次compute()避免重复计算
使用Parquet格式比CSV更快,列式存储
分区与文件匹配一个文件对应一个分区
检查任务图大小
len(ddf.__dask_graph__())
应小于100k
使用分布式调度器调试仪表盘可显示进度

Common Pitfalls

常见陷阱

PitfallSolution
Loading with pandas firstUse
dd.read_*
directly
compute() in loopsCollect all, single compute()
Too many partitionsRepartition to ~100 MB each
Memory errorsReduce chunk size, add workers
Slow shufflesAvoid sorts/joins when possible

陷阱解决方案
先用pandas加载数据直接使用
dd.read_*
在循环中调用compute()收集所有操作后,只调用一次compute()
分区过多重新分区为每个分区约100 MB
内存错误减小分块大小,添加工作节点
缓慢的shuffle操作尽可能避免排序/连接操作

vs Alternatives

与替代工具对比

ToolBest ForTrade-off
DaskScale pandas/NumPy, clustersSetup complexity
PolarsFast in-memoryMust fit in RAM
VaexOut-of-core single machineLimited operations
SparkEnterprise, SQL-heavyInfrastructure
工具最佳适用场景权衡
Dask扩展pandas/NumPy、集群环境配置复杂度
Polars快速内存内处理数据必须能放入RAM
Vaex单机器核外处理操作有限
Spark企业级、重SQL场景基础设施要求高

Resources

资源