spark

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Apache Spark

Spark is the king of Big Data. v4.0 (2024/2025) makes Spark Connect the default, allowing thin clients (like VS Code) to connect to massive clusters easily.

Spark是大数据领域的佼佼者。2024/2025推出的v4.0版本将Spark Connect设为默认配置，让轻量客户端（如VS Code）能够轻松连接到大规模集群。

When to Use

适用场景

Data Engineering: ETL at Petabyte scale.
Streaming: Structured Streaming for real-time analytics.
Legacy ML:
```
spark.ml
```
(though mostly replaced by XGBoost/Torch).

数据工程: 处理PB级规模的ETL任务。
流处理: 采用Structured Streaming进行实时分析。
传统机器学习: 使用
```
spark.ml
```
（不过目前大多已被XGBoost/Torch取代）。

Core Concepts

核心概念

Spark Connect

Decouples client (your laptop) from server (the cluster). Allows using Spark from Go/Rust/TypeScript.

将客户端（你的笔记本电脑）与服务器（集群）解耦，支持通过Go/Rust/TypeScript使用Spark。

Catalyst Optimizer

Optimizes your SQL/DataFrame queries before execution.

在执行前优化你的SQL/DataFrame查询。

RDD

The low-level API. Almost never used directly in modern Spark.

底层API，在现代Spark中几乎不会直接使用。

Best Practices (2025)

2025年最佳实践

Do:

Use PySpark: It is now a first-class citizen with Python UDF profiling.
Use Delta Lake / Iceberg: Spark works best with modern table formats.
Use
pandas_udf
: For vectorized Python UDFs.

Don't:

Don't use
rdd.map
: It is slow (Python serialization). Use DataFrames.

建议：

使用PySpark: 它现在是一等公民，支持Python UDF性能分析。
使用Delta Lake / Iceberg: Spark与现代表格式配合使用效果最佳。
使用
pandas_udf
: 用于向量化Python UDF。

避免：

不要使用
rdd.map
: 速度较慢（存在Python序列化开销），建议使用DataFrames。

References

参考资料

Apache Spark

Apache Spark