databricks-spark-structured-streaming
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSpark Structured Streaming
Spark Structured Streaming
Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
基于Spark Structured Streaming的生产级流处理管道。本指南提供了详细模式与最佳实践的导航。
Quick Start
快速入门
python
from pyspark.sql.functions import col, from_jsonpython
from pyspark.sql.functions import col, from_jsonBasic Kafka to Delta streaming
Basic Kafka to Delta streaming
df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "topic")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
undefineddf = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "topic")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
undefinedCore Patterns
核心模式
| Pattern | Description | Reference |
|---|---|---|
| Kafka Streaming | Kafka to Delta, Kafka to Kafka, Real-Time Mode | See kafka-streaming.md |
| Stream Joins | Stream-stream joins, stream-static joins | See stream-stream-joins.md, stream-static-joins.md |
| Multi-Sink Writes | Write to multiple tables, parallel merges | See multi-sink-writes.md |
| Merge Operations | MERGE performance, parallel merges, optimizations | See merge-operations.md |
| 模式 | 描述 | 参考 |
|---|---|---|
| Kafka Streaming | Kafka到Delta、Kafka到Kafka、实时模式 | 参见 kafka-streaming.md |
| 流连接 | 流-流连接、流-静态连接 | 参见 stream-stream-joins.md, stream-static-joins.md |
| 多输出端写入 | 写入多张表、并行合并 | 参见 multi-sink-writes.md |
| 合并操作 | MERGE性能、并行合并、优化 | 参见 merge-operations.md |
Configuration
配置
| Topic | Description | Reference |
|---|---|---|
| Checkpoints | Checkpoint management and best practices | See checkpoint-best-practices.md |
| Stateful Operations | Watermarks, state stores, RocksDB configuration | See stateful-operations.md |
| Trigger & Cost | Trigger selection, cost optimization, RTM | See trigger-and-cost-optimization.md |
| 主题 | 描述 | 参考 |
|---|---|---|
| 检查点 | 检查点管理与最佳实践 | 参见 checkpoint-best-practices.md |
| 有状态操作 | 水印、状态存储、RocksDB配置 | 参见 stateful-operations.md |
| 触发器与成本 | 触发器选择、成本优化、RTM | 参见 trigger-and-cost-optimization.md |
Best Practices
最佳实践
| Topic | Description | Reference |
|---|---|---|
| Production Checklist | Comprehensive best practices | See streaming-best-practices.md |
| 主题 | 描述 | 参考 |
|---|---|---|
| 生产环境最佳实践 | 全面的最佳实践 | 参见 streaming-best-practices.md |
Production Checklist
生产环境检查清单
- Checkpoint location is persistent (UC volumes, not DBFS)
- Unique checkpoint per stream
- Fixed-size cluster (no autoscaling for streaming)
- Monitoring configured (input rate, lag, batch duration)
- Exactly-once verified (txnVersion/txnAppId)
- Watermark configured for stateful operations
- Left joins for stream-static (not inner)
- 检查点位置为持久化存储(使用UC卷,而非DBFS)
- 每个流处理任务使用唯一的检查点
- 使用固定大小的集群(流处理任务不启用自动扩缩容)
- 已配置监控(输入速率、延迟、批处理时长)
- 已验证精确一次语义(txnVersion/txnAppId)
- 为有状态操作配置水印
- 流-静态连接使用左连接(而非内连接)