databricks-spark-structured-streaming

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Spark Structured Streaming

Spark Structured Streaming

Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
基于Spark Structured Streaming的生产级流处理管道。本指南提供了详细模式与最佳实践的导航。

Quick Start

快速入门

python
from pyspark.sql.functions import col, from_json
python
from pyspark.sql.functions import col, from_json

Basic Kafka to Delta streaming

Basic Kafka to Delta streaming

df = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "topic") .load() .select(from_json(col("value").cast("string"), schema).alias("data")) .select("data.*") )
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
undefined
df = (spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "topic") .load() .select(from_json(col("value").cast("string"), schema).alias("data")) .select("data.*") )
df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream")
.trigger(processingTime="30 seconds")
.start("/delta/target_table")
undefined

Core Patterns

核心模式

PatternDescriptionReference
Kafka StreamingKafka to Delta, Kafka to Kafka, Real-Time ModeSee kafka-streaming.md
Stream JoinsStream-stream joins, stream-static joinsSee stream-stream-joins.md, stream-static-joins.md
Multi-Sink WritesWrite to multiple tables, parallel mergesSee multi-sink-writes.md
Merge OperationsMERGE performance, parallel merges, optimizationsSee merge-operations.md
模式描述参考
Kafka StreamingKafka到Delta、Kafka到Kafka、实时模式参见 kafka-streaming.md
流连接流-流连接、流-静态连接参见 stream-stream-joins.md, stream-static-joins.md
多输出端写入写入多张表、并行合并参见 multi-sink-writes.md
合并操作MERGE性能、并行合并、优化参见 merge-operations.md

Configuration

配置

TopicDescriptionReference
CheckpointsCheckpoint management and best practicesSee checkpoint-best-practices.md
Stateful OperationsWatermarks, state stores, RocksDB configurationSee stateful-operations.md
Trigger & CostTrigger selection, cost optimization, RTMSee trigger-and-cost-optimization.md
主题描述参考
检查点检查点管理与最佳实践参见 checkpoint-best-practices.md
有状态操作水印、状态存储、RocksDB配置参见 stateful-operations.md
触发器与成本触发器选择、成本优化、RTM参见 trigger-and-cost-optimization.md

Best Practices

最佳实践

TopicDescriptionReference
Production ChecklistComprehensive best practicesSee streaming-best-practices.md
主题描述参考
生产环境最佳实践全面的最佳实践参见 streaming-best-practices.md

Production Checklist

生产环境检查清单

  • Checkpoint location is persistent (UC volumes, not DBFS)
  • Unique checkpoint per stream
  • Fixed-size cluster (no autoscaling for streaming)
  • Monitoring configured (input rate, lag, batch duration)
  • Exactly-once verified (txnVersion/txnAppId)
  • Watermark configured for stateful operations
  • Left joins for stream-static (not inner)
  • 检查点位置为持久化存储(使用UC卷,而非DBFS)
  • 每个流处理任务使用唯一的检查点
  • 使用固定大小的集群(流处理任务不启用自动扩缩容)
  • 已配置监控(输入速率、延迟、批处理时长)
  • 已验证精确一次语义(txnVersion/txnAppId)
  • 为有状态操作配置水印
  • 流-静态连接使用左连接(而非内连接)