streaming-data-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseStreaming Data Skill
流数据技能
Expert guidance for real-time and near-real-time data pipelines: continuous stream processing, event-driven architectures, and batch-vs-streaming decisions.
为实时和近实时数据管道提供专业指导:包括持续流处理、事件驱动架构以及批处理与流处理的决策。
When to Use
激活场景
Activate when:
- Building or troubleshooting Kafka pipelines (producers, consumers, Connect)
- Implementing stream processing with Flink, Spark Streaming, or Kafka Streams
- Designing event-driven architectures or real-time analytics
- Configuring warehouse streaming ingestion (Snowpipe, BigQuery Storage Write API)
- Creating materialized views or dynamic tables
- Evaluating latency requirements (batch vs streaming)
- Handling schema evolution, exactly-once semantics, or idempotent processing
- Debugging consumer lag, backpressure, or checkpoint failures
Do NOT use for: batch ETL (use ), static data modeling, SQL optimization on analytical queries, data quality on static datasets, one-time migrations.
dbt-skill当以下场景时激活:
- 构建或排查Kafka管道(生产者、消费者、Connect)
- 使用Flink、Spark Streaming或Kafka Streams实现流处理
- 设计事件驱动架构或实时分析
- 配置数据仓库流摄入(Snowpipe、BigQuery Storage Write API)
- 创建物化视图或动态表
- 评估延迟需求(批处理 vs 流处理)
- 处理Schema演进、exactly-once语义或幂等处理
- 排查消费者延迟、背压或检查点失败问题
请勿用于:批处理ETL(请使用)、静态数据建模、分析查询的SQL优化、静态数据集的数据质量检查、一次性迁移。
dbt-skillScope Constraints
范围限制
- This skill covers architecture decisions, configuration patterns, and code generation for streaming systems.
- It does NOT manage infrastructure provisioning (Terraform, CloudFormation) or CI/CD pipelines.
- For Kafka security details, production tuning, and connector configs, load .
references/kafka-deep-dive.md - For Flink/Spark framework-specific APIs, load .
references/flink-spark-streaming.md - For warehouse-native streaming (Snowpipe, BigQuery, Dynamic Tables), load .
references/warehouse-streaming-ingestion.md - For testing and replay strategies, load .
references/stream-testing-patterns.md - Load only the reference file relevant to the current task.
- 本技能涵盖流处理系统的架构决策、配置模式和代码生成。
- 不涉及基础设施配置(Terraform、CloudFormation)或CI/CD管道管理。
- 关于Kafka安全细节、生产调优和连接器配置,请加载。
references/kafka-deep-dive.md - 关于Flink/Spark框架特定API,请加载。
references/flink-spark-streaming.md - 关于数据仓库原生流处理(Snowpipe、BigQuery、动态表),请加载。
references/warehouse-streaming-ingestion.md - 关于测试和重放策略,请加载。
references/stream-testing-patterns.md - 仅加载与当前任务相关的参考文件。
Model Routing
模型路由
| reasoning_demand | preferred | acceptable | minimum |
|---|---|---|---|
| high | Opus | Sonnet | Sonnet |
| 推理需求 | 首选 | 可接受 | 最低要求 |
|---|---|---|---|
| 高 | Opus | Sonnet | Sonnet |
Core Principles
核心原则
Event Time over Processing Time. Design for event time. Use watermarks with to handle out-of-order and late-arriving data.
WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL 'N' SECONDExactly-Once vs At-Least-Once. At-least-once with idempotent consumers is simpler and often sufficient. Use exactly-once for financial transactions; at-least-once for analytics dashboards.
Backpressure Awareness. Monitor consumer lag and implement rate limiting. Use Kafka buffering or Flink backpressure mechanisms to smooth traffic spikes. Alert when lag exceeds thresholds.
Schema Evolution. Use schema registries (Confluent, AWS Glue) from day one. Enforce BACKWARD compatibility for most production systems. Add new fields with defaults; never remove required fields without multi-phase migration.
Idempotent Consumers. Store offsets transactionally with output, use unique keys for upserts, avoid operations that accumulate state incorrectly on replay.
事件时间优先于处理时间:基于事件时间进行设计。使用水印语句来处理乱序和延迟到达的数据。
WATERMARK FOR event_timestamp AS event_timestamp - INTERVAL 'N' SECONDExactly-Once 与 At-Least-Once:带有幂等消费者的At-Least-Once实现更简单,通常已足够。金融交易场景使用Exactly-Once;分析仪表板场景使用At-Least-Once。
背压感知:监控消费者延迟并实现速率限制。使用Kafka缓冲或Flink背压机制来平滑流量峰值。当延迟超过阈值时触发告警。
Schema演进:从项目初期就使用Schema注册表(Confluent、AWS Glue)。大多数生产系统强制使用BACKWARD兼容性。添加带默认值的新字段;除非经过多阶段迁移,否则绝不要删除必填字段。
幂等消费者:将偏移量与输出事务性存储,使用唯一键进行更新插入,避免在重放时错误累积状态的操作。
Architecture Decision Matrix
架构决策矩阵
| Architecture | Latency | Complexity | Best For |
|---|---|---|---|
| Traditional Batch | Hours-days | Low | Historical reporting, large aggregations |
| Micro-Batch (Spark Streaming) | Seconds-minutes | Medium | Near-real-time analytics, unified batch/stream |
| True Streaming (Flink, Kafka Streams) | Milliseconds-seconds | High | Real-time dashboards, fraud detection, alerting |
| Kappa Architecture | Milliseconds-seconds | Medium | Stream-first, immutable event log, reprocessing |
| Warehouse-Native Streaming | Seconds-minutes | Low | SQL-first teams, simple ingestion, BI integration |
| 架构 | 延迟 | 复杂度 | 最佳适用场景 |
|---|---|---|---|
| 传统批处理 | 数小时-数天 | 低 | 历史报告、大规模聚合 |
| 微批处理(Spark Streaming) | 数秒-数分钟 | 中 | 近实时分析、批流统一 |
| 真正流处理(Flink、Kafka Streams) | 毫秒-数秒 | 高 | 实时仪表板、欺诈检测、告警 |
| Kappa架构 | 毫秒-数秒 | 中 | 流优先、不可变事件日志、重处理 |
| 数据仓库原生流处理 | 数秒-数分钟 | 低 | 以SQL为主的团队、简单摄入、BI集成 |
Stream Processing Framework Selection
流处理框架选择
| Framework | Latency | SQL | Managed | Best For |
|---|---|---|---|---|
| Kafka Streams | ms | KSQL (separate) | No | Microservices, JVM apps |
| Apache Flink | ms | Flink SQL | AWS KDA, Confluent | Complex event processing, large state |
| Spark Structured Streaming | seconds | Spark SQL | Databricks, EMR | Unified batch/stream, ML integration |
| ksqlDB | ms | Streaming SQL | Confluent Cloud | SQL-first simple transforms |
| Apache Beam/Dataflow | seconds | Limited | GCP Dataflow | Multi-cloud, GCP native |
| 框架 | 延迟 | SQL支持 | 托管服务 | 最佳适用场景 |
|---|---|---|---|---|
| Kafka Streams | 毫秒 | KSQL(独立) | 无 | 微服务、JVM应用 |
| Apache Flink | 毫秒 | Flink SQL | AWS KDA、Confluent | 复杂事件处理、大规模状态 |
| Spark Structured Streaming | 数秒 | Spark SQL | Databricks、EMR | 批流统一、机器学习集成 |
| ksqlDB | 毫秒 | 流SQL | Confluent Cloud | 以SQL为主的简单转换 |
| Apache Beam/Dataflow | 数秒 | 有限 | GCP Dataflow | 多云、GCP原生 |
Windowing Patterns
窗口模式
| Pattern | Description | Use Case |
|---|---|---|
| Tumbling | Fixed, non-overlapping intervals | Hourly aggregations, regular reporting |
| Sliding | Fixed, overlapping intervals | Moving averages, trend detection |
| Session | Gap-based, variable size | User sessions, activity bursts |
| Global | Custom trigger-controlled | Accumulate until condition met |
Configure watermarks to balance latency vs completeness. Shorter watermark delay = lower latency but may drop late data. Longer delay = higher completeness but delayed results. Handle late arrivals with side outputs (Flink) or allowed lateness windows.
| 模式 | 描述 | 适用场景 |
|---|---|---|
| Tumbling(滚动窗口) | 固定、无重叠的时间间隔 | 小时级聚合、常规报告 |
| Sliding(滑动窗口) | 固定、有重叠的时间间隔 | 移动平均值、趋势检测 |
| Session(会话窗口) | 基于间隙的可变大小窗口 | 用户会话、活动爆发 |
| Global(全局窗口) | 自定义触发器控制的窗口 | 累积数据直到满足条件 |
配置水印以平衡延迟与完整性。更短的水印延迟 = 更低的延迟,但可能会丢弃延迟到达的数据。更长的延迟 = 更高的完整性,但结果会延迟。使用侧输出(Flink)或允许延迟窗口来处理延迟到达的数据。
State Management & Checkpointing
状态管理与检查点
State backends: RocksDB (disk, large state GB-TB), in-memory (fast, limited by heap), managed (cloud platforms).
Checkpointing: Periodic snapshots of state and positions. Shorter intervals (30s-1min) = faster recovery + more overhead. Longer intervals (5-10min) = less overhead + slower recovery.
State TTL: Expire old state to prevent unbounded growth. Set TTL based on business logic.
Savepoints: Manual checkpoints for planned downtime. Always take a savepoint before production deployments.
状态后端:RocksDB(磁盘存储,支持GB-TB级大规模状态)、内存(速度快,受堆内存限制)、托管(云平台)。
检查点:定期对状态和位置进行快照。更短的间隔(30秒-1分钟)= 恢复更快 + 开销更大。更长的间隔(5-10分钟)= 开销更小 + 恢复更慢。
状态TTL:过期旧状态以防止无限制增长。根据业务逻辑设置TTL。
Savepoints(保存点):用于计划停机的手动检查点。在生产部署前务必创建保存点。
Schema Compatibility Modes
Schema兼容模式
| Mode | Allowed Changes | Use When |
|---|---|---|
| BACKWARD | Delete fields, add optional | Consumers upgrade first (most common) |
| FORWARD | Add fields, delete optional | Producers upgrade first |
| FULL | Backward + Forward only | Upgrade order unpredictable |
| 模式 | 允许的变更 | 使用场景 |
|---|---|---|
| BACKWARD | 删除字段、添加可选字段 | 先升级消费者(最常见) |
| FORWARD | 添加字段、删除可选字段 | 先升级生产者 |
| FULL | 仅允许BACKWARD + FORWARD变更 | 升级顺序不可预测 |
Monitoring Essentials
监控要点
| Metric | Alert Threshold |
|---|---|
| Consumer Lag | > 1M messages or > 5 min |
| Throughput | < 50% baseline |
| Error Rate | > 0.1% for critical pipelines |
| Checkpoint Duration | > 2x interval |
| Backpressure Ratio | > 10% sustained |
| Partition Skew | Max/min ratio > 3x |
Alert tiers: Critical (page) = consumer stopped, error spike, EOS violation. Warning (hours) = elevated lag, slow checkpoints. Info (trends) = rebalancing, schema changes.
| 指标 | 告警阈值 |
|---|---|
| 消费者延迟 | > 100万条消息 或 > 5分钟 |
| 吞吐量 | < 基准值的50% |
| 错误率 | 关键管道 > 0.1% |
| 检查点持续时间 | > 间隔的2倍 |
| 背压比率 | 持续 > 10% |
| 分区倾斜 | 最大/最小比率 > 3倍 |
告警层级:严重(页面告警)= 消费者停止、错误激增、EOS违规。警告(小时级)= 延迟升高、检查点缓慢。信息(趋势)= 重平衡、Schema变更。
Security Posture
安全规范
Generates Kafka configs, stream processing code, and warehouse streaming pipelines. See Security & Compliance Patterns.
Credentials: Kafka broker auth (SASL/mTLS), Schema Registry auth, warehouse connections. All secrets via environment variables.
Kafka auth: Use SASL/SCRAM or mTLS. Never PLAINTEXT in production. Use for connector secrets from Vault/AWS SM.
ConfigProvider| Capability | Tier 1 (Cloud-Native) | Tier 2 (Regulated) | Tier 3 (Air-Gapped) |
|---|---|---|---|
| Kafka producer/consumer | Deploy to dev | Generate for review | Generate only |
| Flink/Spark jobs | Submit to dev | Generate for review | Generate only |
| Warehouse streaming | Configure dev | Generate configs | Generate only |
生成Kafka配置、流处理代码和数据仓库流管道。请参阅安全与合规模式。
凭据:Kafka broker认证(SASL/mTLS)、Schema Registry认证、数据仓库连接。所有密钥通过环境变量管理。
Kafka认证:使用SASL/SCRAM或mTLS。生产环境绝不要使用PLAINTEXT。使用从Vault/AWS SM获取连接器密钥。
ConfigProvider| 能力 | Tier 1(云原生) | Tier 2(受监管) | Tier 3(离线环境) |
|---|---|---|---|
| Kafka生产者/消费者 | 部署到开发环境 | 生成供审核 | 仅生成 |
| Flink/Spark作业 | 提交到开发环境 | 生成供审核 | 仅生成 |
| 数据仓库流处理 | 配置开发环境 | 生成配置 | 仅生成 |
Reference Files
参考文件
- Kafka Deep Dive — Architecture, EOS, Connect, ksqlDB, security, production tuning
- Flink & Spark Streaming — DataStream API, Flink SQL, watermarks, state backends, deployment
- Warehouse Streaming Ingestion — Snowpipe Streaming, Dynamic Tables, BigQuery Storage Write API
- Stream Testing Patterns — Embedded Kafka, testcontainers, stream replay, backfill patterns
- Kafka深度解析 — 架构、EOS、Connect、ksqlDB、安全、生产调优
- Flink & Spark流处理 — DataStream API、Flink SQL、水印、状态后端、部署
- 数据仓库流摄入 — Snowpipe流处理、动态表、BigQuery Storage Write API
- 流测试模式 — 嵌入式Kafka、testcontainers、流重放、回填模式