kafka-development

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Kafka Development

Kafka 开发

You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.
您是Apache Kafka事件流与分布式消息传递系统的专家。在构建基于Kafka的应用程序时,请遵循以下最佳实践。

Core Principles

核心原则

  • Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
  • Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
  • Design for scalability, durability, and exactly-once semantics where needed
  • Leave NO todos, placeholders, or missing pieces in the implementation
  • Kafka 是一个用于高吞吐量、容错消息传递的分布式事件流平台
  • 与传统发布/订阅模式不同,Kafka 采用拉取模型——消费者从分区中拉取消息
  • 按需设计可扩展性、持久性和Exactly-Once语义
  • 实现中不要留下任何待办事项、占位符或缺失部分

Architecture Overview

架构概述

Core Components

核心组件

  • Topics: Categories/feeds for organizing messages
  • Partitions: Ordered, immutable sequences within topics enabling parallelism
  • Producers: Clients that publish messages to topics
  • Consumers: Clients that read messages from topics
  • Consumer Groups: Coordinate consumption across multiple consumers
  • Brokers: Kafka servers that store data and serve clients
  • Topics(主题):用于组织消息的分类/信息流
  • Partitions(分区):主题内的有序、不可变序列,支持并行处理
  • Producers(生产者):向主题发布消息的客户端
  • Consumers(消费者):从主题读取消息的客户端
  • Consumer Groups(消费者组):协调多个消费者之间的消息消费
  • Brokers(代理服务器):存储数据并为客户端提供服务的Kafka服务器

Key Concepts

关键概念

  • Messages are appended to partitions in order
  • Each message has an offset - a unique sequential ID within the partition
  • Consumers maintain their own cursor (offset) and can read streams repeatedly
  • Partitions are distributed across brokers for scalability
  • 消息按顺序追加到分区中
  • 每条消息都有一个offset(偏移量)——分区内唯一的连续ID
  • 消费者维护自己的游标(偏移量),可以重复读取流数据
  • 分区分布在多个代理服务器上以实现可扩展性

Topic Design

主题设计

Partitioning Strategy

分区策略

  • Use partition keys to place related events in the same partition
  • Messages with the same key always go to the same partition
  • This ensures ordering for related events
  • Choose keys carefully - uneven distribution causes hot partitions
  • 使用分区键将相关事件放置在同一个分区中
  • 具有相同键的消息始终发送到同一个分区
  • 这确保了相关事件的顺序性
  • 谨慎选择键——分布不均会导致热点分区

Partition Count

分区数量

  • More partitions = more parallelism but more overhead
  • Consider: expected throughput, consumer count, broker resources
  • Start with number of consumers you expect to run concurrently
  • Partitions can be increased but not decreased
  • 分区越多 = 并行度越高,但开销也越大
  • 考虑因素:预期吞吐量、消费者数量、代理服务器资源
  • 从您预期同时运行的消费者数量开始设置
  • 分区数量可以增加,但不能减少

Topic Configuration

主题配置

  • retention.ms
    : How long to keep messages (default 7 days)
  • retention.bytes
    : Maximum size per partition
  • cleanup.policy
    : delete (remove old) or compact (keep latest per key)
  • min.insync.replicas
    : Minimum replicas that must acknowledge
  • retention.ms
    :消息保留时长(默认7天)
  • retention.bytes
    :每个分区的最大存储大小
  • cleanup.policy
    :delete(删除旧消息)或compact(保留每个键的最新版本)
  • min.insync.replicas
    :必须确认的最小副本数

Producer Best Practices

生产者最佳实践

Reliability Settings

可靠性设置

acks=all               # Wait for all replicas to acknowledge
retries=MAX_INT        # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry
acks=all               # 等待所有副本确认
retries=MAX_INT        # 遇到临时故障时重试
enable.idempotence=true # 防止重试时产生重复消息

Performance Tuning

性能调优

  • batch.size
    : Accumulate messages before sending (default 16KB)
  • linger.ms
    : Wait time for batching (0 = send immediately)
  • buffer.memory
    : Total memory for buffering unsent messages
  • compression.type
    : gzip, snappy, lz4, or zstd for bandwidth savings
  • batch.size
    :发送前累积的消息大小(默认16KB)
  • linger.ms
    :批处理等待时间(0 = 立即发送)
  • buffer.memory
    :用于缓冲未发送消息的总内存
  • compression.type
    :gzip、snappy、lz4或zstd,以节省带宽

Error Handling

错误处理

  • Implement retry logic with exponential backoff
  • Handle retriable vs non-retriable exceptions differently
  • Log and alert on send failures
  • Consider dead letter topics for messages that fail repeatedly
  • 实现带有指数退避的重试逻辑
  • 区分可重试和不可重试异常进行处理
  • 记录并告警发送失败的情况
  • 为反复发送失败的消息考虑使用死信主题

Partitioner

分区器

  • Default: hash of key determines partition (null key = round-robin)
  • Custom partitioners for specific routing needs
  • Ensure even distribution to avoid hot partitions
  • 默认:键的哈希值决定分区(空键 = 轮询分配)
  • 针对特定路由需求使用自定义分区器
  • 确保均匀分布以避免热点分区

Consumer Best Practices

消费者最佳实践

Offset Management

偏移量管理

  • Consumers track which messages they've processed via offsets
  • auto.offset.reset
    : earliest (start from beginning) or latest (only new messages)
  • Commit offsets after successful processing, not before
  • Use
    enable.auto.commit=false
    for exactly-once semantics
  • 消费者通过偏移量跟踪已处理的消息
  • auto.offset.reset
    :earliest(从开头开始)或latest(仅处理新消息)
  • 成功处理后再提交偏移量,而不是之前
  • 为实现Exactly-Once语义,使用
    enable.auto.commit=false

Consumer Groups

消费者组

  • Consumers in a group share partitions (each partition to one consumer)
  • More consumers than partitions = some consumers idle
  • Group rebalancing occurs when consumers join/leave
  • Use
    group.instance.id
    for static membership to reduce rebalances
  • 同一组内的消费者共享分区(每个分区分配给一个消费者)
  • 消费者数量超过分区数量 = 部分消费者处于空闲状态
  • 当消费者加入/离开时会发生组重平衡
  • 使用
    group.instance.id
    实现静态成员身份,以减少重平衡次数

Processing Patterns

处理模式

  • Process messages in order within a partition
  • Handle out-of-order messages across partitions if needed
  • Implement idempotent processing for at-least-once delivery
  • Consider transactional processing for exactly-once
  • 在分区内按顺序处理消息
  • 如有需要,处理跨分区的乱序消息
  • 实现幂等处理以支持至少一次交付语义
  • 考虑使用事务处理实现Exactly-Once语义

Timeouts and Failures

超时与故障

  • Implement processing timeout to isolate slow events
  • When timeout occurs, set event aside and continue to next message
  • Maintain overall system performance over processing every single event
  • Use dead letter queues for messages failing all retries
  • 实现处理超时以隔离慢事件
  • 发生超时后,将事件暂存并继续处理下一条消息
  • 优先保持整体系统性能,而非处理每一条事件
  • 为所有重试都失败的消息使用死信队列

Error Handling and Retry

错误处理与重试

Retry Strategy

重试策略

  • Allow multiple runtime retries per processing attempt
  • Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
  • Runtime retries typically cover 99% of failures
  • After exhausting retries, route to dead letter queue
  • 每次处理尝试允许多次运行时重试
  • 示例:每次重驱动最多3次运行时重试,最多5次重驱动 = 总共15次重试
  • 运行时重试通常可以覆盖99%的故障
  • 重试耗尽后,将消息路由到死信队列

Dead Letter Topics

死信主题

  • Create dedicated DLT for messages that can't be processed
  • Include original topic, partition, offset, and error details
  • Monitor DLT for patterns indicating systemic issues
  • Implement manual or automated retry from DLT
  • 为无法处理的消息创建专用的DLT(死信主题)
  • 包含原始主题、分区、偏移量和错误详情
  • 监控DLT以发现系统性问题的模式
  • 实现从DLT手动或自动重试

Schema Management

Schema 管理

Schema Registry

Schema Registry(Schema注册表)

  • Use Confluent Schema Registry for schema management
  • Producers validate data against registered schemas during serialization
  • Schema mismatches throw exceptions, preventing malformed data
  • Provides common reference for producers and consumers
  • 使用Confluent Schema Registry进行Schema管理
  • 生产者在序列化期间根据已注册的Schema验证数据
  • Schema不匹配会抛出异常,防止格式错误的数据进入
  • 为生产者和消费者提供通用的参考标准

Schema Evolution

Schema 演进

  • Design schemas for forward and backward compatibility
  • Add optional fields with defaults for backward compatibility
  • Avoid removing or renaming fields
  • Use schema versioning and migration strategies
  • 设计Schema时考虑向前和向后兼容性
  • 添加带默认值的可选字段以实现向后兼容性
  • 避免删除或重命名字段
  • 使用Schema版本控制和迁移策略

Kafka Streams

Kafka Streams

State Management

状态管理

  • Implement log compaction to maintain latest version of each key
  • Periodically purge old data from state stores
  • Monitor state store size and access patterns
  • Use appropriate storage backends for your scale
  • 实现日志压缩以保留每个键的最新版本
  • 定期从状态存储中清除旧数据
  • 监控状态存储的大小和访问模式
  • 根据规模选择合适的存储后端

Windowing Operations

窗口操作

  • Handle out-of-order events and skewed timestamps
  • Use appropriate time extraction and watermarking techniques
  • Configure grace periods for late-arriving data
  • Choose window types based on use case (tumbling, hopping, sliding, session)
  • 处理乱序事件和时间戳偏移的情况
  • 使用适当的时间提取和水印技术
  • 为迟到的数据配置宽限期
  • 根据用例选择窗口类型(滚动、跳跃、滑动、会话)

Security

安全性

Authentication

身份验证

  • Use SASL/SSL for client authentication
  • Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
  • Enable SSL for encryption in transit
  • Rotate credentials regularly
  • 使用SASL/SSL进行客户端身份验证
  • 支持SASL机制:PLAIN、SCRAM、OAUTHBEARER、GSSAPI
  • 启用SSL以实现传输加密
  • 定期轮换凭证

Authorization

授权

  • Use Kafka ACLs for fine-grained access control
  • Grant minimum necessary permissions per principal
  • Separate read/write permissions by topic
  • Audit access patterns regularly
  • 使用Kafka ACL实现细粒度的访问控制
  • 为每个主体授予最小必要权限
  • 按主题分离读写权限
  • 定期审计访问模式

Monitoring and Observability

监控与可观测性

Key Metrics

关键指标

  • Producer: record-send-rate, record-error-rate, batch-size-avg
  • Consumer: records-consumed-rate, records-lag, commit-latency
  • Broker: under-replicated-partitions, request-latency, disk-usage
  • 生产者:record-send-rate(消息发送速率)、record-error-rate(消息错误率)、batch-size-avg(平均批大小)
  • 消费者:records-consumed-rate(消息消费速率)、records-lag(消息滞后量)、commit-latency(提交延迟)
  • 代理服务器:under-replicated-partitions(副本不足的分区)、request-latency(请求延迟)、disk-usage(磁盘使用率)

Lag Monitoring

滞后量监控

  • Consumer lag = last produced offset - last committed offset
  • High lag indicates consumers can't keep up
  • Alert on increasing lag trends
  • Scale consumers or optimize processing
  • 消费者滞后量 = 最后生产的偏移量 - 最后提交的偏移量
  • 高滞后量表示消费者无法跟上生产速度
  • 对滞后量上升趋势进行告警
  • 扩容消费者或优化处理逻辑

Distributed Tracing

分布式追踪

  • Propagate trace context in message headers
  • Use OpenTelemetry for end-to-end tracing
  • Correlate producer and consumer spans
  • Track message journey through the pipeline
  • 在消息头中传播追踪上下文
  • 使用OpenTelemetry实现端到端追踪
  • 关联生产者和消费者的追踪跨度
  • 跟踪消息在管道中的传递路径

Testing

测试

Unit Testing

单元测试

  • Mock Kafka clients for isolated testing
  • Test serialization/deserialization logic
  • Verify partitioning logic
  • Test error handling paths
  • 模拟Kafka客户端进行隔离测试
  • 测试序列化/反序列化逻辑
  • 验证分区逻辑
  • 测试错误处理路径

Integration Testing

集成测试

  • Use embedded Kafka or Testcontainers
  • Test full producer-consumer flows
  • Verify exactly-once semantics if used
  • Test rebalancing scenarios
  • 使用嵌入式Kafka或Testcontainers
  • 测试完整的生产者-消费者流程
  • 验证是否实现了Exactly-Once语义(如果使用)
  • 测试重平衡场景

Performance Testing

性能测试

  • Load test with production-like message rates
  • Test consumer throughput and lag behavior
  • Verify broker resource usage under load
  • Test failure and recovery scenarios
  • 以生产环境级别的消息速率进行负载测试
  • 测试消费者吞吐量和滞后行为
  • 验证负载下代理服务器的资源使用情况
  • 测试故障和恢复场景

Common Patterns

常见模式

Event Sourcing

事件溯源

  • Store all state changes as immutable events
  • Rebuild state by replaying events
  • Use log compaction for snapshots
  • Enable time-travel debugging
  • 将所有状态变更存储为不可变事件
  • 通过重放事件重建状态
  • 使用日志压缩实现快照
  • 支持时间旅行调试

CQRS (Command Query Responsibility Segregation)

CQRS(命令查询职责分离)

  • Separate write (command) and read (query) models
  • Use Kafka as the event store
  • Build read-optimized projections from events
  • Handle eventual consistency appropriately
  • 分离写入(命令)和读取(查询)模型
  • 使用Kafka作为事件存储
  • 从事件构建优化读取的投影
  • 适当处理最终一致性

Saga Pattern

Saga模式

  • Coordinate distributed transactions across services
  • Each service publishes events for next step
  • Implement compensating transactions for rollback
  • Use correlation IDs to track saga instances
  • 跨服务协调分布式事务
  • 每个服务发布事件触发下一步
  • 实现补偿事务以支持回滚
  • 使用关联ID跟踪Saga实例

Change Data Capture (CDC)

变更数据捕获(CDC)

  • Capture database changes as Kafka events
  • Use Debezium or similar CDC tools
  • Enable real-time data synchronization
  • Build event-driven integrations
  • 将数据库变更捕获为Kafka事件
  • 使用Debezium或类似的CDC工具
  • 实现实时数据同步
  • 构建事件驱动的集成