kafka-development
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseKafka Development
Kafka 开发
You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.
您是Apache Kafka事件流与分布式消息传递系统的专家。在构建基于Kafka的应用程序时,请遵循以下最佳实践。
Core Principles
核心原则
- Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
- Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
- Design for scalability, durability, and exactly-once semantics where needed
- Leave NO todos, placeholders, or missing pieces in the implementation
- Kafka 是一个用于高吞吐量、容错消息传递的分布式事件流平台
- 与传统发布/订阅模式不同,Kafka 采用拉取模型——消费者从分区中拉取消息
- 按需设计可扩展性、持久性和Exactly-Once语义
- 实现中不要留下任何待办事项、占位符或缺失部分
Architecture Overview
架构概述
Core Components
核心组件
- Topics: Categories/feeds for organizing messages
- Partitions: Ordered, immutable sequences within topics enabling parallelism
- Producers: Clients that publish messages to topics
- Consumers: Clients that read messages from topics
- Consumer Groups: Coordinate consumption across multiple consumers
- Brokers: Kafka servers that store data and serve clients
- Topics(主题):用于组织消息的分类/信息流
- Partitions(分区):主题内的有序、不可变序列,支持并行处理
- Producers(生产者):向主题发布消息的客户端
- Consumers(消费者):从主题读取消息的客户端
- Consumer Groups(消费者组):协调多个消费者之间的消息消费
- Brokers(代理服务器):存储数据并为客户端提供服务的Kafka服务器
Key Concepts
关键概念
- Messages are appended to partitions in order
- Each message has an offset - a unique sequential ID within the partition
- Consumers maintain their own cursor (offset) and can read streams repeatedly
- Partitions are distributed across brokers for scalability
- 消息按顺序追加到分区中
- 每条消息都有一个offset(偏移量)——分区内唯一的连续ID
- 消费者维护自己的游标(偏移量),可以重复读取流数据
- 分区分布在多个代理服务器上以实现可扩展性
Topic Design
主题设计
Partitioning Strategy
分区策略
- Use partition keys to place related events in the same partition
- Messages with the same key always go to the same partition
- This ensures ordering for related events
- Choose keys carefully - uneven distribution causes hot partitions
- 使用分区键将相关事件放置在同一个分区中
- 具有相同键的消息始终发送到同一个分区
- 这确保了相关事件的顺序性
- 谨慎选择键——分布不均会导致热点分区
Partition Count
分区数量
- More partitions = more parallelism but more overhead
- Consider: expected throughput, consumer count, broker resources
- Start with number of consumers you expect to run concurrently
- Partitions can be increased but not decreased
- 分区越多 = 并行度越高,但开销也越大
- 考虑因素:预期吞吐量、消费者数量、代理服务器资源
- 从您预期同时运行的消费者数量开始设置
- 分区数量可以增加,但不能减少
Topic Configuration
主题配置
- : How long to keep messages (default 7 days)
retention.ms - : Maximum size per partition
retention.bytes - : delete (remove old) or compact (keep latest per key)
cleanup.policy - : Minimum replicas that must acknowledge
min.insync.replicas
- :消息保留时长(默认7天)
retention.ms - :每个分区的最大存储大小
retention.bytes - :delete(删除旧消息)或compact(保留每个键的最新版本)
cleanup.policy - :必须确认的最小副本数
min.insync.replicas
Producer Best Practices
生产者最佳实践
Reliability Settings
可靠性设置
acks=all # Wait for all replicas to acknowledge
retries=MAX_INT # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retryacks=all # 等待所有副本确认
retries=MAX_INT # 遇到临时故障时重试
enable.idempotence=true # 防止重试时产生重复消息Performance Tuning
性能调优
- : Accumulate messages before sending (default 16KB)
batch.size - : Wait time for batching (0 = send immediately)
linger.ms - : Total memory for buffering unsent messages
buffer.memory - : gzip, snappy, lz4, or zstd for bandwidth savings
compression.type
- :发送前累积的消息大小(默认16KB)
batch.size - :批处理等待时间(0 = 立即发送)
linger.ms - :用于缓冲未发送消息的总内存
buffer.memory - :gzip、snappy、lz4或zstd,以节省带宽
compression.type
Error Handling
错误处理
- Implement retry logic with exponential backoff
- Handle retriable vs non-retriable exceptions differently
- Log and alert on send failures
- Consider dead letter topics for messages that fail repeatedly
- 实现带有指数退避的重试逻辑
- 区分可重试和不可重试异常进行处理
- 记录并告警发送失败的情况
- 为反复发送失败的消息考虑使用死信主题
Partitioner
分区器
- Default: hash of key determines partition (null key = round-robin)
- Custom partitioners for specific routing needs
- Ensure even distribution to avoid hot partitions
- 默认:键的哈希值决定分区(空键 = 轮询分配)
- 针对特定路由需求使用自定义分区器
- 确保均匀分布以避免热点分区
Consumer Best Practices
消费者最佳实践
Offset Management
偏移量管理
- Consumers track which messages they've processed via offsets
- : earliest (start from beginning) or latest (only new messages)
auto.offset.reset - Commit offsets after successful processing, not before
- Use for exactly-once semantics
enable.auto.commit=false
- 消费者通过偏移量跟踪已处理的消息
- :earliest(从开头开始)或latest(仅处理新消息)
auto.offset.reset - 成功处理后再提交偏移量,而不是之前
- 为实现Exactly-Once语义,使用
enable.auto.commit=false
Consumer Groups
消费者组
- Consumers in a group share partitions (each partition to one consumer)
- More consumers than partitions = some consumers idle
- Group rebalancing occurs when consumers join/leave
- Use for static membership to reduce rebalances
group.instance.id
- 同一组内的消费者共享分区(每个分区分配给一个消费者)
- 消费者数量超过分区数量 = 部分消费者处于空闲状态
- 当消费者加入/离开时会发生组重平衡
- 使用实现静态成员身份,以减少重平衡次数
group.instance.id
Processing Patterns
处理模式
- Process messages in order within a partition
- Handle out-of-order messages across partitions if needed
- Implement idempotent processing for at-least-once delivery
- Consider transactional processing for exactly-once
- 在分区内按顺序处理消息
- 如有需要,处理跨分区的乱序消息
- 实现幂等处理以支持至少一次交付语义
- 考虑使用事务处理实现Exactly-Once语义
Timeouts and Failures
超时与故障
- Implement processing timeout to isolate slow events
- When timeout occurs, set event aside and continue to next message
- Maintain overall system performance over processing every single event
- Use dead letter queues for messages failing all retries
- 实现处理超时以隔离慢事件
- 发生超时后,将事件暂存并继续处理下一条消息
- 优先保持整体系统性能,而非处理每一条事件
- 为所有重试都失败的消息使用死信队列
Error Handling and Retry
错误处理与重试
Retry Strategy
重试策略
- Allow multiple runtime retries per processing attempt
- Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
- Runtime retries typically cover 99% of failures
- After exhausting retries, route to dead letter queue
- 每次处理尝试允许多次运行时重试
- 示例:每次重驱动最多3次运行时重试,最多5次重驱动 = 总共15次重试
- 运行时重试通常可以覆盖99%的故障
- 重试耗尽后,将消息路由到死信队列
Dead Letter Topics
死信主题
- Create dedicated DLT for messages that can't be processed
- Include original topic, partition, offset, and error details
- Monitor DLT for patterns indicating systemic issues
- Implement manual or automated retry from DLT
- 为无法处理的消息创建专用的DLT(死信主题)
- 包含原始主题、分区、偏移量和错误详情
- 监控DLT以发现系统性问题的模式
- 实现从DLT手动或自动重试
Schema Management
Schema 管理
Schema Registry
Schema Registry(Schema注册表)
- Use Confluent Schema Registry for schema management
- Producers validate data against registered schemas during serialization
- Schema mismatches throw exceptions, preventing malformed data
- Provides common reference for producers and consumers
- 使用Confluent Schema Registry进行Schema管理
- 生产者在序列化期间根据已注册的Schema验证数据
- Schema不匹配会抛出异常,防止格式错误的数据进入
- 为生产者和消费者提供通用的参考标准
Schema Evolution
Schema 演进
- Design schemas for forward and backward compatibility
- Add optional fields with defaults for backward compatibility
- Avoid removing or renaming fields
- Use schema versioning and migration strategies
- 设计Schema时考虑向前和向后兼容性
- 添加带默认值的可选字段以实现向后兼容性
- 避免删除或重命名字段
- 使用Schema版本控制和迁移策略
Kafka Streams
Kafka Streams
State Management
状态管理
- Implement log compaction to maintain latest version of each key
- Periodically purge old data from state stores
- Monitor state store size and access patterns
- Use appropriate storage backends for your scale
- 实现日志压缩以保留每个键的最新版本
- 定期从状态存储中清除旧数据
- 监控状态存储的大小和访问模式
- 根据规模选择合适的存储后端
Windowing Operations
窗口操作
- Handle out-of-order events and skewed timestamps
- Use appropriate time extraction and watermarking techniques
- Configure grace periods for late-arriving data
- Choose window types based on use case (tumbling, hopping, sliding, session)
- 处理乱序事件和时间戳偏移的情况
- 使用适当的时间提取和水印技术
- 为迟到的数据配置宽限期
- 根据用例选择窗口类型(滚动、跳跃、滑动、会话)
Security
安全性
Authentication
身份验证
- Use SASL/SSL for client authentication
- Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
- Enable SSL for encryption in transit
- Rotate credentials regularly
- 使用SASL/SSL进行客户端身份验证
- 支持SASL机制:PLAIN、SCRAM、OAUTHBEARER、GSSAPI
- 启用SSL以实现传输加密
- 定期轮换凭证
Authorization
授权
- Use Kafka ACLs for fine-grained access control
- Grant minimum necessary permissions per principal
- Separate read/write permissions by topic
- Audit access patterns regularly
- 使用Kafka ACL实现细粒度的访问控制
- 为每个主体授予最小必要权限
- 按主题分离读写权限
- 定期审计访问模式
Monitoring and Observability
监控与可观测性
Key Metrics
关键指标
- Producer: record-send-rate, record-error-rate, batch-size-avg
- Consumer: records-consumed-rate, records-lag, commit-latency
- Broker: under-replicated-partitions, request-latency, disk-usage
- 生产者:record-send-rate(消息发送速率)、record-error-rate(消息错误率)、batch-size-avg(平均批大小)
- 消费者:records-consumed-rate(消息消费速率)、records-lag(消息滞后量)、commit-latency(提交延迟)
- 代理服务器:under-replicated-partitions(副本不足的分区)、request-latency(请求延迟)、disk-usage(磁盘使用率)
Lag Monitoring
滞后量监控
- Consumer lag = last produced offset - last committed offset
- High lag indicates consumers can't keep up
- Alert on increasing lag trends
- Scale consumers or optimize processing
- 消费者滞后量 = 最后生产的偏移量 - 最后提交的偏移量
- 高滞后量表示消费者无法跟上生产速度
- 对滞后量上升趋势进行告警
- 扩容消费者或优化处理逻辑
Distributed Tracing
分布式追踪
- Propagate trace context in message headers
- Use OpenTelemetry for end-to-end tracing
- Correlate producer and consumer spans
- Track message journey through the pipeline
- 在消息头中传播追踪上下文
- 使用OpenTelemetry实现端到端追踪
- 关联生产者和消费者的追踪跨度
- 跟踪消息在管道中的传递路径
Testing
测试
Unit Testing
单元测试
- Mock Kafka clients for isolated testing
- Test serialization/deserialization logic
- Verify partitioning logic
- Test error handling paths
- 模拟Kafka客户端进行隔离测试
- 测试序列化/反序列化逻辑
- 验证分区逻辑
- 测试错误处理路径
Integration Testing
集成测试
- Use embedded Kafka or Testcontainers
- Test full producer-consumer flows
- Verify exactly-once semantics if used
- Test rebalancing scenarios
- 使用嵌入式Kafka或Testcontainers
- 测试完整的生产者-消费者流程
- 验证是否实现了Exactly-Once语义(如果使用)
- 测试重平衡场景
Performance Testing
性能测试
- Load test with production-like message rates
- Test consumer throughput and lag behavior
- Verify broker resource usage under load
- Test failure and recovery scenarios
- 以生产环境级别的消息速率进行负载测试
- 测试消费者吞吐量和滞后行为
- 验证负载下代理服务器的资源使用情况
- 测试故障和恢复场景
Common Patterns
常见模式
Event Sourcing
事件溯源
- Store all state changes as immutable events
- Rebuild state by replaying events
- Use log compaction for snapshots
- Enable time-travel debugging
- 将所有状态变更存储为不可变事件
- 通过重放事件重建状态
- 使用日志压缩实现快照
- 支持时间旅行调试
CQRS (Command Query Responsibility Segregation)
CQRS(命令查询职责分离)
- Separate write (command) and read (query) models
- Use Kafka as the event store
- Build read-optimized projections from events
- Handle eventual consistency appropriately
- 分离写入(命令)和读取(查询)模型
- 使用Kafka作为事件存储
- 从事件构建优化读取的投影
- 适当处理最终一致性
Saga Pattern
Saga模式
- Coordinate distributed transactions across services
- Each service publishes events for next step
- Implement compensating transactions for rollback
- Use correlation IDs to track saga instances
- 跨服务协调分布式事务
- 每个服务发布事件触发下一步
- 实现补偿事务以支持回滚
- 使用关联ID跟踪Saga实例
Change Data Capture (CDC)
变更数据捕获(CDC)
- Capture database changes as Kafka events
- Use Debezium or similar CDC tools
- Enable real-time data synchronization
- Build event-driven integrations
- 将数据库变更捕获为Kafka事件
- 使用Debezium或类似的CDC工具
- 实现实时数据同步
- 构建事件驱动的集成