kafka-development

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Kafka Development

Kafka 开发

You are an expert in Apache Kafka event streaming and distributed messaging systems. Follow these best practices when building Kafka-based applications.

您是Apache Kafka事件流与分布式消息传递系统的专家。在构建基于Kafka的应用程序时，请遵循以下最佳实践。

Core Principles

核心原则

Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
Design for scalability, durability, and exactly-once semantics where needed
Leave NO todos, placeholders, or missing pieces in the implementation

Kafka 是一个用于高吞吐量、容错消息传递的分布式事件流平台
与传统发布/订阅模式不同，Kafka 采用拉取模型——消费者从分区中拉取消息
按需设计可扩展性、持久性和Exactly-Once语义
实现中不要留下任何待办事项、占位符或缺失部分

Architecture Overview

架构概述

Core Components

核心组件

Topics: Categories/feeds for organizing messages
Partitions: Ordered, immutable sequences within topics enabling parallelism
Producers: Clients that publish messages to topics
Consumers: Clients that read messages from topics
Consumer Groups: Coordinate consumption across multiple consumers
Brokers: Kafka servers that store data and serve clients

Topics（主题）：用于组织消息的分类/信息流
Partitions（分区）：主题内的有序、不可变序列，支持并行处理
Producers（生产者）：向主题发布消息的客户端
Consumers（消费者）：从主题读取消息的客户端
Consumer Groups（消费者组）：协调多个消费者之间的消息消费
Brokers（代理服务器）：存储数据并为客户端提供服务的Kafka服务器

Key Concepts

关键概念

Messages are appended to partitions in order
Each message has an offset - a unique sequential ID within the partition
Consumers maintain their own cursor (offset) and can read streams repeatedly
Partitions are distributed across brokers for scalability

消息按顺序追加到分区中
每条消息都有一个offset（偏移量）——分区内唯一的连续ID
消费者维护自己的游标（偏移量），可以重复读取流数据
分区分布在多个代理服务器上以实现可扩展性

Topic Design

主题设计

Partitioning Strategy

分区策略

Use partition keys to place related events in the same partition
Messages with the same key always go to the same partition
This ensures ordering for related events
Choose keys carefully - uneven distribution causes hot partitions

使用分区键将相关事件放置在同一个分区中
具有相同键的消息始终发送到同一个分区
这确保了相关事件的顺序性
谨慎选择键——分布不均会导致热点分区

Partition Count

分区数量

More partitions = more parallelism but more overhead
Consider: expected throughput, consumer count, broker resources
Start with number of consumers you expect to run concurrently
Partitions can be increased but not decreased

分区越多 = 并行度越高，但开销也越大
考虑因素：预期吞吐量、消费者数量、代理服务器资源
从您预期同时运行的消费者数量开始设置
分区数量可以增加，但不能减少

Topic Configuration

主题配置

```
retention.ms
```
: How long to keep messages (default 7 days)
```
retention.bytes
```
: Maximum size per partition
```
cleanup.policy
```
: delete (remove old) or compact (keep latest per key)
```
min.insync.replicas
```
: Minimum replicas that must acknowledge

```
retention.ms
```
：消息保留时长（默认7天）
```
retention.bytes
```
：每个分区的最大存储大小
```
cleanup.policy
```
：delete（删除旧消息）或compact（保留每个键的最新版本）
```
min.insync.replicas
```
：必须确认的最小副本数

Producer Best Practices

生产者最佳实践

Reliability Settings

可靠性设置

acks=all               # Wait for all replicas to acknowledge
retries=MAX_INT        # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry

acks=all               # 等待所有副本确认
retries=MAX_INT        # 遇到临时故障时重试
enable.idempotence=true # 防止重试时产生重复消息

Performance Tuning

性能调优

```
batch.size
```
: Accumulate messages before sending (default 16KB)
```
linger.ms
```
: Wait time for batching (0 = send immediately)
```
buffer.memory
```
: Total memory for buffering unsent messages
```
compression.type
```
: gzip, snappy, lz4, or zstd for bandwidth savings

```
batch.size
```
：发送前累积的消息大小（默认16KB）
```
linger.ms
```
：批处理等待时间（0 = 立即发送）
```
buffer.memory
```
：用于缓冲未发送消息的总内存
```
compression.type
```
：gzip、snappy、lz4或zstd，以节省带宽

Error Handling

错误处理

Implement retry logic with exponential backoff
Handle retriable vs non-retriable exceptions differently
Log and alert on send failures
Consider dead letter topics for messages that fail repeatedly

实现带有指数退避的重试逻辑
区分可重试和不可重试异常进行处理
记录并告警发送失败的情况
为反复发送失败的消息考虑使用死信主题

Partitioner

分区器

Default: hash of key determines partition (null key = round-robin)
Custom partitioners for specific routing needs
Ensure even distribution to avoid hot partitions

默认：键的哈希值决定分区（空键 = 轮询分配）
针对特定路由需求使用自定义分区器
确保均匀分布以避免热点分区

Consumer Best Practices

消费者最佳实践

Offset Management

偏移量管理

Consumers track which messages they've processed via offsets
```
auto.offset.reset
```
: earliest (start from beginning) or latest (only new messages)
Commit offsets after successful processing, not before
Use
```
enable.auto.commit=false
```
for exactly-once semantics

消费者通过偏移量跟踪已处理的消息
```
auto.offset.reset
```
：earliest（从开头开始）或latest（仅处理新消息）
成功处理后再提交偏移量，而不是之前
为实现Exactly-Once语义，使用
```
enable.auto.commit=false
```

Consumer Groups

消费者组

Consumers in a group share partitions (each partition to one consumer)
More consumers than partitions = some consumers idle
Group rebalancing occurs when consumers join/leave
Use
```
group.instance.id
```
for static membership to reduce rebalances

同一组内的消费者共享分区（每个分区分配给一个消费者）
消费者数量超过分区数量 = 部分消费者处于空闲状态
当消费者加入/离开时会发生组重平衡
使用
```
group.instance.id
```
实现静态成员身份，以减少重平衡次数

Processing Patterns

处理模式

Process messages in order within a partition
Handle out-of-order messages across partitions if needed
Implement idempotent processing for at-least-once delivery
Consider transactional processing for exactly-once

在分区内按顺序处理消息
如有需要，处理跨分区的乱序消息
实现幂等处理以支持至少一次交付语义
考虑使用事务处理实现Exactly-Once语义

Timeouts and Failures

超时与故障

Implement processing timeout to isolate slow events
When timeout occurs, set event aside and continue to next message
Maintain overall system performance over processing every single event
Use dead letter queues for messages failing all retries

实现处理超时以隔离慢事件
发生超时后，将事件暂存并继续处理下一条消息
优先保持整体系统性能，而非处理每一条事件
为所有重试都失败的消息使用死信队列

Error Handling and Retry

错误处理与重试

Retry Strategy

重试策略

Allow multiple runtime retries per processing attempt
Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
Runtime retries typically cover 99% of failures
After exhausting retries, route to dead letter queue

每次处理尝试允许多次运行时重试
示例：每次重驱动最多3次运行时重试，最多5次重驱动 = 总共15次重试
运行时重试通常可以覆盖99%的故障
重试耗尽后，将消息路由到死信队列

Dead Letter Topics

死信主题

Create dedicated DLT for messages that can't be processed
Include original topic, partition, offset, and error details
Monitor DLT for patterns indicating systemic issues
Implement manual or automated retry from DLT

为无法处理的消息创建专用的DLT（死信主题）
包含原始主题、分区、偏移量和错误详情
监控DLT以发现系统性问题的模式
实现从DLT手动或自动重试

Schema Management

Schema 管理

Schema Registry

Schema Registry（Schema注册表）

Use Confluent Schema Registry for schema management
Producers validate data against registered schemas during serialization
Schema mismatches throw exceptions, preventing malformed data
Provides common reference for producers and consumers

使用Confluent Schema Registry进行Schema管理
生产者在序列化期间根据已注册的Schema验证数据
Schema不匹配会抛出异常，防止格式错误的数据进入
为生产者和消费者提供通用的参考标准

Schema Evolution

Schema 演进

Design schemas for forward and backward compatibility
Add optional fields with defaults for backward compatibility
Avoid removing or renaming fields
Use schema versioning and migration strategies

设计Schema时考虑向前和向后兼容性
添加带默认值的可选字段以实现向后兼容性
避免删除或重命名字段
使用Schema版本控制和迁移策略

Kafka Streams

State Management

状态管理

Implement log compaction to maintain latest version of each key
Periodically purge old data from state stores
Monitor state store size and access patterns
Use appropriate storage backends for your scale

实现日志压缩以保留每个键的最新版本
定期从状态存储中清除旧数据
监控状态存储的大小和访问模式
根据规模选择合适的存储后端

Windowing Operations

窗口操作

Handle out-of-order events and skewed timestamps
Use appropriate time extraction and watermarking techniques
Configure grace periods for late-arriving data
Choose window types based on use case (tumbling, hopping, sliding, session)

处理乱序事件和时间戳偏移的情况
使用适当的时间提取和水印技术
为迟到的数据配置宽限期
根据用例选择窗口类型（滚动、跳跃、滑动、会话）

Security

安全性

Authentication

身份验证

Use SASL/SSL for client authentication
Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
Enable SSL for encryption in transit
Rotate credentials regularly

使用SASL/SSL进行客户端身份验证
支持SASL机制：PLAIN、SCRAM、OAUTHBEARER、GSSAPI
启用SSL以实现传输加密
定期轮换凭证

Authorization

授权

Use Kafka ACLs for fine-grained access control
Grant minimum necessary permissions per principal
Separate read/write permissions by topic
Audit access patterns regularly

使用Kafka ACL实现细粒度的访问控制
为每个主体授予最小必要权限
按主题分离读写权限
定期审计访问模式

Monitoring and Observability

监控与可观测性

Key Metrics

关键指标

Producer: record-send-rate, record-error-rate, batch-size-avg
Consumer: records-consumed-rate, records-lag, commit-latency
Broker: under-replicated-partitions, request-latency, disk-usage

生产者：record-send-rate（消息发送速率）、record-error-rate（消息错误率）、batch-size-avg（平均批大小）
消费者：records-consumed-rate（消息消费速率）、records-lag（消息滞后量）、commit-latency（提交延迟）
代理服务器：under-replicated-partitions（副本不足的分区）、request-latency（请求延迟）、disk-usage（磁盘使用率）

Lag Monitoring

滞后量监控

Consumer lag = last produced offset - last committed offset
High lag indicates consumers can't keep up
Alert on increasing lag trends
Scale consumers or optimize processing

消费者滞后量 = 最后生产的偏移量 - 最后提交的偏移量
高滞后量表示消费者无法跟上生产速度
对滞后量上升趋势进行告警
扩容消费者或优化处理逻辑

Distributed Tracing

分布式追踪

Propagate trace context in message headers
Use OpenTelemetry for end-to-end tracing
Correlate producer and consumer spans
Track message journey through the pipeline

在消息头中传播追踪上下文
使用OpenTelemetry实现端到端追踪
关联生产者和消费者的追踪跨度
跟踪消息在管道中的传递路径

Testing

测试

Unit Testing

单元测试

Mock Kafka clients for isolated testing
Test serialization/deserialization logic
Verify partitioning logic
Test error handling paths

模拟Kafka客户端进行隔离测试
测试序列化/反序列化逻辑
验证分区逻辑
测试错误处理路径

Integration Testing

集成测试

Use embedded Kafka or Testcontainers
Test full producer-consumer flows
Verify exactly-once semantics if used
Test rebalancing scenarios

使用嵌入式Kafka或Testcontainers
测试完整的生产者-消费者流程
验证是否实现了Exactly-Once语义（如果使用）
测试重平衡场景

Performance Testing

性能测试

Load test with production-like message rates
Test consumer throughput and lag behavior
Verify broker resource usage under load
Test failure and recovery scenarios

以生产环境级别的消息速率进行负载测试
测试消费者吞吐量和滞后行为
验证负载下代理服务器的资源使用情况
测试故障和恢复场景

Common Patterns

常见模式

Event Sourcing

事件溯源

Store all state changes as immutable events
Rebuild state by replaying events
Use log compaction for snapshots
Enable time-travel debugging

将所有状态变更存储为不可变事件
通过重放事件重建状态
使用日志压缩实现快照
支持时间旅行调试

CQRS (Command Query Responsibility Segregation)

CQRS（命令查询职责分离）

Separate write (command) and read (query) models
Use Kafka as the event store
Build read-optimized projections from events
Handle eventual consistency appropriately

分离写入（命令）和读取（查询）模型
使用Kafka作为事件存储
从事件构建优化读取的投影
适当处理最终一致性

Saga Pattern

Saga模式

Coordinate distributed transactions across services
Each service publishes events for next step
Implement compensating transactions for rollback
Use correlation IDs to track saga instances

跨服务协调分布式事务
每个服务发布事件触发下一步
实现补偿事务以支持回滚
使用关联ID跟踪Saga实例

Change Data Capture (CDC)

变更数据捕获（CDC）

Capture database changes as Kafka events
Use Debezium or similar CDC tools
Enable real-time data synchronization
Build event-driven integrations

将数据库变更捕获为Kafka事件
使用Debezium或类似的CDC工具
实现实时数据同步
构建事件驱动的集成