altinity-expert-clickhouse-kafka

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Diagnostics

诊断

Run all queries from the file checks.sql and analyze the results.

运行checks.sql文件中的所有查询并分析结果。

Interpreting Results

结果解读

Consumer Health

消费者健康状态

Check if consumers are stuck by comparing exception time vs activity times:

last_exception_time >= last_poll_time

last_exception_time >= last_commit_time

→ consumer stuck on error, not progressing

Otherwise → consumer healthy

The

exceptions

column is a tuple of arrays with matching indices —

exceptions.time[-1]

and

exceptions.text[-1]

give the most recent error.

通过对比异常时间与活动时间，检查消费者是否停滞：

last_exception_time >= last_poll_time

或

last_exception_time >= last_commit_time

→ 消费者因错误停滞，无法继续运行

其他情况 → 消费者状态健康

exceptions

列是一个带匹配索引的数组元组——

exceptions.time[-1]

和

exceptions.text[-1]

可获取最近的错误信息。

Thread Pool Capacity

线程池容量

```
kafka_consumers > mb_pool_size
```
→ thread starvation — consumers waiting for available threads

Fix: increase

background_message_broker_schedule_pool_size

(default: 16)

Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer

```
kafka_consumers > mb_pool_size
```
→ 线程饥饿——消费者等待可用线程

修复方案：增大

background_message_broker_schedule_pool_size

（默认值：16）

容量规划：Kafka + RabbitMQ/NATS消费者总数 + 25% 缓冲量

Slow Materialized Views (Poll Interval Risk)

慢物化视图（轮询间隔风险）

MV avg duration > 30s → consumer may exceed
```
max.poll.interval.ms
```
and get kicked from the group
MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch)
Most common root cause for slow MVs: multiple
```
JSONExtract
```
calls re-parsing the same JSON blob

Fix: rewrite to one-pass

JSONExtract(json, 'Tuple(...)') AS parsed

tupleElement()

— see troubleshooting.md

MV平均执行时长>30秒 → 消费者可能超出
```
max.poll.interval.ms
```
限制并被踢出消费组
执行状态为错误的MV → 可能发生了消费者重平衡（消费者被踢出，MV在批处理中途中断）
慢MV最常见的根本原因： 多次调用
```
JSONExtract
```
重复解析同一个JSON数据块
修复方案： 重写为单次解析的
```
JSONExtract(json, 'Tuple(...)') AS parsed
```
+
```
tupleElement()
```
——详见troubleshooting.md

Pool Utilization Trends (12h)

线程池利用率趋势（12小时）

Sustained high values near pool size → capacity pressure
Spikes correlating with lag → temporary overload
Flat zero → Kafka consumers may not be active

持续接近线程池大小的高利用率 → 容量压力
与延迟相关的利用率峰值 → 临时过载
持续为零 → Kafka消费者可能未处于活跃状态

Advanced Diagnostics

高级诊断

For deeper investigation, run queries from advanced_checks.sql:

Consumer exception drill-down — filter to a specific problematic Kafka table
Consumption speed measurement — snapshot-based rate calculation
Topic lag via rdkafka_stat — total lag per table and per-partition breakdown
Broker connection health — connection state, errors, disconnects

Important:

rdkafka_stat

is not enabled by default in ClickHouse. It requires

<statistics_interval_ms>

in the Kafka engine settings. See advanced_checks.sql for setup instructions.

如需深入排查，运行advanced_checks.sql中的查询：

消费者异常深度排查 —— 过滤到特定的有问题的Kafka表
消费速度测算 —— 基于快照的速率计算
通过rdkafka_stat查看Topic延迟 —— 按表和分区拆分的总延迟
Broker连接健康状态 —— 连接状态、错误、断开情况

注意： ClickHouse中默认未启用

rdkafka_stat

。需要在Kafka引擎设置中配置

<statistics_interval_ms>

。设置说明详见advanced_checks.sql。

Common Issues

常见问题

For troubleshooting common errors and configuration guidance, see troubleshooting.md:

Topic authorization / ACL errors
Poll interval exceeded (slow MV / JSON parsing optimization)
Thread pool starvation
Parsing errors / dead letter queue
Data loss with multiple materialized views
Offset rewind / replay
Parallel consumption tuning

如需排查常见错误和获取配置指导，详见troubleshooting.md：

Topic授权/ACL错误
超出轮询间隔（慢MV/JSON解析优化）
线程池饥饿
解析错误/死信队列
多物化视图场景下的数据丢失
偏移量回退/重放
并行消费调优

Cross-Module Triggers

跨模块触发

Finding	Load Module	Reason
Slow MV inserts	`altinity-expert-clickhouse-ingestion`	Insert pipeline analysis
High merge memory	`altinity-expert-clickhouse-merges`	Merge patterns
Query-level issues	`altinity-expert-clickhouse-reporting`	Query optimization
Schema concerns	`altinity-expert-clickhouse-schema`	Table design

发现的问题	加载模块	原因
慢MV插入	`altinity-expert-clickhouse-ingestion`	插入流水线分析
高合并内存占用	`altinity-expert-clickhouse-merges`	合并模式分析
查询级问题	`altinity-expert-clickhouse-reporting`	查询优化
Schema相关问题	`altinity-expert-clickhouse-schema`	表设计分析

Settings Reference

配置参考

Setting	Scope	Notes
`background_message_broker_schedule_pool_size`	Server	Thread pool for Kafka/RabbitMQ/NATS consumers (default: 16)
`kafka_num_consumers`	Table	Parallel consumers per table (limited by cores)
`kafka_thread_per_consumer`	Table	Required for parallel inserts ( `= 1` )
`kafka_handle_error_mode`	Table	`stream` (21.6+) or `dead_letter` (25.8+)
`max_poll_interval_ms`	librdkafka	Max time between polls before consumer is kicked (default: 300s)
`statistics_interval_ms`	librdkafka	Enable rdkafka_stat collection (disabled by default)

配置项	作用范围	说明
`background_message_broker_schedule_pool_size`	服务器	Kafka/RabbitMQ/NATS消费者的线程池（默认值：16）
`kafka_num_consumers`	表	每张表的并行消费者数量（受核心数限制）
`kafka_thread_per_consumer`	表	并行插入所需配置（需设为 `= 1` ）
`kafka_handle_error_mode`	表	`stream` （21.6+版本）或 `dead_letter` （25.8+版本）
`max_poll_interval_ms`	librdkafka	消费者被踢出前的最大轮询间隔（默认值：300秒）
`statistics_interval_ms`	librdkafka	启用rdkafka_stat数据收集（默认禁用）