altinity-expert-clickhouse-kafka

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Diagnostics

诊断

Run all queries from the file checks.sql and analyze the results.

运行checks.sql文件中的所有查询并分析结果。

Interpreting Results

结果解读

Consumer Health

消费者健康状态

Check if consumers are stuck by comparing exception time vs activity times:
  • last_exception_time >= last_poll_time
    OR
    last_exception_time >= last_commit_time
    → consumer stuck on error, not progressing
  • Otherwise → consumer healthy
The
exceptions
column is a tuple of arrays with matching indices —
exceptions.time[-1]
and
exceptions.text[-1]
give the most recent error.
通过对比异常时间与活动时间,检查消费者是否停滞:
  • last_exception_time >= last_poll_time
    last_exception_time >= last_commit_time
    → 消费者因错误停滞,无法继续运行
  • 其他情况 → 消费者状态健康
exceptions
列是一个带匹配索引的数组元组——
exceptions.time[-1]
exceptions.text[-1]
可获取最近的错误信息。

Thread Pool Capacity

线程池容量

  • kafka_consumers > mb_pool_size
    → thread starvation — consumers waiting for available threads
  • Fix: increase
    background_message_broker_schedule_pool_size
    (default: 16)
  • Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer
  • kafka_consumers > mb_pool_size
    → 线程饥饿——消费者等待可用线程
  • 修复方案:增大
    background_message_broker_schedule_pool_size
    (默认值:16)
  • 容量规划:Kafka + RabbitMQ/NATS消费者总数 + 25% 缓冲量

Slow Materialized Views (Poll Interval Risk)

慢物化视图(轮询间隔风险)

  • MV avg duration > 30s → consumer may exceed
    max.poll.interval.ms
    and get kicked from the group
  • MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch)
  • Most common root cause for slow MVs: multiple
    JSONExtract
    calls re-parsing the same JSON blob
  • Fix: rewrite to one-pass
    JSONExtract(json, 'Tuple(...)') AS parsed
    +
    tupleElement()
    — see troubleshooting.md
  • MV平均执行时长>30秒 → 消费者可能超出
    max.poll.interval.ms
    限制并被踢出消费组
  • 执行状态为错误的MV → 可能发生了消费者重平衡(消费者被踢出,MV在批处理中途中断)
  • 慢MV最常见的根本原因: 多次调用
    JSONExtract
    重复解析同一个JSON数据块
  • 修复方案: 重写为单次解析的
    JSONExtract(json, 'Tuple(...)') AS parsed
    +
    tupleElement()
    ——详见troubleshooting.md

Pool Utilization Trends (12h)

线程池利用率趋势(12小时)

  • Sustained high values near pool size → capacity pressure
  • Spikes correlating with lag → temporary overload
  • Flat zero → Kafka consumers may not be active

  • 持续接近线程池大小的高利用率 → 容量压力
  • 与延迟相关的利用率峰值 → 临时过载
  • 持续为零 → Kafka消费者可能未处于活跃状态

Advanced Diagnostics

高级诊断

For deeper investigation, run queries from advanced_checks.sql:
  • Consumer exception drill-down — filter to a specific problematic Kafka table
  • Consumption speed measurement — snapshot-based rate calculation
  • Topic lag via rdkafka_stat — total lag per table and per-partition breakdown
  • Broker connection health — connection state, errors, disconnects
Important:
rdkafka_stat
is not enabled by default in ClickHouse. It requires
<statistics_interval_ms>
in the Kafka engine settings. See advanced_checks.sql for setup instructions.

如需深入排查,运行advanced_checks.sql中的查询:
  • 消费者异常深度排查 —— 过滤到特定的有问题的Kafka表
  • 消费速度测算 —— 基于快照的速率计算
  • 通过rdkafka_stat查看Topic延迟 —— 按表和分区拆分的总延迟
  • Broker连接健康状态 —— 连接状态、错误、断开情况
注意: ClickHouse中默认未启用
rdkafka_stat
。需要在Kafka引擎设置中配置
<statistics_interval_ms>
。设置说明详见advanced_checks.sql。

Common Issues

常见问题

For troubleshooting common errors and configuration guidance, see troubleshooting.md:
  • Topic authorization / ACL errors
  • Poll interval exceeded (slow MV / JSON parsing optimization)
  • Thread pool starvation
  • Parsing errors / dead letter queue
  • Data loss with multiple materialized views
  • Offset rewind / replay
  • Parallel consumption tuning

如需排查常见错误和获取配置指导,详见troubleshooting.md
  • Topic授权/ACL错误
  • 超出轮询间隔(慢MV/JSON解析优化)
  • 线程池饥饿
  • 解析错误/死信队列
  • 多物化视图场景下的数据丢失
  • 偏移量回退/重放
  • 并行消费调优

Cross-Module Triggers

跨模块触发

FindingLoad ModuleReason
Slow MV inserts
altinity-expert-clickhouse-ingestion
Insert pipeline analysis
High merge memory
altinity-expert-clickhouse-merges
Merge patterns
Query-level issues
altinity-expert-clickhouse-reporting
Query optimization
Schema concerns
altinity-expert-clickhouse-schema
Table design

发现的问题加载模块原因
慢MV插入
altinity-expert-clickhouse-ingestion
插入流水线分析
高合并内存占用
altinity-expert-clickhouse-merges
合并模式分析
查询级问题
altinity-expert-clickhouse-reporting
查询优化
Schema相关问题
altinity-expert-clickhouse-schema
表设计分析

Settings Reference

配置参考

SettingScopeNotes
background_message_broker_schedule_pool_size
ServerThread pool for Kafka/RabbitMQ/NATS consumers (default: 16)
kafka_num_consumers
TableParallel consumers per table (limited by cores)
kafka_thread_per_consumer
TableRequired for parallel inserts (
= 1
)
kafka_handle_error_mode
Table
stream
(21.6+) or
dead_letter
(25.8+)
max_poll_interval_ms
librdkafkaMax time between polls before consumer is kicked (default: 300s)
statistics_interval_ms
librdkafkaEnable rdkafka_stat collection (disabled by default)
配置项作用范围说明
background_message_broker_schedule_pool_size
服务器Kafka/RabbitMQ/NATS消费者的线程池(默认值:16)
kafka_num_consumers
每张表的并行消费者数量(受核心数限制)
kafka_thread_per_consumer
并行插入所需配置(需设为
= 1
kafka_handle_error_mode
stream
(21.6+版本)或
dead_letter
(25.8+版本)
max_poll_interval_ms
librdkafka消费者被踢出前的最大轮询间隔(默认值:300秒)
statistics_interval_ms
librdkafka启用rdkafka_stat数据收集(默认禁用)