altinity-expert-clickhouse-kafka
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDiagnostics
诊断
Run all queries from the file checks.sql and analyze the results.
运行checks.sql文件中的所有查询并分析结果。
Interpreting Results
结果解读
Consumer Health
消费者健康状态
Check if consumers are stuck by comparing exception time vs activity times:
- OR
last_exception_time >= last_poll_time→ consumer stuck on error, not progressinglast_exception_time >= last_commit_time - Otherwise → consumer healthy
The column is a tuple of arrays with matching indices — and give the most recent error.
exceptionsexceptions.time[-1]exceptions.text[-1]通过对比异常时间与活动时间,检查消费者是否停滞:
- 或
last_exception_time >= last_poll_time→ 消费者因错误停滞,无法继续运行last_exception_time >= last_commit_time - 其他情况 → 消费者状态健康
exceptionsexceptions.time[-1]exceptions.text[-1]Thread Pool Capacity
线程池容量
- → thread starvation — consumers waiting for available threads
kafka_consumers > mb_pool_size - Fix: increase (default: 16)
background_message_broker_schedule_pool_size - Sizing: total Kafka + RabbitMQ/NATS consumers + 25% buffer
- → 线程饥饿——消费者等待可用线程
kafka_consumers > mb_pool_size - 修复方案:增大(默认值:16)
background_message_broker_schedule_pool_size - 容量规划:Kafka + RabbitMQ/NATS消费者总数 + 25% 缓冲量
Slow Materialized Views (Poll Interval Risk)
慢物化视图(轮询间隔风险)
- MV avg duration > 30s → consumer may exceed and get kicked from the group
max.poll.interval.ms - MV executions with error status → likely consumer rebalances (consumer kicked, MV interrupted mid-batch)
- Most common root cause for slow MVs: multiple calls re-parsing the same JSON blob
JSONExtract - Fix: rewrite to one-pass +
JSONExtract(json, 'Tuple(...)') AS parsed— see troubleshooting.mdtupleElement()
- MV平均执行时长>30秒 → 消费者可能超出限制并被踢出消费组
max.poll.interval.ms - 执行状态为错误的MV → 可能发生了消费者重平衡(消费者被踢出,MV在批处理中途中断)
- 慢MV最常见的根本原因: 多次调用重复解析同一个JSON数据块
JSONExtract - 修复方案: 重写为单次解析的+
JSONExtract(json, 'Tuple(...)') AS parsed——详见troubleshooting.mdtupleElement()
Pool Utilization Trends (12h)
线程池利用率趋势(12小时)
- Sustained high values near pool size → capacity pressure
- Spikes correlating with lag → temporary overload
- Flat zero → Kafka consumers may not be active
- 持续接近线程池大小的高利用率 → 容量压力
- 与延迟相关的利用率峰值 → 临时过载
- 持续为零 → Kafka消费者可能未处于活跃状态
Advanced Diagnostics
高级诊断
For deeper investigation, run queries from advanced_checks.sql:
- Consumer exception drill-down — filter to a specific problematic Kafka table
- Consumption speed measurement — snapshot-based rate calculation
- Topic lag via rdkafka_stat — total lag per table and per-partition breakdown
- Broker connection health — connection state, errors, disconnects
Important: is not enabled by default in ClickHouse. It requires in the Kafka engine settings. See advanced_checks.sql for setup instructions.
rdkafka_stat<statistics_interval_ms>如需深入排查,运行advanced_checks.sql中的查询:
- 消费者异常深度排查 —— 过滤到特定的有问题的Kafka表
- 消费速度测算 —— 基于快照的速率计算
- 通过rdkafka_stat查看Topic延迟 —— 按表和分区拆分的总延迟
- Broker连接健康状态 —— 连接状态、错误、断开情况
注意: ClickHouse中默认未启用。需要在Kafka引擎设置中配置。设置说明详见advanced_checks.sql。
rdkafka_stat<statistics_interval_ms>Common Issues
常见问题
For troubleshooting common errors and configuration guidance, see troubleshooting.md:
- Topic authorization / ACL errors
- Poll interval exceeded (slow MV / JSON parsing optimization)
- Thread pool starvation
- Parsing errors / dead letter queue
- Data loss with multiple materialized views
- Offset rewind / replay
- Parallel consumption tuning
如需排查常见错误和获取配置指导,详见troubleshooting.md:
- Topic授权/ACL错误
- 超出轮询间隔(慢MV/JSON解析优化)
- 线程池饥饿
- 解析错误/死信队列
- 多物化视图场景下的数据丢失
- 偏移量回退/重放
- 并行消费调优
Cross-Module Triggers
跨模块触发
| Finding | Load Module | Reason |
|---|---|---|
| Slow MV inserts | | Insert pipeline analysis |
| High merge memory | | Merge patterns |
| Query-level issues | | Query optimization |
| Schema concerns | | Table design |
| 发现的问题 | 加载模块 | 原因 |
|---|---|---|
| 慢MV插入 | | 插入流水线分析 |
| 高合并内存占用 | | 合并模式分析 |
| 查询级问题 | | 查询优化 |
| Schema相关问题 | | 表设计分析 |
Settings Reference
配置参考
| Setting | Scope | Notes |
|---|---|---|
| Server | Thread pool for Kafka/RabbitMQ/NATS consumers (default: 16) |
| Table | Parallel consumers per table (limited by cores) |
| Table | Required for parallel inserts ( |
| Table | |
| librdkafka | Max time between polls before consumer is kicked (default: 300s) |
| librdkafka | Enable rdkafka_stat collection (disabled by default) |
| 配置项 | 作用范围 | 说明 |
|---|---|---|
| 服务器 | Kafka/RabbitMQ/NATS消费者的线程池(默认值:16) |
| 表 | 每张表的并行消费者数量(受核心数限制) |
| 表 | 并行插入所需配置(需设为 |
| 表 | |
| librdkafka | 消费者被踢出前的最大轮询间隔(默认值:300秒) |
| librdkafka | 启用rdkafka_stat数据收集(默认禁用) |