clickhouse-operations
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseClickHouse Operations & Production Management
ClickHouse 运维与生产环境管理
When to Use This Skill
适用场景
Use when asked to:
- Diagnose ClickHouse performance problems ("query is slow", "out of memory")
- Set up production monitoring and alerting
- Scale ClickHouse vertically (more resources) or horizontally (clustering)
- Implement backup and disaster recovery procedures
- Troubleshoot replication lag, disk space, or resource issues
- Plan capacity for growing data volumes
Do NOT use when:
- Optimizing specific queries (use clickhouse-query-optimization instead)
- Building materialized views (use clickhouse-materialized-views instead)
- Initial ClickHouse installation (this is for operations, not setup)
当被问及以下需求时使用:
- 诊断ClickHouse性能问题(如“查询缓慢”“内存不足”)
- 搭建生产环境监控与告警系统
- 对ClickHouse进行垂直扩缩容(增加资源)或水平扩缩容(集群化)
- 实施备份与灾难恢复流程
- 排查复制延迟、磁盘空间或资源相关问题
- 针对数据量增长进行容量规划
请勿在以下场景使用:
- 优化特定查询(请使用clickhouse-query-optimization)
- 构建物化视图(请使用clickhouse-materialized-views)
- ClickHouse初始安装(本指南面向运维操作,而非部署搭建)
Table of Contents
目录
- Purpose
- Quick Start
- Key Operations
- Step 1: Set Up Essential Monitoring
- Step 2: Diagnose Issues Using System Tables
- Step 3: Implement Vertical Scaling (Single Server)
- Step 4: Implement Horizontal Scaling (Cluster)
- Step 5: Set Up Backup & Disaster Recovery
- Step 6: Tune Performance at Multiple Levels
- Step 7: Troubleshoot Common Issues
- Step 8: Manage Resources Effectively
- Step 9: Production Deployment Checklist
- Examples
- References & Resources
- Complete Query Reference - Diagnostic and monitoring queries
- Extended Troubleshooting Guide - Root cause analysis and solutions
- Scaling Case Studies - Real-world scaling decisions
- Monitoring Setup Examples - Production-ready Prometheus/Grafana configs
- Requirements
Purpose
目的
This skill provides comprehensive operational guidance for managing ClickHouse in production environments. It covers monitoring strategies, system diagnostics, scaling approaches, backup procedures, performance tuning, troubleshooting, and resource management to help DevOps and SRE teams maintain reliable, performant ClickHouse deployments.
本指南为生产环境中的ClickHouse管理提供全面的运维指导,涵盖监控策略、系统诊断、扩缩容方案、备份流程、性能调优、故障排查和资源管理,帮助DevOps和SRE团队维持ClickHouse部署的可靠性与高性能。
Quick Start
快速入门
Monitor current cluster health in 30 seconds:
bash
undefined30秒内监控集群健康状态:
bash
undefinedSSH to ClickHouse server
SSH连接到ClickHouse服务器
clickhouse-client --query "
SELECT
formatReadableSize(total_space) as total_disk,
formatReadableSize(free_space) as free_disk,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- Check running queries
SELECT query_id, user, elapsed, memory_usage / 1024 / 1024 as memory_mb
FROM system.processes
LIMIT 5;
-- Check recent errors
SELECT name, value as count, last_error_time
FROM system.errors
WHERE value > 0
ORDER BY last_error_time DESC
LIMIT 5;
"
undefinedclickhouse-client --query "
SELECT
formatReadableSize(total_space) as total_disk,
formatReadableSize(free_space) as free_disk,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- 查看运行中的查询
SELECT query_id, user, elapsed, memory_usage / 1024 / 1024 as memory_mb
FROM system.processes
LIMIT 5;
-- 查看近期错误
SELECT name, value as count, last_error_time
FROM system.errors
WHERE value > 0
ORDER BY last_error_time DESC
LIMIT 5;
"
undefinedKey Operations
核心操作
This section walks through the 9 essential steps for managing ClickHouse in production.
本节介绍生产环境中管理ClickHouse的9个关键步骤。
Step 1: Set Up Essential Monitoring
步骤1:搭建基础监控
ClickHouse requires monitoring of 9 critical metrics. Use these queries with your monitoring system (Prometheus, Grafana, Datadog, etc.):
Query Latency & Throughput:
sql
-- p95 latency (replace 0.95 for p99 = 0.99)
SELECT quantile(0.95)(query_duration_ms) as p95_latency_ms
FROM system.query_log
WHERE event_date >= today()
AND type = 'QueryFinish'
AND query NOT LIKE '%system.%';
-- Queries per second (current)
SELECT COUNT() / max(CAST(elapsed as Float64)) as qps
FROM system.processes
WHERE query NOT LIKE '%system.%';Insert Rate & Memory Usage:
sql
-- Insert throughput (rows/second)
SELECT
table,
SUM(rows) / SUM(CAST(query_duration_ms as Float64)) * 1000 as rows_per_second
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND type = 'QueryFinish'
AND event_date >= today()
GROUP BY table;
-- Memory usage
SELECT
formatReadableSize(value) as current_memory,
formatReadableSize(80000000000) as target_limit -- Adjust to 80% of your RAM
FROM system.metrics
WHERE metric = 'MemoryTracking';Disk Usage & Background Operations:
sql
-- Disk utilization
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- Merge pressure (high pending merges = slow inserts)
SELECT database, table, COUNT() as parts_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING parts_count > 1000
ORDER BY parts_count DESC;Alert Thresholds:
- Query latency p95 > 5 seconds
- Insert rate drops > 20% without load change
- Free disk < 20% of total
- Active parts > 1000 per table
- Memory usage > 80% of limit
- Replication lag > 60 seconds
- Failed query rate > 1%
ClickHouse需要监控9项核心指标,可将以下查询语句集成到你的监控系统(Prometheus、Grafana、Datadog等)中:
查询延迟与吞吐量:
sql
-- p95延迟(替换0.95为0.99可获取p99延迟)
SELECT quantile(0.95)(query_duration_ms) as p95_latency_ms
FROM system.query_log
WHERE event_date >= today()
AND type = 'QueryFinish'
AND query NOT LIKE '%system.%';
-- 当前每秒查询数(QPS)
SELECT COUNT() / max(CAST(elapsed as Float64)) as qps
FROM system.processes
WHERE query NOT LIKE '%system.%';插入速率与内存使用:
sql
-- 插入吞吐量(行/秒)
SELECT
table,
SUM(rows) / SUM(CAST(query_duration_ms as Float64)) * 1000 as rows_per_second
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND type = 'QueryFinish'
AND event_date >= today()
GROUP BY table;
-- 内存使用情况
SELECT
formatReadableSize(value) as current_memory,
formatReadableSize(80000000000) as target_limit -- 调整为内存总量的80%
FROM system.metrics
WHERE metric = 'MemoryTracking';磁盘使用与后台操作:
sql
-- 磁盘利用率
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- Merge压力(待合并分片过多会导致插入缓慢)
SELECT database, table, COUNT() as parts_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING parts_count > 1000
ORDER BY parts_count DESC;告警阈值:
- p95查询延迟 > 5秒
- 插入速率无负载变化时下降 > 20%
- 可用磁盘空间 < 总容量的20%
- 每张表的活跃分片数 > 1000
- 内存使用率 > 限制值的80%
- 复制延迟 > 60秒
- 查询失败率 > 1%
Step 2: Diagnose Issues Using System Tables
步骤2:利用系统表诊断问题
When problems occur, use these targeted queries to pinpoint root causes:
Find Slow Queries:
sql
-- Slowest queries (last 24 hours)
SELECT
event_time,
query_duration_ms,
read_rows,
formatReadableSize(read_bytes) as bytes_read,
formatReadableSize(memory_usage) as peak_memory,
query
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_date >= today() - 1
AND query NOT LIKE '%system.%'
ORDER BY query_duration_ms DESC
LIMIT 20;Identify Problematic Queries:
sql
-- Most resource-intensive queries
SELECT
query,
COUNT() as exec_count,
AVG(query_duration_ms) as avg_duration_ms,
MAX(memory_usage) / 1024 / 1024 as peak_memory_mb,
SUM(read_bytes) / 1024 / 1024 / 1024 as total_gb_read
FROM system.query_log
WHERE event_date >= today() - 7
AND type = 'QueryFinish'
GROUP BY query
ORDER BY total_gb_read DESC
LIMIT 20;Track Failed Queries:
sql
-- Error breakdown
SELECT
exception,
COUNT() as count,
MAX(event_time) as last_error
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
AND event_date >= today() - 1
GROUP BY exception
ORDER BY count DESC;Monitor Table Growth:
sql
-- Largest tables and growth rate
SELECT
database,
name as table,
SUM(rows) as total_rows,
formatReadableSize(SUM(bytes_on_disk)) as disk_size,
formatReadableSize(SUM(data_compressed_bytes)) as compressed,
round(SUM(data_uncompressed_bytes) / SUM(data_compressed_bytes), 1) as compression_ratio
FROM system.parts
WHERE active = 1
AND database NOT IN ('system', 'information_schema')
GROUP BY database, name
ORDER BY bytes_on_disk DESC;出现问题时,使用以下针对性查询语句定位根因:
查找缓慢查询:
sql
-- 最近24小时内最慢的查询
SELECT
event_time,
query_duration_ms,
read_rows,
formatReadableSize(read_bytes) as bytes_read,
formatReadableSize(memory_usage) as peak_memory,
query
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_date >= today() - 1
AND query NOT LIKE '%system.%'
ORDER BY query_duration_ms DESC
LIMIT 20;识别资源密集型查询:
sql
-- 资源消耗最高的查询
SELECT
query,
COUNT() as exec_count,
AVG(query_duration_ms) as avg_duration_ms,
MAX(memory_usage) / 1024 / 1024 as peak_memory_mb,
SUM(read_bytes) / 1024 / 1024 / 1024 as total_gb_read
FROM system.query_log
WHERE event_date >= today() - 7
AND type = 'QueryFinish'
GROUP BY query
ORDER BY total_gb_read DESC
LIMIT 20;追踪失败查询:
sql
-- 错误分类统计
SELECT
exception,
COUNT() as count,
MAX(event_time) as last_error
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
AND event_date >= today() - 1
GROUP BY exception
ORDER BY count DESC;监控表增长情况:
sql
-- 最大的表及其增长率
SELECT
database,
name as table,
SUM(rows) as total_rows,
formatReadableSize(SUM(bytes_on_disk)) as disk_size,
formatReadableSize(SUM(data_compressed_bytes)) as compressed,
round(SUM(data_uncompressed_bytes) / SUM(data_compressed_bytes), 1) as compression_ratio
FROM system.parts
WHERE active = 1
AND database NOT IN ('system', 'information_schema')
GROUP BY database, name
ORDER BY bytes_on_disk DESC;Step 3: Implement Vertical Scaling (Single Server)
步骤3:实施垂直扩缩容(单服务器)
Scale up individual ClickHouse servers by increasing CPU, memory, or disk resources:
CPU Optimization:
sql
-- Check current thread setting
SELECT value FROM system.settings WHERE name = 'max_threads';
-- Optimize for your hardware
SET max_threads = 16; -- Set to CPU core count (or slightly less)
-- Permanent setting in /etc/clickhouse-server/config.xml
-- <max_threads>16</max_threads>Memory Management:
sql
-- Current server memory limit
SELECT value FROM system.settings WHERE name = 'max_server_memory_usage';
-- Set per-query limit (10 GB example)
SET max_memory_usage = 10000000000;
-- Production config (/etc/clickhouse-server/config.xml):
-- <max_server_memory_usage>64000000000</max_server_memory_usage> <!-- 64 GB = 80% of 80GB RAM -->
-- <max_concurrent_queries>100</max_concurrent_queries>
-- Monitor current usage
SELECT formatReadableSize(value) as memory_usage
FROM system.metrics
WHERE metric = 'MemoryTracking';Disk I/O Tuning:
xml
<!-- In /etc/clickhouse-server/config.xml -->
<merge_tree>
<!-- Enable direct I/O for large table scans -->
<min_bytes_to_use_direct_io>10485760</min_bytes_to_use_direct_io> <!-- 10 MB -->
</merge_tree>
<!-- Filesystem recommendations: Use NVMe SSD for best performance -->
<!-- Separate data and logs to different disks if possible -->通过增加CPU、内存或磁盘资源来升级单个ClickHouse服务器:
CPU优化:
sql
-- 查看当前线程设置
SELECT value FROM system.settings WHERE name = 'max_threads';
-- 根据硬件配置优化
SET max_threads = 16; -- 设置为CPU核心数(或略少)
-- 永久配置在/etc/clickhouse-server/config.xml
-- <max_threads>16</max_threads>内存管理:
sql
-- 当前服务器内存限制
SELECT value FROM system.settings WHERE name = 'max_server_memory_usage';
-- 设置单查询内存限制(示例为10GB)
SET max_memory_usage = 10000000000;
-- 生产环境配置(/etc/clickhouse-server/config.xml):
-- <max_server_memory_usage>64000000000</max_server_memory_usage> <!-- 64GB = 80GB内存的80% -->
-- <max_concurrent_queries>100</max_concurrent_queries>
-- 监控当前内存使用
SELECT formatReadableSize(value) as memory_usage
FROM system.metrics
WHERE metric = 'MemoryTracking';磁盘I/O调优:
xml
<!-- 在/etc/clickhouse-server/config.xml中配置 -->
<merge_tree>
<!-- 对大表扫描启用直接I/O -->
<min_bytes_to_use_direct_io>10485760</min_bytes_to_use_direct_io> <!-- 10MB -->
</merge_tree>
<!-- 文件系统建议:使用NVMe SSD以获得最佳性能 -->
<!-- 如有可能,将数据和日志存储在不同磁盘上 -->Step 4: Implement Horizontal Scaling (Cluster)
步骤4:实施水平扩缩容(集群)
For massive scale, distribute data across multiple nodes:
Define Cluster Configuration:
xml
<!-- Add to /etc/clickhouse-server/config.xml on all nodes -->
<remote_servers>
<production_cluster>
<shard>
<replica>
<host>node1.example.com</host>
<port>9000</port>
</replica>
<replica>
<host>node2.example.com</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>node3.example.com</host>
<port>9000</port>
</replica>
<replica>
<host>node4.example.com</host>
<port>9000</port>
</replica>
</shard>
</production_cluster>
</remote_servers>Create Distributed Tables:
sql
-- Step 1: Create local table on each shard
CREATE TABLE events_local ON CLUSTER 'production_cluster' (
user_id UInt32,
event_type String,
timestamp DateTime,
properties String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (user_id, timestamp)
PARTITION BY toDate(timestamp);
-- Step 2: Create distributed table (insert point)
CREATE TABLE events ON CLUSTER 'production_cluster' AS events_local
ENGINE = Distributed('production_cluster', default, events_local, rand());
-- Step 3: Use distributed table for inserts
INSERT INTO events VALUES (123, 'purchase', now(), '{"amount": 99.99}');
-- Step 4: Query across all shards
SELECT COUNT() as total_events FROM events;
SELECT topK(10)(event_type) FROM events; -- Top 10 event typesMonitor Cluster Health:
sql
-- Check replication status
SELECT
database,
table,
total_replicas,
active_replicas,
absolute_delay
FROM system.replicas
ORDER BY absolute_delay DESC;
-- Verify all nodes responding
SELECT hostname(), version(), uptime() as uptime_seconds
FROM clusterAllReplicas('production_cluster', system.one);针对超大规模场景,将数据分布到多个节点:
定义集群配置:
xml
<!-- 在所有节点的/etc/clickhouse-server/config.xml中添加 -->
<remote_servers>
<production_cluster>
<shard>
<replica>
<host>node1.example.com</host>
<port>9000</port>
</replica>
<replica>
<host>node2.example.com</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>node3.example.com</host>
<port>9000</port>
</replica>
<replica>
<host>node4.example.com</host>
<port>9000</port>
</replica>
</shard>
</production_cluster>
</remote_servers>创建分布式表:
sql
-- 步骤1:在每个分片上创建本地表
CREATE TABLE events_local ON CLUSTER 'production_cluster' (
user_id UInt32,
event_type String,
timestamp DateTime,
properties String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
ORDER BY (user_id, timestamp)
PARTITION BY toDate(timestamp);
-- 步骤2:创建分布式表(插入入口)
CREATE TABLE events ON CLUSTER 'production_cluster' AS events_local
ENGINE = Distributed('production_cluster', default, events_local, rand());
-- 步骤3:使用分布式表进行插入
INSERT INTO events VALUES (123, 'purchase', now(), '{"amount": 99.99}');
-- 步骤4:跨所有分片查询
SELECT COUNT() as total_events FROM events;
SELECT topK(10)(event_type) FROM events; -- 排名前10的事件类型监控集群健康状态:
sql
-- 检查复制状态
SELECT
database,
table,
total_replicas,
active_replicas,
absolute_delay
FROM system.replicas
ORDER BY absolute_delay DESC;
-- 验证所有节点是否响应
SELECT hostname(), version(), uptime() as uptime_seconds
FROM clusterAllReplicas('production_cluster', system.one);Step 5: Set Up Backup & Disaster Recovery
步骤5:搭建备份与灾难恢复
Implement automated, tested backup procedures:
Filesystem Snapshot Strategy (Recommended for production):
bash
#!/bin/bash实施自动化、经过测试的备份流程:
文件系统快照策略(生产环境推荐):
bash
#!/bin/bashbackup_clickhouse.sh - Daily automated backup
backup_clickhouse.sh - 每日自动备份脚本
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/clickhouse/$BACKUP_DATE"
CH_DATA_DIR="/var/lib/clickhouse"
mkdir -p "$BACKUP_DIR"
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="/backups/clickhouse/$BACKUP_DATE"
CH_DATA_DIR="/var/lib/clickhouse"
mkdir -p "$BACKUP_DIR"
Optional: Stop ClickHouse for consistent backup
可选:停止ClickHouse以获得一致性备份
sudo systemctl stop clickhouse-server
sudo systemctl stop clickhouse-server
Create backup with rsync (preserve hard links)
使用rsync创建备份(保留硬链接)
rsync -av --hard-links "$CH_DATA_DIR/" "$BACKUP_DIR/"
rsync -av --hard-links "$CH_DATA_DIR/" "$BACKUP_DIR/"
Optional: Restart ClickHouse
可选:重启ClickHouse
sudo systemctl start clickhouse-server
sudo systemctl start clickhouse-server
Keep only last 30 days of backups
仅保留最近30天的备份
find /backups/clickhouse -maxdepth 1 -type d -mtime +30 -exec rm -rf {} ;
echo "Backup completed to $BACKUP_DIR"
**Native BACKUP Command (ClickHouse 22.8+):**
```sql
-- Backup single table
BACKUP TABLE events TO Disk('backups', 'events_' || toString(today()) || '.zip');
-- Backup entire database
BACKUP DATABASE default TO Disk('backups', 'db_' || toString(today()) || '.zip');
-- Restore table
RESTORE TABLE events FROM Disk('backups', 'events_20240115.zip');
-- Restore database
RESTORE DATABASE default FROM Disk('backups', 'db_20240115.zip');Disaster Recovery Procedure:
bash
#!/bin/bashfind /backups/clickhouse -maxdepth 1 -type d -mtime +30 -exec rm -rf {} ;
echo "备份已完成,存储路径:$BACKUP_DIR"
**原生BACKUP命令(ClickHouse 22.8+版本支持):**
```sql
-- 备份单个表
BACKUP TABLE events TO Disk('backups', 'events_' || toString(today()) || '.zip');
-- 备份整个数据库
BACKUP DATABASE default TO Disk('backups', 'db_' || toString(today()) || '.zip');
-- 恢复表
RESTORE TABLE events FROM Disk('backups', 'events_20240115.zip');
-- 恢复数据库
RESTORE DATABASE default FROM Disk('backups', 'db_20240115.zip');灾难恢复流程:
bash
#!/bin/bashRestore complete ClickHouse instance
完整ClickHouse实例恢复脚本
set -e
NEW_SERVER="new.example.com"
BACKUP_DATE="20240115"
echo "Starting disaster recovery..."
set -e
NEW_SERVER="new.example.com"
BACKUP_DATE="20240115"
echo "开始灾难恢复..."
1. Install ClickHouse on new server
1. 在新服务器上安装ClickHouse
ssh "root@$NEW_SERVER" 'apt-get update && apt-get install -y clickhouse-server clickhouse-client'
ssh "root@$NEW_SERVER" 'apt-get update && apt-get install -y clickhouse-server clickhouse-client'
2. Stop ClickHouse
2. 停止ClickHouse服务
ssh "clickhouse@$NEW_SERVER" 'sudo systemctl stop clickhouse-server'
ssh "clickhouse@$NEW_SERVER" 'sudo systemctl stop clickhouse-server'
3. Restore data directory
3. 恢复数据目录
ssh "root@$NEW_SERVER" "rsync -av /remote/backup/$BACKUP_DATE/ /var/lib/clickhouse/"
ssh "root@$NEW_SERVER" "rsync -av /remote/backup/$BACKUP_DATE/ /var/lib/clickhouse/"
4. Fix permissions
4. 修复权限
ssh "root@$NEW_SERVER" 'chown -R clickhouse:clickhouse /var/lib/clickhouse'
ssh "root@$NEW_SERVER" 'chown -R clickhouse:clickhouse /var/lib/clickhouse'
5. Start ClickHouse
5. 启动ClickHouse服务
ssh "root@$NEW_SERVER" 'systemctl start clickhouse-server'
ssh "root@$NEW_SERVER" 'systemctl start clickhouse-server'
6. Verify data integrity
6. 验证数据完整性
ssh "clickhouse@$NEW_SERVER" 'clickhouse-client --query "SELECT COUNT() FROM events"'
echo "Disaster recovery complete!"
**Test Backups Monthly:**
```bashssh "clickhouse@$NEW_SERVER" 'clickhouse-client --query "SELECT COUNT() FROM events"'
echo "灾难恢复完成!"
**每月测试备份:**
```bashSchedule this as a monthly cron job
作为每月定时任务添加到cron
0 2 1 * * /opt/scripts/test_backup_restore.sh
0 2 1 * * /opt/scripts/test_backup_restore.sh
Test script should:
测试脚本应包含:
- Restore latest backup to test instance
- 将最新备份恢复到测试实例
- Run sanity checks (table counts, data integrity)
- 运行完整性检查(表计数、数据一致性)
- Report success/failure to ops team
- 向运维团队报告成功/失败结果
undefinedundefinedStep 6: Tune Performance at Multiple Levels
步骤6:多维度性能调优
Query-Level Tuning:
sql
-- For analytical queries on large datasets
SET max_threads = 16; -- Use all cores
SET max_memory_usage = 50000000000; -- 50 GB for large aggregations
SET max_execution_time = 300; -- 5 minutes timeout
-- For batch operations
SET async_insert = 1; -- Queue inserts asynchronously
SET wait_for_async_insert = 0; -- Don't wait for async confirmation
-- For complex GROUP BY operations
SET group_by_overflow_mode = 'any'; -- Switch to approximate aggregation if limit exceeded
-- Enable query results cache
SET use_query_cache = 1;Server-Level Configuration (/etc/clickhouse-server/config.xml):
xml
<clickhouse>
<!-- Memory: Set to ~80% of available RAM -->
<max_server_memory_usage>64000000000</max_server_memory_usage>
<!-- Concurrency -->
<max_concurrent_queries>100</max_concurrent_queries>
<max_connections>4096</max_connections>
<!-- Background operations (merges, mutations) -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>8</background_schedule_pool_size>
<!-- Compression: LZ4 is fast (default), ZSTD is smaller -->
<compression>
<case>
<method>lz4</method>
</case>
</compression>
<!-- Network: Keep connections alive efficiently -->
<keep_alive_timeout>3</keep_alive_timeout>
<socket_receive_timeout_sec>300</socket_receive_timeout_sec>
<socket_send_timeout_sec>300</socket_send_timeout_sec>
<!-- Logging: Use warning level in production to reduce I/O -->
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>100M</size>
<count>10</count>
</logger>
</clickhouse>Table-Level Optimization:
sql
-- Use appropriate codecs for different column types
CREATE TABLE events (
user_id UInt32,
timestamp DateTime CODEC(DoubleDelta, LZ4), -- Delta encoding for timestamps
event_type LowCardinality(String), -- Dictionary encoding for low-cardinality
price Decimal(10, 2) CODEC(T64, ZSTD(3)), -- Better compression for decimals
json_data String CODEC(ZSTD(3)) -- ZSTD for larger strings
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
PARTITION BY toDate(timestamp)
SETTINGS
index_granularity = 8192,
min_bytes_for_wide_part = 10485760, -- 10 MB
min_rows_for_wide_part = 100000;
-- Enable TTL for automatic data lifecycle management
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
-- Force optimization (merges all parts, applies TTL)
OPTIMIZE TABLE events FINAL;查询级调优:
sql
-- 针对大数据集的分析查询
SET max_threads = 16; -- 使用所有核心
SET max_memory_usage = 50000000000; -- 为大规模聚合分配50GB内存
SET max_execution_time = 300; -- 5分钟超时时间
-- 针对批量操作
SET async_insert = 1; -- 异步队列化插入操作
SET wait_for_async_insert = 0; -- 不等待异步插入确认
-- 针对复杂GROUP BY操作
SET group_by_overflow_mode = 'any'; -- 超出限制时切换为近似聚合
-- 启用查询结果缓存
SET use_query_cache = 1;服务器级配置(/etc/clickhouse-server/config.xml):
xml
<clickhouse>
<!-- 内存:设置为可用内存的约80% -->
<max_server_memory_usage>64000000000</max_server_memory_usage>
<!-- 并发配置 -->
<max_concurrent_queries>100</max_concurrent_queries>
<max_connections>4096</max_connections>
<!-- 后台操作(合并、变更) -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>8</background_schedule_pool_size>
<!-- 压缩配置:LZ4速度快(默认),ZSTD压缩比更高 -->
<compression>
<case>
<method>lz4</method>
</case>
</compression>
<!-- 网络配置:高效保持连接活跃 -->
<keep_alive_timeout>3</keep_alive_timeout>
<socket_receive_timeout_sec>300</socket_receive_timeout_sec>
<socket_send_timeout_sec>300</socket_send_timeout_sec>
<!-- 日志配置:生产环境使用warning级别以减少I/O -->
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>100M</size>
<count>10</count>
</logger>
</clickhouse>表级优化:
sql
-- 为不同列类型使用合适的编码
CREATE TABLE events (
user_id UInt32,
timestamp DateTime CODEC(DoubleDelta, LZ4), -- 时间戳使用Delta编码
event_type LowCardinality(String), -- 低基数列使用字典编码
price Decimal(10, 2) CODEC(T64, ZSTD(3)), -- 小数类型使用更好的压缩
json_data String CODEC(ZSTD(3)) -- 大字符串使用ZSTD压缩
)
ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
PARTITION BY toDate(timestamp)
SETTINGS
index_granularity = 8192,
min_bytes_for_wide_part = 10485760, -- 10MB
min_rows_for_wide_part = 100000;
-- 启用TTL自动数据生命周期管理
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
-- 强制优化(合并所有分片,应用TTL)
OPTIMIZE TABLE events FINAL;Step 7: Troubleshoot Common Issues
步骤7:排查常见问题
Issue: Query Timeout (Queries killed after max_execution_time)
Diagnosis:
sql
SELECT query, query_duration_ms, read_rows, read_bytes / 1024 / 1024 as read_mb
FROM system.query_log
WHERE query_duration_ms > 30000 -- > 30 seconds
ORDER BY query_duration_ms DESC
LIMIT 10;Solutions:
sql
-- Option 1: Increase timeout for specific query
SET max_execution_time = 300; -- 5 minutes
SELECT * FROM events WHERE timestamp > '2024-01-01';
-- Option 2: Add index for faster filtering
ALTER TABLE events ADD INDEX idx_timestamp timestamp TYPE minmax GRANULARITY 4;
-- Option 3: Use SAMPLE to estimate results faster
SELECT COUNT() FROM events SAMPLE 0.1; -- Estimate from 10% of data
-- Option 4: Reduce scanned data with appropriate WHERE clauses
SELECT COUNT() FROM events WHERE timestamp >= today() - 7; -- Last 7 days onlyIssue: Out of Memory (Memory limit exceeded errors)
Diagnosis:
sql
SELECT
query_id,
user,
memory_usage / 1024 / 1024 as memory_mb,
query
FROM system.processes
ORDER BY memory_usage DESC;
-- Check recent OOM errors
SELECT exception FROM system.query_log
WHERE exception LIKE '%Memory%'
AND event_date >= today() - 1;Solutions:
sql
-- Option 1: Increase memory limit per query
SET max_memory_usage = 30000000000; -- 30 GB (if server has capacity)
-- Option 2: Use approximate aggregations instead of exact
SELECT uniq(user_id) FROM events; -- Faster approximate distinct count
SELECT topK(1000)(category) FROM events; -- Top 1000 without full GROUP BY
-- Option 3: Process in smaller batches
SELECT * FROM events WHERE date = toDate(now()) LIMIT 1000000;
-- Option 4: Increase server memory limit if within hardware capacity
-- Edit config.xml: <max_server_memory_usage>128000000000</max_server_memory_usage>Issue: Slow Inserts (Low insert throughput)
Diagnosis:
sql
SELECT
table,
COUNT() as insert_count,
SUM(rows) as total_rows,
SUM(rows) / SUM(CAST(query_duration_ms as Float64)) * 1000 as rows_per_second
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND type = 'QueryFinish'
AND event_date >= today()
GROUP BY table;
-- Check number of parts (high = slow inserts)
SELECT database, table, COUNT() as parts_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY parts_count DESC;Solutions:
sql
-- Option 1: Increase batch size (1000 -> 100000 rows per insert)
INSERT INTO events
SELECT * FROM source_table
LIMIT 100000; -- Larger batches = better throughput
-- Option 2: Enable async inserts for non-critical data
SET async_insert = 1;
SET wait_for_async_insert = 0; -- Fire and forget
INSERT INTO events VALUES (row1), (row2), ...;
-- Option 3: Reduce merge pressure
ALTER TABLE events MODIFY SETTING max_parts_in_total = 10000;
SYSTEM STOP MERGES events; -- Temporarily disable merges during bulk load
-- ... run inserts ...
SYSTEM START MERGES events;
-- Option 4: Force merge after bulk load to consolidate
OPTIMIZE TABLE events;Issue: High Replication Lag (Replicas behind leader)
Diagnosis:
sql
SELECT
database,
table,
total_replicas,
active_replicas,
absolute_delay
FROM system.replicas
ORDER BY absolute_delay DESC;
-- On problematic replica, check queue
SELECT * FROM system.replication_queue;Solutions:
bash
undefined问题:查询超时(查询超过max_execution_time后被终止)
诊断:
sql
SELECT query, query_duration_ms, read_rows, read_bytes / 1024 / 1024 as read_mb
FROM system.query_log
WHERE query_duration_ms > 30000 -- 超过30秒
ORDER BY query_duration_ms DESC
LIMIT 10;解决方案:
sql
-- 方案1:为特定查询增加超时时间
SET max_execution_time = 300; -- 5分钟
SELECT * FROM events WHERE timestamp > '2024-01-01';
-- 方案2:添加索引以加速过滤
ALTER TABLE events ADD INDEX idx_timestamp timestamp TYPE minmax GRANULARITY 4;
-- 方案3:使用SAMPLE快速估算结果
SELECT COUNT() FROM events SAMPLE 0.1; -- 基于10%的数据估算
-- 方案4:通过合适的WHERE子句减少扫描数据量
SELECT COUNT() FROM events WHERE timestamp >= today() - 7; -- 仅查询最近7天问题:内存不足(超出内存限制错误)
诊断:
sql
SELECT
query_id,
user,
memory_usage / 1024 / 1024 as memory_mb,
query
FROM system.processes
ORDER BY memory_usage DESC;
-- 查看近期OOM错误
SELECT exception FROM system.query_log
WHERE exception LIKE '%Memory%'
AND event_date >= today() - 1;解决方案:
sql
-- 方案1:增加单查询内存限制
SET max_memory_usage = 30000000000; -- 30GB(如果服务器有足够容量)
-- 方案2:使用近似聚合替代精确聚合
SELECT uniq(user_id) FROM events; -- 更快的近似去重计数
SELECT topK(1000)(category) FROM events; -- 无需完整GROUP BY的Top 1000
-- 方案3:分批处理数据
SELECT * FROM events WHERE date = toDate(now()) LIMIT 1000000;
-- 方案4:如果硬件允许,增加服务器内存限制
-- 编辑config.xml: <max_server_memory_usage>128000000000</max_server_memory_usage>问题:插入缓慢(插入吞吐量低)
诊断:
sql
SELECT
table,
COUNT() as insert_count,
SUM(rows) as total_rows,
SUM(rows) / SUM(CAST(query_duration_ms as Float64)) * 1000 as rows_per_second
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND type = 'QueryFinish'
AND event_date >= today()
GROUP BY table;
-- 检查分片数量(过多会导致插入缓慢)
SELECT database, table, COUNT() as parts_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY parts_count DESC;解决方案:
sql
-- 方案1:增加批量大小(从1000行增加到100000行/插入)
INSERT INTO events
SELECT * FROM source_table
LIMIT 100000; -- 更大的批量 = 更好的吞吐量
-- 方案2:针对非关键数据启用异步插入
SET async_insert = 1;
SET wait_for_async_insert = 0; -- 即发即弃
INSERT INTO events VALUES (row1), (row2), ...;
-- 方案3:降低Merge压力
ALTER TABLE events MODIFY SETTING max_parts_in_total = 10000;
SYSTEM STOP MERGES events; -- 批量导入期间临时禁用Merge
-- ... 执行插入操作 ...
SYSTEM START MERGES events;
-- 方案4:批量导入后强制合并分片
OPTIMIZE TABLE events;问题:复制延迟高(副本落后于主节点)
诊断:
sql
SELECT
database,
table,
total_replicas,
active_replicas,
absolute_delay
FROM system.replicas
ORDER BY absolute_delay DESC;
-- 在有问题的副本上查看队列
SELECT * FROM system.replication_queue;解决方案:
bash
undefinedOption 1: Check network between replicas
方案1:检查副本间网络连接
ping replica-host
mtr replica-host # For detailed network path analysis
ping replica-host
mtr replica-host -- 详细网络路径分析
Option 2: Check ZooKeeper cluster health
方案2:检查ZooKeeper集群健康状态
echo stat | nc zookeeper-host 2181
echo mntr | nc zookeeper-host 2181
echo stat | nc zookeeper-host 2181
echo mntr | nc zookeeper-host 2181
Option 3: Force sync replica with leader
方案3:强制副本与主节点同步
clickhouse-client --query "SYSTEM SYNC REPLICA database.table;"
clickhouse-client --query "SYSTEM SYNC REPLICA database.table;"
Option 4: Restart replica if stuck in queue
方案4:如果队列卡住,重启副本
sudo systemctl restart clickhouse-server
sudo systemctl restart clickhouse-server
Option 5: If far behind, resync from scratch
方案5:如果差距过大,重新同步
clickhouse-client --query "SYSTEM DROP REPLICA 'replica2' FROM ZOOKEEPER;"
clickhouse-client --query "SYSTEM DROP REPLICA 'replica2' FROM ZOOKEEPER;"
---
**Issue: Kafka Table Not Consuming (No new data from Kafka)**
Diagnosis:
```sql
-- Check Kafka consumer status
SELECT * FROM system.kafka_consumers;
-- Check for Kafka-related errors
SELECT exception, COUNT() as count
FROM system.query_log
WHERE query LIKE '%Kafka%'
AND exception != ''
AND event_date >= today() - 1
GROUP BY exception;Solutions:
bash
undefined
---
**问题:Kafka表未消费(无法从Kafka获取新数据)**
诊断:
```sql
-- 检查Kafka消费者状态
SELECT * FROM system.kafka_consumers;
-- 检查Kafka相关错误
SELECT exception, COUNT() as count
FROM system.query_log
WHERE query LIKE '%Kafka%'
AND exception != ''
AND event_date >= today() - 1
GROUP BY exception;解决方案:
bash
undefinedOption 1: Check connectivity to Kafka broker
方案1:检查与Kafka broker的连接
telnet kafka-broker 9092
telnet kafka-broker 9092
Option 2: Verify consumer group exists and offsets
方案2:验证消费者组是否存在以及偏移量
kafka-consumer-groups.sh --bootstrap-server kafka:9092
--group clickhouse_consumer --describe
--group clickhouse_consumer --describe
kafka-consumer-groups.sh --bootstrap-server kafka:9092
--group clickhouse_consumer --describe
--group clickhouse_consumer --describe
Option 3: Reset offsets to consume from beginning (if stuck)
方案3:重置偏移量从头开始消费(如果卡住)
kafka-consumer-groups.sh --bootstrap-server kafka:9092
--group clickhouse_consumer --reset-offsets --to-earliest
--topic events --execute
--group clickhouse_consumer --reset-offsets --to-earliest
--topic events --execute
kafka-consumer-groups.sh --bootstrap-server kafka:9092
--group clickhouse_consumer --reset-offsets --to-earliest
--topic events --execute
--group clickhouse_consumer --reset-offsets --to-earliest
--topic events --execute
Option 4: Recreate Kafka table
方案4:重新创建Kafka表
clickhouse-client --query "DROP TABLE kafka_events;"
clickhouse-client --query "DROP TABLE kafka_events;"
clickhouse-client -d default < create_kafka_table.sql
clickhouse-client -d default < create_kafka_table.sql
undefinedundefinedStep 8: Manage Resources Effectively
步骤8:高效资源管理
Disk Space Management:
sql
-- Monitor disk usage
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- Drop old partitions (instant, no merge)
ALTER TABLE events DROP PARTITION '202301';
ALTER TABLE events DROP PARTITION '202302';
-- Set TTL for automatic cleanup
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
-- Manual optimization (applies TTL, merges parts)
OPTIMIZE TABLE events FINAL;
-- Check size reduction from optimization
SELECT
formatReadableSize(SUM(bytes_on_disk)) as disk_size,
COUNT() as parts_count
FROM system.parts
WHERE table = 'events' AND active = 1;Connection Management:
sql
-- View all current connections
SELECT
query_id,
user,
client_hostname,
elapsed as elapsed_seconds,
query
FROM system.processes
ORDER BY elapsed DESC;
-- Kill specific long-running query
KILL QUERY WHERE query_id = 'abc123';
-- Kill all queries from a user (e.g., runaway job)
KILL QUERY WHERE user = 'batch_user';
-- Persistent limit in config.xml
-- <max_connections>4096</max_connections>Background Operation Control:
sql
-- Check merge queue
SELECT
database,
table,
elapsed,
progress,
num_parts,
formatReadableSize(total_size_bytes_compressed) as size
FROM system.merges
ORDER BY elapsed DESC;
-- Pause merges temporarily (e.g., for urgent queries)
SYSTEM STOP MERGES events;
-- Resume merges
SYSTEM START MERGES events;
-- Force immediate merge
OPTIMIZE TABLE events;
-- Prevent excessive merges for new tables
CREATE TABLE events_bulk (...)
ENGINE = MergeTree()
SETTINGS
max_parts_in_total = 10000, -- Allow more parts before forcing merge
parts_to_throw_insert = 20000; -- Fail inserts at this threshold磁盘空间管理:
sql
-- 监控磁盘使用
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- 删除旧分区(即时操作,无需合并)
ALTER TABLE events DROP PARTITION '202301';
ALTER TABLE events DROP PARTITION '202302';
-- 设置TTL自动清理
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;
-- 手动优化(应用TTL,合并分片)
OPTIMIZE TABLE events FINAL;
-- 检查优化后的空间减少情况
SELECT
formatReadableSize(SUM(bytes_on_disk)) as disk_size,
COUNT() as parts_count
FROM system.parts
WHERE table = 'events' AND active = 1;连接管理:
sql
-- 查看所有当前连接
SELECT
query_id,
user,
client_hostname,
elapsed as elapsed_seconds,
query
FROM system.processes
ORDER BY elapsed DESC;
-- 终止特定长时间运行的查询
KILL QUERY WHERE query_id = 'abc123';
-- 终止某个用户的所有查询(例如失控任务)
KILL QUERY WHERE user = 'batch_user';
-- 在config.xml中设置永久限制
-- <max_connections>4096</max_connections>后台操作控制:
sql
-- 查看Merge队列
SELECT
database,
table,
elapsed,
progress,
num_parts,
formatReadableSize(total_size_bytes_compressed) as size
FROM system.merges
ORDER BY elapsed DESC;
-- 临时暂停Merge(例如为了紧急查询)
SYSTEM STOP MERGES events;
-- 恢复Merge
SYSTEM START MERGES events;
-- 强制立即合并
OPTIMIZE TABLE events;
-- 为新表避免过度Merge
CREATE TABLE events_bulk (...)
ENGINE = MergeTree()
SETTINGS
max_parts_in_total = 10000, -- 允许更多分片后再强制合并
parts_to_throw_insert = 20000; -- 达到该阈值时插入失败Step 9: Production Deployment Checklist
步骤9:生产环境部署检查清单
Use this before deploying ClickHouse to production:
Pre-Deployment (Hardware & Infrastructure)
- CPU: Minimum 8 cores, 16+ cores recommended
- Memory: 64 GB minimum, 128+ GB for production workloads
- Disk: NVMe SSD, 3x estimated data size minimum free space
- Network: < 1 ms latency between cluster nodes
- Redundancy: 3+ nodes if replication required
- ZooKeeper: 3-5 dedicated nodes for cluster coordination
- Monitoring: Prometheus/Grafana/Datadog setup ready
- Backup storage: Tested, sufficient capacity, separate location
Configuration
- Memory limits:
max_server_memory_usage = 80% of RAM - Connections: tuned for expected workload
max_connections - Logging: Level set to (not
warningortrace)debug - Compression: LZ4 configured for most tables
- TTL policies: Data lifecycle management enabled
- Authentication: users.xml configured with strong passwords
- Replication: ZooKeeper paths and cluster config validated
- Kafka: Consumer groups and topic offsets confirmed
Operational
- Monitoring alerts: Configured for all critical metrics
- Query latency p95 > 5s
- Disk free < 20%
- Memory > 80%
- Replication lag > 60s
- Error rate > 1%
- Log rotation: Logrotate or equivalent configured
- Graceful shutdown: systemctl stop (not kill -9)
- Upgrades: Tested in staging, rollback plan documented
- Capacity planning: Growth projections and scaling triggers
- Disaster recovery: Documented, tested within past 90 days
- Backups: Automated, tested restore procedure verified
- Documentation: Runbooks for common operations updated
Post-Deployment (First Week)
- Load testing: Realistic query and insert patterns
- Monitoring validation: All alerts working correctly
- Performance baseline: Record metrics for future comparison
- Team training: Ops team familiar with alerting & runbooks
- Incident response: On-call rotation established
在将ClickHouse部署到生产环境前使用本清单:
部署前(硬件与基础设施)
- CPU:最低8核,推荐16核以上
- 内存:最低64GB,生产工作负载推荐128GB以上
- 磁盘:NVMe SSD,可用空间至少为预估数据量的3倍
- 网络:集群节点间延迟 < 1ms
- 冗余:如需复制,至少3个节点
- ZooKeeper:3-5个专用节点用于集群协调
- 监控:Prometheus/Grafana/Datadog已准备就绪
- 备份存储:经过测试,容量充足,与生产环境分离
配置检查
- 内存限制:
max_server_memory_usage = 内存总量的80% - 连接数:已根据预期工作负载调整
max_connections - 日志:级别设置为(而非
warning或trace)debug - 压缩:大多数表已配置LZ4压缩
- TTL策略:已启用数据生命周期管理
- 认证:users.xml已配置强密码
- 复制:ZooKeeper路径和集群配置已验证
- Kafka:消费者组和主题偏移量已确认
运维准备
- 监控告警:所有核心指标已配置告警
- p95查询延迟 > 5秒
- 可用磁盘空间 < 20%
- 内存使用率 > 80%
- 复制延迟 > 60秒
- 错误率 > 1%
- 日志轮转:已配置Logrotate或类似工具
- 优雅停机:使用systemctl stop(而非kill -9)
- 升级:已在预发布环境测试,回滚计划已文档化
- 容量规划:已制定增长预测和扩缩容触发条件
- 灾难恢复:流程已文档化,且过去90天内已测试
- 备份:已自动化,恢复流程已验证
- 文档:常见操作手册已更新
部署后(第一周)
- 负载测试:已运行真实的查询和插入模式测试
- 监控验证:所有告警工作正常
- 性能基线:已记录指标用于未来对比
- 团队培训:运维团队已熟悉告警和操作手册
- 事件响应:已建立值班轮换机制
Examples
示例
Example 1: Emergency Query That's Using Too Much Memory
示例1:紧急场景:查询占用过多内存
You're alerted that query p95 latency has spiked. A long-running job is causing memory pressure.
sql
-- 1. Identify the problem query
SELECT
query_id,
user,
elapsed,
memory_usage / 1024 / 1024 / 1024 as memory_gb,
query
FROM system.processes
WHERE memory_usage > 10000000000 -- > 10 GB
ORDER BY memory_usage DESC;
-- 2. Kill it immediately if necessary
KILL QUERY WHERE query_id = 'user_batch_12345';
-- 3. Investigate what went wrong in query log
SELECT
query_duration_ms,
memory_usage / 1024 / 1024 / 1024 as memory_gb,
query
FROM system.query_log
WHERE user = 'batch_user'
AND event_date >= today()
ORDER BY event_time DESC
LIMIT 10;
-- 4. Optimize the problematic query
-- Instead of: SELECT * FROM events GROUP BY user_id, event_type, device, os, browser, location
-- Use approximate: SELECT topK(1000)(user_id) FROM events你收到告警,p95查询延迟飙升,一个长时间运行的任务导致内存压力。
sql
-- 1. 定位问题查询
SELECT
query_id,
user,
elapsed,
memory_usage / 1024 / 1024 / 1024 as memory_gb,
query
FROM system.processes
WHERE memory_usage > 10000000000 -- 超过10GB
ORDER BY memory_usage DESC;
-- 2. 必要时立即终止该查询
KILL QUERY WHERE query_id = 'user_batch_12345';
-- 3. 在查询日志中调查问题原因
SELECT
query_duration_ms,
memory_usage / 1024 / 1024 / 1024 as memory_gb,
query
FROM system.query_log
WHERE user = 'batch_user'
AND event_date >= today()
ORDER BY event_time DESC
LIMIT 10;
-- 4. 优化有问题的查询
-- 替代方案:SELECT * FROM events GROUP BY user_id, event_type, device, os, browser, location
-- 使用近似聚合:SELECT topK(1000)(user_id) FROM eventsExample 2: Disk Space Running Out (Critical)
示例2:紧急场景:磁盘空间不足
Alert: Free disk space dropped below 15%. You have 4 hours before system fails.
sql
-- 1. Check disk space immediately
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- 2. Identify largest tables
SELECT
database,
name as table,
formatReadableSize(SUM(bytes_on_disk)) as size,
COUNT() as parts_count
FROM system.parts
WHERE active = 1
AND database NOT IN ('system', 'information_schema')
GROUP BY database, name
ORDER BY bytes_on_disk DESC
LIMIT 10;
-- 3. Drop old partitions from largest table
ALTER TABLE events DROP PARTITION '202301';
ALTER TABLE events DROP PARTITION '202302';
ALTER TABLE events DROP PARTITION '202303';
-- 4. Trigger optimization to merge parts and reduce size
OPTIMIZE TABLE events FINAL;
-- 5. Monitor recovery
SELECT formatReadableSize(free_space) as free
FROM system.disks;
-- 6. For permanent fix: add more disk capacity and adjust TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;告警:可用磁盘空间降至15%以下,系统将在4小时内故障。
sql
-- 1. 立即检查磁盘空间
SELECT
formatReadableSize(free_space) as free,
formatReadableSize(total_space) as total,
round(free_space / total_space * 100, 2) as free_percent
FROM system.disks;
-- 2. 定位最大的表
SELECT
database,
name as table,
formatReadableSize(SUM(bytes_on_disk)) as size,
COUNT() as parts_count
FROM system.parts
WHERE active = 1
AND database NOT IN ('system', 'information_schema')
GROUP BY database, name
ORDER BY bytes_on_disk DESC
LIMIT 10;
-- 3. 从最大的表中删除旧分区
ALTER TABLE events DROP PARTITION '202301';
ALTER TABLE events DROP PARTITION '202302';
ALTER TABLE events DROP PARTITION '202303';
-- 4. 触发优化以合并分片并减少空间占用
OPTIMIZE TABLE events FINAL;
-- 5. 监控恢复情况
SELECT formatReadableSize(free_space) as free
FROM system.disks;
-- 6. 永久修复:增加磁盘容量并调整TTL
ALTER TABLE events MODIFY TTL timestamp + INTERVAL 90 DAY;Example 3: Setting Up Monitoring for New Cluster
示例3:监控搭建:新集群监控配置
You've deployed a new 3-node ClickHouse cluster and need Prometheus monitoring.
Create for Prometheus scraper:
clickhouse_queries.yamlyaml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'clickhouse-metrics'
static_configs:
- targets: ['clickhouse1:9090', 'clickhouse2:9090', 'clickhouse3:9090']
- labels: {job: 'clickhouse-cluster'}
metrics_path: '/metrics'Create monitoring queries for Grafana:
sql
-- Dashboard: Query Latency
SELECT
avg(query_duration_ms) as avg_latency_ms,
quantile(0.95)(query_duration_ms) as p95_latency_ms,
quantile(0.99)(query_duration_ms) as p99_latency_ms
FROM system.query_log
WHERE event_date >= today()
AND type = 'QueryFinish'
AND query NOT LIKE '%system.%';
-- Dashboard: Insert Rate
SELECT
toStartOfMinute(event_time) as minute,
SUM(rows) as total_rows
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND event_date >= today()
GROUP BY minute
ORDER BY minute DESC;
-- Dashboard: Replication Health
SELECT
database,
table,
absolute_delay as lag_seconds,
active_replicas,
total_replicas
FROM system.replicas;你已部署一个3节点的ClickHouse集群,需要配置Prometheus监控。
创建Prometheus采集器配置文件:
clickhouse_queries.yamlyaml
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'clickhouse-metrics'
static_configs:
- targets: ['clickhouse1:9090', 'clickhouse2:9090', 'clickhouse3:9090']
- labels: {job: 'clickhouse-cluster'}
metrics_path: '/metrics'为Grafana创建监控查询语句:
sql
-- 仪表盘:查询延迟
SELECT
avg(query_duration_ms) as avg_latency_ms,
quantile(0.95)(query_duration_ms) as p95_latency_ms,
quantile(0.99)(query_duration_ms) as p99_latency_ms
FROM system.query_log
WHERE event_date >= today()
AND type = 'QueryFinish'
AND query NOT LIKE '%system.%';
-- 仪表盘:插入速率
SELECT
toStartOfMinute(event_time) as minute,
SUM(rows) as total_rows
FROM system.query_log
WHERE query LIKE 'INSERT%'
AND event_date >= today()
GROUP BY minute
ORDER BY minute DESC;
-- 仪表盘:复制健康状态
SELECT
database,
table,
absolute_delay as lag_seconds,
active_replicas,
total_replicas
FROM system.replicas;Example 4: Implementing Backup with 30-Day Retention
示例4:备份实现:30天保留期的自动备份
Create automated daily backups that keep 30 days of history:
bash
#!/bin/bash创建每日自动备份脚本,保留最近30天的备份:
bash
#!/bin/bash/usr/local/bin/backup_clickhouse.sh
/usr/local/bin/backup_clickhouse.sh
set -e
BACKUP_DIR="/backups/clickhouse"
RETENTION_DAYS=30
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
FULL_BACKUP_PATH="$BACKUP_DIR/$BACKUP_DATE"
mkdir -p "$FULL_BACKUP_PATH"
echo "Starting ClickHouse backup at $BACKUP_DATE"
set -e
BACKUP_DIR="/backups/clickhouse"
RETENTION_DAYS=30
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
FULL_BACKUP_PATH="$BACKUP_DIR/$BACKUP_DATE"
mkdir -p "$FULL_BACKUP_PATH"
echo "开始ClickHouse备份,时间:$BACKUP_DATE"
-- 方式1:使用ClickHouse原生备份(ClickHouse 22.8+版本支持)
clickhouse-client << EOF
BACKUP DATABASE default TO Disk('backups', 'db_$BACKUP_DATE.zip');
BACKUP TABLE events TO Disk('backups', 'events_$BACKUP_DATE.zip');
EOF
-- 方式2:文件系统快照(替代方案)
-- rsync -av /var/lib/clickhouse/ "$FULL_BACKUP_PATH/"
-- 清理旧备份
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +"$RETENTION_DAYS" -exec rm -rf {} ;
-- 验证备份完整性(抽样检查)
if [ -f "$FULL_BACKUP_PATH/.metadata" ]; then
echo "备份已验证,路径:$FULL_BACKUP_PATH"
else
echo "警告:备份验证失败"
exit 1
fi
echo "备份完成"
添加到crontab:
```bash
-- 每日凌晨2点执行备份
0 2 * * * /usr/local/bin/backup_clickhouse.sh >> /var/log/clickhouse_backup.log 2>&1
-- 每月1日凌晨3点执行恢复测试
0 3 1 * * /usr/local/bin/test_backup_restore.sh >> /var/log/clickhouse_restore_test.log 2>&1Method 1: Using ClickHouse native backup (if ClickHouse 22.8+)
参考资料
clickhouse-client << EOF
BACKUP DATABASE default TO Disk('backups', 'db_$BACKUP_DATE.zip');
BACKUP TABLE events TO Disk('backups', 'events_$BACKUP_DATE.zip');
EOF
本指南提供了全面的参考资料和真实场景示例,组织在以下支持文件中:
Method 2: Filesystem snapshot (alternative)
查询参考库
rsync -av /var/lib/clickhouse/ "$FULL_BACKUP_PATH/"
—
Cleanup old backups
—
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +"$RETENTION_DAYS" -exec rm -rf {} ;
文件: references/queries.md
按类别组织的完整诊断、监控和运维查询语句库:
- 监控与诊断查询(实时集群状态、查询性能、错误追踪)
- 性能分析查询(插入性能、Merge操作、内存使用、索引有效性)
- 资源管理查询(磁盘空间、连接数、后台操作)
- 集群操作查询(复制健康状态、集群级指标、数据一致性)
- 数据生命周期查询(TTL与保留管理、分区管理)
- 系统 introspection 查询(配置、表引擎信息、系统事件)
Verify backup integrity (sample check)
故障排查指南
if [ -f "$FULL_BACKUP_PATH/.metadata" ]; then
echo "Backup verified at $FULL_BACKUP_PATH"
else
echo "Warning: Backup verification failed"
exit 1
fi
echo "Backup completed successfully"
Add to crontab:
```bash文件: references/troubleshooting.md
扩展故障排查指南,包含根因分析和解决流程:
- 查询性能问题(缓慢查询、结果错误)
- 资源问题(OOM、高CPU、磁盘空间不足)
- 复制与集群问题(复制延迟、协调失败)
- 数据导入问题(插入缓慢、Kafka消费者问题)
- 运维问题(服务器无响应、备份/恢复失败)
Daily backup at 2 AM
扩缩容决策框架
0 2 * * * /usr/local/bin/backup_clickhouse.sh >> /var/log/clickhouse_backup.log 2>&1
文件: examples/scaling-case-studies.md
真实场景扩缩容案例研究和决策树:
- 案例1:针对数据量增长的扩缩容(容量规划)
- 案例2:针对查询性能的扩缩容(优化)
- 案例3:针对插入吞吐量的扩缩容(导入瓶颈)
- 扩缩容决策矩阵(选择垂直或水平扩缩容)
- 6个月扩缩容路线图模板
Monthly restore test (1st of month at 3 AM)
监控配置示例
0 3 1 * * /usr/local/bin/test_backup_restore.sh >> /var/log/clickhouse_restore_test.log 2>&1
undefined文件: examples/monitoring-setup.md
生产环境可用的监控实现:
- ClickHouse的Prometheus配置
- Grafana仪表盘搭建
- 告警规则示例
- 真实场景监控模式
References & Resources
环境要求
This skill provides comprehensive reference materials and real-world examples organized in supporting files:
- ClickHouse服务器已安装并运行(21.0+版本,22.8+版本支持原生BACKUP功能)
- 拥有ClickHouse服务器的SSH访问权限
- 可使用命令行工具
clickhouse-client - 如需集群操作:已配置ZooKeeper 3.4+
- 如需Kafka集成:可访问Kafka/Redpanda broker
- 备份存储:可用空间充足(至少为数据量的3倍)
- 监控系统:Prometheus、Grafana或Datadog用于仪表盘展示
Query Reference Library
—
File: references/queries.md
Complete library of diagnostic, monitoring, and operational queries organized by category:
- Monitoring & diagnostics queries (real-time cluster status, query performance, error tracking)
- Performance analysis queries (insert performance, merge operations, memory usage, index effectiveness)
- Resource management queries (disk space, connections, background operations)
- Cluster operations queries (replication health, cluster-wide metrics, data consistency)
- Data lifecycle queries (TTL and retention management, partition management)
- System introspection queries (configuration, table engine information, system events)
—
Troubleshooting Guide
—
File: references/troubleshooting.md
Extended troubleshooting guide with root cause analysis and resolution procedures for:
- Query performance issues (slow queries, wrong results)
- Resource issues (OOM killer, high CPU, disk space)
- Replication & clustering (replication lag, coordination failures)
- Data ingestion issues (slow inserts, Kafka consumer problems)
- Operational issues (unresponsive server, backup/restore failures)
—
Scaling Decision Framework
—
File: examples/scaling-case-studies.md
Real-world scaling case studies and decision trees:
- Case Study 1: Scaling for growing data volume (capacity planning)
- Case Study 2: Scaling for query performance (optimization)
- Case Study 3: Scaling for insert throughput (ingestion bottlenecks)
- Scaling decision matrix (choosing between vertical and horizontal scaling)
- 6-month scaling roadmap template
—
Monitoring Setup Examples
—
File: examples/monitoring-setup.md
Production-ready monitoring implementations:
- Prometheus configuration for ClickHouse
- Grafana dashboard setup
- Alert rule examples
- Real-world monitoring patterns
—
Requirements
—
- ClickHouse server installed and running (version 21.0+, 22.8+ for native BACKUP)
- SSH access to ClickHouse servers
- command-line tool available
clickhouse-client - For cluster operations: ZooKeeper 3.4+ configured
- For Kafka integration: Kafka/Redpanda brokers accessible
- Backup storage: Sufficient disk space (3x data size minimum)
- Monitoring system: Prometheus, Grafana, or Datadog for dashboards
—