altinity-expert-clickhouse-index-analysis
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDiagnostics
诊断步骤
Run all queries from the file checks.sql and analyze the results.
运行checks.sql文件中的所有查询并分析结果。
Deep Dive Queries (Placeholder-Based)
深度分析查询(基于占位符)
EXPLAIN Index Usage for Specific Query
特定查询的EXPLAIN索引使用分析
sql
EXPLAIN indexes = 1
{query_without_format}Look for:
- condition should not be
PrimaryKey(means no filtering)true - ratio shows selectivity (low X/Y = good)
Granules: X/Y - indexes should reduce parts/granules further
Skip
sql
EXPLAIN indexes = 1
{query_without_format}需要关注:
- 条件不应为
PrimaryKey(表示未进行过滤)true - 比值表示筛选性(X/Y值越小越好)
Granules: X/Y - 跳数索引(Skip indexes)应进一步减少数据分片/颗粒数
Column Cardinality Analysis
列基数分析
sql
SELECT
{columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAYOptimal ORDER BY ordering: Low cardinality columns first, high cardinality last.
sql
SELECT
{columns} APPLY uniq
FROM {database}.{table}
WHERE {time_column} > now() - INTERVAL {days} DAY最优ORDER BY排序规则: 低基数列在前,高基数列在后。
Query Pattern WHERE Columns Extraction
查询模式中WHERE子句列提取
sql
WITH
any(query) AS q,
arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
arrayJoin(c) AS c2
SELECT
c2,
count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlocksql
WITH
any(query) AS q,
arrayJoin(extractAll(query, '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w,
arrayFilter(x -> (position(w, extract(x, '\\.(`[^`]+`|[^\\.]+)$')) > 0), columns) AS c,
arrayJoin(c) AS c2
SELECT
c2,
count() AS usage_count
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
GROUP BY c2
ORDER BY usage_count DESC
FORMAT PrettyCompactMonoBlockNormalized WHERE Clause Patterns
标准化WHERE子句模式
sql
WITH
arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
w AS where_pattern,
count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20sql
WITH
arrayJoin(extractAll(normalizeQuery(query), '\\b(?:PRE)?WHERE\\s+(.*?)\\s+(?:GROUP BY|ORDER BY|UNION|SETTINGS|FORMAT|$)')) AS w
SELECT
w AS where_pattern,
count() AS frequency
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
GROUP BY w
ORDER BY frequency DESC
LIMIT 20Granule Selectivity from Query Log
从查询日志分析颗粒筛选性
sql
SELECT
query_id,
normalized_query_hash,
selected_parts,
selected_marks,
read_rows,
round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
query_duration_ms,
formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20High / total marks = poor index utilization.
selected_markssql
SELECT
query_id,
normalized_query_hash,
selected_parts,
selected_marks,
read_rows,
round(read_rows / nullIf(selected_marks, 0)) AS rows_per_mark,
query_duration_ms,
formatReadableSize(read_bytes) AS read_bytes
FROM system.query_log
WHERE event_time >= now() - toIntervalDay({days})
AND arrayExists(x -> x LIKE '%{table}%', tables)
AND query ILIKE 'SELECT%'
AND type = 'QueryFinish'
ORDER BY selected_marks DESC
LIMIT 20selected_marksAnalysis Workflow
分析流程
Step 1: Check Current Indexes
步骤1:检查当前索引
sql
-- Table structure with ORDER BY, PRIMARY KEY, indexes
SHOW CREATE TABLE {database}.{table}sql
-- Skipping indexes
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'sql
-- 包含ORDER BY、PRIMARY KEY、索引的表结构
SHOW CREATE TABLE {database}.{table}sql
-- 跳数索引
SELECT name, type, expr, granularity
FROM system.data_skipping_indices
WHERE database = '{database}' AND table = '{table}'Step 2: Extract Query Patterns
步骤2:提取查询模式
Run the WHERE column extraction and normalized pattern queries to understand:
- Which columns appear most frequently in WHERE clauses
- What condition combinations are common
运行WHERE列提取和标准化模式查询,以了解:
- WHERE子句中最常出现的列
- 常见的条件组合
Step 3: Check Column Cardinalities
步骤3:检查列基数
Compare cardinalities of columns in:
- Current ORDER BY key
- Frequently filtered columns from Step 2
比较以下列的基数:
- 当前ORDER BY键中的列
- 步骤2中筛选出的高频过滤列
Step 4: Evaluate Index Alignment
步骤4:评估索引匹配度
| Query Pattern | Index Support | Action |
|---|---|---|
| Filters on ORDER BY prefix | ✅ Good | None |
| Filters on non-ORDER BY cols | ⚠️ Skip index? | Add bloom_filter or projection |
| Time range + entity | ⚠️ Check order | Time in ORDER BY or partition? |
| High-cardinality first in ORDER BY | ❌ Bad | Reorder (low→high cardinality) |
| 查询模式 | 索引支持情况 | 操作建议 |
|---|---|---|
| 对ORDER BY前缀列进行过滤 | ✅ 良好 | 无需操作 |
| 对非ORDER BY列进行过滤 | ⚠️ 是否使用跳数索引? | 添加bloom_filter或投影 |
| 时间范围+实体筛选 | ⚠️ 检查顺序 | 时间列是否在ORDER BY或分区键中? |
| ORDER BY中高基数列在前 | ❌ 不合理 | 重新排序(低→高基数) |
ORDER BY Design Guidelines
ORDER BY设计指南
Column Order Principles
列排序原则
- Lowest cardinality first - maximizes granule skipping
- Most frequently filtered - columns in WHERE should be in ORDER BY
- Time column considerations:
- If most queries filter on time ranges → include in ORDER BY (possibly with lower resolution like )
toDate(ts) - If partition key handles time filtering → may not need in ORDER BY
- If most queries filter on time ranges → include in ORDER BY (possibly with lower resolution like
- 最低基数列优先 - 最大化颗粒跳过效率
- 高频过滤列优先 - WHERE子句中的列应包含在ORDER BY中
- 时间列注意事项:
- 如果大多数查询针对时间范围过滤 → 将其加入ORDER BY(可使用更低精度的格式如)
toDate(ts) - 如果分区键已处理时间过滤 → 可能无需加入ORDER BY
- 如果大多数查询针对时间范围过滤 → 将其加入ORDER BY(可使用更低精度的格式如
Common Anti-Patterns
常见反模式
| Anti-Pattern | Problem | Fix |
|---|---|---|
| High-cardinality UUID first | No granule skipping | Move after low-cardinality columns |
| DateTime64 microseconds first | Too granular | Use |
| Column in WHERE not in ORDER BY | Full scan | Add to ORDER BY or create projection |
| Bloom filter on ORDER BY column | Redundant | Remove skip index |
| Time not in ORDER BY or partition | Range queries scan all | Add |
| 反模式 | 问题 | 修复方案 |
|---|---|---|
| 高基数UUID列排在首位 | 无法跳过颗粒 | 移至低基数列之后 |
| DateTime64微秒列排在首位 | 粒度太细 | 使用 |
| WHERE子句中的列未在ORDER BY中 | 全表扫描 | 加入ORDER BY或创建投影 |
| ORDER BY列上创建Bloom过滤器 | 冗余 | 删除跳数索引 |
| 时间列未在ORDER BY或分区键中 | 范围查询扫描全表 | 将 |
Cardinality Ordering Example
基数排序示例
Given cardinalities:
- : 6
entity_type - : 18,588
entity - : 335,620
cast_hash
Recommended ORDER BY:
(entity_type, entity, cast_hash, ...)给定基数:
- : 6
entity_type - : 18,588
entity - : 335,620
cast_hash
推荐ORDER BY顺序:
(entity_type, entity, cast_hash, ...)Skipping Index Guidelines
跳数索引指南
When Skip Indexes Help
跳数索引适用场景
- Column NOT in ORDER BY
- Column values correlate with physical data order
- Low false-positive rate for the index type
- 列未包含在ORDER BY中
- 列值与物理数据顺序相关
- 索引类型的误报率低
When Skip Indexes Don't Help
跳数索引不适用场景
- Column already in ORDER BY prefix (use PRIMARY KEY instead)
- Column values randomly distributed (no correlation with ORDER BY)
- Very high cardinality with set/bloom_filter
- 列已在ORDER BY前缀中(改用PRIMARY KEY)
- 列值随机分布(与ORDER BY无关联)
- 使用set/bloom_filter的极高基数列