optimizing-query-by-id

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Optimize Query from Query ID

通过查询ID优化查询

Fetch query → Get profile → Apply best practices → Verify improvement → Return optimized query
获取查询 → 生成配置文件 → 应用最佳实践 → 验证提升效果 → 返回优化后的查询

Workflow

工作流程

1. Fetch Query Details from Query ID

1. 通过查询ID获取查询详情

sql
SELECT
    query_id,
    query_text,
    total_elapsed_time/1000 as seconds,
    bytes_scanned/1e9 as gb_scanned,
    bytes_spilled_to_local_storage/1e9 as gb_spilled_local,
    bytes_spilled_to_remote_storage/1e9 as gb_spilled_remote,
    partitions_scanned,
    partitions_total,
    rows_produced
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_id = '<query_id>';
Note the key metrics:
  • seconds
    : Total execution time
  • gb_scanned
    : Data read (lower is better)
  • gb_spilled
    : Spillage indicates memory pressure
  • partitions_scanned/total
    : Partition pruning effectiveness
sql
SELECT
    query_id,
    query_text,
    total_elapsed_time/1000 as seconds,
    bytes_scanned/1e9 as gb_scanned,
    bytes_spilled_to_local_storage/1e9 as gb_spilled_local,
    bytes_spilled_to_remote_storage/1e9 as gb_spilled_remote,
    partitions_scanned,
    partitions_total,
    rows_produced
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_id = '<query_id>';
注意以下关键指标:
  • seconds
    : 总执行时间
  • gb_scanned
    : 读取的数据量(越小越好)
  • gb_spilled
    : 溢出量表示内存压力
  • partitions_scanned/total
    : 分区修剪的有效性

2. Get Query Profile Details

2. 获取查询配置文件详情

sql
-- Get operator-level statistics
SELECT *
FROM TABLE(GET_QUERY_OPERATOR_STATS('<query_id>'));
Look for:
  • Operators with high
    output_rows
    vs
    input_rows
    (explosions)
  • TableScan operators with high bytes
  • Sort/Aggregate operators with spillage
sql
-- 获取算子级统计信息
SELECT *
FROM TABLE(GET_QUERY_OPERATOR_STATS('<query_id>'));
重点关注:
  • output_rows
    远大于
    input_rows
    的算子(数据膨胀)
  • 扫描字节数高的TableScan算子
  • 存在溢出的Sort/Aggregate算子

3. Identify Optimization Opportunities

3. 识别优化机会

Based on profile, look for:
MetricIssueFix
partitions_scanned = partitions_totalNo pruningAdd filter on cluster key
gb_spilled > 0Memory pressureSimplify query, increase warehouse
High bytes_scannedFull scanAdd selective filters, reduce columns
Join explosionCartesian or bad keyFix join condition, filter before join
根据配置文件,查看以下内容:
指标问题修复方案
partitions_scanned = partitions_total未进行分区修剪添加基于集群键的过滤条件
gb_spilled > 0内存压力简化查询,增大仓库规模
高bytes_scanned全表扫描添加选择性过滤条件,减少查询列数
连接膨胀笛卡尔积或关联键不合理修复连接条件,连接前先过滤数据

4. Apply Optimizations

4. 应用优化措施

Rewrite the query:
  • Select only needed columns
  • Filter early (before joins)
  • Use CTEs to avoid repeated scans
  • Ensure filters align with clustering keys
  • Add LIMIT if full result not needed
重写查询:
  • 仅选择需要的列
  • 尽早过滤(连接前)
  • 使用CTE避免重复扫描
  • 确保过滤条件与集群键匹配
  • 若不需要完整结果则添加LIMIT

5. Get Explain Plan for Optimized Query

5. 获取优化后查询的执行计划

sql
EXPLAIN USING JSON
<optimized_query>;
sql
EXPLAIN USING JSON
<optimized_query>;

6. Compare Plans

6. 对比执行计划

Compare original vs optimized:
  • Fewer partitions scanned?
  • Fewer intermediate rows?
  • Better join order?
对比原始查询与优化后查询:
  • 扫描的分区数是否更少?
  • 中间结果行数是否更少?
  • 连接顺序是否更优?

7. Return Results

7. 返回结果

Provide:
  1. Original query metrics (time, data scanned, spillage)
  2. Identified issues
  3. The optimized query
  4. Summary of changes made
  5. Expected improvement
提供以下内容:
  1. 原始查询指标(时间、扫描数据量、溢出量)
  2. 识别出的问题
  3. 优化后的查询
  4. 所做修改的总结
  5. 预期提升效果

Example Output

示例输出

Original Query Metrics:
  • Execution time: 45 seconds
  • Data scanned: 12.3 GB
  • Partitions: 500/500 (no pruning)
  • Spillage: 2.1 GB
Issues Found:
  1. No partition pruning - filtering on non-cluster column
  2. SELECT * scanning unnecessary columns
  3. Large table joined without pre-filtering
Optimized Query:
sql
WITH filtered_events AS (
    SELECT event_id, user_id, event_type, created_at
    FROM events
    WHERE created_at >= '2024-01-01'
      AND created_at < '2024-02-01'
      AND event_type = 'purchase'
)
SELECT fe.event_id, fe.created_at, u.name
FROM filtered_events fe
JOIN users u ON fe.user_id = u.id;
Changes:
  • Added date range filter matching cluster key
  • Replaced SELECT * with specific columns
  • Pre-filtered in CTE before join
Expected Improvement:
  • Partitions: 500 → ~15 (97% reduction)
  • Data scanned: 12.3 GB → ~0.4 GB
  • Estimated time: 45s → ~3s
原始查询指标:
  • 执行时间:45秒
  • 扫描数据量:12.3 GB
  • 分区情况:500/500(未进行分区修剪)
  • 溢出量:2.1 GB
发现的问题:
  1. 未进行分区修剪——过滤条件未使用集群键
  2. 使用SELECT *扫描不必要的列
  3. 大表连接前未进行预过滤
优化后的查询:
sql
WITH filtered_events AS (
    SELECT event_id, user_id, event_type, created_at
    FROM events
    WHERE created_at >= '2024-01-01'
      AND created_at < '2024-02-01'
      AND event_type = 'purchase'
)
SELECT fe.event_id, fe.created_at, u.name
FROM filtered_events fe
JOIN users u ON fe.user_id = u.id;
修改内容:
  • 添加了与集群键匹配的日期范围过滤条件
  • 将SELECT *替换为具体列
  • 在CTE中进行预过滤后再连接
预期提升效果:
  • 分区数:500 → 约15个(减少97%)
  • 扫描数据量:12.3 GB → 约0.4 GB
  • 预估执行时间:45秒 → 约3秒