optimizing-query-text

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Optimize Query from SQL Text

根据SQL文本优化查询

OUTPUT FORMAT

输出格式

Return ONLY the optimized SQL query. No markdown formatting, no explanations, no bullet points - just pure SQL that can be executed directly in Snowflake.
仅返回优化后的SQL查询。不要使用Markdown格式,不要解释,不要项目符号——只返回可直接在Snowflake中执行的纯SQL。

CRITICAL: Semantic Preservation Rules

关键:语义保留规则

The optimized query MUST return IDENTICAL results to the original.
Before returning ANY optimization, verify:
  • Same columns: Exact same columns in exact same order with exact same aliases
  • Same rows: Filter conditions must be semantically equivalent
  • Same ordering: Preserve
    ORDER BY
    exactly as written
  • Same limits: If original has
    LIMIT N
    , keep
    LIMIT N
    . If no LIMIT, do NOT add one.
If you cannot guarantee identical results, return the original query unchanged.

优化后的查询必须返回与原查询完全一致的结果。
在返回任何优化结果前,请验证:
  • 相同列:列完全相同,顺序完全一致,别名完全相同
  • 相同行:过滤条件必须在语义上等价
  • 相同排序:严格保留原查询的
    ORDER BY
    语句
  • 相同限制:如果原查询有
    LIMIT N
    ,则保留该限制;如果没有LIMIT,请勿添加。
如果无法保证结果完全一致,请原封不动返回原查询。

Pattern 1: Function on Filter Column

模式1:过滤列上使用函数

Problem: Functions on columns in WHERE clause prevent partition pruning and index usage.
问题:WHERE子句中列上使用函数会阻止分区裁剪和索引使用。

CAN Fix

可优化场景

OriginalOptimizedWhy Safe
WHERE DATE(ts) = '2024-01-01'
WHERE ts >= '2024-01-01' AND ts < '2024-01-02'
Equivalent range
WHERE YEAR(dt) = 2024
WHERE dt >= '2024-01-01' AND dt < '2025-01-01'
Equivalent range
WHERE MONTH(dt) = 3 AND YEAR(dt) = 2024
WHERE dt >= '2024-03-01' AND dt < '2024-04-01'
Equivalent range
WHERE DATE(ts) >= '2024-01-01' AND DATE(ts) < '2024-02-01'
WHERE ts >= '2024-01-01' AND ts < '2024-02-01'
Same boundaries
WHERE YEAR(dt) BETWEEN 1995 AND 1996
WHERE dt >= '1995-01-01' AND dt < '1997-01-01'
Equivalent range
原查询优化后查询安全性说明
WHERE DATE(ts) = '2024-01-01'
WHERE ts >= '2024-01-01' AND ts < '2024-01-02'
等价范围查询
WHERE YEAR(dt) = 2024
WHERE dt >= '2024-01-01' AND dt < '2025-01-01'
等价范围查询
WHERE MONTH(dt) = 3 AND YEAR(dt) = 2024
WHERE dt >= '2024-03-01' AND dt < '2024-04-01'
等价范围查询
WHERE DATE(ts) >= '2024-01-01' AND DATE(ts) < '2024-02-01'
WHERE ts >= '2024-01-01' AND ts < '2024-02-01'
边界条件一致
WHERE YEAR(dt) BETWEEN 1995 AND 1996
WHERE dt >= '1995-01-01' AND dt < '1997-01-01'
等价范围查询

CANNOT Fix

不可优化场景

PatternWhy Not
WHERE YEAR(dt) IN (SELECT year FROM ...)
Dynamic values, cannot precompute range
WHERE DATE(ts) = DATE(other_col)
Comparing two columns, both need function
WHERE EXTRACT(DOW FROM dt) = 1
Day-of-week has no contiguous range
WHERE DATE_TRUNC('month', dt) = '2024-01-01'
in GROUP BY
Needed for grouping logic
SELECT YEAR(dt) AS yr ... GROUP BY YEAR(dt)
Function in SELECT/GROUP BY is fine, only filter matters

模式原因
WHERE YEAR(dt) IN (SELECT year FROM ...)
动态值,无法预计算范围
WHERE DATE(ts) = DATE(other_col)
比较两列,均需使用函数
WHERE EXTRACT(DOW FROM dt) = 1
星期几没有连续范围
WHERE DATE_TRUNC('month', dt) = '2024-01-01'
在GROUP BY中
分组逻辑需要该函数
SELECT YEAR(dt) AS yr ... GROUP BY YEAR(dt)
SELECT/GROUP BY中使用函数无问题,仅过滤条件需关注

Pattern 2: Function on JOIN Column

模式2:连接列上使用函数

Problem: Functions on JOIN columns prevent hash joins, forcing slower nested loop joins.
问题:JOIN列上使用函数会阻止哈希连接,迫使用户使用更慢的嵌套循环连接。

CAN Fix

可优化场景

OriginalOptimizedWhy Safe
ON CAST(a.id AS VARCHAR) = CAST(b.id AS VARCHAR)
ON a.id = b.id
If both are same type (e.g., INTEGER)
ON UPPER(a.code) = UPPER(b.code)
ON a.code = b.code
If data is already consistently cased
ON TRIM(a.name) = TRIM(b.name)
ON a.name = b.name
If data has no leading/trailing spaces
原查询优化后查询安全性说明
ON CAST(a.id AS VARCHAR) = CAST(b.id AS VARCHAR)
ON a.id = b.id
如果两者类型相同(例如INTEGER)
ON UPPER(a.code) = UPPER(b.code)
ON a.code = b.code
如果数据大小写已保持一致
ON TRIM(a.name) = TRIM(b.name)
ON a.name = b.name
如果数据没有前导/尾随空格

CANNOT Fix

不可优化场景

PatternWhy Not
ON CAST(a.id AS VARCHAR) = b.string_id
Types genuinely differ, CAST required
ON DATE(a.timestamp) = b.date_col
Different granularity, DATE() required
ON UPPER(a.code) = b.code
If b.code might have different case
ON a.id = b.id + 1
Arithmetic transformation, cannot remove

模式原因
ON CAST(a.id AS VARCHAR) = b.string_id
类型确实不同,必须使用CAST
ON DATE(a.timestamp) = b.date_col
粒度不同,必须使用DATE()
ON UPPER(a.code) = b.code
如果b.code可能存在不同大小写
ON a.id = b.id + 1
算术转换,无法移除

Pattern 3: NOT IN Subquery

模式3:NOT IN子查询

Problem:
NOT IN
has poor performance and unexpected NULL behavior.
问题
NOT IN
性能较差且存在意外的NULL行为。

CAN Fix

可优化场景

OriginalOptimizedWhy Safe
WHERE id NOT IN (SELECT id FROM t WHERE ...)
WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)
Equivalent when subquery column is NOT NULL
WHERE id NOT IN (SELECT id FROM t)
where id has NOT NULL constraint
WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)
NOT NULL guarantees equivalence
原查询优化后查询安全性说明
WHERE id NOT IN (SELECT id FROM t WHERE ...)
WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)
当子查询列不为NULL时等价
WHERE id NOT IN (SELECT id FROM t)
且id有NOT NULL约束
WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)
NOT NULL约束保证等价性

CANNOT Fix

不可优化场景

PatternWhy Not
WHERE id NOT IN (SELECT nullable_col FROM t)
If subquery returns NULL, NOT IN returns no rows; NOT EXISTS doesn't
WHERE (a, b) NOT IN (SELECT x, y FROM t)
Multi-column NOT IN has complex NULL semantics
Key Rule: Only convert NOT IN to NOT EXISTS if you can verify the subquery column cannot be NULL.

模式原因
WHERE id NOT IN (SELECT nullable_col FROM t)
如果子查询返回NULL,NOT IN不会返回任何行;NOT EXISTS则不会出现此问题
WHERE (a, b) NOT IN (SELECT x, y FROM t)
多列NOT IN的NULL语义复杂
核心规则:仅当你能确认子查询列不可能为NULL时,才将NOT IN转换为NOT EXISTS。

Pattern 4: Repeated Subquery

模式4:重复子查询

Problem: Same subquery executed multiple times causes redundant scans.
问题:相同子查询多次执行会导致冗余扫描。

CAN Fix

可优化场景

OriginalOptimized
Subquery appears 2+ times identicallyExtract to CTE, reference CTE multiple times
Same aggregation used in multiple placesCompute once in CTE
原查询优化后查询
相同子查询出现2次及以上提取为CTE,多次引用该CTE
同一聚合在多个地方使用在CTE中计算一次

CANNOT Fix

不可优化场景

PatternWhy Not
Correlated subquery (references outer table)Each execution is different, cannot cache
Subqueries with different filtersNot actually the same subquery
Subquery in SELECT that depends on current rowCorrelation prevents extraction

模式原因
关联子查询(引用外部表)每次执行结果不同,无法缓存
子查询过滤条件不同并非真正相同的子查询
SELECT中依赖当前行的子查询关联关系导致无法提取

Pattern 5: Implicit Comma Joins

模式5:隐式逗号连接

Problem: Comma-separated tables in FROM clause are harder to read and optimize.
问题:FROM子句中用逗号分隔表的写法可读性差且难以优化。

CAN Fix - Always

可优化场景 - 始终可以优化

Convert
FROM a, b, c WHERE a.id = b.id AND b.id = c.id
to explicit JOIN syntax.
This is always safe - just restructuring, no semantic change.

FROM a, b, c WHERE a.id = b.id AND b.id = c.id
转换为显式JOIN语法。
这种优化始终安全——只是重构写法,不会改变语义。

UNSAFE Optimizations (NEVER apply)

不安全的优化操作(绝对禁止)

  • UNION to UNION ALL: UNION deduplicates rows, UNION ALL does not - different results
  • Changing window functions: Do not modify
    SUM(SUM(x)) OVER(...)
    or similar nested aggregates
  • Adding redundant filters: Do not add filters in JOIN ON if same filter exists in WHERE
  • Changing column names: Copy column names EXACTLY from original - do not "simplify" or rename
  • Changing column aliases: Keep all aliases exactly as original
  • Adding early filtering in JOINs: If a filter is in WHERE, do not duplicate it in JOIN ON clause

  • 将UNION改为UNION ALL:UNION会去重行,UNION ALL不会——结果不同
  • 修改窗口函数:请勿修改
    SUM(SUM(x)) OVER(...)
    或类似的嵌套聚合函数
  • 添加冗余过滤条件:如果WHERE子句中已有相同过滤条件,请勿在JOIN ON中重复添加
  • 修改列名:完全复制原查询的列名——不要“简化”或重命名
  • 修改列别名:严格保留所有原别名
  • 在JOIN中提前过滤:如果过滤条件在WHERE中,请勿在JOIN ON子句中重复

Principles

原则

  1. Minimal changes: Make the fewest changes necessary. Simpler optimizations are more reliable.
  2. Preserve structure: Keep subqueries, CTEs, and overall query structure unless there's a clear benefit.
  3. When in doubt, don't: If unsure whether a change preserves semantics, skip it.
  4. Copy exactly: Column names, table aliases, and expressions should be copied character-for-character.

  1. 最小修改:仅进行必要的最少修改。更简单的优化更可靠。
  2. 保留结构:除非有明确的好处,否则保留子查询、CTE和整体查询结构。
  3. 存疑则不做:如果不确定修改是否会保留语义,请跳过该优化。
  4. 完全复制:列名、表别名和表达式应逐字符复制。

Priority Order

优先级顺序

  1. Date/time functions on filter columns - Highest impact
  2. Implicit joins to explicit JOIN - Always safe, improves readability
  3. NOT IN to NOT EXISTS - Only if NULL-safe

  1. 过滤列上的日期/时间函数 - 影响最大
  2. 隐式连接转显式JOIN - 始终安全,提升可读性
  3. NOT IN转NOT EXISTS - 仅在NULL安全的情况下进行

Requirements

要求

  • Results must be identical: Same rows, same columns, same order
  • Valid Snowflake SQL: Output must execute without errors in Snowflake
  • 结果必须完全一致:相同的行、相同的列、相同的顺序
  • 有效的Snowflake SQL:输出必须能在Snowflake中无错误执行