optimizing-query-text

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Optimize Query from SQL Text

根据SQL文本优化查询

OUTPUT FORMAT

输出格式

Return ONLY the optimized SQL query. No markdown formatting, no explanations, no bullet points - just pure SQL that can be executed directly in Snowflake.

仅返回优化后的SQL查询。不要使用Markdown格式，不要解释，不要项目符号——只返回可直接在Snowflake中执行的纯SQL。

CRITICAL: Semantic Preservation Rules

关键：语义保留规则

The optimized query MUST return IDENTICAL results to the original.

Before returning ANY optimization, verify:

Same columns: Exact same columns in exact same order with exact same aliases
Same rows: Filter conditions must be semantically equivalent
Same ordering: Preserve
```
ORDER BY
```
exactly as written
Same limits: If original has
```
LIMIT N
```
, keep
```
LIMIT N
```
. If no LIMIT, do NOT add one.

If you cannot guarantee identical results, return the original query unchanged.

优化后的查询必须返回与原查询完全一致的结果。

在返回任何优化结果前，请验证：

相同列：列完全相同，顺序完全一致，别名完全相同
相同行：过滤条件必须在语义上等价
相同排序：严格保留原查询的
```
ORDER BY
```
语句
相同限制：如果原查询有
```
LIMIT N
```
，则保留该限制；如果没有LIMIT，请勿添加。

如果无法保证结果完全一致，请原封不动返回原查询。

Pattern 1: Function on Filter Column

模式1：过滤列上使用函数

Problem: Functions on columns in WHERE clause prevent partition pruning and index usage.

问题：WHERE子句中列上使用函数会阻止分区裁剪和索引使用。

CAN Fix

可优化场景

Original	Optimized	Why Safe
`WHERE DATE(ts) = '2024-01-01'`	`WHERE ts >= '2024-01-01' AND ts < '2024-01-02'`	Equivalent range
`WHERE YEAR(dt) = 2024`	`WHERE dt >= '2024-01-01' AND dt < '2025-01-01'`	Equivalent range
`WHERE MONTH(dt) = 3 AND YEAR(dt) = 2024`	`WHERE dt >= '2024-03-01' AND dt < '2024-04-01'`	Equivalent range
`WHERE DATE(ts) >= '2024-01-01' AND DATE(ts) < '2024-02-01'`	`WHERE ts >= '2024-01-01' AND ts < '2024-02-01'`	Same boundaries
`WHERE YEAR(dt) BETWEEN 1995 AND 1996`	`WHERE dt >= '1995-01-01' AND dt < '1997-01-01'`	Equivalent range

原查询	优化后查询	安全性说明
`WHERE DATE(ts) = '2024-01-01'`	`WHERE ts >= '2024-01-01' AND ts < '2024-01-02'`	等价范围查询
`WHERE YEAR(dt) = 2024`	`WHERE dt >= '2024-01-01' AND dt < '2025-01-01'`	等价范围查询
`WHERE MONTH(dt) = 3 AND YEAR(dt) = 2024`	`WHERE dt >= '2024-03-01' AND dt < '2024-04-01'`	等价范围查询
`WHERE DATE(ts) >= '2024-01-01' AND DATE(ts) < '2024-02-01'`	`WHERE ts >= '2024-01-01' AND ts < '2024-02-01'`	边界条件一致
`WHERE YEAR(dt) BETWEEN 1995 AND 1996`	`WHERE dt >= '1995-01-01' AND dt < '1997-01-01'`	等价范围查询

CANNOT Fix

不可优化场景

Pattern	Why Not
`WHERE YEAR(dt) IN (SELECT year FROM ...)`	Dynamic values, cannot precompute range
`WHERE DATE(ts) = DATE(other_col)`	Comparing two columns, both need function
`WHERE EXTRACT(DOW FROM dt) = 1`	Day-of-week has no contiguous range
`WHERE DATE_TRUNC('month', dt) = '2024-01-01'` in GROUP BY	Needed for grouping logic
`SELECT YEAR(dt) AS yr ... GROUP BY YEAR(dt)`	Function in SELECT/GROUP BY is fine, only filter matters

模式	原因
`WHERE YEAR(dt) IN (SELECT year FROM ...)`	动态值，无法预计算范围
`WHERE DATE(ts) = DATE(other_col)`	比较两列，均需使用函数
`WHERE EXTRACT(DOW FROM dt) = 1`	星期几没有连续范围
`WHERE DATE_TRUNC('month', dt) = '2024-01-01'` 在GROUP BY中	分组逻辑需要该函数
`SELECT YEAR(dt) AS yr ... GROUP BY YEAR(dt)`	SELECT/GROUP BY中使用函数无问题，仅过滤条件需关注

Pattern 2: Function on JOIN Column

模式2：连接列上使用函数

Problem: Functions on JOIN columns prevent hash joins, forcing slower nested loop joins.

问题：JOIN列上使用函数会阻止哈希连接，迫使用户使用更慢的嵌套循环连接。

CAN Fix

可优化场景

Original	Optimized	Why Safe
`ON CAST(a.id AS VARCHAR) = CAST(b.id AS VARCHAR)`	`ON a.id = b.id`	If both are same type (e.g., INTEGER)
`ON UPPER(a.code) = UPPER(b.code)`	`ON a.code = b.code`	If data is already consistently cased
`ON TRIM(a.name) = TRIM(b.name)`	`ON a.name = b.name`	If data has no leading/trailing spaces

原查询	优化后查询	安全性说明
`ON CAST(a.id AS VARCHAR) = CAST(b.id AS VARCHAR)`	`ON a.id = b.id`	如果两者类型相同（例如INTEGER）
`ON UPPER(a.code) = UPPER(b.code)`	`ON a.code = b.code`	如果数据大小写已保持一致
`ON TRIM(a.name) = TRIM(b.name)`	`ON a.name = b.name`	如果数据没有前导/尾随空格

CANNOT Fix

不可优化场景

Pattern	Why Not
`ON CAST(a.id AS VARCHAR) = b.string_id`	Types genuinely differ, CAST required
`ON DATE(a.timestamp) = b.date_col`	Different granularity, DATE() required
`ON UPPER(a.code) = b.code`	If b.code might have different case
`ON a.id = b.id + 1`	Arithmetic transformation, cannot remove

模式	原因
`ON CAST(a.id AS VARCHAR) = b.string_id`	类型确实不同，必须使用CAST
`ON DATE(a.timestamp) = b.date_col`	粒度不同，必须使用DATE()
`ON UPPER(a.code) = b.code`	如果b.code可能存在不同大小写
`ON a.id = b.id + 1`	算术转换，无法移除

Pattern 3: NOT IN Subquery

模式3：NOT IN子查询

Problem:

NOT IN

has poor performance and unexpected NULL behavior.

问题：

NOT IN

性能较差且存在意外的NULL行为。

CAN Fix

可优化场景

Original Optimized Why Safe

Original	Optimized	Why Safe
`WHERE id NOT IN (SELECT id FROM t WHERE ...)`	`WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)`	Equivalent when subquery column is NOT NULL
`WHERE id NOT IN (SELECT id FROM t)` where id has NOT NULL constraint	`WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)`	NOT NULL guarantees equivalence

WHERE id NOT IN (SELECT id FROM t WHERE ...)

WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)

Equivalent when subquery column is NOT NULL

WHERE id NOT IN (SELECT id FROM t)

where id has NOT NULL constraint

WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)

NOT NULL guarantees equivalence

原查询优化后查询安全性说明

原查询	优化后查询	安全性说明
`WHERE id NOT IN (SELECT id FROM t WHERE ...)`	`WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)`	当子查询列不为NULL时等价
`WHERE id NOT IN (SELECT id FROM t)` 且id有NOT NULL约束	`WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)`	NOT NULL约束保证等价性

WHERE id NOT IN (SELECT id FROM t WHERE ...)

WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id AND ...)

当子查询列不为NULL时等价

WHERE id NOT IN (SELECT id FROM t)

且id有NOT NULL约束

WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.id = main.id)

NOT NULL约束保证等价性

CANNOT Fix

不可优化场景

Pattern	Why Not
`WHERE id NOT IN (SELECT nullable_col FROM t)`	If subquery returns NULL, NOT IN returns no rows; NOT EXISTS doesn't
`WHERE (a, b) NOT IN (SELECT x, y FROM t)`	Multi-column NOT IN has complex NULL semantics

Key Rule: Only convert NOT IN to NOT EXISTS if you can verify the subquery column cannot be NULL.

模式	原因
`WHERE id NOT IN (SELECT nullable_col FROM t)`	如果子查询返回NULL，NOT IN不会返回任何行；NOT EXISTS则不会出现此问题
`WHERE (a, b) NOT IN (SELECT x, y FROM t)`	多列NOT IN的NULL语义复杂

核心规则：仅当你能确认子查询列不可能为NULL时，才将NOT IN转换为NOT EXISTS。

Pattern 4: Repeated Subquery

模式4：重复子查询

Problem: Same subquery executed multiple times causes redundant scans.

问题：相同子查询多次执行会导致冗余扫描。

CAN Fix

可优化场景

Original	Optimized
Subquery appears 2+ times identically	Extract to CTE, reference CTE multiple times
Same aggregation used in multiple places	Compute once in CTE

原查询	优化后查询
相同子查询出现2次及以上	提取为CTE，多次引用该CTE
同一聚合在多个地方使用	在CTE中计算一次

CANNOT Fix

不可优化场景

Pattern	Why Not
Correlated subquery (references outer table)	Each execution is different, cannot cache
Subqueries with different filters	Not actually the same subquery
Subquery in SELECT that depends on current row	Correlation prevents extraction

模式	原因
关联子查询（引用外部表）	每次执行结果不同，无法缓存
子查询过滤条件不同	并非真正相同的子查询
SELECT中依赖当前行的子查询	关联关系导致无法提取

Pattern 5: Implicit Comma Joins

模式5：隐式逗号连接

Problem: Comma-separated tables in FROM clause are harder to read and optimize.

问题：FROM子句中用逗号分隔表的写法可读性差且难以优化。

CAN Fix - Always

可优化场景 - 始终可以优化

Convert

FROM a, b, c WHERE a.id = b.id AND b.id = c.id

to explicit JOIN syntax.

This is always safe - just restructuring, no semantic change.

将

FROM a, b, c WHERE a.id = b.id AND b.id = c.id

转换为显式JOIN语法。

这种优化始终安全——只是重构写法，不会改变语义。

UNSAFE Optimizations (NEVER apply)

不安全的优化操作（绝对禁止）

UNION to UNION ALL: UNION deduplicates rows, UNION ALL does not - different results
Changing window functions: Do not modify
```
SUM(SUM(x)) OVER(...)
```
or similar nested aggregates
Adding redundant filters: Do not add filters in JOIN ON if same filter exists in WHERE
Changing column names: Copy column names EXACTLY from original - do not "simplify" or rename
Changing column aliases: Keep all aliases exactly as original
Adding early filtering in JOINs: If a filter is in WHERE, do not duplicate it in JOIN ON clause

将UNION改为UNION ALL：UNION会去重行，UNION ALL不会——结果不同
修改窗口函数：请勿修改
```
SUM(SUM(x)) OVER(...)
```
或类似的嵌套聚合函数
添加冗余过滤条件：如果WHERE子句中已有相同过滤条件，请勿在JOIN ON中重复添加
修改列名：完全复制原查询的列名——不要“简化”或重命名
修改列别名：严格保留所有原别名
在JOIN中提前过滤：如果过滤条件在WHERE中，请勿在JOIN ON子句中重复

Principles

原则

Minimal changes: Make the fewest changes necessary. Simpler optimizations are more reliable.
Preserve structure: Keep subqueries, CTEs, and overall query structure unless there's a clear benefit.
When in doubt, don't: If unsure whether a change preserves semantics, skip it.
Copy exactly: Column names, table aliases, and expressions should be copied character-for-character.

最小修改：仅进行必要的最少修改。更简单的优化更可靠。
保留结构：除非有明确的好处，否则保留子查询、CTE和整体查询结构。
存疑则不做：如果不确定修改是否会保留语义，请跳过该优化。
完全复制：列名、表别名和表达式应逐字符复制。

Priority Order

优先级顺序

Date/time functions on filter columns - Highest impact
Implicit joins to explicit JOIN - Always safe, improves readability
NOT IN to NOT EXISTS - Only if NULL-safe

过滤列上的日期/时间函数 - 影响最大
隐式连接转显式JOIN - 始终安全，提升可读性
NOT IN转NOT EXISTS - 仅在NULL安全的情况下进行

Requirements

要求

Results must be identical: Same rows, same columns, same order
Valid Snowflake SQL: Output must execute without errors in Snowflake

结果必须完全一致：相同的行、相同的列、相同的顺序
有效的Snowflake SQL：输出必须能在Snowflake中无错误执行