databricks-sql-autotuner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Databricks SQL Autotuner

Databricks SQL 自动调优器

Analyze, rewrite, validate, and benchmark SQL queries against a live Databricks cluster. The optimized query must produce identical results and show a statistically meaningful speedup before it is accepted.

针对运行中的Databricks集群分析、重写、验证和基准测试SQL查询。优化后的查询必须返回完全一致的结果，并且表现出统计意义上的显著提速才会被采纳。

Examples

示例

undefined

undefined

Minimal — paste an inline query, discover everything automatically

最简用法 — 粘贴内联查询，自动完成所有探测步骤

/databricks-sql-autotuner "SELECT o.order_id, SUM(l.amount) FROM orders o JOIN lines l ON o.id = l.order_id GROUP BY o.order_id"

From a file, cluster by ID

从文件读取，指定集群ID

/databricks-sql-autotuner --cluster-id 0408-195905-abc --profile premium @queries/slow_report.sql

From a file, cluster by name

从文件读取，指定集群名称

/databricks-sql-autotuner --cluster-name my-cluster --profile premium @queries/slow_report.sql

Optimize for speed, 5 benchmark runs

优先优化速度，执行5次基准测试

/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed --runs 5 @queries/slow_report.sql

Optimize for speed first, then simplify

优先优化速度，其次简化语句

/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed,simplicity @queries/slow_report.sql

Override catalog and schema

自定义catalog和schema

/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql

---

/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql

---

Reference files

参考文件

Read these before writing the optimized query:

File	When to read
`references/spark-sql-hints.md`	Any time you are considering a hint ( `BROADCAST` , `MERGE` , `SHUFFLE_HASH` , `REBALANCE` , etc.) or dealing with UNION ALL scoping
`references/spark-sql-perf-tuning.md`	For AQE behavior, partition tuning, statistics interpretation, and the full SQL-level optimization checklist

编写优化查询前请阅读这些文件：

文件	读取时机
`references/spark-sql-hints.md`	任何时候你考虑使用hint（ `BROADCAST` 、 `MERGE` 、 `SHUFFLE_HASH` 、 `REBALANCE` 等）或者处理UNION ALL作用域问题时
`references/spark-sql-perf-tuning.md`	了解AQE行为、分区调优、统计信息解读和完整的SQL层面优化检查清单

Phase 0 — Parse input

阶段0 — 解析输入

The user invokes with:

/databricks-sql-autotuner [options] <query-or-path>

Supported options (all optional — skip any discovery step where the value is provided):

Option	Example	Effect
`--cluster-id <id>`	`--cluster-id 0408-195905-abc`	Skip cluster listing; use this cluster ID directly
`--cluster-name <name>`	`--cluster-name my-cluster`	Skip cluster listing; find cluster by name
`--profile <name>`	`--profile premium`	Skip profile selection; use this profile
`--catalog <name>`	`--catalog my_catalog`	Skip catalog discovery; use this value
`--schema <name>`	`--schema my_schema`	Skip schema discovery; use this value
`--runs <n>`	`--runs 5`	Override default benchmark run count (default: 3)
`--goals <list>`	`--goals speed`	Comma-separated goals: `speed` , `simplicity` , or `speed,simplicity`

Record as

GOALS

(default:

speed

). Accepted values:

Value	Meaning
`speed`	Minimize execution time
`simplicity`	Minimize complexity score (lines, nesting, subqueries)
`speed,simplicity`	Optimize speed first; once accepted, simplify while staying within 10% of best time

Parse these from the invocation arguments before starting Phase 1. For any option not supplied, run the normal discovery step.

--query <sql-or-path>

--query @slow.sql

| SQL string or

@path/to/file.sql

; if omitted, ask the user | |

--optimized <path>

--optimized out.sql

| Write optimized query to a separate file instead of editing the original in place | |

--timing-count

--timing-count

| Wrap timing runs in

COUNT(*)

to avoid collecting large result sets to the driver (use when query returns millions of rows) |

The

--query

value:

```
@path/to/file.sql
```
→ read the file
Any other string → treat as inline SQL
Omitted → ask the user to paste or provide the query

用户通过以下方式调用：

/databricks-sql-autotuner [options] <query-or-path>

支持的选项（均为可选 — 如果值已提供则跳过对应的探测步骤）：

选项	示例	作用
`--cluster-id <id>`	`--cluster-id 0408-195905-abc`	跳过集群列表查询，直接使用该集群ID
`--cluster-name <name>`	`--cluster-name my-cluster`	跳过集群列表查询，通过名称查找集群
`--profile <name>`	`--profile premium`	跳过配置选择，直接使用该配置
`--catalog <name>`	`--catalog my_catalog`	跳过catalog探测，直接使用该值
`--schema <name>`	`--schema my_schema`	跳过schema探测，直接使用该值
`--runs <n>`	`--runs 5`	覆盖默认的基准测试运行次数（默认：3）
`--goals <list>`	`--goals speed`	逗号分隔的优化目标： `speed` 、 `simplicity` 或者 `speed,simplicity`

记录为

GOALS

（默认：

speed

）。可接受的值：

值	含义
`speed`	最小化执行时间
`simplicity`	最小化复杂度评分（行数、嵌套层级、子查询数量）
`speed,simplicity`	优先优化速度；速度优化达标后，在不超过最优执行时间10%的前提下简化语句

在进入阶段1前先从调用参数中解析这些选项。对于未提供的选项，执行常规探测步骤。

--query <sql-or-path>

--query @slow.sql

| SQL字符串或

@path/to/file.sql

；如果省略则询问用户 | |

--optimized <path>

--optimized out.sql

| 将优化后的查询写入独立文件，而非原地修改原文件 | |

--timing-count

--timing-count

| 将计时运行包裹在

COUNT(*)

中，避免向驱动节点返回大量结果集（查询返回数百万行时使用） |

--query

值的处理规则：

```
@path/to/file.sql
```
→ 读取对应文件
其他字符串 → 视为内联SQL
省略 → 要求用户粘贴或提供查询语句

Phase 1 — Environment discovery

阶段1 — 环境探测

1.1 Verify Databricks CLI is authenticated

1.1 验证Databricks CLI已认证

bash

databricks auth profiles

If this fails or returns no profiles, stop and ask the user to authenticate:

bash

databricks auth login --profile <name>

bash

databricks auth profiles

如果命令失败或未返回任何配置，停止执行并要求用户认证：

bash

databricks auth login --profile <name>

1.2 Select profile

1.2 选择配置

--profile

was provided, skip this step.

Otherwise present all available profiles with their workspace host URLs. Auto-select if only one exists; otherwise ask.

如果已提供

--profile

参数，跳过该步骤。

否则展示所有可用配置及其对应的工作区主机URL。如果只有一个配置则自动选择，否则询问用户选择。

1.3 Select cluster

1.3 选择集群

--cluster-id

was provided, use that ID directly and skip the listing.

--cluster-name

was provided, resolve it to an ID:

bash

databricks clusters list --profile <PROFILE> --output json \
  | python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
    print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"

If neither was provided, list all clusters:

bash

databricks clusters list --profile <PROFILE> --output json

Filter to clusters with

state = RUNNING

state = TERMINATED

(can be started). Present the list (cluster ID, name, state, DBR version). Auto-select any running general-purpose cluster if one exists; otherwise ask.

If the chosen cluster is terminated, start it:

bash

databricks clusters start <CLUSTER_ID> --profile <PROFILE>

如果已提供

--cluster-id

参数，直接使用该ID并跳过列表查询。

如果已提供

--cluster-name

参数，将其解析为ID：

bash

databricks clusters list --profile <PROFILE> --output json \
  | python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
    print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"

如果两个参数都未提供，列出所有集群：

bash

databricks clusters list --profile <PROFILE> --output json

筛选出

state = RUNNING

或

state = TERMINATED

（可启动）的集群。展示集群列表（集群ID、名称、状态、DBR版本）。如果存在运行中的通用型集群则自动选择，否则询问用户选择。

如果选中的集群已终止，启动它：

bash

databricks clusters start <CLUSTER_ID> --profile <PROFILE>

1.4 Get DBR version

1.4 获取DBR版本

bash

databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"

Record as

DBR_VERSION

(e.g.,

17.3

). This determines which

databricks-connect

to install.

bash

databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"

记录为

DBR_VERSION

（例如

17.3

）。该值决定要安装的

databricks-connect

版本。

1.5 Catalog / schema context

1.5 Catalog / Schema 上下文

Run to discover the current defaults:

bash

databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; c=json.load(sys.stdin); \
    cfg=c.get('spark_conf',{}); \
    print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
    print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"

Use whatever the cluster reports as

CATALOG

and

SCHEMA

. Only ask the user to override if the query explicitly references a different catalog/schema or if the command returns nothing useful.

运行以下命令探测当前默认值：

bash

databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; c=json.load(sys.stdin); \
    cfg=c.get('spark_conf',{}); \
    print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
    print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"

使用集群返回的结果作为

CATALOG

和

SCHEMA

。只有当查询显式引用了不同的catalog/schema，或者命令未返回有效信息时，才询问用户覆盖默认值。

1.6 Benchmark runs

1.6 基准测试运行次数

--runs

was provided, skip this step and use that value.

Otherwise ask:

How many benchmark runs per query variant? (default: 3)

Record as

N_RUNS

(default

, minimum

如果已提供

--runs

参数，跳过该步骤直接使用该值。

否则询问用户：

每个查询变体需要运行多少次基准测试？（默认：3）

记录为

N_RUNS

（默认

，最小值

）。

1.7 Confirm

1.7 确认

Present a summary and wait for confirmation:

Parameter	Value
Profile
Cluster ID
DBR version
Catalog
Schema
N runs

展示配置摘要并等待用户确认：

参数	值
Profile
Cluster ID
DBR version
Catalog
Schema
N runs

Phase 2 — Environment setup

阶段2 — 环境搭建

2.0 Resolve the skill directory

2.0 定位技能目录

The scripts live inside the skill, not the project. Locate them first:

bash

SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
                  "$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
             -maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"

If neither path exists, check where Claude Code installed the skill:

bash

find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1

Record as

SKILL_DIR

. All script paths below use

$SKILL_DIR/scripts/

脚本存放在技能目录下，而非项目目录。首先定位该目录：

bash

SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
                  "$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
             -maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"

如果两个路径都不存在，检查Claude Code安装技能的位置：

bash

find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1

记录为

SKILL_DIR

。后续所有脚本路径都使用

$SKILL_DIR/scripts/

。

2.1 Check uv

2.1 检查uv

bash

uv --version

If missing:

curl -LsSf https://astral.sh/uv/install.sh | sh

bash

uv --version

如果缺失：

curl -LsSf https://astral.sh/uv/install.sh | sh

2.2 Run setup

2.2 运行安装脚本

bash

python3 "$SKILL_DIR/scripts/env_setup.py" \
  --dbr-version <DBR_VERSION> \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID>

This uses

uv venv

to create

.venv_autotuner/

, installs

databricks-connect==<DBR_VERSION>.*

, and verifies the connection with a

SELECT 1

smoke test.

All subsequent commands invoke the venv Python directly:

bash

VENV_PYTHON=".venv_autotuner/bin/python"          # macOS / Linux

bash

python3 "$SKILL_DIR/scripts/env_setup.py" \
  --dbr-version <DBR_VERSION> \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID>

该脚本使用

uv venv

创建

.venv_autotuner/

，安装

databricks-connect==<DBR_VERSION>.*

，并通过

SELECT 1

冒烟测试验证连接。

后续所有命令直接调用虚拟环境中的Python：

bash

VENV_PYTHON=".venv_autotuner/bin/python"          # macOS / Linux

Windows: VENV_PYTHON=".venv_autotuner\Scripts\python.exe"

TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"


Use `$TUNE` for every `tune.py` invocation from here on.

---

TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"


后续所有`tune.py`调用都使用`$TUNE`。

---

Phase 2b — Branch & baseline

阶段2b — 分支与基线

Every tuning session gets its own branch and a results log. This keeps the work isolated, makes it easy to compare iterations, and produces a clean git history that shows exactly what was tried and kept.

每个调优会话都有独立的分支和结果日志。这样可以隔离工作，便于比较不同迭代版本，同时生成清晰的git历史，记录所有尝试和采纳的修改。

2b.1 Generate a run ID

2b.1 生成运行ID

Use a short slug derived from the query's purpose — e.g.

sales-summary

user-funnel-join

bash

RUN_ID="<query-slug>"           # e.g. sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"

使用基于查询用途的简短标识，例如

sales-summary

、

user-funnel-join

。

bash

RUN_ID="<query-slug>"           # 例如 sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"

2b.2 Create a branch

2b.2 创建分支

bash

git checkout -b sql-tune/${RUN_ID}

bash

git checkout -b sql-tune/${RUN_ID}

2b.3 Establish the working file

2b.3 确定工作文件

Determine

QUERY_FILE

(the file that will be edited each iteration) and

ORIGINAL_FILE

(the baseline passed to

--original

in tune.py):

Scenario	ORIGINAL_FILE	QUERY_FILE
`--query @file.sql` (default)	`file.sql`	`file.sql` — edit in place
`--query @file.sql --optimized out.sql`	`file.sql`	`out.sql` — write here, original stays untouched
`--query "inline SQL"`	`query.sql` (write it, commit)	`query.sql` — edit in place
`--query "inline SQL" --optimized out.sql`	`query.sql`	`out.sql`

When

--optimized

is given, copy the original into that file first:

bash

cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILE

When editing in place (no

--optimized

ORIGINAL_FILE == QUERY_FILE

and git tracks the evolution naturally.

确定

QUERY_FILE

（每次迭代编辑的文件）和

ORIGINAL_FILE

（传递给tune.py

--original

参数的基线文件）：

场景	ORIGINAL_FILE	QUERY_FILE
`--query @file.sql` （默认）	`file.sql`	`file.sql` — 原地编辑
`--query @file.sql --optimized out.sql`	`file.sql`	`out.sql` — 写入该文件，原文件保持不变
`--query "inline SQL"`	`query.sql` （写入内容并提交）	`query.sql` — 原地编辑
`--query "inline SQL" --optimized out.sql`	`query.sql`	`out.sql`

如果提供了

--optimized

参数，先将原文件复制到目标路径：

bash

cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILE

如果是原地编辑（未提供

--optimized

），

ORIGINAL_FILE == QUERY_FILE

，git会自然跟踪文件演变。

2b.4 Initialize the results log

2b.4 初始化结果日志

bash

echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}"     >> .git/info/exclude

Create

$RESULTS_FILE

attempt	commit	mean_s	speedup	status	description
0	<sha>	<baseline_mean>	1.0	baseline	original query

bash

echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}"     >> .git/info/exclude

创建

$RESULTS_FILE

：

attempt	commit	mean_s	speedup	status	description
0	<sha>	<baseline_mean>	1.0	baseline	original query

2b.5 Run the baseline

2b.5 运行基线测试

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1

Extract

original.mean_s

and record as attempt

Baseline: mean = X.XXs. Branch:
sql-tune/<run-id>
. Starting analysis.

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1

提取

original.mean_s

并记录为第0次尝试。

基线：平均耗时 = X.XX秒。分支：
sql-tune/<run-id>
。开始分析。

Phase 3 — Query analysis

阶段3 — 查询分析

Start with the execution plan — it tells you what the optimizer actually decided. Collect table stats only when the plan raises a question that size or schema data would answer (e.g. "is this table small enough to broadcast?").

从执行计划开始分析 — 它会告诉你优化器实际的决策。只有当执行计划提出的问题需要表大小或schema数据才能回答时（例如“这个表是否小到可以广播？”），才收集表统计信息。

3.1 Get the execution plan

3.1 获取执行计划

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "<QUERY_OR_@FILE>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only

Read the plan and identify bottlenecks:

Bottleneck signal	What to look for
BroadcastNestedLoopJoin	Missing join key or cartesian product — add explicit join condition
SortMergeJoin	May be improvable with BROADCAST if the smaller side is small enough — check table size
Exchange (shuffle)	Heavy repartitioning — check GROUP BY / JOIN key cardinality
FileScan with no PartitionFilters	Filter may not be using partition columns — worth checking schema
Filter after scan (not pushed)	Move filter earlier or rewrite WHERE to use partition columns
Repeated subquery	Extract to CTE or use window function
UNION ALL + aggregate	Check if ROLLUP or window can replace
UDF calls on large datasets	Consider replacing with built-in Spark SQL functions
Large row count with high-null join key	Add `IS NOT NULL` filter before join to reduce shuffle size
SubqueryAlias / deep nested plan where a FileScan is expected	Join target is a view — read the view definition (step 3.2a)
Skewed Exchange — one partition vastly larger than others	GROUP BY key has low or highly uneven cardinality — consider SQL salting (see Phase 4)
`IN (subquery)`	Often generates a worse plan than `EXISTS` or a semi-join — rewrite if the subquery is large
`ObjectHashAggregate` or `ArrowEvalPython` nodes	Photon cannot accelerate these — check for Python UDFs, `TRANSFORM` / `FILTER` on arrays, or `MAP` operations

After reading the plan, decide: can you already identify the optimization, or do you need more information about the tables?

If the bottleneck is clear (e.g. a correlated subquery, missing NULL filter, LOWER() on a filter column) — go straight to Phase 4.
If the bottleneck depends on physical table properties (file size, partition columns, column cardinality) — collect stats for just those tables in 3.2.
If a join target appears to be a view (plan shows a nested subquery or a
```
SubqueryAlias
```
node where you expect a FileScan) — read the view definition in 3.2 before deciding what to optimize.

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "<QUERY_OR_@FILE>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only

阅读执行计划并识别瓶颈：

瓶颈信号	需要查找的问题
BroadcastNestedLoopJoin	缺失join条件或笛卡尔积 — 添加显式join条件
SortMergeJoin	如果小表足够小，可以用BROADCAST优化 — 检查表大小
Exchange (shuffle)	大量重分区 — 检查GROUP BY / JOIN键的基数
无PartitionFilters的FileScan	过滤条件可能没有使用分区列 — 值得检查schema
扫描后才执行Filter（未下推）	将过滤条件提前或重写WHERE语句使用分区列
重复子查询	提取为CTE或使用窗口函数
UNION ALL + 聚合	检查是否可以用ROLLUP或窗口函数替代
大数据集上的UDF调用	考虑用内置Spark SQL函数替代
高null率join键的大量行	在join前添加 `IS NOT NULL` 过滤减少shuffle大小
预期是FileScan但出现SubqueryAlias / 深层嵌套计划	Join目标是视图 — 阅读视图定义（步骤3.2a）
倾斜的Exchange — 某个分区远大于其他分区	GROUP BY键基数低或分布极不均匀 — 考虑SQL加盐（见阶段4）
`IN (subquery)`	通常生成的执行计划比 `EXISTS` 或半连接差 — 如果子查询很大可以重写
`ObjectHashAggregate` 或 `ArrowEvalPython` 节点	Photon无法加速这些节点 — 检查是否存在Python UDF、数组上的 `TRANSFORM` / `FILTER` 或 `MAP` 操作

阅读执行计划后，判断：你是否已经可以确定优化点，还是需要更多表相关信息？

如果瓶颈很明确（例如相关子查询、缺失NULL过滤、过滤列上的LOWER()函数） — 直接进入阶段4。
如果瓶颈取决于物理表属性（文件大小、分区列、列基数） — 在3.2步骤中收集相关表的统计信息。
如果join目标看起来是视图（执行计划中预期是FileScan的位置出现嵌套子查询或
```
SubqueryAlias
```
节点） — 在3.2步骤中先阅读视图定义再决定优化方向。

3.2a Read view definitions (when a join target is a view)

3.2a 阅读视图定义（当join目标是视图时）

The EXPLAIN plan will show a view's underlying query inlined as a subquery tree. If you see a deep nested plan for what should be a simple table join, the view itself may be the bottleneck — not the outer query.

To read the view definition:

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "SHOW CREATE TABLE <view_name>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only

Or collect it alongside table stats (the

--table-stats

flag works for views too).

When you read a view definition, look for:

Filters that could be pushed closer to the source scan
Joins inside the view that are better placed in the outer query (where the caller already filters heavily before joining)
Aggregations computed in the view that the caller immediately re-aggregates
DISTINCT or ORDER BY inside the view that serve no purpose when used as a subquery
Correlated subqueries or scalar subqueries that run once per row

What to do if the view is suboptimal: You have two options — pick the one that is less invasive:

Inline the view as a CTE — replace
```
JOIN my_view ON ...
```
with
```
WITH my_view AS (<rewritten definition>) ...
```
in the optimized query. This avoids changing the view in the catalog and keeps the fix self-contained.
Suggest a view rewrite — if the view is used widely and inlining would make the outer query unreadable, write an optimized version of the view definition separately and note it under "Additional recommendations" in the final report. Do not change the view in the catalog.

Always prefer option 1 unless the view is complex enough that inlining it makes the outer query harder to reason about than it was before.

EXPLAIN计划会将视图的底层查询内联为子查询树。如果你看到本来应该是简单表join的位置出现深层嵌套计划，那么视图本身可能就是瓶颈 — 而非外层查询。

读取视图定义：

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "SHOW CREATE TABLE <view_name>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only

也可以和表统计信息一起收集（

--table-stats

参数也适用于视图）。

阅读视图定义时，查找以下问题：

可以下推到源扫描更近位置的过滤条件
视图内部的join放在外层查询更合适（调用者在join前已经做了大量过滤）
视图中计算的聚合，调用者随后立即再次聚合
视图内部无意义的DISTINCT或ORDER BY，作为子查询使用时没有作用
每行执行一次的相关子查询或标量子查询

如果视图本身不够优化该怎么办： 你有两个选项 — 选择侵入性更低的方案：

将视图内联为CTE — 在优化后的查询中，将
```
JOIN my_view ON ...
```
替换为
```
WITH my_view AS (<重写后的定义>) ...
```
。这样避免修改catalog中的视图，修复逻辑自包含。
建议视图重写 — 如果视图被广泛使用，内联会让外层查询变得难以阅读，可以单独编写优化后的视图定义，并在最终报告的“额外建议”部分注明。不要修改catalog中的视图。

除非视图复杂到内联会让外层查询比原来更难理解，否则优先选择方案1。

3.2 Collect table stats (when needed)

3.2 收集表统计信息（需要时）

Only run this step when the execution plan leaves a question that physical metadata would answer. Pass only the tables relevant to the bottleneck — not every table in the query.

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --table-stats <TABLE1> [<TABLE2> ...]

What this collects per table and when it matters:

Section	Source	When you need it
Schema	`DESCRIBE table`	Partition columns unknown; filter pushdown uncertain
Partition columns	`DESCRIBE EXTENDED`	Plan shows FileScan without PartitionFilters
Delta detail	`DESCRIBE DETAIL`	Plan has SortMergeJoin — need sizeInBytes to decide BROADCAST
Table stats	`DESCRIBE EXTENDED`	Optimizer picked a bad join strategy; may have stale/missing stats
Column stats	`DESCRIBE EXTENDED t col`	High Exchange cost; need distinctCount/nullCount to assess skew

Use these facts to make decisions:

Fact	Decision
Delta detail `sizeInBytes` < 200 MB	Safe to BROADCAST this table in a join
Delta detail `numFiles` is very high	Consider Z-ordering or compaction (note as recommendation, not SQL change)
Avg file size < 32 MB	Small files problem — predicate pushdown likely hurting more than helping
No partition columns	All filters are post-scan; look for other pushdown opportunities
Filter column IS a partition column	Verify the WHERE clause actually uses it for pruning
Column `distinctCount` is low	High-skew risk in GROUP BY or JOIN on that column
Column `nullCount` is high	NULL-safe joins may inflate row counts unexpectedly
Row count from DESCRIBE ≠ actual COUNT(*)	Stats are stale — note this, but do not run ANALYZE (it can be slow)

只有当执行计划留下的问题需要物理元数据才能回答时才运行该步骤。仅传递与瓶颈相关的表 — 不要传递查询中的所有表。

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --table-stats <TABLE1> [<TABLE2> ...]

该命令收集的每张表的信息以及适用场景：

部分	来源	需要用到的场景
Schema	`DESCRIBE table`	分区列未知；过滤下推不确定
分区列	`DESCRIBE EXTENDED`	执行计划显示FileScan没有PartitionFilters
Delta详情	`DESCRIBE DETAIL`	执行计划有SortMergeJoin — 需要sizeInBytes判断是否可以BROADCAST
表统计信息	`DESCRIBE EXTENDED`	优化器选择了错误的join策略；可能存在统计信息过期/缺失
列统计信息	`DESCRIBE EXTENDED t col`	Exchange成本高；需要distinctCount/nullCount评估倾斜情况

使用这些信息做决策：

事实	决策
Delta详情 `sizeInBytes` < 200 MB	该表在join中可以安全使用BROADCAST
Delta详情 `numFiles` 非常高	考虑Z-ordering或压缩（作为建议注明，不做SQL修改）
平均文件大小 < 32 MB	小文件问题 — 谓词下推的负面影响可能超过收益
无分区列	所有过滤都是扫描后执行；寻找其他下推机会
过滤列是分区列	验证WHERE子句是否实际使用该列进行裁剪
列 `distinctCount` 低	GROUP BY或JOIN该列有高倾斜风险
列 `nullCount` 高	NULL安全join可能意外膨胀行数
DESCRIBE返回的行数 ≠ 实际COUNT(*)	统计信息过期 — 注明该问题，但不要运行ANALYZE（可能很慢）

Phase 3.3 — Build an attack plan

阶段3.3 — 制定优化计划

Before writing a single line of optimized SQL, write a short numbered list of the specific changes you intend to make and why. Show it to the user (or include it in your reasoning) before proceeding to Phase 4.

Each item should name:

What you will change (e.g. "replace SortMergeJoin with BROADCAST hint on
```
dim_country
```
")
Why (e.g. "Delta detail shows sizeInBytes = 18 MB, well under the 200 MB threshold")
Expected effect (e.g. "eliminates one Exchange shuffle node")

Example:

Attack plan:
1. Add /*+ BROADCAST(dim_country) */ — sizeInBytes 18 MB, plan currently SortMergeJoin → eliminates shuffle
2. Replace LOWER(email) = ? filter with COLLATE UTF8_LCASE — restores Delta file-skipping
3. Extract repeated subquery into CTE — currently executed once per row in the outer SELECT
4. Add IS NOT NULL guard on user_id before the JOIN — 12% null rate inflates shuffle size

If you have no concrete changes to make (the plan looks fine or the bottleneck is outside SQL), say so explicitly and explain what you found.

在编写任何优化SQL之前，先编写简短的编号列表，列出你打算做的具体修改以及原因。在进入阶段4之前展示给用户（或包含在你的推理中）。

每个条目需要说明：

修改内容（例如“在
```
dim_country
```
表上添加/*+ BROADCAST(dim_country) */ hint”）
原因（例如“Delta详情显示sizeInBytes = 18 MB，远低于200 MB阈值”）
预期效果（例如“消除一个Exchange shuffle节点”）

示例：

优化计划：
1. 添加/*+ BROADCAST(dim_country) */ — 大小18 MB，当前计划是SortMergeJoin → 消除shuffle
2. 将LOWER(email) = ?过滤条件替换为COLLATE UTF8_LCASE — 恢复Delta文件跳过能力
3. 将重复子查询提取为CTE — 当前在外层SELECT中每行执行一次
4. 在JOIN前为user_id添加IS NOT NULL校验 — 12%的null率会膨胀shuffle大小

如果你没有明确的修改建议（执行计划看起来没问题，或者瓶颈不在SQL层面），明确说明并解释你的发现。

Phase 4 — SQL optimization

阶段4 — SQL优化

Output contract: The only thing this skill produces is a rewritten SQL query. No Python code. No cluster config. No DataFrame API. No schema changes. A plain SQL string — same SELECT structure, same columns, same output — just faster. SQL hints (
/*+ BROADCAST(...) */
) and SQL comments are allowed.

Rewrite the query using SQL-level rewrites only.

输出约定： 该技能唯一的输出是重写后的SQL查询。没有Python代码、没有集群配置、没有DataFrame API、没有schema修改。就是普通的SQL字符串 — 相同的SELECT结构、相同的列、相同的输出 — 只是更快。允许使用SQL hint（
/*+ BROADCAST(...) */
）和SQL注释。

仅使用SQL层面的重写优化查询。

What is allowed

允许的修改

Rewriting JOINs (inner ↔ left, reordering join tables for better plan)
Adding / removing CTEs for de-duplication and readability
Replacing correlated subqueries with joins or window functions
Adding predicate filters earlier in the plan (push filters closer to the scan)
Rewriting DISTINCT with GROUP BY where appropriate
Replacing UDFs with equivalent built-in Spark SQL functions
Splitting complex expressions to help predicate pushdown

SQL optimizer hints:

/*+ BROADCAST(t) */

/*+ MERGE(t) */

/*+ SHUFFLE_HASH(t) */

SQL comments (
```
--
```
or
```
/* */
```
) explaining what the optimization does
Inlining a view as a CTE — if a join target is a view whose definition is suboptimal, replace the view reference with
```
WITH view_name AS (<rewritten SQL>)
```
in the optimized query. The output is still a single self-contained SQL string.
Rewriting
IN (subquery)
as
EXISTS
or a semi-join — Spark often builds a better plan for
```
EXISTS
```
/
```
LEFT SEMI JOIN
```
than for
```
IN
```
with a large subquery
Skew salting — when a GROUP BY key is heavily skewed, a two-pass SQL salt distributes the hot partition across workers (see salting pattern below)

重写JOIN（内连接 ↔ 左连接，调整join表顺序获得更好的执行计划）
添加/移除CTE以去重和提升可读性
用join或窗口函数替换相关子查询
在执行计划中更早添加谓词过滤（将过滤条件下推到更接近扫描的位置）
适当情况下用GROUP BY替换DISTINCT
用等价的内置Spark SQL函数替换UDF
拆分复杂表达式帮助谓词下推

SQL优化器hint：

/*+ BROADCAST(t) */

、

/*+ MERGE(t) */

、

/*+ SHUFFLE_HASH(t) */

SQL注释（
```
--
```
或
```
/* */
```
）说明优化的作用
将视图内联为CTE — 如果join目标是视图且其定义不够优化，在优化查询中将视图引用替换为
```
WITH view_name AS (<重写后的SQL>)
```
。输出仍然是独立的单SQL字符串。
将
IN (子查询)
重写为
EXISTS
或半连接 — 对于大子查询，Spark通常为
```
EXISTS
```
/
```
LEFT SEMI JOIN
```
生成更好的执行计划。
倾斜加盐 — 当GROUP BY键严重倾斜时，两步SQL加盐将热分区分散到多个worker（见下方加盐模式）

Skew salting pattern

倾斜加盐模式

When the plan shows a highly skewed Exchange on a GROUP BY key, salt the aggregation:

sql

-- Phase 1: aggregate with a random salt to spread the hot key
WITH salted AS (
  SELECT
    key,
    FLOOR(RAND() * 8) AS salt,  -- 8 buckets; tune to cluster core count
    SUM(value) AS partial_sum
  FROM large_table
  GROUP BY key, salt
)
-- Phase 2: collapse the salted buckets
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY key

Only use this when column stats or a skewed Exchange confirms one key dominates. The extra aggregation pass adds overhead on balanced data — don't apply it speculatively.

当执行计划显示GROUP BY键上的Exchange严重倾斜时，对聚合加盐：

sql

-- 阶段1：使用随机盐聚合，将热键分散
WITH salted AS (
  SELECT
    key,
    FLOOR(RAND() * 8) AS salt,  -- 8个桶；根据集群核心数调整
    SUM(value) AS partial_sum
  FROM large_table
  GROUP BY key, salt
)
-- 阶段2：合并加盐后的桶
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY key

只有当列统计信息或倾斜的Exchange确认某个键占比过高时才使用该方法。额外的聚合步骤会在数据均衡时增加开销 — 不要猜测使用。

CTE materialization caveat

CTE物化注意事项

Spark does not guarantee that a CTE is computed only once. If a CTE is referenced multiple times in the query, Spark may re-evaluate it on each reference. This means:

A CTE used to "cache" an expensive subquery may not help — or may hurt if the subquery is scanned multiple times instead of once.
If you add a CTE to de-duplicate a repeated subquery and the plan still shows the subquery running multiple times, note
```
CACHE TABLE
```
as a recommendation outside SQL scope.
Prefer CTEs for readability and hint-targeting; don't assume they imply materialization.

Spark不保证CTE只计算一次。如果CTE在查询中被多次引用，Spark可能在每次引用时重新计算。这意味着：

用于“缓存”昂贵子查询的CTE可能没有帮助 — 如果子查询被多次扫描而非一次，甚至可能降低性能。
如果你添加CTE去重重复子查询，但执行计划仍然显示子查询运行多次，可以在SQL范围外建议使用
```
CACHE TABLE
```
。
优先使用CTE提升可读性和hint靶向性；不要假设它们会被物化。

Photon-unfriendly patterns

Photon不友好的模式

Databricks Photon accelerates most SQL, but silently falls back to JVM execution for:

Pattern	Why Photon can't accelerate it
Python / Scala UDFs	Black-box execution, no vectorization possible
`TRANSFORM(array, x -> ...)`	Higher-order functions not yet Photon-native
`FILTER(array, x -> ...)`	Same — lambda expressions bypass Photon
`MAP_KEYS` / `MAP_VALUES` on complex maps	Complex type operations not vectorized
Non-deterministic functions in certain contexts	Photon cannot reorder them safely

If the plan shows

ArrowEvalPython

ObjectHashAggregate

, or

BatchEvalPython

nodes, a Photon fallback is happening. Flag it in the attack plan and suggest replacing with built-in Spark SQL equivalents where possible.

Databricks Photon可以加速大多数SQL，但会静默回退到JVM执行以下场景：

模式	Photon无法加速的原因
Python / Scala UDF	黑盒执行，无法向量化
`TRANSFORM(array, x -> ...)`	高阶函数尚未原生支持Photon
`FILTER(array, x -> ...)`	同上 — lambda表达式绕过Photon
复杂map上的 `MAP_KEYS` / `MAP_VALUES`	复杂类型操作未向量化
特定上下文中的非确定性函数	Photon无法安全重排它们

如果执行计划显示

ArrowEvalPython

、

ObjectHashAggregate

或

BatchEvalPython

节点，说明发生了Photon回退。在优化计划中标注，并建议尽可能用内置Spark SQL等价实现替换。

What is NOT allowed — never suggest these

禁止的修改 — 永远不要建议这些

```
spark.conf.set(...)
```
or any cluster/session configuration
DataFrame / PySpark API code (
```
.filter()
```
,
```
.join()
```
,
```
.groupBy()
```
, etc.)

DDL changes (

ALTER TABLE

OPTIMIZE

ZORDER

VACUUM

ANALYZE TABLE

)

Any rewrite that changes columns, row count, or values in the output

If a non-SQL improvement (e.g., Z-ordering, cluster sizing) would help, mention it separately under "Additional recommendations (outside scope)" in the final report — but never include it in the optimized query itself.

```
spark.conf.set(...)
```
或任何集群/会话配置
DataFrame / PySpark API代码（
```
.filter()
```
、
```
.join()
```
、
```
.groupBy()
```
等）

DDL修改（

ALTER TABLE

、

OPTIMIZE

、

ZORDER

、

VACUUM

、

ANALYZE TABLE

）

任何改变输出列、行数或值的重写

如果非SQL改进（例如Z-ordering、集群规格调整）有帮助，在最终报告的“额外建议（超出范围）”部分单独说明 — 但永远不要包含在优化后的查询本身中。

Hint restriction — CRITICAL for UNION ALL

Hint限制 — UNION ALL场景的关键注意事项

Spark SQL does not allow

/*+ ... */

hints inside CTE branches that are part of a

UNION ALL

. Place all optimizer hints on the outermost SELECT only:

sql

-- ✅ CORRECT
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...

-- ❌ WRONG — hint inside a UNION ALL branch causes a parse error
WITH branch AS (
  SELECT /*+ BROADCAST(t) */ * FROM t   -- this will fail
)
SELECT * FROM branch
UNION ALL
SELECT * FROM other

Spark SQL不允许在

UNION ALL

的CTE分支内部使用

/*+ ... */

hint。所有优化器hint只能放在最外层的SELECT上：

sql

-- ✅ 正确
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...

-- ❌ 错误 — UNION ALL分支内部的hint会导致解析错误
WITH branch AS (
  SELECT /*+ BROADCAST(t) */ * FROM t   -- 会报错
)
SELECT * FROM branch
UNION ALL
SELECT * FROM other

UDF handling

UDF处理

If the query calls UDFs, check whether each is:

A persistent SQL function (created with
```
CREATE FUNCTION
```
in the catalog) — these work as-is
A session UDF that must be registered at runtime — create a
```
udf_setup.py
```
file (see below)

If session UDFs are needed, create

udf_setup.py

in the current directory:

python

undefined

如果查询调用了UDF，检查每个UDF的类型：

持久化SQL函数（在catalog中通过
```
CREATE FUNCTION
```
创建） — 可以直接使用
会话UDF，需要在运行时注册 — 创建
```
udf_setup.py
```
文件（见下文）

如果需要会话UDF，在当前目录创建

udf_setup.py

：

python

undefined

udf_setup.py — loaded by tune.py automatically if it exists

udf_setup.py — 如果存在会被tune.py自动加载

from pyspark.sql.functions import udf as spark_udf from pyspark.sql.types import StringType

def _my_udf(value): # your implementation return result

def register_udfs(spark): spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))


`tune.py` checks for `udf_setup.py` in the current directory and calls
`register_udfs(spark)` automatically before running any queries.

---

from pyspark.sql.functions import udf as spark_udf from pyspark.sql.types import StringType

def _my_udf(value): # 你的实现 return result

def register_udfs(spark): spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))


`tune.py`会检查当前目录下的`udf_setup.py`，在运行任何查询前自动调用`register_udfs(spark)`。

---

Phase 5 — Validate, benchmark & log

阶段5 — 验证、基准测试与日志记录

5.1 Edit and commit

5.1 编辑并提交

Edit

$QUERY_FILE

directly — the branch is isolated so edits are safe to make in place.

bash

git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <one-line description of what changed>"

Commit before benchmarking so every attempt is in the git log regardless of outcome.

直接编辑

$QUERY_FILE

— 分支是隔离的，原地编辑是安全的。

bash

git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <修改内容的单行描述>"

在基准测试前提交，这样无论结果如何，每次尝试都会记录在git日志中。

5.2 Run the benchmark

5.2 运行基准测试

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1

(Both

--original

and

--optimized

point to the same file — tune.py compares the current version to the baseline it runs internally.)

The script outputs JSON with:

```
validation.passed
```
— whether results are byte-for-byte identical
```
validation.row_count
```
— number of rows compared
```
original.mean_s
```
,
```
original.std_s
```
— timing stats
```
optimized.mean_s
```
,
```
optimized.std_s
```
— timing stats
```
speedup
```
— ratio of original mean to optimized mean
```
statistically_significant
```
— true if optimized CI is entirely below original CI

bash

$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1

（

--original

和

--optimized

指向同一个文件 — tune.py会将当前版本与内部运行的基线进行比较。）

脚本输出的JSON包含：

```
validation.passed
```
— 结果是否逐字节完全一致
```
validation.row_count
```
— 比较的行数
```
original.mean_s
```
、
```
original.std_s
```
— 基线的计时统计
```
optimized.mean_s
```
、
```
optimized.std_s
```
— 优化版本的计时统计
```
speedup
```
— 基线平均耗时与优化版本平均耗时的比值
```
statistically_significant
```
— 如果优化版本的置信区间完全低于基线的置信区间则为true

5.3 Decide: keep or revert

5.3 决策：保留或回滚

A speedup is real only when:

```
validation.passed = true
```
(identical results)
```
speedup > 1.0
```
(optimized is faster on average)
```
statistically_significant = true
```
(confidence intervals don't overlap)

✅ IMPROVED  → keep commit, update BEST_MEAN = optimized.mean_s
❌ SAME/WORSE → revert the file to the last kept commit:
               git checkout HEAD $QUERY_FILE
💥 VALIDATION FAIL → fix the query and re-benchmark before reverting

提速是真实有效的仅当：

```
validation.passed = true
```
（结果完全一致）
```
speedup > 1.0
```
（优化版本平均速度更快）
```
statistically_significant = true
```
（置信区间不重叠）

✅ 性能提升  → 保留提交，更新BEST_MEAN = optimized.mean_s
❌ 性能相同/下降 → 将文件回滚到上一次保留的提交：
               git checkout HEAD $QUERY_FILE
💥 验证失败 → 修复查询，重新基准测试后再回滚

5.4 Log the attempt

5.4 记录尝试

Append to

$RESULTS_FILE

<N>	<commit-sha>	<optimized.mean_s>	<speedup>	<keep|discard>	<description>

If validation fails: fix the query and re-benchmark before doing anything else.

追加到

$RESULTS_FILE

：

<N>	<commit-sha>	<optimized.mean_s>	<speedup>	<keep|discard>	<description>

如果验证失败：先修复查询并重新基准测试，再进行其他操作。

Phase 5b — Autonomous optimization loop

阶段5b — 自动优化循环

Run continuously after the first benchmark. Never pause to ask "should I continue?". Stop only when the user interrupts or explicitly says they're satisfied.

THINK   Read $RESULTS_FILE and the current best query.
        Study the EXPLAIN of the best query so far — not the original.
        Form a specific hypothesis: "changing X should reduce Y because Z."
        Follow the strategy priority below.

EDIT    Edit $QUERY_FILE directly — one focused change per attempt.
        The branch is isolated; edits are safe to make in place.

COMMIT  git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
        Commit before benchmarking — this records what was tried.

RUN     $TUNE \
          --profile <PROFILE> --cluster-id <CLUSTER_ID> \
          --original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
          --catalog <CATALOG> --schema <SCHEMA> \
          --n-runs <N_RUNS> > $LOG_FILE 2>&1

        # For simplicity / both goals, also score complexity
        python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json

MEASURE Extract from $LOG_FILE (timing) and complexity.json (score).
        On crash: read last 50 lines for the error. Attempt up to 2 quick fixes,
        amend the commit, re-run. If still broken, revert and discard.

        Metric to optimize depends on GOALS:
        ┌──────────────────────┬────────────────────────────────────────────────────┐
        │ speed                │ optimized.mean_s  (lower is better)                │
        │ simplicity           │ complexity score  (lower is better)                │
        │ speed,simplicity     │ Phase 1: optimize mean_s (same as speed)           │
        │                      │ Phase 2: once speed accepted, optimize score        │
        │                      │          while mean_s stays ≤ BEST_MEAN * 1.10     │
        └──────────────────────┴────────────────────────────────────────────────────┘

DECIDE  Compare metric to BEST:
        speed             → ✅ if mean_s improved + statistically significant
        simplicity        → ✅ if complexity score decreased + validation passes
        speed,simplicity  → phase 1: same as speed
                            phase 2: ✅ if score decreased AND mean_s ≤ BEST_MEAN * 1.10

        ❌ no improvement → git checkout HEAD $QUERY_FILE   (revert file, keep commit msg in log)
        💥 VALIDATION FAIL → fix semantics first, then re-benchmark

LOG     Append to $RESULTS_FILE:
        <N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>

首次基准测试后持续运行。永远不要暂停询问“我是否应该继续？”。仅当用户中断或明确表示满意时停止。

思考   读取$RESULTS_FILE和当前最优查询。
        研究当前最优查询的EXPLAIN计划 — 不是原始查询。
        形成明确的假设：“修改X会减少Y，因为Z。”
        遵循下方的策略优先级。

编辑   直接编辑$QUERY_FILE — 每次尝试只做一个针对性修改。
        分支是隔离的；原地编辑是安全的。

提交   git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
        基准测试前提交 — 记录所有尝试的内容。

运行   $TUNE \
          --profile <PROFILE> --cluster-id <CLUSTER_ID> \
          --original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
          --catalog <CATALOG> --schema <SCHEMA> \
          --n-runs <N_RUNS> > $LOG_FILE 2>&1

        # 如果优化目标包含简化/双目标，同时计算复杂度评分
        python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json

测量   从$LOG_FILE（计时）和complexity.json（评分）中提取数据。
        崩溃时：读取最后50行错误信息。尝试最多2次快速修复，
        修改提交，重新运行。如果仍然失败，回滚并丢弃。

        优化的指标取决于GOALS：
        ┌──────────────────────┬────────────────────────────────────────────────────┐
        │ speed                │ optimized.mean_s （越低越好）                │
        │ simplicity           │ complexity score （越低越好）                │
        │ speed,simplicity     │ 阶段1：优化mean_s（和speed目标一致）           │
        │                      │ 阶段2：速度达标后，优化评分        │
        │                      │          同时保持mean_s ≤ BEST_MEAN * 1.10     │
        └──────────────────────┴────────────────────────────────────────────────────┘

决策   与最优指标比较：
        speed             → ✅ 如果mean_s提升 + 统计显著性
        simplicity        → ✅ 如果复杂度评分降低 + 验证通过
        speed,simplicity  → 阶段1：和speed目标一致
                            阶段2：✅ 如果评分降低 AND mean_s ≤ BEST_MEAN * 1.10

        ❌ 无提升 → git checkout HEAD $QUERY_FILE   （回滚文件，保留提交日志）
        💥 验证失败 → 先修复语义问题，再重新基准测试

日志   追加到$RESULTS_FILE：
        <N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>

Strategy priority

策略优先级

Follow the plan — work through the attack plan from Phase 3.3 in order
Follow wins — if a hint or rewrite helped, probe further in that direction
Diversify after 3 consecutive discards — switch to a different bottleneck
Combine winners — if A and B each improved independently, try A+B together
Diagnose no-speedup runs — re-EXPLAIN the optimized query; check if AQE already did what the hint asked, or if a Photon fallback is dominating
Accept the floor — if 5+ consecutive attempts yield no improvement and the remaining ideas are speculative, stop the loop and go to Phase 6

For
simplicity
or
speed,simplicity
goal, strategy shifts focus (in the simplicity phase):

Prefer removing CTEs that the optimizer re-evaluates anyway
Inline single-use subqueries where they don't increase nesting
Replace multi-step transformations with a single built-in function call
Remove redundant ORDER BY, DISTINCT, or GROUP BY that the caller drops
Simplicity wins don't require statistical significance — a lower score is enough

Hard constraints that apply to all goals, including simplicity:

Never introduce
```
SELECT *
```
— explicit column lists must be preserved exactly.
```
SELECT *
```
hides schema changes, breaks downstream consumers, and can silently change column order. A query with fewer lines but a
```
SELECT *
```
is not simpler — it is fragile.
Never remove columns from the output or reorder them
Never change NULL semantics or filter behaviour

遵循计划 — 按顺序执行阶段3.3的优化计划
跟进收益点 — 如果某个hint或重写有帮助，继续在该方向探索
连续3次丢弃后多样化尝试 — 切换到其他瓶颈
合并有效修改 — 如果A和B各自独立提升了性能，尝试同时应用A+B
诊断无提速的运行 — 重新EXPLAIN优化后的查询；检查AQE是否已经实现了hint的作用，或者Photon回退是主要瓶颈
接受性能下限 — 如果连续5次以上尝试都没有提升，剩余的优化思路都是猜测性的，停止循环进入阶段6

对于
simplicity
或
speed,simplicity
目标，在简化阶段策略会调整重点：

优先移除优化器会重复计算的CTE
内联单次使用的子查询，不会增加嵌套层级
用单个内置函数调用替换多步转换
移除调用者不需要的冗余ORDER BY、DISTINCT或GROUP BY
简化收益不需要统计显著性 — 评分更低即可

适用于所有目标的硬约束，包括简化目标：

永远不要引入
```
SELECT *
```
— 必须严格保留显式列列表。
```
SELECT *
```
会隐藏schema变更，破坏下游依赖，可能静默修改列顺序。行数更少但使用
```
SELECT *
```
的查询并不更简单 — 它更脆弱。
永远不要移除输出列或调整列顺序
永远不要修改NULL语义或过滤行为

Phase 6 — Report

阶段6 — 报告

Present a summary:

First print the full attempt log:

bash

cat $RESULTS_FILE

Then show the git log of kept commits:

bash

git log --oneline <baseline-sha>..HEAD

Then present the summary report:

undefined

展示摘要：

首先打印完整的尝试日志：

bash

cat $RESULTS_FILE

然后展示保留提交的git日志：

bash

git log --oneline <baseline-sha>..HEAD

然后展示摘要报告：

undefined

SQL Tuning Report

SQL调优报告

Run summary

运行摘要

Attempts: N total / K kept / M discarded
Baseline: X.XXs → Best: X.XXs (X.Xx speedup)

总尝试次数：N / 保留：K / 丢弃：M
基线耗时：X.XX秒 → 最优耗时：X.XX秒（X.X倍提速）

Attempt log

尝试日志

#	Mean (s)	Speedup	Status	Description
0	x.xx	1.0x	baseline	original query
1	x.xx	x.xx	keep	...
2	x.xx	x.xx	discard	...

#	平均耗时（秒）	提速倍数	状态	描述
0	x.xx	1.0x	基线	原始查询
1	x.xx	x.xx	保留	...
2	x.xx	x.xx	丢弃	...

Query plan bottlenecks (original)

原始查询计划瓶颈

<执行计划的关键发现>

Optimizations applied (kept only)

应用的优化（仅保留的修改）

<change — what + why> ...

<修改内容 — 是什么 + 为什么> ...

Original SQL

原始SQL

<原始查询>

Optimized SQL

优化后SQL

<最优的优化后查询>

Final benchmark

最终基准测试

Variant	Mean (s)	Std (s)	Runs
Original	x.xx	x.xx	N
Optimized	x.xx	x.xx	N

Speedup: X.Xx | Statistically significant: yes/no Validation: PASS — N rows, identical results

版本	平均耗时（秒）	标准差（秒）	运行次数
原始版本	x.xx	x.xx	N
优化版本	x.xx	x.xx	N

提速倍数：X.Xx | 统计显著性：是/否验证结果：通过 — N行，结果完全一致

Conclusion

结论

<一段总结：修改了什么，为什么有效果，任何注意事项>

Additional recommendations (outside SQL scope)

额外建议（超出SQL范围）

<Z-ordering, ANALYZE TABLE, cluster sizing, etc. — if applicable>

---

<Z-ordering、ANALYZE TABLE、集群规格调整等 — 如适用>

---

Troubleshooting

故障排查

Error	Fix
`Cannot configure default credentials`	Re-authenticate: `databricks auth login --profile <PROFILE>`
`Cluster not found`	Verify cluster ID: `databricks clusters list --profile <PROFILE>`
Version mismatch	Ensure `DBR_VERSION` from `clusters get` matches installed `databricks-connect`
`.venv_autotuner` import errors	Delete the venv and re-run `env_setup.py`
Validation diff on floats	May be floating-point non-determinism — check with `ROUND()` or cast to DECIMAL
`.collect()` fails with `>4 GiB` / driver OOM	Query returns millions of rows — add `--timing-count` to wrap timing runs in `COUNT(*)` . Validation still uses real results.
Hint parse error in UNION ALL	Move hint to outermost SELECT (see Phase 4 restriction)
Session UDF not found	Create `udf_setup.py` with a `register_udfs(spark)` function

错误	修复方案
`Cannot configure default credentials`	重新认证： `databricks auth login --profile <PROFILE>`
`Cluster not found`	验证集群ID： `databricks clusters list --profile <PROFILE>`
版本不匹配	确保从 `clusters get` 获取的 `DBR_VERSION` 与安装的 `databricks-connect` 版本匹配
`.venv_autotuner` 导入错误	删除虚拟环境，重新运行 `env_setup.py`
浮点数验证差异	可能是浮点数非确定性 — 用 `ROUND()` 或转换为DECIMAL验证
`.collect()` 失败，报错 `>4 GiB` / 驱动OOM	查询返回数百万行 — 添加 `--timing-count` 将计时运行包裹在 `COUNT(*)` 中。验证仍然使用真实结果。
UNION ALL中的hint解析错误	将hint移动到最外层SELECT（见阶段4的限制）
会话UDF未找到	创建包含 `register_udfs(spark)` 函数的 `udf_setup.py`