databricks-sql-autotuner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Databricks SQL Autotuner

Databricks SQL 自动调优器

Analyze, rewrite, validate, and benchmark SQL queries against a live Databricks cluster. The optimized query must produce identical results and show a statistically meaningful speedup before it is accepted.
针对运行中的Databricks集群分析、重写、验证和基准测试SQL查询。优化后的查询必须返回完全一致的结果,并且表现出统计意义上的显著提速才会被采纳。

Examples

示例

undefined
undefined

Minimal — paste an inline query, discover everything automatically

最简用法 — 粘贴内联查询,自动完成所有探测步骤

/databricks-sql-autotuner "SELECT o.order_id, SUM(l.amount) FROM orders o JOIN lines l ON o.id = l.order_id GROUP BY o.order_id"
/databricks-sql-autotuner "SELECT o.order_id, SUM(l.amount) FROM orders o JOIN lines l ON o.id = l.order_id GROUP BY o.order_id"

From a file, cluster by ID

从文件读取,指定集群ID

/databricks-sql-autotuner --cluster-id 0408-195905-abc --profile premium @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --profile premium @queries/slow_report.sql

From a file, cluster by name

从文件读取,指定集群名称

/databricks-sql-autotuner --cluster-name my-cluster --profile premium @queries/slow_report.sql
/databricks-sql-autotuner --cluster-name my-cluster --profile premium @queries/slow_report.sql

Optimize for speed, 5 benchmark runs

优先优化速度,执行5次基准测试

/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed --runs 5 @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed --runs 5 @queries/slow_report.sql

Optimize for speed first, then simplify

优先优化速度,其次简化语句

/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed,simplicity @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed,simplicity @queries/slow_report.sql

Override catalog and schema

自定义catalog和schema

/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql

---
/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql

---

Reference files

参考文件

Read these before writing the optimized query:
FileWhen to read
references/spark-sql-hints.md
Any time you are considering a hint (
BROADCAST
,
MERGE
,
SHUFFLE_HASH
,
REBALANCE
, etc.) or dealing with UNION ALL scoping
references/spark-sql-perf-tuning.md
For AQE behavior, partition tuning, statistics interpretation, and the full SQL-level optimization checklist

编写优化查询前请阅读这些文件:
文件读取时机
references/spark-sql-hints.md
任何时候你考虑使用hint(
BROADCAST
MERGE
SHUFFLE_HASH
REBALANCE
等)或者处理UNION ALL作用域问题时
references/spark-sql-perf-tuning.md
了解AQE行为、分区调优、统计信息解读和完整的SQL层面优化检查清单

Phase 0 — Parse input

阶段0 — 解析输入

The user invokes with:
/databricks-sql-autotuner [options] <query-or-path>
Supported options (all optional — skip any discovery step where the value is provided):
OptionExampleEffect
--cluster-id <id>
--cluster-id 0408-195905-abc
Skip cluster listing; use this cluster ID directly
--cluster-name <name>
--cluster-name my-cluster
Skip cluster listing; find cluster by name
--profile <name>
--profile premium
Skip profile selection; use this profile
--catalog <name>
--catalog my_catalog
Skip catalog discovery; use this value
--schema <name>
--schema my_schema
Skip schema discovery; use this value
--runs <n>
--runs 5
Override default benchmark run count (default: 3)
--goals <list>
--goals speed
Comma-separated goals:
speed
,
simplicity
, or
speed,simplicity
Record as
GOALS
(default:
speed
). Accepted values:
ValueMeaning
speed
Minimize execution time
simplicity
Minimize complexity score (lines, nesting, subqueries)
speed,simplicity
Optimize speed first; once accepted, simplify while staying within 10% of best time
Parse these from the invocation arguments before starting Phase 1. For any option not supplied, run the normal discovery step.
|
--query <sql-or-path>
|
--query @slow.sql
| SQL string or
@path/to/file.sql
; if omitted, ask the user | |
--optimized <path>
|
--optimized out.sql
| Write optimized query to a separate file instead of editing the original in place | |
--timing-count
|
--timing-count
| Wrap timing runs in
COUNT(*)
to avoid collecting large result sets to the driver (use when query returns millions of rows) |
The
--query
value:
  • @path/to/file.sql
    → read the file
  • Any other string → treat as inline SQL
  • Omitted → ask the user to paste or provide the query

用户通过以下方式调用:
/databricks-sql-autotuner [options] <query-or-path>
支持的选项(均为可选 — 如果值已提供则跳过对应的探测步骤):
选项示例作用
--cluster-id <id>
--cluster-id 0408-195905-abc
跳过集群列表查询,直接使用该集群ID
--cluster-name <name>
--cluster-name my-cluster
跳过集群列表查询,通过名称查找集群
--profile <name>
--profile premium
跳过配置选择,直接使用该配置
--catalog <name>
--catalog my_catalog
跳过catalog探测,直接使用该值
--schema <name>
--schema my_schema
跳过schema探测,直接使用该值
--runs <n>
--runs 5
覆盖默认的基准测试运行次数(默认:3)
--goals <list>
--goals speed
逗号分隔的优化目标:
speed
simplicity
或者
speed,simplicity
记录为
GOALS
(默认:
speed
)。可接受的值:
含义
speed
最小化执行时间
simplicity
最小化复杂度评分(行数、嵌套层级、子查询数量)
speed,simplicity
优先优化速度;速度优化达标后,在不超过最优执行时间10%的前提下简化语句
在进入阶段1前先从调用参数中解析这些选项。对于未提供的选项,执行常规探测步骤。
|
--query <sql-or-path>
|
--query @slow.sql
| SQL字符串或
@path/to/file.sql
;如果省略则询问用户 | |
--optimized <path>
|
--optimized out.sql
| 将优化后的查询写入独立文件,而非原地修改原文件 | |
--timing-count
|
--timing-count
| 将计时运行包裹在
COUNT(*)
中,避免向驱动节点返回大量结果集(查询返回数百万行时使用) |
--query
值的处理规则:
  • @path/to/file.sql
    → 读取对应文件
  • 其他字符串 → 视为内联SQL
  • 省略 → 要求用户粘贴或提供查询语句

Phase 1 — Environment discovery

阶段1 — 环境探测

1.1 Verify Databricks CLI is authenticated

1.1 验证Databricks CLI已认证

bash
databricks auth profiles
If this fails or returns no profiles, stop and ask the user to authenticate:
bash
databricks auth login --profile <name>
bash
databricks auth profiles
如果命令失败或未返回任何配置,停止执行并要求用户认证:
bash
databricks auth login --profile <name>

1.2 Select profile

1.2 选择配置

If
--profile
was provided, skip this step.
Otherwise present all available profiles with their workspace host URLs. Auto-select if only one exists; otherwise ask.
如果已提供
--profile
参数,跳过该步骤。
否则展示所有可用配置及其对应的工作区主机URL。如果只有一个配置则自动选择,否则询问用户选择。

1.3 Select cluster

1.3 选择集群

If
--cluster-id
was provided, use that ID directly and skip the listing.
If
--cluster-name
was provided, resolve it to an ID:
bash
databricks clusters list --profile <PROFILE> --output json \
  | python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
    print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"
If neither was provided, list all clusters:
bash
databricks clusters list --profile <PROFILE> --output json
Filter to clusters with
state = RUNNING
or
state = TERMINATED
(can be started). Present the list (cluster ID, name, state, DBR version). Auto-select any running general-purpose cluster if one exists; otherwise ask.
If the chosen cluster is terminated, start it:
bash
databricks clusters start <CLUSTER_ID> --profile <PROFILE>
如果已提供
--cluster-id
参数,直接使用该ID并跳过列表查询。
如果已提供
--cluster-name
参数,将其解析为ID:
bash
databricks clusters list --profile <PROFILE> --output json \
  | python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
    print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"
如果两个参数都未提供,列出所有集群:
bash
databricks clusters list --profile <PROFILE> --output json
筛选出
state = RUNNING
state = TERMINATED
(可启动)的集群。展示集群列表(集群ID、名称、状态、DBR版本)。如果存在运行中的通用型集群则自动选择,否则询问用户选择。
如果选中的集群已终止,启动它:
bash
databricks clusters start <CLUSTER_ID> --profile <PROFILE>

1.4 Get DBR version

1.4 获取DBR版本

bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"
Record as
DBR_VERSION
(e.g.,
17.3
). This determines which
databricks-connect
to install.
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"
记录为
DBR_VERSION
(例如
17.3
)。该值决定要安装的
databricks-connect
版本。

1.5 Catalog / schema context

1.5 Catalog / Schema 上下文

Run to discover the current defaults:
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; c=json.load(sys.stdin); \
    cfg=c.get('spark_conf',{}); \
    print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
    print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"
Use whatever the cluster reports as
CATALOG
and
SCHEMA
. Only ask the user to override if the query explicitly references a different catalog/schema or if the command returns nothing useful.
运行以下命令探测当前默认值:
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
  | python3 -c "import sys,json; c=json.load(sys.stdin); \
    cfg=c.get('spark_conf',{}); \
    print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
    print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"
使用集群返回的结果作为
CATALOG
SCHEMA
。只有当查询显式引用了不同的catalog/schema,或者命令未返回有效信息时,才询问用户覆盖默认值。

1.6 Benchmark runs

1.6 基准测试运行次数

If
--runs
was provided, skip this step and use that value.
Otherwise ask:
How many benchmark runs per query variant? (default: 3)
Record as
N_RUNS
(default
3
, minimum
2
).
如果已提供
--runs
参数,跳过该步骤直接使用该值。
否则询问用户:
每个查询变体需要运行多少次基准测试?(默认:3)
记录为
N_RUNS
(默认
3
,最小值
2
)。

1.7 Confirm

1.7 确认

Present a summary and wait for confirmation:
ParameterValue
Profile
Cluster ID
DBR version
Catalog
Schema
N runs

展示配置摘要并等待用户确认:
参数
Profile
Cluster ID
DBR version
Catalog
Schema
N runs

Phase 2 — Environment setup

阶段2 — 环境搭建

2.0 Resolve the skill directory

2.0 定位技能目录

The scripts live inside the skill, not the project. Locate them first:
bash
SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
                  "$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
             -maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"
If neither path exists, check where Claude Code installed the skill:
bash
find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1
Record as
SKILL_DIR
. All script paths below use
$SKILL_DIR/scripts/
.
脚本存放在技能目录下,而非项目目录。首先定位该目录:
bash
SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
                  "$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
             -maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"
如果两个路径都不存在,检查Claude Code安装技能的位置:
bash
find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1
记录为
SKILL_DIR
。后续所有脚本路径都使用
$SKILL_DIR/scripts/

2.1 Check uv

2.1 检查uv

bash
uv --version
If missing:
curl -LsSf https://astral.sh/uv/install.sh | sh
bash
uv --version
如果缺失:
curl -LsSf https://astral.sh/uv/install.sh | sh

2.2 Run setup

2.2 运行安装脚本

bash
python3 "$SKILL_DIR/scripts/env_setup.py" \
  --dbr-version <DBR_VERSION> \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID>
This uses
uv venv
to create
.venv_autotuner/
, installs
databricks-connect==<DBR_VERSION>.*
, and verifies the connection with a
SELECT 1
smoke test.
All subsequent commands invoke the venv Python directly:
bash
VENV_PYTHON=".venv_autotuner/bin/python"          # macOS / Linux
bash
python3 "$SKILL_DIR/scripts/env_setup.py" \
  --dbr-version <DBR_VERSION> \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID>
该脚本使用
uv venv
创建
.venv_autotuner/
,安装
databricks-connect==<DBR_VERSION>.*
,并通过
SELECT 1
冒烟测试验证连接。
后续所有命令直接调用虚拟环境中的Python:
bash
VENV_PYTHON=".venv_autotuner/bin/python"          # macOS / Linux

Windows: VENV_PYTHON=".venv_autotuner\Scripts\python.exe"

Windows: VENV_PYTHON=".venv_autotuner\Scripts\python.exe"

TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"

Use `$TUNE` for every `tune.py` invocation from here on.

---
TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"

后续所有`tune.py`调用都使用`$TUNE`。

---

Phase 2b — Branch & baseline

阶段2b — 分支与基线

Every tuning session gets its own branch and a results log. This keeps the work isolated, makes it easy to compare iterations, and produces a clean git history that shows exactly what was tried and kept.
每个调优会话都有独立的分支和结果日志。这样可以隔离工作,便于比较不同迭代版本,同时生成清晰的git历史,记录所有尝试和采纳的修改。

2b.1 Generate a run ID

2b.1 生成运行ID

Use a short slug derived from the query's purpose — e.g.
sales-summary
,
user-funnel-join
.
bash
RUN_ID="<query-slug>"           # e.g. sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"
使用基于查询用途的简短标识,例如
sales-summary
user-funnel-join
bash
RUN_ID="<query-slug>"           # 例如 sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"

2b.2 Create a branch

2b.2 创建分支

bash
git checkout -b sql-tune/${RUN_ID}
bash
git checkout -b sql-tune/${RUN_ID}

2b.3 Establish the working file

2b.3 确定工作文件

Determine
QUERY_FILE
(the file that will be edited each iteration) and
ORIGINAL_FILE
(the baseline passed to
--original
in tune.py):
ScenarioORIGINAL_FILEQUERY_FILE
--query @file.sql
(default)
file.sql
file.sql
— edit in place
--query @file.sql --optimized out.sql
file.sql
out.sql
— write here, original stays untouched
--query "inline SQL"
query.sql
(write it, commit)
query.sql
— edit in place
--query "inline SQL" --optimized out.sql
query.sql
out.sql
When
--optimized
is given, copy the original into that file first:
bash
cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILE
When editing in place (no
--optimized
),
ORIGINAL_FILE == QUERY_FILE
and git tracks the evolution naturally.
确定
QUERY_FILE
(每次迭代编辑的文件)和
ORIGINAL_FILE
(传递给tune.py
--original
参数的基线文件):
场景ORIGINAL_FILEQUERY_FILE
--query @file.sql
(默认)
file.sql
file.sql
— 原地编辑
--query @file.sql --optimized out.sql
file.sql
out.sql
— 写入该文件,原文件保持不变
--query "inline SQL"
query.sql
(写入内容并提交)
query.sql
— 原地编辑
--query "inline SQL" --optimized out.sql
query.sql
out.sql
如果提供了
--optimized
参数,先将原文件复制到目标路径:
bash
cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILE
如果是原地编辑(未提供
--optimized
),
ORIGINAL_FILE == QUERY_FILE
,git会自然跟踪文件演变。

2b.4 Initialize the results log

2b.4 初始化结果日志

bash
echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}"     >> .git/info/exclude
Create
$RESULTS_FILE
:
attempt	commit	mean_s	speedup	status	description
0	<sha>	<baseline_mean>	1.0	baseline	original query
bash
echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}"     >> .git/info/exclude
创建
$RESULTS_FILE
attempt	commit	mean_s	speedup	status	description
0	<sha>	<baseline_mean>	1.0	baseline	original query

2b.5 Run the baseline

2b.5 运行基线测试

bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1
Extract
original.mean_s
and record as attempt
0
.
Baseline: mean = X.XXs. Branch:
sql-tune/<run-id>
. Starting analysis.

bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1
提取
original.mean_s
并记录为第0次尝试。
基线:平均耗时 = X.XX秒。分支:
sql-tune/<run-id>
。开始分析。

Phase 3 — Query analysis

阶段3 — 查询分析

Start with the execution plan — it tells you what the optimizer actually decided. Collect table stats only when the plan raises a question that size or schema data would answer (e.g. "is this table small enough to broadcast?").
从执行计划开始分析 — 它会告诉你优化器实际的决策。只有当执行计划提出的问题需要表大小或schema数据才能回答时(例如“这个表是否小到可以广播?”),才收集表统计信息。

3.1 Get the execution plan

3.1 获取执行计划

bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "<QUERY_OR_@FILE>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only
Read the plan and identify bottlenecks:
Bottleneck signalWhat to look for
BroadcastNestedLoopJoinMissing join key or cartesian product — add explicit join condition
SortMergeJoinMay be improvable with BROADCAST if the smaller side is small enough — check table size
Exchange (shuffle)Heavy repartitioning — check GROUP BY / JOIN key cardinality
FileScan with no PartitionFiltersFilter may not be using partition columns — worth checking schema
Filter after scan (not pushed)Move filter earlier or rewrite WHERE to use partition columns
Repeated subqueryExtract to CTE or use window function
UNION ALL + aggregateCheck if ROLLUP or window can replace
UDF calls on large datasetsConsider replacing with built-in Spark SQL functions
Large row count with high-null join keyAdd
IS NOT NULL
filter before join to reduce shuffle size
SubqueryAlias / deep nested plan where a FileScan is expectedJoin target is a view — read the view definition (step 3.2a)
Skewed Exchange — one partition vastly larger than othersGROUP BY key has low or highly uneven cardinality — consider SQL salting (see Phase 4)
IN (subquery)
Often generates a worse plan than
EXISTS
or a semi-join — rewrite if the subquery is large
ObjectHashAggregate
or
ArrowEvalPython
nodes
Photon cannot accelerate these — check for Python UDFs,
TRANSFORM
/
FILTER
on arrays, or
MAP
operations
After reading the plan, decide: can you already identify the optimization, or do you need more information about the tables?
  • If the bottleneck is clear (e.g. a correlated subquery, missing NULL filter, LOWER() on a filter column) — go straight to Phase 4.
  • If the bottleneck depends on physical table properties (file size, partition columns, column cardinality) — collect stats for just those tables in 3.2.
  • If a join target appears to be a view (plan shows a nested subquery or a
    SubqueryAlias
    node where you expect a FileScan) — read the view definition in 3.2 before deciding what to optimize.
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "<QUERY_OR_@FILE>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only
阅读执行计划并识别瓶颈:
瓶颈信号需要查找的问题
BroadcastNestedLoopJoin缺失join条件或笛卡尔积 — 添加显式join条件
SortMergeJoin如果小表足够小,可以用BROADCAST优化 — 检查表大小
Exchange (shuffle)大量重分区 — 检查GROUP BY / JOIN键的基数
无PartitionFilters的FileScan过滤条件可能没有使用分区列 — 值得检查schema
扫描后才执行Filter(未下推)将过滤条件提前或重写WHERE语句使用分区列
重复子查询提取为CTE或使用窗口函数
UNION ALL + 聚合检查是否可以用ROLLUP或窗口函数替代
大数据集上的UDF调用考虑用内置Spark SQL函数替代
高null率join键的大量行在join前添加
IS NOT NULL
过滤减少shuffle大小
预期是FileScan但出现SubqueryAlias / 深层嵌套计划Join目标是视图 — 阅读视图定义(步骤3.2a)
倾斜的Exchange — 某个分区远大于其他分区GROUP BY键基数低或分布极不均匀 — 考虑SQL加盐(见阶段4)
IN (subquery)
通常生成的执行计划比
EXISTS
或半连接差 — 如果子查询很大可以重写
ObjectHashAggregate
ArrowEvalPython
节点
Photon无法加速这些节点 — 检查是否存在Python UDF、数组上的
TRANSFORM
/
FILTER
MAP
操作
阅读执行计划后,判断:你是否已经可以确定优化点,还是需要更多表相关信息?
  • 如果瓶颈很明确(例如相关子查询、缺失NULL过滤、过滤列上的LOWER()函数) — 直接进入阶段4。
  • 如果瓶颈取决于物理表属性(文件大小、分区列、列基数) — 在3.2步骤中收集相关表的统计信息。
  • 如果join目标看起来是视图(执行计划中预期是FileScan的位置出现嵌套子查询或
    SubqueryAlias
    节点) — 在3.2步骤中先阅读视图定义再决定优化方向。

3.2a Read view definitions (when a join target is a view)

3.2a 阅读视图定义(当join目标是视图时)

The EXPLAIN plan will show a view's underlying query inlined as a subquery tree. If you see a deep nested plan for what should be a simple table join, the view itself may be the bottleneck — not the outer query.
To read the view definition:
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "SHOW CREATE TABLE <view_name>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only
Or collect it alongside table stats (the
--table-stats
flag works for views too).
When you read a view definition, look for:
  • Filters that could be pushed closer to the source scan
  • Joins inside the view that are better placed in the outer query (where the caller already filters heavily before joining)
  • Aggregations computed in the view that the caller immediately re-aggregates
  • DISTINCT or ORDER BY inside the view that serve no purpose when used as a subquery
  • Correlated subqueries or scalar subqueries that run once per row
What to do if the view is suboptimal: You have two options — pick the one that is less invasive:
  1. Inline the view as a CTE — replace
    JOIN my_view ON ...
    with
    WITH my_view AS (<rewritten definition>) ...
    in the optimized query. This avoids changing the view in the catalog and keeps the fix self-contained.
  2. Suggest a view rewrite — if the view is used widely and inlining would make the outer query unreadable, write an optimized version of the view definition separately and note it under "Additional recommendations" in the final report. Do not change the view in the catalog.
Always prefer option 1 unless the view is complex enough that inlining it makes the outer query harder to reason about than it was before.
EXPLAIN计划会将视图的底层查询内联为子查询树。如果你看到本来应该是简单表join的位置出现深层嵌套计划,那么视图本身可能就是瓶颈 — 而非外层查询。
读取视图定义:
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original "SHOW CREATE TABLE <view_name>" \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --explain-only
也可以和表统计信息一起收集(
--table-stats
参数也适用于视图)。
阅读视图定义时,查找以下问题:
  • 可以下推到源扫描更近位置的过滤条件
  • 视图内部的join放在外层查询更合适(调用者在join前已经做了大量过滤)
  • 视图中计算的聚合,调用者随后立即再次聚合
  • 视图内部无意义的DISTINCT或ORDER BY,作为子查询使用时没有作用
  • 每行执行一次的相关子查询或标量子查询
如果视图本身不够优化该怎么办: 你有两个选项 — 选择侵入性更低的方案:
  1. 将视图内联为CTE — 在优化后的查询中,将
    JOIN my_view ON ...
    替换为
    WITH my_view AS (<重写后的定义>) ...
    。这样避免修改catalog中的视图,修复逻辑自包含。
  2. 建议视图重写 — 如果视图被广泛使用,内联会让外层查询变得难以阅读,可以单独编写优化后的视图定义,并在最终报告的“额外建议”部分注明。不要修改catalog中的视图。
除非视图复杂到内联会让外层查询比原来更难理解,否则优先选择方案1。

3.2 Collect table stats (when needed)

3.2 收集表统计信息(需要时)

Only run this step when the execution plan leaves a question that physical metadata would answer. Pass only the tables relevant to the bottleneck — not every table in the query.
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --table-stats <TABLE1> [<TABLE2> ...]
What this collects per table and when it matters:
SectionSourceWhen you need it
Schema
DESCRIBE table
Partition columns unknown; filter pushdown uncertain
Partition columns
DESCRIBE EXTENDED
Plan shows FileScan without PartitionFilters
Delta detail
DESCRIBE DETAIL
Plan has SortMergeJoin — need sizeInBytes to decide BROADCAST
Table stats
DESCRIBE EXTENDED
Optimizer picked a bad join strategy; may have stale/missing stats
Column stats
DESCRIBE EXTENDED t col
High Exchange cost; need distinctCount/nullCount to assess skew
Use these facts to make decisions:
FactDecision
Delta detail
sizeInBytes
< 200 MB
Safe to BROADCAST this table in a join
Delta detail
numFiles
is very high
Consider Z-ordering or compaction (note as recommendation, not SQL change)
Avg file size < 32 MBSmall files problem — predicate pushdown likely hurting more than helping
No partition columnsAll filters are post-scan; look for other pushdown opportunities
Filter column IS a partition columnVerify the WHERE clause actually uses it for pruning
Column
distinctCount
is low
High-skew risk in GROUP BY or JOIN on that column
Column
nullCount
is high
NULL-safe joins may inflate row counts unexpectedly
Row count from DESCRIBE ≠ actual COUNT(*)Stats are stale — note this, but do not run ANALYZE (it can be slow)

只有当执行计划留下的问题需要物理元数据才能回答时才运行该步骤。仅传递与瓶颈相关的表 — 不要传递查询中的所有表。
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --table-stats <TABLE1> [<TABLE2> ...]
该命令收集的每张表的信息以及适用场景:
部分来源需要用到的场景
Schema
DESCRIBE table
分区列未知;过滤下推不确定
分区列
DESCRIBE EXTENDED
执行计划显示FileScan没有PartitionFilters
Delta详情
DESCRIBE DETAIL
执行计划有SortMergeJoin — 需要sizeInBytes判断是否可以BROADCAST
表统计信息
DESCRIBE EXTENDED
优化器选择了错误的join策略;可能存在统计信息过期/缺失
列统计信息
DESCRIBE EXTENDED t col
Exchange成本高;需要distinctCount/nullCount评估倾斜情况
使用这些信息做决策:
事实决策
Delta详情
sizeInBytes
< 200 MB
该表在join中可以安全使用BROADCAST
Delta详情
numFiles
非常高
考虑Z-ordering或压缩(作为建议注明,不做SQL修改)
平均文件大小 < 32 MB小文件问题 — 谓词下推的负面影响可能超过收益
无分区列所有过滤都是扫描后执行;寻找其他下推机会
过滤列是分区列验证WHERE子句是否实际使用该列进行裁剪
distinctCount
GROUP BY或JOIN该列有高倾斜风险
nullCount
NULL安全join可能意外膨胀行数
DESCRIBE返回的行数 ≠ 实际COUNT(*)统计信息过期 — 注明该问题,但不要运行ANALYZE(可能很慢)

Phase 3.3 — Build an attack plan

阶段3.3 — 制定优化计划

Before writing a single line of optimized SQL, write a short numbered list of the specific changes you intend to make and why. Show it to the user (or include it in your reasoning) before proceeding to Phase 4.
Each item should name:
  • What you will change (e.g. "replace SortMergeJoin with BROADCAST hint on
    dim_country
    ")
  • Why (e.g. "Delta detail shows sizeInBytes = 18 MB, well under the 200 MB threshold")
  • Expected effect (e.g. "eliminates one Exchange shuffle node")
Example:
Attack plan:
1. Add /*+ BROADCAST(dim_country) */ — sizeInBytes 18 MB, plan currently SortMergeJoin → eliminates shuffle
2. Replace LOWER(email) = ? filter with COLLATE UTF8_LCASE — restores Delta file-skipping
3. Extract repeated subquery into CTE — currently executed once per row in the outer SELECT
4. Add IS NOT NULL guard on user_id before the JOIN — 12% null rate inflates shuffle size
If you have no concrete changes to make (the plan looks fine or the bottleneck is outside SQL), say so explicitly and explain what you found.

在编写任何优化SQL之前,先编写简短的编号列表,列出你打算做的具体修改以及原因。在进入阶段4之前展示给用户(或包含在你的推理中)。
每个条目需要说明:
  • 修改内容(例如“在
    dim_country
    表上添加/*+ BROADCAST(dim_country) */ hint”)
  • 原因(例如“Delta详情显示sizeInBytes = 18 MB,远低于200 MB阈值”)
  • 预期效果(例如“消除一个Exchange shuffle节点”)
示例:
优化计划:
1. 添加/*+ BROADCAST(dim_country) */ — 大小18 MB,当前计划是SortMergeJoin → 消除shuffle
2. 将LOWER(email) = ?过滤条件替换为COLLATE UTF8_LCASE — 恢复Delta文件跳过能力
3. 将重复子查询提取为CTE — 当前在外层SELECT中每行执行一次
4. 在JOIN前为user_id添加IS NOT NULL校验 — 12%的null率会膨胀shuffle大小
如果你没有明确的修改建议(执行计划看起来没问题,或者瓶颈不在SQL层面),明确说明并解释你的发现。

Phase 4 — SQL optimization

阶段4 — SQL优化

Output contract: The only thing this skill produces is a rewritten SQL query. No Python code. No cluster config. No DataFrame API. No schema changes. A plain SQL string — same SELECT structure, same columns, same output — just faster. SQL hints (
/*+ BROADCAST(...) */
) and SQL comments are allowed.
Rewrite the query using SQL-level rewrites only.
输出约定: 该技能唯一的输出是重写后的SQL查询。没有Python代码、没有集群配置、没有DataFrame API、没有schema修改。就是普通的SQL字符串 — 相同的SELECT结构、相同的列、相同的输出 — 只是更快。允许使用SQL hint(
/*+ BROADCAST(...) */
)和SQL注释。
仅使用SQL层面的重写优化查询。

What is allowed

允许的修改

  • Rewriting JOINs (inner ↔ left, reordering join tables for better plan)
  • Adding / removing CTEs for de-duplication and readability
  • Replacing correlated subqueries with joins or window functions
  • Adding predicate filters earlier in the plan (push filters closer to the scan)
  • Rewriting DISTINCT with GROUP BY where appropriate
  • Replacing UDFs with equivalent built-in Spark SQL functions
  • Splitting complex expressions to help predicate pushdown
  • SQL optimizer hints:
    /*+ BROADCAST(t) */
    ,
    /*+ MERGE(t) */
    ,
    /*+ SHUFFLE_HASH(t) */
  • SQL comments (
    --
    or
    /* */
    ) explaining what the optimization does
  • Inlining a view as a CTE — if a join target is a view whose definition is suboptimal, replace the view reference with
    WITH view_name AS (<rewritten SQL>)
    in the optimized query. The output is still a single self-contained SQL string.
  • Rewriting
    IN (subquery)
    as
    EXISTS
    or a semi-join
    — Spark often builds a better plan for
    EXISTS
    /
    LEFT SEMI JOIN
    than for
    IN
    with a large subquery
  • Skew salting — when a GROUP BY key is heavily skewed, a two-pass SQL salt distributes the hot partition across workers (see salting pattern below)
  • 重写JOIN(内连接 ↔ 左连接,调整join表顺序获得更好的执行计划)
  • 添加/移除CTE以去重和提升可读性
  • 用join或窗口函数替换相关子查询
  • 在执行计划中更早添加谓词过滤(将过滤条件下推到更接近扫描的位置)
  • 适当情况下用GROUP BY替换DISTINCT
  • 用等价的内置Spark SQL函数替换UDF
  • 拆分复杂表达式帮助谓词下推
  • SQL优化器hint:
    /*+ BROADCAST(t) */
    /*+ MERGE(t) */
    /*+ SHUFFLE_HASH(t) */
  • SQL注释(
    --
    /* */
    )说明优化的作用
  • 将视图内联为CTE — 如果join目标是视图且其定义不够优化,在优化查询中将视图引用替换为
    WITH view_name AS (<重写后的SQL>)
    。输出仍然是独立的单SQL字符串。
  • IN (子查询)
    重写为
    EXISTS
    或半连接
    — 对于大子查询,Spark通常为
    EXISTS
    /
    LEFT SEMI JOIN
    生成更好的执行计划。
  • 倾斜加盐 — 当GROUP BY键严重倾斜时,两步SQL加盐将热分区分散到多个worker(见下方加盐模式)

Skew salting pattern

倾斜加盐模式

When the plan shows a highly skewed Exchange on a GROUP BY key, salt the aggregation:
sql
-- Phase 1: aggregate with a random salt to spread the hot key
WITH salted AS (
  SELECT
    key,
    FLOOR(RAND() * 8) AS salt,  -- 8 buckets; tune to cluster core count
    SUM(value) AS partial_sum
  FROM large_table
  GROUP BY key, salt
)
-- Phase 2: collapse the salted buckets
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY key
Only use this when column stats or a skewed Exchange confirms one key dominates. The extra aggregation pass adds overhead on balanced data — don't apply it speculatively.
当执行计划显示GROUP BY键上的Exchange严重倾斜时,对聚合加盐:
sql
-- 阶段1:使用随机盐聚合,将热键分散
WITH salted AS (
  SELECT
    key,
    FLOOR(RAND() * 8) AS salt,  -- 8个桶;根据集群核心数调整
    SUM(value) AS partial_sum
  FROM large_table
  GROUP BY key, salt
)
-- 阶段2:合并加盐后的桶
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY key
只有当列统计信息或倾斜的Exchange确认某个键占比过高时才使用该方法。额外的聚合步骤会在数据均衡时增加开销 — 不要猜测使用。

CTE materialization caveat

CTE物化注意事项

Spark does not guarantee that a CTE is computed only once. If a CTE is referenced multiple times in the query, Spark may re-evaluate it on each reference. This means:
  • A CTE used to "cache" an expensive subquery may not help — or may hurt if the subquery is scanned multiple times instead of once.
  • If you add a CTE to de-duplicate a repeated subquery and the plan still shows the subquery running multiple times, note
    CACHE TABLE
    as a recommendation outside SQL scope.
  • Prefer CTEs for readability and hint-targeting; don't assume they imply materialization.
Spark保证CTE只计算一次。如果CTE在查询中被多次引用,Spark可能在每次引用时重新计算。这意味着:
  • 用于“缓存”昂贵子查询的CTE可能没有帮助 — 如果子查询被多次扫描而非一次,甚至可能降低性能。
  • 如果你添加CTE去重重复子查询,但执行计划仍然显示子查询运行多次,可以在SQL范围外建议使用
    CACHE TABLE
  • 优先使用CTE提升可读性和hint靶向性;不要假设它们会被物化。

Photon-unfriendly patterns

Photon不友好的模式

Databricks Photon accelerates most SQL, but silently falls back to JVM execution for:
PatternWhy Photon can't accelerate it
Python / Scala UDFsBlack-box execution, no vectorization possible
TRANSFORM(array, x -> ...)
Higher-order functions not yet Photon-native
FILTER(array, x -> ...)
Same — lambda expressions bypass Photon
MAP_KEYS
/
MAP_VALUES
on complex maps
Complex type operations not vectorized
Non-deterministic functions in certain contextsPhoton cannot reorder them safely
If the plan shows
ArrowEvalPython
,
ObjectHashAggregate
, or
BatchEvalPython
nodes, a Photon fallback is happening. Flag it in the attack plan and suggest replacing with built-in Spark SQL equivalents where possible.
Databricks Photon可以加速大多数SQL,但会静默回退到JVM执行以下场景:
模式Photon无法加速的原因
Python / Scala UDF黑盒执行,无法向量化
TRANSFORM(array, x -> ...)
高阶函数尚未原生支持Photon
FILTER(array, x -> ...)
同上 — lambda表达式绕过Photon
复杂map上的
MAP_KEYS
/
MAP_VALUES
复杂类型操作未向量化
特定上下文中的非确定性函数Photon无法安全重排它们
如果执行计划显示
ArrowEvalPython
ObjectHashAggregate
BatchEvalPython
节点,说明发生了Photon回退。在优化计划中标注,并建议尽可能用内置Spark SQL等价实现替换。

What is NOT allowed — never suggest these

禁止的修改 — 永远不要建议这些

  • spark.conf.set(...)
    or any cluster/session configuration
  • DataFrame / PySpark API code (
    .filter()
    ,
    .join()
    ,
    .groupBy()
    , etc.)
  • DDL changes (
    ALTER TABLE
    ,
    OPTIMIZE
    ,
    ZORDER
    ,
    VACUUM
    ,
    ANALYZE TABLE
    )
  • Any rewrite that changes columns, row count, or values in the output
If a non-SQL improvement (e.g., Z-ordering, cluster sizing) would help, mention it separately under "Additional recommendations (outside scope)" in the final report — but never include it in the optimized query itself.
  • spark.conf.set(...)
    或任何集群/会话配置
  • DataFrame / PySpark API代码(
    .filter()
    .join()
    .groupBy()
    等)
  • DDL修改(
    ALTER TABLE
    OPTIMIZE
    ZORDER
    VACUUM
    ANALYZE TABLE
  • 任何改变输出列、行数或值的重写
如果非SQL改进(例如Z-ordering、集群规格调整)有帮助,在最终报告的“额外建议(超出范围)”部分单独说明 — 但永远不要包含在优化后的查询本身中。

Hint restriction — CRITICAL for UNION ALL

Hint限制 — UNION ALL场景的关键注意事项

Spark SQL does not allow
/*+ ... */
hints inside CTE branches that are part of a
UNION ALL
. Place all optimizer hints on the outermost SELECT only:
sql
-- ✅ CORRECT
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...

-- ❌ WRONG — hint inside a UNION ALL branch causes a parse error
WITH branch AS (
  SELECT /*+ BROADCAST(t) */ * FROM t   -- this will fail
)
SELECT * FROM branch
UNION ALL
SELECT * FROM other
Spark SQL允许在
UNION ALL
的CTE分支内部使用
/*+ ... */
hint。所有优化器hint只能放在最外层的SELECT上
sql
-- ✅ 正确
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...

-- ❌ 错误 — UNION ALL分支内部的hint会导致解析错误
WITH branch AS (
  SELECT /*+ BROADCAST(t) */ * FROM t   -- 会报错
)
SELECT * FROM branch
UNION ALL
SELECT * FROM other

UDF handling

UDF处理

If the query calls UDFs, check whether each is:
  • A persistent SQL function (created with
    CREATE FUNCTION
    in the catalog) — these work as-is
  • A session UDF that must be registered at runtime — create a
    udf_setup.py
    file (see below)
If session UDFs are needed, create
udf_setup.py
in the current directory:
python
undefined
如果查询调用了UDF,检查每个UDF的类型:
  • 持久化SQL函数(在catalog中通过
    CREATE FUNCTION
    创建) — 可以直接使用
  • 会话UDF,需要在运行时注册 — 创建
    udf_setup.py
    文件(见下文)
如果需要会话UDF,在当前目录创建
udf_setup.py
python
undefined

udf_setup.py — loaded by tune.py automatically if it exists

udf_setup.py — 如果存在会被tune.py自动加载

from pyspark.sql.functions import udf as spark_udf from pyspark.sql.types import StringType
def _my_udf(value): # your implementation return result
def register_udfs(spark): spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))

`tune.py` checks for `udf_setup.py` in the current directory and calls
`register_udfs(spark)` automatically before running any queries.

---
from pyspark.sql.functions import udf as spark_udf from pyspark.sql.types import StringType
def _my_udf(value): # 你的实现 return result
def register_udfs(spark): spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))

`tune.py`会检查当前目录下的`udf_setup.py`,在运行任何查询前自动调用`register_udfs(spark)`。

---

Phase 5 — Validate, benchmark & log

阶段5 — 验证、基准测试与日志记录

5.1 Edit and commit

5.1 编辑并提交

Edit
$QUERY_FILE
directly — the branch is isolated so edits are safe to make in place.
bash
git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <one-line description of what changed>"
Commit before benchmarking so every attempt is in the git log regardless of outcome.
直接编辑
$QUERY_FILE
— 分支是隔离的,原地编辑是安全的。
bash
git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <修改内容的单行描述>"
在基准测试前提交,这样无论结果如何,每次尝试都会记录在git日志中。

5.2 Run the benchmark

5.2 运行基准测试

bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1
(Both
--original
and
--optimized
point to the same file — tune.py compares the current version to the baseline it runs internally.)
The script outputs JSON with:
  • validation.passed
    — whether results are byte-for-byte identical
  • validation.row_count
    — number of rows compared
  • original.mean_s
    ,
    original.std_s
    — timing stats
  • optimized.mean_s
    ,
    optimized.std_s
    — timing stats
  • speedup
    — ratio of original mean to optimized mean
  • statistically_significant
    — true if optimized CI is entirely below original CI
bash
$TUNE \
  --profile <PROFILE> \
  --cluster-id <CLUSTER_ID> \
  --original @$QUERY_FILE \
  --optimized @$QUERY_FILE \
  --catalog <CATALOG> \
  --schema <SCHEMA> \
  --n-runs <N_RUNS> > $LOG_FILE 2>&1
--original
--optimized
指向同一个文件 — tune.py会将当前版本与内部运行的基线进行比较。)
脚本输出的JSON包含:
  • validation.passed
    — 结果是否逐字节完全一致
  • validation.row_count
    — 比较的行数
  • original.mean_s
    original.std_s
    — 基线的计时统计
  • optimized.mean_s
    optimized.std_s
    — 优化版本的计时统计
  • speedup
    — 基线平均耗时与优化版本平均耗时的比值
  • statistically_significant
    — 如果优化版本的置信区间完全低于基线的置信区间则为true

5.3 Decide: keep or revert

5.3 决策:保留或回滚

A speedup is real only when:
  1. validation.passed = true
    (identical results)
  2. speedup > 1.0
    (optimized is faster on average)
  3. statistically_significant = true
    (confidence intervals don't overlap)
✅ IMPROVED  → keep commit, update BEST_MEAN = optimized.mean_s
❌ SAME/WORSE → revert the file to the last kept commit:
               git checkout HEAD $QUERY_FILE
💥 VALIDATION FAIL → fix the query and re-benchmark before reverting
提速是真实有效的仅当:
  1. validation.passed = true
    (结果完全一致)
  2. speedup > 1.0
    (优化版本平均速度更快)
  3. statistically_significant = true
    (置信区间不重叠)
✅ 性能提升  → 保留提交,更新BEST_MEAN = optimized.mean_s
❌ 性能相同/下降 → 将文件回滚到上一次保留的提交:
               git checkout HEAD $QUERY_FILE
💥 验证失败 → 修复查询,重新基准测试后再回滚

5.4 Log the attempt

5.4 记录尝试

Append to
$RESULTS_FILE
:
<N>	<commit-sha>	<optimized.mean_s>	<speedup>	<keep|discard>	<description>
If validation fails: fix the query and re-benchmark before doing anything else.

追加到
$RESULTS_FILE
<N>	<commit-sha>	<optimized.mean_s>	<speedup>	<keep|discard>	<description>
如果验证失败:先修复查询并重新基准测试,再进行其他操作。

Phase 5b — Autonomous optimization loop

阶段5b — 自动优化循环

Run continuously after the first benchmark. Never pause to ask "should I continue?". Stop only when the user interrupts or explicitly says they're satisfied.
THINK   Read $RESULTS_FILE and the current best query.
        Study the EXPLAIN of the best query so far — not the original.
        Form a specific hypothesis: "changing X should reduce Y because Z."
        Follow the strategy priority below.

EDIT    Edit $QUERY_FILE directly — one focused change per attempt.
        The branch is isolated; edits are safe to make in place.

COMMIT  git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
        Commit before benchmarking — this records what was tried.

RUN     $TUNE \
          --profile <PROFILE> --cluster-id <CLUSTER_ID> \
          --original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
          --catalog <CATALOG> --schema <SCHEMA> \
          --n-runs <N_RUNS> > $LOG_FILE 2>&1

        # For simplicity / both goals, also score complexity
        python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json

MEASURE Extract from $LOG_FILE (timing) and complexity.json (score).
        On crash: read last 50 lines for the error. Attempt up to 2 quick fixes,
        amend the commit, re-run. If still broken, revert and discard.

        Metric to optimize depends on GOALS:
        ┌──────────────────────┬────────────────────────────────────────────────────┐
        │ speed                │ optimized.mean_s  (lower is better)                │
        │ simplicity           │ complexity score  (lower is better)                │
        │ speed,simplicity     │ Phase 1: optimize mean_s (same as speed)           │
        │                      │ Phase 2: once speed accepted, optimize score        │
        │                      │          while mean_s stays ≤ BEST_MEAN * 1.10     │
        └──────────────────────┴────────────────────────────────────────────────────┘

DECIDE  Compare metric to BEST:
        speed             → ✅ if mean_s improved + statistically significant
        simplicity        → ✅ if complexity score decreased + validation passes
        speed,simplicity  → phase 1: same as speed
                            phase 2: ✅ if score decreased AND mean_s ≤ BEST_MEAN * 1.10

        ❌ no improvement → git checkout HEAD $QUERY_FILE   (revert file, keep commit msg in log)
        💥 VALIDATION FAIL → fix semantics first, then re-benchmark

LOG     Append to $RESULTS_FILE:
        <N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>
首次基准测试后持续运行。永远不要暂停询问“我是否应该继续?”。仅当用户中断或明确表示满意时停止。
思考   读取$RESULTS_FILE和当前最优查询。
        研究当前最优查询的EXPLAIN计划 — 不是原始查询。
        形成明确的假设:“修改X会减少Y,因为Z。”
        遵循下方的策略优先级。

编辑   直接编辑$QUERY_FILE — 每次尝试只做一个针对性修改。
        分支是隔离的;原地编辑是安全的。

提交   git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
        基准测试前提交 — 记录所有尝试的内容。

运行   $TUNE \
          --profile <PROFILE> --cluster-id <CLUSTER_ID> \
          --original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
          --catalog <CATALOG> --schema <SCHEMA> \
          --n-runs <N_RUNS> > $LOG_FILE 2>&1

        # 如果优化目标包含简化/双目标,同时计算复杂度评分
        python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json

测量   从$LOG_FILE(计时)和complexity.json(评分)中提取数据。
        崩溃时:读取最后50行错误信息。尝试最多2次快速修复,
        修改提交,重新运行。如果仍然失败,回滚并丢弃。

        优化的指标取决于GOALS:
        ┌──────────────────────┬────────────────────────────────────────────────────┐
        │ speed                │ optimized.mean_s (越低越好)                │
        │ simplicity           │ complexity score (越低越好)                │
        │ speed,simplicity     │ 阶段1:优化mean_s(和speed目标一致)           │
        │                      │ 阶段2:速度达标后,优化评分        │
        │                      │          同时保持mean_s ≤ BEST_MEAN * 1.10     │
        └──────────────────────┴────────────────────────────────────────────────────┘

决策   与最优指标比较:
        speed             → ✅ 如果mean_s提升 + 统计显著性
        simplicity        → ✅ 如果复杂度评分降低 + 验证通过
        speed,simplicity  → 阶段1:和speed目标一致
                            阶段2:✅ 如果评分降低 AND mean_s ≤ BEST_MEAN * 1.10

        ❌ 无提升 → git checkout HEAD $QUERY_FILE   (回滚文件,保留提交日志)
        💥 验证失败 → 先修复语义问题,再重新基准测试

日志   追加到$RESULTS_FILE:
        <N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>

Strategy priority

策略优先级

  1. Follow the plan — work through the attack plan from Phase 3.3 in order
  2. Follow wins — if a hint or rewrite helped, probe further in that direction
  3. Diversify after 3 consecutive discards — switch to a different bottleneck
  4. Combine winners — if A and B each improved independently, try A+B together
  5. Diagnose no-speedup runs — re-EXPLAIN the optimized query; check if AQE already did what the hint asked, or if a Photon fallback is dominating
  6. Accept the floor — if 5+ consecutive attempts yield no improvement and the remaining ideas are speculative, stop the loop and go to Phase 6
For
simplicity
or
speed,simplicity
goal
, strategy shifts focus (in the simplicity phase):
  • Prefer removing CTEs that the optimizer re-evaluates anyway
  • Inline single-use subqueries where they don't increase nesting
  • Replace multi-step transformations with a single built-in function call
  • Remove redundant ORDER BY, DISTINCT, or GROUP BY that the caller drops
  • Simplicity wins don't require statistical significance — a lower score is enough
Hard constraints that apply to all goals, including simplicity:
  • Never introduce
    SELECT *
    — explicit column lists must be preserved exactly.
    SELECT *
    hides schema changes, breaks downstream consumers, and can silently change column order. A query with fewer lines but a
    SELECT *
    is not simpler — it is fragile.
  • Never remove columns from the output or reorder them
  • Never change NULL semantics or filter behaviour

  1. 遵循计划 — 按顺序执行阶段3.3的优化计划
  2. 跟进收益点 — 如果某个hint或重写有帮助,继续在该方向探索
  3. 连续3次丢弃后多样化尝试 — 切换到其他瓶颈
  4. 合并有效修改 — 如果A和B各自独立提升了性能,尝试同时应用A+B
  5. 诊断无提速的运行 — 重新EXPLAIN优化后的查询;检查AQE是否已经实现了hint的作用,或者Photon回退是主要瓶颈
  6. 接受性能下限 — 如果连续5次以上尝试都没有提升,剩余的优化思路都是猜测性的,停止循环进入阶段6
对于
simplicity
speed,simplicity
目标
,在简化阶段策略会调整重点:
  • 优先移除优化器会重复计算的CTE
  • 内联单次使用的子查询,不会增加嵌套层级
  • 用单个内置函数调用替换多步转换
  • 移除调用者不需要的冗余ORDER BY、DISTINCT或GROUP BY
  • 简化收益不需要统计显著性 — 评分更低即可
适用于所有目标的硬约束,包括简化目标:
  • 永远不要引入
    SELECT *
    — 必须严格保留显式列列表。
    SELECT *
    会隐藏schema变更,破坏下游依赖,可能静默修改列顺序。行数更少但使用
    SELECT *
    的查询并不更简单 — 它更脆弱。
  • 永远不要移除输出列或调整列顺序
  • 永远不要修改NULL语义或过滤行为

Phase 6 — Report

阶段6 — 报告

Present a summary:
First print the full attempt log:
bash
cat $RESULTS_FILE
Then show the git log of kept commits:
bash
git log --oneline <baseline-sha>..HEAD
Then present the summary report:
undefined
展示摘要:
首先打印完整的尝试日志:
bash
cat $RESULTS_FILE
然后展示保留提交的git日志:
bash
git log --oneline <baseline-sha>..HEAD
然后展示摘要报告:
undefined

SQL Tuning Report

SQL调优报告

Run summary

运行摘要

  • Attempts: N total / K kept / M discarded
  • Baseline: X.XXs → Best: X.XXs (X.Xx speedup)
  • 总尝试次数:N / 保留:K / 丢弃:M
  • 基线耗时:X.XX秒 → 最优耗时:X.XX秒 (X.X倍提速)

Attempt log

尝试日志

#Mean (s)SpeedupStatusDescription
0x.xx1.0xbaselineoriginal query
1x.xxx.xxkeep...
2x.xxx.xxdiscard...
#平均耗时(秒)提速倍数状态描述
0x.xx1.0x基线原始查询
1x.xxx.xx保留...
2x.xxx.xx丢弃...

Query plan bottlenecks (original)

原始查询计划瓶颈

<key findings from EXPLAIN>
<执行计划的关键发现>

Optimizations applied (kept only)

应用的优化(仅保留的修改)

  1. <change — what + why> ...
  1. <修改内容 — 是什么 + 为什么> ...

Original SQL

原始SQL

<original query>
<原始查询>

Optimized SQL

优化后SQL

<best optimized query>
<最优的优化后查询>

Final benchmark

最终基准测试

VariantMean (s)Std (s)Runs
Originalx.xxx.xxN
Optimizedx.xxx.xxN
Speedup: X.Xx | Statistically significant: yes/no Validation: PASS — N rows, identical results
版本平均耗时(秒)标准差(秒)运行次数
原始版本x.xxx.xxN
优化版本x.xxx.xxN
提速倍数:X.Xx | 统计显著性:是/否 验证结果:通过 — N行,结果完全一致

Conclusion

结论

<one paragraph: what changed, why it helped, any caveats>
<一段总结:修改了什么,为什么有效果,任何注意事项>

Additional recommendations (outside SQL scope)

额外建议(超出SQL范围)

<Z-ordering, ANALYZE TABLE, cluster sizing, etc. — if applicable>

---
<Z-ordering、ANALYZE TABLE、集群规格调整等 — 如适用>

---

Troubleshooting

故障排查

ErrorFix
Cannot configure default credentials
Re-authenticate:
databricks auth login --profile <PROFILE>
Cluster not found
Verify cluster ID:
databricks clusters list --profile <PROFILE>
Version mismatchEnsure
DBR_VERSION
from
clusters get
matches installed
databricks-connect
.venv_autotuner
import errors
Delete the venv and re-run
env_setup.py
Validation diff on floatsMay be floating-point non-determinism — check with
ROUND()
or cast to DECIMAL
.collect()
fails with
>4 GiB
/ driver OOM
Query returns millions of rows — add
--timing-count
to wrap timing runs in
COUNT(*)
. Validation still uses real results.
Hint parse error in UNION ALLMove hint to outermost SELECT (see Phase 4 restriction)
Session UDF not foundCreate
udf_setup.py
with a
register_udfs(spark)
function
错误修复方案
Cannot configure default credentials
重新认证:
databricks auth login --profile <PROFILE>
Cluster not found
验证集群ID:
databricks clusters list --profile <PROFILE>
版本不匹配确保从
clusters get
获取的
DBR_VERSION
与安装的
databricks-connect
版本匹配
.venv_autotuner
导入错误
删除虚拟环境,重新运行
env_setup.py
浮点数验证差异可能是浮点数非确定性 — 用
ROUND()
或转换为DECIMAL验证
.collect()
失败,报错
>4 GiB
/ 驱动OOM
查询返回数百万行 — 添加
--timing-count
将计时运行包裹在
COUNT(*)
中。验证仍然使用真实结果。
UNION ALL中的hint解析错误将hint移动到最外层SELECT(见阶段4的限制)
会话UDF未找到创建包含
register_udfs(spark)
函数的
udf_setup.py