databricks-sql-autotuner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDatabricks SQL Autotuner
Databricks SQL 自动调优器
Analyze, rewrite, validate, and benchmark SQL queries against a live Databricks
cluster. The optimized query must produce identical results and show a
statistically meaningful speedup before it is accepted.
针对运行中的Databricks集群分析、重写、验证和基准测试SQL查询。优化后的查询必须返回完全一致的结果,并且表现出统计意义上的显著提速才会被采纳。
Examples
示例
undefinedundefinedMinimal — paste an inline query, discover everything automatically
最简用法 — 粘贴内联查询,自动完成所有探测步骤
/databricks-sql-autotuner "SELECT o.order_id, SUM(l.amount) FROM orders o JOIN lines l ON o.id = l.order_id GROUP BY o.order_id"
/databricks-sql-autotuner "SELECT o.order_id, SUM(l.amount) FROM orders o JOIN lines l ON o.id = l.order_id GROUP BY o.order_id"
From a file, cluster by ID
从文件读取,指定集群ID
/databricks-sql-autotuner --cluster-id 0408-195905-abc --profile premium @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --profile premium @queries/slow_report.sql
From a file, cluster by name
从文件读取,指定集群名称
/databricks-sql-autotuner --cluster-name my-cluster --profile premium @queries/slow_report.sql
/databricks-sql-autotuner --cluster-name my-cluster --profile premium @queries/slow_report.sql
Optimize for speed, 5 benchmark runs
优先优化速度,执行5次基准测试
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed --runs 5 @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed --runs 5 @queries/slow_report.sql
Optimize for speed first, then simplify
优先优化速度,其次简化语句
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed,simplicity @queries/slow_report.sql
/databricks-sql-autotuner --cluster-id 0408-195905-abc --goals speed,simplicity @queries/slow_report.sql
Override catalog and schema
自定义catalog和schema
/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql
---/databricks-sql-autotuner --cluster-id 0408-195905-abc --catalog sales --schema prod @queries/slow_report.sql
---Reference files
参考文件
Read these before writing the optimized query:
| File | When to read |
|---|---|
| Any time you are considering a hint ( |
| For AQE behavior, partition tuning, statistics interpretation, and the full SQL-level optimization checklist |
编写优化查询前请阅读这些文件:
| 文件 | 读取时机 |
|---|---|
| 任何时候你考虑使用hint( |
| 了解AQE行为、分区调优、统计信息解读和完整的SQL层面优化检查清单 |
Phase 0 — Parse input
阶段0 — 解析输入
The user invokes with:
/databricks-sql-autotuner [options] <query-or-path>Supported options (all optional — skip any discovery step where the value is provided):
| Option | Example | Effect |
|---|---|---|
| | Skip cluster listing; use this cluster ID directly |
| | Skip cluster listing; find cluster by name |
| | Skip profile selection; use this profile |
| | Skip catalog discovery; use this value |
| | Skip schema discovery; use this value |
| | Override default benchmark run count (default: 3) |
| | Comma-separated goals: |
Record as (default: ). Accepted values:
GOALSspeed| Value | Meaning |
|---|---|
| Minimize execution time |
| Minimize complexity score (lines, nesting, subqueries) |
| Optimize speed first; once accepted, simplify while staying within 10% of best time |
Parse these from the invocation arguments before starting Phase 1. For any option
not supplied, run the normal discovery step.
| | | SQL string or ; if omitted, ask the user |
| | | Write optimized query to a separate file instead of editing the original in place |
| | | Wrap timing runs in to avoid collecting large result sets to the driver (use when query returns millions of rows) |
--query <sql-or-path>--query @slow.sql@path/to/file.sql--optimized <path>--optimized out.sql--timing-count--timing-countCOUNT(*)The value:
--query- → read the file
@path/to/file.sql - Any other string → treat as inline SQL
- Omitted → ask the user to paste or provide the query
用户通过以下方式调用:
/databricks-sql-autotuner [options] <query-or-path>支持的选项(均为可选 — 如果值已提供则跳过对应的探测步骤):
| 选项 | 示例 | 作用 |
|---|---|---|
| | 跳过集群列表查询,直接使用该集群ID |
| | 跳过集群列表查询,通过名称查找集群 |
| | 跳过配置选择,直接使用该配置 |
| | 跳过catalog探测,直接使用该值 |
| | 跳过schema探测,直接使用该值 |
| | 覆盖默认的基准测试运行次数(默认:3) |
| | 逗号分隔的优化目标: |
记录为(默认:)。可接受的值:
GOALSspeed| 值 | 含义 |
|---|---|
| 最小化执行时间 |
| 最小化复杂度评分(行数、嵌套层级、子查询数量) |
| 优先优化速度;速度优化达标后,在不超过最优执行时间10%的前提下简化语句 |
在进入阶段1前先从调用参数中解析这些选项。对于未提供的选项,执行常规探测步骤。
| | | SQL字符串或;如果省略则询问用户 |
| | | 将优化后的查询写入独立文件,而非原地修改原文件 |
| | | 将计时运行包裹在中,避免向驱动节点返回大量结果集(查询返回数百万行时使用) |
--query <sql-or-path>--query @slow.sql@path/to/file.sql--optimized <path>--optimized out.sql--timing-count--timing-countCOUNT(*)--query- → 读取对应文件
@path/to/file.sql - 其他字符串 → 视为内联SQL
- 省略 → 要求用户粘贴或提供查询语句
Phase 1 — Environment discovery
阶段1 — 环境探测
1.1 Verify Databricks CLI is authenticated
1.1 验证Databricks CLI已认证
bash
databricks auth profilesIf this fails or returns no profiles, stop and ask the user to authenticate:
bash
databricks auth login --profile <name>bash
databricks auth profiles如果命令失败或未返回任何配置,停止执行并要求用户认证:
bash
databricks auth login --profile <name>1.2 Select profile
1.2 选择配置
If was provided, skip this step.
--profileOtherwise present all available profiles with their workspace host URLs.
Auto-select if only one exists; otherwise ask.
如果已提供参数,跳过该步骤。
--profile否则展示所有可用配置及其对应的工作区主机URL。如果只有一个配置则自动选择,否则询问用户选择。
1.3 Select cluster
1.3 选择集群
If was provided, use that ID directly and skip the listing.
--cluster-idIf was provided, resolve it to an ID:
--cluster-namebash
databricks clusters list --profile <PROFILE> --output json \
| python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"If neither was provided, list all clusters:
bash
databricks clusters list --profile <PROFILE> --output jsonFilter to clusters with or (can be started).
Present the list (cluster ID, name, state, DBR version).
Auto-select any running general-purpose cluster if one exists; otherwise ask.
state = RUNNINGstate = TERMINATEDIf the chosen cluster is terminated, start it:
bash
databricks clusters start <CLUSTER_ID> --profile <PROFILE>如果已提供参数,直接使用该ID并跳过列表查询。
--cluster-id如果已提供参数,将其解析为ID:
--cluster-namebash
databricks clusters list --profile <PROFILE> --output json \
| python3 -c "
import sys, json
clusters = json.load(sys.stdin)
name = '<CLUSTER_NAME>'
match = [c for c in clusters if c.get('cluster_name') == name]
if not match:
print('ERROR: no cluster named', name, file=sys.stderr); sys.exit(1)
print(match[0]['cluster_id'])
"如果两个参数都未提供,列出所有集群:
bash
databricks clusters list --profile <PROFILE> --output json筛选出或(可启动)的集群。展示集群列表(集群ID、名称、状态、DBR版本)。如果存在运行中的通用型集群则自动选择,否则询问用户选择。
state = RUNNINGstate = TERMINATED如果选中的集群已终止,启动它:
bash
databricks clusters start <CLUSTER_ID> --profile <PROFILE>1.4 Get DBR version
1.4 获取DBR版本
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
| python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"Record as (e.g., ). This determines which to install.
DBR_VERSION17.3databricks-connectbash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
| python3 -c "import sys,json; v=json.load(sys.stdin)['spark_version']; print(v.split('-')[0])"记录为(例如)。该值决定要安装的版本。
DBR_VERSION17.3databricks-connect1.5 Catalog / schema context
1.5 Catalog / Schema 上下文
Run to discover the current defaults:
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
| python3 -c "import sys,json; c=json.load(sys.stdin); \
cfg=c.get('spark_conf',{}); \
print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"Use whatever the cluster reports as and . Only ask the user to
override if the query explicitly references a different catalog/schema or if the
command returns nothing useful.
CATALOGSCHEMA运行以下命令探测当前默认值:
bash
databricks clusters get <CLUSTER_ID> --profile <PROFILE> --output json \
| python3 -c "import sys,json; c=json.load(sys.stdin); \
cfg=c.get('spark_conf',{}); \
print('catalog:', cfg.get('spark.databricks.sql.initial.catalog.name','hive_metastore')); \
print('schema:', cfg.get('spark.databricks.sql.initial.catalog.namespace','default'))"使用集群返回的结果作为和。只有当查询显式引用了不同的catalog/schema,或者命令未返回有效信息时,才询问用户覆盖默认值。
CATALOGSCHEMA1.6 Benchmark runs
1.6 基准测试运行次数
If was provided, skip this step and use that value.
--runsOtherwise ask:
How many benchmark runs per query variant? (default: 3)
Record as (default , minimum ).
N_RUNS32如果已提供参数,跳过该步骤直接使用该值。
--runs否则询问用户:
每个查询变体需要运行多少次基准测试?(默认:3)
记录为(默认,最小值)。
N_RUNS321.7 Confirm
1.7 确认
Present a summary and wait for confirmation:
| Parameter | Value |
|---|---|
| Profile | |
| Cluster ID | |
| DBR version | |
| Catalog | |
| Schema | |
| N runs |
展示配置摘要并等待用户确认:
| 参数 | 值 |
|---|---|
| Profile | |
| Cluster ID | |
| DBR version | |
| Catalog | |
| Schema | |
| N runs |
Phase 2 — Environment setup
阶段2 — 环境搭建
2.0 Resolve the skill directory
2.0 定位技能目录
The scripts live inside the skill, not the project. Locate them first:
bash
SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
"$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
-maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"If neither path exists, check where Claude Code installed the skill:
bash
find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1Record as . All script paths below use .
SKILL_DIR$SKILL_DIR/scripts/脚本存放在技能目录下,而非项目目录。首先定位该目录:
bash
SKILL_DIR=$(find ~/.claude/skills/databricks-sql-autotuner \
"$(git rev-parse --show-toplevel 2>/dev/null)/.claude/skills/databricks-sql-autotuner" \
-maxdepth 0 -type d 2>/dev/null | head -1)
echo "SKILL_DIR=$SKILL_DIR"如果两个路径都不存在,检查Claude Code安装技能的位置:
bash
find ~ -path "*/.claude/skills/databricks-sql-autotuner" -maxdepth 6 -type d 2>/dev/null | head -1记录为。后续所有脚本路径都使用。
SKILL_DIR$SKILL_DIR/scripts/2.1 Check uv
2.1 检查uv
bash
uv --versionIf missing:
curl -LsSf https://astral.sh/uv/install.sh | shbash
uv --version如果缺失:
curl -LsSf https://astral.sh/uv/install.sh | sh2.2 Run setup
2.2 运行安装脚本
bash
python3 "$SKILL_DIR/scripts/env_setup.py" \
--dbr-version <DBR_VERSION> \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID>This uses to create , installs
, and verifies the connection with a
smoke test.
uv venv.venv_autotuner/databricks-connect==<DBR_VERSION>.*SELECT 1All subsequent commands invoke the venv Python directly:
bash
VENV_PYTHON=".venv_autotuner/bin/python" # macOS / Linuxbash
python3 "$SKILL_DIR/scripts/env_setup.py" \
--dbr-version <DBR_VERSION> \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID>该脚本使用创建,安装,并通过冒烟测试验证连接。
uv venv.venv_autotuner/databricks-connect==<DBR_VERSION>.*SELECT 1后续所有命令直接调用虚拟环境中的Python:
bash
VENV_PYTHON=".venv_autotuner/bin/python" # macOS / LinuxWindows: VENV_PYTHON=".venv_autotuner\Scripts\python.exe"
Windows: VENV_PYTHON=".venv_autotuner\Scripts\python.exe"
TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"
Use `$TUNE` for every `tune.py` invocation from here on.
---TUNE="$VENV_PYTHON $SKILL_DIR/scripts/tune.py"
后续所有`tune.py`调用都使用`$TUNE`。
---Phase 2b — Branch & baseline
阶段2b — 分支与基线
Every tuning session gets its own branch and a results log. This keeps the work
isolated, makes it easy to compare iterations, and produces a clean git history
that shows exactly what was tried and kept.
每个调优会话都有独立的分支和结果日志。这样可以隔离工作,便于比较不同迭代版本,同时生成清晰的git历史,记录所有尝试和采纳的修改。
2b.1 Generate a run ID
2b.1 生成运行ID
Use a short slug derived from the query's purpose — e.g. , .
sales-summaryuser-funnel-joinbash
RUN_ID="<query-slug>" # e.g. sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"使用基于查询用途的简短标识,例如、。
sales-summaryuser-funnel-joinbash
RUN_ID="<query-slug>" # 例如 sales-summary
RESULTS_FILE="sqltune-${RUN_ID}.tsv"
LOG_FILE="sqltune-${RUN_ID}.log"2b.2 Create a branch
2b.2 创建分支
bash
git checkout -b sql-tune/${RUN_ID}bash
git checkout -b sql-tune/${RUN_ID}2b.3 Establish the working file
2b.3 确定工作文件
Determine (the file that will be edited each iteration) and
(the baseline passed to in tune.py):
QUERY_FILEORIGINAL_FILE--original| Scenario | ORIGINAL_FILE | QUERY_FILE |
|---|---|---|
| | |
| | |
| | |
| | |
When is given, copy the original into that file first:
--optimizedbash
cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILEWhen editing in place (no ), and git tracks the evolution naturally.
--optimizedORIGINAL_FILE == QUERY_FILE确定(每次迭代编辑的文件)和(传递给tune.py参数的基线文件):
QUERY_FILEORIGINAL_FILE--original| 场景 | ORIGINAL_FILE | QUERY_FILE |
|---|---|---|
| | |
| | |
| | |
| | |
如果提供了参数,先将原文件复制到目标路径:
--optimizedbash
cp $ORIGINAL_FILE $QUERY_FILE
git add $QUERY_FILE如果是原地编辑(未提供),,git会自然跟踪文件演变。
--optimizedORIGINAL_FILE == QUERY_FILE2b.4 Initialize the results log
2b.4 初始化结果日志
bash
echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}" >> .git/info/excludeCreate :
$RESULTS_FILEattempt commit mean_s speedup status description
0 <sha> <baseline_mean> 1.0 baseline original querybash
echo "${RESULTS_FILE}" >> .git/info/exclude
echo "${LOG_FILE}" >> .git/info/exclude创建:
$RESULTS_FILEattempt commit mean_s speedup status description
0 <sha> <baseline_mean> 1.0 baseline original query2b.5 Run the baseline
2b.5 运行基线测试
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original @$QUERY_FILE \
--optimized @$QUERY_FILE \
--catalog <CATALOG> \
--schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1Extract and record as attempt .
original.mean_s0Baseline: mean = X.XXs. Branch:. Starting analysis.sql-tune/<run-id>
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original @$QUERY_FILE \
--optimized @$QUERY_FILE \
--catalog <CATALOG> \
--schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1提取并记录为第0次尝试。
original.mean_s基线:平均耗时 = X.XX秒。分支:。开始分析。sql-tune/<run-id>
Phase 3 — Query analysis
阶段3 — 查询分析
Start with the execution plan — it tells you what the optimizer actually decided.
Collect table stats only when the plan raises a question that size or schema data
would answer (e.g. "is this table small enough to broadcast?").
从执行计划开始分析 — 它会告诉你优化器实际的决策。只有当执行计划提出的问题需要表大小或schema数据才能回答时(例如“这个表是否小到可以广播?”),才收集表统计信息。
3.1 Get the execution plan
3.1 获取执行计划
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original "<QUERY_OR_@FILE>" \
--catalog <CATALOG> \
--schema <SCHEMA> \
--explain-onlyRead the plan and identify bottlenecks:
| Bottleneck signal | What to look for |
|---|---|
| BroadcastNestedLoopJoin | Missing join key or cartesian product — add explicit join condition |
| SortMergeJoin | May be improvable with BROADCAST if the smaller side is small enough — check table size |
| Exchange (shuffle) | Heavy repartitioning — check GROUP BY / JOIN key cardinality |
| FileScan with no PartitionFilters | Filter may not be using partition columns — worth checking schema |
| Filter after scan (not pushed) | Move filter earlier or rewrite WHERE to use partition columns |
| Repeated subquery | Extract to CTE or use window function |
| UNION ALL + aggregate | Check if ROLLUP or window can replace |
| UDF calls on large datasets | Consider replacing with built-in Spark SQL functions |
| Large row count with high-null join key | Add |
| SubqueryAlias / deep nested plan where a FileScan is expected | Join target is a view — read the view definition (step 3.2a) |
| Skewed Exchange — one partition vastly larger than others | GROUP BY key has low or highly uneven cardinality — consider SQL salting (see Phase 4) |
| Often generates a worse plan than |
| Photon cannot accelerate these — check for Python UDFs, |
After reading the plan, decide: can you already identify the optimization, or do
you need more information about the tables?
- If the bottleneck is clear (e.g. a correlated subquery, missing NULL filter, LOWER() on a filter column) — go straight to Phase 4.
- If the bottleneck depends on physical table properties (file size, partition columns, column cardinality) — collect stats for just those tables in 3.2.
- If a join target appears to be a view (plan shows a nested subquery or a
node where you expect a FileScan) — read the view definition in 3.2 before deciding what to optimize.
SubqueryAlias
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original "<QUERY_OR_@FILE>" \
--catalog <CATALOG> \
--schema <SCHEMA> \
--explain-only阅读执行计划并识别瓶颈:
| 瓶颈信号 | 需要查找的问题 |
|---|---|
| BroadcastNestedLoopJoin | 缺失join条件或笛卡尔积 — 添加显式join条件 |
| SortMergeJoin | 如果小表足够小,可以用BROADCAST优化 — 检查表大小 |
| Exchange (shuffle) | 大量重分区 — 检查GROUP BY / JOIN键的基数 |
| 无PartitionFilters的FileScan | 过滤条件可能没有使用分区列 — 值得检查schema |
| 扫描后才执行Filter(未下推) | 将过滤条件提前或重写WHERE语句使用分区列 |
| 重复子查询 | 提取为CTE或使用窗口函数 |
| UNION ALL + 聚合 | 检查是否可以用ROLLUP或窗口函数替代 |
| 大数据集上的UDF调用 | 考虑用内置Spark SQL函数替代 |
| 高null率join键的大量行 | 在join前添加 |
| 预期是FileScan但出现SubqueryAlias / 深层嵌套计划 | Join目标是视图 — 阅读视图定义(步骤3.2a) |
| 倾斜的Exchange — 某个分区远大于其他分区 | GROUP BY键基数低或分布极不均匀 — 考虑SQL加盐(见阶段4) |
| 通常生成的执行计划比 |
| Photon无法加速这些节点 — 检查是否存在Python UDF、数组上的 |
阅读执行计划后,判断:你是否已经可以确定优化点,还是需要更多表相关信息?
- 如果瓶颈很明确(例如相关子查询、缺失NULL过滤、过滤列上的LOWER()函数) — 直接进入阶段4。
- 如果瓶颈取决于物理表属性(文件大小、分区列、列基数) — 在3.2步骤中收集相关表的统计信息。
- 如果join目标看起来是视图(执行计划中预期是FileScan的位置出现嵌套子查询或节点) — 在3.2步骤中先阅读视图定义再决定优化方向。
SubqueryAlias
3.2a Read view definitions (when a join target is a view)
3.2a 阅读视图定义(当join目标是视图时)
The EXPLAIN plan will show a view's underlying query inlined as a subquery tree.
If you see a deep nested plan for what should be a simple table join, the view
itself may be the bottleneck — not the outer query.
To read the view definition:
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original "SHOW CREATE TABLE <view_name>" \
--catalog <CATALOG> \
--schema <SCHEMA> \
--explain-onlyOr collect it alongside table stats (the flag works for views too).
--table-statsWhen you read a view definition, look for:
- Filters that could be pushed closer to the source scan
- Joins inside the view that are better placed in the outer query (where the caller already filters heavily before joining)
- Aggregations computed in the view that the caller immediately re-aggregates
- DISTINCT or ORDER BY inside the view that serve no purpose when used as a subquery
- Correlated subqueries or scalar subqueries that run once per row
What to do if the view is suboptimal:
You have two options — pick the one that is less invasive:
-
Inline the view as a CTE — replacewith
JOIN my_view ON ...in the optimized query. This avoids changing the view in the catalog and keeps the fix self-contained.WITH my_view AS (<rewritten definition>) ... -
Suggest a view rewrite — if the view is used widely and inlining would make the outer query unreadable, write an optimized version of the view definition separately and note it under "Additional recommendations" in the final report. Do not change the view in the catalog.
Always prefer option 1 unless the view is complex enough that inlining it makes
the outer query harder to reason about than it was before.
EXPLAIN计划会将视图的底层查询内联为子查询树。如果你看到本来应该是简单表join的位置出现深层嵌套计划,那么视图本身可能就是瓶颈 — 而非外层查询。
读取视图定义:
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original "SHOW CREATE TABLE <view_name>" \
--catalog <CATALOG> \
--schema <SCHEMA> \
--explain-only也可以和表统计信息一起收集(参数也适用于视图)。
--table-stats阅读视图定义时,查找以下问题:
- 可以下推到源扫描更近位置的过滤条件
- 视图内部的join放在外层查询更合适(调用者在join前已经做了大量过滤)
- 视图中计算的聚合,调用者随后立即再次聚合
- 视图内部无意义的DISTINCT或ORDER BY,作为子查询使用时没有作用
- 每行执行一次的相关子查询或标量子查询
如果视图本身不够优化该怎么办:
你有两个选项 — 选择侵入性更低的方案:
-
将视图内联为CTE — 在优化后的查询中,将替换为
JOIN my_view ON ...。这样避免修改catalog中的视图,修复逻辑自包含。WITH my_view AS (<重写后的定义>) ... -
建议视图重写 — 如果视图被广泛使用,内联会让外层查询变得难以阅读,可以单独编写优化后的视图定义,并在最终报告的“额外建议”部分注明。不要修改catalog中的视图。
除非视图复杂到内联会让外层查询比原来更难理解,否则优先选择方案1。
3.2 Collect table stats (when needed)
3.2 收集表统计信息(需要时)
Only run this step when the execution plan leaves a question that physical metadata
would answer. Pass only the tables relevant to the bottleneck — not every table in
the query.
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--catalog <CATALOG> \
--schema <SCHEMA> \
--table-stats <TABLE1> [<TABLE2> ...]What this collects per table and when it matters:
| Section | Source | When you need it |
|---|---|---|
| Schema | | Partition columns unknown; filter pushdown uncertain |
| Partition columns | | Plan shows FileScan without PartitionFilters |
| Delta detail | | Plan has SortMergeJoin — need sizeInBytes to decide BROADCAST |
| Table stats | | Optimizer picked a bad join strategy; may have stale/missing stats |
| Column stats | | High Exchange cost; need distinctCount/nullCount to assess skew |
Use these facts to make decisions:
| Fact | Decision |
|---|---|
Delta detail | Safe to BROADCAST this table in a join |
Delta detail | Consider Z-ordering or compaction (note as recommendation, not SQL change) |
| Avg file size < 32 MB | Small files problem — predicate pushdown likely hurting more than helping |
| No partition columns | All filters are post-scan; look for other pushdown opportunities |
| Filter column IS a partition column | Verify the WHERE clause actually uses it for pruning |
Column | High-skew risk in GROUP BY or JOIN on that column |
Column | NULL-safe joins may inflate row counts unexpectedly |
| Row count from DESCRIBE ≠ actual COUNT(*) | Stats are stale — note this, but do not run ANALYZE (it can be slow) |
只有当执行计划留下的问题需要物理元数据才能回答时才运行该步骤。仅传递与瓶颈相关的表 — 不要传递查询中的所有表。
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--catalog <CATALOG> \
--schema <SCHEMA> \
--table-stats <TABLE1> [<TABLE2> ...]该命令收集的每张表的信息以及适用场景:
| 部分 | 来源 | 需要用到的场景 |
|---|---|---|
| Schema | | 分区列未知;过滤下推不确定 |
| 分区列 | | 执行计划显示FileScan没有PartitionFilters |
| Delta详情 | | 执行计划有SortMergeJoin — 需要sizeInBytes判断是否可以BROADCAST |
| 表统计信息 | | 优化器选择了错误的join策略;可能存在统计信息过期/缺失 |
| 列统计信息 | | Exchange成本高;需要distinctCount/nullCount评估倾斜情况 |
使用这些信息做决策:
| 事实 | 决策 |
|---|---|
Delta详情 | 该表在join中可以安全使用BROADCAST |
Delta详情 | 考虑Z-ordering或压缩(作为建议注明,不做SQL修改) |
| 平均文件大小 < 32 MB | 小文件问题 — 谓词下推的负面影响可能超过收益 |
| 无分区列 | 所有过滤都是扫描后执行;寻找其他下推机会 |
| 过滤列是分区列 | 验证WHERE子句是否实际使用该列进行裁剪 |
列 | GROUP BY或JOIN该列有高倾斜风险 |
列 | NULL安全join可能意外膨胀行数 |
| DESCRIBE返回的行数 ≠ 实际COUNT(*) | 统计信息过期 — 注明该问题,但不要运行ANALYZE(可能很慢) |
Phase 3.3 — Build an attack plan
阶段3.3 — 制定优化计划
Before writing a single line of optimized SQL, write a short numbered list of the
specific changes you intend to make and why. Show it to the user (or include it
in your reasoning) before proceeding to Phase 4.
Each item should name:
- What you will change (e.g. "replace SortMergeJoin with BROADCAST hint on ")
dim_country - Why (e.g. "Delta detail shows sizeInBytes = 18 MB, well under the 200 MB threshold")
- Expected effect (e.g. "eliminates one Exchange shuffle node")
Example:
Attack plan:
1. Add /*+ BROADCAST(dim_country) */ — sizeInBytes 18 MB, plan currently SortMergeJoin → eliminates shuffle
2. Replace LOWER(email) = ? filter with COLLATE UTF8_LCASE — restores Delta file-skipping
3. Extract repeated subquery into CTE — currently executed once per row in the outer SELECT
4. Add IS NOT NULL guard on user_id before the JOIN — 12% null rate inflates shuffle sizeIf you have no concrete changes to make (the plan looks fine or the bottleneck is
outside SQL), say so explicitly and explain what you found.
在编写任何优化SQL之前,先编写简短的编号列表,列出你打算做的具体修改以及原因。在进入阶段4之前展示给用户(或包含在你的推理中)。
每个条目需要说明:
- 修改内容(例如“在表上添加/*+ BROADCAST(dim_country) */ hint”)
dim_country - 原因(例如“Delta详情显示sizeInBytes = 18 MB,远低于200 MB阈值”)
- 预期效果(例如“消除一个Exchange shuffle节点”)
示例:
优化计划:
1. 添加/*+ BROADCAST(dim_country) */ — 大小18 MB,当前计划是SortMergeJoin → 消除shuffle
2. 将LOWER(email) = ?过滤条件替换为COLLATE UTF8_LCASE — 恢复Delta文件跳过能力
3. 将重复子查询提取为CTE — 当前在外层SELECT中每行执行一次
4. 在JOIN前为user_id添加IS NOT NULL校验 — 12%的null率会膨胀shuffle大小如果你没有明确的修改建议(执行计划看起来没问题,或者瓶颈不在SQL层面),明确说明并解释你的发现。
Phase 4 — SQL optimization
阶段4 — SQL优化
Output contract: The only thing this skill produces is a rewritten SQL query. No Python code. No cluster config. No DataFrame API. No schema changes. A plain SQL string — same SELECT structure, same columns, same output — just faster. SQL hints () and SQL comments are allowed./*+ BROADCAST(...) */
Rewrite the query using SQL-level rewrites only.
输出约定: 该技能唯一的输出是重写后的SQL查询。没有Python代码、没有集群配置、没有DataFrame API、没有schema修改。就是普通的SQL字符串 — 相同的SELECT结构、相同的列、相同的输出 — 只是更快。允许使用SQL hint()和SQL注释。/*+ BROADCAST(...) */
仅使用SQL层面的重写优化查询。
What is allowed
允许的修改
- Rewriting JOINs (inner ↔ left, reordering join tables for better plan)
- Adding / removing CTEs for de-duplication and readability
- Replacing correlated subqueries with joins or window functions
- Adding predicate filters earlier in the plan (push filters closer to the scan)
- Rewriting DISTINCT with GROUP BY where appropriate
- Replacing UDFs with equivalent built-in Spark SQL functions
- Splitting complex expressions to help predicate pushdown
- SQL optimizer hints: ,
/*+ BROADCAST(t) */,/*+ MERGE(t) *//*+ SHUFFLE_HASH(t) */ - SQL comments (or
--) explaining what the optimization does/* */ - Inlining a view as a CTE — if a join target is a view whose definition is
suboptimal, replace the view reference with in the optimized query. The output is still a single self-contained SQL string.
WITH view_name AS (<rewritten SQL>) - Rewriting as
IN (subquery)or a semi-join — Spark often builds a better plan forEXISTS/EXISTSthan forLEFT SEMI JOINwith a large subqueryIN - Skew salting — when a GROUP BY key is heavily skewed, a two-pass SQL salt distributes the hot partition across workers (see salting pattern below)
- 重写JOIN(内连接 ↔ 左连接,调整join表顺序获得更好的执行计划)
- 添加/移除CTE以去重和提升可读性
- 用join或窗口函数替换相关子查询
- 在执行计划中更早添加谓词过滤(将过滤条件下推到更接近扫描的位置)
- 适当情况下用GROUP BY替换DISTINCT
- 用等价的内置Spark SQL函数替换UDF
- 拆分复杂表达式帮助谓词下推
- SQL优化器hint:、
/*+ BROADCAST(t) */、/*+ MERGE(t) *//*+ SHUFFLE_HASH(t) */ - SQL注释(或
--)说明优化的作用/* */ - 将视图内联为CTE — 如果join目标是视图且其定义不够优化,在优化查询中将视图引用替换为。输出仍然是独立的单SQL字符串。
WITH view_name AS (<重写后的SQL>) - 将重写为
IN (子查询)或半连接 — 对于大子查询,Spark通常为EXISTS/EXISTS生成更好的执行计划。LEFT SEMI JOIN - 倾斜加盐 — 当GROUP BY键严重倾斜时,两步SQL加盐将热分区分散到多个worker(见下方加盐模式)
Skew salting pattern
倾斜加盐模式
When the plan shows a highly skewed Exchange on a GROUP BY key, salt the aggregation:
sql
-- Phase 1: aggregate with a random salt to spread the hot key
WITH salted AS (
SELECT
key,
FLOOR(RAND() * 8) AS salt, -- 8 buckets; tune to cluster core count
SUM(value) AS partial_sum
FROM large_table
GROUP BY key, salt
)
-- Phase 2: collapse the salted buckets
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY keyOnly use this when column stats or a skewed Exchange confirms one key dominates.
The extra aggregation pass adds overhead on balanced data — don't apply it speculatively.
当执行计划显示GROUP BY键上的Exchange严重倾斜时,对聚合加盐:
sql
-- 阶段1:使用随机盐聚合,将热键分散
WITH salted AS (
SELECT
key,
FLOOR(RAND() * 8) AS salt, -- 8个桶;根据集群核心数调整
SUM(value) AS partial_sum
FROM large_table
GROUP BY key, salt
)
-- 阶段2:合并加盐后的桶
SELECT key, SUM(partial_sum) AS total
FROM salted
GROUP BY key只有当列统计信息或倾斜的Exchange确认某个键占比过高时才使用该方法。额外的聚合步骤会在数据均衡时增加开销 — 不要猜测使用。
CTE materialization caveat
CTE物化注意事项
Spark does not guarantee that a CTE is computed only once. If a CTE is referenced
multiple times in the query, Spark may re-evaluate it on each reference. This means:
- A CTE used to "cache" an expensive subquery may not help — or may hurt if the subquery is scanned multiple times instead of once.
- If you add a CTE to de-duplicate a repeated subquery and the plan still shows the
subquery running multiple times, note as a recommendation outside SQL scope.
CACHE TABLE - Prefer CTEs for readability and hint-targeting; don't assume they imply materialization.
Spark不保证CTE只计算一次。如果CTE在查询中被多次引用,Spark可能在每次引用时重新计算。这意味着:
- 用于“缓存”昂贵子查询的CTE可能没有帮助 — 如果子查询被多次扫描而非一次,甚至可能降低性能。
- 如果你添加CTE去重重复子查询,但执行计划仍然显示子查询运行多次,可以在SQL范围外建议使用。
CACHE TABLE - 优先使用CTE提升可读性和hint靶向性;不要假设它们会被物化。
Photon-unfriendly patterns
Photon不友好的模式
Databricks Photon accelerates most SQL, but silently falls back to JVM execution for:
| Pattern | Why Photon can't accelerate it |
|---|---|
| Python / Scala UDFs | Black-box execution, no vectorization possible |
| Higher-order functions not yet Photon-native |
| Same — lambda expressions bypass Photon |
| Complex type operations not vectorized |
| Non-deterministic functions in certain contexts | Photon cannot reorder them safely |
If the plan shows , , or nodes,
a Photon fallback is happening. Flag it in the attack plan and suggest replacing with
built-in Spark SQL equivalents where possible.
ArrowEvalPythonObjectHashAggregateBatchEvalPythonDatabricks Photon可以加速大多数SQL,但会静默回退到JVM执行以下场景:
| 模式 | Photon无法加速的原因 |
|---|---|
| Python / Scala UDF | 黑盒执行,无法向量化 |
| 高阶函数尚未原生支持Photon |
| 同上 — lambda表达式绕过Photon |
复杂map上的 | 复杂类型操作未向量化 |
| 特定上下文中的非确定性函数 | Photon无法安全重排它们 |
如果执行计划显示、或节点,说明发生了Photon回退。在优化计划中标注,并建议尽可能用内置Spark SQL等价实现替换。
ArrowEvalPythonObjectHashAggregateBatchEvalPythonWhat is NOT allowed — never suggest these
禁止的修改 — 永远不要建议这些
- or any cluster/session configuration
spark.conf.set(...) - DataFrame / PySpark API code (,
.filter(),.join(), etc.).groupBy() - DDL changes (,
ALTER TABLE,OPTIMIZE,ZORDER,VACUUM)ANALYZE TABLE - Any rewrite that changes columns, row count, or values in the output
If a non-SQL improvement (e.g., Z-ordering, cluster sizing) would help, mention it
separately under "Additional recommendations (outside scope)" in the final report —
but never include it in the optimized query itself.
- 或任何集群/会话配置
spark.conf.set(...) - DataFrame / PySpark API代码(、
.filter()、.join()等).groupBy() - DDL修改(、
ALTER TABLE、OPTIMIZE、ZORDER、VACUUM)ANALYZE TABLE - 任何改变输出列、行数或值的重写
如果非SQL改进(例如Z-ordering、集群规格调整)有帮助,在最终报告的“额外建议(超出范围)”部分单独说明 — 但永远不要包含在优化后的查询本身中。
Hint restriction — CRITICAL for UNION ALL
Hint限制 — UNION ALL场景的关键注意事项
Spark SQL does not allow hints inside CTE branches that are
part of a . Place all optimizer hints on the outermost SELECT only:
/*+ ... */UNION ALLsql
-- ✅ CORRECT
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...
-- ❌ WRONG — hint inside a UNION ALL branch causes a parse error
WITH branch AS (
SELECT /*+ BROADCAST(t) */ * FROM t -- this will fail
)
SELECT * FROM branch
UNION ALL
SELECT * FROM otherSpark SQL不允许在的CTE分支内部使用 hint。所有优化器hint只能放在最外层的SELECT上:
UNION ALL/*+ ... */sql
-- ✅ 正确
SELECT /*+ BROADCAST(small_table) */ *
FROM big_table
JOIN small_table ON ...
-- ❌ 错误 — UNION ALL分支内部的hint会导致解析错误
WITH branch AS (
SELECT /*+ BROADCAST(t) */ * FROM t -- 会报错
)
SELECT * FROM branch
UNION ALL
SELECT * FROM otherUDF handling
UDF处理
If the query calls UDFs, check whether each is:
- A persistent SQL function (created with in the catalog) — these work as-is
CREATE FUNCTION - A session UDF that must be registered at runtime — create a file (see below)
udf_setup.py
If session UDFs are needed, create in the current directory:
udf_setup.pypython
undefined如果查询调用了UDF,检查每个UDF的类型:
- 持久化SQL函数(在catalog中通过创建) — 可以直接使用
CREATE FUNCTION - 会话UDF,需要在运行时注册 — 创建文件(见下文)
udf_setup.py
如果需要会话UDF,在当前目录创建:
udf_setup.pypython
undefinedudf_setup.py — loaded by tune.py automatically if it exists
udf_setup.py — 如果存在会被tune.py自动加载
from pyspark.sql.functions import udf as spark_udf
from pyspark.sql.types import StringType
def _my_udf(value):
# your implementation
return result
def register_udfs(spark):
spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))
`tune.py` checks for `udf_setup.py` in the current directory and calls
`register_udfs(spark)` automatically before running any queries.
---from pyspark.sql.functions import udf as spark_udf
from pyspark.sql.types import StringType
def _my_udf(value):
# 你的实现
return result
def register_udfs(spark):
spark.udf.register("my_udf_name", spark_udf(_my_udf, StringType()))
`tune.py`会检查当前目录下的`udf_setup.py`,在运行任何查询前自动调用`register_udfs(spark)`。
---Phase 5 — Validate, benchmark & log
阶段5 — 验证、基准测试与日志记录
5.1 Edit and commit
5.1 编辑并提交
Edit directly — the branch is isolated so edits are safe to make in place.
$QUERY_FILEbash
git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <one-line description of what changed>"Commit before benchmarking so every attempt is in the git log regardless of outcome.
直接编辑 — 分支是隔离的,原地编辑是安全的。
$QUERY_FILEbash
git add $QUERY_FILE
git commit -m "sql-tune: attempt <N> — <修改内容的单行描述>"在基准测试前提交,这样无论结果如何,每次尝试都会记录在git日志中。
5.2 Run the benchmark
5.2 运行基准测试
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original @$QUERY_FILE \
--optimized @$QUERY_FILE \
--catalog <CATALOG> \
--schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1(Both and point to the same file — tune.py compares the
current version to the baseline it runs internally.)
--original--optimizedThe script outputs JSON with:
- — whether results are byte-for-byte identical
validation.passed - — number of rows compared
validation.row_count - ,
original.mean_s— timing statsoriginal.std_s - ,
optimized.mean_s— timing statsoptimized.std_s - — ratio of original mean to optimized mean
speedup - — true if optimized CI is entirely below original CI
statistically_significant
bash
$TUNE \
--profile <PROFILE> \
--cluster-id <CLUSTER_ID> \
--original @$QUERY_FILE \
--optimized @$QUERY_FILE \
--catalog <CATALOG> \
--schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1(和指向同一个文件 — tune.py会将当前版本与内部运行的基线进行比较。)
--original--optimized脚本输出的JSON包含:
- — 结果是否逐字节完全一致
validation.passed - — 比较的行数
validation.row_count - 、
original.mean_s— 基线的计时统计original.std_s - 、
optimized.mean_s— 优化版本的计时统计optimized.std_s - — 基线平均耗时与优化版本平均耗时的比值
speedup - — 如果优化版本的置信区间完全低于基线的置信区间则为true
statistically_significant
5.3 Decide: keep or revert
5.3 决策:保留或回滚
A speedup is real only when:
- (identical results)
validation.passed = true - (optimized is faster on average)
speedup > 1.0 - (confidence intervals don't overlap)
statistically_significant = true
✅ IMPROVED → keep commit, update BEST_MEAN = optimized.mean_s
❌ SAME/WORSE → revert the file to the last kept commit:
git checkout HEAD $QUERY_FILE
💥 VALIDATION FAIL → fix the query and re-benchmark before reverting提速是真实有效的仅当:
- (结果完全一致)
validation.passed = true - (优化版本平均速度更快)
speedup > 1.0 - (置信区间不重叠)
statistically_significant = true
✅ 性能提升 → 保留提交,更新BEST_MEAN = optimized.mean_s
❌ 性能相同/下降 → 将文件回滚到上一次保留的提交:
git checkout HEAD $QUERY_FILE
💥 验证失败 → 修复查询,重新基准测试后再回滚5.4 Log the attempt
5.4 记录尝试
Append to :
$RESULTS_FILE<N> <commit-sha> <optimized.mean_s> <speedup> <keep|discard> <description>If validation fails: fix the query and re-benchmark before doing anything else.
追加到:
$RESULTS_FILE<N> <commit-sha> <optimized.mean_s> <speedup> <keep|discard> <description>如果验证失败:先修复查询并重新基准测试,再进行其他操作。
Phase 5b — Autonomous optimization loop
阶段5b — 自动优化循环
Run continuously after the first benchmark. Never pause to ask "should I continue?".
Stop only when the user interrupts or explicitly says they're satisfied.
THINK Read $RESULTS_FILE and the current best query.
Study the EXPLAIN of the best query so far — not the original.
Form a specific hypothesis: "changing X should reduce Y because Z."
Follow the strategy priority below.
EDIT Edit $QUERY_FILE directly — one focused change per attempt.
The branch is isolated; edits are safe to make in place.
COMMIT git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
Commit before benchmarking — this records what was tried.
RUN $TUNE \
--profile <PROFILE> --cluster-id <CLUSTER_ID> \
--original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
--catalog <CATALOG> --schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1
# For simplicity / both goals, also score complexity
python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json
MEASURE Extract from $LOG_FILE (timing) and complexity.json (score).
On crash: read last 50 lines for the error. Attempt up to 2 quick fixes,
amend the commit, re-run. If still broken, revert and discard.
Metric to optimize depends on GOALS:
┌──────────────────────┬────────────────────────────────────────────────────┐
│ speed │ optimized.mean_s (lower is better) │
│ simplicity │ complexity score (lower is better) │
│ speed,simplicity │ Phase 1: optimize mean_s (same as speed) │
│ │ Phase 2: once speed accepted, optimize score │
│ │ while mean_s stays ≤ BEST_MEAN * 1.10 │
└──────────────────────┴────────────────────────────────────────────────────┘
DECIDE Compare metric to BEST:
speed → ✅ if mean_s improved + statistically significant
simplicity → ✅ if complexity score decreased + validation passes
speed,simplicity → phase 1: same as speed
phase 2: ✅ if score decreased AND mean_s ≤ BEST_MEAN * 1.10
❌ no improvement → git checkout HEAD $QUERY_FILE (revert file, keep commit msg in log)
💥 VALIDATION FAIL → fix semantics first, then re-benchmark
LOG Append to $RESULTS_FILE:
<N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>首次基准测试后持续运行。永远不要暂停询问“我是否应该继续?”。仅当用户中断或明确表示满意时停止。
思考 读取$RESULTS_FILE和当前最优查询。
研究当前最优查询的EXPLAIN计划 — 不是原始查询。
形成明确的假设:“修改X会减少Y,因为Z。”
遵循下方的策略优先级。
编辑 直接编辑$QUERY_FILE — 每次尝试只做一个针对性修改。
分支是隔离的;原地编辑是安全的。
提交 git add $QUERY_FILE && git commit -m "sql-tune: attempt <N> — <description>"
基准测试前提交 — 记录所有尝试的内容。
运行 $TUNE \
--profile <PROFILE> --cluster-id <CLUSTER_ID> \
--original @$ORIGINAL_FILE --optimized @$QUERY_FILE \
--catalog <CATALOG> --schema <SCHEMA> \
--n-runs <N_RUNS> > $LOG_FILE 2>&1
# 如果优化目标包含简化/双目标,同时计算复杂度评分
python3 "$SKILL_DIR/scripts/complexity.py" --json $QUERY_FILE > complexity.json
测量 从$LOG_FILE(计时)和complexity.json(评分)中提取数据。
崩溃时:读取最后50行错误信息。尝试最多2次快速修复,
修改提交,重新运行。如果仍然失败,回滚并丢弃。
优化的指标取决于GOALS:
┌──────────────────────┬────────────────────────────────────────────────────┐
│ speed │ optimized.mean_s (越低越好) │
│ simplicity │ complexity score (越低越好) │
│ speed,simplicity │ 阶段1:优化mean_s(和speed目标一致) │
│ │ 阶段2:速度达标后,优化评分 │
│ │ 同时保持mean_s ≤ BEST_MEAN * 1.10 │
└──────────────────────┴────────────────────────────────────────────────────┘
决策 与最优指标比较:
speed → ✅ 如果mean_s提升 + 统计显著性
simplicity → ✅ 如果复杂度评分降低 + 验证通过
speed,simplicity → 阶段1:和speed目标一致
阶段2:✅ 如果评分降低 AND mean_s ≤ BEST_MEAN * 1.10
❌ 无提升 → git checkout HEAD $QUERY_FILE (回滚文件,保留提交日志)
💥 验证失败 → 先修复语义问题,再重新基准测试
日志 追加到$RESULTS_FILE:
<N>\t<sha>\t<mean_s>\t<complexity_score>\t<status>\t<description>Strategy priority
策略优先级
- Follow the plan — work through the attack plan from Phase 3.3 in order
- Follow wins — if a hint or rewrite helped, probe further in that direction
- Diversify after 3 consecutive discards — switch to a different bottleneck
- Combine winners — if A and B each improved independently, try A+B together
- Diagnose no-speedup runs — re-EXPLAIN the optimized query; check if AQE already did what the hint asked, or if a Photon fallback is dominating
- Accept the floor — if 5+ consecutive attempts yield no improvement and the remaining ideas are speculative, stop the loop and go to Phase 6
For or goal, strategy shifts focus (in the simplicity phase):
simplicityspeed,simplicity- Prefer removing CTEs that the optimizer re-evaluates anyway
- Inline single-use subqueries where they don't increase nesting
- Replace multi-step transformations with a single built-in function call
- Remove redundant ORDER BY, DISTINCT, or GROUP BY that the caller drops
- Simplicity wins don't require statistical significance — a lower score is enough
Hard constraints that apply to all goals, including simplicity:
- Never introduce — explicit column lists must be preserved exactly.
SELECT *hides schema changes, breaks downstream consumers, and can silently change column order. A query with fewer lines but aSELECT *is not simpler — it is fragile.SELECT * - Never remove columns from the output or reorder them
- Never change NULL semantics or filter behaviour
- 遵循计划 — 按顺序执行阶段3.3的优化计划
- 跟进收益点 — 如果某个hint或重写有帮助,继续在该方向探索
- 连续3次丢弃后多样化尝试 — 切换到其他瓶颈
- 合并有效修改 — 如果A和B各自独立提升了性能,尝试同时应用A+B
- 诊断无提速的运行 — 重新EXPLAIN优化后的查询;检查AQE是否已经实现了hint的作用,或者Photon回退是主要瓶颈
- 接受性能下限 — 如果连续5次以上尝试都没有提升,剩余的优化思路都是猜测性的,停止循环进入阶段6
对于或目标,在简化阶段策略会调整重点:
simplicityspeed,simplicity- 优先移除优化器会重复计算的CTE
- 内联单次使用的子查询,不会增加嵌套层级
- 用单个内置函数调用替换多步转换
- 移除调用者不需要的冗余ORDER BY、DISTINCT或GROUP BY
- 简化收益不需要统计显著性 — 评分更低即可
适用于所有目标的硬约束,包括简化目标:
- 永远不要引入— 必须严格保留显式列列表。
SELECT *会隐藏schema变更,破坏下游依赖,可能静默修改列顺序。行数更少但使用SELECT *的查询并不更简单 — 它更脆弱。SELECT * - 永远不要移除输出列或调整列顺序
- 永远不要修改NULL语义或过滤行为
Phase 6 — Report
阶段6 — 报告
Present a summary:
First print the full attempt log:
bash
cat $RESULTS_FILEThen show the git log of kept commits:
bash
git log --oneline <baseline-sha>..HEADThen present the summary report:
undefined展示摘要:
首先打印完整的尝试日志:
bash
cat $RESULTS_FILE然后展示保留提交的git日志:
bash
git log --oneline <baseline-sha>..HEAD然后展示摘要报告:
undefinedSQL Tuning Report
SQL调优报告
Run summary
运行摘要
- Attempts: N total / K kept / M discarded
- Baseline: X.XXs → Best: X.XXs (X.Xx speedup)
- 总尝试次数:N / 保留:K / 丢弃:M
- 基线耗时:X.XX秒 → 最优耗时:X.XX秒 (X.X倍提速)
Attempt log
尝试日志
| # | Mean (s) | Speedup | Status | Description |
|---|---|---|---|---|
| 0 | x.xx | 1.0x | baseline | original query |
| 1 | x.xx | x.xx | keep | ... |
| 2 | x.xx | x.xx | discard | ... |
| # | 平均耗时(秒) | 提速倍数 | 状态 | 描述 |
|---|---|---|---|---|
| 0 | x.xx | 1.0x | 基线 | 原始查询 |
| 1 | x.xx | x.xx | 保留 | ... |
| 2 | x.xx | x.xx | 丢弃 | ... |
Query plan bottlenecks (original)
原始查询计划瓶颈
<key findings from EXPLAIN>
<执行计划的关键发现>
Optimizations applied (kept only)
应用的优化(仅保留的修改)
- <change — what + why> ...
- <修改内容 — 是什么 + 为什么> ...
Original SQL
原始SQL
<original query>
<原始查询>
Optimized SQL
优化后SQL
<best optimized query>
<最优的优化后查询>
Final benchmark
最终基准测试
| Variant | Mean (s) | Std (s) | Runs |
|---|---|---|---|
| Original | x.xx | x.xx | N |
| Optimized | x.xx | x.xx | N |
Speedup: X.Xx | Statistically significant: yes/no
Validation: PASS — N rows, identical results
| 版本 | 平均耗时(秒) | 标准差(秒) | 运行次数 |
|---|---|---|---|
| 原始版本 | x.xx | x.xx | N |
| 优化版本 | x.xx | x.xx | N |
提速倍数:X.Xx | 统计显著性:是/否
验证结果:通过 — N行,结果完全一致
Conclusion
结论
<one paragraph: what changed, why it helped, any caveats>
<一段总结:修改了什么,为什么有效果,任何注意事项>
Additional recommendations (outside SQL scope)
额外建议(超出SQL范围)
<Z-ordering, ANALYZE TABLE, cluster sizing, etc. — if applicable>
---<Z-ordering、ANALYZE TABLE、集群规格调整等 — 如适用>
---Troubleshooting
故障排查
| Error | Fix |
|---|---|
| Re-authenticate: |
| Verify cluster ID: |
| Version mismatch | Ensure |
| Delete the venv and re-run |
| Validation diff on floats | May be floating-point non-determinism — check with |
| Query returns millions of rows — add |
| Hint parse error in UNION ALL | Move hint to outermost SELECT (see Phase 4 restriction) |
| Session UDF not found | Create |
| 错误 | 修复方案 |
|---|---|
| 重新认证: |
| 验证集群ID: |
| 版本不匹配 | 确保从 |
| 删除虚拟环境,重新运行 |
| 浮点数验证差异 | 可能是浮点数非确定性 — 用 |
| 查询返回数百万行 — 添加 |
| UNION ALL中的hint解析错误 | 将hint移动到最外层SELECT(见阶段4的限制) |
| 会话UDF未找到 | 创建包含 |