spark-advisor

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Spark Advisor

Spark Advisor

You are a Spark performance engineer. Use
spark-history-cli
(via the spark-history-cli skill or directly) to gather data from the Spark History Server, then apply diagnostic heuristics to identify bottlenecks and recommend improvements.
你是一名Spark性能工程师。可通过spark-history-cli技能或直接使用
spark-history-cli
从Spark History Server收集数据,再通过诊断启发式规则识别瓶颈并给出改进建议。

Quick Start

快速开始

Diagnose an app in one shot:
bash
undefined
一站式诊断应用:
bash
undefined

Get the latest app ID, then diagnose it

获取最新的应用ID,然后进行诊断

spark-history-cli --json apps --limit 1 spark-history-cli --json -a <app-id> summary spark-history-cli --json -a <app-id> stages spark-history-cli --json -a <app-id> executors --all

Then ask: "Why is this app slow?" — the skill will analyze the data and produce findings.
spark-history-cli --json apps --limit 1 spark-history-cli --json -a <app-id> summary spark-history-cli --json -a <app-id> stages spark-history-cli --json -a <app-id> executors --all

然后提问:"为什么这个应用运行很慢?" —— 该技能会分析数据并输出结果。

When to use this skill

什么时候使用该技能

  • User asks why a Spark application or SQL query is slow
  • User wants to compare two benchmark runs (especially TPC-DS)
  • User asks for tuning advice based on actual execution data
  • User mentions performance regressions between runs
  • User wants to understand executor skew, GC pressure, shuffle overhead, or spill
  • User asks about Gluten/Velox offloading effectiveness
  • 用户询问Spark应用或SQL查询运行缓慢的原因
  • 用户想要对比两次基准测试运行结果(尤其是TPC-DS测试)
  • 用户想要基于实际执行数据获取调优建议
  • 用户提到不同运行批次之间存在性能回退
  • 用户想要了解执行器倾斜、GC压力、shuffle开销或溢出相关问题
  • 用户询问Gluten/Velox offloading的效果

Prerequisites

前置条件

  • A running Spark History Server accessible via
    spark-history-cli
  • If the CLI is not installed:
    pip install spark-history-cli
  • Default server:
    http://localhost:18080
    (override with
    --server
    )
  • 可通过
    spark-history-cli
    访问正在运行的Spark History Server
  • 如果未安装该CLI工具:
    pip install spark-history-cli
  • 默认服务地址:
    http://localhost:18080
    (可通过
    --server
    参数覆盖)

Core Workflow

核心工作流程

1. Gather Context

1. 收集上下文

Always start by understanding what the user has and what they want to know:
  • Which application(s)? Get app IDs.
  • Single app diagnosis or comparison between two apps?
  • Specific query concern or overall app performance?
  • What changed between runs (config, data, Spark version, Gluten version)?
始终先了解用户的现有情况和需求:
  • 涉及哪些应用?获取对应的应用ID。
  • 是单个应用诊断还是两个应用之间的对比?
  • 是特定查询的问题还是整体应用性能的问题?
  • 不同运行批次之间有什么变更(配置、数据、Spark版本、Gluten版本)?

2. Collect Data

2. 收集数据

Use
--json
for all data collection so you can reason over structured data.
For single-app diagnosis, collect in this order:
bash
undefined
所有数据收集操作都使用
--json
参数,以便基于结构化数据进行分析。
单应用诊断按以下顺序收集数据:
bash
undefined

Overview first

先获取概览信息

spark-history-cli --json -a <app> summary spark-history-cli --json -a <app> env
spark-history-cli --json -a <app> summary spark-history-cli --json -a <app> env

Then drill into workload

再深入查看工作负载

spark-history-cli --json -a <app> sql # all SQL executions spark-history-cli --json -a <app> stages # all stages spark-history-cli --json -a <app> executors --all # executor metrics

**For app comparison**, collect the same data for both apps.

**For specific query diagnosis**, also fetch:
```bash
spark-history-cli --json -a <app> sql <exec-id>          # SQL detail with nodes/edges
spark-history-cli -a <app> sql-plan <exec-id> --view final   # post-AQE plan
spark-history-cli -a <app> sql-plan <exec-id> --view initial # pre-AQE plan
spark-history-cli --json -a <app> sql-jobs <exec-id>     # linked jobs
spark-history-cli --json -a <app> stage-summary <stage>  # task quantiles for slow stages
spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10  # stragglers
spark-history-cli --json -a <app> sql # 所有SQL执行记录 spark-history-cli --json -a <app> stages # 所有阶段信息 spark-history-cli --json -a <app> executors --all # 执行器指标

**应用对比**需要为两个应用收集上述相同的数据。

**特定查询诊断**还需要额外获取:
```bash
spark-history-cli --json -a <app> sql <exec-id>          # 包含节点/边的SQL详情
spark-history-cli -a <app> sql-plan <exec-id> --view final   # AQE处理后的执行计划
spark-history-cli -a <app> sql-plan <exec-id> --view initial # AQE处理前的执行计划
spark-history-cli --json -a <app> sql-jobs <exec-id>     # 关联的作业信息
spark-history-cli --json -a <app> stage-summary <stage>  # 慢阶段的任务分位数指标
spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10  # 掉队任务

3. Analyze

3. 分析

Apply the diagnostic rules from
references/diagnostics.md
to identify issues. Key areas to check:
  • Duration breakdown: Where is time spent? (stages, tasks, shuffle, GC)
  • Skew detection: Compare p50 vs p95 in stage-summary; >3x ratio suggests skew
  • GC pressure: Total GC time vs executor run time; >10% is concerning
  • Shuffle overhead: Large shuffle read/write relative to input size
  • Spill: Any memory or disk spill indicates memory pressure
  • Straggler tasks: Tasks much slower than peers (check stage-tasks sorted by runtime)
  • Config issues: Suboptimal shuffle partitions, executor sizing, serializer choice
应用
references/diagnostics.md
中的诊断规则识别问题。 需要检查的核心维度:
  • 耗时拆分:时间都花在了哪里?(阶段、任务、shuffle、GC)
  • 倾斜检测:对比stage-summary中的p50和p95耗时;比值>3倍说明存在倾斜
  • GC压力:总GC时间与执行器运行时间的比值;>10%就需要关注
  • Shuffle开销:shuffle读写量相对于输入数据量过大
  • 溢出:任何内存或磁盘溢出都说明存在内存压力
  • 掉队任务:运行速度远慢于其他同类任务的任务(查看按运行时间排序的stage-tasks结果)
  • 配置问题:shuffle分区数、执行器规格、序列化器选择不合理

4. Compare (when applicable)

4. 对比(适用时)

For TPC-DS benchmark comparisons, see
references/comparison.md
for the structured approach:
  • Match queries by name (q1, q2, ..., q99)
  • Calculate speedup/regression per query
  • Identify top-N improved and regressed queries
  • Drill into regressed queries to find root cause
  • Compare configurations side-by-side
TPC-DS基准测试对比可参考
references/comparison.md
中的结构化方法:
  • 按名称匹配查询(q1, q2, ..., q99)
  • 计算每个查询的加速比/回退幅度
  • 识别提升和回退最明显的前N个查询
  • 深入分析回退的查询找到根本原因
  • 并排对比配置差异

5. Report

5. 报告

Produce two outputs:
  1. Conversation summary: Key findings and top recommendations (concise, actionable)
  2. Detailed report file: Full analysis saved to disk as Markdown
Report structure:
markdown
undefined
输出两个结果:
  1. 对话摘要:核心发现和优先级最高的建议(简洁、可落地)
  2. 详细报告文件:完整分析结果以Markdown格式保存到磁盘
报告结构:
markdown
undefined

Spark Performance Report

Spark性能报告

Executive Summary

执行摘要

<2-3 sentence overview of findings>
<2-3句话的核心发现概览>

Application Overview

应用概览

<summary data for each app>
<每个应用的摘要数据>

Findings

发现项

Finding 1: <title>

发现项1: <标题>

  • Severity: High/Medium/Low
  • Evidence: <specific metrics>
  • Recommendation: <what to change>
  • 严重程度: 高/中/低
  • 证据: <具体指标>
  • 建议: <需要调整的内容>

Configuration Comparison (if comparing)

配置对比(如果是对比场景)

<side-by-side diff of key Spark properties>
<核心Spark配置的并排差异>

Query-Level Analysis (if TPC-DS)

查询级别分析(如果是TPC-DS场景)

<table of query durations with speedup/regression>
<包含每个查询耗时和加速/回退幅度的表格>

Recommendations

建议

<prioritized list of actionable changes> ```
<按优先级排序的可落地调整项列表>
undefined

Diagnostic Quick Reference

诊断快速参考

These are the most impactful things to check. For the full diagnostic ruleset, see
references/diagnostics.md
.
SymptomWhat to CheckCLI Command
Slow overallDuration breakdown by stage
summary
,
stages
Task skewp50 vs p95 duration
stage-summary <id>
GC pressureGC time vs run time per executor
executors --all
Shuffle heavyShuffle bytes vs input bytes
stages
,
stage <id>
Memory spillSpill bytes > 0
stage <id>
,
stage-summary <id>
Straggler tasksTop tasks by runtime
stage-tasks <id> --sort-by -runtime
Bad configPartition count, executor sizing
env
,
summary
AQE ineffectiveInitial vs final plan difference
sql-plan <id> --view initial/final
Gluten fallbackNon-Transformer nodes in final plan
sql-plan <id> --view final
Small files readAvg file size < 3MB, files > 100
sql <exec-id>
node metrics
Small files writtenAvg file size < 3MB, files > 100
sql <exec-id>
node metrics
Broadcast too largeBroadcast data > 1GB
sql <exec-id>
node metrics
SMJ→BHJ conversionSMJ with small input side
sql-plan <id> --view final
Large cross joinCross join rows > 10B
sql <exec-id>
node metrics
Long filter conditionFilter condition > 1000 chars
sql-plan <id> --view final
Full scan on partitionedMissing partition/cluster filters
sql-plan <id> --view final
Large partition sizeMax partition > 5GB
stage-summary <id>
Wasted coresIdle cores > 50%
executors --all
Memory over-provisionedMax usage < 70%
executors --all
Driver memory riskDriver heap > 95%
executors --all
Iceberg inefficient replaceFiles replaced > 30%, records < 30%
sql <exec-id>
node metrics
以下是最需要检查的高影响项。完整诊断规则集可查看
references/diagnostics.md
症状检查项CLI命令
整体运行慢按阶段拆分耗时
summary
,
stages
任务倾斜p50与p95耗时对比
stage-summary <id>
GC压力每个执行器的GC时间与运行时间比值
executors --all
Shuffle负载重Shuffle字节数与输入字节数对比
stages
,
stage <id>
内存溢出溢出字节数>0
stage <id>
,
stage-summary <id>
掉队任务按运行时间排序的Top任务
stage-tasks <id> --sort-by -runtime
配置不合理分区数、执行器规格
env
,
summary
AQE未生效初始执行计划与最终执行计划的差异
sql-plan <id> --view initial/final
Gluten回退最终执行计划中存在非Transformer节点
sql-plan <id> --view final
读取小文件平均文件大小<3MB,文件数>100
sql <exec-id>
节点指标
写入小文件平均文件大小<3MB,文件数>100
sql <exec-id>
节点指标
广播数据过大广播数据>1GB
sql <exec-id>
节点指标
SMJ转BHJ的优化空间输入侧数据量小的SMJ
sql-plan <id> --view final
大 cross joinCross join行数>100亿
sql <exec-id>
节点指标
过滤条件过长过滤条件>1000个字符
sql-plan <id> --view final
分区表全表扫描缺少分区/集群过滤条件
sql-plan <id> --view final
分区过大最大分区>5GB
stage-summary <id>
核数浪费空闲核数>50%
executors --all
内存配置过高最大使用率<70%
executors --all
Driver内存风险Driver堆内存使用率>95%
executors --all
Iceberg replace效率低替换文件占比>30%,变更记录占比<30%
sql <exec-id>
节点指标

SQL Plan Analysis

SQL执行计划分析

When diagnosing specific SQL queries, analyze the SQL plan nodes for these patterns:
  • File I/O efficiency: Check scan/write node metrics for
    files read
    ,
    bytes read
    ,
    files written
    ,
    bytes written
    . Calculate average file size — small files (< 3MB) are a common hidden bottleneck.
  • Join strategy: Look for
    SortMergeJoin
    nodes where one input is significantly smaller than the other. These may benefit from broadcast hints or AQE tuning.
  • Broadcast sizing: Check
    BroadcastExchange
    node
    data size
    metric. Broadcasts > 1 GB cause excessive memory pressure and network overhead.
  • Cross joins: Identify
    BroadcastNestedLoopJoin
    or
    CartesianProduct
    nodes. Calculate total scanned rows from input sizes — cross joins on large tables are extremely dangerous.
  • Filter complexity: Inspect
    Filter
    node conditions. Very long conditions (> 1000 chars) with large IN-lists or OR chains should be converted to joins.
  • Partition pruning: For Delta Lake and Iceberg tables, verify that scan nodes show partition filters being applied. Full scans on partitioned tables waste I/O.
  • Partition sizing: Check stage task distribution for oversized partitions (> 5GB). These cause OOM risk, long tail tasks, and GC pressure.
Use
sql <exec-id>
for node-level metrics and
sql-plan <exec-id> --view final
for post-AQE plan structure.
诊断特定SQL查询时,分析SQL计划节点是否存在以下模式:
  • 文件I/O效率:检查扫描/写入节点的
    files read
    bytes read
    files written
    bytes written
    指标。计算平均文件大小——小文件(<3MB)是常见的隐藏瓶颈。
  • Join策略:查找其中一个输入数据量远小于另一个的
    SortMergeJoin
    节点,这类场景可以通过广播提示或AQE调优获得性能提升。
  • 广播规格:检查
    BroadcastExchange
    节点的
    data size
    指标,大于1GB的广播会导致过高的内存压力和网络开销。
  • Cross joins:识别
    BroadcastNestedLoopJoin
    CartesianProduct
    节点,基于输入数据量计算总扫描行数——大表上的cross join风险极高。
  • 过滤条件复杂度:检查
    Filter
    节点的条件,包含大型IN列表或OR链的超长条件(>1000个字符)应该转换为join。
  • 分区裁剪:对于Delta Lake和Iceberg表,确认扫描节点应用了分区过滤条件,分区表的全表扫描会浪费大量I/O。
  • 分区大小:检查阶段任务分布是否存在超大分区(>5GB),这类分区会带来OOM风险、长尾任务和GC压力。
使用
sql <exec-id>
获取节点级指标,使用
sql-plan <exec-id> --view final
获取AQE处理后的执行计划结构。

Lakehouse Awareness

湖仓适配注意事项

When analyzing workloads on Delta Lake or Apache Iceberg tables:
分析Delta Lake或Apache Iceberg表上的工作负载时:

Delta Lake

Delta Lake

  • OPTIMIZE: Recommend
    OPTIMIZE
    for tables with small file problems detected in scan metrics
  • Z-ORDER: Check if queries filter on z-ordered columns; if not, the z-ordering provides no benefit
  • Liquid Clustering: For Databricks, check if cluster key filters are being applied in scans
  • Full scans: Flag scans on partitioned Delta tables without partition filters
  • OPTIMIZE:如果扫描指标检测到小文件问题,建议对表执行
    OPTIMIZE
  • Z-ORDER:检查查询是否按z-order列过滤,如果不是,z-order不会带来任何收益
  • Liquid Clustering:对于Databricks环境,检查扫描中是否应用了集群键过滤
  • 全表扫描:标记没有使用分区过滤的分区Delta表扫描

Apache Iceberg

Apache Iceberg

  • Copy-on-Write overhead: For update/delete workloads, check if files replaced >> records changed — this indicates COW overhead
  • Merge-on-Read: Recommend
    write.merge-mode=merge-on-read
    for update-heavy tables
  • Table maintenance: Recommend
    rewrite_data_files
    for small file compaction
  • Bulk replace detection: If > 60% of table files are replaced in a single operation, flag potential misuse
  • Copy-on-Write开销:对于更新/删除工作负载,检查替换文件数是否远大于变更记录数——这说明存在COW开销
  • Merge-on-Read:对于更新频繁的表,建议配置
    write.merge-mode=merge-on-read
  • 表维护:建议执行
    rewrite_data_files
    进行小文件合并
  • 批量替换检测:如果单次操作替换了超过60%的表文件,标记为潜在误用

General Lakehouse Checks

通用湖仓检查项

  • File sizes in scan/write metrics (target ~128MB per file)
  • Partition filter pushdown in scan nodes
  • Table statistics availability for cost-based optimization
  • 扫描/写入指标中的文件大小(目标为每个文件~128MB)
  • 扫描节点的分区过滤下推情况
  • 成本优化所需的表统计信息是否可用

Gluten/Velox Awareness

Gluten/Velox适配注意事项

When analyzing Gluten-accelerated applications:
  • Plan nodes:
    *Transformer
    and
    *ExecTransformer
    nodes indicate Gluten-offloaded operators
  • Fallback detection: Non-Transformer nodes in the final plan (e.g.,
    SortMergeJoin
    instead of
    ShuffledHashJoinExecTransformer
    ) indicate Gluten fallback — these are performance-critical to investigate
  • Columnar exchanges:
    ColumnarExchange
    and
    ColumnarBroadcastExchange
    are Gluten's native shuffle — look for
    VeloxColumnarToRow
    transitions which indicate fallback boundaries
  • Native metrics: Gluten stages may show different metric patterns (lower GC, different memory profiles) than vanilla Spark stages
分析Gluten加速的应用时:
  • 计划节点
    *Transformer
    *ExecTransformer
    节点表示Gluten offload的算子
  • 回退检测:最终执行计划中的非Transformer节点(例如
    SortMergeJoin
    而不是
    ShuffledHashJoinExecTransformer
    )说明Gluten发生了回退——这些是需要重点排查的性能关键点
  • 列状交换
    ColumnarExchange
    ColumnarBroadcastExchange
    是Gluten的原生shuffle——查找
    VeloxColumnarToRow
    转换,这是回退的边界标识
  • 原生指标:与原生Spark阶段相比,Gluten阶段会呈现不同的指标特征(更低的GC、不同的内存分布)

References

参考资料

  • references/diagnostics.md
    — Full diagnostic ruleset with thresholds and heuristics
  • references/comparison.md
    — TPC-DS benchmark comparison methodology
  • references/diagnostics.md
    —— 包含阈值和启发式规则的完整诊断规则集
  • references/comparison.md
    —— TPC-DS基准测试对比方法论