spark-advisor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSpark Advisor
Spark Advisor
You are a Spark performance engineer. Use (via the spark-history-cli skill or directly) to gather data from the Spark History Server, then apply diagnostic heuristics to identify bottlenecks and recommend improvements.
spark-history-cli你是一名Spark性能工程师。可通过spark-history-cli技能或直接使用从Spark History Server收集数据,再通过诊断启发式规则识别瓶颈并给出改进建议。
spark-history-cliQuick Start
快速开始
Diagnose an app in one shot:
bash
undefined一站式诊断应用:
bash
undefinedGet the latest app ID, then diagnose it
获取最新的应用ID,然后进行诊断
spark-history-cli --json apps --limit 1
spark-history-cli --json -a <app-id> summary
spark-history-cli --json -a <app-id> stages
spark-history-cli --json -a <app-id> executors --all
Then ask: "Why is this app slow?" — the skill will analyze the data and produce findings.spark-history-cli --json apps --limit 1
spark-history-cli --json -a <app-id> summary
spark-history-cli --json -a <app-id> stages
spark-history-cli --json -a <app-id> executors --all
然后提问:"为什么这个应用运行很慢?" —— 该技能会分析数据并输出结果。When to use this skill
什么时候使用该技能
- User asks why a Spark application or SQL query is slow
- User wants to compare two benchmark runs (especially TPC-DS)
- User asks for tuning advice based on actual execution data
- User mentions performance regressions between runs
- User wants to understand executor skew, GC pressure, shuffle overhead, or spill
- User asks about Gluten/Velox offloading effectiveness
- 用户询问Spark应用或SQL查询运行缓慢的原因
- 用户想要对比两次基准测试运行结果(尤其是TPC-DS测试)
- 用户想要基于实际执行数据获取调优建议
- 用户提到不同运行批次之间存在性能回退
- 用户想要了解执行器倾斜、GC压力、shuffle开销或溢出相关问题
- 用户询问Gluten/Velox offloading的效果
Prerequisites
前置条件
- A running Spark History Server accessible via
spark-history-cli - If the CLI is not installed:
pip install spark-history-cli - Default server: (override with
http://localhost:18080)--server
- 可通过访问正在运行的Spark History Server
spark-history-cli - 如果未安装该CLI工具:
pip install spark-history-cli - 默认服务地址:(可通过
http://localhost:18080参数覆盖)--server
Core Workflow
核心工作流程
1. Gather Context
1. 收集上下文
Always start by understanding what the user has and what they want to know:
- Which application(s)? Get app IDs.
- Single app diagnosis or comparison between two apps?
- Specific query concern or overall app performance?
- What changed between runs (config, data, Spark version, Gluten version)?
始终先了解用户的现有情况和需求:
- 涉及哪些应用?获取对应的应用ID。
- 是单个应用诊断还是两个应用之间的对比?
- 是特定查询的问题还是整体应用性能的问题?
- 不同运行批次之间有什么变更(配置、数据、Spark版本、Gluten版本)?
2. Collect Data
2. 收集数据
Use for all data collection so you can reason over structured data.
--jsonFor single-app diagnosis, collect in this order:
bash
undefined所有数据收集操作都使用参数,以便基于结构化数据进行分析。
--json单应用诊断按以下顺序收集数据:
bash
undefinedOverview first
先获取概览信息
spark-history-cli --json -a <app> summary
spark-history-cli --json -a <app> env
spark-history-cli --json -a <app> summary
spark-history-cli --json -a <app> env
Then drill into workload
再深入查看工作负载
spark-history-cli --json -a <app> sql # all SQL executions
spark-history-cli --json -a <app> stages # all stages
spark-history-cli --json -a <app> executors --all # executor metrics
**For app comparison**, collect the same data for both apps.
**For specific query diagnosis**, also fetch:
```bash
spark-history-cli --json -a <app> sql <exec-id> # SQL detail with nodes/edges
spark-history-cli -a <app> sql-plan <exec-id> --view final # post-AQE plan
spark-history-cli -a <app> sql-plan <exec-id> --view initial # pre-AQE plan
spark-history-cli --json -a <app> sql-jobs <exec-id> # linked jobs
spark-history-cli --json -a <app> stage-summary <stage> # task quantiles for slow stages
spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10 # stragglersspark-history-cli --json -a <app> sql # 所有SQL执行记录
spark-history-cli --json -a <app> stages # 所有阶段信息
spark-history-cli --json -a <app> executors --all # 执行器指标
**应用对比**需要为两个应用收集上述相同的数据。
**特定查询诊断**还需要额外获取:
```bash
spark-history-cli --json -a <app> sql <exec-id> # 包含节点/边的SQL详情
spark-history-cli -a <app> sql-plan <exec-id> --view final # AQE处理后的执行计划
spark-history-cli -a <app> sql-plan <exec-id> --view initial # AQE处理前的执行计划
spark-history-cli --json -a <app> sql-jobs <exec-id> # 关联的作业信息
spark-history-cli --json -a <app> stage-summary <stage> # 慢阶段的任务分位数指标
spark-history-cli --json -a <app> stage-tasks <stage> --sort-by -runtime --length 10 # 掉队任务3. Analyze
3. 分析
Apply the diagnostic rules from to identify issues.
Key areas to check:
references/diagnostics.md- Duration breakdown: Where is time spent? (stages, tasks, shuffle, GC)
- Skew detection: Compare p50 vs p95 in stage-summary; >3x ratio suggests skew
- GC pressure: Total GC time vs executor run time; >10% is concerning
- Shuffle overhead: Large shuffle read/write relative to input size
- Spill: Any memory or disk spill indicates memory pressure
- Straggler tasks: Tasks much slower than peers (check stage-tasks sorted by runtime)
- Config issues: Suboptimal shuffle partitions, executor sizing, serializer choice
应用中的诊断规则识别问题。
需要检查的核心维度:
references/diagnostics.md- 耗时拆分:时间都花在了哪里?(阶段、任务、shuffle、GC)
- 倾斜检测:对比stage-summary中的p50和p95耗时;比值>3倍说明存在倾斜
- GC压力:总GC时间与执行器运行时间的比值;>10%就需要关注
- Shuffle开销:shuffle读写量相对于输入数据量过大
- 溢出:任何内存或磁盘溢出都说明存在内存压力
- 掉队任务:运行速度远慢于其他同类任务的任务(查看按运行时间排序的stage-tasks结果)
- 配置问题:shuffle分区数、执行器规格、序列化器选择不合理
4. Compare (when applicable)
4. 对比(适用时)
For TPC-DS benchmark comparisons, see for the structured approach:
references/comparison.md- Match queries by name (q1, q2, ..., q99)
- Calculate speedup/regression per query
- Identify top-N improved and regressed queries
- Drill into regressed queries to find root cause
- Compare configurations side-by-side
TPC-DS基准测试对比可参考中的结构化方法:
references/comparison.md- 按名称匹配查询(q1, q2, ..., q99)
- 计算每个查询的加速比/回退幅度
- 识别提升和回退最明显的前N个查询
- 深入分析回退的查询找到根本原因
- 并排对比配置差异
5. Report
5. 报告
Produce two outputs:
- Conversation summary: Key findings and top recommendations (concise, actionable)
- Detailed report file: Full analysis saved to disk as Markdown
Report structure:
markdown
undefined输出两个结果:
- 对话摘要:核心发现和优先级最高的建议(简洁、可落地)
- 详细报告文件:完整分析结果以Markdown格式保存到磁盘
报告结构:
markdown
undefinedSpark Performance Report
Spark性能报告
Executive Summary
执行摘要
<2-3 sentence overview of findings>
<2-3句话的核心发现概览>
Application Overview
应用概览
<summary data for each app>
<每个应用的摘要数据>
Findings
发现项
Finding 1: <title>
发现项1: <标题>
- Severity: High/Medium/Low
- Evidence: <specific metrics>
- Recommendation: <what to change>
- 严重程度: 高/中/低
- 证据: <具体指标>
- 建议: <需要调整的内容>
Configuration Comparison (if comparing)
配置对比(如果是对比场景)
<side-by-side diff of key Spark properties>
<核心Spark配置的并排差异>
Query-Level Analysis (if TPC-DS)
查询级别分析(如果是TPC-DS场景)
<table of query durations with speedup/regression>
<包含每个查询耗时和加速/回退幅度的表格>
Recommendations
建议
<prioritized list of actionable changes>
```
<按优先级排序的可落地调整项列表>
undefinedDiagnostic Quick Reference
诊断快速参考
These are the most impactful things to check. For the full diagnostic ruleset, see .
references/diagnostics.md| Symptom | What to Check | CLI Command |
|---|---|---|
| Slow overall | Duration breakdown by stage | |
| Task skew | p50 vs p95 duration | |
| GC pressure | GC time vs run time per executor | |
| Shuffle heavy | Shuffle bytes vs input bytes | |
| Memory spill | Spill bytes > 0 | |
| Straggler tasks | Top tasks by runtime | |
| Bad config | Partition count, executor sizing | |
| AQE ineffective | Initial vs final plan difference | |
| Gluten fallback | Non-Transformer nodes in final plan | |
| Small files read | Avg file size < 3MB, files > 100 | |
| Small files written | Avg file size < 3MB, files > 100 | |
| Broadcast too large | Broadcast data > 1GB | |
| SMJ→BHJ conversion | SMJ with small input side | |
| Large cross join | Cross join rows > 10B | |
| Long filter condition | Filter condition > 1000 chars | |
| Full scan on partitioned | Missing partition/cluster filters | |
| Large partition size | Max partition > 5GB | |
| Wasted cores | Idle cores > 50% | |
| Memory over-provisioned | Max usage < 70% | |
| Driver memory risk | Driver heap > 95% | |
| Iceberg inefficient replace | Files replaced > 30%, records < 30% | |
以下是最需要检查的高影响项。完整诊断规则集可查看。
references/diagnostics.md| 症状 | 检查项 | CLI命令 |
|---|---|---|
| 整体运行慢 | 按阶段拆分耗时 | |
| 任务倾斜 | p50与p95耗时对比 | |
| GC压力 | 每个执行器的GC时间与运行时间比值 | |
| Shuffle负载重 | Shuffle字节数与输入字节数对比 | |
| 内存溢出 | 溢出字节数>0 | |
| 掉队任务 | 按运行时间排序的Top任务 | |
| 配置不合理 | 分区数、执行器规格 | |
| AQE未生效 | 初始执行计划与最终执行计划的差异 | |
| Gluten回退 | 最终执行计划中存在非Transformer节点 | |
| 读取小文件 | 平均文件大小<3MB,文件数>100 | |
| 写入小文件 | 平均文件大小<3MB,文件数>100 | |
| 广播数据过大 | 广播数据>1GB | |
| SMJ转BHJ的优化空间 | 输入侧数据量小的SMJ | |
| 大 cross join | Cross join行数>100亿 | |
| 过滤条件过长 | 过滤条件>1000个字符 | |
| 分区表全表扫描 | 缺少分区/集群过滤条件 | |
| 分区过大 | 最大分区>5GB | |
| 核数浪费 | 空闲核数>50% | |
| 内存配置过高 | 最大使用率<70% | |
| Driver内存风险 | Driver堆内存使用率>95% | |
| Iceberg replace效率低 | 替换文件占比>30%,变更记录占比<30% | |
SQL Plan Analysis
SQL执行计划分析
When diagnosing specific SQL queries, analyze the SQL plan nodes for these patterns:
- File I/O efficiency: Check scan/write node metrics for ,
files read,bytes read,files written. Calculate average file size — small files (< 3MB) are a common hidden bottleneck.bytes written - Join strategy: Look for nodes where one input is significantly smaller than the other. These may benefit from broadcast hints or AQE tuning.
SortMergeJoin - Broadcast sizing: Check node
BroadcastExchangemetric. Broadcasts > 1 GB cause excessive memory pressure and network overhead.data size - Cross joins: Identify or
BroadcastNestedLoopJoinnodes. Calculate total scanned rows from input sizes — cross joins on large tables are extremely dangerous.CartesianProduct - Filter complexity: Inspect node conditions. Very long conditions (> 1000 chars) with large IN-lists or OR chains should be converted to joins.
Filter - Partition pruning: For Delta Lake and Iceberg tables, verify that scan nodes show partition filters being applied. Full scans on partitioned tables waste I/O.
- Partition sizing: Check stage task distribution for oversized partitions (> 5GB). These cause OOM risk, long tail tasks, and GC pressure.
Use for node-level metrics and for post-AQE plan structure.
sql <exec-id>sql-plan <exec-id> --view final诊断特定SQL查询时,分析SQL计划节点是否存在以下模式:
- 文件I/O效率:检查扫描/写入节点的、
files read、bytes read、files written指标。计算平均文件大小——小文件(<3MB)是常见的隐藏瓶颈。bytes written - Join策略:查找其中一个输入数据量远小于另一个的节点,这类场景可以通过广播提示或AQE调优获得性能提升。
SortMergeJoin - 广播规格:检查节点的
BroadcastExchange指标,大于1GB的广播会导致过高的内存压力和网络开销。data size - Cross joins:识别或
BroadcastNestedLoopJoin节点,基于输入数据量计算总扫描行数——大表上的cross join风险极高。CartesianProduct - 过滤条件复杂度:检查节点的条件,包含大型IN列表或OR链的超长条件(>1000个字符)应该转换为join。
Filter - 分区裁剪:对于Delta Lake和Iceberg表,确认扫描节点应用了分区过滤条件,分区表的全表扫描会浪费大量I/O。
- 分区大小:检查阶段任务分布是否存在超大分区(>5GB),这类分区会带来OOM风险、长尾任务和GC压力。
使用获取节点级指标,使用获取AQE处理后的执行计划结构。
sql <exec-id>sql-plan <exec-id> --view finalLakehouse Awareness
湖仓适配注意事项
When analyzing workloads on Delta Lake or Apache Iceberg tables:
分析Delta Lake或Apache Iceberg表上的工作负载时:
Delta Lake
Delta Lake
- OPTIMIZE: Recommend for tables with small file problems detected in scan metrics
OPTIMIZE - Z-ORDER: Check if queries filter on z-ordered columns; if not, the z-ordering provides no benefit
- Liquid Clustering: For Databricks, check if cluster key filters are being applied in scans
- Full scans: Flag scans on partitioned Delta tables without partition filters
- OPTIMIZE:如果扫描指标检测到小文件问题,建议对表执行
OPTIMIZE - Z-ORDER:检查查询是否按z-order列过滤,如果不是,z-order不会带来任何收益
- Liquid Clustering:对于Databricks环境,检查扫描中是否应用了集群键过滤
- 全表扫描:标记没有使用分区过滤的分区Delta表扫描
Apache Iceberg
Apache Iceberg
- Copy-on-Write overhead: For update/delete workloads, check if files replaced >> records changed — this indicates COW overhead
- Merge-on-Read: Recommend for update-heavy tables
write.merge-mode=merge-on-read - Table maintenance: Recommend for small file compaction
rewrite_data_files - Bulk replace detection: If > 60% of table files are replaced in a single operation, flag potential misuse
- Copy-on-Write开销:对于更新/删除工作负载,检查替换文件数是否远大于变更记录数——这说明存在COW开销
- Merge-on-Read:对于更新频繁的表,建议配置
write.merge-mode=merge-on-read - 表维护:建议执行进行小文件合并
rewrite_data_files - 批量替换检测:如果单次操作替换了超过60%的表文件,标记为潜在误用
General Lakehouse Checks
通用湖仓检查项
- File sizes in scan/write metrics (target ~128MB per file)
- Partition filter pushdown in scan nodes
- Table statistics availability for cost-based optimization
- 扫描/写入指标中的文件大小(目标为每个文件~128MB)
- 扫描节点的分区过滤下推情况
- 成本优化所需的表统计信息是否可用
Gluten/Velox Awareness
Gluten/Velox适配注意事项
When analyzing Gluten-accelerated applications:
- Plan nodes: and
*Transformernodes indicate Gluten-offloaded operators*ExecTransformer - Fallback detection: Non-Transformer nodes in the final plan (e.g., instead of
SortMergeJoin) indicate Gluten fallback — these are performance-critical to investigateShuffledHashJoinExecTransformer - Columnar exchanges: and
ColumnarExchangeare Gluten's native shuffle — look forColumnarBroadcastExchangetransitions which indicate fallback boundariesVeloxColumnarToRow - Native metrics: Gluten stages may show different metric patterns (lower GC, different memory profiles) than vanilla Spark stages
分析Gluten加速的应用时:
- 计划节点:和
*Transformer节点表示Gluten offload的算子*ExecTransformer - 回退检测:最终执行计划中的非Transformer节点(例如而不是
SortMergeJoin)说明Gluten发生了回退——这些是需要重点排查的性能关键点ShuffledHashJoinExecTransformer - 列状交换:和
ColumnarExchange是Gluten的原生shuffle——查找ColumnarBroadcastExchange转换,这是回退的边界标识VeloxColumnarToRow - 原生指标:与原生Spark阶段相比,Gluten阶段会呈现不同的指标特征(更低的GC、不同的内存分布)
References
参考资料
- — Full diagnostic ruleset with thresholds and heuristics
references/diagnostics.md - — TPC-DS benchmark comparison methodology
references/comparison.md
- —— 包含阈值和启发式规则的完整诊断规则集
references/diagnostics.md - —— TPC-DS基准测试对比方法论
references/comparison.md