spark-consumption-cli

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
  • GitHub Copilot CLI / VS Code: invoke the
    check-updates
    skill.
  • Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
  • Skip if the check was already performed earlier in this session.
CRITICAL NOTES
  1. To find the workspace details (including its ID) from workspace name: list all workspaces and, then, use JMESPath filtering
  2. To find the item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace and, then, use JMESPath filtering
更新检查——每个会话仅需执行一次(必填) 在会话中首次使用此技能时,请先运行check-updates技能再继续操作。
  • GitHub Copilot CLI / VS Code:调用
    check-updates
    技能。
  • Claude Code / Cowork / Cursor / Windsurf / Codex:对比本地与远程package.json版本。
  • 若本会话中已完成过检查,可跳过此步骤。
重要说明
  1. 要通过工作区名称获取工作区详情(包括ID):列出所有工作区,然后使用JMESPath筛选
  2. 要通过工作区ID、项目类型和项目名称获取项目详情(包括ID):列出该工作区中对应类型的所有项目,然后使用JMESPath筛选

Data Engineering Consumption — CLI Skill

数据工程消费——CLI技能

Table of Contents

目录

TaskReferenceNotes
Fabric Topology & Key ConceptsCOMMON-CORE.md § Fabric Topology & Key Concepts
Environment URLsCOMMON-CORE.md § Environment URLs
Authentication & Token AcquisitionCOMMON-CORE.md § Authentication & Token AcquisitionWrong audience = 401; read before any auth issue
Core Control-Plane REST APIsCOMMON-CORE.md § Core Control-Plane REST APIs
PaginationCOMMON-CORE.md § Pagination
Long-Running Operations (LRO)COMMON-CORE.md § Long-Running Operations (LRO)
Rate Limiting & ThrottlingCOMMON-CORE.md § Rate Limiting & Throttling
OneLake Data AccessCOMMON-CORE.md § OneLake Data AccessRequires
storage.azure.com
token, not Fabric token
Job ExecutionCOMMON-CORE.md § Job Execution
Capacity ManagementCOMMON-CORE.md § Capacity Management
Gotchas & TroubleshootingCOMMON-CORE.md § Gotchas & Troubleshooting
Best PracticesCOMMON-CORE.md § Best Practices
Tool Selection RationaleCOMMON-CLI.md § Tool Selection Rationale
Finding Workspaces and Items in FabricCOMMON-CLI.md § Finding Workspaces and Items in FabricMandatoryREAD link first [needed for finding workspace id by its name or item id by its name, item type, and workspace id]
Authentication RecipesCOMMON-CLI.md § Authentication Recipes
az login
flows and token acquisition
Fabric Control-Plane API via
az rest
COMMON-CLI.md § Fabric Control-Plane API via az restAlways pass
--resource https://api.fabric.microsoft.com
or
az rest
fails
Pagination PatternCOMMON-CLI.md § Pagination Pattern
Long-Running Operations (LRO) PatternCOMMON-CLI.md § Long-Running Operations (LRO) Pattern
OneLake Data Access via
curl
COMMON-CLI.md § OneLake Data Access via curlUse
curl
not
az rest
(different token audience)
SQL / TDS Data-Plane AccessCOMMON-CLI.md § SQL / TDS Data-Plane Access
sqlcmd
(Go) connect, query, CSV export
Job Execution (CLI)COMMON-CLI.md § Job Execution
OneLake ShortcutsCOMMON-CLI.md § OneLake Shortcuts
Capacity Management (CLI)COMMON-CLI.md § Capacity Management
Composite RecipesCOMMON-CLI.md § Composite Recipes
Gotchas & Troubleshooting (CLI-Specific)COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific)
az rest
audience, shell escaping, token expiry
Quick Reference:
az rest
Template
COMMON-CLI.md § Quick Reference: az rest Template
Quick Reference: Token Audience / CLI Tool MatrixCOMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool MatrixWhich
--resource
+ tool for each service
Relationship to SPARK-AUTHORING-CORE.mdSPARK-CONSUMPTION-CORE.md § Relationship to SPARK-AUTHORING-CORE.md
Data Engineering Consumption Capability MatrixSPARK-CONSUMPTION-CORE.md § Data Engineering Consumption Capability Matrix
OneLake Table APIs (Schema-enabled Lakehouses)SPARK-CONSUMPTION-CORE.md § OneLake Table APIs (Schema-enabled Lakehouses)Unity Catalog-compatible metadata; requires
storage.azure.com
token
Livy Session ManagementSPARK-CONSUMPTION-CORE.md § Livy Session ManagementSession creation, states, lifecycle, termination
Interactive Data ExplorationSPARK-CONSUMPTION-CORE.md § Interactive Data ExplorationStatement execution, output retrieval, data discovery
PySpark Analytics PatternsSPARK-CONSUMPTION-CORE.md § PySpark Analytics PatternsCross-lakehouse 3-part naming, performance optimization
Must/Prefer/AvoidSKILL.md § Must/Prefer/AvoidMUST DO / AVOID / PREFER checklists
Quick StartSKILL.md § Quick StartCLI-specific Livy session setup and data exploration
Key Fabric PatternsSKILL.md § Key Fabric PatternsSpark pattern quick-reference table
Session CleanupSKILL.md § Session CleanupClean up idle Livy sessions via CLI

任务参考文档说明
Fabric拓扑与核心概念COMMON-CORE.md § Fabric Topology & Key Concepts
环境URLCOMMON-CORE.md § Environment URLs
认证与令牌获取COMMON-CORE.md § Authentication & Token Acquisition受众错误会导致401;遇到任何认证问题前请先阅读
核心控制平面REST APICOMMON-CORE.md § Core Control-Plane REST APIs
分页COMMON-CORE.md § Pagination
长时间运行操作(LRO)COMMON-CORE.md § Long-Running Operations (LRO)
速率限制与限流COMMON-CORE.md § Rate Limiting & Throttling
OneLake数据访问COMMON-CORE.md § OneLake Data Access需要
storage.azure.com
令牌,而非Fabric令牌
作业执行COMMON-CORE.md § Job Execution
容量管理COMMON-CORE.md § Capacity Management
常见问题与故障排除COMMON-CORE.md § Gotchas & Troubleshooting
最佳实践COMMON-CORE.md § Best Practices
工具选择理由COMMON-CLI.md § Tool Selection Rationale
在Fabric中查找工作区和项目COMMON-CLI.md § Finding Workspaces and Items in Fabric必填先阅读链接内容 [需通过名称查找工作区ID,或通过名称、项目类型和工作区ID查找项目ID]
认证方案COMMON-CLI.md § Authentication Recipes
az login
流程与令牌获取
通过
az rest
调用Fabric控制平面API
COMMON-CLI.md § Fabric Control-Plane API via az rest必须传递
--resource https://api.fabric.microsoft.com
,否则
az rest
会失败
分页模式COMMON-CLI.md § Pagination Pattern
长时间运行操作(LRO)模式COMMON-CLI.md § Long-Running Operations (LRO) Pattern
通过
curl
访问OneLake数据
COMMON-CLI.md § OneLake Data Access via curl使用
curl
而非
az rest
(令牌受众不同)
SQL / TDS数据平面访问COMMON-CLI.md § SQL / TDS Data-Plane Access
sqlcmd
(Go版本)连接、查询、CSV导出
作业执行(CLI)COMMON-CLI.md § Job Execution
OneLake快捷方式COMMON-CLI.md § OneLake Shortcuts
容量管理(CLI)COMMON-CLI.md § Capacity Management
复合方案COMMON-CLI.md § Composite Recipes
常见问题与故障排除(CLI专属)COMMON-CLI.md § Gotchas & Troubleshooting (CLI-Specific)
az rest
受众、Shell转义、令牌过期
快速参考:
az rest
模板
COMMON-CLI.md § Quick Reference: az rest Template
快速参考:令牌受众 / CLI工具矩阵COMMON-CLI.md § Quick Reference: Token Audience ↔ CLI Tool Matrix各服务对应的
--resource
与工具
与SPARK-AUTHORING-CORE.md的关系SPARK-CONSUMPTION-CORE.md § Relationship to SPARK-AUTHORING-CORE.md
数据工程消费能力矩阵SPARK-CONSUMPTION-CORE.md § Data Engineering Consumption Capability Matrix
OneLake表API(支持Schema的湖仓)SPARK-CONSUMPTION-CORE.md § OneLake Table APIs (Schema-enabled Lakehouses)兼容Unity Catalog的元数据;需要
storage.azure.com
令牌
Livy会话管理SPARK-CONSUMPTION-CORE.md § Livy Session Management会话创建、状态、生命周期、终止
交互式数据探索SPARK-CONSUMPTION-CORE.md § Interactive Data Exploration语句执行、结果获取、数据发现
PySpark分析模式SPARK-CONSUMPTION-CORE.md § PySpark Analytics Patterns跨湖仓三段式命名、性能优化
必须/推荐/避免SKILL.md § Must/Prefer/Avoid必须执行/避免/推荐清单
快速开始SKILL.md § Quick StartCLI专属的Livy会话设置与数据探索
Fabric核心模式SKILL.md § Key Fabric PatternsSpark模式快速参考表
会话清理SKILL.md § Session Cleanup通过CLI清理闲置Livy会话

Must/Prefer/Avoid

必须/推荐/避免

MUST DO

必须执行

  • Check for existing idle sessions before creating new ones
  • Use dynamic workspace/lakehouse discovery
  • Follow API patterns from COMMON-CLI.md
  • 创建新会话前检查是否存在闲置会话
  • 使用动态工作区/湖仓发现机制
  • 遵循COMMON-CLI.md中的API模式

PREFER

推荐

  • sqldw-consumption-cli for simple lakehouse queries — row counts, SELECT, schema exploration, filtering, and aggregation on lakehouse Delta tables should use the SQL Endpoint via
    sqlcmd
    , not Spark. Only use this skill when the user explicitly requests PySpark, DataFrames, or Spark-specific features.
  • SQL Endpoint for Delta tables
  • Livy for unstructured/JSON data or complex Python analytics
  • Session reuse over creation
  • 简单湖仓查询使用sqldw-consumption-cli — 湖仓Delta表的行数统计、SELECT查询、Schema探索、过滤和聚合操作应通过
    sqlcmd
    使用SQL端点,而非Spark。仅当用户明确要求使用PySpark、DataFrames或Spark专属功能时才使用本技能。
  • 对Delta表使用SQL端点
  • 对非结构化/JSON数据或复杂Python分析使用Livy
  • 优先复用会话而非创建新会话

AVOID

避免

  • Hardcoded workspace IDs
  • Creating unnecessary sessions
  • Large result sets without LIMIT

  • 硬编码工作区ID
  • 创建不必要的会话
  • 未使用LIMIT返回大型结果集

Quick Start

快速开始

Environment Setup

环境设置

Apply environment detection from COMMON-CORE.md Environment Detection Pattern to set:
  • $FABRIC_API_BASE
    and
    $FABRIC_RESOURCE_SCOPE
  • $FABRIC_API_URL
    and
    $LIVY_API_PATH
    for Livy operations
Authentication: Use token acquisition from COMMON-CLI.md Environment Detection and API Configuration
应用COMMON-CORE.md中的环境检测模式来设置:
  • $FABRIC_API_BASE
    $FABRIC_RESOURCE_SCOPE
  • $FABRIC_API_URL
    $LIVY_API_PATH
    用于Livy操作
认证:使用COMMON-CLI.md中的环境检测与API配置流程获取令牌

Workspace & Item Discovery

工作区与项目发现

Preferred: Use COMMON-CLI.md item discovery patterns (Finding things in Fabric) to find workspaces and items by name.
Fallback (when workspace is already known):
bash
undefined
推荐方式:使用COMMON-CLI.md中的项目发现模式(在Fabric中查找资源)通过名称查找工作区和项目。
备选方案(已知工作区时):
bash
undefined

List workspaces

List workspaces

az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces" --query "value[].{name:displayName, id:id}" --output table read -p "Workspace ID: " workspaceId
az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces" --query "value[].{name:displayName, id:id}" --output table read -p "Workspace ID: " workspaceId

List lakehouses in workspace

List lakehouses in workspace

az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/items?type=Lakehouse" --query "value[].{name:displayName, id:id}" --output table
read -p "Lakehouse ID: " lakehouseId
undefined
az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/items?type=Lakehouse" --query "value[].{name:displayName, id:id}" --output table
read -p "Lakehouse ID: " lakehouseId
undefined

Session Management

会话管理

bash
undefined
bash
undefined

Check for existing idle session (avoid resource waste)

Check for existing idle session (avoid resource waste)

sessionId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'][0].id" --output tsv)
sessionId=$(az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'][0].id" --output tsv)

Create if none available - FORCE STARTER POOL USAGE

Create if none available - FORCE STARTER POOL USAGE

if [[ -z "$sessionId" ]]; then cat > /tmp/body.json << 'EOF' { "name":"analysis", "driverMemory":"56g", "driverCores":8, "executorMemory":"56g", "executorCores":8, "conf": { "spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool" } } EOF sessionId=$(az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --body @/tmp/body.json --query "id" --output tsv)
echo "⏳ Waiting for starter pool session to be ready..." 
# With starter pools, this should be 3-5 seconds
timeout=30  # Reduced from 90s since starter pools are fast
while [ $timeout -gt 0 ]; do
    state=$(az rest --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId" --query "state" --output tsv)
    if [[ "$state" == "idle" ]]; then
        echo "✅ Session ready in starter pool!"
        break
    fi
    echo "   Session state: $state (${timeout}s remaining)"
    sleep 3
    timeout=$((timeout - 3))
done
fi
undefined
if [[ -z "$sessionId" ]]; then cat > /tmp/body.json << 'EOF' { "name":"analysis", "driverMemory":"56g", "driverCores":8, "executorMemory":"56g", "executorCores":8, "conf": { "spark.dynamicAllocation.enabled": "true", "spark.fabric.pool.name": "Starter Pool" } } EOF sessionId=$(az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --body @/tmp/body.json --query "id" --output tsv)
echo "⏳ Waiting for starter pool session to be ready..." 
# With starter pools, this should be 3-5 seconds
timeout=30  # Reduced from 90s since starter pools are fast
while [ $timeout -gt 0 ]; do
    state=$(az rest --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId" --query "state" --output tsv)
    if [[ "$state" == "idle" ]]; then
        echo "✅ Session ready in starter pool!"
        break
    fi
    echo "   Session state: $state (${timeout}s remaining)"
    sleep 3
    timeout=$((timeout - 3))
done
fi
undefined

Data Exploration (Fabric-Specific Patterns)

数据探索(Fabric专属模式)

bash
undefined
bash
undefined

Execute statement (LLM knows Python/Spark syntax)

Execute statement (LLM knows Python/Spark syntax)

cat > /tmp/body.json << 'EOF' { "code": "spark.sql("SHOW TABLES").show(); df = spark.table("your_table"); df.describe().show()", "kind": "pyspark" } EOF az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId/statements" --body @/tmp/body.json
undefined
cat > /tmp/body.json << 'EOF' { "code": "spark.sql("SHOW TABLES").show(); df = spark.table("your_table"); df.describe().show()", "kind": "pyspark" } EOF az rest --method post --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/$sessionId/statements" --body @/tmp/body.json
undefined

Key Fabric Patterns

Fabric核心模式

PatternCodeUse Case
Table Discovery
spark.sql("SHOW TABLES")
List available tables
Cross-Lakehouse
spark.sql("SELECT * FROM other_workspace.table")
Query across workspaces
Delta Features
df.history()
,
df.readVersion(1)
Time travel, versioning
Schema Evolution
df.printSchema()
Understand structure
模式代码使用场景
表发现
spark.sql("SHOW TABLES")
列出可用表
跨湖仓查询
spark.sql("SELECT * FROM other_workspace.table")
跨工作区查询
Delta特性
df.history()
,
df.readVersion(1)
时间旅行、版本管理
Schema演进
df.printSchema()
了解数据结构

Session Cleanup

会话清理

bash
undefined
bash
undefined

Clean up idle sessions (optional)

Clean up idle sessions (optional)

az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'].id" --output tsv | xargs -I {} az rest --method delete --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/{}"

---

**Focus**: This skill provides Fabric-specific REST API patterns. LLM already knows Python/Spark syntax — we focus on Fabric integration, session management, and API endpoints.
az rest --method get --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions" --query "sessions[?state=='idle'].id" --output tsv | xargs -I {} az rest --method delete --resource "$FABRIC_RESOURCE_SCOPE" --url "$FABRIC_API_URL/workspaces/$workspaceId/lakehouses/$lakehouseId/$LIVY_API_PATH/sessions/{}"

---

**重点**:本技能提供Fabric专属的REST API模式。LLM已掌握Python/Spark语法——我们聚焦于Fabric集成、会话管理和API端点。