Loading...
Loading...
Develops and executes Spark code on Dataproc Clusters and Serverless. Reads and writes data using BigLake Iceberg catalogs, BigQuery and Spanner. Debugs execution failures. Use when: - Writing Spark ETL pipelines on GCP. - Training or running inference with ML models with spark on GCP. - Managing Spark clusters, jobs, batches, and interactive sessions. Don't use when: - Writing generic Python scripts that don't use Spark. - Performing simple SQL queries that can be done directly in BigQuery.
npx skill4agent add gemini-cli-extensions/data-agent-kit-starter-pack gcp-spark[!IMPORTANT] You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
@skill:discovering-gcp-data-assetsresources/schema_direct_inspection.mdresources/read_write_data.md@skill:ml-best-practicesresources/ml_tasks.mdresources/spark_optimizations.mddf.printSchema()@skill:discovering-gcp-data-assetsresources/schema_direct_inspection.mdjupyter nbconvert --to script your-notebook.ipynbpython3 -m py_compile your-notebook.py.pyresources/gcloud_dataproc.md[!CAUTION] Ensure you verify this checklist to avoid mistakes
colwhenlitpyspark.sql.functionsvector_to_arrayfrom pyspark.ml.functions import vector_to_arraypyspark.sql.functionsdf.printSchema()headerinferSchemaroles/dataproc.workerroles/biglake.adminroles/bigquery.jobUserroles/storage.objectUserroles/spanner.databaseUserresources/gcloud_dataproc.md