Loading...
Loading...
Found 91 Skills
Transform raw data into analytical assets using ETL/ELT patterns, SQL (dbt), Python (pandas/polars/PySpark), and orchestration (Airflow). Use when building data pipelines, implementing incremental models, migrating from pandas to polars, or orchestrating multi-step transformations with testing and quality checks.
Master data engineering, ETL/ELT, data warehousing, SQL optimization, and analytics. Use when building data pipelines, designing data systems, or working with large datasets.
Production ETL patterns orchestrator. Routes to core reliability patterns and incremental load strategies.
This skill should be used when building data processing pipelines with CocoIndex v1, a Python library for incremental data transformation. Use when the task involves processing files/data into databases, creating vector embeddings, building knowledge graphs, ETL workflows, or any data pipeline requiring automatic change detection and incremental updates. CocoIndex v1 is Python-native (supports any Python types), has no DSL, and is currently under pre-release (version 1.0.0a1 or later).
Use this skill when building data pipelines, ETL/ELT workflows, or data transformation layers. Triggers on Airflow DAG design, dbt model creation, Spark job optimization, streaming vs batch architecture decisions, data ingestion, data quality checks, pipeline orchestration, incremental loads, CDC (change data capture), schema evolution, and data warehouse modeling. Acts as a senior data engineer advisor for building reliable, scalable data infrastructure.
Design data pipelines covering ETL vs ELT architectures, data source integration, scheduling, quality checks, and warehouse design. Use this skill when the user needs to move data between systems, build a data warehouse, automate data processing, or improve data reliability — even if they say 'move data from X to Y', 'build an ETL pipeline', 'our data is a mess', or 'set up a data warehouse'.
Import data into the AWS data lake from S3 files, local uploads, JDBC databases (Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora), Amazon Redshift, Snowflake, BigQuery, DynamoDB, or existing Glue catalog tables (migration). Default target is S3 Tables; standard Iceberg on a general purpose bucket is supported where S3 Tables is not adopted. Handles one-time loads, recurring pipelines, migrations. Triggers on: import data, load data, ingest, sync database, migrate table, move data to AWS, set up pipeline, ETL, pull from Snowflake, query BigQuery into S3, export DynamoDB, CTAS, convert to Iceberg. Do NOT use for setting up or troubleshooting Glue connections (use connecting-to-data-source), creating empty tables (use creating-data-lake-table), running queries (use querying-data-lake), finding tables by fuzzy name (use finding-data-lake-assets), catalog audit (use exploring-data-catalog), or SaaS platforms like Salesforce, ServiceNow, SAP, MongoDB, Kafka.
End-to-end data engineering pipeline using Harvard Art Museums API with ETL, SQL analytics, and Streamlit visualization
This skill provides guidance for merging data from multiple heterogeneous sources (JSON, CSV, Parquet, XML, etc.) into a unified dataset. Use this skill when tasks involve combining records from different file formats, applying field mappings, resolving conflicts based on priority rules, or generating merged outputs with conflict reports. Applicable to ETL pipelines, data consolidation, and record deduplication scenarios.
Transform raw data into analytical assets using ETL/ELT patterns, SQL (dbt), Python (pandas/polars/PySpark), and orchestration (Airflow). Use when building data pipelines, implementing incremental models, migrating from pandas to polars, or orchestrating multi-step transformations with testing and quality checks.
Generate synthetic test data with edge cases for ETL pipeline testing.
Data pipeline expert for ETL, Apache Spark, Airflow, dbt, and data quality