Loading...
Loading...
Design data pipelines covering ETL vs ELT architectures, data source integration, scheduling, quality checks, and warehouse design. Use this skill when the user needs to move data between systems, build a data warehouse, automate data processing, or improve data reliability — even if they say 'move data from X to Y', 'build an ETL pipeline', 'our data is a mess', or 'set up a data warehouse'.
npx skill4agent add asgard-ai-platform/skills tech-data-pipelineIRON LAW: Data Quality Checks at Every Stage
A pipeline that moves bad data fast is worse than no pipeline — it
corrupts downstream analytics and decisions. Every pipeline stage
(extract, transform, load) must have data quality checks:
row counts, null checks, schema validation, freshness checks.
"Garbage in, garbage out" is not a warning — it's a guarantee.| Aspect | ETL (Extract, Transform, Load) | ELT (Extract, Load, Transform) |
|---|---|---|
| Transform where? | Before loading (in pipeline) | After loading (in warehouse) |
| Best for | Structured data, compliance-heavy | Cloud warehouses (BigQuery, Snowflake) |
| Flexibility | Less (transform logic is fixed) | More (transform in SQL after loading) |
| Cost | Compute in pipeline | Compute in warehouse |
| Trend | Legacy/on-prem | Modern/cloud-native |
[Sources] → [Extract] → [Stage] → [Transform] → [Load] → [Serve]
↑ ↓
| [Quality Checks at every stage] [Dashboard]
| [Monitoring & Alerting] [API]
└──────────────── [Orchestrator (Airflow/Prefect)] ──────┘| Source | Extraction Method | Challenges |
|---|---|---|
| Database | CDC (Change Data Capture), bulk query, replication | Schema changes, performance impact on source |
| API | REST/GraphQL polling, webhooks | Rate limits, pagination, auth token refresh |
| Files | S3/GCS pickup, SFTP, email attachment | Format inconsistency, encoding issues |
| Streaming | Kafka, Kinesis, Pub/Sub | Ordering, exactly-once processing |
| SaaS tools | Pre-built connectors (Fivetran, Airbyte) | API changes, data model complexity |
| Tool | Type | Best For | Complexity |
|---|---|---|---|
| Airflow | Python DAGs | Complex pipelines, team of engineers | High |
| Prefect | Python, modern API | Simpler than Airflow, good DX | Medium |
| dbt | SQL transforms only | Transform layer in ELT | Low-Medium |
| Cron | Simple scheduling | Single script, low complexity | Low |
| Fivetran/Airbyte | Managed connectors | Extract + Load (no transform) | Low |
| Check | What It Validates | When |
|---|---|---|
| Row count | Expected number of rows (within ±10% of prior run) | After extract, after load |
| Null check | Critical columns have no unexpected nulls | After extract |
| Schema validation | Column names, types match expected | After extract |
| Freshness | Data is recent (not stale) | After load |
| Uniqueness | No duplicate primary keys | After load |
| Range check | Values within expected bounds | After transform |
| Referential integrity | Foreign keys match parent tables | After load |
# Data Pipeline Design: {Project}
## Sources & Destinations
| Source | Type | Destination | Freshness | Volume |
|--------|------|-----------|-----------|--------|
| {source} | DB/API/File | {dest} | {daily/hourly} | {rows/day} |
## Architecture
- Pattern: ETL / ELT
- Orchestrator: {tool}
- Transform: {tool/SQL}
- Quality: {tool/custom checks}
## Pipeline Diagram
{Source} → {Extract} → {Stage} → {Transform} → {Load} → {Serve}
## Quality Checks
| Stage | Check | Threshold | Alert |
|-------|-------|-----------|-------|
| Extract | Row count | ±10% of prior | Slack alert |
| Load | Freshness | < 6 hours old | PagerDuty |
## Schedule
| Pipeline | Frequency | Start Time | SLA |
|----------|-----------|-----------|-----|
| {name} | {daily/hourly} | {time} | Data ready by {time} |references/dbt-guide.mdreferences/dimensional-modeling.md