etl-patterns
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseETL Patterns
ETL模式
Orchestrator for production-grade Extract-Transform-Load patterns.
生产级抽取-转换-加载(ETL)模式的编排器。
Skill Routing
技能路由
| Need | Skill | Content |
|---|---|---|
| Reliability patterns | | Idempotency, checkpointing, error handling, chunking, retry, logging |
| Load strategies | | Backfill, timestamp-based, CDC, pipeline orchestration |
| 需求 | 技能 | 内容 |
|---|---|---|
| 可靠性模式 | | Idempotency、checkpointing、错误处理、分块处理、重试、日志记录 |
| 加载策略 | | 回填、基于时间戳的加载、CDC、管道编排 |
Pattern Selection Guide
模式选择指南
By Reliability Need
按可靠性需求选择
| Need | Pattern | Skill |
|---|---|---|
| Repeatable runs | Idempotency | |
| Resume after failure | Checkpointing | |
| Handle bad records | Error handling + DLQ | |
| Memory management | Chunked processing | |
| Network resilience | Retry with backoff | |
| Observability | Structured logging | |
| 需求 | 模式 | 技能 |
|---|---|---|
| 可重复运行 | Idempotency | |
| 故障后恢复 | Checkpointing | |
| 处理坏数据 | 错误处理 + DLQ | |
| 内存管理 | 分块处理 | |
| 网络韧性 | 退避重试 | |
| 可观测性 | 结构化日志 | |
By Load Strategy
按加载策略选择
| Scenario | Pattern | Skill |
|---|---|---|
| Small tables (<100K) | Full refresh | |
| Large tables | Timestamp incremental | |
| Real-time sync | CDC events | |
| Historical migration | Parallel backfill | |
| Zero-downtime refresh | Swap pattern | |
| Multi-step pipelines | Pipeline orchestration | |
| 场景 | 模式 | 技能 |
|---|---|---|
| 小表(<10万行) | 全量刷新 | |
| 大表 | 基于时间戳的增量加载 | |
| 实时同步 | CDC事件 | |
| 历史数据迁移 | 并行回填 | |
| 零停机刷新 | 交换模式 | |
| 多步骤管道 | 管道编排 | |
Quick Reference
快速参考
Idempotency Options
Idempotency选项
python
undefinedpython
undefinedSmall datasets: Delete-then-insert
Small datasets: Delete-then-insert
Large datasets: UPSERT on conflict
Large datasets: UPSERT on conflict
Change detection: Row hash comparison
Change detection: Row hash comparison
undefinedundefinedLoad Strategy Decision
加载策略决策
Is table < 100K rows?
→ Full refresh
Has reliable timestamp column?
→ Timestamp incremental
Source supports CDC?
→ CDC event processing
Need zero downtime?
→ Swap pattern (temp table → rename)
One-time historical load?
→ Parallel backfill with date rangesIs table < 100K rows?
→ Full refresh
Has reliable timestamp column?
→ Timestamp incremental
Source supports CDC?
→ CDC event processing
Need zero downtime?
→ Swap pattern (temp table → rename)
One-time historical load?
→ Parallel backfill with date rangesCommon Pipeline Structure
常见管道结构
python
undefinedpython
undefined1. Setup
1. Setup
checkpoint = Checkpoint('.etl_checkpoint.json')
processor = ETLProcessor()
checkpoint = Checkpoint('.etl_checkpoint.json')
processor = ETLProcessor()
2. Extract (with incremental)
2. Extract (with incremental)
df = incremental_by_timestamp(source_table, 'updated_at')
df = incremental_by_timestamp(source_table, 'updated_at')
3. Transform (with error handling)
3. Transform (with error handling)
transformed = processor.process_batch(df.to_dict('records'))
transformed = processor.process_batch(df.to_dict('records'))
4. Load (with idempotency)
4. Load (with idempotency)
upsert_records(pd.DataFrame(transformed))
upsert_records(pd.DataFrame(transformed))
5. Checkpoint
5. Checkpoint
checkpoint.set_last_processed('sync', df['updated_at'].max())
checkpoint.set_last_processed('sync', df['updated_at'].max())
6. Handle failures
6. Handle failures
processor.save_failures('failures/')
undefinedprocessor.save_failures('failures/')
undefinedRelated Skills
相关技能
- - Validate data quality during ETL
data-validation - - Monitor data quality metrics
data-quality - - DataFrame transformations
pandas-coder
- - ETL过程中的数据质量校验
data-validation - - 数据质量指标监控
data-quality - - DataFrame转换
pandas-coder