incremental-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseIncremental Fetch
增量获取
Build data pipelines that never lose progress and never re-fetch existing data.
构建永不丢失进度且不会重复获取已有数据的数据管道。
The Two Watermarks Pattern
双水印模式
Track TWO cursors to support both forward and backward fetching:
| Watermark | Purpose | API Parameter |
|---|---|---|
| Fetch new data since last run | |
| Backfill older data | |
A single watermark only fetches forward. Two watermarks enable:
- Regular runs: fetch NEW data (since )
newest_id - Backfill runs: fetch OLD data (until )
oldest_id - No overlap, no gaps
跟踪两个游标以支持向前和向后获取:
| 水印 | 用途 | API参数 |
|---|---|---|
| 获取上次运行以来的新数据 | |
| 回填旧数据 | |
单一水印仅支持向前获取。双水印可实现:
- 常规运行:获取新数据(基于之后)
newest_id - 回填运行:获取旧数据(基于之前)
oldest_id - 无重叠、无间隙
Critical: Data vs Watermark Saving
关键:数据与水印存储
These are different operations with different timing:
| What | When to Save | Why |
|---|---|---|
| Data records | After EACH page | Resilience: interrupted on page 47? Keep 46 pages |
| Watermarks | ONCE at end of run | Correctness: only commit progress after full success |
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks这是两个不同的操作,时机也不同:
| 操作内容 | 存储时机 | 原因 |
|---|---|---|
| 数据记录 | 每获取一页之后 | 韧性:在第47页中断?保留前46页的数据 |
| 水印 | 运行结束后仅保存一次 | 正确性:仅在完全成功后提交进度 |
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarksWorkflow Decision Tree
工作流决策树
First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
├── YES → Backfill mode (until_id = oldest_id)
└── NO → Update mode (since_id = newest_id)First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
├── YES → Backfill mode (until_id = oldest_id)
└── NO → Update mode (since_id = newest_id)Implementation Checklist
实施检查清单
- Database: Create ingestion_state table (see patterns.md)
- Fetch loop: Insert records immediately after each API page
- Watermark tracking: Track newest/oldest IDs seen in this run
- Watermark update: Save watermarks ONCE at end of successful run
- Retry: Exponential backoff with jitter
- Rate limits: Wait for reset or skip and record for next run
- 数据库:创建ingestion_state表(参见patterns.md)
- 获取循环:每获取一页API数据后立即插入记录
- 水印跟踪:跟踪本次运行中出现的newest/oldest ID
- 水印更新:仅在运行成功结束后保存一次水印
- 重试:带抖动的指数退避策略
- 速率限制:等待重置或跳过并记录以便下次运行
Pagination Types
分页类型
This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:
| Type | Adaptation |
|---|---|
| Cursor/token | Store cursor string instead of ID; can't compare numerically |
| Timestamp | Use |
| Offset/limit | Store page number; resume from last saved page |
See references/patterns.md for schemas and code examples.
此模式最适用于基于ID的分页(可比较的数值ID)。对于其他分页类型:
| 类型 | 适配方式 |
|---|---|
| 游标/令牌 | 存储游标字符串而非ID;无法进行数值比较 |
| 时间戳 | 使用 |
| 偏移量/限制 | 存储页码;从上次保存的页码恢复 |
参见references/patterns.md获取模式和代码示例。