incremental-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Incremental Fetch

增量获取

Build data pipelines that never lose progress and never re-fetch existing data.
构建永不丢失进度且不会重复获取已有数据的数据管道。

The Two Watermarks Pattern

双水印模式

Track TWO cursors to support both forward and backward fetching:
WatermarkPurposeAPI Parameter
newest_id
Fetch new data since last run
since_id
oldest_id
Backfill older data
until_id
A single watermark only fetches forward. Two watermarks enable:
  • Regular runs: fetch NEW data (since
    newest_id
    )
  • Backfill runs: fetch OLD data (until
    oldest_id
    )
  • No overlap, no gaps
跟踪两个游标以支持向前和向后获取:
水印用途API参数
newest_id
获取上次运行以来的新数据
since_id
oldest_id
回填旧数据
until_id
单一水印仅支持向前获取。双水印可实现:
  • 常规运行:获取新数据(基于
    newest_id
    之后)
  • 回填运行:获取旧数据(基于
    oldest_id
    之前)
  • 无重叠、无间隙

Critical: Data vs Watermark Saving

关键:数据与水印存储

These are different operations with different timing:
WhatWhen to SaveWhy
Data recordsAfter EACH pageResilience: interrupted on page 47? Keep 46 pages
WatermarksONCE at end of runCorrectness: only commit progress after full success
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks
这是两个不同的操作,时机也不同:
操作内容存储时机原因
数据记录每获取一页之后韧性:在第47页中断?保留前46页的数据
水印运行结束后仅保存一次正确性:仅在完全成功后提交进度
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks

Workflow Decision Tree

工作流决策树

First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)
First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)

Implementation Checklist

实施检查清单

  1. Database: Create ingestion_state table (see patterns.md)
  2. Fetch loop: Insert records immediately after each API page
  3. Watermark tracking: Track newest/oldest IDs seen in this run
  4. Watermark update: Save watermarks ONCE at end of successful run
  5. Retry: Exponential backoff with jitter
  6. Rate limits: Wait for reset or skip and record for next run
  1. 数据库:创建ingestion_state表(参见patterns.md)
  2. 获取循环:每获取一页API数据后立即插入记录
  3. 水印跟踪:跟踪本次运行中出现的newest/oldest ID
  4. 水印更新:仅在运行成功结束后保存一次水印
  5. 重试:带抖动的指数退避策略
  6. 速率限制:等待重置或跳过并记录以便下次运行

Pagination Types

分页类型

This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:
TypeAdaptation
Cursor/tokenStore cursor string instead of ID; can't compare numerically
TimestampUse
last_timestamp
column; compare as dates
Offset/limitStore page number; resume from last saved page
See references/patterns.md for schemas and code examples.
此模式最适用于基于ID的分页(可比较的数值ID)。对于其他分页类型:
类型适配方式
游标/令牌存储游标字符串而非ID;无法进行数值比较
时间戳使用
last_timestamp
列;按日期比较
偏移量/限制存储页码;从上次保存的页码恢复
参见references/patterns.md获取模式和代码示例。