incremental-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Incremental Fetch

增量获取

Build data pipelines that never lose progress and never re-fetch existing data.

构建永不丢失进度且不会重复获取已有数据的数据管道。

The Two Watermarks Pattern

双水印模式

Track TWO cursors to support both forward and backward fetching:

Watermark	Purpose	API Parameter
`newest_id`	Fetch new data since last run	`since_id`
`oldest_id`	Backfill older data	`until_id`

A single watermark only fetches forward. Two watermarks enable:

Regular runs: fetch NEW data (since
```
newest_id
```
)
Backfill runs: fetch OLD data (until
```
oldest_id
```
)
No overlap, no gaps

跟踪两个游标以支持向前和向后获取：

水印	用途	API参数
`newest_id`	获取上次运行以来的新数据	`since_id`
`oldest_id`	回填旧数据	`until_id`

单一水印仅支持向前获取。双水印可实现：

常规运行：获取新数据（基于
```
newest_id
```
之后）
回填运行：获取旧数据（基于
```
oldest_id
```
之前）
无重叠、无间隙

Critical: Data vs Watermark Saving

关键：数据与水印存储

These are different operations with different timing:

What	When to Save	Why
Data records	After EACH page	Resilience: interrupted on page 47? Keep 46 pages
Watermarks	ONCE at end of run	Correctness: only commit progress after full success

fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks

这是两个不同的操作，时机也不同：

操作内容	存储时机	原因
数据记录	每获取一页之后	韧性：在第47页中断？保留前46页的数据
水印	运行结束后仅保存一次	正确性：仅在完全成功后提交进度

fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks

Workflow Decision Tree

工作流决策树

First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)

First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)

Implementation Checklist

实施检查清单

Database: Create ingestion_state table (see patterns.md)
Fetch loop: Insert records immediately after each API page
Watermark tracking: Track newest/oldest IDs seen in this run
Watermark update: Save watermarks ONCE at end of successful run
Retry: Exponential backoff with jitter
Rate limits: Wait for reset or skip and record for next run

数据库：创建ingestion_state表（参见patterns.md）
获取循环：每获取一页API数据后立即插入记录
水印跟踪：跟踪本次运行中出现的newest/oldest ID
水印更新：仅在运行成功结束后保存一次水印
重试：带抖动的指数退避策略
速率限制：等待重置或跳过并记录以便下次运行

Pagination Types

分页类型

This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:

Type	Adaptation
Cursor/token	Store cursor string instead of ID; can't compare numerically
Timestamp	Use `last_timestamp` column; compare as dates
Offset/limit	Store page number; resume from last saved page

See references/patterns.md for schemas and code examples.

此模式最适用于基于ID的分页（可比较的数值ID）。对于其他分页类型：

类型	适配方式
游标/令牌	存储游标字符串而非ID；无法进行数值比较
时间戳	使用 `last_timestamp` 列；按日期比较
偏移量/限制	存储页码；从上次保存的页码恢复

参见references/patterns.md获取模式和代码示例。