ai-tech-fulltext-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Tech Fulltext Fetch

AI Tech 全文获取工具

Core Goal

核心目标

Reuse the same SQLite database populated by
```
ai-tech-rss-fetch
```
.
Fetch article body text from each RSS entry URL.
Persist extraction status and text in a companion table (
```
entry_content
```
).
Support incremental runs and safe retries without creating duplicate fulltext rows.

复用由
```
ai-tech-rss-fetch
```
填充的同一SQLite数据库。
从每个RSS条目的URL获取文章正文。
将提取状态和文本持久化至关联表
```
entry_content
```
中。
支持增量运行和安全重试，不会创建重复的全文行。

Triggering Conditions

触发条件

Receive a request to fetch article body/full text for entries already in
```
ai_rss.db
```
.
Receive a request to build a second-stage pipeline after RSS metadata sync.
Need a stable, resumable queue over existing
```
entries
```
rows.
Need URL-based fulltext persistence before chunking, indexing, or summarization.

收到请求，为已存在于
```
ai_rss.db
```
中的条目获取文章正文/全文。
收到请求，在RSS元数据同步后构建第二阶段流水线。
需要基于现有
```
entries
```
行的稳定、可恢复队列。
需要在分块、索引或摘要生成前，基于URL持久化全文内容。

Workflow

工作流程

Ensure metadata table exists first.

Run
```
ai-tech-rss-fetch
```
and populate
```
entries
```
in SQLite before using this skill.
This skill requires the
```
entries
```
table to exist.
In multi-agent runtimes, pin DB to the same absolute path used by
```
ai-tech-rss-fetch
```
:

bash

export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"

Initialize fulltext table.

bash

python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"

Run incremental fulltext sync.

Default behavior fetches rows that are missing full text or currently failed.

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300

Fetch one entry on demand.

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234

Inspect extracted content state.

bash

python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

确保元数据表已存在。

在使用本工具前，先运行
```
ai-tech-rss-fetch
```
并在SQLite中填充
```
entries
```
表。
本工具要求
```
entries
```
表必须存在。
在多Agent运行环境中，将数据库固定到与
```
ai-tech-rss-fetch
```
相同的绝对路径：

bash

export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"

初始化全文表。

bash

python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"

运行增量全文同步。

默认行为是获取缺少全文或当前提取失败的行。

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300

按需获取单个条目。

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234

查看提取内容的状态。

bash

python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

数据契约

Reads from existing
```
entries
```
table:
- ```
id
```
  ,
```
canonical_url
```
  ,
```
url
```
  ,
```
title
```
  .

Writes to

entry_content

table:

```
entry_id
```
(unique, one row per entry)
```
source_url
```
,
```
final_url
```
,
```
http_status
```
```
extractor
```
(
```
trafilatura
```
,
```
html-parser
```
, or
```
none
```
)
```
content_text
```
,
```
content_hash
```
,
```
content_length
```
```
status
```
(
```
ready
```
or
```
failed
```
)
```
retry_count
```
,
```
last_error
```
, timestamps.

从现有
```
entries
```
表读取：
- ```
id
```
  ,
```
canonical_url
```
  ,
```
url
```
  ,
```
title
```
  .

写入

entry_content

表：

```
entry_id
```
（唯一，每个条目对应一行）
```
source_url
```
,
```
final_url
```
,
```
http_status
```
```
extractor
```
（
```
trafilatura
```
,
```
html-parser
```
, 或
```
none
```
）
```
content_text
```
,
```
content_hash
```
,
```
content_length
```
```
status
```
（
```
ready
```
或
```
failed
```
）
```
retry_count
```
,
```
last_error
```
, 时间戳。

Extraction and Update Rules

提取与更新规则

URL source priority:
```
canonical_url
```
first, fallback to
```
url
```
.
Attempt
```
trafilatura
```
extraction when dependency is available, fallback to built-in HTML parser.
Upsert by
```
entry_id
```
:
- Success: write/update full text and reset
```
retry_count
```
  to
```
0
```
  .
- Failure with existing
```
ready
```
  content: keep old text, keep status
```
ready
```
  , record
```
last_error
```
  .
- Failure without ready content: status becomes
```
failed
```
  , increment
```
retry_count
```
  , set
```
next_retry_at
```
  .
Failed retries are capped by
```
--max-retries
```
(default
```
3
```
) and paced by
```
--retry-backoff-minutes
```
.
```
--force
```
allows refetching already
```
ready
```
rows.
```
--refetch-days N
```
allows refreshing rows older than
```
N
```
days.

URL源优先级：优先使用
```
canonical_url
```
，降级使用
```
url
```
。
若依赖可用则尝试使用
```
trafilatura
```
提取，降级至内置HTML解析器。
按
```
entry_id
```
进行更新插入（Upsert）：
- 成功：写入/更新全文，并将
```
retry_count
```
  重置为
```
0
```
  。
- 失败但已有
```
ready
```
  内容：保留旧文本，保持状态为
```
ready
```
  ，记录
```
last_error
```
  。
- 失败且无可用
```
ready
```
  内容：状态设为
```
failed
```
  ，递增
```
retry_count
```
  ，设置
```
next_retry_at
```
  。
失败重试次数受
```
--max-retries
```
（默认3次）限制，且按
```
--retry-backoff-minutes
```
设置的间隔进行。
```
--force
```
参数允许重新获取已标记为
```
ready
```
的行。
```
--refetch-days N
```
允许刷新早于N天的行。

Configurable Parameters

可配置参数

```
--db
```
```
AI_RSS_DB_PATH
```
(recommended absolute path in multi-agent runtime)
```
--limit
```
```
--force
```
```
--only-failed
```
```
--refetch-days
```
```
--oldest-first
```
```
--timeout
```
```
--max-bytes
```
```
--min-chars
```
```
--max-retries
```
```
--retry-backoff-minutes
```
```
--user-agent
```
```
--disable-trafilatura
```
```
--fail-on-errors
```

```
--db
```
```
AI_RSS_DB_PATH
```
（多Agent运行环境中推荐使用绝对路径）
```
--limit
```
```
--force
```
```
--only-failed
```
```
--refetch-days
```
```
--oldest-first
```
```
--timeout
```
```
--max-bytes
```
```
--min-chars
```
```
--max-retries
```
```
--retry-backoff-minutes
```
```
--user-agent
```
```
--disable-trafilatura
```
```
--fail-on-errors
```

Error Handling

错误处理

Missing
```
entries
```
table: return actionable error and stop.
Network/HTTP/parse errors: store failure state and continue processing other entries.
Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
Extraction too short (
```
--min-chars
```
): treat as failure to avoid low-quality body text.

缺少
```
entries
```
表：返回可操作错误并停止运行。
网络/HTTP/解析错误：存储失败状态并继续处理其他条目。
非文本内容类型（PDF/图片/音频/视频/压缩包）：标记为该条目提取失败。
提取内容过短（低于
```
--min-chars
```
设置）：视为提取失败，以避免低质量正文。

References

参考文档

```
references/schema.md
```
```
references/fetch-rules.md
```

```
references/schema.md
```
```
references/fetch-rules.md
```

Assets

资源文件

```
assets/config.example.json
```

```
assets/config.example.json
```

Scripts

脚本文件

```
scripts/fulltext_fetch.py
```

```
scripts/fulltext_fetch.py
```