ai-tech-fulltext-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Tech Fulltext Fetch

AI Tech 全文获取工具

Core Goal

核心目标

  • Reuse the same SQLite database populated by
    ai-tech-rss-fetch
    .
  • Fetch article body text from each RSS entry URL.
  • Persist extraction status and text in a companion table (
    entry_content
    ).
  • Support incremental runs and safe retries without creating duplicate fulltext rows.
  • 复用由
    ai-tech-rss-fetch
    填充的同一SQLite数据库。
  • 从每个RSS条目的URL获取文章正文。
  • 将提取状态和文本持久化至关联表
    entry_content
    中。
  • 支持增量运行和安全重试,不会创建重复的全文行。

Triggering Conditions

触发条件

  • Receive a request to fetch article body/full text for entries already in
    ai_rss.db
    .
  • Receive a request to build a second-stage pipeline after RSS metadata sync.
  • Need a stable, resumable queue over existing
    entries
    rows.
  • Need URL-based fulltext persistence before chunking, indexing, or summarization.
  • 收到请求,为已存在于
    ai_rss.db
    中的条目获取文章正文/全文。
  • 收到请求,在RSS元数据同步后构建第二阶段流水线。
  • 需要基于现有
    entries
    行的稳定、可恢复队列。
  • 需要在分块、索引或摘要生成前,基于URL持久化全文内容。

Workflow

工作流程

  1. Ensure metadata table exists first.
  • Run
    ai-tech-rss-fetch
    and populate
    entries
    in SQLite before using this skill.
  • This skill requires the
    entries
    table to exist.
  • In multi-agent runtimes, pin DB to the same absolute path used by
    ai-tech-rss-fetch
    :
bash
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
  1. Initialize fulltext table.
bash
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
  1. Run incremental fulltext sync.
  • Default behavior fetches rows that are missing full text or currently failed.
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300
  1. Fetch one entry on demand.
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234
  1. Inspect extracted content state.
bash
python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100
  1. 确保元数据表已存在。
  • 在使用本工具前,先运行
    ai-tech-rss-fetch
    并在SQLite中填充
    entries
    表。
  • 本工具要求
    entries
    表必须存在。
  • 在多Agent运行环境中,将数据库固定到与
    ai-tech-rss-fetch
    相同的绝对路径:
bash
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"
  1. 初始化全文表。
bash
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"
  1. 运行增量全文同步。
  • 默认行为是获取缺少全文或当前提取失败的行。
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$AI_RSS_DB_PATH" \
  --limit 50 \
  --timeout 20 \
  --min-chars 300
  1. 按需获取单个条目。
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$AI_RSS_DB_PATH" \
  --entry-id 1234
  1. 查看提取内容的状态。
bash
python3 scripts/fulltext_fetch.py list-content \
  --db "$AI_RSS_DB_PATH" \
  --status ready \
  --limit 100

Data Contract

数据契约

  • Reads from existing
    entries
    table:
    • id
      ,
      canonical_url
      ,
      url
      ,
      title
      .
  • Writes to
    entry_content
    table:
    • entry_id
      (unique, one row per entry)
    • source_url
      ,
      final_url
      ,
      http_status
    • extractor
      (
      trafilatura
      ,
      html-parser
      , or
      none
      )
    • content_text
      ,
      content_hash
      ,
      content_length
    • status
      (
      ready
      or
      failed
      )
    • retry_count
      ,
      last_error
      , timestamps.
  • 从现有
    entries
    表读取:
    • id
      ,
      canonical_url
      ,
      url
      ,
      title
      .
  • 写入
    entry_content
    表:
    • entry_id
      (唯一,每个条目对应一行)
    • source_url
      ,
      final_url
      ,
      http_status
    • extractor
      trafilatura
      ,
      html-parser
      , 或
      none
    • content_text
      ,
      content_hash
      ,
      content_length
    • status
      ready
      failed
    • retry_count
      ,
      last_error
      , 时间戳。

Extraction and Update Rules

提取与更新规则

  • URL source priority:
    canonical_url
    first, fallback to
    url
    .
  • Attempt
    trafilatura
    extraction when dependency is available, fallback to built-in HTML parser.
  • Upsert by
    entry_id
    :
    • Success: write/update full text and reset
      retry_count
      to
      0
      .
    • Failure with existing
      ready
      content: keep old text, keep status
      ready
      , record
      last_error
      .
    • Failure without ready content: status becomes
      failed
      , increment
      retry_count
      , set
      next_retry_at
      .
  • Failed retries are capped by
    --max-retries
    (default
    3
    ) and paced by
    --retry-backoff-minutes
    .
  • --force
    allows refetching already
    ready
    rows.
  • --refetch-days N
    allows refreshing rows older than
    N
    days.
  • URL源优先级:优先使用
    canonical_url
    ,降级使用
    url
  • 若依赖可用则尝试使用
    trafilatura
    提取,降级至内置HTML解析器。
  • entry_id
    进行更新插入(Upsert):
    • 成功:写入/更新全文,并将
      retry_count
      重置为
      0
    • 失败但已有
      ready
      内容:保留旧文本,保持状态为
      ready
      ,记录
      last_error
    • 失败且无可用
      ready
      内容:状态设为
      failed
      ,递增
      retry_count
      ,设置
      next_retry_at
  • 失败重试次数受
    --max-retries
    (默认3次)限制,且按
    --retry-backoff-minutes
    设置的间隔进行。
  • --force
    参数允许重新获取已标记为
    ready
    的行。
  • --refetch-days N
    允许刷新早于N天的行。

Configurable Parameters

可配置参数

  • --db
  • AI_RSS_DB_PATH
    (recommended absolute path in multi-agent runtime)
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors
  • --db
  • AI_RSS_DB_PATH
    (多Agent运行环境中推荐使用绝对路径)
  • --limit
  • --force
  • --only-failed
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

错误处理

  • Missing
    entries
    table: return actionable error and stop.
  • Network/HTTP/parse errors: store failure state and continue processing other entries.
  • Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
  • Extraction too short (
    --min-chars
    ): treat as failure to avoid low-quality body text.
  • 缺少
    entries
    表:返回可操作错误并停止运行。
  • 网络/HTTP/解析错误:存储失败状态并继续处理其他条目。
  • 非文本内容类型(PDF/图片/音频/视频/压缩包):标记为该条目提取失败。
  • 提取内容过短(低于
    --min-chars
    设置):视为提取失败,以避免低质量正文。

References

参考文档

  • references/schema.md
  • references/fetch-rules.md
  • references/schema.md
  • references/fetch-rules.md

Assets

资源文件

  • assets/config.example.json
  • assets/config.example.json

Scripts

脚本文件

  • scripts/fulltext_fetch.py
  • scripts/fulltext_fetch.py