ai-tech-fulltext-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Tech Fulltext Fetch
AI Tech 全文获取工具
Core Goal
核心目标
- Reuse the same SQLite database populated by .
ai-tech-rss-fetch - Fetch article body text from each RSS entry URL.
- Persist extraction status and text in a companion table ().
entry_content - Support incremental runs and safe retries without creating duplicate fulltext rows.
- 复用由填充的同一SQLite数据库。
ai-tech-rss-fetch - 从每个RSS条目的URL获取文章正文。
- 将提取状态和文本持久化至关联表中。
entry_content - 支持增量运行和安全重试,不会创建重复的全文行。
Triggering Conditions
触发条件
- Receive a request to fetch article body/full text for entries already in .
ai_rss.db - Receive a request to build a second-stage pipeline after RSS metadata sync.
- Need a stable, resumable queue over existing rows.
entries - Need URL-based fulltext persistence before chunking, indexing, or summarization.
- 收到请求,为已存在于中的条目获取文章正文/全文。
ai_rss.db - 收到请求,在RSS元数据同步后构建第二阶段流水线。
- 需要基于现有行的稳定、可恢复队列。
entries - 需要在分块、索引或摘要生成前,基于URL持久化全文内容。
Workflow
工作流程
- Ensure metadata table exists first.
- Run and populate
ai-tech-rss-fetchin SQLite before using this skill.entries - This skill requires the table to exist.
entries - In multi-agent runtimes, pin DB to the same absolute path used by :
ai-tech-rss-fetch
bash
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"- Initialize fulltext table.
bash
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"- Run incremental fulltext sync.
- Default behavior fetches rows that are missing full text or currently failed.
bash
python3 scripts/fulltext_fetch.py sync \
--db "$AI_RSS_DB_PATH" \
--limit 50 \
--timeout 20 \
--min-chars 300- Fetch one entry on demand.
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$AI_RSS_DB_PATH" \
--entry-id 1234- Inspect extracted content state.
bash
python3 scripts/fulltext_fetch.py list-content \
--db "$AI_RSS_DB_PATH" \
--status ready \
--limit 100- 确保元数据表已存在。
- 在使用本工具前,先运行并在SQLite中填充
ai-tech-rss-fetch表。entries - 本工具要求表必须存在。
entries - 在多Agent运行环境中,将数据库固定到与相同的绝对路径:
ai-tech-rss-fetch
bash
export AI_RSS_DB_PATH="/absolute/path/to/workspace-rss-bot/ai_rss.db"- 初始化全文表。
bash
python3 scripts/fulltext_fetch.py init-db --db "$AI_RSS_DB_PATH"- 运行增量全文同步。
- 默认行为是获取缺少全文或当前提取失败的行。
bash
python3 scripts/fulltext_fetch.py sync \
--db "$AI_RSS_DB_PATH" \
--limit 50 \
--timeout 20 \
--min-chars 300- 按需获取单个条目。
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$AI_RSS_DB_PATH" \
--entry-id 1234- 查看提取内容的状态。
bash
python3 scripts/fulltext_fetch.py list-content \
--db "$AI_RSS_DB_PATH" \
--status ready \
--limit 100Data Contract
数据契约
- Reads from existing table:
entries- ,
id,canonical_url,url.title
- Writes to table:
entry_content- (unique, one row per entry)
entry_id - ,
source_url,final_urlhttp_status - (
extractor,trafilatura, orhtml-parser)none - ,
content_text,content_hashcontent_length - (
statusorready)failed - ,
retry_count, timestamps.last_error
- 从现有表读取:
entries- ,
id,canonical_url,url.title
- 写入表:
entry_content- (唯一,每个条目对应一行)
entry_id - ,
source_url,final_urlhttp_status - (
extractor,trafilatura, 或html-parser)none - ,
content_text,content_hashcontent_length - (
status或ready)failed - ,
retry_count, 时间戳。last_error
Extraction and Update Rules
提取与更新规则
- URL source priority: first, fallback to
canonical_url.url - Attempt extraction when dependency is available, fallback to built-in HTML parser.
trafilatura - Upsert by :
entry_id- Success: write/update full text and reset to
retry_count.0 - Failure with existing content: keep old text, keep status
ready, recordready.last_error - Failure without ready content: status becomes , increment
failed, setretry_count.next_retry_at
- Success: write/update full text and reset
- Failed retries are capped by (default
--max-retries) and paced by3.--retry-backoff-minutes - allows refetching already
--forcerows.ready - allows refreshing rows older than
--refetch-days Ndays.N
- URL源优先级:优先使用,降级使用
canonical_url。url - 若依赖可用则尝试使用提取,降级至内置HTML解析器。
trafilatura - 按进行更新插入(Upsert):
entry_id- 成功:写入/更新全文,并将重置为
retry_count。0 - 失败但已有内容:保留旧文本,保持状态为
ready,记录ready。last_error - 失败且无可用内容:状态设为
ready,递增failed,设置retry_count。next_retry_at
- 成功:写入/更新全文,并将
- 失败重试次数受(默认3次)限制,且按
--max-retries设置的间隔进行。--retry-backoff-minutes - 参数允许重新获取已标记为
--force的行。ready - 允许刷新早于N天的行。
--refetch-days N
Configurable Parameters
可配置参数
--db- (recommended absolute path in multi-agent runtime)
AI_RSS_DB_PATH --limit--force--only-failed--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
--db- (多Agent运行环境中推荐使用绝对路径)
AI_RSS_DB_PATH --limit--force--only-failed--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
Error Handling
错误处理
- Missing table: return actionable error and stop.
entries - Network/HTTP/parse errors: store failure state and continue processing other entries.
- Non-text content types (PDF/image/audio/video/zip): mark as failed for that entry.
- Extraction too short (): treat as failure to avoid low-quality body text.
--min-chars
- 缺少表:返回可操作错误并停止运行。
entries - 网络/HTTP/解析错误:存储失败状态并继续处理其他条目。
- 非文本内容类型(PDF/图片/音频/视频/压缩包):标记为该条目提取失败。
- 提取内容过短(低于设置):视为提取失败,以避免低质量正文。
--min-chars
References
参考文档
references/schema.mdreferences/fetch-rules.md
references/schema.mdreferences/fetch-rules.md
Assets
资源文件
assets/config.example.json
assets/config.example.json
Scripts
脚本文件
scripts/fulltext_fetch.py
scripts/fulltext_fetch.py