eceee-news-fulltext-fetch
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineseeceee News Fulltext Fetch
eceee新闻全文抓取
Core Goal
核心目标
- Discover news article URLs from .
https://www.eceee.org/all-news/ - Persist discovered entry metadata into SQLite.
- Fetch and extract article body text from each entry page.
- Persist status and text in a companion table () with retry-safe updates.
entry_content
- 从发现新闻文章URL。
https://www.eceee.org/all-news/ - 将发现的条目元数据持久化到SQLite中。
- 从每个条目页面抓取并提取文章正文文本。
- 在配套表()中持久化状态和文本,支持可重试的更新。
entry_content
Triggering Conditions
触发条件
- Receive a request to extract full text from eceee news archive pages.
- Receive a request to run incremental fulltext sync for eceee news links.
- Need a resilient local SQLite queue for discovery + extraction + retries.
- 收到从eceee新闻存档页面提取全文的请求。
- 收到对eceee新闻链接运行增量全文同步的请求。
- 需要一个可靠的本地SQLite队列来处理发现、提取和重试操作。
Workflow
工作流程
- Initialize database.
bash
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"- Discover links and fetch fulltext incrementally.
bash
python3 scripts/fulltext_fetch.py sync \
--db "$ECEEE_NEWS_DB_PATH" \
--index-url "https://www.eceee.org/all-news/" \
--limit 50 \
--min-chars 180- Discover only (refresh URL catalog without fetching bodies).
bash
python3 scripts/fulltext_fetch.py sync \
--db "$ECEEE_NEWS_DB_PATH" \
--discover-only- Fetch one entry on demand.
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$ECEEE_NEWS_DB_PATH" \
--entry-id 123Or by URL:
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$ECEEE_NEWS_DB_PATH" \
--url "https://www.eceee.org/all-news/news/example-slug/"- Inspect stored state.
bash
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100- 初始化数据库。
bash
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"- 发现链接并增量抓取全文。
bash
python3 scripts/fulltext_fetch.py sync \
--db "$ECEEE_NEWS_DB_PATH" \
--index-url "https://www.eceee.org/all-news/" \
--limit 50 \
--min-chars 180- 仅发现链接(刷新URL目录而不抓取正文)。
bash
python3 scripts/fulltext_fetch.py sync \
--db "$ECEEE_NEWS_DB_PATH" \
--discover-only- 按需抓取单个条目。
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$ECEEE_NEWS_DB_PATH" \
--entry-id 123或通过URL抓取:
bash
python3 scripts/fulltext_fetch.py fetch-entry \
--db "$ECEEE_NEWS_DB_PATH" \
--url "https://www.eceee.org/all-news/news/example-slug/"- 查看存储状态。
bash
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100Data Contract
数据约定
- table stores discovery metadata:
entries- ,
url,titlepublished_at - ,
discovered_atlast_seen_at
- table stores extraction result (one row per
entry_content):entry_id- ,
source_url,final_urlhttp_status - (
extractor,trafilatura, orhtml-parser)none - ,
content_text,content_hashcontent_length - (
statusorready)failed - retry fields + timestamps
- 表存储发现元数据:
entries- ,
url,titlepublished_at - ,
discovered_atlast_seen_at
- 表存储提取结果(每个
entry_content对应一行):entry_id- ,
source_url,final_urlhttp_status - (
extractor、trafilatura或html-parser)none - ,
content_text,content_hashcontent_length - (
status或ready)failed - 重试字段及时间戳
Extraction and Update Rules
提取与更新规则
- Discovery source is , extracting anchor tags with class
https://www.eceee.org/all-news/undernewslink./all-news/news/ - Fulltext extraction uses article main content region () and removes related-news/share blocks.
mainContentColumn - Extraction path:
- (if installed and not disabled)
trafilatura - built-in HTML parser fallback
- Upsert by :
entry_id- Success: set , write text/hash/length, reset retry counters.
ready - Failure with existing content: keep old content, update error/retry metadata.
ready - Failure without ready content: set , increment retries, set
failed.next_retry_at
- Success: set
- 发现来源为,提取
https://www.eceee.org/all-news/下带有/all-news/news/类的锚标签。newslink - 全文提取使用文章主内容区域(),并移除相关新闻/分享模块。
mainContentColumn - 提取优先级:
- (若已安装且未禁用)
trafilatura - 内置HTML解析器作为备选
- 按进行更新插入(Upsert):
entry_id- 成功:设置状态为,写入文本/哈希/长度,重置重试计数器。
ready - 失败但已有内容:保留旧内容,更新错误/重试元数据。
ready - 失败且无可用内容:设置状态为,增加重试次数,设置
failed。next_retry_at
- 成功:设置状态为
Configurable Parameters
可配置参数
--dbECEEE_NEWS_DB_PATH--index-url--discover-only--limit--force--only-failed--since-date--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
--dbECEEE_NEWS_DB_PATH--index-url--discover-only--limit--force--only-failed--since-date--refetch-days--oldest-first--timeout--max-bytes--min-chars--max-retries--retry-backoff-minutes--user-agent--disable-trafilatura--fail-on-errors
Error Handling
错误处理
- Index fetch/parse failure returns actionable error.
- HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
- Short extracted text () is treated as failed to avoid low-quality bodies.
< --min-chars - Retry queue is controlled via + exponential backoff.
max_retries
- 索引抓取/解析失败返回可操作的错误信息。
- HTTP/网络/内容类型错误按条目记录,不会终止整个同步批次。
- 提取文本过短(小于)视为失败,以避免低质量内容。
--min-chars - 重试队列由和指数退避机制控制。
max_retries
References
参考资料
references/schema.mdreferences/fetch-rules.md
references/schema.mdreferences/fetch-rules.md
Assets
资源文件
assets/config.example.json
assets/config.example.json
Scripts
脚本文件
scripts/fulltext_fetch.py
scripts/fulltext_fetch.py