eceee-news-fulltext-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

eceee News Fulltext Fetch

eceee新闻全文抓取

Core Goal

核心目标

  • Discover news article URLs from
    https://www.eceee.org/all-news/
    .
  • Persist discovered entry metadata into SQLite.
  • Fetch and extract article body text from each entry page.
  • Persist status and text in a companion table (
    entry_content
    ) with retry-safe updates.
  • https://www.eceee.org/all-news/
    发现新闻文章URL。
  • 将发现的条目元数据持久化到SQLite中。
  • 从每个条目页面抓取并提取文章正文文本。
  • 在配套表(
    entry_content
    )中持久化状态和文本,支持可重试的更新。

Triggering Conditions

触发条件

  • Receive a request to extract full text from eceee news archive pages.
  • Receive a request to run incremental fulltext sync for eceee news links.
  • Need a resilient local SQLite queue for discovery + extraction + retries.
  • 收到从eceee新闻存档页面提取全文的请求。
  • 收到对eceee新闻链接运行增量全文同步的请求。
  • 需要一个可靠的本地SQLite队列来处理发现、提取和重试操作。

Workflow

工作流程

  1. Initialize database.
bash
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"
  1. Discover links and fetch fulltext incrementally.
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180
  1. Discover only (refresh URL catalog without fetching bodies).
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only
  1. Fetch one entry on demand.
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123
Or by URL:
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"
  1. Inspect stored state.
bash
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100
  1. 初始化数据库。
bash
export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"
  1. 发现链接并增量抓取全文。
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180
  1. 仅发现链接(刷新URL目录而不抓取正文)。
bash
python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only
  1. 按需抓取单个条目。
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123
或通过URL抓取:
bash
python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"
  1. 查看存储状态。
bash
python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

Data Contract

数据约定

  • entries
    table stores discovery metadata:
    • url
      ,
      title
      ,
      published_at
    • discovered_at
      ,
      last_seen_at
  • entry_content
    table stores extraction result (one row per
    entry_id
    ):
    • source_url
      ,
      final_url
      ,
      http_status
    • extractor
      (
      trafilatura
      ,
      html-parser
      , or
      none
      )
    • content_text
      ,
      content_hash
      ,
      content_length
    • status
      (
      ready
      or
      failed
      )
    • retry fields + timestamps
  • entries
    表存储发现元数据:
    • url
      ,
      title
      ,
      published_at
    • discovered_at
      ,
      last_seen_at
  • entry_content
    表存储提取结果(每个
    entry_id
    对应一行):
    • source_url
      ,
      final_url
      ,
      http_status
    • extractor
      trafilatura
      html-parser
      none
    • content_text
      ,
      content_hash
      ,
      content_length
    • status
      ready
      failed
    • 重试字段及时间戳

Extraction and Update Rules

提取与更新规则

  • Discovery source is
    https://www.eceee.org/all-news/
    , extracting anchor tags with class
    newslink
    under
    /all-news/news/
    .
  • Fulltext extraction uses article main content region (
    mainContentColumn
    ) and removes related-news/share blocks.
  • Extraction path:
    1. trafilatura
      (if installed and not disabled)
    2. built-in HTML parser fallback
  • Upsert by
    entry_id
    :
    • Success: set
      ready
      , write text/hash/length, reset retry counters.
    • Failure with existing
      ready
      content: keep old content, update error/retry metadata.
    • Failure without ready content: set
      failed
      , increment retries, set
      next_retry_at
      .
  • 发现来源为
    https://www.eceee.org/all-news/
    ,提取
    /all-news/news/
    下带有
    newslink
    类的锚标签。
  • 全文提取使用文章主内容区域(
    mainContentColumn
    ),并移除相关新闻/分享模块。
  • 提取优先级:
    1. trafilatura
      (若已安装且未禁用)
    2. 内置HTML解析器作为备选
  • entry_id
    进行更新插入(Upsert):
    • 成功:设置状态为
      ready
      ,写入文本/哈希/长度,重置重试计数器。
    • 失败但已有
      ready
      内容:保留旧内容,更新错误/重试元数据。
    • 失败且无可用内容:设置状态为
      failed
      ,增加重试次数,设置
      next_retry_at

Configurable Parameters

可配置参数

  • --db
  • ECEEE_NEWS_DB_PATH
  • --index-url
  • --discover-only
  • --limit
  • --force
  • --only-failed
  • --since-date
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors
  • --db
  • ECEEE_NEWS_DB_PATH
  • --index-url
  • --discover-only
  • --limit
  • --force
  • --only-failed
  • --since-date
  • --refetch-days
  • --oldest-first
  • --timeout
  • --max-bytes
  • --min-chars
  • --max-retries
  • --retry-backoff-minutes
  • --user-agent
  • --disable-trafilatura
  • --fail-on-errors

Error Handling

错误处理

  • Index fetch/parse failure returns actionable error.
  • HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
  • Short extracted text (
    < --min-chars
    ) is treated as failed to avoid low-quality bodies.
  • Retry queue is controlled via
    max_retries
    + exponential backoff.
  • 索引抓取/解析失败返回可操作的错误信息。
  • HTTP/网络/内容类型错误按条目记录,不会终止整个同步批次。
  • 提取文本过短(小于
    --min-chars
    )视为失败,以避免低质量内容。
  • 重试队列由
    max_retries
    和指数退避机制控制。

References

参考资料

  • references/schema.md
  • references/fetch-rules.md
  • references/schema.md
  • references/fetch-rules.md

Assets

资源文件

  • assets/config.example.json
  • assets/config.example.json

Scripts

脚本文件

  • scripts/fulltext_fetch.py
  • scripts/fulltext_fetch.py