eceee-news-fulltext-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

eceee News Fulltext Fetch

eceee新闻全文抓取

Core Goal

核心目标

Discover news article URLs from
```
https://www.eceee.org/all-news/
```
.
Persist discovered entry metadata into SQLite.
Fetch and extract article body text from each entry page.
Persist status and text in a companion table (
```
entry_content
```
) with retry-safe updates.

从
```
https://www.eceee.org/all-news/
```
发现新闻文章URL。
将发现的条目元数据持久化到SQLite中。
从每个条目页面抓取并提取文章正文文本。
在配套表（
```
entry_content
```
）中持久化状态和文本，支持可重试的更新。

Triggering Conditions

触发条件

Receive a request to extract full text from eceee news archive pages.
Receive a request to run incremental fulltext sync for eceee news links.
Need a resilient local SQLite queue for discovery + extraction + retries.

收到从eceee新闻存档页面提取全文的请求。
收到对eceee新闻链接运行增量全文同步的请求。
需要一个可靠的本地SQLite队列来处理发现、提取和重试操作。

Workflow

工作流程

Initialize database.

bash

export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"

Discover links and fetch fulltext incrementally.

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180

Discover only (refresh URL catalog without fetching bodies).

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only

Fetch one entry on demand.

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123

Or by URL:

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"

Inspect stored state.

bash

python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

初始化数据库。

bash

export ECEEE_NEWS_DB_PATH="/absolute/path/to/eceee_news.db"
python3 scripts/fulltext_fetch.py init-db --db "$ECEEE_NEWS_DB_PATH"

发现链接并增量抓取全文。

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --index-url "https://www.eceee.org/all-news/" \
  --limit 50 \
  --min-chars 180

仅发现链接（刷新URL目录而不抓取正文）。

bash

python3 scripts/fulltext_fetch.py sync \
  --db "$ECEEE_NEWS_DB_PATH" \
  --discover-only

按需抓取单个条目。

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --entry-id 123

或通过URL抓取：

bash

python3 scripts/fulltext_fetch.py fetch-entry \
  --db "$ECEEE_NEWS_DB_PATH" \
  --url "https://www.eceee.org/all-news/news/example-slug/"

查看存储状态。

bash

python3 scripts/fulltext_fetch.py list-entries --db "$ECEEE_NEWS_DB_PATH" --limit 100
python3 scripts/fulltext_fetch.py list-content --db "$ECEEE_NEWS_DB_PATH" --status ready --limit 100

Data Contract

数据约定

entries

table stores discovery metadata:

```
url
```
,
```
title
```
,
```
published_at
```
```
discovered_at
```
,
```
last_seen_at
```

entry_content

table stores extraction result (one row per

entry_id

```
source_url
```
,
```
final_url
```
,
```
http_status
```
```
extractor
```
(
```
trafilatura
```
,
```
html-parser
```
, or
```
none
```
)
```
content_text
```
,
```
content_hash
```
,
```
content_length
```
```
status
```
(
```
ready
```
or
```
failed
```
)
retry fields + timestamps

entries

表存储发现元数据：

```
url
```
,
```
title
```
,
```
published_at
```
```
discovered_at
```
,
```
last_seen_at
```

entry_content

表存储提取结果（每个

entry_id

对应一行）：

```
source_url
```
,
```
final_url
```
,
```
http_status
```
```
extractor
```
（
```
trafilatura
```
、
```
html-parser
```
或
```
none
```
）
```
content_text
```
,
```
content_hash
```
,
```
content_length
```
```
status
```
（
```
ready
```
或
```
failed
```
）
重试字段及时间戳

Extraction and Update Rules

提取与更新规则

Discovery source is
```
https://www.eceee.org/all-news/
```
, extracting anchor tags with class
```
newslink
```
under
```
/all-news/news/
```
.
Fulltext extraction uses article main content region (
```
mainContentColumn
```
) and removes related-news/share blocks.
Extraction path:
1. ```
trafilatura
```
  (if installed and not disabled)
2. built-in HTML parser fallback
Upsert by
```
entry_id
```
:
- Success: set
```
ready
```
  , write text/hash/length, reset retry counters.
- Failure with existing
```
ready
```
  content: keep old content, update error/retry metadata.
- Failure without ready content: set
```
failed
```
  , increment retries, set
```
next_retry_at
```
  .

发现来源为

https://www.eceee.org/all-news/

，提取

/all-news/news/

下带有

newslink

类的锚标签。

全文提取使用文章主内容区域（
```
mainContentColumn
```
），并移除相关新闻/分享模块。
提取优先级：
1. ```
trafilatura
```
  （若已安装且未禁用）
2. 内置HTML解析器作为备选
按
```
entry_id
```
进行更新插入（Upsert）：
- 成功：设置状态为
```
ready
```
  ，写入文本/哈希/长度，重置重试计数器。
- 失败但已有
```
ready
```
  内容：保留旧内容，更新错误/重试元数据。
- 失败且无可用内容：设置状态为
```
failed
```
  ，增加重试次数，设置
```
next_retry_at
```
  。

Configurable Parameters

可配置参数

```
--db
```
```
ECEEE_NEWS_DB_PATH
```
```
--index-url
```
```
--discover-only
```
```
--limit
```
```
--force
```
```
--only-failed
```
```
--since-date
```
```
--refetch-days
```
```
--oldest-first
```
```
--timeout
```
```
--max-bytes
```
```
--min-chars
```
```
--max-retries
```
```
--retry-backoff-minutes
```
```
--user-agent
```
```
--disable-trafilatura
```
```
--fail-on-errors
```

```
--db
```
```
ECEEE_NEWS_DB_PATH
```
```
--index-url
```
```
--discover-only
```
```
--limit
```
```
--force
```
```
--only-failed
```
```
--since-date
```
```
--refetch-days
```
```
--oldest-first
```
```
--timeout
```
```
--max-bytes
```
```
--min-chars
```
```
--max-retries
```
```
--retry-backoff-minutes
```
```
--user-agent
```
```
--disable-trafilatura
```
```
--fail-on-errors
```

Error Handling

错误处理

Index fetch/parse failure returns actionable error.
HTTP/network/content-type failures are recorded per entry and do not stop the whole sync batch.
Short extracted text (
```
< --min-chars
```
) is treated as failed to avoid low-quality bodies.
Retry queue is controlled via
```
max_retries
```
+ exponential backoff.

索引抓取/解析失败返回可操作的错误信息。
HTTP/网络/内容类型错误按条目记录，不会终止整个同步批次。
提取文本过短（小于
```
--min-chars
```
）视为失败，以避免低质量内容。
重试队列由
```
max_retries
```
和指数退避机制控制。

References

参考资料

```
references/schema.md
```
```
references/fetch-rules.md
```

```
references/schema.md
```
```
references/fetch-rules.md
```

Assets

资源文件

```
assets/config.example.json
```

```
assets/config.example.json
```

Scripts

脚本文件

```
scripts/fulltext_fetch.py
```

```
scripts/fulltext_fetch.py
```