wechat-article-fetcher

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

微信公众号文章获取器

WeChat Official Account Article Fetcher

获取、解析并保存微信公众号文章，支持单篇和批量下载、元数据提取、图片下载和 Markdown 转换。

Fetch, parse and save WeChat Official Account articles, support single and batch download, metadata extraction, image download and Markdown conversion.

快速开始

Quick Start

获取单篇文章：

bash

python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"

批量获取多篇文章（空格分隔）：

bash

python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output

批量获取多篇文章（逗号分隔）：

bash

python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output

仅输出元数据（不保存文件）：

bash

python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json

Fetch a single article:

bash

python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"

Fetch multiple articles in batch (separated by spaces):

bash

python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output

Fetch multiple articles in batch (separated by commas):

bash

python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output

Output metadata only (no files saved):

bash

python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json

依赖安装

Dependency Installation

bash

pip install beautifulsoup4 html2text requests

bash

pip install beautifulsoup4 html2text requests

功能说明

Feature Description

1. 获取文章并保存到本地

1. Fetch articles and save to local

bash

python scripts/fetch_wechat_article.py "<url>" --output-dir ./output

输出目录结构：

output/<公众号名称>/<日期>_<标题>/
├── index.html    # 格式化的独立HTML文件
├── article.md    # Markdown版本
├── meta.json     # 文章元数据
└── images/       # 下载的图片

bash

python scripts/fetch_wechat_article.py "<url>" --output-dir ./output

Output directory structure:

output/<公众号名称>/<日期>_<标题>/
├── index.html    # Formatted standalone HTML file
├── article.md    # Markdown version
├── meta.json     # Article metadata
└── images/       # Downloaded images

2. 仅提取元数据

2. Extract metadata only

bash

python scripts/fetch_wechat_article.py "<url>" --json

返回 JSON 包含：

title

（标题）、

author

（作者）、

account_nickname

（公众号名称）、

description

（摘要）、

create_time

（发布时间）、

content_text

（正文文本）、

content_markdown

（Markdown内容）、

cover_image

（封面图）、

source_url

（原文链接）。

bash

python scripts/fetch_wechat_article.py "<url>" --json

The returned JSON contains:

title

author

account_nickname

(WeChat Official Account name),

description

(abstract),

create_time

(publish time),

content_text

(main body text),

content_markdown

(Markdown content),

cover_image

(cover image),

source_url

(original article link).

3. 批量下载多篇文章

3. Batch download multiple articles

空格分隔多个链接：

bash

python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output

逗号分隔多个链接：

bash

python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output

自定义下载间隔（默认3秒，避免触发反爬）：

bash

python scripts/fetch_wechat_article.py "url1" "url2" --interval 5

同一公众号的文章自动归类到同一目录下。

Multiple links separated by spaces:

bash

python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output

Multiple links separated by commas:

bash

python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output

Custom download interval (default 3 seconds, to avoid triggering anti-crawling mechanism):

bash

python scripts/fetch_wechat_article.py "url1" "url2" --interval 5

Articles from the same Official Account are automatically categorized into the same directory.

4. 不下载图片

4. Skip image download

bash

python scripts/fetch_wechat_article.py "<url>" --no-images

bash

python scripts/fetch_wechat_article.py "<url>" --no-images

4. 不下载图片

4. Skip image download

bash

python scripts/fetch_wechat_article.py "<url>" --no-images

bash

python scripts/fetch_wechat_article.py "<url>" --no-images

5. 作为 Python 库调用

5. Call as a Python library

python

from scripts.fetch_wechat_article import fetch_article, batch_fetch

python

from scripts.fetch_wechat_article import fetch_article, batch_fetch

单篇获取并保存

Fetch and save single article

result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output") print(result['title'], result['path'])

单篇仅获取元数据

Only fetch metadata for single article

meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True) print(meta['title']) print(meta['content_text'][:200])

批量获取

Batch fetch

urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"] stats = batch_fetch(urls, output_dir="./output", interval=3.0) print(f"成功{stats['success']}篇, 失败{stats['fail']}篇")


主要函数参数：
- `url`：文章链接（支持短链接和长链接）
- `output_dir`：保存目录（默认：`./wechat_articles`）
- `download_img`：是否下载图片（默认：`True`）
- `to_markdown`：是否转换为 Markdown（默认：`True`）
- `json_only`：仅返回元数据字典，不保存文件

`batch_fetch` 额外参数：
- `urls`：文章链接列表
- `interval`：每篇文章之间的下载间隔秒数（默认：`3.0`）

urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"] stats = batch_fetch(urls, output_dir="./output", interval=3.0) print(f"Success: {stats['success']} articles, Fail: {stats['fail']} articles")


Main function parameters:
- `url`: Article link (supports both short links and long links)
- `output_dir`: Save directory (default: `./wechat_articles`)
- `download_img`: Whether to download images (default: `True`)
- `to_markdown`: Whether to convert to Markdown (default: `True`)
- `json_only`: Only return metadata dictionary, no files saved

Extra parameters for `batch_fetch`:
- `urls`: List of article links
- `interval`: Download interval in seconds between each article (default: `3.0`)

注意事项

Notes

优先使用短链接（
```
/s/xxxxx
```
）—— 带
```
__biz
```
参数的长链接可能触发验证码。
批量下载时默认间隔3秒，可通过
```
--interval
```
调整，避免触发微信反爬机制。
自动使用微信移动端 User-Agent 绕过访问限制。
微信图片使用
```
data-src
```
属性（非
```
src
```
），因为采用了懒加载。
下载图片需要设置
```
Referer: https://mp.weixin.qq.com/
```
请求头。
HTML 结构详情参见 references/wechat_html_structure.md。

Give priority to short links (
```
/s/xxxxx
```
) —— Long links with
```
__biz
```
parameter may trigger CAPTCHA.
The default interval for batch download is 3 seconds, which can be adjusted via
```
--interval
```
to avoid triggering WeChat's anti-crawling mechanism.
Automatically use WeChat mobile User-Agent to bypass access restrictions.
WeChat images use
```
data-src
```
attribute (not
```
src
```
) due to lazy loading.
```
Referer: https://mp.weixin.qq.com/
```
request header is required for image download.
For details of HTML structure, please refer to references/wechat_html_structure.md.