wechat-article-fetcher

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

微信公众号文章获取器

WeChat Official Account Article Fetcher

获取、解析并保存微信公众号文章,支持单篇和批量下载、元数据提取、图片下载和 Markdown 转换。
Fetch, parse and save WeChat Official Account articles, support single and batch download, metadata extraction, image download and Markdown conversion.

快速开始

Quick Start

获取单篇文章:
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"
批量获取多篇文章(空格分隔):
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
批量获取多篇文章(逗号分隔):
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
仅输出元数据(不保存文件):
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json
Fetch a single article:
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"
Fetch multiple articles in batch (separated by spaces):
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
Fetch multiple articles in batch (separated by commas):
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
Output metadata only (no files saved):
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json

依赖安装

Dependency Installation

bash
pip install beautifulsoup4 html2text requests
bash
pip install beautifulsoup4 html2text requests

功能说明

Feature Description

1. 获取文章并保存到本地

1. Fetch articles and save to local

bash
python scripts/fetch_wechat_article.py "<url>" --output-dir ./output
输出目录结构:
output/<公众号名称>/<日期>_<标题>/
├── index.html    # 格式化的独立HTML文件
├── article.md    # Markdown版本
├── meta.json     # 文章元数据
└── images/       # 下载的图片
bash
python scripts/fetch_wechat_article.py "<url>" --output-dir ./output
Output directory structure:
output/<公众号名称>/<日期>_<标题>/
├── index.html    # Formatted standalone HTML file
├── article.md    # Markdown version
├── meta.json     # Article metadata
└── images/       # Downloaded images

2. 仅提取元数据

2. Extract metadata only

bash
python scripts/fetch_wechat_article.py "<url>" --json
返回 JSON 包含:
title
(标题)、
author
(作者)、
account_nickname
(公众号名称)、
description
(摘要)、
create_time
(发布时间)、
content_text
(正文文本)、
content_markdown
(Markdown内容)、
cover_image
(封面图)、
source_url
(原文链接)。
bash
python scripts/fetch_wechat_article.py "<url>" --json
The returned JSON contains:
title
,
author
,
account_nickname
(WeChat Official Account name),
description
(abstract),
create_time
(publish time),
content_text
(main body text),
content_markdown
(Markdown content),
cover_image
(cover image),
source_url
(original article link).

3. 批量下载多篇文章

3. Batch download multiple articles

空格分隔多个链接:
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
逗号分隔多个链接:
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
自定义下载间隔(默认3秒,避免触发反爬):
bash
python scripts/fetch_wechat_article.py "url1" "url2" --interval 5
同一公众号的文章自动归类到同一目录下。
Multiple links separated by spaces:
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
Multiple links separated by commas:
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
Custom download interval (default 3 seconds, to avoid triggering anti-crawling mechanism):
bash
python scripts/fetch_wechat_article.py "url1" "url2" --interval 5
Articles from the same Official Account are automatically categorized into the same directory.

4. 不下载图片

4. Skip image download

bash
python scripts/fetch_wechat_article.py "<url>" --no-images
bash
python scripts/fetch_wechat_article.py "<url>" --no-images

4. 不下载图片

4. Skip image download

bash
python scripts/fetch_wechat_article.py "<url>" --no-images
bash
python scripts/fetch_wechat_article.py "<url>" --no-images

5. 作为 Python 库调用

5. Call as a Python library

python
from scripts.fetch_wechat_article import fetch_article, batch_fetch
python
from scripts.fetch_wechat_article import fetch_article, batch_fetch

单篇获取并保存

Fetch and save single article

result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output") print(result['title'], result['path'])
result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output") print(result['title'], result['path'])

单篇仅获取元数据

Only fetch metadata for single article

meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True) print(meta['title']) print(meta['content_text'][:200])
meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True) print(meta['title']) print(meta['content_text'][:200])

批量获取

Batch fetch

urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"] stats = batch_fetch(urls, output_dir="./output", interval=3.0) print(f"成功{stats['success']}篇, 失败{stats['fail']}篇")

主要函数参数:
- `url`:文章链接(支持短链接和长链接)
- `output_dir`:保存目录(默认:`./wechat_articles`)
- `download_img`:是否下载图片(默认:`True`)
- `to_markdown`:是否转换为 Markdown(默认:`True`)
- `json_only`:仅返回元数据字典,不保存文件

`batch_fetch` 额外参数:
- `urls`:文章链接列表
- `interval`:每篇文章之间的下载间隔秒数(默认:`3.0`)
urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"] stats = batch_fetch(urls, output_dir="./output", interval=3.0) print(f"Success: {stats['success']} articles, Fail: {stats['fail']} articles")

Main function parameters:
- `url`: Article link (supports both short links and long links)
- `output_dir`: Save directory (default: `./wechat_articles`)
- `download_img`: Whether to download images (default: `True`)
- `to_markdown`: Whether to convert to Markdown (default: `True`)
- `json_only`: Only return metadata dictionary, no files saved

Extra parameters for `batch_fetch`:
- `urls`: List of article links
- `interval`: Download interval in seconds between each article (default: `3.0`)

注意事项

Notes

  • 优先使用短链接
    /s/xxxxx
    )—— 带
    __biz
    参数的长链接可能触发验证码。
  • 批量下载时默认间隔3秒,可通过
    --interval
    调整,避免触发微信反爬机制。
  • 自动使用微信移动端 User-Agent 绕过访问限制。
  • 微信图片使用
    data-src
    属性(非
    src
    ),因为采用了懒加载。
  • 下载图片需要设置
    Referer: https://mp.weixin.qq.com/
    请求头。
  • HTML 结构详情参见 references/wechat_html_structure.md
  • Give priority to short links (
    /s/xxxxx
    ) —— Long links with
    __biz
    parameter may trigger CAPTCHA.
  • The default interval for batch download is 3 seconds, which can be adjusted via
    --interval
    to avoid triggering WeChat's anti-crawling mechanism.
  • Automatically use WeChat mobile User-Agent to bypass access restrictions.
  • WeChat images use
    data-src
    attribute (not
    src
    ) due to lazy loading.
  • Referer: https://mp.weixin.qq.com/
    request header is required for image download.
  • For details of HTML structure, please refer to references/wechat_html_structure.md.