wechat-article-fetcher
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese微信公众号文章获取器
WeChat Official Account Article Fetcher
获取、解析并保存微信公众号文章,支持单篇和批量下载、元数据提取、图片下载和 Markdown 转换。
Fetch, parse and save WeChat Official Account articles, support single and batch download, metadata extraction, image download and Markdown conversion.
快速开始
Quick Start
获取单篇文章:
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"批量获取多篇文章(空格分隔):
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output批量获取多篇文章(逗号分隔):
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output仅输出元数据(不保存文件):
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --jsonFetch a single article:
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"Fetch multiple articles in batch (separated by spaces):
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./outputFetch multiple articles in batch (separated by commas):
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./outputOutput metadata only (no files saved):
bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json依赖安装
Dependency Installation
bash
pip install beautifulsoup4 html2text requestsbash
pip install beautifulsoup4 html2text requests功能说明
Feature Description
1. 获取文章并保存到本地
1. Fetch articles and save to local
bash
python scripts/fetch_wechat_article.py "<url>" --output-dir ./output输出目录结构:
output/<公众号名称>/<日期>_<标题>/
├── index.html # 格式化的独立HTML文件
├── article.md # Markdown版本
├── meta.json # 文章元数据
└── images/ # 下载的图片bash
python scripts/fetch_wechat_article.py "<url>" --output-dir ./outputOutput directory structure:
output/<公众号名称>/<日期>_<标题>/
├── index.html # Formatted standalone HTML file
├── article.md # Markdown version
├── meta.json # Article metadata
└── images/ # Downloaded images2. 仅提取元数据
2. Extract metadata only
bash
python scripts/fetch_wechat_article.py "<url>" --json返回 JSON 包含:(标题)、(作者)、(公众号名称)、(摘要)、(发布时间)、(正文文本)、(Markdown内容)、(封面图)、(原文链接)。
titleauthoraccount_nicknamedescriptioncreate_timecontent_textcontent_markdowncover_imagesource_urlbash
python scripts/fetch_wechat_article.py "<url>" --jsonThe returned JSON contains: , , (WeChat Official Account name), (abstract), (publish time), (main body text), (Markdown content), (cover image), (original article link).
titleauthoraccount_nicknamedescriptioncreate_timecontent_textcontent_markdowncover_imagesource_url3. 批量下载多篇文章
3. Batch download multiple articles
空格分隔多个链接:
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output逗号分隔多个链接:
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output自定义下载间隔(默认3秒,避免触发反爬):
bash
python scripts/fetch_wechat_article.py "url1" "url2" --interval 5同一公众号的文章自动归类到同一目录下。
Multiple links separated by spaces:
bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./outputMultiple links separated by commas:
bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./outputCustom download interval (default 3 seconds, to avoid triggering anti-crawling mechanism):
bash
python scripts/fetch_wechat_article.py "url1" "url2" --interval 5Articles from the same Official Account are automatically categorized into the same directory.
4. 不下载图片
4. Skip image download
bash
python scripts/fetch_wechat_article.py "<url>" --no-imagesbash
python scripts/fetch_wechat_article.py "<url>" --no-images4. 不下载图片
4. Skip image download
bash
python scripts/fetch_wechat_article.py "<url>" --no-imagesbash
python scripts/fetch_wechat_article.py "<url>" --no-images5. 作为 Python 库调用
5. Call as a Python library
python
from scripts.fetch_wechat_article import fetch_article, batch_fetchpython
from scripts.fetch_wechat_article import fetch_article, batch_fetch单篇获取并保存
Fetch and save single article
result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output")
print(result['title'], result['path'])
result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output")
print(result['title'], result['path'])
单篇仅获取元数据
Only fetch metadata for single article
meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True)
print(meta['title'])
print(meta['content_text'][:200])
meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True)
print(meta['title'])
print(meta['content_text'][:200])
批量获取
Batch fetch
urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"]
stats = batch_fetch(urls, output_dir="./output", interval=3.0)
print(f"成功{stats['success']}篇, 失败{stats['fail']}篇")
主要函数参数:
- `url`:文章链接(支持短链接和长链接)
- `output_dir`:保存目录(默认:`./wechat_articles`)
- `download_img`:是否下载图片(默认:`True`)
- `to_markdown`:是否转换为 Markdown(默认:`True`)
- `json_only`:仅返回元数据字典,不保存文件
`batch_fetch` 额外参数:
- `urls`:文章链接列表
- `interval`:每篇文章之间的下载间隔秒数(默认:`3.0`)urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"]
stats = batch_fetch(urls, output_dir="./output", interval=3.0)
print(f"Success: {stats['success']} articles, Fail: {stats['fail']} articles")
Main function parameters:
- `url`: Article link (supports both short links and long links)
- `output_dir`: Save directory (default: `./wechat_articles`)
- `download_img`: Whether to download images (default: `True`)
- `to_markdown`: Whether to convert to Markdown (default: `True`)
- `json_only`: Only return metadata dictionary, no files saved
Extra parameters for `batch_fetch`:
- `urls`: List of article links
- `interval`: Download interval in seconds between each article (default: `3.0`)注意事项
Notes
- 优先使用短链接()—— 带
/s/xxxxx参数的长链接可能触发验证码。__biz - 批量下载时默认间隔3秒,可通过 调整,避免触发微信反爬机制。
--interval - 自动使用微信移动端 User-Agent 绕过访问限制。
- 微信图片使用 属性(非
data-src),因为采用了懒加载。src - 下载图片需要设置 请求头。
Referer: https://mp.weixin.qq.com/ - HTML 结构详情参见 references/wechat_html_structure.md。
- Give priority to short links () —— Long links with
/s/xxxxxparameter may trigger CAPTCHA.__biz - The default interval for batch download is 3 seconds, which can be adjusted via to avoid triggering WeChat's anti-crawling mechanism.
--interval - Automatically use WeChat mobile User-Agent to bypass access restrictions.
- WeChat images use attribute (not
data-src) due to lazy loading.src - request header is required for image download.
Referer: https://mp.weixin.qq.com/ - For details of HTML structure, please refer to references/wechat_html_structure.md.