fetch-wechat-article
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese微信公众号文章抓取工具
WeChat Official Account Article Scraper
概述
Overview
使用 Playwright 抓取微信公众号文章,后台运行无弹窗,自动处理动态加载,提取干净的文章内容,并支持自动保存为 Markdown 文件。
Scrape WeChat Official Account articles using Playwright, run in the background without pop-ups, automatically handle dynamic loading, extract clean article content, and support automatic saving as Markdown files.
功能特性
Features
- ✅ 无头模式运行: 默认后台抓取,不弹出浏览器窗口
- ✅ 智能回退机制: 无头模式失败时自动切换到有头模式
- ✅ 动态内容支持: 自动等待页面加载完成,处理懒加载图片
- ✅ 自动保存为 Markdown: 支持将抓取结果保存为格式化的 Markdown 文件
- ✅ 内容清洗: 移除HTML标签,保留段落结构,输出纯文本
- ✅ 自动重试: 失败时自动重试3次,提高成功率
- ✅ 错误检测: 识别"参数错误"等异常页面
- ✅ 跨平台支持: 完全支持 Windows、macOS 和 Linux
- ✅ 智能工作流: 检测法律内容时自动调用格式化技能
- ✅ 图片下载: 自动下载文章中的所有图片到本地
- ✅ 智能图片筛选: 自动过滤小的装饰性图片(如社交媒体按钮、表情符号)
- ✅ 图片位置保持: 保留图片在原文档中的位置
- ✅ 自动文件命名: 根据文章标题生成文件名和资源文件夹
- ✅ Headless Mode Execution: Default background scraping without popping up browser windows
- ✅ Smart Fallback Mechanism: Automatically switch to headed mode when headless mode fails
- ✅ Dynamic Content Support: Automatically wait for page loading completion, handle lazy-loaded images
- ✅ Auto-save as Markdown: Support saving scraping results as formatted Markdown files
- ✅ Content Cleaning: Remove HTML tags, retain paragraph structure, output plain text
- ✅ Automatic Retry: Automatically retry 3 times on failure to improve success rate
- ✅ Error Detection: Identify abnormal pages such as "parameter error"
- ✅ Cross-platform Support: Fully compatible with Windows, macOS and Linux
- ✅ Smart Workflow: Automatically call formatting skill when detecting legal content
- ✅ Image Download: Automatically download all images in articles to local
- ✅ Smart Image Filtering: Automatically filter small decorative images (such as social media buttons, emojis)
- ✅ Image Position Preservation: Retain the position of images in the original document
- ✅ Auto File Naming: Generate file names and resource folders based on article titles
与其他技能的协作
Collaboration with Other Skills
智能工作流
Smart Workflow
本技能专注于文章抓取,保持通用性。仅在检测到法律相关内容时,AI 会自动调用 技能进行格式化。
legal-text-formatAI 执行流程:
text
用户请求 → wechat-article-fetch 抓取 → [判断内容类型]
↓
┌────────────────────────┴────────────────────────┐
↓ ↓
检测到法律内容 普通文章
↓ ↓
自动调用 legal-text-format 保存原始内容到项目根目录
↓
输出到 archive/ 目录This skill focuses on article scraping and maintains generality. The AI will automatically call the skill for formatting only when legal-related content is detected.
legal-text-formatAI Execution Flow:
text
User Request → wechat-article-fetch scraping → [Judge Content Type]
↓
┌────────────────────────┴────────────────────────┐
↓ ↓
Detected Legal Content Ordinary Article
↓ ↓
Automatically call legal-text-format Save original content to project root directory
↓
Output to archive/ directory法律内容检测
Legal Content Detection
AI 会根据以下特征判断是否为法律内容:
- 标题关键词:包含"案例""裁判""判决""法规""条例""最高法""最高检"等
- 内容特征:包含案号、法院名称、法律条文引用等
- 结构特征:符合法律案例的典型结构(基本案情、裁判结果、典型意义等)
AI will judge whether it is legal content based on the following features:
- Title Keywords: Contains "case", "judgment", "verdict", "regulation", "rule", "Supreme People's Court", "Supreme People's Procuratorate", etc.
- Content Features: Contains case numbers, court names, legal article citations, etc.
- Structural Features: Conforms to the typical structure of legal cases (basic facts, judgment results, typical significance, etc.)
默认保存位置
Default Save Location
- 未指定路径:保存到项目根目录
- 指定相对路径:相对于项目根目录
- 指定绝对路径:使用指定的完整路径
示例:
bash
undefined- No Path Specified: Save to project root directory
- Specified Relative Path: Relative to project root directory
- Specified Absolute Path: Use the specified full path
Examples:
bash
undefined保存到项目根目录
Save to project root directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
保存到指定目录
Save to specified directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
保存到指定文件
Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
undefinednode scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
undefined使用方法
Usage
在 Claude Code 中调用
Call in Claude Code
javascript
// 抓取文章(仅返回结果)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");
// 抓取文章并自动保存为 Markdown 文件
const result = await fetchWechatArticle(
"https://mp.weixin.qq.com/s/xxxxx",
3, // 重试次数(可选)
"./output.md" // 保存路径(可选)
);
// 返回格式
{
title: "文章标题",
content: "文章正文...",
url: "文章URL"
}javascript
// Scrape article (return results only)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");
// Scrape article and automatically save as Markdown file
const result = await fetchWechatArticle(
"https://mp.weixin.qq.com/s/xxxxx",
3, // Retry count (optional)
"./output.md" // Save path (optional)
);
// Return format
{
title: "Article Title",
content: "Article main text...",
url: "Article URL"
}命令行调用
Command Line Call
bash
undefinedbash
undefined基本用法(仅输出到控制台)
Basic usage (output to console only)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
保存为指定文件
Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"
保存到目录(自动使用文章标题作为文件名)
Save to directory (automatically use article title as file name)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
undefinednode scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
undefined输出格式
Output Format
控制台输出
Console Output
text
标题: 文章标题
文章正文第一段...
文章正文第二段...text
Title: Article Title
First paragraph of article main text...
Second paragraph of article main text...Markdown 文件格式
Markdown File Format
markdown
undefinedmarkdown
undefined文章标题
Article Title
Original URL: https://mp.weixin.qq.com/s/xxxxx Scraping Time: 2026-01-21 20:30:00
First paragraph of article main text...

Second paragraph of article main text...
undefined文件结构
File Structure
当文章包含图片时,会自动生成以下文件结构:
输出目录/
├── 文章标题.md # Markdown 文件
└── 文章标题_assets/ # 图片资源文件夹
├── image_xxx_0.jpg
├── image_xxx_1.jpg
└── ...When articles contain images, the following file structure will be automatically generated:
Output Directory/
├── Article_Title.md # Markdown file
└── Article_Title_assets/ # Image resource folder
├── image_xxx_0.jpg
├── image_xxx_1.jpg
└── ...图片筛选
Image Filtering
默认启用智能图片筛选,自动过滤小于 15KB 的装饰性图片(如社交媒体按钮、表情符号等)。
可以在 中修改筛选配置:
scripts/fetch.jsjavascript
const IMAGE_FILTER_CONFIG = {
minFileSize: 15 * 1024, // 最小文件大小(字节)
enabled: true // 是否启用筛选
};Smart image filtering is enabled by default, automatically filtering decorative images smaller than 15KB (such as social media buttons, emojis, etc.).
You can modify the filtering configuration in :
scripts/fetch.jsjavascript
const IMAGE_FILTER_CONFIG = {
minFileSize: 15 * 1024, // Minimum file size (bytes)
enabled: true // Whether to enable filtering
};技术实现
Technical Implementation
依赖要求
Dependency Requirements
- Playwright ()
npx playwright install chromium - Node.js >= 14.0.0
- Playwright ()
npx playwright install chromium - Node.js >= 14.0.0
抓取流程
Scraping Process
- 检测并安装 Playwright(如需要)
- 启动 Playwright headless 浏览器
- 设置反检测参数(User-Agent, webdriver隐藏等)
- 导航到目标URL,等待网络空闲
- 滚动页面触发懒加载
- 提取 或
#js_content区域.rich_media_content - 清理HTML标签,保留段落结构
- 返回标题和纯文本内容
- 如果指定了保存路径,自动保存为 Markdown 文件
- 如果无头模式失败,自动回退到有头模式重试
- Detect and install Playwright (if needed)
- Launch Playwright headless browser
- Set anti-detection parameters (User-Agent, webdriver hiding, etc.)
- Navigate to target URL, wait for network idle
- Scroll page to trigger lazy loading
- Extract or
#js_contentarea.rich_media_content - Clean up HTML tags, retain paragraph structure
- Return title and plain text content
- Automatically save as Markdown file if save path is specified
- Automatically fall back to headed mode for retry if headless mode fails
错误处理
Error Handling
- 自动重试3次,每次失败后等待3秒
- 无头模式失败后自动回退到有头模式
- 检测错误页面(参数错误、访问异常)
- 超时设置30秒
- Windows 平台特殊处理(路径、命令格式)
- Automatically retry 3 times, wait 3 seconds after each failure
- Automatically fall back to headed mode if headless mode fails
- Detect error pages (parameter error, access exception)
- Timeout set to 30 seconds
- Special handling for Windows platform (paths, command formats)
跨平台兼容性
Cross-Platform Compatibility
- Windows: 自动检测并使用 运行 npx 命令
cmd.exe - macOS/Linux: 直接使用 npx 命令
- 路径处理: 自动规范化路径分隔符
- 文件名处理: 自动移除 Windows 非法字符
- Windows: Automatically detect and use to run npx commands
cmd.exe - macOS/Linux: Directly use npx commands
- Path Handling: Automatically normalize path separators
- File Name Handling: Automatically remove Windows illegal characters
适用场景
Application Scenarios
- 内容转换工具的输入源
- 文章分析和处理
- 自动化内容抓取
- 批量文章下载
- 文章归档和本地保存
- Markdown 格式转换
- 法律文档自动格式化(检测到法律内容时)
- 图文文章完整保存(包含图片的离线归档)
- 图片资源管理(自动下载并组织文章中的图片)
- Input source for content conversion tools
- Article analysis and processing
- Automated content scraping
- Batch article downloading
- Article archiving and local saving
- Markdown format conversion
- Legal document automatic formatting (when legal content is detected)
- Complete saving of articles with images (offline archiving including images)
- Image resource management (automatically download and organize images in articles)
使用示例
Usage Examples
示例 1: 批量抓取并保存
Example 1: Batch Scraping and Saving
javascript
const urls = [
"https://mp.weixin.qq.com/s/xxxx1",
"https://mp.weixin.qq.com/s/xxxx2",
"https://mp.weixin.qq.com/s/xxxx3"
];
for (const url of urls) {
const result = await fetchWechatArticle(url, 3, "./articles/");
console.log(`已保存: ${result.title}`);
}javascript
const urls = [
"https://mp.weixin.qq.com/s/xxxx1",
"https://mp.weixin.qq.com/s/xxxx2",
"https://mp.weixin.qq.com/s/xxxx3"
];
for (const url of urls) {
const result = await fetchWechatArticle(url, 3, "./articles/");
console.log(`Saved: ${result.title}`);
}示例 2: 在 Claude Code 中直接使用
Example 2: Direct Use in Claude Code
text
请帮我抓取这个微信公众号文章并保存为 Markdown 文件:
https://mp.weixin.qq.com/s/xxxxxtext
Please help me scrape this WeChat Official Account article and save it as a Markdown file:
https://mp.weixin.qq.com/s/xxxxx注意事项
Notes
⚠️ 仅用于个人学习和研究,请遵守网站服务条款
⚠️ 频繁抓取可能被限流,建议控制请求频率
⚠️ 抓取的内容版权归原作者所有
⚠️ 有头模式会弹出浏览器窗口,可能干扰工作流程
⚠️ Windows 用户首次使用需要安装 Playwright(会自动安装)
⚠️ Only for personal study and research, please comply with website service terms
⚠️ Frequent scraping may lead to rate limiting, it is recommended to control request frequency
⚠️ The copyright of scraped content belongs to the original author
⚠️ Headed mode will pop up a browser window, which may interfere with workflow
⚠️ Windows users need to install Playwright for the first use (will be installed automatically)
