fetch-wechat-article

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

微信公众号文章抓取工具

WeChat Official Account Article Scraper

概述

Overview

使用 Playwright 抓取微信公众号文章,后台运行无弹窗,自动处理动态加载,提取干净的文章内容,并支持自动保存为 Markdown 文件。
Scrape WeChat Official Account articles using Playwright, run in the background without pop-ups, automatically handle dynamic loading, extract clean article content, and support automatic saving as Markdown files.

功能特性

Features

  • 无头模式运行: 默认后台抓取,不弹出浏览器窗口
  • 智能回退机制: 无头模式失败时自动切换到有头模式
  • 动态内容支持: 自动等待页面加载完成,处理懒加载图片
  • 自动保存为 Markdown: 支持将抓取结果保存为格式化的 Markdown 文件
  • 内容清洗: 移除HTML标签,保留段落结构,输出纯文本
  • 自动重试: 失败时自动重试3次,提高成功率
  • 错误检测: 识别"参数错误"等异常页面
  • 跨平台支持: 完全支持 Windows、macOS 和 Linux
  • 智能工作流: 检测法律内容时自动调用格式化技能
  • 图片下载: 自动下载文章中的所有图片到本地
  • 智能图片筛选: 自动过滤小的装饰性图片(如社交媒体按钮、表情符号)
  • 图片位置保持: 保留图片在原文档中的位置
  • 自动文件命名: 根据文章标题生成文件名和资源文件夹
  • Headless Mode Execution: Default background scraping without popping up browser windows
  • Smart Fallback Mechanism: Automatically switch to headed mode when headless mode fails
  • Dynamic Content Support: Automatically wait for page loading completion, handle lazy-loaded images
  • Auto-save as Markdown: Support saving scraping results as formatted Markdown files
  • Content Cleaning: Remove HTML tags, retain paragraph structure, output plain text
  • Automatic Retry: Automatically retry 3 times on failure to improve success rate
  • Error Detection: Identify abnormal pages such as "parameter error"
  • Cross-platform Support: Fully compatible with Windows, macOS and Linux
  • Smart Workflow: Automatically call formatting skill when detecting legal content
  • Image Download: Automatically download all images in articles to local
  • Smart Image Filtering: Automatically filter small decorative images (such as social media buttons, emojis)
  • Image Position Preservation: Retain the position of images in the original document
  • Auto File Naming: Generate file names and resource folders based on article titles

与其他技能的协作

Collaboration with Other Skills

智能工作流

Smart Workflow

本技能专注于文章抓取,保持通用性。仅在检测到法律相关内容时,AI 会自动调用
legal-text-format
技能进行格式化。
AI 执行流程
text
用户请求 → wechat-article-fetch 抓取 → [判断内容类型]
                    ┌────────────────────────┴────────────────────────┐
                    ↓                                                 ↓
              检测到法律内容                                    普通文章
                    ↓                                                 ↓
          自动调用 legal-text-format                      保存原始内容到项目根目录
          输出到 archive/ 目录
This skill focuses on article scraping and maintains generality. The AI will automatically call the
legal-text-format
skill for formatting only when legal-related content is detected.
AI Execution Flow:
text
User Request → wechat-article-fetch scraping → [Judge Content Type]
                    ┌────────────────────────┴────────────────────────┐
                    ↓                                                 ↓
              Detected Legal Content                               Ordinary Article
                    ↓                                                 ↓
          Automatically call legal-text-format                Save original content to project root directory
          Output to archive/ directory

法律内容检测

Legal Content Detection

AI 会根据以下特征判断是否为法律内容:
  • 标题关键词:包含"案例""裁判""判决""法规""条例""最高法""最高检"等
  • 内容特征:包含案号、法院名称、法律条文引用等
  • 结构特征:符合法律案例的典型结构(基本案情、裁判结果、典型意义等)
AI will judge whether it is legal content based on the following features:
  • Title Keywords: Contains "case", "judgment", "verdict", "regulation", "rule", "Supreme People's Court", "Supreme People's Procuratorate", etc.
  • Content Features: Contains case numbers, court names, legal article citations, etc.
  • Structural Features: Conforms to the typical structure of legal cases (basic facts, judgment results, typical significance, etc.)

默认保存位置

Default Save Location

  • 未指定路径:保存到项目根目录
  • 指定相对路径:相对于项目根目录
  • 指定绝对路径:使用指定的完整路径
示例
bash
undefined
  • No Path Specified: Save to project root directory
  • Specified Relative Path: Relative to project root directory
  • Specified Absolute Path: Use the specified full path
Examples:
bash
undefined

保存到项目根目录

Save to project root directory

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"

保存到指定目录

Save to specified directory

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"

保存到指定文件

Save to specified file

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
undefined
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
undefined

使用方法

Usage

在 Claude Code 中调用

Call in Claude Code

javascript
// 抓取文章(仅返回结果)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");

// 抓取文章并自动保存为 Markdown 文件
const result = await fetchWechatArticle(
  "https://mp.weixin.qq.com/s/xxxxx",
  3,           // 重试次数(可选)
  "./output.md" // 保存路径(可选)
);

// 返回格式
{
  title: "文章标题",
  content: "文章正文...",
  url: "文章URL"
}
javascript
// Scrape article (return results only)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");

// Scrape article and automatically save as Markdown file
const result = await fetchWechatArticle(
  "https://mp.weixin.qq.com/s/xxxxx",
  3,           // Retry count (optional)
  "./output.md" // Save path (optional)
);

// Return format
{
  title: "Article Title",
  content: "Article main text...",
  url: "Article URL"
}

命令行调用

Command Line Call

bash
undefined
bash
undefined

基本用法(仅输出到控制台)

Basic usage (output to console only)

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"

保存为指定文件

Save to specified file

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"

保存到目录(自动使用文章标题作为文件名)

Save to directory (automatically use article title as file name)

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
undefined
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
undefined

输出格式

Output Format

控制台输出

Console Output

text
标题: 文章标题

文章正文第一段...

文章正文第二段...
text
Title: Article Title

First paragraph of article main text...

Second paragraph of article main text...

Markdown 文件格式

Markdown File Format

markdown
undefined
markdown
undefined

文章标题

Article Title

原文链接: https://mp.weixin.qq.com/s/xxxxx 抓取时间: 2026-01-21 20:30:00

文章正文第一段...
图片描述
文章正文第二段...
undefined
Original URL: https://mp.weixin.qq.com/s/xxxxx Scraping Time: 2026-01-21 20:30:00

First paragraph of article main text...
Image Description
Second paragraph of article main text...
undefined

文件结构

File Structure

当文章包含图片时,会自动生成以下文件结构:
输出目录/
├── 文章标题.md              # Markdown 文件
└── 文章标题_assets/         # 图片资源文件夹
    ├── image_xxx_0.jpg
    ├── image_xxx_1.jpg
    └── ...
When articles contain images, the following file structure will be automatically generated:
Output Directory/
├── Article_Title.md              # Markdown file
└── Article_Title_assets/         # Image resource folder
    ├── image_xxx_0.jpg
    ├── image_xxx_1.jpg
    └── ...

图片筛选

Image Filtering

默认启用智能图片筛选,自动过滤小于 15KB 的装饰性图片(如社交媒体按钮、表情符号等)。
可以在
scripts/fetch.js
中修改筛选配置:
javascript
const IMAGE_FILTER_CONFIG = {
  minFileSize: 15 * 1024,  // 最小文件大小(字节)
  enabled: true            // 是否启用筛选
};
Smart image filtering is enabled by default, automatically filtering decorative images smaller than 15KB (such as social media buttons, emojis, etc.).
You can modify the filtering configuration in
scripts/fetch.js
:
javascript
const IMAGE_FILTER_CONFIG = {
  minFileSize: 15 * 1024,  // Minimum file size (bytes)
  enabled: true            // Whether to enable filtering
};

技术实现

Technical Implementation

依赖要求

Dependency Requirements

  • Playwright (
    npx playwright install chromium
    )
  • Node.js >= 14.0.0
  • Playwright (
    npx playwright install chromium
    )
  • Node.js >= 14.0.0

抓取流程

Scraping Process

  1. 检测并安装 Playwright(如需要)
  2. 启动 Playwright headless 浏览器
  3. 设置反检测参数(User-Agent, webdriver隐藏等)
  4. 导航到目标URL,等待网络空闲
  5. 滚动页面触发懒加载
  6. 提取
    #js_content
    .rich_media_content
    区域
  7. 清理HTML标签,保留段落结构
  8. 返回标题和纯文本内容
  9. 如果指定了保存路径,自动保存为 Markdown 文件
  10. 如果无头模式失败,自动回退到有头模式重试
  1. Detect and install Playwright (if needed)
  2. Launch Playwright headless browser
  3. Set anti-detection parameters (User-Agent, webdriver hiding, etc.)
  4. Navigate to target URL, wait for network idle
  5. Scroll page to trigger lazy loading
  6. Extract
    #js_content
    or
    .rich_media_content
    area
  7. Clean up HTML tags, retain paragraph structure
  8. Return title and plain text content
  9. Automatically save as Markdown file if save path is specified
  10. Automatically fall back to headed mode for retry if headless mode fails

错误处理

Error Handling

  • 自动重试3次,每次失败后等待3秒
  • 无头模式失败后自动回退到有头模式
  • 检测错误页面(参数错误、访问异常)
  • 超时设置30秒
  • Windows 平台特殊处理(路径、命令格式)
  • Automatically retry 3 times, wait 3 seconds after each failure
  • Automatically fall back to headed mode if headless mode fails
  • Detect error pages (parameter error, access exception)
  • Timeout set to 30 seconds
  • Special handling for Windows platform (paths, command formats)

跨平台兼容性

Cross-Platform Compatibility

  • Windows: 自动检测并使用
    cmd.exe
    运行 npx 命令
  • macOS/Linux: 直接使用 npx 命令
  • 路径处理: 自动规范化路径分隔符
  • 文件名处理: 自动移除 Windows 非法字符
  • Windows: Automatically detect and use
    cmd.exe
    to run npx commands
  • macOS/Linux: Directly use npx commands
  • Path Handling: Automatically normalize path separators
  • File Name Handling: Automatically remove Windows illegal characters

适用场景

Application Scenarios

  • 内容转换工具的输入源
  • 文章分析和处理
  • 自动化内容抓取
  • 批量文章下载
  • 文章归档和本地保存
  • Markdown 格式转换
  • 法律文档自动格式化(检测到法律内容时)
  • 图文文章完整保存(包含图片的离线归档)
  • 图片资源管理(自动下载并组织文章中的图片)
  • Input source for content conversion tools
  • Article analysis and processing
  • Automated content scraping
  • Batch article downloading
  • Article archiving and local saving
  • Markdown format conversion
  • Legal document automatic formatting (when legal content is detected)
  • Complete saving of articles with images (offline archiving including images)
  • Image resource management (automatically download and organize images in articles)

使用示例

Usage Examples

示例 1: 批量抓取并保存

Example 1: Batch Scraping and Saving

javascript
const urls = [
  "https://mp.weixin.qq.com/s/xxxx1",
  "https://mp.weixin.qq.com/s/xxxx2",
  "https://mp.weixin.qq.com/s/xxxx3"
];

for (const url of urls) {
  const result = await fetchWechatArticle(url, 3, "./articles/");
  console.log(`已保存: ${result.title}`);
}
javascript
const urls = [
  "https://mp.weixin.qq.com/s/xxxx1",
  "https://mp.weixin.qq.com/s/xxxx2",
  "https://mp.weixin.qq.com/s/xxxx3"
];

for (const url of urls) {
  const result = await fetchWechatArticle(url, 3, "./articles/");
  console.log(`Saved: ${result.title}`);
}

示例 2: 在 Claude Code 中直接使用

Example 2: Direct Use in Claude Code

text
请帮我抓取这个微信公众号文章并保存为 Markdown 文件:
https://mp.weixin.qq.com/s/xxxxx
text
Please help me scrape this WeChat Official Account article and save it as a Markdown file:
https://mp.weixin.qq.com/s/xxxxx

注意事项

Notes

⚠️ 仅用于个人学习和研究,请遵守网站服务条款 ⚠️ 频繁抓取可能被限流,建议控制请求频率 ⚠️ 抓取的内容版权归原作者所有 ⚠️ 有头模式会弹出浏览器窗口,可能干扰工作流程 ⚠️ Windows 用户首次使用需要安装 Playwright(会自动安装)
⚠️ Only for personal study and research, please comply with website service terms ⚠️ Frequent scraping may lead to rate limiting, it is recommended to control request frequency ⚠️ The copyright of scraped content belongs to the original author ⚠️ Headed mode will pop up a browser window, which may interfere with workflow ⚠️ Windows users need to install Playwright for the first use (will be installed automatically)