wechat-article-fetch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

微信公众号文章抓取工具

WeChat Official Account Article Scraper

概述

Overview

使用 Playwright 抓取微信公众号文章,后台运行无弹窗,自动处理动态加载,提取干净的文章内容,并支持自动保存为 Markdown 文件。

Use Playwright to scrape WeChat Official Account articles, run in the background without pop-ups, automatically handle dynamic loading, extract clean article content, and support automatic saving as Markdown files.

功能特性

Features

✅ 无头模式运行: 默认后台抓取,不弹出浏览器窗口
✅ 智能回退机制: 无头模式失败时自动切换到有头模式
✅ 动态内容支持: 自动等待页面加载完成,处理懒加载图片
✅ 自动保存为 Markdown: 支持将抓取结果保存为格式化的 Markdown 文件
✅ 内容清洗: 移除HTML标签,保留段落结构,输出纯文本
✅ 自动重试: 失败时自动重试3次,提高成功率
✅ 错误检测: 识别"参数错误"等异常页面
✅ 跨平台支持: 完全支持 Windows、macOS 和 Linux
✅ 智能工作流: 检测法律内容时自动调用格式化技能
✅ 图片下载: 自动下载文章中的所有图片到本地
✅ 智能图片筛选: 自动过滤小的装饰性图片（如社交媒体按钮、表情符号）
✅ 图片位置保持: 保留图片在原文档中的位置
✅ 自动文件命名: 根据文章标题生成文件名和资源文件夹

✅ Headless mode operation: By default, scrape in the background without popping up a browser window
✅ Smart fallback mechanism: Automatically switch to headed mode when headless mode fails
✅ Dynamic content support: Automatically wait for page loading completion and handle lazy-loaded images
✅ Auto-save as Markdown: Support saving scraping results as formatted Markdown files
✅ Content cleaning: Remove HTML tags, retain paragraph structure, and output plain text
✅ Auto-retry: Automatically retry 3 times on failure to improve success rate
✅ Error detection: Identify abnormal pages such as "parameter error"
✅ Cross-platform support: Fully compatible with Windows, macOS, and Linux
✅ Smart workflow: Automatically call formatting skills when detecting legal content
✅ Image downloading: Automatically download all images in the article to local storage
✅ Smart image filtering: Automatically filter small decorative images (such as social media buttons, emojis)
✅ Image position retention: Keep images in their original positions in the document
✅ Auto file naming: Generate file names and resource folders based on article titles

与其他技能的协作

Collaboration with Other Skills

智能工作流

Smart Workflow

本技能专注于文章抓取，保持通用性。仅在检测到法律相关内容时，AI 会自动调用

legal-text-format

技能进行格式化。

AI 执行流程：

text

用户请求 → wechat-article-fetch 抓取 → [判断内容类型]
                                              ↓
                    ┌────────────────────────┴────────────────────────┐
                    ↓                                                 ↓
              检测到法律内容                                    普通文章
                    ↓                                                 ↓
          自动调用 legal-text-format                      保存原始内容到项目根目录
                    ↓
          输出到 archive/ 目录

This skill focuses on article scraping and maintains universality. The AI will automatically call the

legal-text-format

skill for formatting only when legal-related content is detected.

AI Execution Flow:

text

User Request → wechat-article-fetch scraping → [Judge Content Type]
                                              ↓
                    ┌────────────────────────┴────────────────────────┐
                    ↓                                                 ↓
              Legal Content Detected                            Regular Articles
                    ↓                                                 ↓
          Auto-call legal-text-format                      Save original content to project root directory
                    ↓
          Output to archive/ directory

法律内容检测

Legal Content Detection

AI 会根据以下特征判断是否为法律内容：

标题关键词：包含"案例""裁判""判决""法规""条例""最高法""最高检"等
内容特征：包含案号、法院名称、法律条文引用等
结构特征：符合法律案例的典型结构（基本案情、裁判结果、典型意义等）

AI will determine if content is legal-related based on the following features:

Title keywords: Contains terms like "case", "judgment", "verdict", "regulation", "rule", "Supreme People's Court", "Supreme People's Procuratorate", etc.
Content features: Includes case numbers, court names, citations of legal provisions, etc.
Structural features: Conforms to the typical structure of legal cases (basic facts, judgment results, typical significance, etc.)

默认保存位置

Default Save Locations

未指定路径：保存到项目根目录
指定相对路径：相对于项目根目录
指定绝对路径：使用指定的完整路径

示例：

bash

undefined

No path specified: Save to project root directory
Specified relative path: Relative to the project root directory
Specified absolute path: Use the specified full path

Examples:

bash

undefined

保存到项目根目录

Save to project root directory

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"

保存到指定目录

Save to specified directory

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"

保存到指定文件

Save to specified file

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"

undefined

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"

undefined

使用方法

Usage

在 Claude Code 中调用

Call in Claude Code

javascript

// 抓取文章（仅返回结果）
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");

// 抓取文章并自动保存为 Markdown 文件
const result = await fetchWechatArticle(
  "https://mp.weixin.qq.com/s/xxxxx",
  3,           // 重试次数（可选）
  "./output.md" // 保存路径（可选）
);

// 返回格式
{
  title: "文章标题",
  content: "文章正文...",
  url: "文章URL"
}

javascript

// Scrape article (return results only)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");

// Scrape article and auto-save as Markdown file
const result = await fetchWechatArticle(
  "https://mp.weixin.qq.com/s/xxxxx",
  3,           // Retry count (optional)
  "./output.md" // Save path (optional)
);

// Return format
{
  title: "Article Title",
  content: "Article main text...",
  url: "Article URL"
}

命令行调用

Command Line Call

bash

undefined

bash

undefined

基本用法（仅输出到控制台）

Basic usage (output to console only)

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"

保存为指定文件

Save to specified file

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"

保存到目录（自动使用文章标题作为文件名）

Save to directory (automatically use article title as file name)

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"

undefined

node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"

undefined

输出格式

Output Format

控制台输出

Console Output

text

标题: 文章标题

文章正文第一段...

文章正文第二段...

text

Title: Article Title

First paragraph of article main text...

Second paragraph of article main text...

Markdown 文件格式

Markdown File Format

markdown

undefined

markdown

undefined

文章标题

Article Title

原文链接: https://mp.weixin.qq.com/s/xxxxx 抓取时间: 2026-01-21 20:30:00

文章正文第一段...

文章正文第二段...

undefined

Original URL: https://mp.weixin.qq.com/s/xxxxx Scraped Time: 2026-01-21 20:30:00

First paragraph of article main text...

Second paragraph of article main text...

undefined

文件结构

File Structure

当文章包含图片时，会自动生成以下文件结构：

输出目录/
├── 文章标题.md              # Markdown 文件
└── 文章标题_assets/         # 图片资源文件夹
    ├── image_xxx_0.jpg
    ├── image_xxx_1.jpg
    └── ...

When an article contains images, the following file structure will be automatically generated:

Output Directory/
├── Article_Title.md              # Markdown file
└── Article_Title_Assets/         # Image resource folder
    ├── image_xxx_0.jpg
    ├── image_xxx_1.jpg
    └── ...

图片筛选

Image Filtering

默认启用智能图片筛选，自动过滤小于 15KB 的装饰性图片（如社交媒体按钮、表情符号等）。

可以在

scripts/fetch.js

中修改筛选配置：

javascript

const IMAGE_FILTER_CONFIG = {
  minFileSize: 15 * 1024,  // 最小文件大小（字节）
  enabled: true            // 是否启用筛选
};

Smart image filtering is enabled by default, which automatically filters decorative images smaller than 15KB (such as social media buttons, emojis, etc.).

You can modify the filtering configuration in

scripts/fetch.js

javascript

const IMAGE_FILTER_CONFIG = {
  minFileSize: 15 * 1024,  // Minimum file size (bytes)
  enabled: true            // Whether to enable filtering
};

技术实现

Technical Implementation

依赖要求

Dependency Requirements

Playwright (
```
npx playwright install chromium
```
)
Node.js >= 14.0.0

Playwright (
```
npx playwright install chromium
```
)
Node.js >= 14.0.0

抓取流程

Scraping Process

检测并安装 Playwright（如需要）
启动 Playwright headless 浏览器
设置反检测参数(User-Agent, webdriver隐藏等)
导航到目标URL,等待网络空闲
滚动页面触发懒加载
提取
```
#js_content
```
或
```
.rich_media_content
```
区域
清理HTML标签,保留段落结构
返回标题和纯文本内容
如果指定了保存路径,自动保存为 Markdown 文件
如果无头模式失败,自动回退到有头模式重试

Detect and install Playwright (if needed)
Launch Playwright headless browser
Set anti-detection parameters (User-Agent, webdriver hiding, etc.)
Navigate to the target URL and wait for network idle
Scroll the page to trigger lazy loading
Extract the
```
#js_content
```
or
```
.rich_media_content
```
area
Clean up HTML tags and retain paragraph structure
Return title and plain text content
Automatically save as Markdown file if a save path is specified
Automatically fall back to headed mode for retry if headless mode fails

错误处理

Error Handling

自动重试3次,每次失败后等待3秒
无头模式失败后自动回退到有头模式
检测错误页面(参数错误、访问异常)
超时设置30秒
Windows 平台特殊处理（路径、命令格式）

Automatically retry 3 times, waiting 3 seconds after each failure
Automatically fall back to headed mode if headless mode fails
Detect error pages (parameter errors, access exceptions)
Timeout set to 30 seconds
Special handling for Windows platform (paths, command formats)

跨平台兼容性

Cross-Platform Compatibility

Windows: 自动检测并使用
```
cmd.exe
```
运行 npx 命令
macOS/Linux: 直接使用 npx 命令
路径处理: 自动规范化路径分隔符
文件名处理: 自动移除 Windows 非法字符

Windows: Automatically detect and use
```
cmd.exe
```
to run npx commands
macOS/Linux: Directly use npx commands
Path handling: Automatically normalize path separators
File name handling: Automatically remove illegal characters for Windows

适用场景

Applicable Scenarios

内容转换工具的输入源
文章分析和处理
自动化内容抓取
批量文章下载
文章归档和本地保存
Markdown 格式转换
法律文档自动格式化（检测到法律内容时）
图文文章完整保存（包含图片的离线归档）
图片资源管理（自动下载并组织文章中的图片）

Input source for content conversion tools
Article analysis and processing
Automated content scraping
Batch article downloading
Article archiving and local saving
Markdown format conversion
Automatic formatting of legal documents (when legal content is detected)
Complete saving of articles with images (offline archiving including images)
Image resource management (automatically download and organize images in articles)

使用示例

Usage Examples

示例 1: 批量抓取并保存

Example 1: Batch Scraping and Saving

javascript

const urls = [
  "https://mp.weixin.qq.com/s/xxxx1",
  "https://mp.weixin.qq.com/s/xxxx2",
  "https://mp.weixin.qq.com/s/xxxx3"
];

for (const url of urls) {
  const result = await fetchWechatArticle(url, 3, "./articles/");
  console.log(`已保存: ${result.title}`);
}

javascript

const urls = [
  "https://mp.weixin.qq.com/s/xxxx1",
  "https://mp.weixin.qq.com/s/xxxx2",
  "https://mp.weixin.qq.com/s/xxxx3"
];

for (const url of urls) {
  const result = await fetchWechatArticle(url, 3, "./articles/");
  console.log(`Saved: ${result.title}`);
}

示例 2: 在 Claude Code 中直接使用

Example 2: Direct Use in Claude Code

text

请帮我抓取这个微信公众号文章并保存为 Markdown 文件:
https://mp.weixin.qq.com/s/xxxxx

text

Please help me scrape this WeChat Official Account article and save it as a Markdown file:
https://mp.weixin.qq.com/s/xxxxx

注意事项

Notes

⚠️ 仅用于个人学习和研究,请遵守网站服务条款 ⚠️ 频繁抓取可能被限流,建议控制请求频率 ⚠️ 抓取的内容版权归原作者所有 ⚠️ 有头模式会弹出浏览器窗口,可能干扰工作流程 ⚠️ Windows 用户首次使用需要安装 Playwright（会自动安装）

⚠️ For personal study and research only, please comply with website service terms ⚠️ Frequent scraping may lead to rate limiting, please control request frequency ⚠️ The copyright of scraped content belongs to the original author ⚠️ Headed mode will pop up a browser window, which may interfere with workflow ⚠️ Windows users need to install Playwright for the first use (installation will be automatic)