WeChat Official Account Article Scraper
Overview
Scrape WeChat Official Account articles using Playwright, run in the background without pop-ups, automatically handle dynamic loading, extract clean article content, and support automatic saving as Markdown files.
Features
- ✅ Headless Mode Execution: Default background scraping without popping up browser windows
- ✅ Smart Fallback Mechanism: Automatically switch to headed mode when headless mode fails
- ✅ Dynamic Content Support: Automatically wait for page loading completion, handle lazy-loaded images
- ✅ Auto-save as Markdown: Support saving scraping results as formatted Markdown files
- ✅ Content Cleaning: Remove HTML tags, retain paragraph structure, output plain text
- ✅ Automatic Retry: Automatically retry 3 times on failure to improve success rate
- ✅ Error Detection: Identify abnormal pages such as "parameter error"
- ✅ Cross-platform Support: Fully compatible with Windows, macOS and Linux
- ✅ Smart Workflow: Automatically call formatting skill when detecting legal content
- ✅ Image Download: Automatically download all images in articles to local
- ✅ Smart Image Filtering: Automatically filter small decorative images (such as social media buttons, emojis)
- ✅ Image Position Preservation: Retain the position of images in the original document
- ✅ Auto File Naming: Generate file names and resource folders based on article titles
Collaboration with Other Skills
Smart Workflow
This skill focuses on article scraping and maintains generality. The AI will automatically call the
skill for formatting only when legal-related content is detected.
AI Execution Flow:
text
User Request → wechat-article-fetch scraping → [Judge Content Type]
↓
┌────────────────────────┴────────────────────────┐
↓ ↓
Detected Legal Content Ordinary Article
↓ ↓
Automatically call legal-text-format Save original content to project root directory
↓
Output to archive/ directory
Legal Content Detection
AI will judge whether it is legal content based on the following features:
- Title Keywords: Contains "case", "judgment", "verdict", "regulation", "rule", "Supreme People's Court", "Supreme People's Procuratorate", etc.
- Content Features: Contains case numbers, court names, legal article citations, etc.
- Structural Features: Conforms to the typical structure of legal cases (basic facts, judgment results, typical significance, etc.)
Default Save Location
- No Path Specified: Save to project root directory
- Specified Relative Path: Relative to project root directory
- Specified Absolute Path: Use the specified full path
Examples:
bash
# Save to project root directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
# Save to specified directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
# Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
Usage
Call in Claude Code
javascript
// Scrape article (return results only)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");
// Scrape article and automatically save as Markdown file
const result = await fetchWechatArticle(
"https://mp.weixin.qq.com/s/xxxxx",
3, // Retry count (optional)
"./output.md" // Save path (optional)
);
// Return format
{
title: "Article Title",
content: "Article main text...",
url: "Article URL"
}
Command Line Call
bash
# Basic usage (output to console only)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
# Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"
# Save to directory (automatically use article title as file name)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
Output Format
Console Output
text
Title: Article Title
First paragraph of article main text...
Second paragraph of article main text...
Markdown File Format
markdown
# Article Title
> Original URL: https://mp.weixin.qq.com/s/xxxxx
> Scraping Time: 2026-01-21 20:30:00
---
First paragraph of article main text...

Second paragraph of article main text...
File Structure
When articles contain images, the following file structure will be automatically generated:
Output Directory/
├── Article_Title.md # Markdown file
└── Article_Title_assets/ # Image resource folder
├── image_xxx_0.jpg
├── image_xxx_1.jpg
└── ...
Image Filtering
Smart image filtering is enabled by default, automatically filtering decorative images smaller than 15KB (such as social media buttons, emojis, etc.).
You can modify the filtering configuration in
:
javascript
const IMAGE_FILTER_CONFIG = {
minFileSize: 15 * 1024, // Minimum file size (bytes)
enabled: true // Whether to enable filtering
};
Technical Implementation
Dependency Requirements
- Playwright (
npx playwright install chromium
)
- Node.js >= 14.0.0
Scraping Process
- Detect and install Playwright (if needed)
- Launch Playwright headless browser
- Set anti-detection parameters (User-Agent, webdriver hiding, etc.)
- Navigate to target URL, wait for network idle
- Scroll page to trigger lazy loading
- Extract or area
- Clean up HTML tags, retain paragraph structure
- Return title and plain text content
- Automatically save as Markdown file if save path is specified
- Automatically fall back to headed mode for retry if headless mode fails
Error Handling
- Automatically retry 3 times, wait 3 seconds after each failure
- Automatically fall back to headed mode if headless mode fails
- Detect error pages (parameter error, access exception)
- Timeout set to 30 seconds
- Special handling for Windows platform (paths, command formats)
Cross-Platform Compatibility
- Windows: Automatically detect and use to run npx commands
- macOS/Linux: Directly use npx commands
- Path Handling: Automatically normalize path separators
- File Name Handling: Automatically remove Windows illegal characters
Application Scenarios
- Input source for content conversion tools
- Article analysis and processing
- Automated content scraping
- Batch article downloading
- Article archiving and local saving
- Markdown format conversion
- Legal document automatic formatting (when legal content is detected)
- Complete saving of articles with images (offline archiving including images)
- Image resource management (automatically download and organize images in articles)
Usage Examples
Example 1: Batch Scraping and Saving
javascript
const urls = [
"https://mp.weixin.qq.com/s/xxxx1",
"https://mp.weixin.qq.com/s/xxxx2",
"https://mp.weixin.qq.com/s/xxxx3"
];
for (const url of urls) {
const result = await fetchWechatArticle(url, 3, "./articles/");
console.log(`Saved: ${result.title}`);
}
Example 2: Direct Use in Claude Code
text
Please help me scrape this WeChat Official Account article and save it as a Markdown file:
https://mp.weixin.qq.com/s/xxxxx
Notes
⚠️ Only for personal study and research, please comply with website service terms
⚠️ Frequent scraping may lead to rate limiting, it is recommended to control request frequency
⚠️ The copyright of scraped content belongs to the original author
⚠️ Headed mode will pop up a browser window, which may interfere with workflow
⚠️ Windows users need to install Playwright for the first use (will be installed automatically)