WeChat Official Account Article Scraper
Overview
Use Playwright to scrape WeChat Official Account articles, run in the background without pop-ups, automatically handle dynamic loading, extract clean article content, and support automatic saving as Markdown files.
Features
- ✅ Headless mode operation: By default, scrape in the background without popping up a browser window
- ✅ Smart fallback mechanism: Automatically switch to headed mode when headless mode fails
- ✅ Dynamic content support: Automatically wait for page loading completion and handle lazy-loaded images
- ✅ Auto-save as Markdown: Support saving scraping results as formatted Markdown files
- ✅ Content cleaning: Remove HTML tags, retain paragraph structure, and output plain text
- ✅ Auto-retry: Automatically retry 3 times on failure to improve success rate
- ✅ Error detection: Identify abnormal pages such as "parameter error"
- ✅ Cross-platform support: Fully compatible with Windows, macOS, and Linux
- ✅ Smart workflow: Automatically call formatting skills when detecting legal content
- ✅ Image downloading: Automatically download all images in the article to local storage
- ✅ Smart image filtering: Automatically filter small decorative images (such as social media buttons, emojis)
- ✅ Image position retention: Keep images in their original positions in the document
- ✅ Auto file naming: Generate file names and resource folders based on article titles
Collaboration with Other Skills
Smart Workflow
This skill focuses on article scraping and maintains universality. The AI will automatically call the
skill for formatting only when legal-related content is detected.
AI Execution Flow:
text
User Request → wechat-article-fetch scraping → [Judge Content Type]
↓
┌────────────────────────┴────────────────────────┐
↓ ↓
Legal Content Detected Regular Articles
↓ ↓
Auto-call legal-text-format Save original content to project root directory
↓
Output to archive/ directory
Legal Content Detection
AI will determine if content is legal-related based on the following features:
- Title keywords: Contains terms like "case", "judgment", "verdict", "regulation", "rule", "Supreme People's Court", "Supreme People's Procuratorate", etc.
- Content features: Includes case numbers, court names, citations of legal provisions, etc.
- Structural features: Conforms to the typical structure of legal cases (basic facts, judgment results, typical significance, etc.)
Default Save Locations
- No path specified: Save to project root directory
- Specified relative path: Relative to the project root directory
- Specified absolute path: Use the specified full path
Examples:
bash
# Save to project root directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
# Save to specified directory
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
# Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/case.md"
Usage
Call in Claude Code
javascript
// Scrape article (return results only)
const result = await fetchWechatArticle("https://mp.weixin.qq.com/s/xxxxx");
// Scrape article and auto-save as Markdown file
const result = await fetchWechatArticle(
"https://mp.weixin.qq.com/s/xxxxx",
3, // Retry count (optional)
"./output.md" // Save path (optional)
);
// Return format
{
title: "Article Title",
content: "Article main text...",
url: "Article URL"
}
Command Line Call
bash
# Basic usage (output to console only)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx"
# Save to specified file
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/my-article.md"
# Save to directory (automatically use article title as file name)
node scripts/fetch.js "https://mp.weixin.qq.com/s/xxxxx" "./articles/"
Output Format
Console Output
text
Title: Article Title
First paragraph of article main text...
Second paragraph of article main text...
Markdown File Format
markdown
# Article Title
> Original URL: https://mp.weixin.qq.com/s/xxxxx
> Scraped Time: 2026-01-21 20:30:00
---
First paragraph of article main text...

Second paragraph of article main text...
File Structure
When an article contains images, the following file structure will be automatically generated:
Output Directory/
├── Article_Title.md # Markdown file
└── Article_Title_Assets/ # Image resource folder
├── image_xxx_0.jpg
├── image_xxx_1.jpg
└── ...
Image Filtering
Smart image filtering is enabled by default, which automatically filters decorative images smaller than 15KB (such as social media buttons, emojis, etc.).
You can modify the filtering configuration in
:
javascript
const IMAGE_FILTER_CONFIG = {
minFileSize: 15 * 1024, // Minimum file size (bytes)
enabled: true // Whether to enable filtering
};
Technical Implementation
Dependency Requirements
- Playwright (
npx playwright install chromium
)
- Node.js >= 14.0.0
Scraping Process
- Detect and install Playwright (if needed)
- Launch Playwright headless browser
- Set anti-detection parameters (User-Agent, webdriver hiding, etc.)
- Navigate to the target URL and wait for network idle
- Scroll the page to trigger lazy loading
- Extract the or area
- Clean up HTML tags and retain paragraph structure
- Return title and plain text content
- Automatically save as Markdown file if a save path is specified
- Automatically fall back to headed mode for retry if headless mode fails
Error Handling
- Automatically retry 3 times, waiting 3 seconds after each failure
- Automatically fall back to headed mode if headless mode fails
- Detect error pages (parameter errors, access exceptions)
- Timeout set to 30 seconds
- Special handling for Windows platform (paths, command formats)
Cross-Platform Compatibility
- Windows: Automatically detect and use to run npx commands
- macOS/Linux: Directly use npx commands
- Path handling: Automatically normalize path separators
- File name handling: Automatically remove illegal characters for Windows
Applicable Scenarios
- Input source for content conversion tools
- Article analysis and processing
- Automated content scraping
- Batch article downloading
- Article archiving and local saving
- Markdown format conversion
- Automatic formatting of legal documents (when legal content is detected)
- Complete saving of articles with images (offline archiving including images)
- Image resource management (automatically download and organize images in articles)
Usage Examples
Example 1: Batch Scraping and Saving
javascript
const urls = [
"https://mp.weixin.qq.com/s/xxxx1",
"https://mp.weixin.qq.com/s/xxxx2",
"https://mp.weixin.qq.com/s/xxxx3"
];
for (const url of urls) {
const result = await fetchWechatArticle(url, 3, "./articles/");
console.log(`Saved: ${result.title}`);
}
Example 2: Direct Use in Claude Code
text
Please help me scrape this WeChat Official Account article and save it as a Markdown file:
https://mp.weixin.qq.com/s/xxxxx
Notes
⚠️ For personal study and research only, please comply with website service terms
⚠️ Frequent scraping may lead to rate limiting, please control request frequency
⚠️ The copyright of scraped content belongs to the original author
⚠️ Headed mode will pop up a browser window, which may interfere with workflow
⚠️ Windows users need to install Playwright for the first use (installation will be automatic)