news-extractor

Original：🇨🇳 Chinese

Translated

12 scripts

News site content extraction. Supports WeChat Official Accounts, Toutiao, NetEase News, Sohu News, and Tencent News. Activated when users need to extract news content, crawl official account articles, scrape news, or obtain news in JSON/Markdown format.

12installs

Sourcenanmicoder/claude-code-skills

Added on2026-02-08

NPX Install

npx skill4agent add nanmicoder/claude-code-skills news-extractor

SKILL.md Content (Chinese)

View Translation Comparison →

News Extractor Skill

Extract article content from mainstream news platforms and output in JSON and Markdown formats.

Supported Platforms

Platform	ID	URL Example
WeChat Official Accounts	wechat	`https://mp.weixin.qq.com/s/xxxxx`
Toutiao	toutiao	`https://www.toutiao.com/article/123456/`
NetEase News	netease	`https://www.163.com/news/article/ABC123.html`
Sohu News	sohu	`https://www.sohu.com/a/123456_789`
Tencent News	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

Dependency Installation

This skill uses uv for dependency management. Install dependencies before first use:

bash

cd ~/.claude/skills/news-extractor
uv sync

Important: All scripts must be executed using

uv run

, not directly with

python

.

uv run

automatically uses dependencies from the project's virtual environment.

Dependency List

Package	Purpose
pydantic	Data model validation
requests	HTTP requests
curl_cffi	Browser simulation crawling
tenacity	Retry mechanism
parsel	HTML/XPath parsing
demjson3	Non-standard JSON parsing

Usage

Basic Usage

bash

# Extract news, auto-detect platform, output JSON + Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"

# Specify output directory
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output

# Output only JSON
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json

# Output only Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown

# List supported platforms
uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms

Output Files

The script outputs two formats to the specified directory (default

./output

) by default:

```
{news_id}.json
```
- Structured JSON data
```
{news_id}.md
```
- Markdown-formatted article

Workflow

Receive URL - User provides a news link
Platform Detection - Automatically identify the platform type
Content Extraction - Call the corresponding crawler to fetch and parse content
Format Conversion - Generate JSON and Markdown
Output Files - Save to the specified directory

Output Formats

JSON Structure

json

{
  "title": "Article Title",
  "news_url": "Original Link",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL1", "Image URL2"],
  "videos": []
}

Markdown Structure

markdown

# Article Title

## Article Information
**Author**: xxx
**Publish Time**: 2024-01-01 12:00
**Original Link**: [Link](URL)

---

## Article Content

Paragraph content...

![Image](URL)

---

## Media Resources
### Images (N)
1. URL1
2. URL2

Usage Examples

Extract WeChat Official Account Article

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

Output:

[INFO] Platform detected: wechat (WeChat Official Accounts)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

Extract Toutiao Article

bash

uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

Error Handling

Error Type	Description	Solution
`Unrecognized Platform`	URL does not match any supported platform	Check if the URL is correct
`Platform Not Supported`	Unsupported site	This Skill only supports the listed news sites
`Extraction Failed`	Network error or page structure change	Retry or check URL validity

Notes

For educational and research purposes only
Do not perform large-scale crawling
Respect the target website's robots.txt and terms of service
WeChat Official Accounts may require valid Cookies (the default configuration usually works)

References

Platform URL Pattern Instructions

news-extractor

NPX Install

Tags

SKILL.md Content (Chinese)

News Extractor Skill

Supported Platforms

Dependency Installation

Dependency List

Usage

Basic Usage

Output Files

Workflow

Output Formats

JSON Structure

Markdown Structure

Usage Examples

Extract WeChat Official Account Article

Extract Toutiao Article

Error Handling

Notes

References