china-news-crawler

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

China News Crawler Skill

从中国主流新闻平台提取文章内容，输出 JSON 和 Markdown 格式。

独立可迁移：本 Skill 包含所有必需代码，无外部依赖，可直接复制到其他项目使用。

Extract article content from mainstream Chinese news platforms, output in JSON and Markdown formats.

Independent and Migratable: This Skill contains all necessary code, has no external dependencies, and can be directly copied to other projects for use.

支持平台

Supported Platforms

平台	ID	URL 示例
微信公众号	wechat	`https://mp.weixin.qq.com/s/xxxxx`
今日头条	toutiao	`https://www.toutiao.com/article/123456/`
网易新闻	netease	`https://www.163.com/news/article/ABC123.html`
搜狐新闻	sohu	`https://www.sohu.com/a/123456_789`
腾讯新闻	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

Platform	ID	URL Example
WeChat Official Account	wechat	`https://mp.weixin.qq.com/s/xxxxx`
Toutiao	toutiao	`https://www.toutiao.com/article/123456/`
NetEase News	netease	`https://www.163.com/news/article/ABC123.html`
Sohu News	sohu	`https://www.sohu.com/a/123456_789`
Tencent News	tencent	`https://news.qq.com/rain/a/20251016A07W8J00`

使用方式

Usage

基本用法

Basic Usage

bash

undefined

bash

undefined

提取新闻，自动检测平台，输出 JSON + Markdown

Extract news, auto-detect platform, output JSON + Markdown

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"

指定输出目录

Specify output directory

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output

仅输出 JSON

Output only JSON

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json

仅输出 Markdown

Output only Markdown

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown

列出支持的平台

List supported platforms

uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms

undefined

uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms

undefined

输出文件

Output Files

脚本默认输出两种格式到指定目录（默认

./output

）：

```
{news_id}.json
```
- 结构化 JSON 数据
```
{news_id}.md
```
- Markdown 格式文章

The script by default outputs two formats to the specified directory (default

./output

```
{news_id}.json
```
- Structured JSON data
```
{news_id}.md
```
- Markdown-formatted article

工作流程

Workflow

接收 URL - 用户提供新闻链接
平台检测 - 自动识别平台类型
内容提取 - 调用对应爬虫获取并解析内容
格式转换 - 生成 JSON 和 Markdown
输出文件 - 保存到指定目录

Receive URL - User provides news link
Platform Detection - Automatically identify platform type
Content Extraction - Call corresponding crawler to retrieve and parse content
Format Conversion - Generate JSON and Markdown
Output Files - Save to specified directory

输出格式

Output Formats

JSON 结构

JSON Structure

json

{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}

json

{
  "title": "Article Title",
  "news_url": "Original URL",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL 1", "Image URL 2"],
  "videos": []
}

Markdown 结构

Markdown Structure

markdown

undefined

markdown

undefined

文章标题

Article Title

文章信息

Article Information

作者: xxx 发布时间: 2024-01-01 12:00 原文链接: 链接

Author: xxx Publish Time: 2024-01-01 12:00 Original Link: Link

正文内容

Article Content

段落内容...

Paragraph content...

媒体资源

Media Resources

图片 (N)

Images (N)

URL1
URL2

undefined

URL1
URL2

undefined

使用示例

Usage Examples

提取微信公众号文章

Extract WeChat Official Account Article

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

输出:

[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"

Output:

[INFO] Platform detected: wechat (WeChat Official Account)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

提取今日头条文章

Extract Toutiao Article

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

bash

uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

依赖要求

Dependency Requirements

本 Skill 需要以下 Python 包（通常已在主项目中安装）：

parsel
pydantic
requests
curl-cffi
tenacity
demjson3

This Skill requires the following Python packages (usually pre-installed in the main project):

parsel
pydantic
requests
curl-cffi
tenacity
demjson3

错误处理

Error Handling

错误类型	说明	解决方案
`无法识别该平台`	URL 不匹配任何支持的平台	检查 URL 是否正确
`平台不支持`	非中国站点	本 Skill 仅支持中国新闻站点
`提取失败`	网络错误或页面结构变化	重试或检查 URL 有效性

Error Type	Description	Solution
`Unrecognized Platform`	URL does not match any supported platform	Check if the URL is correct
`Unsupported Platform`	Non-Chinese site	This Skill only supports Chinese news sites
`Extraction Failed`	Network error or page structure change	Retry or check URL validity

注意事项

Notes

仅用于教育和研究目的
不要进行大规模爬取
尊重目标网站的 robots.txt 和服务条款
微信公众号可能需要有效的 Cookie（当前默认配置通常可用）

For educational and research purposes only
Do not perform large-scale crawling
Respect the target website's robots.txt and terms of service
WeChat Official Accounts may require valid cookies (the current default configuration usually works)

目录结构

Directory Structure

china-news-crawler/
├── SKILL.md                      # [必需] Skill 定义文件
├── references/
│   └── platform-patterns.md      # 平台 URL 模式说明
└── scripts/
    ├── extract_news.py           # CLI 入口脚本
    ├── models.py                 # 数据模型
    ├── detector.py               # 平台检测
    ├── formatter.py              # Markdown 格式化
    └── crawlers/                 # 爬虫模块
        ├── __init__.py
        ├── base.py               # BaseNewsCrawler 基类
        ├── fetchers.py           # HTTP 获取策略
        ├── wechat.py             # 微信公众号
        ├── toutiao.py            # 今日头条
        ├── netease.py            # 网易新闻
        ├── sohu.py               # 搜狐新闻
        └── tencent.py            # 腾讯新闻

china-news-crawler/
├── SKILL.md                      # [Required] Skill definition file
├── references/
│   └── platform-patterns.md      # Platform URL pattern description
└── scripts/
    ├── extract_news.py           # CLI entry script
    ├── models.py                 # Data models
    ├── detector.py               # Platform detection
    ├── formatter.py              # Markdown formatting
    └── crawlers/                 # Crawler modules
        ├── __init__.py
        ├── base.py               # BaseNewsCrawler base class
        ├── fetchers.py           # HTTP fetching strategies
        ├── wechat.py             # WeChat Official Accounts
        ├── toutiao.py            # Toutiao
        ├── netease.py            # NetEase News
        ├── sohu.py               # Sohu News
        └── tencent.py            # Tencent News

参考

References

平台 URL 模式说明

Platform URL Pattern Description