china-news-crawler

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

China News Crawler Skill

China News Crawler Skill

从中国主流新闻平台提取文章内容,输出 JSON 和 Markdown 格式。
独立可迁移:本 Skill 包含所有必需代码,无外部依赖,可直接复制到其他项目使用。
Extract article content from mainstream Chinese news platforms, output in JSON and Markdown formats.
Independent and Migratable: This Skill contains all necessary code, has no external dependencies, and can be directly copied to other projects for use.

支持平台

Supported Platforms

平台IDURL 示例
微信公众号wechat
https://mp.weixin.qq.com/s/xxxxx
今日头条toutiao
https://www.toutiao.com/article/123456/
网易新闻netease
https://www.163.com/news/article/ABC123.html
搜狐新闻sohu
https://www.sohu.com/a/123456_789
腾讯新闻tencent
https://news.qq.com/rain/a/20251016A07W8J00
PlatformIDURL Example
WeChat Official Accountwechat
https://mp.weixin.qq.com/s/xxxxx
Toutiaotoutiao
https://www.toutiao.com/article/123456/
NetEase Newsnetease
https://www.163.com/news/article/ABC123.html
Sohu Newssohu
https://www.sohu.com/a/123456_789
Tencent Newstencent
https://news.qq.com/rain/a/20251016A07W8J00

使用方式

Usage

基本用法

Basic Usage

bash
undefined
bash
undefined

提取新闻,自动检测平台,输出 JSON + Markdown

Extract news, auto-detect platform, output JSON + Markdown

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"

指定输出目录

Specify output directory

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output

仅输出 JSON

Output only JSON

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json

仅输出 Markdown

Output only Markdown

uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown

列出支持的平台

List supported platforms

uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms
undefined
uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms
undefined

输出文件

Output Files

脚本默认输出两种格式到指定目录(默认
./output
):
  • {news_id}.json
    - 结构化 JSON 数据
  • {news_id}.md
    - Markdown 格式文章
The script by default outputs two formats to the specified directory (default
./output
):
  • {news_id}.json
    - Structured JSON data
  • {news_id}.md
    - Markdown-formatted article

工作流程

Workflow

  1. 接收 URL - 用户提供新闻链接
  2. 平台检测 - 自动识别平台类型
  3. 内容提取 - 调用对应爬虫获取并解析内容
  4. 格式转换 - 生成 JSON 和 Markdown
  5. 输出文件 - 保存到指定目录
  1. Receive URL - User provides news link
  2. Platform Detection - Automatically identify platform type
  3. Content Extraction - Call corresponding crawler to retrieve and parse content
  4. Format Conversion - Generate JSON and Markdown
  5. Output Files - Save to specified directory

输出格式

Output Formats

JSON 结构

JSON Structure

json
{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}
json
{
  "title": "Article Title",
  "news_url": "Original URL",
  "news_id": "Article ID",
  "meta_info": {
    "author_name": "Author/Source",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "Paragraph text", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["Paragraph 1", "Paragraph 2"],
  "images": ["Image URL 1", "Image URL 2"],
  "videos": []
}

Markdown 结构

Markdown Structure

markdown
undefined
markdown
undefined

文章标题

Article Title

文章信息

Article Information

作者: xxx 发布时间: 2024-01-01 12:00 原文链接: 链接

Author: xxx Publish Time: 2024-01-01 12:00 Original Link: Link

正文内容

Article Content

段落内容...
图片

Paragraph content...
Image

媒体资源

Media Resources

图片 (N)

Images (N)

  1. URL1
  2. URL2
undefined
  1. URL1
  2. URL2
undefined

使用示例

Usage Examples

提取微信公众号文章

Extract WeChat Official Account Article

bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
输出:
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md
bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
Output:
[INFO] Platform detected: wechat (WeChat Official Account)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md

提取今日头条文章

Extract Toutiao Article

bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"
bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"

依赖要求

Dependency Requirements

本 Skill 需要以下 Python 包(通常已在主项目中安装):
  • parsel
  • pydantic
  • requests
  • curl-cffi
  • tenacity
  • demjson3
This Skill requires the following Python packages (usually pre-installed in the main project):
  • parsel
  • pydantic
  • requests
  • curl-cffi
  • tenacity
  • demjson3

错误处理

Error Handling

错误类型说明解决方案
无法识别该平台
URL 不匹配任何支持的平台检查 URL 是否正确
平台不支持
非中国站点本 Skill 仅支持中国新闻站点
提取失败
网络错误或页面结构变化重试或检查 URL 有效性
Error TypeDescriptionSolution
Unrecognized Platform
URL does not match any supported platformCheck if the URL is correct
Unsupported Platform
Non-Chinese siteThis Skill only supports Chinese news sites
Extraction Failed
Network error or page structure changeRetry or check URL validity

注意事项

Notes

  • 仅用于教育和研究目的
  • 不要进行大规模爬取
  • 尊重目标网站的 robots.txt 和服务条款
  • 微信公众号可能需要有效的 Cookie(当前默认配置通常可用)
  • For educational and research purposes only
  • Do not perform large-scale crawling
  • Respect the target website's robots.txt and terms of service
  • WeChat Official Accounts may require valid cookies (the current default configuration usually works)

目录结构

Directory Structure

china-news-crawler/
├── SKILL.md                      # [必需] Skill 定义文件
├── references/
│   └── platform-patterns.md      # 平台 URL 模式说明
└── scripts/
    ├── extract_news.py           # CLI 入口脚本
    ├── models.py                 # 数据模型
    ├── detector.py               # 平台检测
    ├── formatter.py              # Markdown 格式化
    └── crawlers/                 # 爬虫模块
        ├── __init__.py
        ├── base.py               # BaseNewsCrawler 基类
        ├── fetchers.py           # HTTP 获取策略
        ├── wechat.py             # 微信公众号
        ├── toutiao.py            # 今日头条
        ├── netease.py            # 网易新闻
        ├── sohu.py               # 搜狐新闻
        └── tencent.py            # 腾讯新闻
china-news-crawler/
├── SKILL.md                      # [Required] Skill definition file
├── references/
│   └── platform-patterns.md      # Platform URL pattern description
└── scripts/
    ├── extract_news.py           # CLI entry script
    ├── models.py                 # Data models
    ├── detector.py               # Platform detection
    ├── formatter.py              # Markdown formatting
    └── crawlers/                 # Crawler modules
        ├── __init__.py
        ├── base.py               # BaseNewsCrawler base class
        ├── fetchers.py           # HTTP fetching strategies
        ├── wechat.py             # WeChat Official Accounts
        ├── toutiao.py            # Toutiao
        ├── netease.py            # NetEase News
        ├── sohu.py               # Sohu News
        └── tencent.py            # Tencent News

参考

References

  • 平台 URL 模式说明
  • Platform URL Pattern Description