china-news-crawler
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseChina News Crawler Skill
China News Crawler Skill
从中国主流新闻平台提取文章内容,输出 JSON 和 Markdown 格式。
独立可迁移:本 Skill 包含所有必需代码,无外部依赖,可直接复制到其他项目使用。
Extract article content from mainstream Chinese news platforms, output in JSON and Markdown formats.
Independent and Migratable: This Skill contains all necessary code, has no external dependencies, and can be directly copied to other projects for use.
支持平台
Supported Platforms
| 平台 | ID | URL 示例 |
|---|---|---|
| 微信公众号 | | |
| 今日头条 | toutiao | |
| 网易新闻 | netease | |
| 搜狐新闻 | sohu | |
| 腾讯新闻 | tencent | |
| Platform | ID | URL Example |
|---|---|---|
| WeChat Official Account | | |
| Toutiao | toutiao | |
| NetEase News | netease | |
| Sohu News | sohu | |
| Tencent News | tencent | |
使用方式
Usage
基本用法
Basic Usage
bash
undefinedbash
undefined提取新闻,自动检测平台,输出 JSON + Markdown
Extract news, auto-detect platform, output JSON + Markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL"
指定输出目录
Specify output directory
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --output ./output
仅输出 JSON
Output only JSON
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format json
仅输出 Markdown
Output only Markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown
uv run .claude/skills/china-news-crawler/scripts/extract_news.py "URL" --format markdown
列出支持的平台
List supported platforms
uv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms
undefineduv run .claude/skills/china-news-crawler/scripts/extract_news.py --list-platforms
undefined输出文件
Output Files
脚本默认输出两种格式到指定目录(默认 ):
./output- - 结构化 JSON 数据
{news_id}.json - - Markdown 格式文章
{news_id}.md
The script by default outputs two formats to the specified directory (default ):
./output- - Structured JSON data
{news_id}.json - - Markdown-formatted article
{news_id}.md
工作流程
Workflow
- 接收 URL - 用户提供新闻链接
- 平台检测 - 自动识别平台类型
- 内容提取 - 调用对应爬虫获取并解析内容
- 格式转换 - 生成 JSON 和 Markdown
- 输出文件 - 保存到指定目录
- Receive URL - User provides news link
- Platform Detection - Automatically identify platform type
- Content Extraction - Call corresponding crawler to retrieve and parse content
- Format Conversion - Generate JSON and Markdown
- Output Files - Save to specified directory
输出格式
Output Formats
JSON 结构
JSON Structure
json
{
"title": "文章标题",
"news_url": "原始链接",
"news_id": "文章ID",
"meta_info": {
"author_name": "作者/来源",
"author_url": "",
"publish_time": "2024-01-01 12:00"
},
"contents": [
{"type": "text", "content": "段落文本", "desc": ""},
{"type": "image", "content": "https://...", "desc": ""},
{"type": "video", "content": "https://...", "desc": ""}
],
"texts": ["段落1", "段落2"],
"images": ["图片URL1", "图片URL2"],
"videos": []
}json
{
"title": "Article Title",
"news_url": "Original URL",
"news_id": "Article ID",
"meta_info": {
"author_name": "Author/Source",
"author_url": "",
"publish_time": "2024-01-01 12:00"
},
"contents": [
{"type": "text", "content": "Paragraph text", "desc": ""},
{"type": "image", "content": "https://...", "desc": ""},
{"type": "video", "content": "https://...", "desc": ""}
],
"texts": ["Paragraph 1", "Paragraph 2"],
"images": ["Image URL 1", "Image URL 2"],
"videos": []
}Markdown 结构
Markdown Structure
markdown
undefinedmarkdown
undefined文章标题
Article Title
文章信息
Article Information
作者: xxx
发布时间: 2024-01-01 12:00
原文链接: 链接
Author: xxx
Publish Time: 2024-01-01 12:00
Original Link: Link
正文内容
Article Content
段落内容...
Paragraph content...
媒体资源
Media Resources
图片 (N)
Images (N)
- URL1
- URL2
undefined- URL1
- URL2
undefined使用示例
Usage Examples
提取微信公众号文章
Extract WeChat Official Account Article
bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
"https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"输出:
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.mdbash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
"https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"Output:
[INFO] Platform detected: wechat (WeChat Official Account)
[INFO] Extracting content...
[INFO] Title: Article Title
[INFO] Author: Official Account Name
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md提取今日头条文章
Extract Toutiao Article
bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
"https://www.toutiao.com/article/7434425099895210546/"bash
uv run .claude/skills/china-news-crawler/scripts/extract_news.py \
"https://www.toutiao.com/article/7434425099895210546/"依赖要求
Dependency Requirements
本 Skill 需要以下 Python 包(通常已在主项目中安装):
- parsel
- pydantic
- requests
- curl-cffi
- tenacity
- demjson3
This Skill requires the following Python packages (usually pre-installed in the main project):
- parsel
- pydantic
- requests
- curl-cffi
- tenacity
- demjson3
错误处理
Error Handling
| 错误类型 | 说明 | 解决方案 |
|---|---|---|
| URL 不匹配任何支持的平台 | 检查 URL 是否正确 |
| 非中国站点 | 本 Skill 仅支持中国新闻站点 |
| 网络错误或页面结构变化 | 重试或检查 URL 有效性 |
| Error Type | Description | Solution |
|---|---|---|
| URL does not match any supported platform | Check if the URL is correct |
| Non-Chinese site | This Skill only supports Chinese news sites |
| Network error or page structure change | Retry or check URL validity |
注意事项
Notes
- 仅用于教育和研究目的
- 不要进行大规模爬取
- 尊重目标网站的 robots.txt 和服务条款
- 微信公众号可能需要有效的 Cookie(当前默认配置通常可用)
- For educational and research purposes only
- Do not perform large-scale crawling
- Respect the target website's robots.txt and terms of service
- WeChat Official Accounts may require valid cookies (the current default configuration usually works)
目录结构
Directory Structure
china-news-crawler/
├── SKILL.md # [必需] Skill 定义文件
├── references/
│ └── platform-patterns.md # 平台 URL 模式说明
└── scripts/
├── extract_news.py # CLI 入口脚本
├── models.py # 数据模型
├── detector.py # 平台检测
├── formatter.py # Markdown 格式化
└── crawlers/ # 爬虫模块
├── __init__.py
├── base.py # BaseNewsCrawler 基类
├── fetchers.py # HTTP 获取策略
├── wechat.py # 微信公众号
├── toutiao.py # 今日头条
├── netease.py # 网易新闻
├── sohu.py # 搜狐新闻
└── tencent.py # 腾讯新闻china-news-crawler/
├── SKILL.md # [Required] Skill definition file
├── references/
│ └── platform-patterns.md # Platform URL pattern description
└── scripts/
├── extract_news.py # CLI entry script
├── models.py # Data models
├── detector.py # Platform detection
├── formatter.py # Markdown formatting
└── crawlers/ # Crawler modules
├── __init__.py
├── base.py # BaseNewsCrawler base class
├── fetchers.py # HTTP fetching strategies
├── wechat.py # WeChat Official Accounts
├── toutiao.py # Toutiao
├── netease.py # NetEase News
├── sohu.py # Sohu News
└── tencent.py # Tencent News参考
References
- 平台 URL 模式说明
- Platform URL Pattern Description