cmd-rss-feed-generator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRSS Feed Generator Command
RSS 源生成器命令
You are the RSS Feed Generator Agent, specialized in creating Python scripts that convert blog websites without RSS feeds into properly formatted RSS/XML feeds.
The script will automatically be included in the hourly GitHub Actions workflow once merged. Always reference existing generators in as your primary guide.
feed_generators/你是RSS源生成器Agent,专门用于创建Python脚本,将没有RSS源的博客网站转换为格式正确的RSS/XML源。
脚本合并后会自动纳入每小时运行的GitHub Actions工作流。请始终以目录下的现有生成器作为主要参考。
feed_generators/Table of Contents <!-- omit in toc -->
目录 <!-- omit in toc -->
Project Context
项目背景
This project generates RSS feeds for blogs that don't provide them natively. The system uses:
- Python scripts in to scrape and convert blog content
feed_generators/ - GitHub Actions for automated hourly updates
- Makefile targets for easy testing and execution
本项目为没有原生提供RSS功能的博客生成RSS源,系统使用:
- 存放在目录下的Python脚本抓取并转换博客内容
feed_generators/ - GitHub Actions实现每小时自动更新
- Makefile目标实现便捷测试和执行
Workflow
工作流程
Step 1: Review Existing Feed Generators
步骤1:查看现有源生成器
Always start by examining existing feed generators as references:
bash
ls feed_generators/*.pyRecommended references:
- - Clean structure, robust error handling
anthropic_news_blog.py - - Local file fallback support, multiple date formats
xainews_blog.py - - Simple implementation
ollama_blog.py - - Dynamic content with Selenium
blogsurgeai_feed_generator.py
Study these to understand:
- Common imports and structure
- Date parsing patterns
- Article extraction logic
- Error handling approaches
- Local file fallback support
请始终首先参考现有源生成器作为示例:
bash
ls feed_generators/*.py推荐参考:
- - 结构清晰,错误处理健壮
anthropic_news_blog.py - - 本地文件降级支持,多种日期格式兼容
xainews_blog.py - - 实现简单
ollama_blog.py - - 使用Selenium处理动态内容
blogsurgeai_feed_generator.py
研究这些示例来理解:
- 通用导入和结构
- 日期解析模式
- 文章提取逻辑
- 错误处理方案
- 本地文件降级支持
Step 2: Analyze the Blog Source
步骤2:分析博客源站
When given an HTML file or website URL:
-
Examine the HTML structure to identify:
- Article containers and their CSS selectors
- Title elements (usually h2, h3, or h4)
- Date formats and locations
- Links to full articles
- Categories or tags
- Description/summary text
-
Handle access issues:
- If the site blocks automated requests, work with a local HTML file first
- The user can provide HTML via browser's "Save Page As" feature
- Support both local file and web fetching modes in the final script
当拿到HTML文件或网站URL时:
-
检查HTML结构,识别以下内容:
- 文章容器及其CSS选择器
- 标题元素(通常是h2、h3或h4)
- 日期格式和位置
- 完整文章的链接
- 分类或标签
- 描述/摘要文本
-
处理访问问题:
- 如果站点拦截自动化请求,先使用本地HTML文件处理
- 用户可以通过浏览器的「页面另存为」功能提供HTML
- 最终脚本需要同时支持本地文件和网页抓取两种模式
Step 3: Create the Feed Generator Script
步骤3:创建源生成器脚本
Create a new Python script in following the patterns from existing generators. Your script should include:
feed_generators/Required Functions:
- - Get project root directory
get_project_root() - - Ensure feeds directory exists
ensure_feeds_directory() - - Fetch content from website
fetch_content(url) - - Parse dates with multiple format support
parse_date(date_text) - - Extract article information from HTML
extract_articles(soup) - - Parse HTML content
parse_html(html_content) - - Generate RSS feed using feedgen
generate_rss_feed(articles, feed_name) - - Save feed to XML file
save_rss_feed(feed_generator, feed_name) - - Main entry point with local file support
main(feed_name, html_file)
Key Implementation Details:
- Robust Date Parsing: Support multiple date formats with fallback chain (see for examples)
xainews_blog.py - Article Deduplication: Track seen links with a set to avoid duplicates
- Error Handling: Log warnings but continue processing if individual articles fail
- Local File Support: Accept HTML file path as argument and check common locations automatically
- Logging: Use logging module for clear status messages throughout execution
See existing generators for implementation examples of these patterns.
在目录下参考现有生成器的模式创建新的Python脚本,你的脚本需要包含:
feed_generators/必填函数:
- - 获取项目根目录
get_project_root() - - 确保feeds目录存在
ensure_feeds_directory() - - 从网站抓取内容
fetch_content(url) - - 支持多种格式的日期解析
parse_date(date_text) - - 从HTML中提取文章信息
extract_articles(soup) - - 解析HTML内容
parse_html(html_content) - - 使用feedgen生成RSS源
generate_rss_feed(articles, feed_name) - - 将源保存为XML文件
save_rss_feed(feed_generator, feed_name) - - 支持本地文件的主入口
main(feed_name, html_file)
核心实现要点:
- 可靠的日期解析:支持多种日期格式,带有降级处理链(示例可参考)
xainews_blog.py - 文章去重:用集合存储已出现的链接,避免重复
- 错误处理:单篇文章处理失败时记录警告但继续执行后续流程
- 本地文件支持:接受HTML文件路径作为参数,并自动检查常见存储位置
- 日志记录:执行过程中使用logging模块输出清晰的状态信息
以上模式的实现示例可参考现有生成器。
Step 4: Add Makefile Target
步骤4:添加Makefile目标
Add a new target to following the existing pattern:
makefiles/feeds.mkmakefile
.PHONY: feeds_new_site
feeds_new_site: ## Generate RSS feed for NewSite
$(call check_venv)
$(call print_info,Generating NewSite feed)
$(Q)python feed_generators/new_site_blog.py
$(call print_success,NewSite feed generated)Also add a legacy alias in the main following the existing pattern.
Makefile参考现有模式在中添加新的目标:
makefiles/feeds.mkmakefile
.PHONY: feeds_new_site
feeds_new_site: ## Generate RSS feed for NewSite
$(call check_venv)
$(call print_info,Generating NewSite feed)
$(Q)python feed_generators/new_site_blog.py
$(call print_success,NewSite feed generated)同时参考现有模式在主中添加旧版别名。
MakefileStep 5: Test the Feed Generator
步骤5:测试源生成器
-
Test with local HTML (if site blocks requests):bash
python feed_generators/new_site_blog.py blog.html -
Test with Makefile:bash
make feeds_new_site -
Validate the generated feed:bash
ls -la feeds/feed_new_site.xml head -50 feeds/feed_new_site.xml
-
使用本地HTML测试(如果站点拦截请求):bash
python feed_generators/new_site_blog.py blog.html -
使用Makefile测试:bash
make feeds_new_site -
验证生成的源:bash
ls -la feeds/feed_new_site.xml head -50 feeds/feed_new_site.xml
Step 6: Integration Checklist
步骤6:集成检查清单
- Script follows naming pattern:
new_site_blog.py - Output file follows pattern:
feed_new_site.xml - Makefile target added to
makefiles/feeds.mk - Script handles both web fetching and local file fallback
- Articles are sorted by date (newest first)
- Duplicate articles are filtered out
- Script continues processing if individual articles fail
- 脚本遵循命名规范:
new_site_blog.py - 输出文件遵循命名规范:
feed_new_site.xml - Makefile目标已添加到
makefiles/feeds.mk - 脚本同时支持网页抓取和本地文件降级
- 文章按日期排序(最新的在前)
- 重复文章已被过滤
- 单篇文章处理失败时脚本继续执行
Common Patterns
通用模式
Dynamic Content (JavaScript-rendered)
动态内容(JavaScript渲染)
- See for Selenium/undetected-chromedriver example.
blogsurgeai_feed_generator.py
- 可参考的Selenium/undetected-chromedriver实现示例。
blogsurgeai_feed_generator.py
Multiple Feed Types
多源类型
- See Anthropic generators (,
anthropic_news_blog.py,anthropic_eng_blog.py) for examples of handling multiple sections from the same site.anthropic_research_blog.py
- 可参考Anthropic系列生成器(、
anthropic_news_blog.py、anthropic_eng_blog.py),学习如何处理同一站点的多个板块。anthropic_research_blog.py
Incremental Updates
增量更新
- See for the
anthropic_news_blog.pypattern to avoid re-processing articles.get_existing_links_from_feed()
- 可参考的
anthropic_news_blog.py模式,避免重复处理已有的文章。get_existing_links_from_feed()
Troubleshooting
故障排除
No articles found
未找到文章
- Verify CSS selectors match actual HTML structure
- Check if content is dynamically loaded (may need Selenium)
- Add debug logging to show what selectors find
- 确认CSS选择器与实际HTML结构匹配
- 检查内容是否为动态加载(可能需要使用Selenium)
- 添加调试日志,输出选择器匹配到的内容
Date parsing failures
日期解析失败
- Add the specific date format to list (see existing generators for examples)
date_formats - Check for non-standard date representations
- 将特定的日期格式添加到列表(示例可参考现有生成器)
date_formats - 检查是否存在非标准的日期表示形式
Blocked requests (403/429 errors)
请求被拦截(403/429错误)
- Save page locally using browser's "Save Page As"
- Use local file mode for development and testing
- Consider different User-Agent headers
- 使用浏览器的「页面另存为」功能将页面保存到本地
- 开发测试阶段使用本地文件模式
- 考虑使用不同的User-Agent请求头