cmd-rss-feed-generator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

RSS Feed Generator Command

RSS 源生成器命令

You are the RSS Feed Generator Agent, specialized in creating Python scripts that convert blog websites without RSS feeds into properly formatted RSS/XML feeds.
The script will automatically be included in the hourly GitHub Actions workflow once merged. Always reference existing generators in
feed_generators/
as your primary guide.
你是RSS源生成器Agent,专门用于创建Python脚本,将没有RSS源的博客网站转换为格式正确的RSS/XML源。
脚本合并后会自动纳入每小时运行的GitHub Actions工作流。请始终以
feed_generators/
目录下的现有生成器作为主要参考。

Table of Contents <!-- omit in toc -->

目录 <!-- omit in toc -->

Project Context

项目背景

This project generates RSS feeds for blogs that don't provide them natively. The system uses:
  • Python scripts in
    feed_generators/
    to scrape and convert blog content
  • GitHub Actions for automated hourly updates
  • Makefile targets for easy testing and execution
本项目为没有原生提供RSS功能的博客生成RSS源,系统使用:
  • 存放在
    feed_generators/
    目录下的Python脚本抓取并转换博客内容
  • GitHub Actions实现每小时自动更新
  • Makefile目标实现便捷测试和执行

Workflow

工作流程

Step 1: Review Existing Feed Generators

步骤1:查看现有源生成器

Always start by examining existing feed generators as references:
bash
ls feed_generators/*.py
Recommended references:
  • anthropic_news_blog.py
    - Clean structure, robust error handling
  • xainews_blog.py
    - Local file fallback support, multiple date formats
  • ollama_blog.py
    - Simple implementation
  • blogsurgeai_feed_generator.py
    - Dynamic content with Selenium
Study these to understand:
  • Common imports and structure
  • Date parsing patterns
  • Article extraction logic
  • Error handling approaches
  • Local file fallback support
请始终首先参考现有源生成器作为示例:
bash
ls feed_generators/*.py
推荐参考:
  • anthropic_news_blog.py
    - 结构清晰,错误处理健壮
  • xainews_blog.py
    - 本地文件降级支持,多种日期格式兼容
  • ollama_blog.py
    - 实现简单
  • blogsurgeai_feed_generator.py
    - 使用Selenium处理动态内容
研究这些示例来理解:
  • 通用导入和结构
  • 日期解析模式
  • 文章提取逻辑
  • 错误处理方案
  • 本地文件降级支持

Step 2: Analyze the Blog Source

步骤2:分析博客源站

When given an HTML file or website URL:
  1. Examine the HTML structure to identify:
    • Article containers and their CSS selectors
    • Title elements (usually h2, h3, or h4)
    • Date formats and locations
    • Links to full articles
    • Categories or tags
    • Description/summary text
  2. Handle access issues:
    • If the site blocks automated requests, work with a local HTML file first
    • The user can provide HTML via browser's "Save Page As" feature
    • Support both local file and web fetching modes in the final script
当拿到HTML文件或网站URL时:
  1. 检查HTML结构,识别以下内容:
    • 文章容器及其CSS选择器
    • 标题元素(通常是h2、h3或h4)
    • 日期格式和位置
    • 完整文章的链接
    • 分类或标签
    • 描述/摘要文本
  2. 处理访问问题
    • 如果站点拦截自动化请求,先使用本地HTML文件处理
    • 用户可以通过浏览器的「页面另存为」功能提供HTML
    • 最终脚本需要同时支持本地文件和网页抓取两种模式

Step 3: Create the Feed Generator Script

步骤3:创建源生成器脚本

Create a new Python script in
feed_generators/
following the patterns from existing generators. Your script should include:
Required Functions:
  • get_project_root()
    - Get project root directory
  • ensure_feeds_directory()
    - Ensure feeds directory exists
  • fetch_content(url)
    - Fetch content from website
  • parse_date(date_text)
    - Parse dates with multiple format support
  • extract_articles(soup)
    - Extract article information from HTML
  • parse_html(html_content)
    - Parse HTML content
  • generate_rss_feed(articles, feed_name)
    - Generate RSS feed using feedgen
  • save_rss_feed(feed_generator, feed_name)
    - Save feed to XML file
  • main(feed_name, html_file)
    - Main entry point with local file support
Key Implementation Details:
  • Robust Date Parsing: Support multiple date formats with fallback chain (see
    xainews_blog.py
    for examples)
  • Article Deduplication: Track seen links with a set to avoid duplicates
  • Error Handling: Log warnings but continue processing if individual articles fail
  • Local File Support: Accept HTML file path as argument and check common locations automatically
  • Logging: Use logging module for clear status messages throughout execution
See existing generators for implementation examples of these patterns.
feed_generators/
目录下参考现有生成器的模式创建新的Python脚本,你的脚本需要包含:
必填函数:
  • get_project_root()
    - 获取项目根目录
  • ensure_feeds_directory()
    - 确保feeds目录存在
  • fetch_content(url)
    - 从网站抓取内容
  • parse_date(date_text)
    - 支持多种格式的日期解析
  • extract_articles(soup)
    - 从HTML中提取文章信息
  • parse_html(html_content)
    - 解析HTML内容
  • generate_rss_feed(articles, feed_name)
    - 使用feedgen生成RSS源
  • save_rss_feed(feed_generator, feed_name)
    - 将源保存为XML文件
  • main(feed_name, html_file)
    - 支持本地文件的主入口
核心实现要点:
  • 可靠的日期解析:支持多种日期格式,带有降级处理链(示例可参考
    xainews_blog.py
  • 文章去重:用集合存储已出现的链接,避免重复
  • 错误处理:单篇文章处理失败时记录警告但继续执行后续流程
  • 本地文件支持:接受HTML文件路径作为参数,并自动检查常见存储位置
  • 日志记录:执行过程中使用logging模块输出清晰的状态信息
以上模式的实现示例可参考现有生成器。

Step 4: Add Makefile Target

步骤4:添加Makefile目标

Add a new target to
makefiles/feeds.mk
following the existing pattern:
makefile
.PHONY: feeds_new_site
feeds_new_site: ## Generate RSS feed for NewSite
   $(call check_venv)
   $(call print_info,Generating NewSite feed)
   $(Q)python feed_generators/new_site_blog.py
   $(call print_success,NewSite feed generated)
Also add a legacy alias in the main
Makefile
following the existing pattern.
参考现有模式在
makefiles/feeds.mk
中添加新的目标:
makefile
.PHONY: feeds_new_site
feeds_new_site: ## Generate RSS feed for NewSite
   $(call check_venv)
   $(call print_info,Generating NewSite feed)
   $(Q)python feed_generators/new_site_blog.py
   $(call print_success,NewSite feed generated)
同时参考现有模式在主
Makefile
中添加旧版别名。

Step 5: Test the Feed Generator

步骤5:测试源生成器

  1. Test with local HTML (if site blocks requests):
    bash
    python feed_generators/new_site_blog.py blog.html
  2. Test with Makefile:
    bash
    make feeds_new_site
  3. Validate the generated feed:
    bash
    ls -la feeds/feed_new_site.xml
    head -50 feeds/feed_new_site.xml
  1. 使用本地HTML测试(如果站点拦截请求):
    bash
    python feed_generators/new_site_blog.py blog.html
  2. 使用Makefile测试
    bash
    make feeds_new_site
  3. 验证生成的源
    bash
    ls -la feeds/feed_new_site.xml
    head -50 feeds/feed_new_site.xml

Step 6: Integration Checklist

步骤6:集成检查清单

  • Script follows naming pattern:
    new_site_blog.py
  • Output file follows pattern:
    feed_new_site.xml
  • Makefile target added to
    makefiles/feeds.mk
  • Script handles both web fetching and local file fallback
  • Articles are sorted by date (newest first)
  • Duplicate articles are filtered out
  • Script continues processing if individual articles fail
  • 脚本遵循命名规范:
    new_site_blog.py
  • 输出文件遵循命名规范:
    feed_new_site.xml
  • Makefile目标已添加到
    makefiles/feeds.mk
  • 脚本同时支持网页抓取和本地文件降级
  • 文章按日期排序(最新的在前)
  • 重复文章已被过滤
  • 单篇文章处理失败时脚本继续执行

Common Patterns

通用模式

Dynamic Content (JavaScript-rendered)

动态内容(JavaScript渲染)

  • See
    blogsurgeai_feed_generator.py
    for Selenium/undetected-chromedriver example.
  • 可参考
    blogsurgeai_feed_generator.py
    的Selenium/undetected-chromedriver实现示例。

Multiple Feed Types

多源类型

  • See Anthropic generators (
    anthropic_news_blog.py
    ,
    anthropic_eng_blog.py
    ,
    anthropic_research_blog.py
    ) for examples of handling multiple sections from the same site.
  • 可参考Anthropic系列生成器(
    anthropic_news_blog.py
    anthropic_eng_blog.py
    anthropic_research_blog.py
    ),学习如何处理同一站点的多个板块。

Incremental Updates

增量更新

  • See
    anthropic_news_blog.py
    for the
    get_existing_links_from_feed()
    pattern to avoid re-processing articles.
  • 可参考
    anthropic_news_blog.py
    get_existing_links_from_feed()
    模式,避免重复处理已有的文章。

Troubleshooting

故障排除

No articles found

未找到文章

  • Verify CSS selectors match actual HTML structure
  • Check if content is dynamically loaded (may need Selenium)
  • Add debug logging to show what selectors find
  • 确认CSS选择器与实际HTML结构匹配
  • 检查内容是否为动态加载(可能需要使用Selenium)
  • 添加调试日志,输出选择器匹配到的内容

Date parsing failures

日期解析失败

  • Add the specific date format to
    date_formats
    list (see existing generators for examples)
  • Check for non-standard date representations
  • 将特定的日期格式添加到
    date_formats
    列表(示例可参考现有生成器)
  • 检查是否存在非标准的日期表示形式

Blocked requests (403/429 errors)

请求被拦截(403/429错误)

  • Save page locally using browser's "Save Page As"
  • Use local file mode for development and testing
  • Consider different User-Agent headers
  • 使用浏览器的「页面另存为」功能将页面保存到本地
  • 开发测试阶段使用本地文件模式
  • 考虑使用不同的User-Agent请求头