xiaohongshu-search-summarizer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseXiaohongshu Search and Summarize
小红书搜索与汇总
This skill automates the process of extracting high-quality multi-modal content (text + images) from Xiaohongshu (小红书) and actively assists you in generating a deeply integrated, analytical final report for the user. Due to Xiaohongshu's aggressive anti-scraping mechanisms, direct HTTP requests or naive scraping often result in 404s or blocks. This skill natively bypasses these by simulating a real user through the in a headed browser window.
playwright-cliIt operates in two distinct phases:
该Skill可自动从小红书提取高质量多模态内容(文本+图片),主动协助你为用户生成深度整合的分析型最终报告。由于小红书严格的反爬机制,直接发送HTTP请求或简单爬取通常会返回404或被拦截。该Skill通过在有头浏览器窗口中模拟真实用户操作,从原生层面绕过了这些限制。
playwright-cli它分为两个不同的运行阶段:
Phase 1: Subagent Data Collection
阶段1:子代理数据收集
- Simulate a search for the keyword on Xiaohongshu in a headed browser.
- Advance through image sliders to fully load all lazy pictures from the top N posts.
- Extract titles, descriptions, top comments, and all high-resolution images.
- Download those images to a local directory and generate a raw data document ().
[keyword]_raw_data.md
- 在有头浏览器中模拟在小红书搜索指定关键词的操作。
- 滑动图片轮播区,完全加载前N条帖子的所有懒加载图片。
- 提取标题、描述、热门评论以及所有高清图片。
- 将这些图片下载到本地目录,并生成原始数据文档()。
[keyword]_raw_data.md
Phase 2: AI Multi-Modal Synthesis (Your Job)
阶段2:AI多模态合成(你的任务)
- You MUST use your file reading capabilities to read the file.
[keyword]_raw_data.md - Inside the raw data markdown, you will find paths to image files. You MUST use your file reading / vision capabilities on these image file paths to actually ingest and "see" their visual content. If you skip this step, you are only reading file names, not the images themselves!
- You analyze the texts, summarize the genuinely useful comments (discarding noise like "pm me"), and interpret the semantic content of the images you just viewed (e.g. diagrams, guidelines, step-by-step UI flows).
- You compile everything into a beautifully synthesized, single comprehensive report rather than just a linear list of posts.
- 你必须使用文件读取能力读取文件。
[keyword]_raw_data.md - 在原始数据Markdown文件中,你会找到图片文件的路径。你必须对这些图片文件路径使用文件读取/视觉能力,以真正摄入并“查看”它们的视觉内容。如果你跳过这一步,你读取的只是文件名,而非图片本身!
- 你需要分析文本,总结真正有价值的评论(丢弃诸如“私我”之类的无效信息),并解读你刚刚查看的图片的语义内容(例如示意图、指南、分步UI流程)。
- 你需要将所有内容整合成一份排版精美、全面统一的报告,而非仅仅是帖子的线性罗列。
Dependencies
依赖项
- (Must be available on the path)
playwright-cli - (Required to download images and stitch the raw data markdown)
python3
- (必须在环境路径中可用)
playwright-cli - (下载图片和拼接原始数据Markdown文件所需)
python3
Usage Instructions
使用说明
Step 1: Run the Extraction Script
步骤1:运行提取脚本
Execute the wrapper script in . It accepts the following arguments:
scripts/run.shbash
/bin/bash <skill_dir>/scripts/run.sh "YOUR KEYWORD" <MAX_POSTS> <OUTPUT_DIRECTORY>- : The search term to look up on Xiaohongshu.
YOUR KEYWORD - : (Optional, default = 10) The number of top posts to scan.
<MAX_POSTS> - : (Optional, default =
<OUTPUT_DIRECTORY>) Directory where the raw data and images will be saved../
Example execution:
bash
/bin/bash ~/.claude/skills/xiaohongshu-search-summarizer/scripts/run.sh "openclaw使用场景" 10 "./xhs_report_openclaw_scenarios"执行中的封装脚本。它接受以下参数:
scripts/run.shbash
/bin/bash <skill_dir>/scripts/run.sh "YOUR KEYWORD" <MAX_POSTS> <OUTPUT_DIRECTORY>- :要在小红书搜索的关键词。
YOUR KEYWORD - :(可选,默认值=10)要扫描的热门帖子数量。
<MAX_POSTS> - :(可选,默认值=
<OUTPUT_DIRECTORY>)原始数据和图片的保存目录。./
执行示例:
bash
/bin/bash ~/.claude/skills/xiaohongshu-search-summarizer/scripts/run.sh "openclaw使用场景" 10 "./xhs_report_openclaw_scenarios"Step 2: Read Raw Data & Images
步骤2:读取原始数据与图片
Once the bash script finishes successfully, navigate to the and use your file reading capabilities to ingest the generated file.
OUTPUT_DIRECTORY[keyword]_raw_data.mdInside this file, you will find descriptions, comments, and file paths pointing to or .
post_X_img_Y.webppost_X_img_Y.jpgBash脚本成功运行完成后,进入,使用你的文件读取能力摄入生成的文件。
OUTPUT_DIRECTORY[keyword]_raw_data.md在该文件中,你会找到描述、评论,以及指向或的文件路径。
post_X_img_Y.webppost_X_img_Y.jpgStep 3: Synthesis & Summarization
步骤3:合成与汇总
This is the most critical step. Do not just return the raw markdown file to the user. Instead, write a polished comprehensive markdown report that reorganizes the information logically, while retaining a high level of detail.
Follow these strict compilation rules:
- Do not list posts individually (e.g. avoid "Post 1: ... Post 2: ...").
- Read the Images: You MUST use your file reading and vision capabilities on the or
.webpimage files found in the raw data directory to interpret their contents..jpg - Detailed & Comprehensive Synthesis: Provide a highly detailed summary that includes diverse viewpoints, nuances, and specific examples found across different posts. Avoid over-summarizing or losing important context; preserve the richness and diversity of the information.
- Extract and merge themes: Group ideas by concepts, steps, recurring themes, or pros/cons.
- Evaluate comments: Merge insights from valuable comments directly into the core narrative. Skip useless or repetitive comments, but preserve diverse opinions or helpful counter-arguments from the comments section.
- Integrate images contextually: Embed the most relevant and high-quality images directly into the flow of your final report to support the analytical points being made. Describe their visual meaning based on what you saw with your vision capabilities.
- Save to OUTPUT_DIRECTORY: Save your beautifully compiled final Markdown report using your file writing capabilities directly into the same as the raw data (e.g.,
<OUTPUT_DIRECTORY>), and give the user the path to it.<OUTPUT_DIRECTORY>/[keyword]_synthesis.md
这是最关键的步骤。 不要直接将原始Markdown文件返回给用户。你需要撰写一份打磨完善的综合Markdown报告,对信息进行逻辑重组,同时保留高细节度。
请遵循以下严格的内容整合规则:
- 不要单独罗列帖子(例如避免出现“帖子1:... 帖子2:...”的格式)。
- 读取图片内容:你必须对原始数据目录中的或
.webp图片文件使用文件读取和视觉能力,解读其内容。.jpg - 详细全面的整合:提供高度详细的总结,涵盖不同帖子中的多元观点、细节差异和具体示例。避免过度总结或丢失重要上下文,保留信息的丰富性和多样性。
- 提取并合并主题:按照概念、步骤、重复出现的主题或优缺点对内容进行分组。
- 评论评估:将有价值的评论中的洞见直接整合到核心叙述中。跳过无用或重复的评论,但保留评论区中的多元观点或有帮助的反对意见。
- 上下文整合图片:将最相关、高质量的图片直接嵌入最终报告的行文逻辑中,支撑你提出的分析观点。根据你通过视觉能力获取的内容,描述图片的视觉含义。
- 保存到输出目录:使用你的文件写入能力,将排版精美的最终Markdown报告直接保存到原始数据所在的中(例如
<OUTPUT_DIRECTORY>),并告知用户文件路径。<OUTPUT_DIRECTORY>/[keyword]_synthesis.md
Error Handling
错误处理
If you encounter or "element not visible" errors during the browser invocation:
404 Not Found- Keep in mind that Xiaohongshu may demand a login challenge. If the site pauses waiting for a login, instruct the user to verify the browser window and perform necessary authentication manually, then try the script again.
playwright-cli
如果在调用浏览器过程中遇到或“元素不可见”错误:
404 Not Found- 请注意小红书可能要求登录验证。如果网站暂停运行等待登录,请告知用户检查浏览器窗口,手动完成必要的身份验证后,再重新运行脚本。
playwright-cli