web-search-scraper-api-skill
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Search Scraper API Skill
Web Search Scraper API 技能
📖 Introduction
📖 简介
This skill provides users with a one-stop web page extraction service through the BrowserAct Web Search Scraper API template. It can directly extract structured markdown content from any given URL. By simply inputting the target URL, you can get clean and usable markdown data.
本技能通过BrowserAct Web Search Scraper API模板为用户提供一站式网页提取服务,可直接从任意给定URL提取结构化的markdown内容。只需输入目标URL,即可获得干净可用的markdown数据。
✨ Features
✨ 特性
- No hallucinations, ensuring stable and precise data extraction: Pre-set workflows avoid AI generative hallucinations.
- No human-machine verification issues: No need to deal with reCAPTCHA or other verification challenges.
- No IP access restrictions or geofencing: No need to handle regional IP limitations.
- More agile execution speed: Compared to purely AI-driven browser automation solutions, task execution is faster.
- Extremely high cost-effectiveness: Compared to AI solutions that consume a lot of Tokens, it can significantly reduce the cost of data acquisition.
- 无幻觉,确保稳定精准的数据提取:预设工作流可避免AI生成幻觉。
- 无人机验证问题:无需处理reCAPTCHA或其他验证挑战。
- 无IP访问限制或地理围栏:无需处理区域IP限制。
- 更敏捷的执行速度:与纯AI驱动的浏览器自动化方案相比,任务执行速度更快。
- 极高的性价比:与消耗大量Tokens的AI方案相比,可显著降低数据获取成本。
🔑 API Key Guidance Process
🔑 API Key 配置指引流程
Before running, you must check the environment variable. If it is not set, do not take other actions first; you should ask and wait for the user to provide it cooperatively.
The Agent must inform the user at this time:
BROWSERACT_API_KEY"Since you have not configured the BrowserAct API Key, please go to the BrowserAct Console first to get your Key."
运行前必须检查环境变量。如果未设置,请不要先执行其他操作,应当询问并等待用户配合提供。
此时Agent必须告知用户:
BROWSERACT_API_KEY"由于您尚未配置BrowserAct API Key,请先前往BrowserAct控制台获取您的Key。"
🛠️ Input Parameters Details
🛠️ 输入参数详情
Agent should flexibly configure the following parameters based on user needs when calling the script:
- target_url
- Type:
string - Description: The website URL to extract content from. Supports any HTTP/HTTPS URL.
- Example:
https://www.browseract.com
- Type:
Agent调用脚本时应当根据用户需求灵活配置以下参数:
- target_url
- 类型:
string - 描述:待提取内容的网站URL,支持任意HTTP/HTTPS URL。
- 示例:
https://www.browseract.com
- 类型:
🚀 Invocation Method (Recommended)
🚀 调用方式(推荐)
Agent should execute the following independent script to achieve "one command gets the result":
bash
undefinedAgent应当执行以下独立脚本来实现“一条命令获取结果”:
bash
undefinedExample invocation
Example invocation
python -u ./scripts/web_search_scraper_api.py "target_url"
undefinedpython -u ./scripts/web_search_scraper_api.py "target_url"
undefined⏳ Execution Status Monitoring
⏳ 执行状态监控
Since the task involves automated browser operations, it may take a long time (several minutes). The script will continuously output status logs with timestamps (e.g., ) while running.
Notice for Agent:
[14:30:05] Task Status: running- While waiting for the script to return results, please keep paying attention to the terminal output.
- As long as the terminal is still outputting new status logs, it means the task is running normally. Do not misjudge it as a deadlock or unresponsiveness.
- If the status remains unchanged for a long time or the script stops outputting and no result is returned, the retry mechanism can be triggered.
由于任务涉及自动化浏览器操作,可能耗时较长(数分钟)。脚本运行过程中会持续输出带时间戳的状态日志(例如)。
Agent注意事项:
[14:30:05] Task Status: running- 等待脚本返回结果期间,请持续关注终端输出。
- 只要终端仍在输出新的状态日志,就说明任务运行正常,请勿误判为死锁或无响应。
- 如果状态长时间无变化,或者脚本停止输出且无结果返回,可触发重试机制。
📊 Data Output Description
📊 数据输出说明
Upon successful execution, the script will directly parse and print the result from the API response. The result contains:
- : The complete markdown content of the webpage.
content
执行成功后,脚本会直接解析并打印API响应中的结果,结果包含:
- :网页的完整markdown内容。
content
⚠️ Error Handling & Retry Mechanism
⚠️ 错误处理与重试机制
During the execution of the script, if an error occurs (such as network fluctuation or task failure), the Agent should follow this logic:
-
Check the output content:
- If the output contains , it means the API Key is invalid or expired. At this time, do not retry, and you should guide the user to recheck and provide the correct API Key.
"Invalid authorization" - If the output does not contain but the task execution fails (for example, the output starts with
"Invalid authorization"or the returned result is empty), the Agent should automatically try to re-execute the script once.Error:
- If the output contains
-
Retry limit:
- Automatic retry is limited to once. If the second attempt still fails, stop retrying and report the specific error message to the user.
脚本执行过程中如果发生错误(例如网络波动或任务失败),Agent应当遵循以下逻辑:
-
检查输出内容:
- 如果输出包含,说明API Key无效或已过期。此时请勿重试,应当引导用户重新检查并提供正确的API Key。
"Invalid authorization" - 如果输出不包含但任务执行失败(例如输出以
"Invalid authorization"开头或返回结果为空),Agent应当自动尝试重新执行脚本一次。Error:
- 如果输出包含
-
重试限制: 自动重试仅限1次。如果第二次尝试仍然失败,停止重试并向用户上报具体错误信息。
🌟 Typical Use Cases
🌟 典型使用场景
- Article Extraction: Scrape the main content of a news article link into markdown.
- Blog Post Parsing: Download the readable text from a target blog post URL.
- Webpage to Markdown: Convert any given website URL into clean markdown format.
- Documentation Scraping: Fetch the contents of a tutorial or documentation page for offline reading.
- Content Monitoring: Automatically extract the text from a specific webpage for updates.
- Data Processing: Parse the HTML of an arbitrary HTTP/HTTPS URL to structure its content.
- 文章提取:爬取新闻文章链接的主要内容并转换为markdown。
- 博客文章解析:从目标博客文章URL下载可读文本。
- 网页转Markdown:将任意给定网站URL转换为干净的markdown格式。
- 文档爬取:获取教程或文档页面的内容用于离线阅读。
- 内容监控:自动提取特定网页的文本以监测更新。
- 数据处理:解析任意HTTP/HTTPS URL的HTML,将内容结构化。