web-fetcher
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Fetcher
网页抓取工具(Web Fetcher)
Extract web page content as clean text/markdown from a given URL using a fallback chain of free services.
通过由免费服务组成的回退链路,从指定URL提取网页内容,输出纯净的文本/markdown格式。
Usage
使用方法
bash
python3 <skill-path>/scripts/fetch.py <url>Save to file:
bash
python3 <skill-path>/scripts/fetch.py <url> -o output.mdbash
python3 <skill-path>/scripts/fetch.py <url>保存到文件:
bash
python3 <skill-path>/scripts/fetch.py <url> -o output.mdFallback Chain
回退链路
The script tries these sources in order, falling back on failure:
- Jina Reader () — best markdown quality, supports JS-rendered pages
r.jina.ai/{url} - defuddle.md () — by Obsidian creator @kepano
defuddle.md/{url} - markdown.new () — 3-layer strategy with browser rendering fallback
markdown.new/{url} - OpenCLI — platform-specific commands with browser login state (zhihu, reddit, twitter, weibo)
- Raw HTML — direct fetch as last resort
脚本会按顺序尝试以下数据源,失败时自动回退到下一个:
- Jina Reader () — markdown质量最优,支持JS渲染页面
r.jina.ai/{url} - defuddle.md () — 由Obsidian开发者@kepano打造
defuddle.md/{url} - markdown.new () — 具备3层策略,兜底为浏览器渲染
markdown.new/{url} - OpenCLI — 平台专属命令,可调用浏览器登录状态(知乎、reddit、twitter、微博)
- 原生HTML — 最后兜底方案,直接抓取
When to Use
适用场景
- JS-rendered pages that WebFetch can't handle (Twitter/X, SPAs)
- Login-required pages on supported platforms (zhihu, reddit, twitter, weibo, xiaohongshu)
- Bulk content extraction
- When you need clean markdown instead of summarized content
- WebFetch无法处理的JS渲染页面(Twitter/X、SPAs)
- 受支持平台上需要登录的页面(知乎、reddit、twitter、微博、小红书)
- 批量内容提取
- 需要纯净markdown而非总结内容的场景
OpenCLI Supported Platforms
OpenCLI支持的平台
When free services fail, OpenCLI auto-detects the platform from URL and routes to the right command:
| URL Pattern | OpenCLI Command |
|---|---|
| |
| |
| |
| |
| |
Requires: + Browser Bridge extension in Chrome/Arc.
npm i -g @jackwener/opencli当免费服务失败时,OpenCLI会从URL自动检测平台并调用对应命令:
| URL Pattern | OpenCLI Command |
|---|---|
| |
| |
| |
| |
| |
依赖要求: + Chrome/Arc浏览器中的Browser Bridge扩展。
npm i -g @jackwener/opencliLimitations
局限性
- WeChat articles (微信公众号) not supported by any strategy
- OpenCLI requires browser extension setup (one-time)
- 所有策略均不支持微信公众号文章
- OpenCLI需要配置浏览器扩展(仅需一次)
Rate Limits
速率限制
| Service | Limit |
|---|---|
| Jina Reader | 20 req/min (free), 10M token key available at jina.ai/reader |
| markdown.new | 500 req/day/IP |
| defuddle.md | Not documented |
| OpenCLI | No documented limits (uses browser session) |
| 服务 | 限制 |
|---|---|
| Jina Reader | 20次请求/分钟(免费版),可前往jina.ai/reader获取10M token密钥 |
| markdown.new | 500次请求/天/IP |
| defuddle.md | 未公开限制 |
| OpenCLI | 无公开限制(使用浏览器会话) |