web-fetch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Fetch Skill

Web Fetch 技能

Fetch and parse web content from URLs.
从URL获取并解析网页内容。

When to Use

适用场景

USE this skill when:
  • "Fetch content from URL"
  • "Download file from..."
  • "Extract article text from..."
  • "Get page title and description"
  • "Scrape data from webpage"
在以下场景使用本技能:
  • "从URL获取内容"
  • "从...下载文件"
  • "从...提取文章文本"
  • "获取页面标题和描述"
  • "从网页抓取数据"

When NOT to Use

不适用场景

DON'T use this skill when:
  • Interactive browser actions → use browser-tools
  • Authenticated sessions → use browser-tools with profile
  • JavaScript-heavy SPAs → use browser-tools
请勿在以下场景使用本技能:
  • 需要交互式浏览器操作 → 使用browser-tools
  • 需要已认证会话 → 使用带配置文件的browser-tools
  • 重度依赖JavaScript的SPA页面 → 使用browser-tools

Commands

命令

Fetch Content

获取内容

bash
{baseDir}/fetch.sh "https://example.com"
{baseDir}/fetch.sh "https://example.com" --markdown
{baseDir}/fetch.sh "https://example.com" --json
bash
{baseDir}/fetch.sh "https://example.com"
{baseDir}/fetch.sh "https://example.com" --markdown
{baseDir}/fetch.sh "https://example.com" --json

Extract Article

提取文章

bash
{baseDir}/extract.sh "https://example.com/article"
{baseDir}/extract.sh "https://example.com/article" --format markdown
bash
{baseDir}/extract.sh "https://example.com/article"
{baseDir}/extract.sh "https://example.com/article" --format markdown

Download File

下载文件

bash
{baseDir}/download.sh "https://example.com/file.pdf" --out /tmp/file.pdf
{baseDir}/download.sh "https://example.com/archive.zip" --out /tmp/archive.zip
bash
{baseDir}/download.sh "https://example.com/file.pdf" --out /tmp/file.pdf
{baseDir}/download.sh "https://example.com/archive.zip" --out /tmp/archive.zip

Get Page Metadata

获取页面元数据

bash
{baseDir}/metadata.sh "https://example.com"
{baseDir}/metadata.sh "https://example.com" --json
bash
{baseDir}/metadata.sh "https://example.com"
{baseDir}/metadata.sh "https://example.com" --json

Extract Links

提取链接

bash
{baseDir}/links.sh "https://example.com"
{baseDir}/links.sh "https://example.com" --filter "blog"
bash
{baseDir}/links.sh "https://example.com"
{baseDir}/links.sh "https://example.com" --filter "blog"

Extract Images

提取图片

bash
{baseDir}/images.sh "https://example.com"
{baseDir}/images.sh "https://example.com" --download --out /tmp/images/
bash
{baseDir}/images.sh "https://example.com"
{baseDir}/images.sh "https://example.com" --download --out /tmp/images/

Options

选项

  • --markdown
    : Output as markdown
  • --json
    : Output as JSON
  • --text
    : Plain text output
  • --timeout N
    : Timeout in seconds (default: 30)
  • --user-agent
    : Custom user agent
  • --out <path>
    : Output file path
  • --markdown
    : 以markdown格式输出
  • --json
    : 以JSON格式输出
  • --text
    : 纯文本输出
  • --timeout N
    : 超时时间(秒,默认:30)
  • --user-agent
    : 自定义用户代理
  • --out <path>
    : 输出文件路径

Output Formats

输出格式

Plain Text

纯文本

Extract visible text from HTML, cleaned of scripts and styles.
从HTML中提取可见文本,移除脚本和样式。

Markdown

Markdown

Convert HTML to markdown with proper formatting.
将HTML转换为格式规范的markdown。

JSON

JSON

Structured output with title, content, metadata.
包含标题、内容、元数据的结构化输出。

Examples

示例

Get article content:
bash
{baseDir}/extract.sh "https://example.com/blog/post" --markdown
Download all PDFs from page:
bash
{baseDir}/links.sh "https://example.com" --filter ".pdf" | xargs -I {} download.sh "{}"
Get page metadata:
bash
{baseDir}/metadata.sh "https://example.com" --json
获取文章内容:
bash
{baseDir}/extract.sh "https://example.com/blog/post" --markdown
下载页面中所有PDF:
bash
{baseDir}/links.sh "https://example.com" --filter ".pdf" | xargs -I {} download.sh "{}"
获取页面元数据:
bash
{baseDir}/metadata.sh "https://example.com" --json

Output: {"title": "...", "description": "...", "og:image": "..."}

输出: {"title": "...", "description": "...", "og:image": "..."}

undefined
undefined

Notes

注意事项

  • Respects robots.txt by default
  • Rate limiting: 1 request per second by default
  • Use
    --user-agent
    to set custom user agent
  • For JavaScript-heavy pages, use browser-tools instead
  • 默认遵守robots.txt规则
  • 默认速率限制:每秒1次请求
  • 使用
    --user-agent
    设置自定义用户代理
  • 对于重度依赖JavaScript的页面,请使用browser-tools