web-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Scraper

Web Scraper

Fetch web page content and convert to clean markdown format.
获取网页内容并转换为干净的Markdown格式。

Usage

使用方法

Run the fetch script to get web content:
bash
python3 scripts/fetch_url.py <url> [options]
运行抓取脚本获取网页内容:
bash
python3 scripts/fetch_url.py <url> [options]

Options

选项

  • --timeout <seconds>
    : Request timeout (default: 30)
  • --max-length <chars>
    : Maximum output length (default: 100000)
  • --raw
    : Output raw HTML instead of markdown
  • --timeout <seconds>
    :请求超时时间(默认值:30)
  • --max-length <chars>
    :最大输出长度(默认值:100000)
  • --raw
    :输出原始HTML而非Markdown

Examples

示例

Fetch single URL:
bash
python3 scripts/fetch_url.py "https://example.com/article"
Fetch with custom timeout:
bash
python3 scripts/fetch_url.py "https://example.com/article" --timeout 60
Fetch multiple URLs in parallel:
bash
for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait
抓取单个URL:
bash
python3 scripts/fetch_url.py "https://example.com/article"
自定义超时时间抓取:
bash
python3 scripts/fetch_url.py "https://example.com/article" --timeout 60
并行抓取多个URL:
bash
for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait

Workflow

工作流程

  1. Single URL: Run
    fetch_url.py
    with the URL
  2. Multiple URLs: Run multiple fetch commands in parallel using background processes
  3. Handle errors: If a URL fails, check:
    • Network connectivity
    • URL validity
    • Website may block automated requests (try different User-Agent or use browser automation)
  1. 单个URL:运行
    fetch_url.py
    并传入URL
  2. 多个URL:使用后台进程并行运行多个抓取命令
  3. 错误处理:如果URL抓取失败,请检查:
    • 网络连接情况
    • URL是否有效
    • 网站可能阻止自动化请求(尝试更换User-Agent或使用浏览器自动化工具)

Output Format

输出格式

The script converts HTML to clean markdown:
  • Headings →
    #
    ,
    ##
    ,
    ###
    , etc.
  • Lists →
    -
    for unordered,
    1.
    for ordered
  • Bold/Italic →
    **bold**
    ,
    *italic*
  • Code blocks preserved
  • Navigation, footer, and ads removed
脚本会将HTML转换为干净的Markdown:
  • 标题 →
    #
    ##
    ###
  • 列表 → 无序列表使用
    -
    ,有序列表使用
    1.
  • 粗体/斜体 →
    **粗体**
    *斜体*
  • 代码块将被保留
  • 导航栏、页脚和广告会被移除

Troubleshooting

故障排除

403 Forbidden: Website blocks automated requests. Consider:
  • Some sites require JavaScript rendering (not supported by this script)
  • Try accessing from a different network
Timeout errors: Increase timeout with
--timeout 60
Empty content: Website may require JavaScript to render content
403 Forbidden(禁止访问):网站阻止了自动化请求。可考虑:
  • 部分网站需要JavaScript渲染(本脚本不支持此功能)
  • 尝试从不同网络访问
超时错误:使用
--timeout 60
参数增加超时时间
内容为空:网站可能需要JavaScript来渲染内容