web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Scraper
Web Scraper
Fetch web page content and convert to clean markdown format.
获取网页内容并转换为干净的Markdown格式。
Usage
使用方法
Run the fetch script to get web content:
bash
python3 scripts/fetch_url.py <url> [options]运行抓取脚本获取网页内容:
bash
python3 scripts/fetch_url.py <url> [options]Options
选项
- : Request timeout (default: 30)
--timeout <seconds> - : Maximum output length (default: 100000)
--max-length <chars> - : Output raw HTML instead of markdown
--raw
- :请求超时时间(默认值:30)
--timeout <seconds> - :最大输出长度(默认值:100000)
--max-length <chars> - :输出原始HTML而非Markdown
--raw
Examples
示例
Fetch single URL:
bash
python3 scripts/fetch_url.py "https://example.com/article"Fetch with custom timeout:
bash
python3 scripts/fetch_url.py "https://example.com/article" --timeout 60Fetch multiple URLs in parallel:
bash
for url in "https://url1.com" "https://url2.com"; do
python3 scripts/fetch_url.py "$url" &
done
wait抓取单个URL:
bash
python3 scripts/fetch_url.py "https://example.com/article"自定义超时时间抓取:
bash
python3 scripts/fetch_url.py "https://example.com/article" --timeout 60并行抓取多个URL:
bash
for url in "https://url1.com" "https://url2.com"; do
python3 scripts/fetch_url.py "$url" &
done
waitWorkflow
工作流程
- Single URL: Run with the URL
fetch_url.py - Multiple URLs: Run multiple fetch commands in parallel using background processes
- Handle errors: If a URL fails, check:
- Network connectivity
- URL validity
- Website may block automated requests (try different User-Agent or use browser automation)
- 单个URL:运行并传入URL
fetch_url.py - 多个URL:使用后台进程并行运行多个抓取命令
- 错误处理:如果URL抓取失败,请检查:
- 网络连接情况
- URL是否有效
- 网站可能阻止自动化请求(尝试更换User-Agent或使用浏览器自动化工具)
Output Format
输出格式
The script converts HTML to clean markdown:
- Headings → ,
#,##, etc.### - Lists → for unordered,
-for ordered1. - Bold/Italic → ,
**bold***italic* - Code blocks preserved
- Navigation, footer, and ads removed
脚本会将HTML转换为干净的Markdown:
- 标题 → 、
#、##等### - 列表 → 无序列表使用,有序列表使用
-1. - 粗体/斜体 → 、
**粗体***斜体* - 代码块将被保留
- 导航栏、页脚和广告会被移除
Troubleshooting
故障排除
403 Forbidden: Website blocks automated requests. Consider:
- Some sites require JavaScript rendering (not supported by this script)
- Try accessing from a different network
Timeout errors: Increase timeout with
--timeout 60Empty content: Website may require JavaScript to render content
403 Forbidden(禁止访问):网站阻止了自动化请求。可考虑:
- 部分网站需要JavaScript渲染(本脚本不支持此功能)
- 尝试从不同网络访问
超时错误:使用参数增加超时时间
--timeout 60内容为空:网站可能需要JavaScript来渲染内容