web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Scraper

Fetch web page content and convert to clean markdown format.

获取网页内容并转换为干净的Markdown格式。

Usage

使用方法

Run the fetch script to get web content:

bash

python3 scripts/fetch_url.py <url> [options]

运行抓取脚本获取网页内容：

bash

python3 scripts/fetch_url.py <url> [options]

Options

选项

```
--timeout <seconds>
```
: Request timeout (default: 30)
```
--max-length <chars>
```
: Maximum output length (default: 100000)
```
--raw
```
: Output raw HTML instead of markdown

```
--timeout <seconds>
```
：请求超时时间（默认值：30）
```
--max-length <chars>
```
：最大输出长度（默认值：100000）
```
--raw
```
：输出原始HTML而非Markdown

Examples

示例

Fetch single URL:

bash

python3 scripts/fetch_url.py "https://example.com/article"

Fetch with custom timeout:

bash

python3 scripts/fetch_url.py "https://example.com/article" --timeout 60

Fetch multiple URLs in parallel:

bash

for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait

抓取单个URL：

bash

python3 scripts/fetch_url.py "https://example.com/article"

自定义超时时间抓取：

bash

python3 scripts/fetch_url.py "https://example.com/article" --timeout 60

并行抓取多个URL：

bash

for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait

Workflow

工作流程

Single URL: Run
```
fetch_url.py
```
with the URL
Multiple URLs: Run multiple fetch commands in parallel using background processes
Handle errors: If a URL fails, check:
- Network connectivity
- URL validity
- Website may block automated requests (try different User-Agent or use browser automation)

单个URL：运行
```
fetch_url.py
```
并传入URL
多个URL：使用后台进程并行运行多个抓取命令
错误处理：如果URL抓取失败，请检查：
- 网络连接情况
- URL是否有效
- 网站可能阻止自动化请求（尝试更换User-Agent或使用浏览器自动化工具）

Output Format

输出格式

The script converts HTML to clean markdown:

Headings →
```
#
```
,
```
##
```
,
```
###
```
, etc.
Lists →
```
-
```
for unordered,
```
1.
```
for ordered
Bold/Italic →
```
**bold**
```
,
```
*italic*
```
Code blocks preserved
Navigation, footer, and ads removed

脚本会将HTML转换为干净的Markdown：

标题 →
```
#
```
、
```
##
```
、
```
###
```
等
列表 → 无序列表使用
```
-
```
，有序列表使用
```
1.
```
粗体/斜体 →
```
**粗体**
```
、
```
*斜体*
```
代码块将被保留
导航栏、页脚和广告会被移除

Troubleshooting

故障排除

403 Forbidden: Website blocks automated requests. Consider:

Some sites require JavaScript rendering (not supported by this script)
Try accessing from a different network

Timeout errors: Increase timeout with

--timeout 60

Empty content: Website may require JavaScript to render content

403 Forbidden（禁止访问）：网站阻止了自动化请求。可考虑：

部分网站需要JavaScript渲染（本脚本不支持此功能）
尝试从不同网络访问

超时错误：使用

--timeout 60

参数增加超时时间

内容为空：网站可能需要JavaScript来渲染内容