algo-seo-crawl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Web Crawler

网页爬虫

Overview

概述

A web crawler systematically traverses web pages by discovering URLs, fetching content, parsing HTML, and storing results. Uses BFS or priority-based frontier management. Performance is I/O-bound, typically limited by politeness constraints rather than compute.

网页爬虫通过发现URL、获取内容、解析HTML并存储结果来系统性地遍历网页。采用BFS或基于优先级的队列管理方式。爬虫的性能受I/O限制，通常受限于礼貌性约束而非计算能力。

When to Use

适用场景

Trigger conditions:

Building a site audit tool to discover all pages and their link structure
Collecting structured data from websites at scale
Mapping site architecture for SEO analysis

When NOT to use:

When you need data from a single API endpoint (use HTTP client directly)
When a sitemap.xml provides all needed URLs (parse sitemap instead)

触发条件：

构建网站审计工具以发现所有页面及其链接结构
大规模收集网站中的结构化数据
绘制网站架构以进行SEO分析

不适用场景：

仅需从单个API端点获取数据时（直接使用HTTP客户端即可）
sitemap.xml已提供所有所需URL时（直接解析站点地图即可）

Algorithm

算法

IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.

IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.

Phase 1: Input Validation

阶段1：输入验证

Parse seed URLs, fetch and parse robots.txt for each domain, set crawl scope (same-domain, subdomain, or cross-domain). Gate: Valid seed URLs, robots.txt rules loaded, scope defined.

解析种子URL，为每个域名获取并解析robots.txt，设置爬取范围（同域名、子域名或跨域名）。 准入条件： 种子URL有效，已加载robots.txt规则，已定义爬取范围。

Phase 2: Core Algorithm

阶段2：核心算法

Initialize URL frontier with seed URLs (priority queue or FIFO)
Dequeue URL, check: not visited, allowed by robots.txt, within scope
Fetch page with timeout and retry logic, respect crawl-delay
Parse HTML: extract links (normalize, deduplicate), extract content/metadata
Enqueue discovered URLs, store parsed data
Repeat until frontier empty or limit reached

用种子URL初始化URL队列（优先队列或先进先出队列）
出队URL，检查：未被访问过、符合robots.txt规则、在爬取范围内
带超时和重试逻辑获取页面，遵守爬取延迟限制
解析HTML：提取链接（标准化、去重），提取内容/元数据
将发现的URL入队，存储解析后的数据
重复操作直到队列为空或达到限制

Phase 3: Verification

阶段3：验证

Check: no robots.txt violations in crawl log, no duplicate pages stored, all discovered URLs accounted for. Gate: Crawl completed within scope, politeness maintained.

检查：爬取日志中无违反robots.txt的记录，无重复页面存储，所有发现的URL都已处理。 准入条件： 在范围内完成爬取，始终遵守礼貌性约束。

Phase 4: Output

阶段4：输出

Return site map with pages, link graph, and extracted metadata.

返回包含页面、链接图和提取的元数据的站点地图。

Output Format

输出格式

json

{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}

json

{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}

Examples

示例

Sample I/O

输入输出示例

Input: Seed: "https://example.com", max_depth: 2, max_pages: 100 Expected: Crawl tree with homepage at depth 0, linked pages at depth 1-2, respecting robots.txt

输入： 种子URL："https://example.com"，最大深度：2，最大页面数：100 预期结果： 爬取树以首页为深度0，链接页面为深度1-2，且遵守robots.txt规则

Edge Cases

边缘情况

Input	Expected	Why
robots.txt disallows /	Zero pages crawled	Must respect full disallow
Redirect loop	Stop after 5 redirects	Prevent infinite loop
Soft 404 (200 with error page)	Flag as soft 404	Status code alone is insufficient

输入	预期结果	原因
robots.txt禁止访问根目录/	爬取0个页面	必须遵守完全禁止规则
重定向循环	5次重定向后停止	防止无限循环
软404（状态码200但为错误页面）	标记为软404	仅靠状态码不足以判断

Gotchas

注意事项

URL normalization:
```
http://Example.COM/path/
```
and
```
http://example.com/path
```
are the same URL. Normalize: lowercase host, remove default port, remove trailing slash, sort query params.
JavaScript-rendered content: A basic HTTP fetch misses JS-rendered content. Use headless browser (Playwright/Puppeteer) for SPAs.
Trap detection: Calendar pages, session IDs in URLs, and infinite pagination create crawler traps. Set max depth and URL pattern limits.
Rate limiting yourself: Parallel fetching without per-domain rate limiting will overwhelm small servers. Use per-domain semaphores.
Character encoding: Not all pages are UTF-8. Detect encoding from HTTP headers and meta tags; fall back to charset detection libraries.

URL标准化：
```
http://Example.COM/path/
```
和
```
http://example.com/path
```
是同一个URL。标准化操作：主机名小写、移除默认端口、移除末尾斜杠、排序查询参数。
JavaScript渲染内容：基础HTTP请求会遗漏JS渲染的内容。对于单页应用（SPA），使用无头浏览器（Playwright/Puppeteer）。
陷阱检测：日历页面、URL中的会话ID、无限分页会造成爬虫陷阱。设置最大深度和URL模式限制。
自我限流：如果不按域名限流，并行抓取会压垮小型服务器。使用按域名的信号量。
字符编码：并非所有页面都是UTF-8编码。从HTTP头和meta标签检测编码；若无则使用字符集检测库。

References

参考资料

For URL normalization rules (RFC 3986), see
```
references/url-normalization.md
```
For distributed crawling architecture, see
```
references/distributed-crawl.md
```

有关URL标准化规则（RFC 3986），请查看
```
references/url-normalization.md
```
有关分布式爬取架构，请查看
```
references/distributed-crawl.md
```