algo-seo-crawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Crawler

网页爬虫

Overview

概述

A web crawler systematically traverses web pages by discovering URLs, fetching content, parsing HTML, and storing results. Uses BFS or priority-based frontier management. Performance is I/O-bound, typically limited by politeness constraints rather than compute.
网页爬虫通过发现URL、获取内容、解析HTML并存储结果来系统性地遍历网页。采用BFS或基于优先级的队列管理方式。爬虫的性能受I/O限制,通常受限于礼貌性约束而非计算能力。

When to Use

适用场景

Trigger conditions:
  • Building a site audit tool to discover all pages and their link structure
  • Collecting structured data from websites at scale
  • Mapping site architecture for SEO analysis
When NOT to use:
  • When you need data from a single API endpoint (use HTTP client directly)
  • When a sitemap.xml provides all needed URLs (parse sitemap instead)
触发条件:
  • 构建网站审计工具以发现所有页面及其链接结构
  • 大规模收集网站中的结构化数据
  • 绘制网站架构以进行SEO分析
不适用场景:
  • 仅需从单个API端点获取数据时(直接使用HTTP客户端即可)
  • sitemap.xml已提供所有所需URL时(直接解析站点地图即可)

Algorithm

算法

IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.
IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.

Phase 1: Input Validation

阶段1:输入验证

Parse seed URLs, fetch and parse robots.txt for each domain, set crawl scope (same-domain, subdomain, or cross-domain). Gate: Valid seed URLs, robots.txt rules loaded, scope defined.
解析种子URL,为每个域名获取并解析robots.txt,设置爬取范围(同域名、子域名或跨域名)。 准入条件: 种子URL有效,已加载robots.txt规则,已定义爬取范围。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Initialize URL frontier with seed URLs (priority queue or FIFO)
  2. Dequeue URL, check: not visited, allowed by robots.txt, within scope
  3. Fetch page with timeout and retry logic, respect crawl-delay
  4. Parse HTML: extract links (normalize, deduplicate), extract content/metadata
  5. Enqueue discovered URLs, store parsed data
  6. Repeat until frontier empty or limit reached
  1. 用种子URL初始化URL队列(优先队列或先进先出队列)
  2. 出队URL,检查:未被访问过、符合robots.txt规则、在爬取范围内
  3. 带超时和重试逻辑获取页面,遵守爬取延迟限制
  4. 解析HTML:提取链接(标准化、去重),提取内容/元数据
  5. 将发现的URL入队,存储解析后的数据
  6. 重复操作直到队列为空或达到限制

Phase 3: Verification

阶段3:验证

Check: no robots.txt violations in crawl log, no duplicate pages stored, all discovered URLs accounted for. Gate: Crawl completed within scope, politeness maintained.
检查:爬取日志中无违反robots.txt的记录,无重复页面存储,所有发现的URL都已处理。 准入条件: 在范围内完成爬取,始终遵守礼貌性约束。

Phase 4: Output

阶段4:输出

Return site map with pages, link graph, and extracted metadata.
返回包含页面、链接图和提取的元数据的站点地图。

Output Format

输出格式

json
{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}
json
{
  "pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
  "metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}

Examples

示例

Sample I/O

输入输出示例

Input: Seed: "https://example.com", max_depth: 2, max_pages: 100 Expected: Crawl tree with homepage at depth 0, linked pages at depth 1-2, respecting robots.txt
输入: 种子URL:"https://example.com",最大深度:2,最大页面数:100 预期结果: 爬取树以首页为深度0,链接页面为深度1-2,且遵守robots.txt规则

Edge Cases

边缘情况

InputExpectedWhy
robots.txt disallows /Zero pages crawledMust respect full disallow
Redirect loopStop after 5 redirectsPrevent infinite loop
Soft 404 (200 with error page)Flag as soft 404Status code alone is insufficient
输入预期结果原因
robots.txt禁止访问根目录/爬取0个页面必须遵守完全禁止规则
重定向循环5次重定向后停止防止无限循环
软404(状态码200但为错误页面)标记为软404仅靠状态码不足以判断

Gotchas

注意事项

  • URL normalization:
    http://Example.COM/path/
    and
    http://example.com/path
    are the same URL. Normalize: lowercase host, remove default port, remove trailing slash, sort query params.
  • JavaScript-rendered content: A basic HTTP fetch misses JS-rendered content. Use headless browser (Playwright/Puppeteer) for SPAs.
  • Trap detection: Calendar pages, session IDs in URLs, and infinite pagination create crawler traps. Set max depth and URL pattern limits.
  • Rate limiting yourself: Parallel fetching without per-domain rate limiting will overwhelm small servers. Use per-domain semaphores.
  • Character encoding: Not all pages are UTF-8. Detect encoding from HTTP headers and meta tags; fall back to charset detection libraries.
  • URL标准化
    http://Example.COM/path/
    http://example.com/path
    是同一个URL。标准化操作:主机名小写、移除默认端口、移除末尾斜杠、排序查询参数。
  • JavaScript渲染内容:基础HTTP请求会遗漏JS渲染的内容。对于单页应用(SPA),使用无头浏览器(Playwright/Puppeteer)。
  • 陷阱检测:日历页面、URL中的会话ID、无限分页会造成爬虫陷阱。设置最大深度和URL模式限制。
  • 自我限流:如果不按域名限流,并行抓取会压垮小型服务器。使用按域名的信号量。
  • 字符编码:并非所有页面都是UTF-8编码。从HTTP头和meta标签检测编码;若无则使用字符集检测库。

References

参考资料

  • For URL normalization rules (RFC 3986), see
    references/url-normalization.md
  • For distributed crawling architecture, see
    references/distributed-crawl.md
  • 有关URL标准化规则(RFC 3986),请查看
    references/url-normalization.md
  • 有关分布式爬取架构,请查看
    references/distributed-crawl.md