algo-seo-crawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb Crawler
网页爬虫
Overview
概述
A web crawler systematically traverses web pages by discovering URLs, fetching content, parsing HTML, and storing results. Uses BFS or priority-based frontier management. Performance is I/O-bound, typically limited by politeness constraints rather than compute.
网页爬虫通过发现URL、获取内容、解析HTML并存储结果来系统性地遍历网页。采用BFS或基于优先级的队列管理方式。爬虫的性能受I/O限制,通常受限于礼貌性约束而非计算能力。
When to Use
适用场景
Trigger conditions:
- Building a site audit tool to discover all pages and their link structure
- Collecting structured data from websites at scale
- Mapping site architecture for SEO analysis
When NOT to use:
- When you need data from a single API endpoint (use HTTP client directly)
- When a sitemap.xml provides all needed URLs (parse sitemap instead)
触发条件:
- 构建网站审计工具以发现所有页面及其链接结构
- 大规模收集网站中的结构化数据
- 绘制网站架构以进行SEO分析
不适用场景:
- 仅需从单个API端点获取数据时(直接使用HTTP客户端即可)
- sitemap.xml已提供所有所需URL时(直接解析站点地图即可)
Algorithm
算法
IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.IRON LAW: Respect robots.txt and Rate Limits
A crawler MUST:
1. Parse and obey robots.txt before crawling any path
2. Enforce crawl-delay (default 1s if unspecified)
3. Identify itself with a descriptive User-Agent
Ignoring these is unethical and will get your IP blocked.Phase 1: Input Validation
阶段1:输入验证
Parse seed URLs, fetch and parse robots.txt for each domain, set crawl scope (same-domain, subdomain, or cross-domain).
Gate: Valid seed URLs, robots.txt rules loaded, scope defined.
解析种子URL,为每个域名获取并解析robots.txt,设置爬取范围(同域名、子域名或跨域名)。
准入条件: 种子URL有效,已加载robots.txt规则,已定义爬取范围。
Phase 2: Core Algorithm
阶段2:核心算法
- Initialize URL frontier with seed URLs (priority queue or FIFO)
- Dequeue URL, check: not visited, allowed by robots.txt, within scope
- Fetch page with timeout and retry logic, respect crawl-delay
- Parse HTML: extract links (normalize, deduplicate), extract content/metadata
- Enqueue discovered URLs, store parsed data
- Repeat until frontier empty or limit reached
- 用种子URL初始化URL队列(优先队列或先进先出队列)
- 出队URL,检查:未被访问过、符合robots.txt规则、在爬取范围内
- 带超时和重试逻辑获取页面,遵守爬取延迟限制
- 解析HTML:提取链接(标准化、去重),提取内容/元数据
- 将发现的URL入队,存储解析后的数据
- 重复操作直到队列为空或达到限制
Phase 3: Verification
阶段3:验证
Check: no robots.txt violations in crawl log, no duplicate pages stored, all discovered URLs accounted for.
Gate: Crawl completed within scope, politeness maintained.
检查:爬取日志中无违反robots.txt的记录,无重复页面存储,所有发现的URL都已处理。
准入条件: 在范围内完成爬取,始终遵守礼貌性约束。
Phase 4: Output
阶段4:输出
Return site map with pages, link graph, and extracted metadata.
返回包含页面、链接图和提取的元数据的站点地图。
Output Format
输出格式
json
{
"pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
"metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}json
{
"pages": [{"url": "...", "status": 200, "title": "...", "links_out": 15, "depth": 2}],
"metadata": {"pages_crawled": 500, "errors": 12, "duration_seconds": 300, "domain": "example.com"}
}Examples
示例
Sample I/O
输入输出示例
Input: Seed: "https://example.com", max_depth: 2, max_pages: 100
Expected: Crawl tree with homepage at depth 0, linked pages at depth 1-2, respecting robots.txt
输入: 种子URL:"https://example.com",最大深度:2,最大页面数:100
预期结果: 爬取树以首页为深度0,链接页面为深度1-2,且遵守robots.txt规则
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| robots.txt disallows / | Zero pages crawled | Must respect full disallow |
| Redirect loop | Stop after 5 redirects | Prevent infinite loop |
| Soft 404 (200 with error page) | Flag as soft 404 | Status code alone is insufficient |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| robots.txt禁止访问根目录/ | 爬取0个页面 | 必须遵守完全禁止规则 |
| 重定向循环 | 5次重定向后停止 | 防止无限循环 |
| 软404(状态码200但为错误页面) | 标记为软404 | 仅靠状态码不足以判断 |
Gotchas
注意事项
- URL normalization: and
http://Example.COM/path/are the same URL. Normalize: lowercase host, remove default port, remove trailing slash, sort query params.http://example.com/path - JavaScript-rendered content: A basic HTTP fetch misses JS-rendered content. Use headless browser (Playwright/Puppeteer) for SPAs.
- Trap detection: Calendar pages, session IDs in URLs, and infinite pagination create crawler traps. Set max depth and URL pattern limits.
- Rate limiting yourself: Parallel fetching without per-domain rate limiting will overwhelm small servers. Use per-domain semaphores.
- Character encoding: Not all pages are UTF-8. Detect encoding from HTTP headers and meta tags; fall back to charset detection libraries.
- URL标准化:和
http://Example.COM/path/是同一个URL。标准化操作:主机名小写、移除默认端口、移除末尾斜杠、排序查询参数。http://example.com/path - JavaScript渲染内容:基础HTTP请求会遗漏JS渲染的内容。对于单页应用(SPA),使用无头浏览器(Playwright/Puppeteer)。
- 陷阱检测:日历页面、URL中的会话ID、无限分页会造成爬虫陷阱。设置最大深度和URL模式限制。
- 自我限流:如果不按域名限流,并行抓取会压垮小型服务器。使用按域名的信号量。
- 字符编码:并非所有页面都是UTF-8编码。从HTTP头和meta标签检测编码;若无则使用字符集检测库。
References
参考资料
- For URL normalization rules (RFC 3986), see
references/url-normalization.md - For distributed crawling architecture, see
references/distributed-crawl.md
- 有关URL标准化规则(RFC 3986),请查看
references/url-normalization.md - 有关分布式爬取架构,请查看
references/distributed-crawl.md