algo-seo-pagerank

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PageRank Algorithm

PageRank算法

Overview

概述

PageRank computes the importance of web pages by modeling a random surfer who follows links with probability d (damping factor) and jumps to a random page with probability 1-d. Converges in O(k * E) where k is iterations and E is number of edges.
PageRank通过模拟随机冲浪者的行为来计算网页的重要性:随机冲浪者有概率d(阻尼因子,damping factor)点击链接跳转,有概率1-d随机跳转到任意页面。算法的收敛时间复杂度为O(k * E),其中k是迭代次数,E是边的数量。

When to Use

适用场景

Trigger conditions:
  • Computing page importance from link graph structure
  • Building link-based authority scoring systems
  • Analyzing citation networks or any directed graph importance
When NOT to use:
  • When you only need keyword relevance (use TF-IDF instead)
  • When the graph is undirected or unweighted (consider centrality measures)
触发条件:
  • 根据链接图结构计算页面重要性
  • 构建基于链接的权威度评分系统
  • 分析引用网络或任意有向图的节点重要性
不适用场景:
  • 仅需关键词相关性时(改用TF-IDF)
  • 处理无向图或无权图时(考虑中心性度量方法)

Algorithm

算法流程

IRON LAW: PageRank Convergence
- Damping factor d MUST be < 1 (typically 0.85)
- Without damping, rank sinks and spider traps break convergence
- Correctness invariant: sum of all PageRank values = 1.0
IRON LAW: PageRank收敛规则
- 阻尼因子d必须小于1(通常取0.85)
- 若无阻尼因子,排名会陷入吸收态和蜘蛛陷阱,导致无法收敛
- 正确性不变量:所有PageRank值的总和 = 1.0

Phase 1: Input Validation

阶段1:输入验证

Build adjacency list from link data. Verify: no self-loops counted, all nodes accounted for (including dangling nodes with no outlinks). Gate: Graph is well-formed, dangling nodes identified.
从链接数据构建邻接表。验证:不统计自环,所有节点(包括没有出链的悬挂节点)都已被统计。 检查点: 图结构合规,已识别悬挂节点。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Initialize all N pages with PR = 1/N
  2. For each iteration:
    • For each page p: PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)) for all q linking to p
    • Distribute dangling node rank equally to all pages
  3. Repeat until convergence (L1 norm change < ε, typically 1e-6)
  1. 初始化所有N个页面的PR值为1/N
  2. 每次迭代执行:
    • 对每个页面p:PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)),其中q是所有指向p的页面
    • 将悬挂节点的排名值平均分配给所有页面
  3. 重复迭代直至收敛(L1范数变化量 < ε,通常取1e-6)

Phase 3: Verification

阶段3:验证

Check: all PR values sum to ~1.0. Compare top-k rankings against known authority pages. Gate: |Σ PR - 1.0| < 0.001 and convergence achieved within max iterations.
检查:所有PR值的总和约等于1.0。将前k个排名与已知的权威页面进行对比。 检查点: |Σ PR - 1.0| < 0.001,且在最大迭代次数内完成收敛。

Phase 4: Output

阶段4:输出

Return sorted page scores with rank position.
返回按排名排序的页面得分。

Output Format

输出格式

json
{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}
json
{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}

Examples

示例

Sample I/O

输入输出示例

Input: Pages A→B, A→C, B→C, C→A (3 nodes, 4 edges, d=0.85) Expected Output: C: 0.390, A: 0.327, B: 0.283 (approximate)
输入: 页面A→B,A→C,B→C,C→A(3个节点,4条边,d=0.85) 预期输出: C: 0.390,A: 0.327,B: 0.283(近似值)

Edge Cases

边缘情况

InputExpectedWhy
Single node, no linksPR = 1.0Only node gets all rank
All nodes link to oneTarget gets highest PRStar topology concentrates rank
Dangling node (no outlinks)Distribute its rank equallyPrevents rank leakage
输入预期结果原因
单个节点,无链接PR = 1.0唯一节点获得全部排名值
所有节点都指向同一个节点目标节点获得最高PR值星型拓扑结构会集中排名值
悬挂节点(无出链)将其排名值平均分配给所有页面防止排名值流失

Gotchas

注意事项

  • Dangling nodes: Pages with no outgoing links leak rank. Redistribute their rank equally across all pages each iteration.
  • Spider traps: A group of pages that only link to each other accumulate rank. Damping factor prevents this but doesn't eliminate it entirely.
  • Convergence speed: Dense graphs converge faster. Sparse graphs with long chains may need 100+ iterations.
  • Floating point accumulation: For large graphs, use double precision. Single precision drifts noticeably after 50+ iterations.
  • Personalized PageRank: Standard PageRank uses uniform random jump. For personalized recommendations, bias the jump vector toward seed pages.
  • 悬挂节点:没有出链的页面会导致排名值流失。每次迭代时需将其排名值平均分配给所有页面。
  • 蜘蛛陷阱:一组仅互相链接的页面会累积排名值。阻尼因子可缓解此问题,但无法完全消除。
  • 收敛速度:稠密图收敛更快。带有长链的稀疏图可能需要100+次迭代。
  • 浮点精度累积:处理大型图时,使用双精度浮点数。单精度浮点数在50+次迭代后会出现明显的精度漂移。
  • 个性化PageRank:标准PageRank使用均匀随机跳转。若用于个性化推荐,可将跳转向量偏向种子页面。

References

参考资料

  • For mathematical derivation of convergence proof, see
    references/convergence-proof.md
  • For efficient sparse matrix implementation, see
    references/sparse-implementation.md
  • 关于收敛性证明的数学推导,请参阅
    references/convergence-proof.md
  • 关于高效稀疏矩阵实现,请参阅
    references/sparse-implementation.md