algo-seo-pagerank

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PageRank Algorithm

PageRank算法

Overview

概述

PageRank computes the importance of web pages by modeling a random surfer who follows links with probability d (damping factor) and jumps to a random page with probability 1-d. Converges in O(k * E) where k is iterations and E is number of edges.

PageRank通过模拟随机冲浪者的行为来计算网页的重要性：随机冲浪者有概率d（阻尼因子，damping factor）点击链接跳转，有概率1-d随机跳转到任意页面。算法的收敛时间复杂度为O(k * E)，其中k是迭代次数，E是边的数量。

When to Use

适用场景

Trigger conditions:

Computing page importance from link graph structure
Building link-based authority scoring systems
Analyzing citation networks or any directed graph importance

When NOT to use:

When you only need keyword relevance (use TF-IDF instead)
When the graph is undirected or unweighted (consider centrality measures)

触发条件：

根据链接图结构计算页面重要性
构建基于链接的权威度评分系统
分析引用网络或任意有向图的节点重要性

不适用场景：

仅需关键词相关性时（改用TF-IDF）
处理无向图或无权图时（考虑中心性度量方法）

Algorithm

算法流程

IRON LAW: PageRank Convergence
- Damping factor d MUST be < 1 (typically 0.85)
- Without damping, rank sinks and spider traps break convergence
- Correctness invariant: sum of all PageRank values = 1.0

IRON LAW: PageRank收敛规则
- 阻尼因子d必须小于1（通常取0.85）
- 若无阻尼因子，排名会陷入吸收态和蜘蛛陷阱，导致无法收敛
- 正确性不变量：所有PageRank值的总和 = 1.0

Phase 1: Input Validation

阶段1：输入验证

Build adjacency list from link data. Verify: no self-loops counted, all nodes accounted for (including dangling nodes with no outlinks). Gate: Graph is well-formed, dangling nodes identified.

从链接数据构建邻接表。验证：不统计自环，所有节点（包括没有出链的悬挂节点）都已被统计。 检查点： 图结构合规，已识别悬挂节点。

Phase 2: Core Algorithm

阶段2：核心算法

Initialize all N pages with PR = 1/N
For each iteration:
- For each page p: PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)) for all q linking to p
- Distribute dangling node rank equally to all pages
Repeat until convergence (L1 norm change < ε, typically 1e-6)

初始化所有N个页面的PR值为1/N
每次迭代执行：
- 对每个页面p：PR(p) = (1-d)/N + d * Σ(PR(q)/L(q))，其中q是所有指向p的页面
- 将悬挂节点的排名值平均分配给所有页面
重复迭代直至收敛（L1范数变化量 < ε，通常取1e-6）

Phase 3: Verification

阶段3：验证

Check: all PR values sum to ~1.0. Compare top-k rankings against known authority pages. Gate: |Σ PR - 1.0| < 0.001 and convergence achieved within max iterations.

检查：所有PR值的总和约等于1.0。将前k个排名与已知的权威页面进行对比。 检查点： |Σ PR - 1.0| < 0.001，且在最大迭代次数内完成收敛。

Phase 4: Output

阶段4：输出

Return sorted page scores with rank position.

返回按排名排序的页面得分。

Output Format

输出格式

json

{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}

json

{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}

Examples

示例

Sample I/O

输入输出示例

Input: Pages A→B, A→C, B→C, C→A (3 nodes, 4 edges, d=0.85) Expected Output: C: 0.390, A: 0.327, B: 0.283 (approximate)

输入： 页面A→B，A→C，B→C，C→A（3个节点，4条边，d=0.85） 预期输出： C: 0.390，A: 0.327，B: 0.283（近似值）

Edge Cases

边缘情况

Input	Expected	Why
Single node, no links	PR = 1.0	Only node gets all rank
All nodes link to one	Target gets highest PR	Star topology concentrates rank
Dangling node (no outlinks)	Distribute its rank equally	Prevents rank leakage

输入	预期结果	原因
单个节点，无链接	PR = 1.0	唯一节点获得全部排名值
所有节点都指向同一个节点	目标节点获得最高PR值	星型拓扑结构会集中排名值
悬挂节点（无出链）	将其排名值平均分配给所有页面	防止排名值流失

Gotchas

注意事项

Dangling nodes: Pages with no outgoing links leak rank. Redistribute their rank equally across all pages each iteration.
Spider traps: A group of pages that only link to each other accumulate rank. Damping factor prevents this but doesn't eliminate it entirely.
Convergence speed: Dense graphs converge faster. Sparse graphs with long chains may need 100+ iterations.
Floating point accumulation: For large graphs, use double precision. Single precision drifts noticeably after 50+ iterations.
Personalized PageRank: Standard PageRank uses uniform random jump. For personalized recommendations, bias the jump vector toward seed pages.

悬挂节点：没有出链的页面会导致排名值流失。每次迭代时需将其排名值平均分配给所有页面。
蜘蛛陷阱：一组仅互相链接的页面会累积排名值。阻尼因子可缓解此问题，但无法完全消除。
收敛速度：稠密图收敛更快。带有长链的稀疏图可能需要100+次迭代。
浮点精度累积：处理大型图时，使用双精度浮点数。单精度浮点数在50+次迭代后会出现明显的精度漂移。
个性化PageRank：标准PageRank使用均匀随机跳转。若用于个性化推荐，可将跳转向量偏向种子页面。

References

参考资料

For mathematical derivation of convergence proof, see
```
references/convergence-proof.md
```
For efficient sparse matrix implementation, see
```
references/sparse-implementation.md
```

关于收敛性证明的数学推导，请参阅
```
references/convergence-proof.md
```
关于高效稀疏矩阵实现，请参阅
```
references/sparse-implementation.md
```