algo-seo-pagerank
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePageRank Algorithm
PageRank算法
Overview
概述
PageRank computes the importance of web pages by modeling a random surfer who follows links with probability d (damping factor) and jumps to a random page with probability 1-d. Converges in O(k * E) where k is iterations and E is number of edges.
PageRank通过模拟随机冲浪者的行为来计算网页的重要性:随机冲浪者有概率d(阻尼因子,damping factor)点击链接跳转,有概率1-d随机跳转到任意页面。算法的收敛时间复杂度为O(k * E),其中k是迭代次数,E是边的数量。
When to Use
适用场景
Trigger conditions:
- Computing page importance from link graph structure
- Building link-based authority scoring systems
- Analyzing citation networks or any directed graph importance
When NOT to use:
- When you only need keyword relevance (use TF-IDF instead)
- When the graph is undirected or unweighted (consider centrality measures)
触发条件:
- 根据链接图结构计算页面重要性
- 构建基于链接的权威度评分系统
- 分析引用网络或任意有向图的节点重要性
不适用场景:
- 仅需关键词相关性时(改用TF-IDF)
- 处理无向图或无权图时(考虑中心性度量方法)
Algorithm
算法流程
IRON LAW: PageRank Convergence
- Damping factor d MUST be < 1 (typically 0.85)
- Without damping, rank sinks and spider traps break convergence
- Correctness invariant: sum of all PageRank values = 1.0IRON LAW: PageRank收敛规则
- 阻尼因子d必须小于1(通常取0.85)
- 若无阻尼因子,排名会陷入吸收态和蜘蛛陷阱,导致无法收敛
- 正确性不变量:所有PageRank值的总和 = 1.0Phase 1: Input Validation
阶段1:输入验证
Build adjacency list from link data. Verify: no self-loops counted, all nodes accounted for (including dangling nodes with no outlinks).
Gate: Graph is well-formed, dangling nodes identified.
从链接数据构建邻接表。验证:不统计自环,所有节点(包括没有出链的悬挂节点)都已被统计。
检查点: 图结构合规,已识别悬挂节点。
Phase 2: Core Algorithm
阶段2:核心算法
- Initialize all N pages with PR = 1/N
- For each iteration:
- For each page p: PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)) for all q linking to p
- Distribute dangling node rank equally to all pages
- Repeat until convergence (L1 norm change < ε, typically 1e-6)
- 初始化所有N个页面的PR值为1/N
- 每次迭代执行:
- 对每个页面p:PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)),其中q是所有指向p的页面
- 将悬挂节点的排名值平均分配给所有页面
- 重复迭代直至收敛(L1范数变化量 < ε,通常取1e-6)
Phase 3: Verification
阶段3:验证
Check: all PR values sum to ~1.0. Compare top-k rankings against known authority pages.
Gate: |Σ PR - 1.0| < 0.001 and convergence achieved within max iterations.
检查:所有PR值的总和约等于1.0。将前k个排名与已知的权威页面进行对比。
检查点: |Σ PR - 1.0| < 0.001,且在最大迭代次数内完成收敛。
Phase 4: Output
阶段4:输出
Return sorted page scores with rank position.
返回按排名排序的页面得分。
Output Format
输出格式
json
{
"rankings": [{"page": "url", "score": 0.042, "rank": 1}],
"metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}json
{
"rankings": [{"page": "url", "score": 0.042, "rank": 1}],
"metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}Examples
示例
Sample I/O
输入输出示例
Input: Pages A→B, A→C, B→C, C→A (3 nodes, 4 edges, d=0.85)
Expected Output: C: 0.390, A: 0.327, B: 0.283 (approximate)
输入: 页面A→B,A→C,B→C,C→A(3个节点,4条边,d=0.85)
预期输出: C: 0.390,A: 0.327,B: 0.283(近似值)
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| Single node, no links | PR = 1.0 | Only node gets all rank |
| All nodes link to one | Target gets highest PR | Star topology concentrates rank |
| Dangling node (no outlinks) | Distribute its rank equally | Prevents rank leakage |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 单个节点,无链接 | PR = 1.0 | 唯一节点获得全部排名值 |
| 所有节点都指向同一个节点 | 目标节点获得最高PR值 | 星型拓扑结构会集中排名值 |
| 悬挂节点(无出链) | 将其排名值平均分配给所有页面 | 防止排名值流失 |
Gotchas
注意事项
- Dangling nodes: Pages with no outgoing links leak rank. Redistribute their rank equally across all pages each iteration.
- Spider traps: A group of pages that only link to each other accumulate rank. Damping factor prevents this but doesn't eliminate it entirely.
- Convergence speed: Dense graphs converge faster. Sparse graphs with long chains may need 100+ iterations.
- Floating point accumulation: For large graphs, use double precision. Single precision drifts noticeably after 50+ iterations.
- Personalized PageRank: Standard PageRank uses uniform random jump. For personalized recommendations, bias the jump vector toward seed pages.
- 悬挂节点:没有出链的页面会导致排名值流失。每次迭代时需将其排名值平均分配给所有页面。
- 蜘蛛陷阱:一组仅互相链接的页面会累积排名值。阻尼因子可缓解此问题,但无法完全消除。
- 收敛速度:稠密图收敛更快。带有长链的稀疏图可能需要100+次迭代。
- 浮点精度累积:处理大型图时,使用双精度浮点数。单精度浮点数在50+次迭代后会出现明显的精度漂移。
- 个性化PageRank:标准PageRank使用均匀随机跳转。若用于个性化推荐,可将跳转向量偏向种子页面。
References
参考资料
- For mathematical derivation of convergence proof, see
references/convergence-proof.md - For efficient sparse matrix implementation, see
references/sparse-implementation.md
- 关于收敛性证明的数学推导,请参阅
references/convergence-proof.md - 关于高效稀疏矩阵实现,请参阅
references/sparse-implementation.md