algo-net-influence

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Influence Maximization

影响力最大化

Overview

概述

Influence maximization selects k seed nodes in a network to maximize expected spread under a diffusion model (Independent Cascade or Linear Threshold). NP-hard, but the greedy algorithm achieves (1-1/e) ≈ 63% approximation guarantee due to submodularity. Practical for networks up to millions of nodes with CELF optimization.
影响力最大化指在网络中选择k个种子节点,以在扩散模型(Independent Cascade或Linear Threshold)下最大化预期传播范围。该问题属于NP难问题,但由于具备次模性,贪心算法可达到(1-1/e)≈63%的近似保证。结合CELF优化后,该算法可处理规模达数百万节点的网络。

When to Use

适用场景

Trigger conditions:
  • Selecting k influencers/users to seed a viral marketing campaign
  • Maximizing information spread under a fixed budget (k seeds)
  • Comparing seeding strategies (degree-based vs greedy vs random)
When NOT to use:
  • When measuring existing influence (use centrality metrics)
  • For community structure analysis (use community detection)
触发条件:
  • 为病毒式营销活动选择k个影响者/用户作为种子
  • 在固定预算(k个种子)下最大化信息传播范围
  • 比较不同种子选择策略(基于度的策略vs贪心策略vs随机策略)
不适用场景:
  • 衡量现有影响力(使用中心性指标)
  • 社区结构分析(使用社区检测算法)

Algorithm

算法

IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.
IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.

Phase 1: Input Validation

阶段1:输入验证

Build network graph. Choose diffusion model: Independent Cascade (probability per edge) or Linear Threshold (threshold per node). Set k (number of seeds) and propagation probabilities. Gate: Graph loaded, diffusion model selected, k defined.
构建网络图谱。选择扩散模型:Independent Cascade(边级传播概率)或Linear Threshold(节点级阈值)。设置k(种子数量)和传播概率。 准入条件: 图谱已加载、扩散模型已选定、k值已定义。

Phase 2: Core Algorithm

阶段2:核心算法

Greedy with CELF:
  1. Initialize: seed set S = ∅
  2. For each candidate node, estimate marginal gain: σ(S∪{v}) - σ(S) via Monte Carlo simulation (R=10,000 runs)
  3. Select node with highest marginal gain, add to S
  4. CELF optimization: reuse previous marginal gains, only re-evaluate when a node's upper bound exceeds current best
  5. Repeat until |S| = k
结合CELF的贪心算法:
  1. 初始化:种子集合S = ∅
  2. 对于每个候选节点,通过Monte Carlo模拟(R=10000次运行)估算边际增益:σ(S∪{v}) - σ(S)
  3. 选择边际增益最高的节点,将其加入S
  4. CELF优化:复用之前的边际增益,仅当节点的上限超过当前最优值时才重新评估
  5. 重复步骤直到|S| = k

Phase 3: Verification

阶段3:结果验证

Compare greedy result against baselines: random seeds, top-k degree, top-k PageRank. Greedy should significantly outperform. Gate: Greedy spread > degree heuristic spread, difference is meaningful.
将贪心算法的结果与基准策略对比:随机种子、基于度的Top-k种子、基于PageRank的Top-k种子。贪心算法应显著优于其他策略。 准入条件: 贪心算法的传播范围 > 基于度的启发式策略的传播范围,且差异具备实际意义。

Phase 4: Output

阶段4:输出结果

Return seed set with expected spread and comparison.
返回种子集合,以及预期传播范围和对比数据。

Output Format

输出格式

json
{
  "seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
  "expected_spread": 2500,
  "baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
  "metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}
json
{
  "seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
  "expected_spread": 2500,
  "baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
  "metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}

Examples

示例

Sample I/O

输入输出示例

Input: Social network 10K nodes, k=5 seeds, IC model with p=0.1 per edge Expected: Greedy selects diverse, well-positioned seeds (not all high-degree), expected spread ~500-1000.
输入: 含10000个节点的社交网络,k=5个种子,IC模型,每条边的传播概率p=0.1 预期结果: 贪心算法会选择分布分散、位置优良的种子(并非全部是高节点度的节点),预期传播范围约为500-1000。

Edge Cases

边缘情况

InputExpectedWhy
k=1Node with highest individual spreadSingle seed, no overlap consideration
k > number of communitiesOne seed per community optimalDiversity beats concentration
Very sparse graph (low p)Small spread regardless of seedsNetwork can't propagate with low probability
输入预期结果原因
k=1选择单个节点传播范围最高的节点仅需单个种子,无需考虑传播范围重叠
k > 社区数量每个社区选择一个种子为最优策略分散性优于集中性
极稀疏图谱(低p值)无论选择哪些种子,传播范围都很小网络传播概率过低,无法有效扩散

Gotchas

注意事项

  • Monte Carlo variance: With R=1000, spread estimates have ~5% variance. Use R=10,000+ for stable results, especially when comparing close candidates.
  • Diffusion model choice matters: IC and LT produce different optimal seed sets. IC favors high-degree nodes; LT favors nodes that can trigger cascades.
  • Propagation probability estimation: Real-world edge probabilities are unknown. Common approaches: uniform (p=0.01-0.1), weighted inverse degree (1/in-degree), or learned from cascade data.
  • Overlap penalty: Greedy naturally handles overlap (submodularity). Heuristics that independently select top nodes waste seeds on overlapping influence spheres.
  • Scalability: Even with CELF, millions of nodes require further approximation (sketch-based methods like IMM or TIM+).
  • Monte Carlo方差: 当R=1000时,传播范围估算的方差约为5%。为获得稳定结果,尤其是在对比性能接近的候选策略时,应使用R=10000+的模拟次数。
  • 扩散模型的选择至关重要: IC和LT模型会产生不同的最优种子集合。IC模型偏好高节点度的节点;LT模型偏好能够触发连锁传播的节点。
  • 传播概率的估算: 现实场景中边的传播概率是未知的。常用方法包括:统一概率(p=0.01-0.1)、加权逆节点度(1/入度),或是从传播数据中学习得到。
  • 重叠惩罚: 贪心算法会自然处理传播范围重叠的问题(次模性)。而独立选择Top节点的启发式策略会因种子的传播范围重叠而浪费资源。
  • 可扩展性: 即便使用CELF算法,处理数百万节点的网络仍需进一步的近似方法(如基于草图的IMM或TIM+方法)。

References

参考文献

  • For CELF and CELF++ implementation, see
    references/celf-implementation.md
  • For scalable influence maximization (IMM), see
    references/scalable-im.md
  • 关于CELF和CELF++的实现,详见
    references/celf-implementation.md
  • 关于可扩展的影响力最大化(IMM),详见
    references/scalable-im.md