algo-net-influence

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Influence Maximization

影响力最大化

Overview

概述

Influence maximization selects k seed nodes in a network to maximize expected spread under a diffusion model (Independent Cascade or Linear Threshold). NP-hard, but the greedy algorithm achieves (1-1/e) ≈ 63% approximation guarantee due to submodularity. Practical for networks up to millions of nodes with CELF optimization.

影响力最大化指在网络中选择k个种子节点，以在扩散模型（Independent Cascade或Linear Threshold）下最大化预期传播范围。该问题属于NP难问题，但由于具备次模性，贪心算法可达到(1-1/e)≈63%的近似保证。结合CELF优化后，该算法可处理规模达数百万节点的网络。

When to Use

适用场景

Trigger conditions:

Selecting k influencers/users to seed a viral marketing campaign
Maximizing information spread under a fixed budget (k seeds)
Comparing seeding strategies (degree-based vs greedy vs random)

When NOT to use:

When measuring existing influence (use centrality metrics)
For community structure analysis (use community detection)

触发条件：

为病毒式营销活动选择k个影响者/用户作为种子
在固定预算（k个种子）下最大化信息传播范围
比较不同种子选择策略（基于度的策略vs贪心策略vs随机策略）

不适用场景：

衡量现有影响力（使用中心性指标）
社区结构分析（使用社区检测算法）

Algorithm

算法

IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.

IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.

Phase 1: Input Validation

阶段1：输入验证

Build network graph. Choose diffusion model: Independent Cascade (probability per edge) or Linear Threshold (threshold per node). Set k (number of seeds) and propagation probabilities. Gate: Graph loaded, diffusion model selected, k defined.

构建网络图谱。选择扩散模型：Independent Cascade（边级传播概率）或Linear Threshold（节点级阈值）。设置k（种子数量）和传播概率。 准入条件： 图谱已加载、扩散模型已选定、k值已定义。

Phase 2: Core Algorithm

阶段2：核心算法

Greedy with CELF:

Initialize: seed set S = ∅
For each candidate node, estimate marginal gain: σ(S∪{v}) - σ(S) via Monte Carlo simulation (R=10,000 runs)
Select node with highest marginal gain, add to S
CELF optimization: reuse previous marginal gains, only re-evaluate when a node's upper bound exceeds current best
Repeat until |S| = k

结合CELF的贪心算法：

初始化：种子集合S = ∅
对于每个候选节点，通过Monte Carlo模拟（R=10000次运行）估算边际增益：σ(S∪{v}) - σ(S)
选择边际增益最高的节点，将其加入S
CELF优化：复用之前的边际增益，仅当节点的上限超过当前最优值时才重新评估
重复步骤直到|S| = k

Phase 3: Verification

阶段3：结果验证

Compare greedy result against baselines: random seeds, top-k degree, top-k PageRank. Greedy should significantly outperform. Gate: Greedy spread > degree heuristic spread, difference is meaningful.

将贪心算法的结果与基准策略对比：随机种子、基于度的Top-k种子、基于PageRank的Top-k种子。贪心算法应显著优于其他策略。 准入条件： 贪心算法的传播范围 > 基于度的启发式策略的传播范围，且差异具备实际意义。

Phase 4: Output

阶段4：输出结果

Return seed set with expected spread and comparison.

返回种子集合，以及预期传播范围和对比数据。

Output Format

输出格式

json

{
  "seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
  "expected_spread": 2500,
  "baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
  "metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}

json

{
  "seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
  "expected_spread": 2500,
  "baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
  "metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}

Examples

示例

Sample I/O

输入输出示例

Input: Social network 10K nodes, k=5 seeds, IC model with p=0.1 per edge Expected: Greedy selects diverse, well-positioned seeds (not all high-degree), expected spread ~500-1000.

输入： 含10000个节点的社交网络，k=5个种子，IC模型，每条边的传播概率p=0.1 预期结果： 贪心算法会选择分布分散、位置优良的种子（并非全部是高节点度的节点），预期传播范围约为500-1000。

Edge Cases

边缘情况

Input	Expected	Why
k=1	Node with highest individual spread	Single seed, no overlap consideration
k > number of communities	One seed per community optimal	Diversity beats concentration
Very sparse graph (low p)	Small spread regardless of seeds	Network can't propagate with low probability

输入	预期结果	原因
k=1	选择单个节点传播范围最高的节点	仅需单个种子，无需考虑传播范围重叠
k > 社区数量	每个社区选择一个种子为最优策略	分散性优于集中性
极稀疏图谱（低p值）	无论选择哪些种子，传播范围都很小	网络传播概率过低，无法有效扩散

Gotchas

注意事项

Monte Carlo variance: With R=1000, spread estimates have ~5% variance. Use R=10,000+ for stable results, especially when comparing close candidates.
Diffusion model choice matters: IC and LT produce different optimal seed sets. IC favors high-degree nodes; LT favors nodes that can trigger cascades.
Propagation probability estimation: Real-world edge probabilities are unknown. Common approaches: uniform (p=0.01-0.1), weighted inverse degree (1/in-degree), or learned from cascade data.
Overlap penalty: Greedy naturally handles overlap (submodularity). Heuristics that independently select top nodes waste seeds on overlapping influence spheres.
Scalability: Even with CELF, millions of nodes require further approximation (sketch-based methods like IMM or TIM+).

Monte Carlo方差： 当R=1000时，传播范围估算的方差约为5%。为获得稳定结果，尤其是在对比性能接近的候选策略时，应使用R=10000+的模拟次数。
扩散模型的选择至关重要： IC和LT模型会产生不同的最优种子集合。IC模型偏好高节点度的节点；LT模型偏好能够触发连锁传播的节点。
传播概率的估算： 现实场景中边的传播概率是未知的。常用方法包括：统一概率（p=0.01-0.1）、加权逆节点度（1/入度），或是从传播数据中学习得到。
重叠惩罚： 贪心算法会自然处理传播范围重叠的问题（次模性）。而独立选择Top节点的启发式策略会因种子的传播范围重叠而浪费资源。
可扩展性： 即便使用CELF算法，处理数百万节点的网络仍需进一步的近似方法（如基于草图的IMM或TIM+方法）。

References

参考文献

For CELF and CELF++ implementation, see
```
references/celf-implementation.md
```
For scalable influence maximization (IMM), see
```
references/scalable-im.md
```

关于CELF和CELF++的实现，详见
```
references/celf-implementation.md
```
关于可扩展的影响力最大化（IMM），详见
```
references/scalable-im.md
```