algo-net-influence
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseInfluence Maximization
影响力最大化
Overview
概述
Influence maximization selects k seed nodes in a network to maximize expected spread under a diffusion model (Independent Cascade or Linear Threshold). NP-hard, but the greedy algorithm achieves (1-1/e) ≈ 63% approximation guarantee due to submodularity. Practical for networks up to millions of nodes with CELF optimization.
影响力最大化指在网络中选择k个种子节点,以在扩散模型(Independent Cascade或Linear Threshold)下最大化预期传播范围。该问题属于NP难问题,但由于具备次模性,贪心算法可达到(1-1/e)≈63%的近似保证。结合CELF优化后,该算法可处理规模达数百万节点的网络。
When to Use
适用场景
Trigger conditions:
- Selecting k influencers/users to seed a viral marketing campaign
- Maximizing information spread under a fixed budget (k seeds)
- Comparing seeding strategies (degree-based vs greedy vs random)
When NOT to use:
- When measuring existing influence (use centrality metrics)
- For community structure analysis (use community detection)
触发条件:
- 为病毒式营销活动选择k个影响者/用户作为种子
- 在固定预算(k个种子)下最大化信息传播范围
- 比较不同种子选择策略(基于度的策略vs贪心策略vs随机策略)
不适用场景:
- 衡量现有影响力(使用中心性指标)
- 社区结构分析(使用社区检测算法)
Algorithm
算法
IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.Phase 1: Input Validation
阶段1:输入验证
Build network graph. Choose diffusion model: Independent Cascade (probability per edge) or Linear Threshold (threshold per node). Set k (number of seeds) and propagation probabilities.
Gate: Graph loaded, diffusion model selected, k defined.
构建网络图谱。选择扩散模型:Independent Cascade(边级传播概率)或Linear Threshold(节点级阈值)。设置k(种子数量)和传播概率。
准入条件: 图谱已加载、扩散模型已选定、k值已定义。
Phase 2: Core Algorithm
阶段2:核心算法
Greedy with CELF:
- Initialize: seed set S = ∅
- For each candidate node, estimate marginal gain: σ(S∪{v}) - σ(S) via Monte Carlo simulation (R=10,000 runs)
- Select node with highest marginal gain, add to S
- CELF optimization: reuse previous marginal gains, only re-evaluate when a node's upper bound exceeds current best
- Repeat until |S| = k
结合CELF的贪心算法:
- 初始化:种子集合S = ∅
- 对于每个候选节点,通过Monte Carlo模拟(R=10000次运行)估算边际增益:σ(S∪{v}) - σ(S)
- 选择边际增益最高的节点,将其加入S
- CELF优化:复用之前的边际增益,仅当节点的上限超过当前最优值时才重新评估
- 重复步骤直到|S| = k
Phase 3: Verification
阶段3:结果验证
Compare greedy result against baselines: random seeds, top-k degree, top-k PageRank. Greedy should significantly outperform.
Gate: Greedy spread > degree heuristic spread, difference is meaningful.
将贪心算法的结果与基准策略对比:随机种子、基于度的Top-k种子、基于PageRank的Top-k种子。贪心算法应显著优于其他策略。
准入条件: 贪心算法的传播范围 > 基于度的启发式策略的传播范围,且差异具备实际意义。
Phase 4: Output
阶段4:输出结果
Return seed set with expected spread and comparison.
返回种子集合,以及预期传播范围和对比数据。
Output Format
输出格式
json
{
"seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
"expected_spread": 2500,
"baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
"metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}json
{
"seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
"expected_spread": 2500,
"baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
"metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}Examples
示例
Sample I/O
输入输出示例
Input: Social network 10K nodes, k=5 seeds, IC model with p=0.1 per edge
Expected: Greedy selects diverse, well-positioned seeds (not all high-degree), expected spread ~500-1000.
输入: 含10000个节点的社交网络,k=5个种子,IC模型,每条边的传播概率p=0.1
预期结果: 贪心算法会选择分布分散、位置优良的种子(并非全部是高节点度的节点),预期传播范围约为500-1000。
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| k=1 | Node with highest individual spread | Single seed, no overlap consideration |
| k > number of communities | One seed per community optimal | Diversity beats concentration |
| Very sparse graph (low p) | Small spread regardless of seeds | Network can't propagate with low probability |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| k=1 | 选择单个节点传播范围最高的节点 | 仅需单个种子,无需考虑传播范围重叠 |
| k > 社区数量 | 每个社区选择一个种子为最优策略 | 分散性优于集中性 |
| 极稀疏图谱(低p值) | 无论选择哪些种子,传播范围都很小 | 网络传播概率过低,无法有效扩散 |
Gotchas
注意事项
- Monte Carlo variance: With R=1000, spread estimates have ~5% variance. Use R=10,000+ for stable results, especially when comparing close candidates.
- Diffusion model choice matters: IC and LT produce different optimal seed sets. IC favors high-degree nodes; LT favors nodes that can trigger cascades.
- Propagation probability estimation: Real-world edge probabilities are unknown. Common approaches: uniform (p=0.01-0.1), weighted inverse degree (1/in-degree), or learned from cascade data.
- Overlap penalty: Greedy naturally handles overlap (submodularity). Heuristics that independently select top nodes waste seeds on overlapping influence spheres.
- Scalability: Even with CELF, millions of nodes require further approximation (sketch-based methods like IMM or TIM+).
- Monte Carlo方差: 当R=1000时,传播范围估算的方差约为5%。为获得稳定结果,尤其是在对比性能接近的候选策略时,应使用R=10000+的模拟次数。
- 扩散模型的选择至关重要: IC和LT模型会产生不同的最优种子集合。IC模型偏好高节点度的节点;LT模型偏好能够触发连锁传播的节点。
- 传播概率的估算: 现实场景中边的传播概率是未知的。常用方法包括:统一概率(p=0.01-0.1)、加权逆节点度(1/入度),或是从传播数据中学习得到。
- 重叠惩罚: 贪心算法会自然处理传播范围重叠的问题(次模性)。而独立选择Top节点的启发式策略会因种子的传播范围重叠而浪费资源。
- 可扩展性: 即便使用CELF算法,处理数百万节点的网络仍需进一步的近似方法(如基于草图的IMM或TIM+方法)。
References
参考文献
- For CELF and CELF++ implementation, see
references/celf-implementation.md - For scalable influence maximization (IMM), see
references/scalable-im.md
- 关于CELF和CELF++的实现,详见
references/celf-implementation.md - 关于可扩展的影响力最大化(IMM),详见
references/scalable-im.md