algo-rec-cf
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCollaborative Filtering
协同过滤
Overview
概述
Collaborative filtering recommends items based on collective user behavior patterns. User-based CF finds similar users; item-based CF finds similar items. Computes in O(U² × I) for user-based or O(I² × U) for item-based where U=users, I=items.
协同过滤基于用户的集体行为模式推荐物品。基于用户的CF(User-based CF)查找相似用户;基于物品的CF(Item-based CF)查找相似物品。基于用户的CF时间复杂度为O(U² × I),基于物品的CF为O(I² × U),其中U代表用户数量,I代表物品数量。
When to Use
使用场景
Trigger conditions:
- Building recommendations from user-item interaction data (ratings, clicks, purchases)
- Finding "users like you also liked" or "frequently bought together" patterns
When NOT to use:
- When you have no interaction data (cold start — use content-based filtering)
- When item features matter more than behavior patterns (use content-based)
触发条件:
- 基于用户-物品交互数据(评分、点击、购买记录)构建推荐系统
- 查找“和你相似的用户也喜欢”或“经常一起购买”的模式
不适用场景:
- 没有交互数据时(冷启动场景——请使用基于内容的过滤算法)
- 物品特征比行为模式更重要时(使用基于内容的过滤算法)
Algorithm
算法实现
IRON LAW: CF Requires SUFFICIENT Interaction Data
With sparse matrices (< 1% fill rate), similarity computation is
unreliable. Minimum viable: each user has rated 5+ items, each item
has 5+ ratings. Below this, fallback to content-based or popularity.IRON LAW: CF Requires SUFFICIENT Interaction Data
With sparse matrices (< 1% fill rate), similarity computation is
unreliable. Minimum viable: each user has rated 5+ items, each item
has 5+ ratings. Below this, fallback to content-based or popularity.Phase 1: Input Validation
阶段1:输入验证
Load user-item interaction matrix. Check sparsity level and filter users/items below minimum interaction threshold.
Gate: Matrix sparsity < 99%, minimum interaction thresholds met.
加载用户-物品交互矩阵,检查稀疏度并过滤掉交互次数低于最低阈值的用户/物品。
**准入条件:**矩阵稀疏度<99%,且满足最低交互次数阈值。
Phase 2: Core Algorithm
阶段2:核心算法
User-based CF:
- Compute pairwise user similarity (cosine or Pearson correlation)
- For target user, find top-K most similar users
- Predict rating: weighted average of similar users' ratings
Item-based CF:
- Compute pairwise item similarity from co-rating patterns
- For target item, find top-K most similar items
- Predict: weighted average of user's ratings on similar items
基于用户的CF:
- 计算用户间的两两相似度(余弦相似度或皮尔逊相关系数)
- 为目标用户找到最相似的Top-K用户
- 预测评分:对相似用户的评分进行加权平均
基于物品的CF:
- 根据共同评分模式计算物品间的两两相似度
- 为目标物品找到最相似的Top-K物品
- 预测评分:对用户在相似物品上的评分进行加权平均
Phase 3: Verification
阶段3:验证
Hold out 20% of interactions for testing. Compute RMSE, MAE, or precision@K / recall@K.
Gate: RMSE below baseline (global mean predictor).
预留20%的交互数据用于测试,计算RMSE(均方根误差)、MAE(平均绝对误差)或Precision@K/Recall@K。
**准入条件:**RMSE低于基准值(全局均值预测器的结果)。
Phase 4: Output
阶段4:输出
Return top-N recommendations with predicted scores.
返回带有预测评分的Top-N推荐结果。
Output Format
输出格式
json
{
"recommendations": [{"item_id": "123", "predicted_score": 4.2, "similar_items_used": 5}],
"metadata": {"method": "item-based", "similarity": "cosine", "k_neighbors": 20, "sparsity": 0.97}
}json
{
"recommendations": [{"item_id": "123", "predicted_score": 4.2, "similar_items_used": 5}],
"metadata": {"method": "item-based", "similarity": "cosine", "k_neighbors": 20, "sparsity": 0.97}
}Examples
示例
Sample I/O
输入输出样例
Input: 5 users × 5 items rating matrix, target: user1, item5
Expected: Predicted rating based on weighted similarity of user1's rated items similar to item5
**输入:**5个用户×5个物品的评分矩阵,目标:用户1,物品5
**预期输出:**基于用户1已评分且与物品5相似的物品的加权相似度计算出的预测评分
Edge Cases
边缘案例
| Input | Expected | Why |
|---|---|---|
| New user, no ratings | Cannot recommend | Cold start — fallback to popularity |
| Item rated by all users | Low differentiation | High popularity ≠ personalized match |
| Single shared item | Unreliable similarity | Need multiple co-ratings for stable similarity |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 新用户,无评分记录 | 无法生成个性化推荐 | 冷启动场景—— fallback至热门推荐 |
| 所有用户都评分过的物品 | 区分度低 | 高热度≠个性化匹配 |
| 仅单个共同评分物品 | 相似度结果不可靠 | 需要多个共同评分才能得到稳定的相似度 |
Gotchas
注意事项
- Scalability: User-based CF with millions of users is O(U²). Use approximate nearest neighbors (LSH) or switch to item-based CF (item catalog is usually smaller).
- Popularity bias: Popular items have more co-ratings, inflating their similarity scores. Normalize by inverse popularity.
- Implicit vs explicit feedback: Clicks/views (implicit) need different treatment than ratings (explicit). Use confidence weighting for implicit data.
- Similarity metric matters: Cosine similarity ignores rating scale differences; Pearson correlation accounts for user rating biases. Choose based on data characteristics.
- Gray sheep: Users with unusual taste patterns have no similar peers. CF fails for them — consider hybrid approaches.
- 可扩展性:当用户数量达数百万时,基于用户的CF时间复杂度为O(U²),可使用近似最近邻(LSH)算法或切换为基于物品的CF(物品目录通常规模更小)。
- 流行度偏差:热门物品有更多共同评分,会抬高其相似度得分,需通过逆流行度进行归一化。
- 隐式与显式反馈:点击/浏览记录(隐式反馈)与评分(显式反馈)的处理方式不同,隐式数据需使用置信度加权。
- 相似度指标的选择:余弦相似度忽略评分尺度差异;皮尔逊相关系数可消除用户评分偏差,需根据数据特征选择。
- 灰色绵羊用户:口味独特的用户没有相似的同类用户,CF算法对这类用户无效——可考虑混合推荐方案。
References
参考资料
- For matrix factorization as a scalable alternative, see
references/matrix-factorization.md - For implicit feedback handling, see
references/implicit-feedback.md
- 关于可扩展的替代方案矩阵分解,请参考
references/matrix-factorization.md - 关于隐式反馈的处理,请参考
references/implicit-feedback.md