clip-aware-embeddings

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CLIP-Aware Image Embeddings

感知CLIP的图像嵌入

Smart image-text matching that knows when CLIP works and when to use alternatives.
智能图文匹配,了解CLIP适用场景及替代方案。

MCP Integrations

MCP集成

MCPPurpose
FirecrawlResearch latest CLIP alternatives and benchmarks
Hugging Face (if configured)Access model cards and documentation
MCP用途
Firecrawl研究最新CLIP替代模型及基准测试
Hugging Face(若已配置)访问模型卡片及文档

Quick Decision Tree

快速决策树

Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗
Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

When to Use This Skill

适用场景

Use for:
  • Semantic image search
  • Broad category classification
  • Image similarity matching
  • Zero-shot tasks on new categories
Do NOT use for:
  • Counting objects in images
  • Fine-grained classification
  • Spatial understanding
  • Attribute binding
  • Negation handling
适用场景:
  • 语义图像搜索
  • 宽泛类别分类
  • 图像相似度匹配
  • 新类别下的零样本任务
不适用场景:
  • 图像中的物体计数
  • 细粒度分类
  • 空间理解
  • 属性绑定
  • 否定处理

Installation

安装

bash
pip install transformers pillow torch sentence-transformers --break-system-packages
Validation: Run
python scripts/validate_setup.py
bash
pip install transformers pillow torch sentence-transformers --break-system-packages
验证: 运行
python scripts/validate_setup.py

Basic Usage

基础用法

Image Search

图像搜索

python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Embed images

Embed images

images = [Image.open(f"img{i}.jpg") for i in range(10)] inputs = processor(images=images, return_tensors="pt") image_features = model.get_image_features(**inputs)
images = [Image.open(f"img{i}.jpg") for i in range(10)] inputs = processor(images=images, return_tensors="pt") image_features = model.get_image_features(**inputs)

Search with text

Search with text

text_inputs = processor(text=["a beach at sunset"], return_tensors="pt") text_features = model.get_text_features(**text_inputs)
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt") text_features = model.get_text_features(**text_inputs)

Compute similarity

Compute similarity

similarity = (image_features @ text_features.T).softmax(dim=0)
undefined
similarity = (image_features @ text_features.T).softmax(dim=0)
undefined

Common Anti-Patterns

常见反模式

Anti-Pattern 1: "CLIP for Everything"

反模式1:“CLIP万能论”

❌ Wrong:
python
undefined
❌错误用法:
python
undefined

Using CLIP to count cars in an image

Using CLIP to count cars in an image

prompt = "How many cars are in this image?"
prompt = "How many cars are in this image?"

CLIP cannot count - it will give nonsense results

CLIP cannot count - it will give nonsense results


**Why wrong**: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.

**✓ Right**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

**错误原因**: CLIP的架构将空间信息压缩为单个向量,本质上无法完成计数任务。

**✓正确用法**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

Detect objects

Detect objects

results = model(**processor(images=image, return_tensors="pt"))
results = model(**processor(images=image, return_tensors="pt"))

Filter for cars and count

Filter for cars and count

car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections)

**How to detect**: If query contains "how many", "count", or numeric questions → Use object detection

---
car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections)

**识别方式**: 若查询包含“多少”、“计数”或数值相关问题 → 使用目标检测模型

---

Anti-Pattern 2: Fine-Grained Classification

反模式2:细粒度分类

❌ Wrong:
python
undefined
❌错误用法:
python
undefined

Trying to identify specific celebrities with CLIP

Trying to identify specific celebrities with CLIP

prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]

CLIP will perform poorly - not trained for fine-grained face ID

CLIP will perform poorly - not trained for fine-grained face ID


**Why wrong**: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.

**✓ Right**:
```python

**错误原因**: CLIP基于粗粒度类别训练,细粒度的人脸识别、车型识别、花卉品种识别需要专用模型。

**✓正确用法**:
```python

Use a fine-tuned face recognition model

Use a fine-tuned face recognition model

from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset )
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset )

Or use dedicated face recognition: ArcFace, CosFace

Or use dedicated face recognition: ArcFace, CosFace


**How to detect**: If query asks to distinguish between similar items in same category → Use specialized model

---

**识别方式**: 若查询要求区分同一类别下的相似物品 → 使用专用模型

---

Anti-Pattern 3: Spatial Understanding

反模式3:空间理解

❌ Wrong:
python
undefined
❌错误用法:
python
undefined

CLIP cannot understand spatial relationships

CLIP cannot understand spatial relationships

prompts = [ "cat to the left of dog", "cat to the right of dog" ]
prompts = [ "cat to the left of dog", "cat to the right of dog" ]

Will give nearly identical scores

Will give nearly identical scores


**Why wrong**: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.

**✓ Right**:
```python

**错误原因**: CLIP嵌入会丢失空间拓扑信息,“左”和“右”被视为词袋处理。

**✓正确用法**:
```python

Use a spatial reasoning model

Use a spatial reasoning model

Examples: GQA models, Visual Genome models, SWIG

Examples: GQA models, Visual Genome models, SWIG

from swig_model import SpatialRelationModel
model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog")
from swig_model import SpatialRelationModel
model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog")

Returns: "left", "right", "above", "below", etc.

Returns: "left", "right", "above", "below", etc.


**How to detect**: If query contains directional words (left, right, above, under, next to) → Use spatial model

---

**识别方式**: 若查询包含方向词(左、右、上、下、旁边) → 使用空间推理模型

---

Anti-Pattern 4: Attribute Binding

反模式4:属性绑定

❌ Wrong:
python
prompts = [
    "red car and blue truck",
    "blue car and red truck"
]
❌错误用法:
python
prompts = [
    "red car and blue truck",
    "blue car and red truck"
]

CLIP often gives similar scores for both

CLIP often gives similar scores for both


**Why wrong**: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.

**✓ Right - Use PC-CLIP or DCSMs**:
```python

**错误原因**: CLIP无法将属性与物体绑定,它会将“红色、蓝色、汽车、卡车”视为一组独立概念。

**✓正确用法 - 使用PC-CLIP或DCSMs**:
```python

PC-CLIP: Fine-tuned for pairwise comparisons

PC-CLIP: Fine-tuned for pairwise comparisons

from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")

Or use DCSMs (Dense Cosine Similarity Maps)

Or use DCSMs (Dense Cosine Similarity Maps)


**How to detect**: If query has multiple objects with different attributes → Use compositional model

---

**识别方式**: 若查询包含多个带有不同属性的物体 → 使用组合模型

---

Evolution Timeline

发展时间线

2021: CLIP Released

2021年:CLIP发布

  • Revolutionary: zero-shot, 400M image-text pairs
  • Widely adopted for everything
  • Limitations not yet understood
  • 革命性突破:零样本能力,基于4亿图文对训练
  • 被广泛应用于各类任务
  • 其局限性尚未被充分认知

2022-2023: Limitations Discovered

2022-2023年:局限性被发现

  • Cannot count objects
  • Poor at fine-grained classification
  • Fails spatial reasoning
  • Can't bind attributes
  • 无法完成物体计数
  • 细粒度分类表现不佳
  • 空间推理任务失败
  • 无法处理属性绑定

2024: Alternatives Emerge

2024年:替代模型涌现

  • DCSMs: Preserve patch/token topology
  • PC-CLIP: Trained on pairwise comparisons
  • SpLiCE: Sparse interpretable embeddings
  • DCSMs: 保留补丁/令牌拓扑结构
  • PC-CLIP: 基于成对比较训练
  • SpLiCE: 稀疏可解释嵌入

2025: Current Best Practices

2025年:当前最佳实践

  • Use CLIP for what it's good at
  • Task-specific models for limitations
  • Compositional models for complex queries
LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.

  • 让CLIP做它擅长的事
  • 针对局限性使用任务专用模型
  • 复杂查询使用组合模型
LLM误区: 基于2021-2023年数据训练的LLM会建议CLIP用于所有任务,因为当时其局限性尚未广泛认知。本技能可纠正这一错误。

Validation Script

验证脚本

Before using CLIP, check if it's appropriate:
bash
python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all
Returns:
  • ✅ CLIP is appropriate
  • ❌ Use alternative (with suggestion)
使用CLIP前,先检查其适用性:
bash
python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all
返回结果:
  • ✅ CLIP适用
  • ❌ 使用替代模型(附建议)

Task-Specific Guidance

任务专属指南

Image Search (CLIP ✓)

图像搜索(CLIP ✓)

python
undefined
python
undefined

Good use of CLIP

Good use of CLIP

queries = ["beach", "mountain", "city skyline"]
queries = ["beach", "mountain", "city skyline"]

Works well for broad semantic concepts

Works well for broad semantic concepts

undefined
undefined

Zero-Shot Classification (CLIP ✓)

零样本分类(CLIP ✓)

python
undefined
python
undefined

Good: Broad categories

Good: Broad categories

categories = ["indoor", "outdoor", "nature", "urban"]
categories = ["indoor", "outdoor", "nature", "urban"]

CLIP excels at this

CLIP excels at this

undefined
undefined

Object Counting (CLIP ✗)

物体计数(CLIP ✗)

python
undefined
python
undefined

Use object detection instead

Use object detection instead

from transformers import DetrImageProcessor, DetrForObjectDetection
from transformers import DetrImageProcessor, DetrForObjectDetection

See /references/object_detection.md

See /references/object_detection.md

undefined
undefined

Fine-Grained Classification (CLIP ✗)

细粒度分类(CLIP ✗)

python
undefined
python
undefined

Use specialized models

Use specialized models

See /references/fine_grained_models.md

See /references/fine_grained_models.md

undefined
undefined

Spatial Reasoning (CLIP ✗)

空间推理(CLIP ✗)

python
undefined
python
undefined

Use spatial relation models

Use spatial relation models

See /references/spatial_models.md

See /references/spatial_models.md


---

---

Troubleshooting

故障排查

Issue: CLIP gives unexpected results

问题:CLIP返回意外结果

Check:
  1. Is this a counting task? → Use object detection
  2. Fine-grained classification? → Use specialized model
  3. Spatial query? → Use spatial model
  4. Multiple objects with attributes? → Use compositional model
Validation:
bash
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"
检查项:
  1. 是否为计数任务? → 使用目标检测模型
  2. 是否为细粒度分类? → 使用专用模型
  3. 是否为空间查询? → 使用空间推理模型
  4. 是否包含多个带属性的物体? → 使用组合模型
验证:
bash
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

Issue: Low similarity scores

问题:相似度分数低

Possible causes:
  1. Query too specific (CLIP works better with broad concepts)
  2. Fine-grained task (not CLIP's strength)
  3. Need to adjust threshold
Solution: Try broader query or use alternative model

可能原因:
  1. 查询过于具体(CLIP更擅长宽泛概念)
  2. 属于细粒度任务(非CLIP强项)
  3. 需要调整阈值
解决方案: 尝试更宽泛的查询或使用替代模型

Model Selection Guide

模型选择指南

ModelBest ForAvoid For
CLIP ViT-L/14Semantic search, broad categoriesCounting, fine-grained, spatial
DETRObject detection, countingSemantic similarity
DINOv2Fine-grained featuresText-image matching
PC-CLIPAttribute binding, comparisonsGeneral embedding
DCSMsCompositional reasoningSimple similarity
模型最佳适用场景避免场景
CLIP ViT-L/14语义搜索、宽泛类别计数、细粒度分类、空间推理
DETR目标检测、计数语义相似度匹配
DINOv2细粒度特征提取图文匹配
PC-CLIP属性绑定、成对比较通用嵌入
DCSMs组合推理简单相似度匹配

Performance Notes

性能说明

CLIP models:
  • ViT-B/32: Fast, lower quality
  • ViT-L/14: Balanced (recommended)
  • ViT-g-14: Highest quality, slower
Inference time (single image, CPU):
  • ViT-B/32: ~100ms
  • ViT-L/14: ~300ms
  • ViT-g-14: ~1000ms
CLIP模型:
  • ViT-B/32: 速度快,质量较低
  • ViT-L/14: 平衡(推荐)
  • ViT-g-14: 质量最高,速度较慢
推理时间(单张图片,CPU):
  • ViT-B/32: ~100ms
  • ViT-L/14: ~300ms
  • ViT-g-14: ~1000ms

Further Reading

扩展阅读

  • /references/clip_limitations.md
    - Detailed analysis of CLIP's failures
  • /references/alternatives.md
    - When to use what model
  • /references/compositional_reasoning.md
    - DCSMs and PC-CLIP deep dive
  • /scripts/validate_clip_usage.py
    - Pre-flight validation tool
  • /scripts/diagnose_clip_issue.py
    - Debug unexpected results

See CHANGELOG.md for version history.
  • /references/clip_limitations.md
    - CLIP局限性的详细分析
  • /references/alternatives.md
    - 不同场景下的模型选择
  • /references/compositional_reasoning.md
    - DCSMs与PC-CLIP深度解析
  • /scripts/validate_clip_usage.py
    - 预验证工具
  • /scripts/diagnose_clip_issue.py
    - 意外结果调试工具

版本历史详见CHANGELOG.md