CLIP-Aware Image Embeddings

感知CLIP的图像嵌入

Smart image-text matching that knows when CLIP works and when to use alternatives.

智能图文匹配，了解CLIP适用场景及替代方案。

MCP Integrations

MCP集成

MCP	Purpose
Firecrawl	Research latest CLIP alternatives and benchmarks
Hugging Face (if configured)	Access model cards and documentation

MCP	用途
Firecrawl	研究最新CLIP替代模型及基准测试
Hugging Face（若已配置）	访问模型卡片及文档

Quick Decision Tree

快速决策树

Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗

When to Use This Skill

适用场景

✅ Use for:

Semantic image search
Broad category classification
Image similarity matching
Zero-shot tasks on new categories

❌ Do NOT use for:

Counting objects in images
Fine-grained classification
Spatial understanding
Attribute binding
Negation handling

✅ 适用场景:

语义图像搜索
宽泛类别分类
图像相似度匹配
新类别下的零样本任务

❌ 不适用场景:

图像中的物体计数
细粒度分类
空间理解
属性绑定
否定处理

Installation

安装

bash

pip install transformers pillow torch sentence-transformers --break-system-packages

Validation: Run

python scripts/validate_setup.py

bash

pip install transformers pillow torch sentence-transformers --break-system-packages

验证: 运行

python scripts/validate_setup.py

Basic Usage

基础用法

Image Search

图像搜索

python

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

python

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Embed images

images = [Image.open(f"img{i}.jpg") for i in range(10)] inputs = processor(images=images, return_tensors="pt") image_features = model.get_image_features(**inputs)

Search with text

text_inputs = processor(text=["a beach at sunset"], return_tensors="pt") text_features = model.get_text_features(**text_inputs)

Compute similarity

similarity = (image_features @ text_features.T).softmax(dim=0)

undefined

similarity = (image_features @ text_features.T).softmax(dim=0)

undefined

Common Anti-Patterns

常见反模式

Anti-Pattern 1: "CLIP for Everything"

反模式1：“CLIP万能论”

❌ Wrong:

python

undefined

❌错误用法:

python

undefined

Using CLIP to count cars in an image

prompt = "How many cars are in this image?"

CLIP cannot count - it will give nonsense results


**Why wrong**: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.

**✓ Right**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")


**错误原因**: CLIP的架构将空间信息压缩为单个向量，本质上无法完成计数任务。

**✓正确用法**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

Detect objects

results = model(**processor(images=image, return_tensors="pt"))

Filter for cars and count

car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections)


**How to detect**: If query contains "how many", "count", or numeric questions → Use object detection

---

car_detections = [d for d in results if d['label'] == 'car'] count = len(car_detections)


**识别方式**: 若查询包含“多少”、“计数”或数值相关问题 → 使用目标检测模型

---

Anti-Pattern 2: Fine-Grained Classification

反模式2：细粒度分类

❌ Wrong:

python

undefined

❌错误用法:

python

undefined

Trying to identify specific celebrities with CLIP

prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]

CLIP will perform poorly - not trained for fine-grained face ID


**Why wrong**: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.

**✓ Right**:
```python


**错误原因**: CLIP基于粗粒度类别训练，细粒度的人脸识别、车型识别、花卉品种识别需要专用模型。

**✓正确用法**:
```python

Use a fine-tuned face recognition model

from transformers import AutoFeatureExtractor, AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset )

from transformers import AutoFeatureExtractor, AutoModelForImageClassification

model = AutoModelForImageClassification.from_pretrained( "microsoft/resnet-50" # Then fine-tune on celebrity dataset )

Or use dedicated face recognition: ArcFace, CosFace


**How to detect**: If query asks to distinguish between similar items in same category → Use specialized model

---


**识别方式**: 若查询要求区分同一类别下的相似物品 → 使用专用模型

---

Anti-Pattern 3: Spatial Understanding

反模式3：空间理解

❌ Wrong:

python

undefined

❌错误用法:

python

undefined

CLIP cannot understand spatial relationships

prompts = [ "cat to the left of dog", "cat to the right of dog" ]

Will give nearly identical scores


**Why wrong**: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.

**✓ Right**:
```python


**错误原因**: CLIP嵌入会丢失空间拓扑信息，“左”和“右”被视为词袋处理。

**✓正确用法**:
```python

Use a spatial reasoning model

Examples: GQA models, Visual Genome models, SWIG

from swig_model import SpatialRelationModel

model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog")

from swig_model import SpatialRelationModel

model = SpatialRelationModel() result = model.predict_relation(image, "cat", "dog")

Returns: "left", "right", "above", "below", etc.


**How to detect**: If query contains directional words (left, right, above, under, next to) → Use spatial model

---


**识别方式**: 若查询包含方向词（左、右、上、下、旁边） → 使用空间推理模型

---

Anti-Pattern 4: Attribute Binding

反模式4：属性绑定

❌ Wrong:

python

prompts = [
    "red car and blue truck",
    "blue car and red truck"
]

❌错误用法:

python

prompts = [
    "red car and blue truck",
    "blue car and red truck"
]

CLIP often gives similar scores for both


**Why wrong**: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.

**✓ Right - Use PC-CLIP or DCSMs**:
```python


**错误原因**: CLIP无法将属性与物体绑定，它会将“红色、蓝色、汽车、卡车”视为一组独立概念。

**✓正确用法 - 使用PC-CLIP或DCSMs**:
```python

PC-CLIP: Fine-tuned for pairwise comparisons

from pc_clip import PCCLIPModel

model = PCCLIPModel.from_pretrained("pc-clip-vit-l")

from pc_clip import PCCLIPModel

model = PCCLIPModel.from_pretrained("pc-clip-vit-l")

Or use DCSMs (Dense Cosine Similarity Maps)


**How to detect**: If query has multiple objects with different attributes → Use compositional model

---


**识别方式**: 若查询包含多个带有不同属性的物体 → 使用组合模型

---

Evolution Timeline

发展时间线

2021: CLIP Released

2021年：CLIP发布

Revolutionary: zero-shot, 400M image-text pairs
Widely adopted for everything
Limitations not yet understood

革命性突破：零样本能力，基于4亿图文对训练
被广泛应用于各类任务
其局限性尚未被充分认知

2022-2023: Limitations Discovered

2022-2023年：局限性被发现

Cannot count objects
Poor at fine-grained classification
Fails spatial reasoning
Can't bind attributes

无法完成物体计数
细粒度分类表现不佳
空间推理任务失败
无法处理属性绑定

2024: Alternatives Emerge

2024年：替代模型涌现

DCSMs: Preserve patch/token topology
PC-CLIP: Trained on pairwise comparisons
SpLiCE: Sparse interpretable embeddings

DCSMs: 保留补丁/令牌拓扑结构
PC-CLIP: 基于成对比较训练
SpLiCE: 稀疏可解释嵌入

2025: Current Best Practices

2025年：当前最佳实践

Use CLIP for what it's good at
Task-specific models for limitations
Compositional models for complex queries

LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.

让CLIP做它擅长的事
针对局限性使用任务专用模型
复杂查询使用组合模型

LLM误区: 基于2021-2023年数据训练的LLM会建议CLIP用于所有任务，因为当时其局限性尚未广泛认知。本技能可纠正这一错误。

Validation Script

验证脚本

Before using CLIP, check if it's appropriate:

bash

python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all

Returns:

✅ CLIP is appropriate
❌ Use alternative (with suggestion)

使用CLIP前，先检查其适用性：

bash

python scripts/validate_clip_usage.py \
    --query "your query here" \
    --check-all

返回结果:

✅ CLIP适用
❌ 使用替代模型（附建议）

Task-Specific Guidance

任务专属指南

Image Search (CLIP ✓)

图像搜索（CLIP ✓）

python

undefined

python

undefined

Good use of CLIP

queries = ["beach", "mountain", "city skyline"]

Works well for broad semantic concepts

undefined

undefined

Zero-Shot Classification (CLIP ✓)

零样本分类（CLIP ✓）

python

undefined

python

undefined

Good: Broad categories

categories = ["indoor", "outdoor", "nature", "urban"]

CLIP excels at this

undefined

undefined

Object Counting (CLIP ✗)

物体计数（CLIP ✗）

python

undefined

python

undefined

Use object detection instead

from transformers import DetrImageProcessor, DetrForObjectDetection

See /references/object_detection.md

undefined

undefined

Fine-Grained Classification (CLIP ✗)

细粒度分类（CLIP ✗）

python

undefined

python

undefined

Use specialized models

See /references/fine_grained_models.md

undefined

undefined

Spatial Reasoning (CLIP ✗)

空间推理（CLIP ✗）

python

undefined

python

undefined

Use spatial relation models

See /references/spatial_models.md

---

---

Troubleshooting

故障排查

Issue: CLIP gives unexpected results

问题：CLIP返回意外结果

Check:

Is this a counting task? → Use object detection
Fine-grained classification? → Use specialized model
Spatial query? → Use spatial model
Multiple objects with attributes? → Use compositional model

Validation:

bash

python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

检查项:

是否为计数任务？ → 使用目标检测模型
是否为细粒度分类？ → 使用专用模型
是否为空间查询？ → 使用空间推理模型
是否包含多个带属性的物体？ → 使用组合模型

验证:

bash

python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"

Issue: Low similarity scores

问题：相似度分数低

Possible causes:

Query too specific (CLIP works better with broad concepts)
Fine-grained task (not CLIP's strength)
Need to adjust threshold

Solution: Try broader query or use alternative model

可能原因:

查询过于具体（CLIP更擅长宽泛概念）
属于细粒度任务（非CLIP强项）
需要调整阈值

解决方案: 尝试更宽泛的查询或使用替代模型

Model Selection Guide

模型选择指南

Model	Best For	Avoid For
CLIP ViT-L/14	Semantic search, broad categories	Counting, fine-grained, spatial
DETR	Object detection, counting	Semantic similarity
DINOv2	Fine-grained features	Text-image matching
PC-CLIP	Attribute binding, comparisons	General embedding
DCSMs	Compositional reasoning	Simple similarity

模型	最佳适用场景	避免场景
CLIP ViT-L/14	语义搜索、宽泛类别	计数、细粒度分类、空间推理
DETR	目标检测、计数	语义相似度匹配
DINOv2	细粒度特征提取	图文匹配
PC-CLIP	属性绑定、成对比较	通用嵌入
DCSMs	组合推理	简单相似度匹配

Performance Notes

性能说明

CLIP models:

ViT-B/32: Fast, lower quality
ViT-L/14: Balanced (recommended)
ViT-g-14: Highest quality, slower

Inference time (single image, CPU):

ViT-B/32: ~100ms
ViT-L/14: ~300ms
ViT-g-14: ~1000ms

CLIP模型:

ViT-B/32: 速度快，质量较低
ViT-L/14: 平衡（推荐）
ViT-g-14: 质量最高，速度较慢

推理时间（单张图片，CPU）:

ViT-B/32: ~100ms
ViT-L/14: ~300ms
ViT-g-14: ~1000ms

扩展阅读

```
/references/clip_limitations.md
```
- Detailed analysis of CLIP's failures
```
/references/alternatives.md
```
- When to use what model
```
/references/compositional_reasoning.md
```
- DCSMs and PC-CLIP deep dive
```
/scripts/validate_clip_usage.py
```
- Pre-flight validation tool
```
/scripts/diagnose_clip_issue.py
```
- Debug unexpected results

See CHANGELOG.md for version history.

```
/references/clip_limitations.md
```
- CLIP局限性的详细分析
```
/references/alternatives.md
```
- 不同场景下的模型选择
```
/references/compositional_reasoning.md
```
- DCSMs与PC-CLIP深度解析
```
/scripts/validate_clip_usage.py
```
- 预验证工具
```
/scripts/diagnose_clip_issue.py
```
- 意外结果调试工具

版本历史详见CHANGELOG.md

clip-aware-embeddings

Original

Translation

CLIP-Aware Image Embeddings

感知CLIP的图像嵌入

MCP Integrations

MCP集成

Quick Decision Tree

快速决策树

When to Use This Skill

适用场景

Installation

安装

Basic Usage

基础用法

Image Search

图像搜索

Embed images

Embed images

Search with text

Search with text

Compute similarity

Compute similarity

Common Anti-Patterns

常见反模式

Anti-Pattern 1: "CLIP for Everything"

反模式1：“CLIP万能论”

Using CLIP to count cars in an image

Using CLIP to count cars in an image

CLIP cannot count - it will give nonsense results

CLIP cannot count - it will give nonsense results

Detect objects

Detect objects

Filter for cars and count

Filter for cars and count

Anti-Pattern 2: Fine-Grained Classification

反模式2：细粒度分类

Trying to identify specific celebrities with CLIP

Trying to identify specific celebrities with CLIP

CLIP will perform poorly - not trained for fine-grained face ID

CLIP will perform poorly - not trained for fine-grained face ID

Use a fine-tuned face recognition model

Use a fine-tuned face recognition model

Or use dedicated face recognition: ArcFace, CosFace

Or use dedicated face recognition: ArcFace, CosFace

Anti-Pattern 3: Spatial Understanding

反模式3：空间理解

CLIP cannot understand spatial relationships

CLIP cannot understand spatial relationships

Will give nearly identical scores

Will give nearly identical scores

Use a spatial reasoning model

Use a spatial reasoning model

Examples: GQA models, Visual Genome models, SWIG

Examples: GQA models, Visual Genome models, SWIG

Returns: "left", "right", "above", "below", etc.

Returns: "left", "right", "above", "below", etc.

Anti-Pattern 4: Attribute Binding

反模式4：属性绑定

CLIP often gives similar scores for both

CLIP often gives similar scores for both

PC-CLIP: Fine-tuned for pairwise comparisons

PC-CLIP: Fine-tuned for pairwise comparisons

Or use DCSMs (Dense Cosine Similarity Maps)

Or use DCSMs (Dense Cosine Similarity Maps)

Evolution Timeline

发展时间线

2021: CLIP Released

2021年：CLIP发布

2022-2023: Limitations Discovered

2022-2023年：局限性被发现

2024: Alternatives Emerge

2024年：替代模型涌现

2025: Current Best Practices

2025年：当前最佳实践

Validation Script

验证脚本

Task-Specific Guidance

任务专属指南

Image Search (CLIP ✓)