clip-aware-embeddings
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCLIP-Aware Image Embeddings
感知CLIP的图像嵌入
Smart image-text matching that knows when CLIP works and when to use alternatives.
智能图文匹配,了解CLIP适用场景及替代方案。
MCP Integrations
MCP集成
| MCP | Purpose |
|---|---|
| Firecrawl | Research latest CLIP alternatives and benchmarks |
| Hugging Face (if configured) | Access model cards and documentation |
| MCP | 用途 |
|---|---|
| Firecrawl | 研究最新CLIP替代模型及基准测试 |
| Hugging Face(若已配置) | 访问模型卡片及文档 |
Quick Decision Tree
快速决策树
Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗Your task:
├─ Semantic search ("find beach images") → CLIP ✓
├─ Zero-shot classification (broad categories) → CLIP ✓
├─ Counting objects → DETR, Faster R-CNN ✗
├─ Fine-grained ID (celebrities, car models) → Specialized model ✗
├─ Spatial relations ("cat left of dog") → GQA, SWIG ✗
└─ Compositional ("red car AND blue truck") → DCSMs, PC-CLIP ✗When to Use This Skill
适用场景
✅ Use for:
- Semantic image search
- Broad category classification
- Image similarity matching
- Zero-shot tasks on new categories
❌ Do NOT use for:
- Counting objects in images
- Fine-grained classification
- Spatial understanding
- Attribute binding
- Negation handling
✅ 适用场景:
- 语义图像搜索
- 宽泛类别分类
- 图像相似度匹配
- 新类别下的零样本任务
❌ 不适用场景:
- 图像中的物体计数
- 细粒度分类
- 空间理解
- 属性绑定
- 否定处理
Installation
安装
bash
pip install transformers pillow torch sentence-transformers --break-system-packagesValidation: Run
python scripts/validate_setup.pybash
pip install transformers pillow torch sentence-transformers --break-system-packages验证: 运行
python scripts/validate_setup.pyBasic Usage
基础用法
Image Search
图像搜索
python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")Embed images
Embed images
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)
images = [Image.open(f"img{i}.jpg") for i in range(10)]
inputs = processor(images=images, return_tensors="pt")
image_features = model.get_image_features(**inputs)
Search with text
Search with text
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
text_inputs = processor(text=["a beach at sunset"], return_tensors="pt")
text_features = model.get_text_features(**text_inputs)
Compute similarity
Compute similarity
similarity = (image_features @ text_features.T).softmax(dim=0)
undefinedsimilarity = (image_features @ text_features.T).softmax(dim=0)
undefinedCommon Anti-Patterns
常见反模式
Anti-Pattern 1: "CLIP for Everything"
反模式1:“CLIP万能论”
❌ Wrong:
python
undefined❌错误用法:
python
undefinedUsing CLIP to count cars in an image
Using CLIP to count cars in an image
prompt = "How many cars are in this image?"
prompt = "How many cars are in this image?"
CLIP cannot count - it will give nonsense results
CLIP cannot count - it will give nonsense results
**Why wrong**: CLIP's architecture collapses spatial information into a single vector. It literally cannot count.
**✓ Right**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
**错误原因**: CLIP的架构将空间信息压缩为单个向量,本质上无法完成计数任务。
**✓正确用法**:
```python
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")Detect objects
Detect objects
results = model(**processor(images=image, return_tensors="pt"))
results = model(**processor(images=image, return_tensors="pt"))
Filter for cars and count
Filter for cars and count
car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)
**How to detect**: If query contains "how many", "count", or numeric questions → Use object detection
---car_detections = [d for d in results if d['label'] == 'car']
count = len(car_detections)
**识别方式**: 若查询包含“多少”、“计数”或数值相关问题 → 使用目标检测模型
---Anti-Pattern 2: Fine-Grained Classification
反模式2:细粒度分类
❌ Wrong:
python
undefined❌错误用法:
python
undefinedTrying to identify specific celebrities with CLIP
Trying to identify specific celebrities with CLIP
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
prompts = ["Tom Hanks", "Brad Pitt", "Morgan Freeman"]
CLIP will perform poorly - not trained for fine-grained face ID
CLIP will perform poorly - not trained for fine-grained face ID
**Why wrong**: CLIP trained on coarse categories. Fine-grained faces, car models, flower species require specialized models.
**✓ Right**:
```python
**错误原因**: CLIP基于粗粒度类别训练,细粒度的人脸识别、车型识别、花卉品种识别需要专用模型。
**✓正确用法**:
```pythonUse a fine-tuned face recognition model
Use a fine-tuned face recognition model
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"microsoft/resnet-50" # Then fine-tune on celebrity dataset
)
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
model = AutoModelForImageClassification.from_pretrained(
"microsoft/resnet-50" # Then fine-tune on celebrity dataset
)
Or use dedicated face recognition: ArcFace, CosFace
Or use dedicated face recognition: ArcFace, CosFace
**How to detect**: If query asks to distinguish between similar items in same category → Use specialized model
---
**识别方式**: 若查询要求区分同一类别下的相似物品 → 使用专用模型
---Anti-Pattern 3: Spatial Understanding
反模式3:空间理解
❌ Wrong:
python
undefined❌错误用法:
python
undefinedCLIP cannot understand spatial relationships
CLIP cannot understand spatial relationships
prompts = [
"cat to the left of dog",
"cat to the right of dog"
]
prompts = [
"cat to the left of dog",
"cat to the right of dog"
]
Will give nearly identical scores
Will give nearly identical scores
**Why wrong**: CLIP embeddings lose spatial topology. "Left" and "right" are treated as bag-of-words.
**✓ Right**:
```python
**错误原因**: CLIP嵌入会丢失空间拓扑信息,“左”和“右”被视为词袋处理。
**✓正确用法**:
```pythonUse a spatial reasoning model
Use a spatial reasoning model
Examples: GQA models, Visual Genome models, SWIG
Examples: GQA models, Visual Genome models, SWIG
from swig_model import SpatialRelationModel
model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
from swig_model import SpatialRelationModel
model = SpatialRelationModel()
result = model.predict_relation(image, "cat", "dog")
Returns: "left", "right", "above", "below", etc.
Returns: "left", "right", "above", "below", etc.
**How to detect**: If query contains directional words (left, right, above, under, next to) → Use spatial model
---
**识别方式**: 若查询包含方向词(左、右、上、下、旁边) → 使用空间推理模型
---Anti-Pattern 4: Attribute Binding
反模式4:属性绑定
❌ Wrong:
python
prompts = [
"red car and blue truck",
"blue car and red truck"
]❌错误用法:
python
prompts = [
"red car and blue truck",
"blue car and red truck"
]CLIP often gives similar scores for both
CLIP often gives similar scores for both
**Why wrong**: CLIP cannot bind attributes to objects. It sees "red, blue, car, truck" as a bag of concepts.
**✓ Right - Use PC-CLIP or DCSMs**:
```python
**错误原因**: CLIP无法将属性与物体绑定,它会将“红色、蓝色、汽车、卡车”视为一组独立概念。
**✓正确用法 - 使用PC-CLIP或DCSMs**:
```pythonPC-CLIP: Fine-tuned for pairwise comparisons
PC-CLIP: Fine-tuned for pairwise comparisons
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
from pc_clip import PCCLIPModel
model = PCCLIPModel.from_pretrained("pc-clip-vit-l")
Or use DCSMs (Dense Cosine Similarity Maps)
Or use DCSMs (Dense Cosine Similarity Maps)
**How to detect**: If query has multiple objects with different attributes → Use compositional model
---
**识别方式**: 若查询包含多个带有不同属性的物体 → 使用组合模型
---Evolution Timeline
发展时间线
2021: CLIP Released
2021年:CLIP发布
- Revolutionary: zero-shot, 400M image-text pairs
- Widely adopted for everything
- Limitations not yet understood
- 革命性突破:零样本能力,基于4亿图文对训练
- 被广泛应用于各类任务
- 其局限性尚未被充分认知
2022-2023: Limitations Discovered
2022-2023年:局限性被发现
- Cannot count objects
- Poor at fine-grained classification
- Fails spatial reasoning
- Can't bind attributes
- 无法完成物体计数
- 细粒度分类表现不佳
- 空间推理任务失败
- 无法处理属性绑定
2024: Alternatives Emerge
2024年:替代模型涌现
- DCSMs: Preserve patch/token topology
- PC-CLIP: Trained on pairwise comparisons
- SpLiCE: Sparse interpretable embeddings
- DCSMs: 保留补丁/令牌拓扑结构
- PC-CLIP: 基于成对比较训练
- SpLiCE: 稀疏可解释嵌入
2025: Current Best Practices
2025年:当前最佳实践
- Use CLIP for what it's good at
- Task-specific models for limitations
- Compositional models for complex queries
LLM Mistake: LLMs trained on 2021-2023 data will suggest CLIP for everything because limitations weren't widely known. This skill corrects that.
- 让CLIP做它擅长的事
- 针对局限性使用任务专用模型
- 复杂查询使用组合模型
LLM误区: 基于2021-2023年数据训练的LLM会建议CLIP用于所有任务,因为当时其局限性尚未广泛认知。本技能可纠正这一错误。
Validation Script
验证脚本
Before using CLIP, check if it's appropriate:
bash
python scripts/validate_clip_usage.py \
--query "your query here" \
--check-allReturns:
- ✅ CLIP is appropriate
- ❌ Use alternative (with suggestion)
使用CLIP前,先检查其适用性:
bash
python scripts/validate_clip_usage.py \
--query "your query here" \
--check-all返回结果:
- ✅ CLIP适用
- ❌ 使用替代模型(附建议)
Task-Specific Guidance
任务专属指南
Image Search (CLIP ✓)
图像搜索(CLIP ✓)
python
undefinedpython
undefinedGood use of CLIP
Good use of CLIP
queries = ["beach", "mountain", "city skyline"]
queries = ["beach", "mountain", "city skyline"]
Works well for broad semantic concepts
Works well for broad semantic concepts
undefinedundefinedZero-Shot Classification (CLIP ✓)
零样本分类(CLIP ✓)
python
undefinedpython
undefinedGood: Broad categories
Good: Broad categories
categories = ["indoor", "outdoor", "nature", "urban"]
categories = ["indoor", "outdoor", "nature", "urban"]
CLIP excels at this
CLIP excels at this
undefinedundefinedObject Counting (CLIP ✗)
物体计数(CLIP ✗)
python
undefinedpython
undefinedUse object detection instead
Use object detection instead
from transformers import DetrImageProcessor, DetrForObjectDetection
from transformers import DetrImageProcessor, DetrForObjectDetection
See /references/object_detection.md
See /references/object_detection.md
undefinedundefinedFine-Grained Classification (CLIP ✗)
细粒度分类(CLIP ✗)
python
undefinedpython
undefinedUse specialized models
Use specialized models
See /references/fine_grained_models.md
See /references/fine_grained_models.md
undefinedundefinedSpatial Reasoning (CLIP ✗)
空间推理(CLIP ✗)
python
undefinedpython
undefinedUse spatial relation models
Use spatial relation models
See /references/spatial_models.md
See /references/spatial_models.md
---
---Troubleshooting
故障排查
Issue: CLIP gives unexpected results
问题:CLIP返回意外结果
Check:
- Is this a counting task? → Use object detection
- Fine-grained classification? → Use specialized model
- Spatial query? → Use spatial model
- Multiple objects with attributes? → Use compositional model
Validation:
bash
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"检查项:
- 是否为计数任务? → 使用目标检测模型
- 是否为细粒度分类? → 使用专用模型
- 是否为空间查询? → 使用空间推理模型
- 是否包含多个带属性的物体? → 使用组合模型
验证:
bash
python scripts/diagnose_clip_issue.py --image path/to/image --query "your query"Issue: Low similarity scores
问题:相似度分数低
Possible causes:
- Query too specific (CLIP works better with broad concepts)
- Fine-grained task (not CLIP's strength)
- Need to adjust threshold
Solution: Try broader query or use alternative model
可能原因:
- 查询过于具体(CLIP更擅长宽泛概念)
- 属于细粒度任务(非CLIP强项)
- 需要调整阈值
解决方案: 尝试更宽泛的查询或使用替代模型
Model Selection Guide
模型选择指南
| Model | Best For | Avoid For |
|---|---|---|
| CLIP ViT-L/14 | Semantic search, broad categories | Counting, fine-grained, spatial |
| DETR | Object detection, counting | Semantic similarity |
| DINOv2 | Fine-grained features | Text-image matching |
| PC-CLIP | Attribute binding, comparisons | General embedding |
| DCSMs | Compositional reasoning | Simple similarity |
| 模型 | 最佳适用场景 | 避免场景 |
|---|---|---|
| CLIP ViT-L/14 | 语义搜索、宽泛类别 | 计数、细粒度分类、空间推理 |
| DETR | 目标检测、计数 | 语义相似度匹配 |
| DINOv2 | 细粒度特征提取 | 图文匹配 |
| PC-CLIP | 属性绑定、成对比较 | 通用嵌入 |
| DCSMs | 组合推理 | 简单相似度匹配 |
Performance Notes
性能说明
CLIP models:
- ViT-B/32: Fast, lower quality
- ViT-L/14: Balanced (recommended)
- ViT-g-14: Highest quality, slower
Inference time (single image, CPU):
- ViT-B/32: ~100ms
- ViT-L/14: ~300ms
- ViT-g-14: ~1000ms
CLIP模型:
- ViT-B/32: 速度快,质量较低
- ViT-L/14: 平衡(推荐)
- ViT-g-14: 质量最高,速度较慢
推理时间(单张图片,CPU):
- ViT-B/32: ~100ms
- ViT-L/14: ~300ms
- ViT-g-14: ~1000ms
Further Reading
扩展阅读
- - Detailed analysis of CLIP's failures
/references/clip_limitations.md - - When to use what model
/references/alternatives.md - - DCSMs and PC-CLIP deep dive
/references/compositional_reasoning.md - - Pre-flight validation tool
/scripts/validate_clip_usage.py - - Debug unexpected results
/scripts/diagnose_clip_issue.py
See CHANGELOG.md for version history.
- - CLIP局限性的详细分析
/references/clip_limitations.md - - 不同场景下的模型选择
/references/alternatives.md - - DCSMs与PC-CLIP深度解析
/references/compositional_reasoning.md - - 预验证工具
/scripts/validate_clip_usage.py - - 意外结果调试工具
/scripts/diagnose_clip_issue.py
版本历史详见CHANGELOG.md