clip

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CLIP - Contrastive Language-Image Pre-Training

CLIP - 对比式语言-图像预训练

OpenAI's model that understands images from natural language.
OpenAI推出的一款能够通过自然语言理解图像的模型。

When to use CLIP

何时使用CLIP

Use when:
  • Zero-shot image classification (no training data needed)
  • Image-text similarity/matching
  • Semantic image search
  • Content moderation (detect NSFW, violence)
  • Visual question answering
  • Cross-modal retrieval (image→text, text→image)
Metrics:
  • 25,300+ GitHub stars
  • Trained on 400M image-text pairs
  • Matches ResNet-50 on ImageNet (zero-shot)
  • MIT License
Use alternatives instead:
  • BLIP-2: Better captioning
  • LLaVA: Vision-language chat
  • Segment Anything: Image segmentation
适用场景:
  • 零样本图像分类(无需训练数据)
  • 图文相似度匹配
  • 语义图像搜索
  • 内容审核(检测不良内容、暴力元素)
  • 视觉问答
  • 跨模态检索(图像→文本、文本→图像)
指标
  • GitHub星标数25300+
  • 基于4亿图文对训练
  • 在ImageNet零样本任务上性能与ResNet-50相当
  • MIT许可证
可替代方案:
  • BLIP-2:更优的图像描述能力
  • LLaVA:视觉语言对话模型
  • Segment Anything:图像分割模型

Quick start

快速开始

Installation

安装

bash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm
bash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

Zero-shot classification

零样本分类

python
import torch
import clip
from PIL import Image
python
import torch
import clip
from PIL import Image

Load model

Load model

device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device)
device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device)

Load image

Load image

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

Define possible labels

Define possible labels

text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

Compute similarity

Compute similarity

with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

Print results

Print results

labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")
undefined
labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")
undefined

Available models

可用模型

python
undefined
python
undefined

Models (sorted by size)

Models (sorted by size)

models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest ]
model, preprocess = clip.load("ViT-B/32")

| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |
models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest ]
model, preprocess = clip.load("ViT-B/32")

| 模型 | 参数规模 | 速度 | 质量 |
|-------|------------|-------|---------|
| RN50 | 1.02亿 | 快 | 良好 |
| ViT-B/32 | 1.51亿 | 中等 | 更优 |
| ViT-L/14 | 4.28亿 | 慢 | 最佳 |

Image-text similarity

图文相似度计算

python
undefined
python
undefined

Compute embeddings

Compute embeddings

image_features = model.encode_image(image) text_features = model.encode_text(text)
image_features = model.encode_image(image) text_features = model.encode_text(text)

Normalize

Normalize

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)
image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

Cosine similarity

Cosine similarity

similarity = (image_features @ text_features.T).item() print(f"Similarity: {similarity:.4f}")
undefined
similarity = (image_features @ text_features.T).item() print(f"Similarity: {similarity:.4f}")
undefined

Semantic image search

语义图像搜索

python
undefined
python
undefined

Index images

Index images

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] image_embeddings = []
for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] image_embeddings = []
for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)

Search with text query

Search with text query

query = "a sunset over the ocean" text_input = clip.tokenize([query]).to(device) with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
query = "a sunset over the ocean" text_input = clip.tokenize([query]).to(device) with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

Find most similar images

Find most similar images

similarities = (text_embedding @ image_embeddings.T).squeeze(0) top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")
undefined
similarities = (text_embedding @ image_embeddings.T).squeeze(0) top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")
undefined

Content moderation

内容审核

python
undefined
python
undefined

Define categories

Define categories

categories = [ "safe for work", "not safe for work", "violent content", "graphic content" ]
text = clip.tokenize(categories).to(device)
categories = [ "safe for work", "not safe for work", "violent content", "graphic content" ]
text = clip.tokenize(categories).to(device)

Check image

Check image

with torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1)
with torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1)

Get classification

Get classification

max_idx = probs.argmax().item() max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
undefined
max_idx = probs.argmax().item() max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
undefined

Batch processing

批量处理

python
undefined
python
undefined

Process multiple images

Process multiple images

images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)] images = torch.stack(images).to(device)
with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)] images = torch.stack(images).to(device)
with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)

Batch text

Batch text

texts = ["a dog", "a cat", "a bird"] text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)
texts = ["a dog", "a cat", "a bird"] text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)

Similarity matrix (10 images × 3 texts)

Similarity matrix (10 images × 3 texts)

similarities = image_features @ text_features.T print(similarities.shape) # (10, 3)
undefined
similarities = image_features @ text_features.T print(similarities.shape) # (10, 3)
undefined

Integration with vector databases

与向量数据库集成

python
undefined
python
undefined

Store CLIP embeddings in Chroma/FAISS

Store CLIP embeddings in Chroma/FAISS

import chromadb
client = chromadb.Client() collection = client.create_collection("image_embeddings")
import chromadb
client = chromadb.Client() collection = client.create_collection("image_embeddings")

Add image embeddings

Add image embeddings

for img_path, embedding in zip(image_paths, image_embeddings): collection.add( embeddings=[embedding.cpu().numpy().tolist()], metadatas=[{"path": img_path}], ids=[img_path] )
for img_path, embedding in zip(image_paths, image_embeddings): collection.add( embeddings=[embedding.cpu().numpy().tolist()], metadatas=[{"path": img_path}], ids=[img_path] )

Query with text

Query with text

query = "a sunset" text_embedding = model.encode_text(clip.tokenize([query])) results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5 )
undefined
query = "a sunset" text_embedding = model.encode_text(clip.tokenize([query])) results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5 )
undefined

Best practices

最佳实践

  1. Use ViT-B/32 for most cases - Good balance
  2. Normalize embeddings - Required for cosine similarity
  3. Batch processing - More efficient
  4. Cache embeddings - Expensive to recompute
  5. Use descriptive labels - Better zero-shot performance
  6. GPU recommended - 10-50× faster
  7. Preprocess images - Use provided preprocess function
  1. 多数场景使用ViT-B/32 - 平衡速度与性能
  2. 归一化嵌入向量 - 余弦相似度计算的必要步骤
  3. 批量处理 - 效率更高
  4. 缓存嵌入向量 - 避免重复计算(成本较高)
  5. 使用描述性标签 - 提升零样本任务性能
  6. 推荐使用GPU - 速度提升10-50倍
  7. 预处理图像 - 使用模型提供的preprocess函数

Performance

性能表现

OperationCPUGPU (V100)
Image encoding~200ms~20ms
Text encoding~50ms~5ms
Similarity compute<1ms<1ms
操作CPUGPU (V100)
图像编码~200ms~20ms
文本编码~50ms~5ms
相似度计算<1ms<1ms

Limitations

局限性

  1. Not for fine-grained tasks - Best for broad categories
  2. Requires descriptive text - Vague labels perform poorly
  3. Biased on web data - May have dataset biases
  4. No bounding boxes - Whole image only
  5. Limited spatial understanding - Position/counting weak
  1. 不适用于细粒度任务 - 更适合宽泛类别
  2. 需要描述性文本 - 模糊标签表现较差
  3. 受训练数据偏见影响 - 可能存在数据集偏差
  4. 不支持边界框 - 仅能处理整幅图像
  5. 空间理解能力有限 - 位置/计数类任务表现不佳

Resources

相关资源