CLIP - Contrastive Language-Image Pre-Training

CLIP - 对比式语言-图像预训练

OpenAI's model that understands images from natural language.

OpenAI推出的一款能够通过自然语言理解图像的模型。

When to use CLIP

何时使用CLIP

Use when:

Zero-shot image classification (no training data needed)
Image-text similarity/matching
Semantic image search
Content moderation (detect NSFW, violence)
Visual question answering
Cross-modal retrieval (image→text, text→image)

Metrics:

25,300+ GitHub stars
Trained on 400M image-text pairs
Matches ResNet-50 on ImageNet (zero-shot)
MIT License

Use alternatives instead:

BLIP-2: Better captioning
LLaVA: Vision-language chat
Segment Anything: Image segmentation

适用场景：

零样本图像分类（无需训练数据）
图文相似度匹配
语义图像搜索
内容审核（检测不良内容、暴力元素）
视觉问答
跨模态检索（图像→文本、文本→图像）

指标：

GitHub星标数25300+
基于4亿图文对训练
在ImageNet零样本任务上性能与ResNet-50相当
MIT许可证

可替代方案：

BLIP-2：更优的图像描述能力
LLaVA：视觉语言对话模型
Segment Anything：图像分割模型

Quick start

快速开始

Installation

安装

bash

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

bash

pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdm

Zero-shot classification

零样本分类

python

import torch
import clip
from PIL import Image

python

import torch
import clip
from PIL import Image

Load model

device = "cuda" if torch.cuda.is_available() else "cpu" model, preprocess = clip.load("ViT-B/32", device=device)

Load image

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

Define possible labels

text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)

Compute similarity

with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)

# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text)

# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()

Print results

labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")

undefined

labels = ["a dog", "a cat", "a bird", "a car"] for label, prob in zip(labels, probs[0]): print(f"{label}: {prob:.2%}")

undefined

Available models

可用模型

python

undefined

python

undefined

Models (sorted by size)

models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest ]

model, preprocess = clip.load("ViT-B/32")


| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |

models = [ "RN50", # ResNet-50 "RN101", # ResNet-101 "ViT-B/32", # Vision Transformer (recommended) "ViT-B/16", # Better quality, slower "ViT-L/14", # Best quality, slowest ]

model, preprocess = clip.load("ViT-B/32")


| 模型 | 参数规模 | 速度 | 质量 |
|-------|------------|-------|---------|
| RN50 | 1.02亿 | 快 | 良好 |
| ViT-B/32 | 1.51亿 | 中等 | 更优 |
| ViT-L/14 | 4.28亿 | 慢 | 最佳 |

Image-text similarity

图文相似度计算

python

undefined

python

undefined

Compute embeddings

image_features = model.encode_image(image) text_features = model.encode_text(text)

Normalize

image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True)

Cosine similarity

similarity = (image_features @ text_features.T).item() print(f"Similarity: {similarity:.4f}")

undefined

similarity = (image_features @ text_features.T).item() print(f"Similarity: {similarity:.4f}")

undefined

Semantic image search

语义图像搜索

python

undefined

python

undefined

Index images

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] image_embeddings = []

for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"] image_embeddings = []

for img_path in image_paths: image = preprocess(Image.open(img_path)).unsqueeze(0).to(device) with torch.no_grad(): embedding = model.encode_image(image) embedding /= embedding.norm(dim=-1, keepdim=True) image_embeddings.append(embedding)

image_embeddings = torch.cat(image_embeddings)

Search with text query

query = "a sunset over the ocean" text_input = clip.tokenize([query]).to(device) with torch.no_grad(): text_embedding = model.encode_text(text_input) text_embedding /= text_embedding.norm(dim=-1, keepdim=True)

Find most similar images

similarities = (text_embedding @ image_embeddings.T).squeeze(0) top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")

undefined

similarities = (text_embedding @ image_embeddings.T).squeeze(0) top_k = similarities.topk(3)

for idx, score in zip(top_k.indices, top_k.values): print(f"{image_paths[idx]}: {score:.3f}")

undefined

Content moderation

内容审核

python

undefined

python

undefined

Define categories

categories = [ "safe for work", "not safe for work", "violent content", "graphic content" ]

text = clip.tokenize(categories).to(device)

categories = [ "safe for work", "not safe for work", "violent content", "graphic content" ]

text = clip.tokenize(categories).to(device)

Check image

with torch.no_grad(): logits_per_image, _ = model(image, text) probs = logits_per_image.softmax(dim=-1)

Get classification

max_idx = probs.argmax().item() max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

undefined

max_idx = probs.argmax().item() max_prob = probs[0, max_idx].item()

print(f"Category: {categories[max_idx]} ({max_prob:.2%})")

undefined

Batch processing

批量处理

python

undefined

python

undefined

Process multiple images

images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)] images = torch.stack(images).to(device)

with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)

images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)] images = torch.stack(images).to(device)

with torch.no_grad(): image_features = model.encode_image(images) image_features /= image_features.norm(dim=-1, keepdim=True)

Batch text

texts = ["a dog", "a cat", "a bird"] text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)

texts = ["a dog", "a cat", "a bird"] text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True)

Similarity matrix (10 images × 3 texts)

similarities = image_features @ text_features.T print(similarities.shape) # (10, 3)

undefined

similarities = image_features @ text_features.T print(similarities.shape) # (10, 3)

undefined

Integration with vector databases

与向量数据库集成

python

undefined

python

undefined

Store CLIP embeddings in Chroma/FAISS

import chromadb

client = chromadb.Client() collection = client.create_collection("image_embeddings")

import chromadb

client = chromadb.Client() collection = client.create_collection("image_embeddings")

Add image embeddings

for img_path, embedding in zip(image_paths, image_embeddings): collection.add( embeddings=[embedding.cpu().numpy().tolist()], metadatas=[{"path": img_path}], ids=[img_path] )

Query with text

query = "a sunset" text_embedding = model.encode_text(clip.tokenize([query])) results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5 )

undefined

query = "a sunset" text_embedding = model.encode_text(clip.tokenize([query])) results = collection.query( query_embeddings=[text_embedding.cpu().numpy().tolist()], n_results=5 )

undefined

Best practices

最佳实践

Use ViT-B/32 for most cases - Good balance
Normalize embeddings - Required for cosine similarity
Batch processing - More efficient
Cache embeddings - Expensive to recompute
Use descriptive labels - Better zero-shot performance
GPU recommended - 10-50× faster
Preprocess images - Use provided preprocess function

多数场景使用ViT-B/32 - 平衡速度与性能
归一化嵌入向量 - 余弦相似度计算的必要步骤
批量处理 - 效率更高
缓存嵌入向量 - 避免重复计算（成本较高）
使用描述性标签 - 提升零样本任务性能
推荐使用GPU - 速度提升10-50倍
预处理图像 - 使用模型提供的preprocess函数

Performance

性能表现

Operation	CPU	GPU (V100)
Image encoding	~200ms	~20ms
Text encoding	~50ms	~5ms
Similarity compute	<1ms	<1ms

操作	CPU	GPU (V100)
图像编码	~200ms	~20ms
文本编码	~50ms	~5ms
相似度计算	<1ms	<1ms

Limitations

局限性

Not for fine-grained tasks - Best for broad categories
Requires descriptive text - Vague labels perform poorly
Biased on web data - May have dataset biases
No bounding boxes - Whole image only
Limited spatial understanding - Position/counting weak

不适用于细粒度任务 - 更适合宽泛类别
需要描述性文本 - 模糊标签表现较差
受训练数据偏见影响 - 可能存在数据集偏差
不支持边界框 - 仅能处理整幅图像
空间理解能力有限 - 位置/计数类任务表现不佳

clip

Original

Translation

CLIP - Contrastive Language-Image Pre-Training

CLIP - 对比式语言-图像预训练

When to use CLIP

何时使用CLIP

Quick start

快速开始

Installation

安装

Zero-shot classification

零样本分类

Load model

Load model

Load image

Load image

Define possible labels

Define possible labels

Compute similarity

Compute similarity

Print results

Print results

Available models

可用模型

Models (sorted by size)

Models (sorted by size)

Image-text similarity

图文相似度计算

Compute embeddings

Compute embeddings

Normalize

Normalize

Cosine similarity

Cosine similarity

Semantic image search

语义图像搜索

Index images

Index images

Search with text query

Search with text query

Find most similar images

Find most similar images

Content moderation

内容审核

Define categories

Define categories

Check image

Check image

Get classification

Get classification

Batch processing

批量处理

Process multiple images

Process multiple images

Batch text

Batch text

Similarity matrix (10 images × 3 texts)

Similarity matrix (10 images × 3 texts)

Integration with vector databases

与向量数据库集成

Store CLIP embeddings in Chroma/FAISS

Store CLIP embeddings in Chroma/FAISS

Add image embeddings

Add image embeddings

Query with text

Query with text

Best practices

最佳实践

Performance

性能表现

Limitations

局限性

Resources

相关资源