clip
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCLIP - Contrastive Language-Image Pre-Training
CLIP - 对比式语言-图像预训练
OpenAI's model that understands images from natural language.
OpenAI推出的一款能够通过自然语言理解图像的模型。
When to use CLIP
何时使用CLIP
Use when:
- Zero-shot image classification (no training data needed)
- Image-text similarity/matching
- Semantic image search
- Content moderation (detect NSFW, violence)
- Visual question answering
- Cross-modal retrieval (image→text, text→image)
Metrics:
- 25,300+ GitHub stars
- Trained on 400M image-text pairs
- Matches ResNet-50 on ImageNet (zero-shot)
- MIT License
Use alternatives instead:
- BLIP-2: Better captioning
- LLaVA: Vision-language chat
- Segment Anything: Image segmentation
适用场景:
- 零样本图像分类(无需训练数据)
- 图文相似度匹配
- 语义图像搜索
- 内容审核(检测不良内容、暴力元素)
- 视觉问答
- 跨模态检索(图像→文本、文本→图像)
指标:
- GitHub星标数25300+
- 基于4亿图文对训练
- 在ImageNet零样本任务上性能与ResNet-50相当
- MIT许可证
可替代方案:
- BLIP-2:更优的图像描述能力
- LLaVA:视觉语言对话模型
- Segment Anything:图像分割模型
Quick start
快速开始
Installation
安装
bash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdmbash
pip install git+https://github.com/openai/CLIP.git
pip install torch torchvision ftfy regex tqdmZero-shot classification
零样本分类
python
import torch
import clip
from PIL import Imagepython
import torch
import clip
from PIL import ImageLoad model
Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
Load image
Load image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
Define possible labels
Define possible labels
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird", "a car"]).to(device)
Compute similarity
Compute similarity
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()Print results
Print results
labels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
undefinedlabels = ["a dog", "a cat", "a bird", "a car"]
for label, prob in zip(labels, probs[0]):
print(f"{label}: {prob:.2%}")
undefinedAvailable models
可用模型
python
undefinedpython
undefinedModels (sorted by size)
Models (sorted by size)
models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
| Model | Parameters | Speed | Quality |
|-------|------------|-------|---------|
| RN50 | 102M | Fast | Good |
| ViT-B/32 | 151M | Medium | Better |
| ViT-L/14 | 428M | Slow | Best |models = [
"RN50", # ResNet-50
"RN101", # ResNet-101
"ViT-B/32", # Vision Transformer (recommended)
"ViT-B/16", # Better quality, slower
"ViT-L/14", # Best quality, slowest
]
model, preprocess = clip.load("ViT-B/32")
| 模型 | 参数规模 | 速度 | 质量 |
|-------|------------|-------|---------|
| RN50 | 1.02亿 | 快 | 良好 |
| ViT-B/32 | 1.51亿 | 中等 | 更优 |
| ViT-L/14 | 4.28亿 | 慢 | 最佳 |Image-text similarity
图文相似度计算
python
undefinedpython
undefinedCompute embeddings
Compute embeddings
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = model.encode_image(image)
text_features = model.encode_text(text)
Normalize
Normalize
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
Cosine similarity
Cosine similarity
similarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
undefinedsimilarity = (image_features @ text_features.T).item()
print(f"Similarity: {similarity:.4f}")
undefinedSemantic image search
语义图像搜索
python
undefinedpython
undefinedIndex images
Index images
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = []
for img_path in image_paths:
image = preprocess(Image.open(img_path)).unsqueeze(0).to(device)
with torch.no_grad():
embedding = model.encode_image(image)
embedding /= embedding.norm(dim=-1, keepdim=True)
image_embeddings.append(embedding)
image_embeddings = torch.cat(image_embeddings)
Search with text query
Search with text query
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
query = "a sunset over the ocean"
text_input = clip.tokenize([query]).to(device)
with torch.no_grad():
text_embedding = model.encode_text(text_input)
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
Find most similar images
Find most similar images
similarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
undefinedsimilarities = (text_embedding @ image_embeddings.T).squeeze(0)
top_k = similarities.topk(3)
for idx, score in zip(top_k.indices, top_k.values):
print(f"{image_paths[idx]}: {score:.3f}")
undefinedContent moderation
内容审核
python
undefinedpython
undefinedDefine categories
Define categories
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
categories = [
"safe for work",
"not safe for work",
"violent content",
"graphic content"
]
text = clip.tokenize(categories).to(device)
Check image
Check image
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
with torch.no_grad():
logits_per_image, _ = model(image, text)
probs = logits_per_image.softmax(dim=-1)
Get classification
Get classification
max_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
undefinedmax_idx = probs.argmax().item()
max_prob = probs[0, max_idx].item()
print(f"Category: {categories[max_idx]} ({max_prob:.2%})")
undefinedBatch processing
批量处理
python
undefinedpython
undefinedProcess multiple images
Process multiple images
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
images = [preprocess(Image.open(f"img{i}.jpg")) for i in range(10)]
images = torch.stack(images).to(device)
with torch.no_grad():
image_features = model.encode_image(images)
image_features /= image_features.norm(dim=-1, keepdim=True)
Batch text
Batch text
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
texts = ["a dog", "a cat", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
Similarity matrix (10 images × 3 texts)
Similarity matrix (10 images × 3 texts)
similarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
undefinedsimilarities = image_features @ text_features.T
print(similarities.shape) # (10, 3)
undefinedIntegration with vector databases
与向量数据库集成
python
undefinedpython
undefinedStore CLIP embeddings in Chroma/FAISS
Store CLIP embeddings in Chroma/FAISS
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
import chromadb
client = chromadb.Client()
collection = client.create_collection("image_embeddings")
Add image embeddings
Add image embeddings
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
for img_path, embedding in zip(image_paths, image_embeddings):
collection.add(
embeddings=[embedding.cpu().numpy().tolist()],
metadatas=[{"path": img_path}],
ids=[img_path]
)
Query with text
Query with text
query = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
undefinedquery = "a sunset"
text_embedding = model.encode_text(clip.tokenize([query]))
results = collection.query(
query_embeddings=[text_embedding.cpu().numpy().tolist()],
n_results=5
)
undefinedBest practices
最佳实践
- Use ViT-B/32 for most cases - Good balance
- Normalize embeddings - Required for cosine similarity
- Batch processing - More efficient
- Cache embeddings - Expensive to recompute
- Use descriptive labels - Better zero-shot performance
- GPU recommended - 10-50× faster
- Preprocess images - Use provided preprocess function
- 多数场景使用ViT-B/32 - 平衡速度与性能
- 归一化嵌入向量 - 余弦相似度计算的必要步骤
- 批量处理 - 效率更高
- 缓存嵌入向量 - 避免重复计算(成本较高)
- 使用描述性标签 - 提升零样本任务性能
- 推荐使用GPU - 速度提升10-50倍
- 预处理图像 - 使用模型提供的preprocess函数
Performance
性能表现
| Operation | CPU | GPU (V100) |
|---|---|---|
| Image encoding | ~200ms | ~20ms |
| Text encoding | ~50ms | ~5ms |
| Similarity compute | <1ms | <1ms |
| 操作 | CPU | GPU (V100) |
|---|---|---|
| 图像编码 | ~200ms | ~20ms |
| 文本编码 | ~50ms | ~5ms |
| 相似度计算 | <1ms | <1ms |
Limitations
局限性
- Not for fine-grained tasks - Best for broad categories
- Requires descriptive text - Vague labels perform poorly
- Biased on web data - May have dataset biases
- No bounding boxes - Whole image only
- Limited spatial understanding - Position/counting weak
- 不适用于细粒度任务 - 更适合宽泛类别
- 需要描述性文本 - 模糊标签表现较差
- 受训练数据偏见影响 - 可能存在数据集偏差
- 不支持边界框 - 仅能处理整幅图像
- 空间理解能力有限 - 位置/计数类任务表现不佳
Resources
相关资源
- GitHub: https://github.com/openai/CLIP ⭐ 25,300+
- Paper: https://arxiv.org/abs/2103.00020
- Colab: https://colab.research.google.com/github/openai/clip/
- License: MIT
- GitHub: https://github.com/openai/CLIP ⭐ 25,300+
- 论文: https://arxiv.org/abs/2103.00020
- Colab: https://colab.research.google.com/github/openai/clip/
- 许可证: MIT