multimodal-rag
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMultimodal RAG ()
多模态RAG ()
Build retrieval-augmented generation systems that handle images, text, and mixed content.
构建可处理图片、文本及混合内容的检索增强生成(RAG)系统。
Overview
概述
- Image + text retrieval (product search, documentation)
- Cross-modal search (text query -> image results)
- Multimodal document processing (PDFs with charts)
- Visual question answering with context
- Image similarity and deduplication
- Hybrid search pipelines
- 图文检索(商品搜索、文档检索)
- 跨模态搜索(文本查询→图片结果)
- 多模态文档处理(含图表的PDF)
- 带上下文的视觉问答
- 图片相似度计算与去重
- 混合检索流水线
Architecture Approaches
架构方案
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Joint Embedding (CLIP) | Direct comparison | Limited context | Pure image search |
| Caption-based | Works with text LLMs | Lossy conversion | Existing text RAG |
| Hybrid | Best accuracy | More complex | Production systems |
| 方案 | 优势 | 劣势 | 适用场景 |
|---|---|---|---|
| 联合嵌入(CLIP) | 可直接对比 | 上下文长度有限 | 纯图片搜索 |
| 基于标题描述 | 兼容文本大语言模型 | 转换过程存在信息损失 | 已有文本RAG系统 |
| 混合方案 | 精度最优 | 实现复杂度更高 | 生产级系统 |
Embedding Models ()
嵌入模型 ()
| Model | Context | Modalities | Best For |
|---|---|---|---|
| Voyage multimodal-3 | 32K tokens | Text + Image | Long documents |
| SigLIP 2 | Standard | Text + Image | Large-scale retrieval |
| CLIP ViT-L/14 | 77 tokens | Text + Image | General purpose |
| ImageBind | Standard | 6 modalities | Audio/video included |
| ColPali | Document | Text + Image | PDF/document RAG |
| 模型 | 上下文长度 | 支持模态 | 适用场景 |
|---|---|---|---|
| Voyage multimodal-3 | 32K tokens | 文本 + 图片 | 长文档处理 |
| SigLIP 2 | 标准长度 | 文本 + 图片 | 大规模检索 |
| CLIP ViT-L/14 | 77 tokens | 文本 + 图片 | 通用场景 |
| ImageBind | 标准长度 | 6种模态 | 含音频/视频的场景 |
| ColPali | 文档级 | 文本 + 图片 | PDF/文档RAG |
CLIP-Based Image Embeddings
基于CLIP的图片嵌入
python
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModelpython
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModelLoad CLIP model
Load CLIP model
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
"""Generate CLIP embedding for an image."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()def embed_text(text: str) -> list[float]:
"""Generate CLIP embedding for text query."""
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = model.get_text_features(**inputs)
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]:
"""Generate CLIP embedding for an image."""
image = Image.open(image_path)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
embeddings = model.get_image_features(**inputs)
# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()def embed_text(text: str) -> list[float]:
"""Generate CLIP embedding for text query."""
inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = model.get_text_features(**inputs)
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()Cross-modal search: text -> images
Cross-modal search: text -> images
def search_images(query: str, image_embeddings: list, top_k: int = 5):
"""Search images using text query."""
query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
np.dot(query_embedding, img_emb)
for img_emb in image_embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]undefineddef search_images(query: str, image_embeddings: list, top_k: int = 5):
"""Search images using text query."""
query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
np.dot(query_embedding, img_emb)
for img_emb in image_embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]undefinedVoyage Multimodal-3 (Long Context)
Voyage Multimodal-3(长上下文)
python
import voyageai
client = voyageai.Client()
def embed_multimodal_voyage(
texts: list[str] = None,
images: list[str] = None # File paths or URLs
) -> list[list[float]]:
"""Embed text and/or images with 32K token context."""
inputs = []
if texts:
inputs.extend([{"type": "text", "content": t} for t in texts])
if images:
for img_path in images:
with open(img_path, "rb") as f:
import base64
b64 = base64.b64encode(f.read()).decode()
inputs.append({
"type": "image",
"content": f"data:image/png;base64,{b64}"
})
response = client.multimodal_embed(
inputs=inputs,
model="voyage-multimodal-3"
)
return response.embeddingspython
import voyageai
client = voyageai.Client()
def embed_multimodal_voyage(
texts: list[str] = None,
images: list[str] = None # File paths or URLs
) -> list[list[float]]:
"""Embed text and/or images with 32K token context."""
inputs = []
if texts:
inputs.extend([{"type": "text", "content": t} for t in texts])
if images:
for img_path in images:
with open(img_path, "rb") as f:
import base64
b64 = base64.b64encode(f.read()).decode()
inputs.append({
"type": "image",
"content": f"data:image/png;base64,{b64}"
})
response = client.multimodal_embed(
inputs=inputs,
model="voyage-multimodal-3"
)
return response.embeddingsHybrid RAG Pipeline
混合RAG流水线
python
from typing import Optional
import numpy as np
class MultimodalRAG:
"""Production multimodal RAG with hybrid retrieval."""
def __init__(self, vector_db, vision_model, text_model):
self.vector_db = vector_db
self.vision_model = vision_model
self.text_model = text_model
async def index_document(
self,
doc_id: str,
text: Optional[str] = None,
image_path: Optional[str] = None,
metadata: dict = None
):
"""Index a document with text and/or image."""
embeddings = []
if text:
text_emb = embed_text(text)
embeddings.append(("text", text_emb))
if image_path:
# Option 1: Direct image embedding
img_emb = embed_image(image_path)
embeddings.append(("image", img_emb))
# Option 2: Generate caption for text search
caption = await self.generate_caption(image_path)
caption_emb = embed_text(caption)
embeddings.append(("caption", caption_emb))
# Store with shared document ID
for emb_type, emb in embeddings:
await self.vector_db.upsert(
id=f"{doc_id}_{emb_type}",
embedding=emb,
metadata={
"doc_id": doc_id,
"type": emb_type,
"image_url": image_path,
"text": text,
**(metadata or {})
}
)
async def generate_caption(self, image_path: str) -> str:
"""Generate text caption for image indexing."""
# Use GPT-4o or Claude for high-quality captions
response = await self.vision_model.analyze(
image_path,
prompt="Describe this image in detail for search indexing. "
"Include objects, text, colors, and context."
)
return response
async def retrieve(
self,
query: str,
query_image: Optional[str] = None,
top_k: int = 10
) -> list[dict]:
"""Hybrid retrieval with optional image query."""
results = []
# Text query embedding
text_emb = embed_text(query)
text_results = await self.vector_db.search(
embedding=text_emb,
top_k=top_k
)
results.extend(text_results)
# Image query embedding (if provided)
if query_image:
img_emb = embed_image(query_image)
img_results = await self.vector_db.search(
embedding=img_emb,
top_k=top_k
)
results.extend(img_results)
# Dedupe by doc_id, keep highest score
seen = {}
for r in results:
doc_id = r["metadata"]["doc_id"]
if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
seen[doc_id] = r
return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]python
from typing import Optional
import numpy as np
class MultimodalRAG:
"""Production multimodal RAG with hybrid retrieval."""
def __init__(self, vector_db, vision_model, text_model):
self.vector_db = vector_db
self.vision_model = vision_model
self.text_model = text_model
async def index_document(
self,
doc_id: str,
text: Optional[str] = None,
image_path: Optional[str] = None,
metadata: dict = None
):
"""Index a document with text and/or image."""
embeddings = []
if text:
text_emb = embed_text(text)
embeddings.append(("text", text_emb))
if image_path:
# Option 1: Direct image embedding
img_emb = embed_image(image_path)
embeddings.append(("image", img_emb))
# Option 2: Generate caption for text search
caption = await self.generate_caption(image_path)
caption_emb = embed_text(caption)
embeddings.append(("caption", caption_emb))
# Store with shared document ID
for emb_type, emb in embeddings:
await self.vector_db.upsert(
id=f"{doc_id}_{emb_type}",
embedding=emb,
metadata={
"doc_id": doc_id,
"type": emb_type,
"image_url": image_path,
"text": text,
**(metadata or {})
}
)
async def generate_caption(self, image_path: str) -> str:
"""Generate text caption for image indexing."""
# Use GPT-4o or Claude for high-quality captions
response = await self.vision_model.analyze(
image_path,
prompt="Describe this image in detail for search indexing. "
"Include objects, text, colors, and context."
)
return response
async def retrieve(
self,
query: str,
query_image: Optional[str] = None,
top_k: int = 10
) -> list[dict]:
"""Hybrid retrieval with optional image query."""
results = []
# Text query embedding
text_emb = embed_text(query)
text_results = await self.vector_db.search(
embedding=text_emb,
top_k=top_k
)
results.extend(text_results)
# Image query embedding (if provided)
if query_image:
img_emb = embed_image(query_image)
img_results = await self.vector_db.search(
embedding=img_emb,
top_k=top_k
)
results.extend(img_results)
# Dedupe by doc_id, keep highest score
seen = {}
for r in results:
doc_id = r["metadata"]["doc_id"]
if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
seen[doc_id] = r
return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]Claude Code PDF Handling (CC 2.1.30+)
Claude处理PDF(CC 2.1.30+)
For large PDFs, use the parameter to process in batches:
pagespython
undefined针对大体积PDF,使用参数批量处理:
pagespython
undefinedProcess large PDF in page-range batches for embedding
Process large PDF in page-range batches for embedding
async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10):
"""Process large PDF by page ranges before embedding."""
import subprocess
# Get total page count
result = subprocess.run(
["pdfinfo", pdf_path],
capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
if 'Pages:' in l][0].split(':')[1].strip())
chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
end = min(start + pages_per_batch - 1, total_pages)
# Read page range (CC 2.1.30 pages parameter)
# Read(file_path=pdf_path, pages=f"{start}-{end}")
# Extract and embed content from this range
page_chunks = extract_chunks_from_range(pdf_path, start, end)
chunks.extend(page_chunks)
return chunksundefinedasync def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10):
"""Process large PDF by page ranges before embedding."""
import subprocess
# Get total page count
result = subprocess.run(
["pdfinfo", pdf_path],
capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
if 'Pages:' in l][0].split(':')[1].strip())
chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
end = min(start + pages_per_batch - 1, total_pages)
# Read page range (CC 2.1.30 pages parameter)
# Read(file_path=pdf_path, pages=f"{start}-{end}")
# Extract and embed content from this range
page_chunks = extract_chunks_from_range(pdf_path, start, end)
chunks.extend(page_chunks)
return chunksundefinedLimits
限制
- Max 20 pages per Read request
- Max 20MB file size
- Process large documents in batches for embedding
- 单次Read请求最多处理20页
- 文件大小上限20MB
- 处理大文档时需分批嵌入
Multimodal Document Chunking
多模态文档分块
python
from dataclasses import dataclass
from typing import Literal
@dataclass
class Chunk:
content: str
chunk_type: Literal["text", "image", "table", "chart"]
page: int
image_path: Optional[str] = None
embedding: Optional[list[float]] = None
def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
"""Chunk PDF preserving images and tables."""
from pdf2image import convert_from_path
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
# Extract text blocks
text_blocks = page.get_text("blocks")
current_text = ""
for block in text_blocks:
if block[6] == 0: # Text block
current_text += block[4] + "\n"
else: # Image block
# Save current text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
current_text = ""
# Extract and save image
xref = block[7]
img = doc.extract_image(xref)
img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
with open(img_path, "wb") as f:
f.write(img["image"])
# Generate caption for the image
caption = generate_image_caption(img_path)
chunks.append(Chunk(
content=caption,
chunk_type="image",
page=page_num,
image_path=img_path
))
# Final text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
return chunkspython
from dataclasses import dataclass
from typing import Literal
@dataclass
class Chunk:
content: str
chunk_type: Literal["text", "image", "table", "chart"]
page: int
image_path: Optional[str] = None
embedding: Optional[list[float]] = None
def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
"""Chunk PDF preserving images and tables."""
from pdf2image import convert_from_path
import fitz # PyMuPDF
doc = fitz.open(pdf_path)
chunks = []
for page_num, page in enumerate(doc):
# Extract text blocks
text_blocks = page.get_text("blocks")
current_text = ""
for block in text_blocks:
if block[6] == 0: # Text block
current_text += block[4] + "\n"
else: # Image block
# Save current text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
current_text = ""
# Extract and save image
xref = block[7]
img = doc.extract_image(xref)
img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
with open(img_path, "wb") as f:
f.write(img["image"])
# Generate caption for the image
caption = generate_image_caption(img_path)
chunks.append(Chunk(
content=caption,
chunk_type="image",
page=page_num,
image_path=img_path
))
# Final text chunk
if current_text.strip():
chunks.append(Chunk(
content=current_text.strip(),
chunk_type="text",
page=page_num
))
return chunksVector Database Setup (Milvus)
向量数据库搭建(Milvus)
python
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
def setup_multimodal_collection():
"""Create Milvus collection for multimodal embeddings."""
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
FieldSchema(name="page", dtype=DataType.INT64)
]
schema = CollectionSchema(fields, "Multimodal document collection")
collection = Collection("multimodal_docs", schema)
# Create index for vector search
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
return collectionpython
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
def setup_multimodal_collection():
"""Create Milvus collection for multimodal embeddings."""
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
FieldSchema(name="page", dtype=DataType.INT64)
]
schema = CollectionSchema(fields, "Multimodal document collection")
collection = Collection("multimodal_docs", schema)
# Create index for vector search
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)
return collectionMultimodal Generation
多模态生成
python
async def generate_with_context(
query: str,
retrieved_chunks: list[Chunk],
model: str = "claude-opus-4-6"
) -> str:
"""Generate response using multimodal context."""
content = []
# Add retrieved images first (attention positioning)
for chunk in retrieved_chunks:
if chunk.chunk_type == "image" and chunk.image_path:
base64_data, media_type = encode_image_base64(chunk.image_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
# Add text context
text_context = "\n\n".join([
f"[Page {c.page}]: {c.content}"
for c in retrieved_chunks if c.chunk_type == "text"
])
content.append({
"type": "text",
"text": f"""Use the following context to answer the question.
Context:
{text_context}
Question: {query}
Provide a detailed answer based on the context and images provided."""
})
response = client.messages.create(
model=model,
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return response.content[0].textpython
async def generate_with_context(
query: str,
retrieved_chunks: list[Chunk],
model: str = "claude-opus-4-6"
) -> str:
"""Generate response using multimodal context."""
content = []
# Add retrieved images first (attention positioning)
for chunk in retrieved_chunks:
if chunk.chunk_type == "image" and chunk.image_path:
base64_data, media_type = encode_image_base64(chunk.image_path)
content.append({
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": base64_data
}
})
# Add text context
text_context = "\n\n".join([
f"[Page {c.page}]: {c.content}"
for c in retrieved_chunks if c.chunk_type == "text"
])
content.append({
"type": "text",
"text": f"""Use the following context to answer the question.
Context:
{text_context}
Question: {query}
Provide a detailed answer based on the context and images provided."""
})
response = client.messages.create(
model=model,
max_tokens=4096,
messages=[{"role": "user", "content": content}]
)
return response.content[0].textKey Decisions
关键决策
| Decision | Recommendation |
|---|---|
| Long documents | Voyage multimodal-3 (32K context) |
| Scale retrieval | SigLIP 2 (optimized for large-scale) |
| PDF processing | ColPali (document-native) |
| Multi-modal search | Hybrid: CLIP + text embeddings |
| Production DB | Milvus or Pinecone with hybrid |
| 决策项 | 推荐方案 |
|---|---|
| 长文档处理 | Voyage multimodal-3(32K上下文) |
| 大规模检索 | SigLIP 2(针对大规模场景优化) |
| PDF处理 | ColPali(原生文档支持) |
| 多模态搜索 | 混合方案:CLIP + 文本嵌入 |
| 生产级数据库 | Milvus或Pinecone(支持混合检索) |
Common Mistakes
常见误区
- Embedding images without captions (limits text search)
- Not deduplicating by document ID
- Missing image URL storage (can't display results)
- Using only image OR text embeddings (use both)
- Ignoring chunk boundaries (split mid-paragraph)
- Not validating image retrieval quality
- 仅嵌入图片而不生成标题(限制文本搜索能力)
- 未按文档ID去重
- 未存储图片URL(无法展示结果)
- 仅使用图片或文本其中一种嵌入(应同时使用)
- 忽略分块边界(在段落中间拆分)
- 未验证图片检索质量
Related Skills
相关技能
- - Image analysis
vision-language-models - - Text embedding patterns
embeddings - - Text RAG patterns
rag-retrieval - - Hybrid BM25+vector
contextual-retrieval
- - 图片分析
vision-language-models - - 文本嵌入方案
embeddings - - 文本RAG方案
rag-retrieval - - 混合BM25+向量检索
contextual-retrieval
Capability Details
能力详情
image-embeddings
image-embeddings
Keywords: CLIP, image embedding, visual features, SigLIP
Solves:
- Convert images to vector representations
- Enable image similarity search
- Cross-modal retrieval
关键词: CLIP, image embedding, visual features, SigLIP
解决问题:
- 将图片转换为向量表示
- 实现图片相似度搜索
- 跨模态检索
cross-modal-search
cross-modal-search
Keywords: text to image, image to text, cross-modal
Solves:
- Find images from text queries
- Find text from image queries
- Bridge modalities
关键词: text to image, image to text, cross-modal
解决问题:
- 通过文本查询找到图片
- 通过图片查询找到文本
- 打通不同模态
multimodal-chunking
multimodal-chunking
Keywords: chunk PDF, split document, extract images
Solves:
- Process documents with mixed content
- Preserve image-text relationships
- Handle tables and charts
关键词: chunk PDF, split document, extract images
解决问题:
- 处理混合内容的文档
- 保留图片与文本的关联关系
- 处理表格与图表
hybrid-retrieval
hybrid-retrieval
Keywords: hybrid search, fusion, multi-embedding
Solves:
- Combine text and image search
- Improve retrieval accuracy
- Handle diverse queries
关键词: hybrid search, fusion, multi-embedding
解决问题:
- 结合文本与图片搜索
- 提升检索精度
- 处理多样化查询