multimodal-rag

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Multimodal RAG ()

多模态RAG ()

Build retrieval-augmented generation systems that handle images, text, and mixed content.
构建可处理图片、文本及混合内容的检索增强生成(RAG)系统。

Overview

概述

  • Image + text retrieval (product search, documentation)
  • Cross-modal search (text query -> image results)
  • Multimodal document processing (PDFs with charts)
  • Visual question answering with context
  • Image similarity and deduplication
  • Hybrid search pipelines
  • 图文检索(商品搜索、文档检索)
  • 跨模态搜索(文本查询→图片结果)
  • 多模态文档处理(含图表的PDF)
  • 带上下文的视觉问答
  • 图片相似度计算与去重
  • 混合检索流水线

Architecture Approaches

架构方案

ApproachProsConsBest For
Joint Embedding (CLIP)Direct comparisonLimited contextPure image search
Caption-basedWorks with text LLMsLossy conversionExisting text RAG
HybridBest accuracyMore complexProduction systems
方案优势劣势适用场景
联合嵌入(CLIP)可直接对比上下文长度有限纯图片搜索
基于标题描述兼容文本大语言模型转换过程存在信息损失已有文本RAG系统
混合方案精度最优实现复杂度更高生产级系统

Embedding Models ()

嵌入模型 ()

ModelContextModalitiesBest For
Voyage multimodal-332K tokensText + ImageLong documents
SigLIP 2StandardText + ImageLarge-scale retrieval
CLIP ViT-L/1477 tokensText + ImageGeneral purpose
ImageBindStandard6 modalitiesAudio/video included
ColPaliDocumentText + ImagePDF/document RAG
模型上下文长度支持模态适用场景
Voyage multimodal-332K tokens文本 + 图片长文档处理
SigLIP 2标准长度文本 + 图片大规模检索
CLIP ViT-L/1477 tokens文本 + 图片通用场景
ImageBind标准长度6种模态含音频/视频的场景
ColPali文档级文本 + 图片PDF/文档RAG

CLIP-Based Image Embeddings

基于CLIP的图片嵌入

python
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
python
import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

Load CLIP model

Load CLIP model

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]: """Generate CLIP embedding for an image.""" image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    embeddings = model.get_image_features(**inputs)

# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
def embed_text(text: str) -> list[float]: """Generate CLIP embedding for text query.""" inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
    embeddings = model.get_text_features(**inputs)

embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(image_path: str) -> list[float]: """Generate CLIP embedding for an image.""" image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    embeddings = model.get_image_features(**inputs)

# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()
def embed_text(text: str) -> list[float]: """Generate CLIP embedding for text query.""" inputs = processor(text=[text], return_tensors="pt", padding=True)
with torch.no_grad():
    embeddings = model.get_text_features(**inputs)

embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

Cross-modal search: text -> images

Cross-modal search: text -> images

def search_images(query: str, image_embeddings: list, top_k: int = 5): """Search images using text query.""" query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
    np.dot(query_embedding, img_emb)
    for img_emb in image_embeddings
]

top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]
undefined
def search_images(query: str, image_embeddings: list, top_k: int = 5): """Search images using text query.""" query_embedding = embed_text(query)
# Compute similarities (cosine)
similarities = [
    np.dot(query_embedding, img_emb)
    for img_emb in image_embeddings
]

top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]
undefined

Voyage Multimodal-3 (Long Context)

Voyage Multimodal-3(长上下文)

python
import voyageai

client = voyageai.Client()

def embed_multimodal_voyage(
    texts: list[str] = None,
    images: list[str] = None  # File paths or URLs
) -> list[list[float]]:
    """Embed text and/or images with 32K token context."""
    inputs = []

    if texts:
        inputs.extend([{"type": "text", "content": t} for t in texts])

    if images:
        for img_path in images:
            with open(img_path, "rb") as f:
                import base64
                b64 = base64.b64encode(f.read()).decode()
                inputs.append({
                    "type": "image",
                    "content": f"data:image/png;base64,{b64}"
                })

    response = client.multimodal_embed(
        inputs=inputs,
        model="voyage-multimodal-3"
    )

    return response.embeddings
python
import voyageai

client = voyageai.Client()

def embed_multimodal_voyage(
    texts: list[str] = None,
    images: list[str] = None  # File paths or URLs
) -> list[list[float]]:
    """Embed text and/or images with 32K token context."""
    inputs = []

    if texts:
        inputs.extend([{"type": "text", "content": t} for t in texts])

    if images:
        for img_path in images:
            with open(img_path, "rb") as f:
                import base64
                b64 = base64.b64encode(f.read()).decode()
                inputs.append({
                    "type": "image",
                    "content": f"data:image/png;base64,{b64}"
                })

    response = client.multimodal_embed(
        inputs=inputs,
        model="voyage-multimodal-3"
    )

    return response.embeddings

Hybrid RAG Pipeline

混合RAG流水线

python
from typing import Optional
import numpy as np

class MultimodalRAG:
    """Production multimodal RAG with hybrid retrieval."""

    def __init__(self, vector_db, vision_model, text_model):
        self.vector_db = vector_db
        self.vision_model = vision_model
        self.text_model = text_model

    async def index_document(
        self,
        doc_id: str,
        text: Optional[str] = None,
        image_path: Optional[str] = None,
        metadata: dict = None
    ):
        """Index a document with text and/or image."""
        embeddings = []

        if text:
            text_emb = embed_text(text)
            embeddings.append(("text", text_emb))

        if image_path:
            # Option 1: Direct image embedding
            img_emb = embed_image(image_path)
            embeddings.append(("image", img_emb))

            # Option 2: Generate caption for text search
            caption = await self.generate_caption(image_path)
            caption_emb = embed_text(caption)
            embeddings.append(("caption", caption_emb))

        # Store with shared document ID
        for emb_type, emb in embeddings:
            await self.vector_db.upsert(
                id=f"{doc_id}_{emb_type}",
                embedding=emb,
                metadata={
                    "doc_id": doc_id,
                    "type": emb_type,
                    "image_url": image_path,
                    "text": text,
                    **(metadata or {})
                }
            )

    async def generate_caption(self, image_path: str) -> str:
        """Generate text caption for image indexing."""
        # Use GPT-4o or Claude for high-quality captions
        response = await self.vision_model.analyze(
            image_path,
            prompt="Describe this image in detail for search indexing. "
                   "Include objects, text, colors, and context."
        )
        return response

    async def retrieve(
        self,
        query: str,
        query_image: Optional[str] = None,
        top_k: int = 10
    ) -> list[dict]:
        """Hybrid retrieval with optional image query."""
        results = []

        # Text query embedding
        text_emb = embed_text(query)
        text_results = await self.vector_db.search(
            embedding=text_emb,
            top_k=top_k
        )
        results.extend(text_results)

        # Image query embedding (if provided)
        if query_image:
            img_emb = embed_image(query_image)
            img_results = await self.vector_db.search(
                embedding=img_emb,
                top_k=top_k
            )
            results.extend(img_results)

        # Dedupe by doc_id, keep highest score
        seen = {}
        for r in results:
            doc_id = r["metadata"]["doc_id"]
            if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
                seen[doc_id] = r

        return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]
python
from typing import Optional
import numpy as np

class MultimodalRAG:
    """Production multimodal RAG with hybrid retrieval."""

    def __init__(self, vector_db, vision_model, text_model):
        self.vector_db = vector_db
        self.vision_model = vision_model
        self.text_model = text_model

    async def index_document(
        self,
        doc_id: str,
        text: Optional[str] = None,
        image_path: Optional[str] = None,
        metadata: dict = None
    ):
        """Index a document with text and/or image."""
        embeddings = []

        if text:
            text_emb = embed_text(text)
            embeddings.append(("text", text_emb))

        if image_path:
            # Option 1: Direct image embedding
            img_emb = embed_image(image_path)
            embeddings.append(("image", img_emb))

            # Option 2: Generate caption for text search
            caption = await self.generate_caption(image_path)
            caption_emb = embed_text(caption)
            embeddings.append(("caption", caption_emb))

        # Store with shared document ID
        for emb_type, emb in embeddings:
            await self.vector_db.upsert(
                id=f"{doc_id}_{emb_type}",
                embedding=emb,
                metadata={
                    "doc_id": doc_id,
                    "type": emb_type,
                    "image_url": image_path,
                    "text": text,
                    **(metadata or {})
                }
            )

    async def generate_caption(self, image_path: str) -> str:
        """Generate text caption for image indexing."""
        # Use GPT-4o or Claude for high-quality captions
        response = await self.vision_model.analyze(
            image_path,
            prompt="Describe this image in detail for search indexing. "
                   "Include objects, text, colors, and context."
        )
        return response

    async def retrieve(
        self,
        query: str,
        query_image: Optional[str] = None,
        top_k: int = 10
    ) -> list[dict]:
        """Hybrid retrieval with optional image query."""
        results = []

        # Text query embedding
        text_emb = embed_text(query)
        text_results = await self.vector_db.search(
            embedding=text_emb,
            top_k=top_k
        )
        results.extend(text_results)

        # Image query embedding (if provided)
        if query_image:
            img_emb = embed_image(query_image)
            img_results = await self.vector_db.search(
                embedding=img_emb,
                top_k=top_k
            )
            results.extend(img_results)

        # Dedupe by doc_id, keep highest score
        seen = {}
        for r in results:
            doc_id = r["metadata"]["doc_id"]
            if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
                seen[doc_id] = r

        return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]

Claude Code PDF Handling (CC 2.1.30+)

Claude处理PDF(CC 2.1.30+)

For large PDFs, use the
pages
parameter to process in batches:
python
undefined
针对大体积PDF,使用
pages
参数批量处理:
python
undefined

Process large PDF in page-range batches for embedding

Process large PDF in page-range batches for embedding

async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10): """Process large PDF by page ranges before embedding.""" import subprocess
# Get total page count
result = subprocess.run(
    ["pdfinfo", pdf_path],
    capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
                   if 'Pages:' in l][0].split(':')[1].strip())

chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
    end = min(start + pages_per_batch - 1, total_pages)

    # Read page range (CC 2.1.30 pages parameter)
    # Read(file_path=pdf_path, pages=f"{start}-{end}")

    # Extract and embed content from this range
    page_chunks = extract_chunks_from_range(pdf_path, start, end)
    chunks.extend(page_chunks)

return chunks
undefined
async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10): """Process large PDF by page ranges before embedding.""" import subprocess
# Get total page count
result = subprocess.run(
    ["pdfinfo", pdf_path],
    capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
                   if 'Pages:' in l][0].split(':')[1].strip())

chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
    end = min(start + pages_per_batch - 1, total_pages)

    # Read page range (CC 2.1.30 pages parameter)
    # Read(file_path=pdf_path, pages=f"{start}-{end}")

    # Extract and embed content from this range
    page_chunks = extract_chunks_from_range(pdf_path, start, end)
    chunks.extend(page_chunks)

return chunks
undefined

Limits

限制

  • Max 20 pages per Read request
  • Max 20MB file size
  • Process large documents in batches for embedding

  • 单次Read请求最多处理20页
  • 文件大小上限20MB
  • 处理大文档时需分批嵌入

Multimodal Document Chunking

多模态文档分块

python
from dataclasses import dataclass
from typing import Literal

@dataclass
class Chunk:
    content: str
    chunk_type: Literal["text", "image", "table", "chart"]
    page: int
    image_path: Optional[str] = None
    embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
    """Chunk PDF preserving images and tables."""
    from pdf2image import convert_from_path
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        current_text = ""

        for block in text_blocks:
            if block[6] == 0:  # Text block
                current_text += block[4] + "\n"
            else:  # Image block
                # Save current text chunk
                if current_text.strip():
                    chunks.append(Chunk(
                        content=current_text.strip(),
                        chunk_type="text",
                        page=page_num
                    ))
                    current_text = ""

                # Extract and save image
                xref = block[7]
                img = doc.extract_image(xref)
                img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
                with open(img_path, "wb") as f:
                    f.write(img["image"])

                # Generate caption for the image
                caption = generate_image_caption(img_path)

                chunks.append(Chunk(
                    content=caption,
                    chunk_type="image",
                    page=page_num,
                    image_path=img_path
                ))

        # Final text chunk
        if current_text.strip():
            chunks.append(Chunk(
                content=current_text.strip(),
                chunk_type="text",
                page=page_num
            ))

    return chunks
python
from dataclasses import dataclass
from typing import Literal

@dataclass
class Chunk:
    content: str
    chunk_type: Literal["text", "image", "table", "chart"]
    page: int
    image_path: Optional[str] = None
    embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
    """Chunk PDF preserving images and tables."""
    from pdf2image import convert_from_path
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        current_text = ""

        for block in text_blocks:
            if block[6] == 0:  # Text block
                current_text += block[4] + "\n"
            else:  # Image block
                # Save current text chunk
                if current_text.strip():
                    chunks.append(Chunk(
                        content=current_text.strip(),
                        chunk_type="text",
                        page=page_num
                    ))
                    current_text = ""

                # Extract and save image
                xref = block[7]
                img = doc.extract_image(xref)
                img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
                with open(img_path, "wb") as f:
                    f.write(img["image"])

                # Generate caption for the image
                caption = generate_image_caption(img_path)

                chunks.append(Chunk(
                    content=caption,
                    chunk_type="image",
                    page=page_num,
                    image_path=img_path
                ))

        # Final text chunk
        if current_text.strip():
            chunks.append(Chunk(
                content=current_text.strip(),
                chunk_type="text",
                page=page_num
            ))

    return chunks

Vector Database Setup (Milvus)

向量数据库搭建(Milvus)

python
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection():
    """Create Milvus collection for multimodal embeddings."""
    connections.connect("default", host="localhost", port="19530")

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
        FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
        FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
        FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
        FieldSchema(name="page", dtype=DataType.INT64)
    ]

    schema = CollectionSchema(fields, "Multimodal document collection")
    collection = Collection("multimodal_docs", schema)

    # Create index for vector search
    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 16, "efConstruction": 256}
    }
    collection.create_index("embedding", index_params)

    return collection
python
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection():
    """Create Milvus collection for multimodal embeddings."""
    connections.connect("default", host="localhost", port="19530")

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
        FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
        FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
        FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
        FieldSchema(name="page", dtype=DataType.INT64)
    ]

    schema = CollectionSchema(fields, "Multimodal document collection")
    collection = Collection("multimodal_docs", schema)

    # Create index for vector search
    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 16, "efConstruction": 256}
    }
    collection.create_index("embedding", index_params)

    return collection

Multimodal Generation

多模态生成

python
async def generate_with_context(
    query: str,
    retrieved_chunks: list[Chunk],
    model: str = "claude-opus-4-6"
) -> str:
    """Generate response using multimodal context."""
    content = []

    # Add retrieved images first (attention positioning)
    for chunk in retrieved_chunks:
        if chunk.chunk_type == "image" and chunk.image_path:
            base64_data, media_type = encode_image_base64(chunk.image_path)
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            })

    # Add text context
    text_context = "\n\n".join([
        f"[Page {c.page}]: {c.content}"
        for c in retrieved_chunks if c.chunk_type == "text"
    ])

    content.append({
        "type": "text",
        "text": f"""Use the following context to answer the question.

Context:
{text_context}

Question: {query}

Provide a detailed answer based on the context and images provided."""
    })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )

    return response.content[0].text
python
async def generate_with_context(
    query: str,
    retrieved_chunks: list[Chunk],
    model: str = "claude-opus-4-6"
) -> str:
    """Generate response using multimodal context."""
    content = []

    # Add retrieved images first (attention positioning)
    for chunk in retrieved_chunks:
        if chunk.chunk_type == "image" and chunk.image_path:
            base64_data, media_type = encode_image_base64(chunk.image_path)
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            })

    # Add text context
    text_context = "\n\n".join([
        f"[Page {c.page}]: {c.content}"
        for c in retrieved_chunks if c.chunk_type == "text"
    ])

    content.append({
        "type": "text",
        "text": f"""Use the following context to answer the question.

Context:
{text_context}

Question: {query}

Provide a detailed answer based on the context and images provided."""
    })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )

    return response.content[0].text

Key Decisions

关键决策

DecisionRecommendation
Long documentsVoyage multimodal-3 (32K context)
Scale retrievalSigLIP 2 (optimized for large-scale)
PDF processingColPali (document-native)
Multi-modal searchHybrid: CLIP + text embeddings
Production DBMilvus or Pinecone with hybrid
决策项推荐方案
长文档处理Voyage multimodal-3(32K上下文)
大规模检索SigLIP 2(针对大规模场景优化)
PDF处理ColPali(原生文档支持)
多模态搜索混合方案:CLIP + 文本嵌入
生产级数据库Milvus或Pinecone(支持混合检索)

Common Mistakes

常见误区

  • Embedding images without captions (limits text search)
  • Not deduplicating by document ID
  • Missing image URL storage (can't display results)
  • Using only image OR text embeddings (use both)
  • Ignoring chunk boundaries (split mid-paragraph)
  • Not validating image retrieval quality
  • 仅嵌入图片而不生成标题(限制文本搜索能力)
  • 未按文档ID去重
  • 未存储图片URL(无法展示结果)
  • 仅使用图片或文本其中一种嵌入(应同时使用)
  • 忽略分块边界(在段落中间拆分)
  • 未验证图片检索质量

Related Skills

相关技能

  • vision-language-models
    - Image analysis
  • embeddings
    - Text embedding patterns
  • rag-retrieval
    - Text RAG patterns
  • contextual-retrieval
    - Hybrid BM25+vector
  • vision-language-models
    - 图片分析
  • embeddings
    - 文本嵌入方案
  • rag-retrieval
    - 文本RAG方案
  • contextual-retrieval
    - 混合BM25+向量检索

Capability Details

能力详情

image-embeddings

image-embeddings

Keywords: CLIP, image embedding, visual features, SigLIP Solves:
  • Convert images to vector representations
  • Enable image similarity search
  • Cross-modal retrieval
关键词: CLIP, image embedding, visual features, SigLIP 解决问题:
  • 将图片转换为向量表示
  • 实现图片相似度搜索
  • 跨模态检索

cross-modal-search

cross-modal-search

Keywords: text to image, image to text, cross-modal Solves:
  • Find images from text queries
  • Find text from image queries
  • Bridge modalities
关键词: text to image, image to text, cross-modal 解决问题:
  • 通过文本查询找到图片
  • 通过图片查询找到文本
  • 打通不同模态

multimodal-chunking

multimodal-chunking

Keywords: chunk PDF, split document, extract images Solves:
  • Process documents with mixed content
  • Preserve image-text relationships
  • Handle tables and charts
关键词: chunk PDF, split document, extract images 解决问题:
  • 处理混合内容的文档
  • 保留图片与文本的关联关系
  • 处理表格与图表

hybrid-retrieval

hybrid-retrieval

Keywords: hybrid search, fusion, multi-embedding Solves:
  • Combine text and image search
  • Improve retrieval accuracy
  • Handle diverse queries
关键词: hybrid search, fusion, multi-embedding 解决问题:
  • 结合文本与图片搜索
  • 提升检索精度
  • 处理多样化查询