multimodal-rag

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Multimodal RAG ()

多模态RAG ()

Build retrieval-augmented generation systems that handle images, text, and mixed content.

构建可处理图片、文本及混合内容的检索增强生成（RAG）系统。

Overview

概述

Image + text retrieval (product search, documentation)
Cross-modal search (text query -> image results)
Multimodal document processing (PDFs with charts)
Visual question answering with context
Image similarity and deduplication
Hybrid search pipelines

图文检索（商品搜索、文档检索）
跨模态搜索（文本查询→图片结果）
多模态文档处理（含图表的PDF）
带上下文的视觉问答
图片相似度计算与去重
混合检索流水线

Architecture Approaches

架构方案

Approach	Pros	Cons	Best For
Joint Embedding (CLIP)	Direct comparison	Limited context	Pure image search
Caption-based	Works with text LLMs	Lossy conversion	Existing text RAG
Hybrid	Best accuracy	More complex	Production systems

方案	优势	劣势	适用场景
联合嵌入（CLIP）	可直接对比	上下文长度有限	纯图片搜索
基于标题描述	兼容文本大语言模型	转换过程存在信息损失	已有文本RAG系统
混合方案	精度最优	实现复杂度更高	生产级系统

Embedding Models ()

嵌入模型 ()

Model	Context	Modalities	Best For
Voyage multimodal-3	32K tokens	Text + Image	Long documents
SigLIP 2	Standard	Text + Image	Large-scale retrieval
CLIP ViT-L/14	77 tokens	Text + Image	General purpose
ImageBind	Standard	6 modalities	Audio/video included
ColPali	Document	Text + Image	PDF/document RAG

模型	上下文长度	支持模态	适用场景
Voyage multimodal-3	32K tokens	文本 + 图片	长文档处理
SigLIP 2	标准长度	文本 + 图片	大规模检索
CLIP ViT-L/14	77 tokens	文本 + 图片	通用场景
ImageBind	标准长度	6种模态	含音频/视频的场景
ColPali	文档级	文本 + 图片	PDF/文档RAG

CLIP-Based Image Embeddings

基于CLIP的图片嵌入

python

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

python

import torch
from PIL import Image
from transformers import CLIPProcessor, CLIPModel

Load CLIP model

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]: """Generate CLIP embedding for an image.""" image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    embeddings = model.get_image_features(**inputs)

# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

def embed_text(text: str) -> list[float]: """Generate CLIP embedding for text query.""" inputs = processor(text=[text], return_tensors="pt", padding=True)

with torch.no_grad():
    embeddings = model.get_text_features(**inputs)

embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

def embed_image(image_path: str) -> list[float]: """Generate CLIP embedding for an image.""" image = Image.open(image_path) inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    embeddings = model.get_image_features(**inputs)

# Normalize for cosine similarity
embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

def embed_text(text: str) -> list[float]: """Generate CLIP embedding for text query.""" inputs = processor(text=[text], return_tensors="pt", padding=True)

with torch.no_grad():
    embeddings = model.get_text_features(**inputs)

embeddings = embeddings / embeddings.norm(dim=-1, keepdim=True)
return embeddings[0].tolist()

Cross-modal search: text -> images

def search_images(query: str, image_embeddings: list, top_k: int = 5): """Search images using text query.""" query_embedding = embed_text(query)

# Compute similarities (cosine)
similarities = [
    np.dot(query_embedding, img_emb)
    for img_emb in image_embeddings
]

top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]

undefined

def search_images(query: str, image_embeddings: list, top_k: int = 5): """Search images using text query.""" query_embedding = embed_text(query)

# Compute similarities (cosine)
similarities = [
    np.dot(query_embedding, img_emb)
    for img_emb in image_embeddings
]

top_indices = np.argsort(similarities)[-top_k:][::-1]
return top_indices, [similarities[i] for i in top_indices]

undefined

Voyage Multimodal-3 (Long Context)

Voyage Multimodal-3（长上下文）

python

import voyageai

client = voyageai.Client()

def embed_multimodal_voyage(
    texts: list[str] = None,
    images: list[str] = None  # File paths or URLs
) -> list[list[float]]:
    """Embed text and/or images with 32K token context."""
    inputs = []

    if texts:
        inputs.extend([{"type": "text", "content": t} for t in texts])

    if images:
        for img_path in images:
            with open(img_path, "rb") as f:
                import base64
                b64 = base64.b64encode(f.read()).decode()
                inputs.append({
                    "type": "image",
                    "content": f"data:image/png;base64,{b64}"
                })

    response = client.multimodal_embed(
        inputs=inputs,
        model="voyage-multimodal-3"
    )

    return response.embeddings

python

import voyageai

client = voyageai.Client()

def embed_multimodal_voyage(
    texts: list[str] = None,
    images: list[str] = None  # File paths or URLs
) -> list[list[float]]:
    """Embed text and/or images with 32K token context."""
    inputs = []

    if texts:
        inputs.extend([{"type": "text", "content": t} for t in texts])

    if images:
        for img_path in images:
            with open(img_path, "rb") as f:
                import base64
                b64 = base64.b64encode(f.read()).decode()
                inputs.append({
                    "type": "image",
                    "content": f"data:image/png;base64,{b64}"
                })

    response = client.multimodal_embed(
        inputs=inputs,
        model="voyage-multimodal-3"
    )

    return response.embeddings

Hybrid RAG Pipeline

混合RAG流水线

python

from typing import Optional
import numpy as np

class MultimodalRAG:
    """Production multimodal RAG with hybrid retrieval."""

    def __init__(self, vector_db, vision_model, text_model):
        self.vector_db = vector_db
        self.vision_model = vision_model
        self.text_model = text_model

    async def index_document(
        self,
        doc_id: str,
        text: Optional[str] = None,
        image_path: Optional[str] = None,
        metadata: dict = None
    ):
        """Index a document with text and/or image."""
        embeddings = []

        if text:
            text_emb = embed_text(text)
            embeddings.append(("text", text_emb))

        if image_path:
            # Option 1: Direct image embedding
            img_emb = embed_image(image_path)
            embeddings.append(("image", img_emb))

            # Option 2: Generate caption for text search
            caption = await self.generate_caption(image_path)
            caption_emb = embed_text(caption)
            embeddings.append(("caption", caption_emb))

        # Store with shared document ID
        for emb_type, emb in embeddings:
            await self.vector_db.upsert(
                id=f"{doc_id}_{emb_type}",
                embedding=emb,
                metadata={
                    "doc_id": doc_id,
                    "type": emb_type,
                    "image_url": image_path,
                    "text": text,
                    **(metadata or {})
                }
            )

    async def generate_caption(self, image_path: str) -> str:
        """Generate text caption for image indexing."""
        # Use GPT-4o or Claude for high-quality captions
        response = await self.vision_model.analyze(
            image_path,
            prompt="Describe this image in detail for search indexing. "
                   "Include objects, text, colors, and context."
        )
        return response

    async def retrieve(
        self,
        query: str,
        query_image: Optional[str] = None,
        top_k: int = 10
    ) -> list[dict]:
        """Hybrid retrieval with optional image query."""
        results = []

        # Text query embedding
        text_emb = embed_text(query)
        text_results = await self.vector_db.search(
            embedding=text_emb,
            top_k=top_k
        )
        results.extend(text_results)

        # Image query embedding (if provided)
        if query_image:
            img_emb = embed_image(query_image)
            img_results = await self.vector_db.search(
                embedding=img_emb,
                top_k=top_k
            )
            results.extend(img_results)

        # Dedupe by doc_id, keep highest score
        seen = {}
        for r in results:
            doc_id = r["metadata"]["doc_id"]
            if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
                seen[doc_id] = r

        return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]

python

from typing import Optional
import numpy as np

class MultimodalRAG:
    """Production multimodal RAG with hybrid retrieval."""

    def __init__(self, vector_db, vision_model, text_model):
        self.vector_db = vector_db
        self.vision_model = vision_model
        self.text_model = text_model

    async def index_document(
        self,
        doc_id: str,
        text: Optional[str] = None,
        image_path: Optional[str] = None,
        metadata: dict = None
    ):
        """Index a document with text and/or image."""
        embeddings = []

        if text:
            text_emb = embed_text(text)
            embeddings.append(("text", text_emb))

        if image_path:
            # Option 1: Direct image embedding
            img_emb = embed_image(image_path)
            embeddings.append(("image", img_emb))

            # Option 2: Generate caption for text search
            caption = await self.generate_caption(image_path)
            caption_emb = embed_text(caption)
            embeddings.append(("caption", caption_emb))

        # Store with shared document ID
        for emb_type, emb in embeddings:
            await self.vector_db.upsert(
                id=f"{doc_id}_{emb_type}",
                embedding=emb,
                metadata={
                    "doc_id": doc_id,
                    "type": emb_type,
                    "image_url": image_path,
                    "text": text,
                    **(metadata or {})
                }
            )

    async def generate_caption(self, image_path: str) -> str:
        """Generate text caption for image indexing."""
        # Use GPT-4o or Claude for high-quality captions
        response = await self.vision_model.analyze(
            image_path,
            prompt="Describe this image in detail for search indexing. "
                   "Include objects, text, colors, and context."
        )
        return response

    async def retrieve(
        self,
        query: str,
        query_image: Optional[str] = None,
        top_k: int = 10
    ) -> list[dict]:
        """Hybrid retrieval with optional image query."""
        results = []

        # Text query embedding
        text_emb = embed_text(query)
        text_results = await self.vector_db.search(
            embedding=text_emb,
            top_k=top_k
        )
        results.extend(text_results)

        # Image query embedding (if provided)
        if query_image:
            img_emb = embed_image(query_image)
            img_results = await self.vector_db.search(
                embedding=img_emb,
                top_k=top_k
            )
            results.extend(img_results)

        # Dedupe by doc_id, keep highest score
        seen = {}
        for r in results:
            doc_id = r["metadata"]["doc_id"]
            if doc_id not in seen or r["score"] > seen[doc_id]["score"]:
                seen[doc_id] = r

        return sorted(seen.values(), key=lambda x: x["score"], reverse=True)[:top_k]

Claude Code PDF Handling (CC 2.1.30+)

Claude处理PDF（CC 2.1.30+）

For large PDFs, use the

pages

parameter to process in batches:

python

undefined

针对大体积PDF，使用

pages

参数批量处理：

python

undefined

Process large PDF in page-range batches for embedding

async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10): """Process large PDF by page ranges before embedding.""" import subprocess

# Get total page count
result = subprocess.run(
    ["pdfinfo", pdf_path],
    capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
                   if 'Pages:' in l][0].split(':')[1].strip())

chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
    end = min(start + pages_per_batch - 1, total_pages)

    # Read page range (CC 2.1.30 pages parameter)
    # Read(file_path=pdf_path, pages=f"{start}-{end}")

    # Extract and embed content from this range
    page_chunks = extract_chunks_from_range(pdf_path, start, end)
    chunks.extend(page_chunks)

return chunks

undefined

async def process_large_pdf_for_rag(pdf_path: str, pages_per_batch: int = 10): """Process large PDF by page ranges before embedding.""" import subprocess

# Get total page count
result = subprocess.run(
    ["pdfinfo", pdf_path],
    capture_output=True, text=True
)
total_pages = int([l for l in result.stdout.split('\n')
                   if 'Pages:' in l][0].split(':')[1].strip())

chunks = []
for start in range(1, total_pages + 1, pages_per_batch):
    end = min(start + pages_per_batch - 1, total_pages)

    # Read page range (CC 2.1.30 pages parameter)
    # Read(file_path=pdf_path, pages=f"{start}-{end}")

    # Extract and embed content from this range
    page_chunks = extract_chunks_from_range(pdf_path, start, end)
    chunks.extend(page_chunks)

return chunks

undefined

Limits

限制

Max 20 pages per Read request
Max 20MB file size
Process large documents in batches for embedding

单次Read请求最多处理20页
文件大小上限20MB
处理大文档时需分批嵌入

Multimodal Document Chunking

多模态文档分块

python

from dataclasses import dataclass
from typing import Literal

@dataclass
class Chunk:
    content: str
    chunk_type: Literal["text", "image", "table", "chart"]
    page: int
    image_path: Optional[str] = None
    embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
    """Chunk PDF preserving images and tables."""
    from pdf2image import convert_from_path
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        current_text = ""

        for block in text_blocks:
            if block[6] == 0:  # Text block
                current_text += block[4] + "\n"
            else:  # Image block
                # Save current text chunk
                if current_text.strip():
                    chunks.append(Chunk(
                        content=current_text.strip(),
                        chunk_type="text",
                        page=page_num
                    ))
                    current_text = ""

                # Extract and save image
                xref = block[7]
                img = doc.extract_image(xref)
                img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
                with open(img_path, "wb") as f:
                    f.write(img["image"])

                # Generate caption for the image
                caption = generate_image_caption(img_path)

                chunks.append(Chunk(
                    content=caption,
                    chunk_type="image",
                    page=page_num,
                    image_path=img_path
                ))

        # Final text chunk
        if current_text.strip():
            chunks.append(Chunk(
                content=current_text.strip(),
                chunk_type="text",
                page=page_num
            ))

    return chunks

python

from dataclasses import dataclass
from typing import Literal

@dataclass
class Chunk:
    content: str
    chunk_type: Literal["text", "image", "table", "chart"]
    page: int
    image_path: Optional[str] = None
    embedding: Optional[list[float]] = None

def chunk_multimodal_document(pdf_path: str) -> list[Chunk]:
    """Chunk PDF preserving images and tables."""
    from pdf2image import convert_from_path
    import fitz  # PyMuPDF

    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc):
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        current_text = ""

        for block in text_blocks:
            if block[6] == 0:  # Text block
                current_text += block[4] + "\n"
            else:  # Image block
                # Save current text chunk
                if current_text.strip():
                    chunks.append(Chunk(
                        content=current_text.strip(),
                        chunk_type="text",
                        page=page_num
                    ))
                    current_text = ""

                # Extract and save image
                xref = block[7]
                img = doc.extract_image(xref)
                img_path = f"/tmp/page{page_num}_img{xref}.{img['ext']}"
                with open(img_path, "wb") as f:
                    f.write(img["image"])

                # Generate caption for the image
                caption = generate_image_caption(img_path)

                chunks.append(Chunk(
                    content=caption,
                    chunk_type="image",
                    page=page_num,
                    image_path=img_path
                ))

        # Final text chunk
        if current_text.strip():
            chunks.append(Chunk(
                content=current_text.strip(),
                chunk_type="text",
                page=page_num
            ))

    return chunks

Vector Database Setup (Milvus)

向量数据库搭建（Milvus）

python

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection():
    """Create Milvus collection for multimodal embeddings."""
    connections.connect("default", host="localhost", port="19530")

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
        FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
        FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
        FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
        FieldSchema(name="page", dtype=DataType.INT64)
    ]

    schema = CollectionSchema(fields, "Multimodal document collection")
    collection = Collection("multimodal_docs", schema)

    # Create index for vector search
    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 16, "efConstruction": 256}
    }
    collection.create_index("embedding", index_params)

    return collection

python

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

def setup_multimodal_collection():
    """Create Milvus collection for multimodal embeddings."""
    connections.connect("default", host="localhost", port="19530")

    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=256),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
        FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=256),
        FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
        FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="image_url", dtype=DataType.VARCHAR, max_length=1024),
        FieldSchema(name="page", dtype=DataType.INT64)
    ]

    schema = CollectionSchema(fields, "Multimodal document collection")
    collection = Collection("multimodal_docs", schema)

    # Create index for vector search
    index_params = {
        "metric_type": "COSINE",
        "index_type": "HNSW",
        "params": {"M": 16, "efConstruction": 256}
    }
    collection.create_index("embedding", index_params)

    return collection

Multimodal Generation

多模态生成

python

async def generate_with_context(
    query: str,
    retrieved_chunks: list[Chunk],
    model: str = "claude-opus-4-6"
) -> str:
    """Generate response using multimodal context."""
    content = []

    # Add retrieved images first (attention positioning)
    for chunk in retrieved_chunks:
        if chunk.chunk_type == "image" and chunk.image_path:
            base64_data, media_type = encode_image_base64(chunk.image_path)
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            })

    # Add text context
    text_context = "\n\n".join([
        f"[Page {c.page}]: {c.content}"
        for c in retrieved_chunks if c.chunk_type == "text"
    ])

    content.append({
        "type": "text",
        "text": f"""Use the following context to answer the question.

Context:
{text_context}

Question: {query}

Provide a detailed answer based on the context and images provided."""
    })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )

    return response.content[0].text

python

async def generate_with_context(
    query: str,
    retrieved_chunks: list[Chunk],
    model: str = "claude-opus-4-6"
) -> str:
    """Generate response using multimodal context."""
    content = []

    # Add retrieved images first (attention positioning)
    for chunk in retrieved_chunks:
        if chunk.chunk_type == "image" and chunk.image_path:
            base64_data, media_type = encode_image_base64(chunk.image_path)
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": base64_data
                }
            })

    # Add text context
    text_context = "\n\n".join([
        f"[Page {c.page}]: {c.content}"
        for c in retrieved_chunks if c.chunk_type == "text"
    ])

    content.append({
        "type": "text",
        "text": f"""Use the following context to answer the question.

Context:
{text_context}

Question: {query}

Provide a detailed answer based on the context and images provided."""
    })

    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=[{"role": "user", "content": content}]
    )

    return response.content[0].text

Key Decisions

关键决策

Decision	Recommendation
Long documents	Voyage multimodal-3 (32K context)
Scale retrieval	SigLIP 2 (optimized for large-scale)
PDF processing	ColPali (document-native)
Multi-modal search	Hybrid: CLIP + text embeddings
Production DB	Milvus or Pinecone with hybrid

决策项	推荐方案
长文档处理	Voyage multimodal-3（32K上下文）
大规模检索	SigLIP 2（针对大规模场景优化）
PDF处理	ColPali（原生文档支持）
多模态搜索	混合方案：CLIP + 文本嵌入
生产级数据库	Milvus或Pinecone（支持混合检索）

Common Mistakes

常见误区

Embedding images without captions (limits text search)
Not deduplicating by document ID
Missing image URL storage (can't display results)
Using only image OR text embeddings (use both)
Ignoring chunk boundaries (split mid-paragraph)
Not validating image retrieval quality

仅嵌入图片而不生成标题（限制文本搜索能力）
未按文档ID去重
未存储图片URL（无法展示结果）
仅使用图片或文本其中一种嵌入（应同时使用）
忽略分块边界（在段落中间拆分）
未验证图片检索质量

Related Skills

Capability Details

能力详情

image-embeddings

Keywords: CLIP, image embedding, visual features, SigLIP Solves:

Convert images to vector representations
Enable image similarity search
Cross-modal retrieval

关键词: CLIP, image embedding, visual features, SigLIP 解决问题:

将图片转换为向量表示
实现图片相似度搜索
跨模态检索

cross-modal-search

Keywords: text to image, image to text, cross-modal Solves:

Find images from text queries
Find text from image queries
Bridge modalities

关键词: text to image, image to text, cross-modal 解决问题:

通过文本查询找到图片
通过图片查询找到文本
打通不同模态

multimodal-chunking

Keywords: chunk PDF, split document, extract images Solves:

Process documents with mixed content
Preserve image-text relationships
Handle tables and charts

关键词: chunk PDF, split document, extract images 解决问题:

处理混合内容的文档
保留图片与文本的关联关系
处理表格与图表

hybrid-retrieval

Keywords: hybrid search, fusion, multi-embedding Solves:

Combine text and image search
Improve retrieval accuracy
Handle diverse queries

关键词: hybrid search, fusion, multi-embedding 解决问题:

结合文本与图片搜索
提升检索精度
处理多样化查询

multimodal-rag

Original

Translation

Multimodal RAG ()

多模态RAG ()

Overview

概述

Architecture Approaches

架构方案

Embedding Models ()

嵌入模型 ()

CLIP-Based Image Embeddings

基于CLIP的图片嵌入

Load CLIP model

Load CLIP model

Cross-modal search: text -> images

Cross-modal search: text -> images

Voyage Multimodal-3 (Long Context)

Voyage Multimodal-3（长上下文）

Hybrid RAG Pipeline

混合RAG流水线

Claude Code PDF Handling (CC 2.1.30+)

Claude处理PDF（CC 2.1.30+）

Process large PDF in page-range batches for embedding

Process large PDF in page-range batches for embedding

Limits

限制

Multimodal Document Chunking

多模态文档分块

Vector Database Setup (Milvus)

向量数据库搭建（Milvus）

Multimodal Generation

多模态生成

Key Decisions

关键决策

Common Mistakes

常见误区

Related Skills

相关技能

Capability Details

能力详情

image-embeddings

image-embeddings

cross-modal-search

cross-modal-search

multimodal-chunking

multimodal-chunking

hybrid-retrieval

hybrid-retrieval