Vector Databases for AI Applications

When to Use This Skill

Use this skill when implementing:

RAG (Retrieval-Augmented Generation) systems for AI chatbots
Semantic search capabilities (meaning-based, not just keyword)
Recommendation systems based on similarity
Multi-modal AI (unified search across text, images, audio)
Document similarity and deduplication
Question answering over private knowledge bases

Quick Decision Framework

1. Vector Database Selection

START: Choosing a Vector Database

EXISTING INFRASTRUCTURE?
├─ Using PostgreSQL already?
│  └─ pgvector (<10M vectors, tight budget)
│      See: references/pgvector.md
│
└─ No existing vector database?
   │
   ├─ OPERATIONAL PREFERENCE?
   │  │
   │  ├─ Zero-ops managed only
   │  │  └─ Pinecone (fully managed, excellent DX)
   │  │      See: references/pinecone.md
   │  │
   │  └─ Flexible (self-hosted or managed)
   │     │
   │     ├─ SCALE: <100M vectors + complex filtering ⭐
   │     │  └─ Qdrant (RECOMMENDED)
   │     │      • Best metadata filtering
   │     │      • Built-in hybrid search (BM25 + Vector)
   │     │      • Self-host: Docker/K8s
   │     │      • Managed: Qdrant Cloud
   │     │      See: references/qdrant.md
   │     │
   │     ├─ SCALE: >100M vectors + GPU acceleration
   │     │  └─ Milvus / Zilliz Cloud
   │     │      See: references/milvus.md
   │     │
   │     ├─ Embedded / No server
   │     │  └─ LanceDB (serverless, edge deployment)
   │     │
   │     └─ Local prototyping
   │        └─ Chroma (simple API, in-memory)

2. Embedding Model Selection

REQUIREMENTS?

├─ Best quality (cost no object)
│  └─ Voyage AI voyage-3 (1024d)
│      • 9.74% better than OpenAI on MTEB
│      • ~$0.12/1M tokens
│      See: references/embedding-strategies.md
│
├─ Enterprise reliability
│  └─ OpenAI text-embedding-3-large (3072d)
│      • Industry standard
│      • ~$0.13/1M tokens
│      • Maturity shortening: reduce to 256/512/1024d
│
├─ Cost-optimized
│  └─ OpenAI text-embedding-3-small (1536d)
│      • ~$0.02/1M tokens (6x cheaper)
│      • 90-95% of large model performance
│
├─ Multilingual (100+ languages)
│  └─ Cohere embed-v3 (1024d)
│      • ~$0.10/1M tokens
│
└─ Self-hosted / Privacy-critical
   ├─ English: nomic-embed-text-v1.5 (768d, Apache 2.0)
   ├─ Multilingual: BAAI/bge-m3 (1024d, MIT)
   └─ Long docs: jina-embeddings-v2 (768d, 8K context)

Core Concepts

Document Chunking Strategy

Recommended defaults for most RAG systems:

Chunk size: 512 tokens (not characters)
Overlap: 50 tokens (10% overlap)

Why these numbers?

512 tokens balances context vs. precision
- Too small (128-256): Fragments concepts, loses context
- Too large (1024-2048): Dilutes relevance, wastes LLM tokens
50 token overlap ensures sentences aren't split mid-context

See

references/chunking-patterns.md

for advanced strategies by content type.

Hybrid Search (Vector + Keyword)

Hybrid Search = Vector Similarity + BM25 Keyword Matching

User Query: "OAuth refresh token implementation"
           │
    ┌──────┴──────┐
    │             │
Vector Search   Keyword Search
(Semantic)      (BM25)
    │             │
Top 20 docs   Top 20 docs
    │             │
    └──────┬──────┘
           │
   Reciprocal Rank Fusion
   (Merge + Re-rank)
           │
    Final Top 5 Results

Why hybrid matters:

Vector captures semantic meaning ("OAuth refresh" ≈ "token renewal")
Keyword ensures exact matches ("refresh_token" literal)
Combined provides best retrieval quality

See

references/hybrid-search.md

for implementation details.

Getting Started

Python + Qdrant Example

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# 1. Initialize client
client = QdrantClient("localhost", port=6333)

# 2. Create collection
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE)
)

# 3. Insert documents with embeddings
points = [
    PointStruct(
        id=idx,
        vector=embedding,  # From OpenAI/Voyage/etc
        payload={
            "text": chunk_text,
            "source": "docs/api.md",
            "section": "Authentication"
        }
    )
    for idx, (embedding, chunk_text) in enumerate(chunks)
]
client.upsert(collection_name="documents", points=points)

# 4. Search with metadata filtering
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=5,
    query_filter={
        "must": [
            {"key": "section", "match": {"value": "Authentication"}}
        ]
    }
)

For complete examples, see

examples/qdrant-python/

TypeScript + Qdrant Example

typescript

import { QdrantClient } from '@qdrant/js-client-rest';

const client = new QdrantClient({ url: 'http://localhost:6333' });

// Create collection
await client.createCollection('documents', {
  vectors: { size: 1024, distance: 'Cosine' }
});

// Insert documents
await client.upsert('documents', {
  points: chunks.map((chunk, idx) => ({
    id: idx,
    vector: chunk.embedding,
    payload: {
      text: chunk.text,
      source: chunk.source
    }
  }))
});

// Search
const results = await client.search('documents', {
  vector: queryEmbedding,
  limit: 5,
  filter: {
    must: [
      { key: 'source', match: { value: 'docs/api.md' } }
    ]
  }
});

For complete examples, see

examples/typescript-rag/

RAG Pipeline Architecture

Complete Pipeline Components

1. INGESTION
   ├─ Document Loading (PDF, web, code, Office)
   ├─ Text Extraction & Cleaning
   ├─ Chunking (semantic, recursive, code-aware)
   └─ Embedding Generation (batch, rate-limited)

2. INDEXING
   ├─ Vector Store Insertion (batch upsert)
   ├─ Index Configuration (HNSW, distance metric)
   └─ Keyword Index (BM25 for hybrid search)

3. RETRIEVAL (Query Time)
   ├─ Query Processing (expansion, embedding)
   ├─ Hybrid Search (vector + keyword)
   ├─ Filtering & Post-Processing (metadata, MMR)
   └─ Re-Ranking (cross-encoder, LLM-based)

4. GENERATION
   ├─ Context Construction (format chunks, citations)
   ├─ Prompt Engineering (system + context + query)
   ├─ LLM Inference (streaming, temperature tuning)
   └─ Response Post-Processing (citations, validation)

5. EVALUATION (Production Critical)
   ├─ Retrieval Metrics (precision, recall, relevancy)
   ├─ Generation Metrics (faithfulness, correctness)
   └─ System Metrics (latency, cost, satisfaction)

Essential Metadata for Production RAG

Critical for filtering and relevance:

python

metadata = {
    # SOURCE TRACKING
    "source": "docs/api-reference.md",
    "source_type": "documentation",  # code, docs, logs, chat
    "last_updated": "2025-12-01T12:00:00Z",

    # HIERARCHICAL CONTEXT
    "section": "Authentication",
    "subsection": "OAuth 2.1",
    "heading_hierarchy": ["API Reference", "Authentication", "OAuth 2.1"],

    # CONTENT CLASSIFICATION
    "content_type": "code_example",  # prose, code, table, list
    "programming_language": "python",

    # FILTERING DIMENSIONS
    "product_version": "v2.0",
    "audience": "enterprise",  # free, pro, enterprise

    # RETRIEVAL HINTS
    "chunk_index": 3,
    "total_chunks": 12,
    "has_code": True
}

Why metadata matters:

Enables filtering BEFORE vector search (reduces search space)
Improves relevance through targeted retrieval
Supports multi-tenant systems (filter by user/org)
Enables versioned documentation (filter by product version)

Evaluation with RAGAS

Use scripts/evaluate_rag.py for automated evaluation:

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,       # Answer grounded in context
    answer_relevancy,   # Answer addresses query
    context_recall,     # Retrieved docs cover ground truth
    context_precision   # Retrieved docs are relevant
)

# Test dataset
test_data = {
    "question": ["How do I refresh OAuth tokens?"],
    "answer": ["Use /token with refresh_token grant..."],
    "contexts": [["OAuth refresh documentation..."]],
    "ground_truth": ["POST to /token with grant_type=refresh_token"]
}

# Evaluate
results = evaluate(test_data, metrics=[
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
])

# Production targets:
# faithfulness: >0.90 (minimal hallucination)
# answer_relevancy: >0.85 (addresses user query)
# context_recall: >0.80 (sufficient context retrieved)
# context_precision: >0.75 (minimal noise)

Performance Optimization

Embedding Generation

Batch processing: 100-500 chunks per batch
Caching: Cache embeddings by content hash
Rate limiting: Respect API provider limits (exponential backoff)

Vector Search

Index type: HNSW (Hierarchical Navigable Small World) for most cases
Distance metric: Cosine for normalized embeddings
Pre-filtering: Apply metadata filters before vector search
Result diversity: Use MMR (Maximal Marginal Relevance) to reduce redundancy

Cost Optimization

Embedding model: Consider text-embedding-3-small for budget constraints
Dimension reduction: Use maturity shortening (3072d → 1024d)
Caching: Implement semantic caching for repeated queries
Batch operations: Group insertions/updates for efficiency

Common Workflows

1. Building a RAG Chatbot

Vector database: Qdrant (self-hosted or cloud)
Embeddings: OpenAI text-embedding-3-large
Chunking: 512 tokens, 50 overlap, semantic splitter
Search: Hybrid (vector + BM25)
Integration: Frontend with ai-chat skill

See

examples/qdrant-python/

for complete implementation.

2. Semantic Search Engine

Vector database: Qdrant or Pinecone
Embeddings: Voyage AI voyage-3 (best quality)
Chunking: Content-type specific (see chunking-patterns.md)
Search: Hybrid with re-ranking
Filtering: Pre-filter by metadata (date, category, etc.)

3. Code Search

Vector database: Qdrant
Embeddings: OpenAI text-embedding-3-large
Chunking: AST-based (function/class boundaries)
Metadata: Language, file path, imports
Search: Hybrid with language filtering

See

examples/qdrant-python/

for code-specific implementation.

Integration with Other Skills

Frontend Skills

ai-chat: Vector DB powers RAG pipeline behind chat interface
search-filter: Replace keyword search with semantic search
data-viz: Visualize embedding spaces, similarity scores

Backend Skills

databases-relational: Hybrid approach using pgvector extension
api-patterns: Expose semantic search via REST/GraphQL
observability: Monitor embedding quality and retrieval metrics

Multi-Language Support

Python (Primary)

Client:
```
qdrant-client
```
Framework: LangChain, LlamaIndex
See:
```
examples/qdrant-python/
```

Rust

Client:
```
qdrant-client
```
(1,549 code snippets in Context7)
Framework: Raw Rust for performance-critical systems
See:
```
examples/rust-axum-vector/
```

TypeScript

Client:
```
@qdrant/js-client-rest
```
Framework: LangChain.js, integration with Next.js
See:
```
examples/typescript-rag/
```

Go

Client:
```
qdrant-go
```
Use case: High-performance microservices

Troubleshooting

Poor Retrieval Quality

Check chunking strategy (too large/small?)
Verify metadata filtering (too restrictive?)
Try hybrid search instead of vector-only
Implement re-ranking stage
Evaluate with RAGAS metrics

Slow Performance

Use HNSW index (not Flat)
Pre-filter with metadata before vector search
Reduce vector dimensions (maturity shortening)
Batch operations (insertions, searches)
Consider GPU acceleration (Milvus)

High Costs

Switch to text-embedding-3-small
Implement semantic caching
Reduce chunk overlap
Use self-hosted embeddings (nomic, bge-m3)
Batch embedding generation

Qdrant Context7 Documentation

Primary resource:

/llmstxt/qdrant_tech_llms-full_txt

Trust score: High
Code snippets: 10,154
Quality score: 83.1

Access via Context7:

resolve-library-id({ libraryName: "Qdrant" })
get-library-docs({
  context7CompatibleLibraryID: "/llmstxt/qdrant_tech_llms-full_txt",
  topic: "hybrid search collections python",
  mode: "code"
})

Additional Resources

Reference Documentation

```
references/qdrant.md
```
- Comprehensive Qdrant guide
```
references/pgvector.md
```
- PostgreSQL pgvector extension
```
references/milvus.md
```
- Milvus/Zilliz for billion-scale
```
references/embedding-strategies.md
```
- Embedding model comparison
```
references/chunking-patterns.md
```
- Advanced chunking techniques

Code Examples

```
examples/qdrant-python/
```
- FastAPI + Qdrant RAG pipeline
```
examples/pgvector-prisma/
```
- PostgreSQL + Prisma integration
```
examples/typescript-rag/
```
- TypeScript RAG with Hono

Automation Scripts

```
scripts/generate_embeddings.py
```
- Batch embedding generation
```
scripts/benchmark_similarity.py
```
- Performance benchmarking
```
scripts/evaluate_rag.py
```
- RAGAS-based evaluation

Next Steps:

Choose vector database based on scale and infrastructure
Select embedding model based on quality vs. cost trade-off
Implement chunking strategy for the content type
Set up hybrid search for production quality
Evaluate with RAGAS metrics
Optimize for performance and cost

using-vector-databases

NPX Install

Tags

SKILL.md Content

Vector Databases for AI Applications

When to Use This Skill

Quick Decision Framework

1. Vector Database Selection

2. Embedding Model Selection

Core Concepts

Document Chunking Strategy

Hybrid Search (Vector + Keyword)

Getting Started

Python + Qdrant Example

TypeScript + Qdrant Example

RAG Pipeline Architecture

Complete Pipeline Components

Essential Metadata for Production RAG

Evaluation with RAGAS

Performance Optimization

Embedding Generation

Vector Search

Cost Optimization

Common Workflows

1. Building a RAG Chatbot

2. Semantic Search Engine

3. Code Search

Integration with Other Skills

Frontend Skills

Backend Skills

Multi-Language Support

Python (Primary)

Rust

TypeScript

Go

Troubleshooting

Poor Retrieval Quality

Slow Performance

High Costs

Qdrant Context7 Documentation

Additional Resources

Reference Documentation

Code Examples

Automation Scripts