pgvector-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePGVector Hybrid Search
PGVector混合搜索
Production-grade semantic + keyword search using PostgreSQL
基于PostgreSQL的生产级语义+关键词搜索方案
Overview
概述
Architecture:
Query
|
[Generate embedding] --> Vector Search (PGVector) --> Top 30 results
|
[Generate ts_query] --> Keyword Search (BM25) --> Top 30 results
|
[Reciprocal Rank Fusion (RRF)] --> Merge & re-rank --> Top 10 final resultsWhen to use this skill:
- Building semantic search (RAG, knowledge bases, recommendations)
- Implementing hybrid retrieval (vector + keyword)
- Optimizing PGVector performance
- Working with large document collections (1M+ chunks)
架构:
查询
|
[生成嵌入向量] --> 向量搜索(PGVector) --> 前30条结果
|
[生成ts_query] --> 关键词搜索(BM25) --> 前30条结果
|
[Reciprocal Rank Fusion(RRF)] --> 合并并重排序 --> 最终前10条结果适用场景:
- 构建语义搜索(RAG、知识库、推荐系统)
- 实现混合检索(向量+关键词)
- 优化PGVector性能
- 处理大规模文档集合(100万+文本块)
Quick Reference
快速参考
Search Type Comparison
搜索类型对比
| Aspect | Semantic (Vector) | Keyword (BM25) |
|---|---|---|
| Query | Embedding similarity | Exact word matches |
| Strengths | Synonyms, concepts | Exact phrases, rare terms |
| Weaknesses | Exact matches, technical terms | No semantic understanding |
| Index | HNSW (pgvector) | GIN (tsvector) |
| 维度 | 语义(向量)搜索 | 关键词(BM25)搜索 |
|---|---|---|
| 查询方式 | 嵌入向量相似度匹配 | 精确词汇匹配 |
| 优势 | 支持同义词、概念匹配 | 精确短语、罕见术语匹配 |
| 劣势 | 精确匹配、技术术语表现不佳 | 无语义理解能力 |
| 索引类型 | HNSW(pgvector) | GIN(tsvector) |
Index Comparison
索引对比
| Metric | IVFFlat | HNSW |
|---|---|---|
| Query speed | 50ms | 3ms (17x faster) |
| Index time | 2 min | 20 min |
| Best for | < 100k vectors | 100k+ vectors |
| Recall@10 | 0.85-0.95 | 0.95-0.99 |
Recommendation: Use HNSW for production (scales to millions).
| 指标 | IVFFlat | HNSW |
|---|---|---|
| 查询速度 | 50ms | 3ms(快17倍) |
| 索引构建时间 | 2分钟 | 20分钟 |
| 最佳适用场景 | 向量数量<10万 | 向量数量>10万 |
| Recall@10 | 0.85-0.95 | 0.95-0.99 |
推荐: 生产环境使用HNSW(可扩展至百万级向量)。
RRF Formula
RRF公式
python
rrf_score = 1/(k + vector_rank) + 1/(k + keyword_rank) # k=60 (standard)python
rrf_score = 1/(k + vector_rank) + 1/(k + keyword_rank) # k=60(标准值)Database Schema
数据库 Schema
sql
CREATE TABLE chunks (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
content TEXT NOT NULL,
embedding vector(1024), -- PGVector
content_tsvector tsvector GENERATED ALWAYS AS (
to_tsvector('english', content)
) STORED,
section_title TEXT,
content_type TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_chunks_content_tsvector ON chunks
USING gin (content_tsvector);sql
CREATE TABLE chunks (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id),
content TEXT NOT NULL,
embedding vector(1024), -- PGVector
content_tsvector tsvector GENERATED ALWAYS AS (
to_tsvector('english', content)
) STORED,
section_title TEXT,
content_type TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- 索引
CREATE INDEX idx_chunks_embedding ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX idx_chunks_content_tsvector ON chunks
USING gin (content_tsvector);Hybrid Search Query (SQLAlchemy)
混合搜索查询(SQLAlchemy)
python
async def hybrid_search(
query: str,
query_embedding: list[float],
top_k: int = 10
) -> list[Chunk]:
FETCH_MULTIPLIER = 3 # Fetch 30 for better RRF coverage
K = 60 # RRF smoothing constant
# Vector search subquery
vector_subq = (
select(Chunk.id,
func.row_number().over(
order_by=Chunk.embedding.cosine_distance(query_embedding)
).label("vector_rank"))
.limit(top_k * FETCH_MULTIPLIER)
.subquery()
)
# Keyword search subquery
ts_query = func.plainto_tsquery("english", query)
keyword_subq = (
select(Chunk.id,
func.row_number().over(
order_by=func.ts_rank_cd(Chunk.content_tsvector, ts_query).desc()
).label("keyword_rank"))
.where(Chunk.content_tsvector.op("@@")(ts_query))
.limit(top_k * FETCH_MULTIPLIER)
.subquery()
)
# RRF fusion with FULL OUTER JOIN
rrf_subq = (
select(
func.coalesce(vector_subq.c.id, keyword_subq.c.id).label("chunk_id"),
(func.coalesce(1.0 / (K + vector_subq.c.vector_rank), 0.0) +
func.coalesce(1.0 / (K + keyword_subq.c.keyword_rank), 0.0)
).label("rrf_score"))
.select_from(vector_subq.outerjoin(keyword_subq, ..., full=True))
.order_by("rrf_score DESC")
.limit(top_k)
.subquery()
)
return await session.execute(
select(Chunk).join(rrf_subq, Chunk.id == rrf_subq.c.chunk_id)
)python
async def hybrid_search(
query: str,
query_embedding: list[float],
top_k: int = 10
) -> list[Chunk]:
FETCH_MULTIPLIER = 3 # 取30条结果以提升RRF覆盖度
K = 60 # RRF平滑常数
# 向量搜索子查询
vector_subq = (
select(Chunk.id,
func.row_number().over(
order_by=Chunk.embedding.cosine_distance(query_embedding)
).label("vector_rank"))
.limit(top_k * FETCH_MULTIPLIER)
.subquery()
)
# 关键词搜索子查询
ts_query = func.plainto_tsquery("english", query)
keyword_subq = (
select(Chunk.id,
func.row_number().over(
order_by=func.ts_rank_cd(Chunk.content_tsvector, ts_query).desc()
).label("keyword_rank"))
.where(Chunk.content_tsvector.op("@@")(ts_query))
.limit(top_k * FETCH_MULTIPLIER)
.subquery()
)
# 基于FULL OUTER JOIN的RRF融合
rrf_subq = (
select(
func.coalesce(vector_subq.c.id, keyword_subq.c.id).label("chunk_id"),
(func.coalesce(1.0 / (K + vector_subq.c.vector_rank), 0.0) +
func.coalesce(1.0 / (K + keyword_subq.c.keyword_rank), 0.0)
).label("rrf_score"))
.select_from(vector_subq.outerjoin(keyword_subq, ..., full=True))
.order_by("rrf_score DESC")
.limit(top_k)
.subquery()
)
return await session.execute(
select(Chunk).join(rrf_subq, Chunk.id == rrf_subq.c.chunk_id)
)Common Patterns
常见模式
Filtered Search
过滤搜索
python
results = await hybrid_search(
query="binary search",
query_embedding=embedding,
content_type_filter=["code_block"]
)python
results = await hybrid_search(
query="binary search",
query_embedding=embedding,
content_type_filter=["code_block"]
)Similarity Threshold
相似度阈值过滤
python
results = await hybrid_search(query, embedding, top_k=50)
filtered = [r for r in results if (1 - r.vector_distance) >= 0.75][:10]python
results = await hybrid_search(query, embedding, top_k=50)
filtered = [r for r in results if (1 - r.vector_distance) >= 0.75][:10]Multi-Query Retrieval
多查询检索
python
queries = ["machine learning", "ML algorithms", "neural networks"]
all_results = [await hybrid_search(q, embed(q)) for q in queries]
final = deduplicate_and_rerank(all_results)python
queries = ["machine learning", "ML algorithms", "neural networks"]
all_results = [await hybrid_search(q, embed(q)) for q in queries]
final = deduplicate_and_rerank(all_results)Performance Tips
性能优化技巧
- Pre-compute tsvector - 5-10x faster than at query time
to_tsvector() - Use HNSW index - 17x faster queries than IVFFlat
- 3x fetch multiplier - Better RRF coverage (30 results per search for top 10)
- Iterative scan for filtered queries - Set
hnsw.iterative_scan = 'relaxed_order' - Metadata boosting - +6% MRR with title/path matching
- 预计算tsvector - 比查询时调用快5-10倍
to_tsvector() - 使用HNSW索引 - 查询速度比IVFFlat快17倍
- 3倍结果获取系数 - 提升RRF覆盖度(每个搜索取30条结果以得到最终10条)
- 过滤查询使用迭代扫描 - 设置
hnsw.iterative_scan = 'relaxed_order' - 元数据加权 - 匹配标题/路径可提升6%的MRR(平均倒数排名)
Redis 8 FT.HYBRID Alternative
Redis 8 FT.HYBRID替代方案
Redis 8.4 introduced with native hybrid search support. Consider this alternative when latency is critical.
FT.HYBRIDRedis 8.4引入了,支持原生混合搜索。当对延迟要求极高时,可考虑此替代方案。
FT.HYBRIDFT.HYBRID Command Syntax
FT.HYBRID命令语法
redis
FT.HYBRID index query
VECTOR $embedding AS vec_score
TEXT "search terms" AS text_score
COMBINE RRF
LIMIT 0 10
RETURN 3 content title scoreredis
FT.HYBRID index query
VECTOR $embedding AS vec_score
TEXT "search terms" AS text_score
COMBINE RRF
LIMIT 0 10
RETURN 3 content title scoreComparison: pgvector vs Redis 8 FT.HYBRID
对比:pgvector vs Redis 8 FT.HYBRID
| Aspect | pgvector | Redis 8 FT.HYBRID |
|---|---|---|
| Setup complexity | Medium (extensions, indexes) | Low (single FT.CREATE) |
| RRF implementation | Manual SQL with FULL OUTER JOIN | Native |
| Keyword scoring | | |
| Typical latency | 5-20ms | 2-5ms |
| Persistence | ACID transactions | AOF/RDB (configurable) |
| Max dataset | Billions (with partitioning) | Memory-bound (~100M vectors) |
| Joins/relations | Full SQL support | Limited (hash/JSON docs) |
| Operational maturity | Battle-tested | New (8.4, 2024) |
| 维度 | pgvector | Redis 8 FT.HYBRID |
|---|---|---|
| 部署复杂度 | 中等(需扩展插件、创建索引) | 低(仅需FT.CREATE命令) |
| RRF实现方式 | 手动通过SQL FULL OUTER JOIN实现 | 原生支持 |
| 关键词评分算法 | | |
| 典型延迟 | 5-20ms | 2-5ms |
| 持久性 | ACID事务支持 | AOF/RDB(可配置) |
| 最大数据集规模 | 数十亿(支持分区) | 受内存限制(约1亿向量) |
| 关联查询支持 | 完整SQL支持 | 有限(哈希/JSON文档) |
| 生产成熟度 | 久经考验 | 较新(2024年8.4版本推出) |
When to Choose pgvector
何时选择pgvector
- Already using PostgreSQL - No new infrastructure, single source of truth
- Need ACID transactions - Atomic updates across search index and relational data
- Complex joins with relational data - Foreign keys, aggregations, reporting
- Large datasets - Billions of vectors with table partitioning
- Compliance requirements - Established PostgreSQL security/audit tooling
- 已在使用PostgreSQL - 无需新增基础设施,保持单一数据源
- 需要ACID事务 - 搜索索引与关系型数据可实现原子更新
- 复杂关联查询 - 支持外键、聚合、报表等操作
- 超大规模数据集 - 支持表分区,可处理数十亿向量
- 合规要求 - PostgreSQL拥有成熟的安全/审计工具链
When to Choose Redis 8
何时选择Redis 8
- Need sub-5ms latency - Real-time autocomplete, typeahead, live recommendations
- Caching layer with search - Already using Redis for cache, add search capability
- Simpler deployment - Single binary, no extensions or complex configuration
- Ephemeral search indexes - Rebuild from source of truth is acceptable
- Memory budget allows - Dataset fits comfortably in RAM
- 亚5ms延迟需求 - 实时自动补全、输入提示、实时推荐
- 已有Redis缓存层 - 在现有缓存基础上添加搜索能力
- 简化部署 - 单二进制文件,无需扩展插件或复杂配置
- 临时搜索索引 - 可从源数据重建索引的场景
- 内存预算充足 - 数据集可完全放入内存
References
参考资料
Detailed Implementation Guides
详细实现指南
| Reference | Description | Use When |
|---|---|---|
| index-strategies.md | HNSW vs IVFFlat, tuning, iterative scans | Choosing/optimizing indexes |
| hybrid-search-rrf.md | RRF algorithm, SQL implementation, debugging | Implementing hybrid search |
| metadata-filtering.md | Pre/post filtering, score boosting | Improving relevance |
| 参考文档 | 描述 | 适用场景 |
|---|---|---|
| index-strategies.md | HNSW与IVFFlat对比、调优、迭代扫描 | 选择/优化索引时 |
| hybrid-search-rrf.md | RRF算法、SQL实现、调试 | 实现混合搜索时 |
| metadata-filtering.md | 预过滤/后过滤、评分加权 | 提升搜索相关性时 |
External Resources
外部资源
Related Skills
相关技能
- - Embeddings and vector concepts
ai-native-development - - Schema design for vector search
database-schema-designer
Version: 1.2.0 | Status: Production-ready | Updated: pgvector 0.8.1
- - 嵌入向量与向量相关概念
ai-native-development - - 向量搜索的Schema设计
database-schema-designer
版本: 1.2.0 | 状态: 可用于生产环境 | 更新适配: pgvector 0.8.1
Capability Details
能力详情
hybrid-search-rrf
hybrid-search-rrf
Keywords: hybrid search, rrf, reciprocal rank fusion, vector bm25, semantic keyword search
Solves:
- How do I combine vector and keyword search?
- Implement hybrid retrieval with RRF
- Merge semantic and BM25 results
关键词: hybrid search, rrf, reciprocal rank fusion, vector bm25, semantic keyword search
解决问题:
- 如何结合向量与关键词搜索?
- 基于RRF实现混合检索
- 合并语义与BM25搜索结果
semantic-search
semantic-search
Keywords: semantic search, vector similarity, embedding, nearest neighbor, cosine distance
Solves:
- How does semantic search work?
- When to use semantic vs keyword search
- Semantic search strengths and weaknesses
关键词: semantic search, vector similarity, embedding, nearest neighbor, cosine distance
解决问题:
- 语义搜索的工作原理是什么?
- 何时使用语义搜索 vs 关键词搜索
- 语义搜索的优势与劣势
keyword-search-bm25
keyword-search-bm25
Keywords: bm25, full-text search, tsvector, tsquery, keyword search
Solves:
- How does BM25 keyword search work?
- Implement PostgreSQL full-text search
- BM25 vs semantic search trade-offs
关键词: bm25, full-text search, tsvector, tsquery, keyword search
解决问题:
- BM25关键词搜索的工作原理是什么?
- 实现PostgreSQL全文搜索
- BM25与语义搜索的权衡
rrf-algorithm
rrf-algorithm
Keywords: rrf, reciprocal rank fusion, rank-based fusion, score normalization
Solves:
- How does Reciprocal Rank Fusion work?
- Why use rank instead of scores?
- RRF smoothing constant (k parameter)
关键词: rrf, reciprocal rank fusion, rank-based fusion, score normalization
解决问题:
- Reciprocal Rank Fusion的工作原理是什么?
- 为什么基于排名而非分数进行融合?
- RRF平滑常数(k参数)
database-schema
database-schema
Keywords: pgvector schema, chunk table, embedding column, tsvector, generated column
Solves:
- How do I design schema for hybrid search?
- Store embeddings with vector(1024)
- Pre-compute tsvector for performance
关键词: pgvector schema, chunk table, embedding column, tsvector, generated column
解决问题:
- 如何设计混合搜索的数据库Schema?
- 使用vector(1024)存储嵌入向量
- 预计算tsvector以提升性能
search-query-implementation
search-query-implementation
Keywords: hybrid search query, sqlalchemy, vector distance, ts_rank_cd, full outer join
Solves:
- How do I write hybrid search SQL?
- Implement RRF in SQLAlchemy
- Use fetch multiplier for better coverage
关键词: hybrid search query, sqlalchemy, vector distance, ts_rank_cd, full outer join
解决问题:
- 如何编写混合搜索的SQL语句?
- 在SQLAlchemy中实现RRF
- 使用结果获取系数提升覆盖度
indexing-strategies
indexing-strategies
Keywords: pgvector index, hnsw, ivfflat, vector index performance, index tuning
Solves:
- HNSW vs IVFFlat comparison
- Optimize vector search speed
- Scale to millions of vectors
关键词: pgvector index, hnsw, ivfflat, vector index performance, index tuning
解决问题:
- HNSW与IVFFlat对比
- 优化向量搜索速度
- 扩展至百万级向量
pre-computed-tsvector
pre-computed-tsvector
Keywords: tsvector, gin index, full-text index, pre-computed column, generated column
Solves:
- Optimize keyword search performance
- 5-10x speedup with indexed tsvector
关键词: tsvector, gin index, full-text index, pre-computed column, generated column
解决问题:
- 优化关键词搜索性能
- 索引化tsvector可提升5-10倍速度
metadata-filtering
metadata-filtering
Keywords: metadata filter, faceted search, content type filter, score boosting
Solves:
- Filter search by metadata
- Boost results by section title
- Pre-filter by content type
关键词: metadata filter, faceted search, content type filter, score boosting
解决问题:
- 基于元数据过滤搜索结果
- 通过章节标题加权提升结果排名
- 按内容类型预过滤
common-patterns
common-patterns
Keywords: filtered search, similarity threshold, multi-query retrieval, search patterns
Solves:
- Filter search by content type
- Set minimum similarity threshold
- Implement multi-query retrieval
关键词: filtered search, similarity threshold, multi-query retrieval, search patterns
解决问题:
- 按内容类型过滤搜索结果
- 设置最小相似度阈值
- 实现多查询检索
golden-dataset-testing
golden-dataset-testing
Keywords: golden dataset, search evaluation, pass rate, mrr, retrieval testing
Solves:
- Test hybrid search quality
- Evaluate search with golden queries
- Calculate pass rate and MRR metrics
关键词: golden dataset, search evaluation, pass rate, mrr, retrieval testing
解决问题:
- 测试混合搜索的质量
- 使用标准查询评估搜索效果
- 计算通过率与MRR指标