postgres-hybrid-text-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseHybrid Text Search
混合文本搜索
Hybrid search combines keyword search (BM25) with semantic search (vector embeddings) to get the best of both: exact keyword matching and meaning-based retrieval. Use Reciprocal Rank Fusion (RRF) to merge results from both methods into a single ranked list.
This guide covers combining pg_textsearch (BM25) with pgvector. Requires both extensions. For high-volume setups, filtering, or advanced pgvector tuning (binary quantization, HNSW parameters), see the pgvector-semantic-search skill.
pg_textsearch is a new BM25 text search extension for PostgreSQL, fully open-source and available hosted on Tiger Cloud as well as for self-managed deployments. It provides true BM25 ranking, which often improves relevance compared to PostgreSQL's built-in ts_rank and can offer better performance at scale. Note: pg_textsearch is currently in prerelease and not yet recommended for production use. pg_textsearch currently supports PostgreSQL 17 and 18.
混合搜索结合了关键词搜索(BM25)与语义搜索(向量嵌入),兼具两者优势:既支持精确关键词匹配,又能实现基于语义的检索。使用Reciprocal Rank Fusion (RRF)算法可将两种搜索方式的结果合并为一个排序后的列表。
本指南介绍如何将pg_textsearch(BM25)与pgvector结合使用,需要同时安装这两个扩展。对于高吞吐量场景、过滤需求或高级pgvector调优(如二进制量化、HNSW参数配置),请参考pgvector-semantic-search技能。
pg_textsearch是PostgreSQL的一款全新BM25文本搜索扩展,完全开源,可在Tiger Cloud托管环境或自部署环境中使用。它提供真正的BM25排序,相比PostgreSQL内置的ts_rank,通常能提升相关性,且在大规模场景下性能更优。注意:pg_textsearch目前处于预发布阶段,暂不推荐用于生产环境,当前支持PostgreSQL 17和18版本。
When to Use Hybrid Search
何时使用混合搜索
- Use hybrid when queries mix specific terms (product names, codes, proper nouns) with conceptual intent
- Use semantic only when meaning matters more than exact wording (e.g., "how to fix slow queries" should match "query optimization")
- Use keyword only when exact matches are critical (e.g., error codes, SKUs, legal citations)
Hybrid search typically improves recall over either method alone, at the cost of slightly more complexity.
- 使用混合搜索:当查询同时包含特定术语(如产品名称、代码、专有名词)和概念性意图时
- 仅使用语义搜索:当语义比精确措辞更重要时(例如,“如何修复慢查询”应匹配“查询优化”)
- 仅使用关键词搜索:当精确匹配至关重要时(例如,错误代码、SKU、法律引用)
混合搜索通常比单一搜索方法的召回率更高,但会增加少许复杂度。
Data Preparation
数据准备
Chunk your documents into smaller pieces (typically 500–1000 tokens) and store each chunk with its embedding. Both BM25 and semantic search operate on the same chunks—this keeps fusion simple since you're comparing like with like.
将文档拆分为较小的片段(通常为500-1000个token),并为每个片段存储对应的嵌入向量。BM25和语义搜索均基于相同的文档片段,这样可简化结果融合的过程,因为对比的是同类数据。
Golden Path (Default Setup)
标准配置流程
sql
-- Enable extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
-- Table with both indexes
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
embedding halfvec(1536) NOT NULL
);
-- BM25 index for keyword search
CREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english');
-- HNSW index for semantic search
CREATE INDEX ON documents USING hnsw (embedding halfvec_cosine_ops);sql
-- 启用扩展
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
-- 创建包含双索引的表
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
embedding halfvec(1536) NOT NULL
);
-- 为关键词搜索创建BM25索引
CREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english');
-- 为语义搜索创建HNSW索引
CREATE INDEX ON documents USING hnsw (embedding halfvec_cosine_ops);BM25 Notes
BM25相关说明
- Negative scores: The operator returns negative values where lower = better match. RRF uses rank position, so this doesn't affect fusion.
<@> - Language config: Change to match your content language (e.g.,
text_config,'french'). See PostgreSQL text search configurations.'german' - Tuning: BM25 has (term frequency saturation, default 1.2) and
k1(length normalization, default 0.75) parameters. Defaults work well; only tune if relevance is poor.bsqlCREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english', k1 = 1.5, b = 0.8); - Partitioned tables: Each partition maintains local statistics. Scores are not directly comparable across partitions—query individual partitions when score comparability matters.
- 负分机制:运算符返回负值,值越小表示匹配度越高。由于RRF使用排名位置,因此这不会影响结果融合。
<@> - 语言配置:修改以匹配内容语言(例如
text_config、'french'),详情请参考PostgreSQL文本搜索配置。'german' - 参数调优:BM25包含(词频饱和系数,默认1.2)和
k1(长度归一化系数,默认0.75)参数。默认参数表现良好,仅当相关性不佳时才需要调优。bsqlCREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english', k1 = 1.5, b = 0.8); - 分区表:每个分区维护本地统计信息,不同分区的分数不具备直接可比性。当需要分数可比时,请单独查询每个分区。
RRF Query Pattern
RRF查询模式
Reciprocal Rank Fusion combines rankings from multiple searches. Each result's score is where is a constant (typically 60). Results are summed across searches and re-sorted.
1 / (k + rank)kRun both queries in parallel from your client for lower latency, then fuse results client-side:
sql
-- Query 1: Keyword search (BM25)
-- $1: search text
SELECT id, content FROM documents ORDER BY content <@> $1 LIMIT 50;sql
-- Query 2: Semantic search (separate query, run in parallel)
-- $1: embedding of your search text as halfvec(1536)
SELECT id, content FROM documents ORDER BY embedding <=> $1::halfvec(1536) LIMIT 50;python
undefinedReciprocal Rank Fusion算法可合并多个搜索结果的排名。每个结果的分数计算公式为,其中为常数(通常取60)。将所有搜索结果的分数求和后重新排序。
1 / (k + rank)k为降低延迟,请从客户端并行执行两个查询,然后在客户端进行结果融合:
sql
-- 查询1:关键词搜索(BM25)
-- $1:搜索文本
SELECT id, content FROM documents ORDER BY content <@> $1 LIMIT 50;sql
-- 查询2:语义搜索(单独查询,并行执行)
-- $1:搜索文本的嵌入向量,格式为halfvec(1536)
SELECT id, content FROM documents ORDER BY embedding <=> $1::halfvec(1536) LIMIT 50;python
undefinedClient-side RRF fusion (Python)
客户端RRF融合(Python)
def rrf_fusion(keyword_results, semantic_results, k=60, limit=10):
scores = {}
content_map = {}
for rank, row in enumerate(keyword_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + 1 / (k + rank)
content_map[row['id']] = row['content']
for rank, row in enumerate(semantic_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + 1 / (k + rank)
content_map[row['id']] = row['content']
sorted_ids = sorted(scores, key=scores.get, reverse=True)[:limit]
return [{'id': id, 'content': content_map[id], 'score': scores[id]} for id in sorted_ids]
```typescript
// Client-side RRF fusion (TypeScript)
type Row = { id: number; content: string };
type Result = Row & { score: number };
function rrfFusion(keywordResults: Row[], semanticResults: Row[], k = 60, limit = 10): Result[] {
const scores = new Map<number, number>();
const contentMap = new Map<number, string>();
keywordResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + 1 / (k + i + 1));
contentMap.set(row.id, row.content);
});
semanticResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + 1 / (k + i + 1));
contentMap.set(row.id, row.content);
});
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, limit)
.map(([id, score]) => ({ id, content: contentMap.get(id)!, score }));
}def rrf_fusion(keyword_results, semantic_results, k=60, limit=10):
scores = {}
content_map = {}
for rank, row in enumerate(keyword_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + 1 / (k + rank)
content_map[row['id']] = row['content']
for rank, row in enumerate(semantic_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + 1 / (k + rank)
content_map[row['id']] = row['content']
sorted_ids = sorted(scores, key=scores.get, reverse=True)[:limit]
return [{'id': id, 'content': content_map[id], 'score': scores[id]} for id in sorted_ids]
```typescript
// 客户端RRF融合(TypeScript)
type Row = { id: number; content: string };
type Result = Row & { score: number };
function rrfFusion(keywordResults: Row[], semanticResults: Row[], k = 60, limit = 10): Result[] {
const scores = new Map<number, number>();
const contentMap = new Map<number, string>();
keywordResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + 1 / (k + i + 1));
contentMap.set(row.id, row.content);
});
semanticResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + 1 / (k + i + 1));
contentMap.set(row.id, row.content);
});
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, limit)
.map(([id, score]) => ({ id, content: contentMap.get(id)!, score }));
}RRF Parameters
RRF参数说明
| Parameter | Default | Description |
|---|---|---|
| 60 | Smoothing constant. Higher values reduce rank differences; 60 is standard |
| Candidates per search | 50 | Higher = better recall, more work |
| Final limit | 10 | Results returned after fusion |
Increase candidates if relevant results are being missed. The k=60 constant rarely needs tuning.
| 参数 | 默认值 | 描述 |
|---|---|---|
| 60 | 平滑常数,值越大排名差异越小,60为标准取值 |
| 单方法候选数 | 50 | 数值越高召回率越好,但计算量越大 |
| 最终返回数量 | 10 | 融合后返回的结果数量 |
如果相关结果被遗漏,可增加候选数。常数k=60通常无需调优。
Weighting Keyword vs Semantic
关键词与语义搜索的权重调整
To favor one method over another, multiply its RRF contribution:
python
undefined若要侧重某一种搜索方法,可对其RRF贡献值进行加权:
python
undefinedWeight semantic search 2x higher than keyword
语义搜索权重为关键词搜索的2倍
keyword_weight = 1.0
semantic_weight = 2.0
for rank, row in enumerate(keyword_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + keyword_weight / (k + rank)
for rank, row in enumerate(semantic_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + semantic_weight / (k + rank)
```typescript
// Weight semantic search 2x higher than keyword
const keywordWeight = 1.0;
const semanticWeight = 2.0;
keywordResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + keywordWeight / (k + i + 1));
});
semanticResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + semanticWeight / (k + i + 1));
});Start with equal weights (1.0 each) and adjust based on measured relevance.
keyword_weight = 1.0
semantic_weight = 2.0
for rank, row in enumerate(keyword_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + keyword_weight / (k + rank)
for rank, row in enumerate(semantic_results, start=1):
scores[row['id']] = scores.get(row['id'], 0) + semantic_weight / (k + rank)
```typescript
// 语义搜索权重为关键词搜索的2倍
const keywordWeight = 1.0;
const semanticWeight = 2.0;
keywordResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + keywordWeight / (k + i + 1));
});
semanticResults.forEach((row, i) => {
scores.set(row.id, (scores.get(row.id) ?? 0) + semanticWeight / (k + i + 1));
});初始时建议使用等权重(均为1.0),然后根据实际相关性调整权重。
Reranking with ML Models
基于机器学习模型的重排序
For highest quality, add a reranking step using a cross-encoder model. Cross-encoders (e.g., ) are more accurate than bi-encoders but too slow for initial retrieval—use them only on the candidate set.
cross-encoder/ms-marco-MiniLM-L-6-v2Run the same parallel queries as above with a higher LIMIT (e.g., 100), then:
python
undefined若要获得最高质量的结果,可添加基于cross-encoder模型的重排序步骤。Cross-encoder模型(例如)比双编码器模型更准确,但初始检索速度较慢,因此仅适用于候选结果集的重排序。
cross-encoder/ms-marco-MiniLM-L-6-v2按照上述方式并行执行查询,并将内部LIMIT设为更高值(例如100),然后执行以下步骤:
python
undefined1. Fuse results with RRF (more candidates for reranking)
1. 使用RRF融合结果(保留更多候选用于重排序)
candidates = rrf_fusion(keyword_results, semantic_results, limit=100)
candidates = rrf_fusion(keyword_results, semantic_results, limit=100)
2. Rerank with cross-encoder
2. 使用cross-encoder进行重排序
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query_text, doc['content']) for doc in candidates]
scores = reranker.predict(pairs)
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query_text, doc['content']) for doc in candidates]
scores = reranker.predict(pairs)
3. Return top 10 by reranker score
3. 返回重排序后的前10个结果
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:10]
```typescript
import { CohereClientV2 } from 'cohere-ai';
// 1. Fuse results with RRF (more candidates for reranking)
const candidates = rrfFusion(keywordResults, semanticResults, 60, 100);
// 2. Rerank via API (example uses Cohere SDK; Jina, Voyage, and others work similarly)
const cohere = new CohereClientV2({ token: COHERE_API_KEY });
const reranked = await cohere.rerank({
model: 'rerank-v3.5',
query: queryText,
documents: candidates.map(c => c.content),
topN: 10
});
// 3. Map back to original documents
const results = reranked.results.map(r => candidates[r.index]);Reranking is optional—hybrid RRF alone significantly improves over single-method search.
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:10]
```typescript
import { CohereClientV2 } from 'cohere-ai';
// 1. 使用RRF融合结果(保留更多候选用于重排序)
const candidates = rrfFusion(keywordResults, semanticResults, 60, 100);
// 2. 通过API进行重排序(示例使用Cohere SDK;Jina、Voyage等平台也支持类似功能)
const cohere = new CohereClientV2({ token: COHERE_API_KEY });
const reranked = await cohere.rerank({
model: 'rerank-v3.5',
query: queryText,
documents: candidates.map(c => c.content),
topN: 10
});
// 3. 映射回原始文档
const results = reranked.results.map(r => candidates[r.index]);重排序为可选步骤——仅使用RRF混合搜索即可显著优于单一搜索方法。
Performance Considerations
性能考量
- Index both columns: BM25 index on text, HNSW index on embedding
- Limit candidate pools: 50–100 candidates per method is usually sufficient
- Run queries in parallel: Client-side parallelism reduces latency vs sequential execution
- Monitor latency: Hybrid adds overhead; ensure both indexes fit in memory
- 为两列创建索引:为文本列创建BM25索引,为嵌入向量列创建HNSW索引
- 限制候选集大小:单方法候选数为50-100通常已足够
- 并行执行查询:客户端并行执行可降低延迟,优于顺序执行
- 监控延迟:混合搜索会增加额外开销,需确保两个索引均能放入内存
Scaling with pgvectorscale
使用pgvectorscale进行扩展
For large datasets (10M+ vectors) or workloads with selective metadata filters, consider pgvectorscale's StreamingDiskANN index instead of HNSW for the semantic search component.
When to use StreamingDiskANN:
- Large datasets where HNSW doesn't fit in memory
- Queries that filter by labels (e.g., tenant_id, category, tags)
- When you need high-performance filtered vector search
Label-based filtering: StreamingDiskANN supports filtered indexes on label columns. Labels are indexed alongside vectors, enabling efficient filtered search without post-filtering accuracy loss.
smallint[]sql
-- Enable pgvectorscale (in addition to pgvector)
CREATE EXTENSION IF NOT EXISTS vectorscale;
-- Table with label column for filtering
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
embedding halfvec(1536) NOT NULL,
labels smallint[] NOT NULL -- e.g., category IDs, tenant IDs
);
-- StreamingDiskANN index with label filtering
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels);
-- BM25 index for keyword search
CREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english');
-- Filtered semantic search using && (array overlap)
SELECT id, content FROM documents
WHERE labels && ARRAY[1, 3]::smallint[]
ORDER BY embedding <=> $1::halfvec(1536) LIMIT 50;See the pgvectorscale documentation for more details on filtered indexes and tuning parameters.
对于大型数据集(1000万+向量)或带有选择性元数据过滤的工作负载,可考虑使用pgvectorscale的StreamingDiskANN索引替代HNSW,用于语义搜索组件。
何时使用StreamingDiskANN:
- 数据集过大,HNSW无法放入内存时
- 查询需要按标签过滤时(例如tenant_id、category、tags)
- 需要高性能过滤向量搜索时
基于标签的过滤:StreamingDiskANN支持对类型的标签列创建过滤索引。标签与向量一起被索引,无需后过滤即可实现高效的过滤搜索,且不会损失准确性。
smallint[]sql
-- 启用pgvectorscale(需同时启用pgvector)
CREATE EXTENSION IF NOT EXISTS vectorscale;
-- 创建带标签列的表用于过滤
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
embedding halfvec(1536) NOT NULL,
labels smallint[] NOT NULL -- 例如类别ID、租户ID
);
-- 创建带标签过滤的StreamingDiskANN索引
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops, labels);
-- 为关键词搜索创建BM25索引
CREATE INDEX ON documents USING bm25 (content) WITH (text_config = 'english');
-- 使用&&(数组重叠)进行过滤语义搜索
SELECT id, content FROM documents
WHERE labels && ARRAY[1, 3]::smallint[]
ORDER BY embedding <=> $1::halfvec(1536) LIMIT 50;有关过滤索引和调优参数的更多详情,请参考pgvectorscale文档。
Monitoring & Debugging
监控与调试
sql
-- Force index usage for verification (planner may prefer seqscan on small tables)
SET enable_seqscan = off;
-- Verify BM25 index is used
EXPLAIN SELECT id, content FROM documents ORDER BY content <@> 'search text' LIMIT 10;
-- Look for: Index Scan using ... (bm25)
-- Verify HNSW index is used
EXPLAIN SELECT id, content FROM documents ORDER BY embedding <=> '[0.1, 0.2, ...]'::halfvec(1536) LIMIT 10;
-- Look for: Index Scan using ... (hnsw)
SET enable_seqscan = on; -- Re-enable for normal operation
-- Check index sizes
SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes WHERE tablename = 'documents';If EXPLAIN still shows sequential scans with , verify indexes exist and queries use correct operators ( for BM25, for cosine). For more pgvector debugging guidance, see the pgvector-semantic-search skill.
enable_seqscan = off<@><=>sql
-- 强制使用索引以验证(对于小表,查询优化器可能更倾向于顺序扫描)
SET enable_seqscan = off;
-- 验证BM25索引是否被使用
EXPLAIN SELECT id, content FROM documents ORDER BY content <@> 'search text' LIMIT 10;
-- 需包含:Index Scan using ... (bm25)
-- 验证HNSW索引是否被使用
EXPLAIN SELECT id, content FROM documents ORDER BY embedding <=> '[0.1, 0.2, ...]'::halfvec(1536) LIMIT 10;
-- 需包含:Index Scan using ... (hnsw)
SET enable_seqscan = on; -- 恢复正常操作
-- 检查索引大小
SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes WHERE tablename = 'documents';若设置后,EXPLAIN仍显示顺序扫描,请检查索引是否存在,以及查询是否使用了正确的运算符(BM25使用,余弦相似度使用)。有关pgvector的更多调试指南,请参考pgvector-semantic-search技能。
enable_seqscan = off<@><=>Common Issues
常见问题
| Symptom | Likely Cause | Fix |
|---|---|---|
| Missing exact matches | Keyword search not returning them | Check BM25 index exists; verify text_config matches content language |
| Poor semantic results | Embedding model mismatch | Ensure query embedding uses same model as stored embeddings |
| Slow queries | Large candidate pools or missing indexes | Reduce inner LIMIT; verify both indexes exist and are used (EXPLAIN) |
| Skewed results | One method dominating | Adjust RRF weights; verify both searches return reasonable candidates |
| 症状 | 可能原因 | 解决方法 |
|---|---|---|
| 缺失精确匹配结果 | 关键词搜索未返回该结果 | 检查BM25索引是否存在;验证text_config是否与内容语言匹配 |
| 语义搜索结果质量差 | 嵌入模型不匹配 | 确保查询嵌入向量与存储的嵌入向量使用相同模型 |
| 查询速度慢 | 候选集过大或缺失索引 | 减小内部LIMIT值;验证两个索引是否存在并被使用(通过EXPLAIN) |
| 结果倾斜 | 某一种搜索方法占主导 | 调整RRF权重;验证两种搜索方法均返回合理的候选结果 |