spice-search
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSearch Data
数据搜索
Spice provides integrated search capabilities: vector (semantic) search, full-text (keyword) search, and hybrid search with Reciprocal Rank Fusion (RRF) — all via SQL functions and HTTP APIs. Search indexes are built on top of accelerated datasets.
Spice 提供集成搜索功能:向量(语义)搜索、全文(关键词)搜索以及结合Reciprocal Rank Fusion (RRF)的混合搜索——所有功能均可通过SQL函数和HTTP API实现。搜索索引构建在加速数据集之上。
Search Methods
搜索方法
| Method | When to Use | Requires |
|---|---|---|
| Vector search | Semantic similarity, RAG, recommendations | Embedding model + column embeddings |
| Full-text search | Keyword/phrase matching, exact terms | |
| Hybrid (RRF) | Best of both — combines rankings from multiple methods | Multiple search methods configured |
| Lexical (LIKE/=) | Exact pattern or value matching | Nothing extra |
| 方法 | 适用场景 | 所需条件 |
|---|---|---|
| 向量搜索 | 语义相似度匹配、RAG、推荐系统 | Embedding模型 + 列向量Embedding |
| 全文搜索 | 关键词/短语匹配、精确术语匹配 | 列上配置 |
| 混合搜索(RRF) | 兼顾两者优势——结合多种搜索方法的排名结果 | 已配置多种搜索方法 |
| 词法搜索(LIKE/=) | 精确模式或值匹配 | 无额外要求 |
Set Up Vector Search
配置向量搜索
1. Define an Embedding Model
1. 定义Embedding模型
yaml
embeddings:
- name: local_embeddings
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- name: openai_embeddings
from: openai:text-embedding-3-small
params:
openai_api_key: ${ secrets:OPENAI_API_KEY }yaml
embeddings:
- name: local_embeddings
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- name: openai_embeddings
from: openai:text-embedding-3-small
params:
openai_api_key: ${ secrets:OPENAI_API_KEY }Supported Embedding Providers
支持的Embedding提供商
| Provider | From Format | Status |
|---|---|---|
| OpenAI | | Release Candidate |
| HuggingFace | | Release Candidate |
| Local file | | Release Candidate |
| Azure OpenAI | | Alpha |
| Google AI | | Alpha |
| Amazon Bedrock | | Alpha |
| Databricks | | Alpha |
| Model2Vec | | Alpha |
| 提供商 | 来源格式 | 状态 |
|---|---|---|
| OpenAI | | Release Candidate |
| HuggingFace | | Release Candidate |
| 本地文件 | | Release Candidate |
| Azure OpenAI | | Alpha |
| Google AI | | Alpha |
| Amazon Bedrock | | Alpha |
| Databricks | | Alpha |
| Model2Vec | | Alpha |
2. Configure Dataset Columns for Embeddings
2. 为数据集列配置Embedding
yaml
datasets:
- from: postgres:documents
name: docs
acceleration:
enabled: true
columns:
- name: content
embeddings:
- from: local_embeddings
row_id: id
chunking:
enabled: true
target_chunk_size: 512
overlap_size: 64yaml
datasets:
- from: postgres:documents
name: docs
acceleration:
enabled: true
columns:
- name: content
embeddings:
- from: local_embeddings
row_id: id
chunking:
enabled: true
target_chunk_size: 512
overlap_size: 64Embedding Methods
Embedding方式
| Method | Description | When to Use |
|---|---|---|
| Accelerated | Precomputed and stored | Faster queries, frequently searched datasets |
| JIT (Just-in-Time) | Computed at query time (no acceleration) | Large or rarely queried datasets |
| Passthrough | Pre-existing embeddings used directly | Source already has |
| 方式 | 描述 | 适用场景 |
|---|---|---|
| 预计算加速 | 提前计算并存储 | 查询速度快、频繁搜索的数据集 |
| 即时计算(JIT) | 查询时计算(无加速) | 大型或极少查询的数据集 |
| 直接复用 | 直接使用已有的Embedding | 数据源已包含 |
3. Query via HTTP API
3. 通过HTTP API查询
bash
curl -X POST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["docs"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 5
}'| Field | Required | Description |
|---|---|---|
| Yes | Search text |
| No | Datasets to search (null = all searchable) |
| No | Extra columns to return |
| No | SQL filter predicate |
| No | Max results per dataset |
To retrieve full documents (not just chunks), include the embedding column name in .
additional_columnsbash
curl -X POST http://localhost:8090/v1/search \
-H 'Content-Type: application/json' \
-d '{
"datasets": ["docs"],
"text": "cutting edge AI",
"where": "author=\"jeadie\"",
"additional_columns": ["title", "state"],
"limit": 5
}'| 字段 | 是否必填 | 描述 |
|---|---|---|
| 是 | 搜索文本 |
| 否 | 要搜索的数据集(null = 所有可搜索数据集) |
| 否 | 要返回的额外列 |
| 否 | SQL过滤条件 |
| 否 | 每个数据集的最大结果数 |
要检索完整文档(而非仅片段),需在中包含Embedding列名。
additional_columns4. Query via SQL UDTF
4. 通过SQL UDTF查询
sql
SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;vector_searchsql
vector_search(
table STRING, -- Dataset name (required)
query STRING, -- Search text (required)
col STRING, -- Column (optional if single embedding column)
limit INTEGER, -- Max results (default: 1000)
include_score BOOLEAN -- Include score column (default: TRUE)
) RETURNS TABLELimitation:UDTF does not yet support chunked embedding columns. Use the HTTP API for chunked data.vector_search
sql
SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;vector_searchsql
vector_search(
table STRING, -- 数据集名称(必填)
query STRING, -- 搜索文本(必填)
col STRING, -- 列名(若只有一个Embedding列则可选)
limit INTEGER, -- 最大结果数(默认:1000)
include_score BOOLEAN -- 是否包含得分列(默认:TRUE)
) RETURNS TABLE限制:UDTF目前暂不支持分块的Embedding列。对于分块数据,请使用HTTP API。vector_search
Set Up Full-Text Search
配置全文搜索
Full-text search uses BM25 scoring (powered by Tantivy) for keyword relevance ranking.
全文搜索采用BM25评分(由Tantivy提供支持)进行关键词相关性排名。
1. Enable Indexing on Columns
1. 为列启用索引
yaml
datasets:
- from: postgres:articles
name: articles
acceleration:
enabled: true
columns:
- name: title
full_text_search:
enabled: true
row_id:
- id
- name: body
full_text_search:
enabled: trueyaml
datasets:
- from: postgres:articles
name: articles
acceleration:
enabled: true
columns:
- name: title
full_text_search:
enabled: true
row_id:
- id
- name: body
full_text_search:
enabled: true2. Query via SQL UDTF
2. 通过SQL UDTF查询
sql
SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;text_searchsql
text_search(
table STRING, -- Dataset name (required)
query STRING, -- Keywords/phrase (required)
col STRING, -- Column (required if multiple indexed columns)
limit INTEGER, -- Max results (default: 1000)
include_score BOOLEAN -- Include score column (default: TRUE)
) RETURNS TABLEsql
SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;text_searchsql
text_search(
table STRING, -- 数据集名称(必填)
query STRING, -- 关键词/短语(必填)
col STRING, -- 列名(若有多个索引列则必填)
limit INTEGER -- 最大结果数(默认:1000)
include_score BOOLEAN -- 是否包含得分列(默认:TRUE)
) RETURNS TABLEHybrid Search with RRF
结合RRF的混合搜索
Reciprocal Rank Fusion merges rankings from multiple search methods. Each query runs independently, then results are combined:
RRF Score = Σ(rank_weight / (k + rank))Documents appearing across multiple result sets receive higher scores.
Reciprocal Rank Fusion会合并多种搜索方法的排名结果。每个查询独立运行,然后合并结果:
RRF 得分 = Σ(rank_weight / (k + rank))在多个结果集中出现的文档会获得更高的得分。
Basic Hybrid Search
基础混合搜索
sql
SELECT id, title, content, fused_score
FROM rrf(
vector_search(documents, 'machine learning algorithms'),
text_search(documents, 'neural networks deep learning', content),
join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;sql
SELECT id, title, content, fused_score
FROM rrf(
vector_search(documents, 'machine learning algorithms'),
text_search(documents, 'neural networks deep learning', content),
join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;Weighted Ranking
加权排名
sql
SELECT fused_score, title, content
FROM rrf(
text_search(posts, 'artificial intelligence', rank_weight => 50.0),
vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;sql
SELECT fused_score, title, content
FROM rrf(
text_search(posts, 'artificial intelligence', rank_weight => 50.0),
vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;Recency-Boosted Search
时效性增强搜索
sql
-- Exponential decay (1-hour scale)
SELECT fused_score, title, created_at
FROM rrf(
text_search(news, 'breaking news'),
vector_search(news, 'latest updates'),
time_column => 'created_at',
recency_decay => 'exponential',
decay_constant => 0.05,
decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;
-- Linear decay (24-hour window)
SELECT fused_score, content
FROM rrf(
text_search(posts, 'trending'),
vector_search(posts, 'viral popular'),
time_column => 'created_at',
recency_decay => 'linear',
decay_window_secs => 86400
)
ORDER BY fused_score DESC;sql
-- 指数衰减(1小时尺度)
SELECT fused_score, title, created_at
FROM rrf(
text_search(news, 'breaking news'),
vector_search(news, 'latest updates'),
time_column => 'created_at',
recency_decay => 'exponential',
decay_constant => 0.05,
decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;
-- 线性衰减(24小时窗口)
SELECT fused_score, content
FROM rrf(
text_search(posts, 'trending'),
vector_search(posts, 'viral popular'),
time_column => 'created_at',
recency_decay => 'linear',
decay_window_secs => 86400
)
ORDER BY fused_score DESC;Cross-Language Search
跨语言搜索
sql
SELECT fused_score, text, langs
FROM rrf(
vector_search(posts, 'ultimas noticias', rank_weight => 100),
text_search(posts, 'news'),
time_column => 'created_at',
recency_decay => 'exponential',
decay_constant => 0.05,
decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;sql
SELECT fused_score, text, langs
FROM rrf(
vector_search(posts, 'ultimas noticias', rank_weight => 100),
text_search(posts, 'news'),
time_column => 'created_at',
recency_decay => 'exponential',
decay_constant => 0.05,
decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;rrf
Parameters
rrfrrf
参数
rrf| Parameter | Type | Required | Description |
|---|---|---|---|
| Search UDTF | Yes (2+) | |
| String | No | Column for joining results (default: auto-hash) |
| Float | No | Smoothing parameter (default: 60.0, lower = more aggressive) |
| String | No | Timestamp column for recency boosting |
| String | No | |
| Float | No | Rate for exponential decay (default: 0.01) |
| Float | No | Time scale for exponential decay (default: 86400) |
| Float | No | Window for linear decay (default: 86400) |
| Float | No | Per-query weight (specified inside search calls) |
| 参数 | 类型 | 是否必填 | 描述 |
|---|---|---|---|
| 搜索UDTF | 是(至少2个) | |
| 字符串 | 否 | 用于合并结果的列(默认:自动哈希) |
| 浮点数 | 否 | 平滑参数(默认:60.0,值越小,排名越激进) |
| 字符串 | 否 | 用于时效性增强的时间戳列 |
| 字符串 | 否 | |
| 浮点数 | 否 | 指数衰减速率(默认:0.01) |
| 浮点数 | 否 | 指数衰减的时间尺度(默认:86400) |
| 浮点数 | 否 | 线性衰减的时间窗口(默认:86400) |
| 浮点数 | 否 | 每个查询的权重(在搜索调用中指定) |
Vector Engines
向量引擎
Store and index embeddings at scale using dedicated vector engines:
yaml
datasets:
- from: postgres:documents
name: docs
acceleration:
enabled: true
columns:
- name: content
embeddings:
- from: embed_model
row_id: id
metadata:
vectors: non-filterable
- name: category
metadata:
vectors: filterable # enable filtering on this column
vectors:
enabled: true
engine: s3_vectors
params:
s3_vectors_bucket: my-bucket
s3_vectors_region: us-east-1使用专用向量引擎大规模存储和索引Embedding:
yaml
datasets:
- from: postgres:documents
name: docs
acceleration:
enabled: true
columns:
- name: content
embeddings:
- from: embed_model
row_id: id
metadata:
vectors: non-filterable
- name: category
metadata:
vectors: filterable # 启用该列的过滤功能
vectors:
enabled: true
engine: s3_vectors
params:
s3_vectors_bucket: my-bucket
s3_vectors_region: us-east-1Lexical Search (SQL)
词法搜索(SQL)
Standard SQL filtering:
sql
SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');标准SQL过滤:
sql
SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');CLI Search
CLI搜索
bash
spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"bash
spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"Complete Example
完整示例
yaml
version: v1
kind: Spicepod
name: search_app
secrets:
- from: env
name: env
embeddings:
- name: embeddings
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
datasets:
- from: file:articles.parquet
name: articles
acceleration:
enabled: true
engine: duckdb
columns:
- name: title
full_text_search:
enabled: true
row_id:
- id
- name: content
embeddings:
- from: embeddings
full_text_search:
enabled: truesql
SELECT id, title, content, fused_score
FROM rrf(
vector_search(articles, 'machine learning best practices'),
text_search(articles, 'neural network training', content),
join_key => 'id',
time_column => 'published_at',
recency_decay => 'exponential',
decay_constant => 0.01,
decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;yaml
version: v1
kind: Spicepod
name: search_app
secrets:
- from: env
name: env
embeddings:
- name: embeddings
from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2
datasets:
- from: file:articles.parquet
name: articles
acceleration:
enabled: true
engine: duckdb
columns:
- name: title
full_text_search:
enabled: true
row_id:
- id
- name: content
embeddings:
- from: embeddings
full_text_search:
enabled: truesql
SELECT id, title, content, fused_score
FROM rrf(
vector_search(articles, 'machine learning best practices'),
text_search(articles, 'neural network training', content),
join_key => 'id',
time_column => 'published_at',
recency_decay => 'exponential',
decay_constant => 0.01,
decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;Troubleshooting
故障排除
| Issue | Solution |
|---|---|
| Verify embeddings configured on column and model is loaded |
| Check |
| Poor hybrid search relevance | Tune |
| Results missing recent content | Add |
| Chunked vector search not working via SQL | Use HTTP API instead (UDTF doesn't support chunked columns yet) |
| 问题 | 解决方案 |
|---|---|
| 验证列上已配置Embedding且模型已加载 |
| 检查是否开启 |
| 混合搜索相关性差 | 调整每个查询的 |
| 结果缺少最新内容 | 在RRF中添加 |
| 分块向量搜索无法通过SQL工作 | 改用HTTP API(UDTF暂不支持分块列) |