spice-search

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Search Data

数据搜索

Spice provides integrated search capabilities: vector (semantic) search, full-text (keyword) search, and hybrid search with Reciprocal Rank Fusion (RRF) — all via SQL functions and HTTP APIs. Search indexes are built on top of accelerated datasets.

Spice 提供集成搜索功能：向量（语义）搜索、全文（关键词）搜索以及结合Reciprocal Rank Fusion (RRF)的混合搜索——所有功能均可通过SQL函数和HTTP API实现。搜索索引构建在加速数据集之上。

Search Methods

搜索方法

Method	When to Use	Requires
Vector search	Semantic similarity, RAG, recommendations	Embedding model + column embeddings
Full-text search	Keyword/phrase matching, exact terms	`full_text_search.enabled: true` on columns
Hybrid (RRF)	Best of both — combines rankings from multiple methods	Multiple search methods configured
Lexical (LIKE/=)	Exact pattern or value matching	Nothing extra

方法	适用场景	所需条件
向量搜索	语义相似度匹配、RAG、推荐系统	Embedding模型 + 列向量Embedding
全文搜索	关键词/短语匹配、精确术语匹配	列上配置 `full_text_search.enabled: true`
混合搜索(RRF)	兼顾两者优势——结合多种搜索方法的排名结果	已配置多种搜索方法
词法搜索(LIKE/=)	精确模式或值匹配	无额外要求

Set Up Vector Search

配置向量搜索

1. Define an Embedding Model

1. 定义Embedding模型

yaml

embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }

yaml

embeddings:
  - name: local_embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  - name: openai_embeddings
    from: openai:text-embedding-3-small
    params:
      openai_api_key: ${ secrets:OPENAI_API_KEY }

Supported Embedding Providers

支持的Embedding提供商

Provider	From Format	Status
OpenAI	`openai:text-embedding-3-large`	Release Candidate
HuggingFace	`huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2`	Release Candidate
Local file	`file:model.safetensors`	Release Candidate
Azure OpenAI	`azure:my-deployment`	Alpha
Google AI	`google:text-embedding-004`	Alpha
Amazon Bedrock	`bedrock:amazon.titan-embed-text-v1`	Alpha
Databricks	`databricks:endpoint`	Alpha
Model2Vec	`model2vec:model-name`	Alpha

提供商	来源格式	状态
OpenAI	`openai:text-embedding-3-large`	Release Candidate
HuggingFace	`huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2`	Release Candidate
本地文件	`file:model.safetensors`	Release Candidate
Azure OpenAI	`azure:my-deployment`	Alpha
Google AI	`google:text-embedding-004`	Alpha
Amazon Bedrock	`bedrock:amazon.titan-embed-text-v1`	Alpha
Databricks	`databricks:endpoint`	Alpha
Model2Vec	`model2vec:model-name`	Alpha

2. Configure Dataset Columns for Embeddings

2. 为数据集列配置Embedding

yaml

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64

yaml

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: local_embeddings
            row_id: id
            chunking:
              enabled: true
              target_chunk_size: 512
              overlap_size: 64

Embedding Methods

Embedding方式

Method	Description	When to Use
Accelerated	Precomputed and stored	Faster queries, frequently searched datasets
JIT (Just-in-Time)	Computed at query time (no acceleration)	Large or rarely queried datasets
Passthrough	Pre-existing embeddings used directly	Source already has `<col>_embedding` columns

方式	描述	适用场景
预计算加速	提前计算并存储	查询速度快、频繁搜索的数据集
即时计算(JIT)	查询时计算（无加速）	大型或极少查询的数据集
直接复用	直接使用已有的Embedding	数据源已包含 `<col>_embedding` 列

3. Query via HTTP API

3. 通过HTTP API查询

bash

curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'

Field	Required	Description
`text`	Yes	Search text
`datasets`	No	Datasets to search (null = all searchable)
`additional_columns`	No	Extra columns to return
`where`	No	SQL filter predicate
`limit`	No	Max results per dataset

To retrieve full documents (not just chunks), include the embedding column name in

additional_columns

bash

curl -X POST http://localhost:8090/v1/search \
  -H 'Content-Type: application/json' \
  -d '{
    "datasets": ["docs"],
    "text": "cutting edge AI",
    "where": "author=\"jeadie\"",
    "additional_columns": ["title", "state"],
    "limit": 5
  }'

字段	是否必填	描述
`text`	是	搜索文本
`datasets`	否	要搜索的数据集（null = 所有可搜索数据集）
`additional_columns`	否	要返回的额外列
`where`	否	SQL过滤条件
`limit`	否	每个数据集的最大结果数

要检索完整文档（而非仅片段），需在

additional_columns

中包含Embedding列名。

4. Query via SQL UDTF

4. 通过SQL UDTF查询

sql

SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;

vector_search
signature:

sql

vector_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Search text (required)
  col STRING,            -- Column (optional if single embedding column)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

Limitation:
vector_search
UDTF does not yet support chunked embedding columns. Use the HTTP API for chunked data.

sql

SELECT id, title, score
FROM vector_search(docs, 'cutting edge AI')
WHERE state = 'Open'
ORDER BY score DESC
LIMIT 5;

vector_search
签名:

sql

vector_search(
  table STRING,          -- 数据集名称（必填）
  query STRING,          -- 搜索文本（必填）
  col STRING,            -- 列名（若只有一个Embedding列则可选）
  limit INTEGER,         -- 最大结果数（默认：1000）
  include_score BOOLEAN  -- 是否包含得分列（默认：TRUE）
) RETURNS TABLE

限制：
vector_search
UDTF目前暂不支持分块的Embedding列。对于分块数据，请使用HTTP API。

Set Up Full-Text Search

配置全文搜索

Full-text search uses BM25 scoring (powered by Tantivy) for keyword relevance ranking.

全文搜索采用BM25评分（由Tantivy提供支持）进行关键词相关性排名。

1. Enable Indexing on Columns

1. 为列启用索引

yaml

datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true

yaml

datasets:
  - from: postgres:articles
    name: articles
    acceleration:
      enabled: true
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: body
        full_text_search:
          enabled: true

2. Query via SQL UDTF

2. 通过SQL UDTF查询

sql

SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;

text_search
signature:

sql

text_search(
  table STRING,          -- Dataset name (required)
  query STRING,          -- Keywords/phrase (required)
  col STRING,            -- Column (required if multiple indexed columns)
  limit INTEGER,         -- Max results (default: 1000)
  include_score BOOLEAN  -- Include score column (default: TRUE)
) RETURNS TABLE

sql

SELECT id, title, score
FROM text_search(articles, 'search keywords', body)
ORDER BY score DESC
LIMIT 5;

text_search
签名:

sql

text_search(
  table STRING,          -- 数据集名称（必填）
  query STRING,          -- 关键词/短语（必填）
  col STRING,            -- 列名（若有多个索引列则必填）
  limit INTEGER         -- 最大结果数（默认：1000）
  include_score BOOLEAN  -- 是否包含得分列（默认：TRUE）
) RETURNS TABLE

Hybrid Search with RRF

结合RRF的混合搜索

Reciprocal Rank Fusion merges rankings from multiple search methods. Each query runs independently, then results are combined:

RRF Score = Σ(rank_weight / (k + rank))

Documents appearing across multiple result sets receive higher scores.

Reciprocal Rank Fusion会合并多种搜索方法的排名结果。每个查询独立运行，然后合并结果：

RRF 得分 = Σ(rank_weight / (k + rank))

在多个结果集中出现的文档会获得更高的得分。

Basic Hybrid Search

基础混合搜索

sql

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;

sql

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(documents, 'machine learning algorithms'),
    text_search(documents, 'neural networks deep learning', content),
    join_key => 'id'
)
ORDER BY fused_score DESC
LIMIT 5;

Weighted Ranking

加权排名

sql

SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

sql

SELECT fused_score, title, content
FROM rrf(
    text_search(posts, 'artificial intelligence', rank_weight => 50.0),
    vector_search(posts, 'AI machine learning', rank_weight => 200.0)
)
ORDER BY fused_score DESC
LIMIT 10;

Recency-Boosted Search

时效性增强搜索

sql

-- Exponential decay (1-hour scale)
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- Linear decay (24-hour window)
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;

sql

-- 指数衰减（1小时尺度）
SELECT fused_score, title, created_at
FROM rrf(
    text_search(news, 'breaking news'),
    vector_search(news, 'latest updates'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
ORDER BY fused_score DESC
LIMIT 10;

-- 线性衰减（24小时窗口）
SELECT fused_score, content
FROM rrf(
    text_search(posts, 'trending'),
    vector_search(posts, 'viral popular'),
    time_column => 'created_at',
    recency_decay => 'linear',
    decay_window_secs => 86400
)
ORDER BY fused_score DESC;

Cross-Language Search

跨语言搜索

sql

SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;

sql

SELECT fused_score, text, langs
FROM rrf(
    vector_search(posts, 'ultimas noticias', rank_weight => 100),
    text_search(posts, 'news'),
    time_column => 'created_at',
    recency_decay => 'exponential',
    decay_constant => 0.05,
    decay_scale_secs => 3600
)
WHERE trim(text) != ''
ORDER BY fused_score DESC LIMIT 15;

rrf

Parameters

rrf

参数

Parameter	Type	Required	Description
`query_1` , `query_2` , ...	Search UDTF	Yes (2+)	`vector_search` or `text_search` calls (variadic)
`join_key`	String	No	Column for joining results (default: auto-hash)
`k`	Float	No	Smoothing parameter (default: 60.0, lower = more aggressive)
`time_column`	String	No	Timestamp column for recency boosting
`recency_decay`	String	No	`'exponential'` (default) or `'linear'`
`decay_constant`	Float	No	Rate for exponential decay (default: 0.01)
`decay_scale_secs`	Float	No	Time scale for exponential decay (default: 86400)
`decay_window_secs`	Float	No	Window for linear decay (default: 86400)
`rank_weight`	Float	No	Per-query weight (specified inside search calls)

参数	类型	是否必填	描述
`query_1` , `query_2` , ...	搜索UDTF	是（至少2个）	`vector_search` 或 `text_search` 调用（支持多个）
`join_key`	字符串	否	用于合并结果的列（默认：自动哈希）
`k`	浮点数	否	平滑参数（默认：60.0，值越小，排名越激进）
`time_column`	字符串	否	用于时效性增强的时间戳列
`recency_decay`	字符串	否	`'exponential'` （默认）或 `'linear'`
`decay_constant`	浮点数	否	指数衰减速率（默认：0.01）
`decay_scale_secs`	浮点数	否	指数衰减的时间尺度（默认：86400）
`decay_window_secs`	浮点数	否	线性衰减的时间窗口（默认：86400）
`rank_weight`	浮点数	否	每个查询的权重（在搜索调用中指定）

Vector Engines

向量引擎

Store and index embeddings at scale using dedicated vector engines:

yaml

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # enable filtering on this column
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1

使用专用向量引擎大规模存储和索引Embedding：

yaml

datasets:
  - from: postgres:documents
    name: docs
    acceleration:
      enabled: true
    columns:
      - name: content
        embeddings:
          - from: embed_model
            row_id: id
        metadata:
          vectors: non-filterable
      - name: category
        metadata:
          vectors: filterable # 启用该列的过滤功能
    vectors:
      enabled: true
      engine: s3_vectors
      params:
        s3_vectors_bucket: my-bucket
        s3_vectors_region: us-east-1

Lexical Search (SQL)

词法搜索(SQL)

Standard SQL filtering:

sql

SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');

标准SQL过滤：

sql

SELECT * FROM my_table WHERE column LIKE '%substring%';
SELECT * FROM my_table WHERE column = 'exact value';
SELECT * FROM my_table WHERE regexp_like(column, '^spice.*ai$');

CLI Search

CLI搜索

bash

spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"

bash

spice search "cutting edge AI" --dataset docs --limit 5
spice search --cache-control no-cache "search terms"

Complete Example

完整示例

yaml

version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true

sql

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;

yaml

version: v1
kind: Spicepod
name: search_app

secrets:
  - from: env
    name: env

embeddings:
  - name: embeddings
    from: huggingface:huggingface.co/sentence-transformers/all-MiniLM-L6-v2

datasets:
  - from: file:articles.parquet
    name: articles
    acceleration:
      enabled: true
      engine: duckdb
    columns:
      - name: title
        full_text_search:
          enabled: true
          row_id:
            - id
      - name: content
        embeddings:
          - from: embeddings
        full_text_search:
          enabled: true

sql

SELECT id, title, content, fused_score
FROM rrf(
    vector_search(articles, 'machine learning best practices'),
    text_search(articles, 'neural network training', content),
    join_key => 'id',
    time_column => 'published_at',
    recency_decay => 'exponential',
    decay_constant => 0.01,
    decay_scale_secs => 86400
)
WHERE fused_score > 0.01
ORDER BY fused_score DESC
LIMIT 10;

Troubleshooting

故障排除

Issue	Solution
`vector_search` returns no results	Verify embeddings configured on column and model is loaded
`text_search` returns no results	Check `full_text_search.enabled: true` ; acceleration must be enabled
Poor hybrid search relevance	Tune `rank_weight` per query and adjust `k`
Results missing recent content	Add `time_column` and `recency_decay` to RRF
Chunked vector search not working via SQL	Use HTTP API instead (UDTF doesn't support chunked columns yet)

问题	解决方案
`vector_search` 无结果返回	验证列上已配置Embedding且模型已加载
`text_search` 无结果返回	检查是否开启 `full_text_search.enabled: true` ；必须启用加速功能
混合搜索相关性差	调整每个查询的 `rank_weight` 和参数 `k`
结果缺少最新内容	在RRF中添加 `time_column` 和 `recency_decay` 参数
分块向量搜索无法通过SQL工作	改用HTTP API（UDTF暂不支持分块列）

spice-search

Original

Translation

Search Data

数据搜索

Search Methods

搜索方法

Set Up Vector Search

配置向量搜索

1. Define an Embedding Model

1. 定义Embedding模型

Supported Embedding Providers

支持的Embedding提供商

2. Configure Dataset Columns for Embeddings

2. 为数据集列配置Embedding

Embedding Methods

Embedding方式

3. Query via HTTP API

3. 通过HTTP API查询

4. Query via SQL UDTF

4. 通过SQL UDTF查询

Set Up Full-Text Search

配置全文搜索

1. Enable Indexing on Columns

1. 为列启用索引

2. Query via SQL UDTF

2. 通过SQL UDTF查询

Hybrid Search with RRF

结合RRF的混合搜索

Basic Hybrid Search

基础混合搜索

Weighted Ranking

加权排名

Recency-Boosted Search

时效性增强搜索

Cross-Language Search

跨语言搜索

rrf Parameters

rrf 参数

Vector Engines

向量引擎

Lexical Search (SQL)

词法搜索(SQL)

CLI Search

CLI搜索

Complete Example

完整示例

Troubleshooting

故障排除

Documentation

相关文档

`rrf`
Parameters

`rrf`
参数