semantic-caching

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Semantic Caching

语义缓存

Cache LLM responses by semantic similarity.
Redis 8 Note: Redis 8+ includes Search, JSON, TimeSeries, and Bloom modules built-in. No separate Redis Stack installation is required. Use
redis:8
in Docker or any Redis 8+ deployment.
通过语义相似度缓存LLM响应。
Redis 8 说明: Redis 8+ 内置了Search、JSON、TimeSeries和Bloom模块,无需单独安装Redis Stack。在Docker或任何Redis 8+部署环境中使用
redis:8
即可。

Cache Hierarchy

缓存层级

Request → L1 (Exact) → L2 (Semantic) → L3 (Prompt) → L4 (LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% save    100% save       90% save    Full cost
请求 → L1(精确匹配)→ L2(语义匹配)→ L3(提示词)→ L4(LLM)
           ~1ms         ~10ms           ~2s          ~3s
         100% 成本节省    100% 成本节省       90% 成本节省    全额成本

Redis Semantic Cache

Redis语义缓存

python
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24h TTL
python
from redisvl.index import SearchIndex
from redisvl.query import VectorQuery

class SemanticCacheService:
    def __init__(self, redis_url: str, threshold: float = 0.92):
        self.client = Redis.from_url(redis_url)
        self.threshold = threshold

    async def get(self, content: str, agent_type: str) -> dict | None:
        embedding = await embed_text(content[:2000])

        query = VectorQuery(
            vector=embedding,
            vector_field_name="embedding",
            filter_expression=f"@agent_type:{{{agent_type}}}",
            num_results=1
        )

        results = self.index.query(query)

        if results:
            distance = float(results[0].get("vector_distance", 1.0))
            if distance <= (1 - self.threshold):
                return json.loads(results[0]["response"])

        return None

    async def set(self, content: str, response: dict, agent_type: str):
        embedding = await embed_text(content[:2000])
        key = f"cache:{agent_type}:{hash_content(content)}"

        self.client.hset(key, mapping={
            "agent_type": agent_type,
            "embedding": embedding,
            "response": json.dumps(response),
            "created_at": time.time(),
        })
        self.client.expire(key, 86400)  # 24小时TTL

Similarity Thresholds

相似度阈值

ThresholdDistanceUse Case
0.98-1.000.00-0.02Nearly identical
0.95-0.980.02-0.05Very similar
0.92-0.950.05-0.08Similar (default)
0.85-0.920.08-0.15Moderately similar
阈值距离使用场景
0.98-1.000.00-0.02几乎完全相同
0.95-0.980.02-0.05高度相似
0.92-0.950.05-0.08相似(默认值)
0.85-0.920.08-0.15中度相似

Multi-Level Lookup

多级查询

python
async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1: Exact match (in-memory LRU)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2: Semantic similarity (Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # Promote to L1
        return similar

    # L3/L4: LLM call with prompt caching
    response = await llm.generate(query)

    # Store in caches
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response
python
async def get_llm_response(query: str, agent_type: str) -> dict:
    # L1:精确匹配(内存LRU缓存)
    cache_key = hash_content(query)
    if cache_key in lru_cache:
        return lru_cache[cache_key]

    # L2:语义相似度(Redis)
    similar = await semantic_cache.get(query, agent_type)
    if similar:
        lru_cache[cache_key] = similar  # 提升至L1缓存
        return similar

    # L3/L4:带提示词缓存的LLM调用
    response = await llm.generate(query)

    # 存入缓存
    await semantic_cache.set(query, response, agent_type)
    lru_cache[cache_key] = response

    return response

Redis 8.4+ Hybrid Search (FT.HYBRID)

Redis 8.4+ 混合搜索(FT.HYBRID)

Redis 8.4 introduces native hybrid search combining semantic (vector) and exact (keyword) matching in a single query. This is ideal for caches that need both similarity and metadata filtering.
python
undefined
Redis 8.4 引入了原生混合搜索功能,可在单个查询中结合语义(向量)和精确(关键词)匹配。对于同时需要相似度和元数据过滤的缓存场景来说,这是理想的选择。
python
undefined

Redis 8.4 native hybrid search

Redis 8.4 原生混合搜索

result = redis.execute_command( "FT.HYBRID", "llm_cache", "SEARCH", f"@agent_type:{{{agent_type}}}", "VSIM", "@embedding", "$query_vec", "KNN", "2", "K", "5", "COMBINE", "RRF", "4", "CONSTANT", "60", "PARAMS", "2", "query_vec", embedding_bytes )

**Hybrid Search Benefits:**
- Single query for keyword + vector matching
- RRF (Reciprocal Rank Fusion) combines scores intelligently
- Better results than sequential filtering
- BM25STD is now the default scorer for keyword matching

**When to Use Hybrid:**
- Filtering by metadata (agent_type, tenant, category) + semantic similarity
- Multi-tenant caches where exact tenant match is required
- Combining keyword search with vector similarity
result = redis.execute_command( "FT.HYBRID", "llm_cache", "SEARCH", f"@agent_type:{{{agent_type}}}", "VSIM", "@embedding", "$query_vec", "KNN", "2", "K", "5", "COMBINE", "RRF", "4", "CONSTANT", "60", "PARAMS", "2", "query_vec", embedding_bytes )

**混合搜索优势:**
- 单个查询即可完成关键词+向量匹配
- RRF( reciprocal Rank Fusion, reciprocal排序融合)智能合并评分
- 效果优于顺序过滤
- BM25STD 现为关键词匹配的默认评分器

**混合搜索适用场景:**
- 元数据(agent_type、租户、分类)过滤 + 语义相似度匹配
- 多租户缓存,要求精确匹配租户
- 关键词搜索与向量相似度结合

Key Decisions

关键决策建议

DecisionRecommendation
ThresholdStart at 0.92, tune based on hit rate
TTL24h for production
Embeddingtext-embedding-3-small (fast)
L1 size1000-10000 entries
ScorerBM25STD (Redis 8+ default)
HybridUse FT.HYBRID for metadata + vector queries
决策项推荐方案
阈值从0.92开始,根据命中率调整
TTL生产环境设置为24小时
嵌入模型text-embedding-3-small(速度快)
L1缓存大小1000-10000条记录
评分器BM25STD(Redis 8+ 默认)
混合搜索元数据+向量查询时使用FT.HYBRID

Common Mistakes

常见错误

  • Threshold too low (false positives)
  • No cache warming (cold start)
  • Missing metadata filters
  • Not promoting L2 hits to L1
  • 阈值设置过低(误报)
  • 未做缓存预热(冷启动问题)
  • 缺失元数据过滤
  • 未将L2命中结果提升至L1缓存

Related Skills

相关技能

  • prompt-caching
    - Provider-native caching
  • embeddings
    - Vector generation
  • cache-cost-tracking
    - Langfuse integration
  • prompt-caching
    - 服务商原生缓存
  • embeddings
    - 向量生成
  • cache-cost-tracking
    - Langfuse集成

Capability Details

能力详情

redis-vector-cache

redis-vector-cache

Keywords: redis, vector, embedding, similarity, cache Solves:
  • Cache LLM responses by semantic similarity
  • Reduce API costs with smart caching
  • Implement multi-level cache hierarchy
关键词: redis, vector, embedding, similarity, cache 解决问题:
  • 通过语义相似度缓存LLM响应
  • 借助智能缓存降低API成本
  • 实现多级缓存架构

similarity-threshold

similarity-threshold

Keywords: threshold, similarity, tuning, cosine Solves:
  • Set appropriate similarity threshold
  • Balance hit rate vs accuracy
  • Tune cache performance
关键词: threshold, similarity, tuning, cosine 解决问题:
  • 设置合适的相似度阈值
  • 平衡命中率与准确性
  • 优化缓存性能

orchestkit-integration

orchestkit-integration

Keywords: orchestkit, integration, roi, cost-savings Solves:
  • Integrate caching with OrchestKit
  • Calculate ROI for caching
  • Production implementation guide
关键词: orchestkit, integration, roi, cost-savings 解决问题:
  • 将缓存与OrchestKit集成
  • 计算缓存的投资回报率(ROI)
  • 生产环境实施指南

cache-service

cache-service

Keywords: service, implementation, template, production Solves:
  • Production cache service template
  • Complete implementation example
  • Redis integration code
关键词: service, implementation, template, production 解决问题:
  • 生产级缓存服务模板
  • 完整的实现示例
  • Redis集成代码

hybrid-search

hybrid-search

Keywords: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter Solves:
  • Combine semantic and keyword search
  • Filter cache by metadata with vector similarity
  • Use Redis 8.4 FT.HYBRID command
  • BM25STD scoring for keyword matching
关键词: hybrid, ft.hybrid, bm25, rrf, keyword, metadata, filter 解决问题:
  • 结合语义与关键词搜索
  • 基于元数据过滤+向量相似度的缓存
  • 使用Redis 8.4 FT.HYBRID命令
  • 关键词匹配采用BM25STD评分