redis-semantic-cache

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Redis Semantic Cache

Redis 语义缓存

Semantic caching for LLM responses with Redis Cloud's LangCache service. Stores prompts as embeddings; subsequent semantically-similar prompts return the cached response without re-calling the model.
LangCache is currently in preview on Redis Cloud. Features and behavior may change.
借助Redis Cloud的LangCache服务实现LLM响应的语义缓存。将提示词存储为嵌入向量;后续语义相似的提示词将直接返回缓存的响应,无需重新调用模型。
LangCache目前在Redis Cloud中处于预览阶段。功能与行为可能会发生变化。

When to apply

适用场景

  • Wrapping an LLM call (OpenAI, Anthropic, etc.) with a cache layer to cut cost and latency.
  • Caching RAG answers, classification outputs, or any deterministic LLM workload.
  • Tuning the precision/hit-rate trade-off for a semantic cache.
  • Splitting one application's LLM workloads across multiple cache instances.
  • 在LLM调用(OpenAI、Anthropic等)前包裹缓存层以降低成本与延迟。
  • 缓存RAG答案、分类输出或任何确定性LLM工作负载。
  • 调整语义缓存的精度/命中率权衡。
  • 将单个应用的LLM工作负载拆分到多个缓存实例中。

1. The cache-aside flow

1. 旁路缓存流程

LangCache fits in front of any LLM call as a standard cache-aside pattern:
  1. Send the user's prompt to LangCache's
    search
    .
  2. Cache hit — return the stored response directly.
  3. Cache miss — call the LLM, then
    set
    the response so future similar prompts hit.
python
from langcache import LangCache
import os

lang_cache = LangCache(
    server_url=f"https://{os.getenv('HOST')}",
    cache_id=os.getenv("CACHE_ID"),
    api_key=os.getenv("API_KEY"),
)

result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.9)
if result:
    response = result[0]["response"]
else:
    response = llm.generate("What is Redis?")
    lang_cache.set(prompt="What is Redis?", response=response)
The same operations are available via REST (
POST /v1/caches/{cacheId}/entries/search
and
POST /v1/caches/{cacheId}/entries
) when an SDK isn't an option.
See references/langcache-usage.md for full SDK + REST samples and attribute-based storage.
LangCache可作为标准旁路缓存模式适配任何LLM调用:
  1. 将用户的提示词发送至LangCache的
    search
    接口。
  2. 缓存命中——直接返回存储的响应。
  3. 缓存未命中——调用LLM,然后通过
    set
    接口存储响应,以便未来相似的提示词可以命中缓存。
python
from langcache import LangCache
import os

lang_cache = LangCache(
    server_url=f"https://{os.getenv('HOST')}",
    cache_id=os.getenv("CACHE_ID"),
    api_key=os.getenv("API_KEY"),
)

result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.9)
if result:
    response = result[0]["response"]
else:
    response = llm.generate("What is Redis?")
    lang_cache.set(prompt="What is Redis?", response=response)
当无法使用SDK时,可通过REST API执行相同操作(
POST /v1/caches/{cacheId}/entries/search
POST /v1/caches/{cacheId}/entries
)。
如需完整的SDK与REST示例以及基于属性的存储方法,请查看references/langcache-usage.md

2. Tune the similarity threshold

2. 调整相似度阈值

The threshold controls how close (in embedding cosine distance) a new prompt must be to a cached one to count as a hit. Higher = stricter match, fewer false positives. Lower = more hits, more risk of returning an off-topic answer.
ThresholdBehaviorUse when
0.95+Near-exact match requiredCustomer-facing answers where wrong responses are costly
0.9Balanced defaultMost workloads — start here
0.8Loose semantic matchInternal tools, exploratory queries, FAQ deduplication
python
undefined
该阈值控制新提示词与缓存提示词的嵌入余弦距离需达到多近才会被判定为命中。阈值越高,匹配越严格,误报越少;阈值越低,命中次数越多,但返回无关答案的风险越高。
ThresholdBehaviorUse when
0.95+Near-exact match requiredCustomer-facing answers where wrong responses are costly
0.9Balanced defaultMost workloads — start here
0.8Loose semantic matchInternal tools, exploratory queries, FAQ deduplication
python
undefined

Stricter — fewer false positives

Stricter — fewer false positives

result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.95)
result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.95)

Looser — higher hit rate

Looser — higher hit rate

result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.8)

Adjust by watching the actual cache-hit rate and spot-checking that returned answers are still relevant.

See [references/best-practices.md](references/best-practices.md).
result = lang_cache.search(prompt="What is Redis?", similarity_threshold=0.8)

可通过监控实际缓存命中率并抽查返回答案的相关性来调整阈值。

详情请查看[references/best-practices.md](references/best-practices.md)。

3. Separate caches per task type

3. 按任务类型分离缓存

Different LLM workloads should not share one cache — a "code question" prompt is semantically close to other code questions but has nothing to do with a password-reset support query, and crossing them returns garbage.
python
support_cache = LangCache(server_url=..., cache_id="support-cache-id", api_key=...)
code_cache    = LangCache(server_url=..., cache_id="code-cache-id",    api_key=...)
Create distinct cache IDs in Redis Cloud per task, and route each call to the right one. As a finer-grained alternative, store and search with custom attributes (e.g.
{"category": "database"}
) to keep tasks in the same cache but isolated by attribute filter — useful when the same prompt format spans subtopics.
不同的LLM工作负载不应共享同一个缓存——“代码问题”提示词在语义上与其他代码问题相近,但与密码重置支持查询毫无关联,混合使用会返回无效结果。
python
support_cache = LangCache(server_url=..., cache_id="support-cache-id", api_key=...)
code_cache    = LangCache(server_url=..., cache_id="code-cache-id",    api_key=...)
在Redis Cloud中为每个任务创建不同的缓存ID,并将每个调用路由到对应的缓存。作为更细粒度的替代方案,可使用自定义属性(例如
{"category": "database"}
)进行存储与搜索,将任务保留在同一缓存中但通过属性过滤器隔离——适用于同一提示词格式涵盖多个子主题的场景。

References

参考资料