prompt-caching

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Prompt Caching Skill

Prompt缓存技能

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.
利用Anthropic的prompt缓存功能,大幅降低重复prompt的延迟和成本。

When to Use This Skill

何时使用该技能

  • RAG systems with large static documents
  • Multi-turn conversations with long instructions
  • Code analysis with large codebase context
  • Batch processing with shared prefixes
  • Document analysis and summarization
  • 包含大型静态文档的RAG系统
  • 带有冗长指令的多轮对话
  • 针对大型代码库上下文的代码分析
  • 带有共享前缀的批处理
  • 文档分析与总结

Core Concepts

核心概念

Cache Control Placement

缓存控制位置

python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)
python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

Cache Hierarchy

缓存层级

Cache breakpoints are checked in this order:
  1. Tools - Tool definitions cached first
  2. System - System prompts cached second
  3. Messages - Conversation history cached last
缓存断点按以下顺序检查:
  1. 工具 - 工具定义优先缓存
  2. 系统提示词 - 系统提示词其次缓存
  3. 对话消息 - 对话历史最后缓存

TTL Options

TTL选项

TTLWrite CostRead CostUse Case
5 minutes (default)1.25x base0.1x baseInteractive sessions
1 hour2.0x base0.1x baseBatch processing, stable docs
TTL时长写入成本读取成本使用场景
5分钟(默认)基础成本的1.25倍基础成本的0.1倍交互式会话
1小时基础成本的2.0倍基础成本的0.1倍批处理、稳定文档

Cache Requirements

缓存要求

  • Minimum tokens: 1024-4096 (varies by model)
  • Maximum breakpoints: 4 per request
  • Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5
  • 最小token数:1024-4096(因模型而异)
  • 最大断点数量:每个请求最多4个
  • 支持的模型:Claude Opus 4.5、Sonnet 4.5、Haiku 4.5

Implementation Patterns

实现模式

Pattern 1: Single Breakpoint (Recommended)

模式1:单一断点(推荐)

python
undefined
python
undefined

Best for: Document analysis, Q&A with static context

Best for: Document analysis, Q&A with static context

system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]
undefined
system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]
undefined

Pattern 2: Multi-Turn Conversation

模式2:多轮对话

python
undefined
python
undefined

Cache grows with conversation

Cache grows with conversation

messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]
undefined
messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]
undefined

Pattern 3: RAG with Multiple Breakpoints

模式3:多断点RAG

python
system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]
python
system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

Pattern 4: Batch Processing with 1-Hour TTL

模式4:1小时TTL的批处理

python
undefined
python
undefined

Warm the cache before batch

Warm the cache before batch

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )

Now run batch - all requests hit the cache

Now run batch - all requests hit the cache

for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )
undefined
for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )
undefined

Performance Monitoring

性能监控

Check Cache Usage

检查缓存使用率

python
response = client.messages.create(...)
python
response = client.messages.create(...)

Monitor these fields

Monitor these fields

cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
undefined
cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint
print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")
undefined

Cost Calculation

成本计算

python
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost
python
def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

Cache Invalidation

缓存失效

Changes that invalidate cache:
ChangeImpact
Tool definitionsEntire cache invalidated
System promptSystem + messages invalidated
Any content before breakpointThat breakpoint + later invalidated
会导致缓存失效的变更:
变更内容影响
工具定义整个缓存失效
系统提示词系统提示词+对话消息缓存失效
断点前的任何内容该断点及后续缓存失效

Best Practices

最佳实践

DO:

建议:

  • Place breakpoint at END of static content
  • Keep tools/instructions stable across requests
  • Use 1-hour TTL for batch processing
  • Monitor cache_read_input_tokens for savings
  • 将断点放在静态内容的末尾
  • 在请求之间保持工具/指令稳定
  • 批处理使用1小时TTL
  • 监控cache_read_input_tokens以查看节省情况

DON'T:

不建议:

  • Place breakpoint in middle of dynamic content
  • Change tool definitions frequently
  • Expect cache to work with <1024 tokens
  • Ignore the 20-block lookback limit
  • 在动态内容中间放置断点
  • 频繁更改工具定义
  • 期望缓存对少于1024个token的内容生效
  • 忽略20块回溯限制

Integration with Extended Thinking

与扩展思考功能集成

python
undefined
python
undefined

Cache + Extended Thinking

Cache + Extended Thinking

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )
undefined
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )
undefined

See Also

相关链接

  • [[llm-integration]] - Claude API basics
  • [[extended-thinking]] - Deep reasoning
  • [[batch-processing]] - Bulk processing
  • [[llm-integration]] - Claude API基础
  • [[extended-thinking]] - 深度推理
  • [[batch-processing]] - 批量处理