Prompt Caching Skill

Prompt缓存技能

Leverage Anthropic's prompt caching to dramatically reduce latency and costs for repeated prompts.

利用Anthropic的prompt缓存功能，大幅降低重复prompt的延迟和成本。

When to Use This Skill

何时使用该技能

RAG systems with large static documents
Multi-turn conversations with long instructions
Code analysis with large codebase context
Batch processing with shared prefixes
Document analysis and summarization

包含大型静态文档的RAG系统
带有冗长指令的多轮对话
针对大型代码库上下文的代码分析
带有共享前缀的批处理
文档分析与总结

Core Concepts

核心概念

Cache Control Placement

缓存控制位置

python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

python

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with access to a large knowledge base...",
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{"role": "user", "content": "What is...?"}]
)

Cache Hierarchy

缓存层级

Cache breakpoints are checked in this order:

Tools - Tool definitions cached first
System - System prompts cached second
Messages - Conversation history cached last

缓存断点按以下顺序检查：

工具 - 工具定义优先缓存
系统提示词 - 系统提示词其次缓存
对话消息 - 对话历史最后缓存

TTL Options

TTL选项

TTL	Write Cost	Read Cost	Use Case
5 minutes (default)	1.25x base	0.1x base	Interactive sessions
1 hour	2.0x base	0.1x base	Batch processing, stable docs

TTL时长	写入成本	读取成本	使用场景
5分钟（默认）	基础成本的1.25倍	基础成本的0.1倍	交互式会话
1小时	基础成本的2.0倍	基础成本的0.1倍	批处理、稳定文档

Cache Requirements

缓存要求

Minimum tokens: 1024-4096 (varies by model)
Maximum breakpoints: 4 per request
Supported models: Claude Opus 4.5, Sonnet 4.5, Haiku 4.5

最小token数：1024-4096（因模型而异）
最大断点数量：每个请求最多4个
支持的模型：Claude Opus 4.5、Sonnet 4.5、Haiku 4.5

Implementation Patterns

实现模式

Pattern 1: Single Breakpoint (Recommended)

模式1：单一断点（推荐）

python

undefined

python

undefined

Best for: Document analysis, Q&A with static context

system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]

undefined

system = [ { "type": "text", "text": large_document_content, "cache_control": {"type": "ephemeral"} # Single breakpoint at end } ]

undefined

Pattern 2: Multi-Turn Conversation

模式2：多轮对话

python

undefined

python

undefined

Cache grows with conversation

messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]

undefined

messages = [ {"role": "user", "content": "First question"}, {"role": "assistant", "content": "First answer"}, { "role": "user", "content": "Follow-up question", "cache_control": {"type": "ephemeral"} # Cache entire conversation } ]

undefined

Pattern 3: RAG with Multiple Breakpoints

模式3：多断点RAG

python

system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

python

system = [
    {
        "type": "text",
        "text": "Tool definitions and instructions",
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: Tools
    },
    {
        "type": "text",
        "text": retrieved_documents,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: Documents
    }
]

Pattern 4: Batch Processing with 1-Hour TTL

模式4：1小时TTL的批处理

python

undefined

python

undefined

Warm the cache before batch

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=100, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": "Initialize cache"}] )

Now run batch - all requests hit the cache

for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )

undefined

for item in batch_items: response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[{ "type": "text", "text": shared_context, "cache_control": {"type": "ephemeral", "ttl": "1h"} }], messages=[{"role": "user", "content": item}] )

undefined

Performance Monitoring

性能监控

Check Cache Usage

检查缓存使用率

python

response = client.messages.create(...)

python

response = client.messages.create(...)

Monitor these fields

cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

undefined

cache_write = response.usage.cache_creation_input_tokens # New cache written cache_read = response.usage.cache_read_input_tokens # Cache hit! uncached = response.usage.input_tokens # After breakpoint

print(f"Cache hit rate: {cache_read / (cache_read + cache_write + uncached) * 100:.1f}%")

undefined

Cost Calculation

成本计算

python

def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

python

def calculate_cost(usage, model="claude-sonnet-4-20250514"):
    # Example rates (check current pricing)
    base_input_rate = 0.003  # per 1K tokens

    write_cost = (usage.cache_creation_input_tokens / 1000) * base_input_rate * 1.25
    read_cost = (usage.cache_read_input_tokens / 1000) * base_input_rate * 0.1
    uncached_cost = (usage.input_tokens / 1000) * base_input_rate

    return write_cost + read_cost + uncached_cost

Cache Invalidation

缓存失效

Changes that invalidate cache:

Change	Impact
Tool definitions	Entire cache invalidated
System prompt	System + messages invalidated
Any content before breakpoint	That breakpoint + later invalidated

会导致缓存失效的变更：

变更内容	影响
工具定义	整个缓存失效
系统提示词	系统提示词+对话消息缓存失效
断点前的任何内容	该断点及后续缓存失效

Best Practices

最佳实践

DO:

建议：

Place breakpoint at END of static content
Keep tools/instructions stable across requests
Use 1-hour TTL for batch processing
Monitor cache_read_input_tokens for savings

将断点放在静态内容的末尾
在请求之间保持工具/指令稳定
批处理使用1小时TTL
监控cache_read_input_tokens以查看节省情况

DON'T:

不建议：

Place breakpoint in middle of dynamic content
Change tool definitions frequently
Expect cache to work with <1024 tokens
Ignore the 20-block lookback limit

在动态内容中间放置断点
频繁更改工具定义
期望缓存对少于1024个token的内容生效
忽略20块回溯限制

Integration with Extended Thinking

与扩展思考功能集成

python

undefined

python

undefined

Cache + Extended Thinking

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )

undefined

response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={"type": "enabled", "budget_tokens": 10000}, system=[{ "type": "text", "text": large_context, "cache_control": {"type": "ephemeral"} }], messages=[{"role": "user", "content": "Analyze this..."}] )

undefined

prompt-caching

Original

Translation

Prompt Caching Skill

Prompt缓存技能

When to Use This Skill

何时使用该技能

Core Concepts

核心概念

Cache Control Placement

缓存控制位置

Cache Hierarchy

缓存层级

TTL Options

TTL选项

Cache Requirements

缓存要求

Implementation Patterns

实现模式

Pattern 1: Single Breakpoint (Recommended)

模式1：单一断点（推荐）

Best for: Document analysis, Q&A with static context

Best for: Document analysis, Q&A with static context

Pattern 2: Multi-Turn Conversation

模式2：多轮对话

Cache grows with conversation

Cache grows with conversation

Pattern 3: RAG with Multiple Breakpoints

模式3：多断点RAG

Pattern 4: Batch Processing with 1-Hour TTL

模式4：1小时TTL的批处理

Warm the cache before batch

Warm the cache before batch

Now run batch - all requests hit the cache

Now run batch - all requests hit the cache

Performance Monitoring

性能监控

Check Cache Usage

检查缓存使用率

Monitor these fields

Monitor these fields

Cost Calculation

成本计算

Cache Invalidation

缓存失效

Best Practices

最佳实践

DO:

建议：

DON'T:

不建议：

Integration with Extended Thinking

与扩展思考功能集成

Cache + Extended Thinking

Cache + Extended Thinking

See Also

相关链接