context-optimization

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Context Optimization Techniques

上下文优化技术

Extend effective context capacity through compression, masking, caching, and partitioning. Effective optimization can 2-3x effective context capacity without larger models.

通过压缩、掩码、缓存和分区扩展有效上下文容量。无需使用更大的模型，有效的优化就能将有效上下文容量提升2-3倍。

Optimization Strategies

优化策略

Strategy	Token Reduction	Use Case
Compaction	50-70%	Message history dominates
Observation Masking	60-80%	Tool outputs dominate
KV-Cache Optimization	70%+ cache hits	Stable workloads
Context Partitioning	Variable	Complex multi-task

策略	Token缩减比例	适用场景
压缩法	50-70%	消息历史占主导
观测掩码法	60-80%	工具输出占主导
KV-Cache优化	70%+缓存命中率	稳定工作负载
上下文分区法	可变	复杂多任务场景

Compaction

压缩法

Summarize context when approaching limits:

python

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

Priority for compression:

Tool outputs → replace with summaries
Old turns → summarize early conversation
Retrieved docs → summarize if recent versions exist
Never compress system prompt

Summary generation by type:

Tool outputs: Preserve findings, metrics, conclusions
Conversational: Preserve decisions, commitments, context shifts
Documents: Preserve key facts, remove supporting evidence

当接近上下文限制时，对上下文进行总结：

python

if context_tokens / context_limit > 0.8:
    context = compact_context(context)

压缩优先级:

工具输出 → 替换为摘要
早期对话轮次 → 总结早期对话内容
检索到的文档 → 若存在最新版本则进行总结
切勿压缩系统提示词

按类型生成摘要:

工具输出: 保留发现结果、指标、结论
对话内容: 保留决策、承诺、上下文转换点
文档: 保留关键事实，移除支撑性证据

Observation Masking

观测掩码法

Tool outputs can be 80%+ of tokens. Replace verbose outputs with references:

python

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

Masking rules:

Never mask: Current task critical, most recent turn, active reasoning
Consider: 3+ turns old, key points extractable, purpose served
Always mask: Repeated outputs, boilerplate, already summarized

工具输出可能占token总量的80%以上。用引用替换冗长的输出：

python

if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"

掩码规则:

切勿掩码: 当前任务关键内容、最新对话轮次、活跃推理过程
考虑掩码: 超过3轮的历史内容、可提取关键要点、已达成目的的内容
始终掩码: 重复输出、模板化内容、已被总结过的内容

KV-Cache Optimization

KV-Cache优化

Cache Key/Value tensors for requests with identical prefixes:

python

undefined

为具有相同前缀的请求缓存Key/Value张量：

python

undefined

Cache-friendly ordering: stable content first

context = [ system_prompt, # Cacheable tool_definitions, # Cacheable reused_templates, # Reusable unique_content # Unique ]


**Design for cache stability**:

- Avoid dynamic content (timestamps)
- Use consistent formatting
- Keep structure stable across sessions

context = [ system_prompt, # Cacheable tool_definitions, # Cacheable reused_templates, # Reusable unique_content # Unique ]


**为缓存稳定性设计**:

- 避免动态内容（如时间戳）
- 使用一致的格式
- 在会话间保持结构稳定

Context Partitioning

上下文分区法

Split work across sub-agents with isolated contexts:

python

undefined

将工作拆分到具有独立上下文的子Agent中：

python

undefined

Each sub-agent has clean, focused context

results = await gather( research_agent.search("topic A"), research_agent.search("topic B"), research_agent.search("topic C") )

Coordinator synthesizes without carrying full context

synthesized = await coordinator.synthesize(results)

undefined

synthesized = await coordinator.synthesize(results)

undefined

Budget Management

预算管理

python

context_budget = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "retrieved_docs": 10000,
    "message_history": 15000,
    "reserved_buffer": 2000
}

python

context_budget = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "retrieved_docs": 10000,
    "message_history": 15000,
    "reserved_buffer": 2000
}

Monitor and trigger optimization at 70-80%

undefined

undefined

When to Optimize

优化时机

Signal	Action
Utilization >70%	Start monitoring
Utilization >80%	Apply compaction
Quality degradation	Investigate cause
Tool outputs dominate	Observation masking
Docs dominate	Summarization/partitioning

信号	操作
利用率>70%	开始监控
利用率>80%	应用压缩法
质量下降	调查原因
工具输出占主导	应用观测掩码法
文档内容占主导	总结/分区处理

Performance Targets

性能目标

Compaction: 50-70% reduction, <5% quality loss
Masking: 60-80% reduction in masked observations
Cache: 70%+ hit rate for stable workloads

压缩法：缩减50-70%，质量损失<5%
掩码法：被掩码的观测内容缩减60-80%
缓存：稳定工作负载下命中率70%+

Best Practices

最佳实践

Measure before optimizing
Apply compaction before masking
Design for cache stability
Partition before context becomes problematic
Monitor effectiveness over time
Balance token savings vs quality
Test at production scale
Implement graceful degradation

优化前先进行度量
先应用压缩法再使用掩码法
为缓存稳定性进行设计
在上下文出现问题前进行分区
长期监控优化效果
平衡token节省与质量
在生产规模下进行测试
实现优雅降级