context-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Context Optimization Techniques

上下文优化技术

Extend effective context capacity through compression, masking, caching, and partitioning. Effective optimization can 2-3x effective context capacity without larger models.
通过压缩、掩码、缓存和分区扩展有效上下文容量。无需使用更大的模型,有效的优化就能将有效上下文容量提升2-3倍。

Optimization Strategies

优化策略

StrategyToken ReductionUse Case
Compaction50-70%Message history dominates
Observation Masking60-80%Tool outputs dominate
KV-Cache Optimization70%+ cache hitsStable workloads
Context PartitioningVariableComplex multi-task
策略Token缩减比例适用场景
压缩法50-70%消息历史占主导
观测掩码法60-80%工具输出占主导
KV-Cache优化70%+缓存命中率稳定工作负载
上下文分区法可变复杂多任务场景

Compaction

压缩法

Summarize context when approaching limits:
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
Priority for compression:
  1. Tool outputs → replace with summaries
  2. Old turns → summarize early conversation
  3. Retrieved docs → summarize if recent versions exist
  4. Never compress system prompt
Summary generation by type:
  • Tool outputs: Preserve findings, metrics, conclusions
  • Conversational: Preserve decisions, commitments, context shifts
  • Documents: Preserve key facts, remove supporting evidence
当接近上下文限制时,对上下文进行总结:
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
压缩优先级:
  1. 工具输出 → 替换为摘要
  2. 早期对话轮次 → 总结早期对话内容
  3. 检索到的文档 → 若存在最新版本则进行总结
  4. 切勿压缩系统提示词
按类型生成摘要:
  • 工具输出: 保留发现结果、指标、结论
  • 对话内容: 保留决策、承诺、上下文转换点
  • 文档: 保留关键事实,移除支撑性证据

Observation Masking

观测掩码法

Tool outputs can be 80%+ of tokens. Replace verbose outputs with references:
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
Masking rules:
  • Never mask: Current task critical, most recent turn, active reasoning
  • Consider: 3+ turns old, key points extractable, purpose served
  • Always mask: Repeated outputs, boilerplate, already summarized
工具输出可能占token总量的80%以上。用引用替换冗长的输出:
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
掩码规则:
  • 切勿掩码: 当前任务关键内容、最新对话轮次、活跃推理过程
  • 考虑掩码: 超过3轮的历史内容、可提取关键要点、已达成目的的内容
  • 始终掩码: 重复输出、模板化内容、已被总结过的内容

KV-Cache Optimization

KV-Cache优化

Cache Key/Value tensors for requests with identical prefixes:
python
undefined
为具有相同前缀的请求缓存Key/Value张量:
python
undefined

Cache-friendly ordering: stable content first

Cache-friendly ordering: stable content first

context = [ system_prompt, # Cacheable tool_definitions, # Cacheable reused_templates, # Reusable unique_content # Unique ]

**Design for cache stability**:

- Avoid dynamic content (timestamps)
- Use consistent formatting
- Keep structure stable across sessions
context = [ system_prompt, # Cacheable tool_definitions, # Cacheable reused_templates, # Reusable unique_content # Unique ]

**为缓存稳定性设计**:

- 避免动态内容(如时间戳)
- 使用一致的格式
- 在会话间保持结构稳定

Context Partitioning

上下文分区法

Split work across sub-agents with isolated contexts:
python
undefined
将工作拆分到具有独立上下文的子Agent中:
python
undefined

Each sub-agent has clean, focused context

Each sub-agent has clean, focused context

results = await gather( research_agent.search("topic A"), research_agent.search("topic B"), research_agent.search("topic C") )
results = await gather( research_agent.search("topic A"), research_agent.search("topic B"), research_agent.search("topic C") )

Coordinator synthesizes without carrying full context

Coordinator synthesizes without carrying full context

synthesized = await coordinator.synthesize(results)
undefined
synthesized = await coordinator.synthesize(results)
undefined

Budget Management

预算管理

python
context_budget = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "retrieved_docs": 10000,
    "message_history": 15000,
    "reserved_buffer": 2000
}
python
context_budget = {
    "system_prompt": 2000,
    "tool_definitions": 3000,
    "retrieved_docs": 10000,
    "message_history": 15000,
    "reserved_buffer": 2000
}

Monitor and trigger optimization at 70-80%

Monitor and trigger optimization at 70-80%

undefined
undefined

When to Optimize

优化时机

SignalAction
Utilization >70%Start monitoring
Utilization >80%Apply compaction
Quality degradationInvestigate cause
Tool outputs dominateObservation masking
Docs dominateSummarization/partitioning
信号操作
利用率>70%开始监控
利用率>80%应用压缩法
质量下降调查原因
工具输出占主导应用观测掩码法
文档内容占主导总结/分区处理

Performance Targets

性能目标

  • Compaction: 50-70% reduction, <5% quality loss
  • Masking: 60-80% reduction in masked observations
  • Cache: 70%+ hit rate for stable workloads
  • 压缩法:缩减50-70%,质量损失<5%
  • 掩码法:被掩码的观测内容缩减60-80%
  • 缓存:稳定工作负载下命中率70%+

Best Practices

最佳实践

  1. Measure before optimizing
  2. Apply compaction before masking
  3. Design for cache stability
  4. Partition before context becomes problematic
  5. Monitor effectiveness over time
  6. Balance token savings vs quality
  7. Test at production scale
  8. Implement graceful degradation
  1. 优化前先进行度量
  2. 先应用压缩法再使用掩码法
  3. 为缓存稳定性进行设计
  4. 在上下文出现问题前进行分区
  5. 长期监控优化效果
  6. 平衡token节省与质量
  7. 在生产规模下进行测试
  8. 实现优雅降级