context-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseContext Optimization Techniques
上下文优化技术
Extend effective context capacity through compression, masking, caching, and partitioning. Effective optimization can 2-3x effective context capacity without larger models.
通过压缩、掩码、缓存和分区扩展有效上下文容量。无需使用更大的模型,有效的优化就能将有效上下文容量提升2-3倍。
Optimization Strategies
优化策略
| Strategy | Token Reduction | Use Case |
|---|---|---|
| Compaction | 50-70% | Message history dominates |
| Observation Masking | 60-80% | Tool outputs dominate |
| KV-Cache Optimization | 70%+ cache hits | Stable workloads |
| Context Partitioning | Variable | Complex multi-task |
| 策略 | Token缩减比例 | 适用场景 |
|---|---|---|
| 压缩法 | 50-70% | 消息历史占主导 |
| 观测掩码法 | 60-80% | 工具输出占主导 |
| KV-Cache优化 | 70%+缓存命中率 | 稳定工作负载 |
| 上下文分区法 | 可变 | 复杂多任务场景 |
Compaction
压缩法
Summarize context when approaching limits:
python
if context_tokens / context_limit > 0.8:
context = compact_context(context)Priority for compression:
- Tool outputs → replace with summaries
- Old turns → summarize early conversation
- Retrieved docs → summarize if recent versions exist
- Never compress system prompt
Summary generation by type:
- Tool outputs: Preserve findings, metrics, conclusions
- Conversational: Preserve decisions, commitments, context shifts
- Documents: Preserve key facts, remove supporting evidence
当接近上下文限制时,对上下文进行总结:
python
if context_tokens / context_limit > 0.8:
context = compact_context(context)压缩优先级:
- 工具输出 → 替换为摘要
- 早期对话轮次 → 总结早期对话内容
- 检索到的文档 → 若存在最新版本则进行总结
- 切勿压缩系统提示词
按类型生成摘要:
- 工具输出: 保留发现结果、指标、结论
- 对话内容: 保留决策、承诺、上下文转换点
- 文档: 保留关键事实,移除支撑性证据
Observation Masking
观测掩码法
Tool outputs can be 80%+ of tokens. Replace verbose outputs with references:
python
if len(observation) > max_length:
ref_id = store_observation(observation)
return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"Masking rules:
- Never mask: Current task critical, most recent turn, active reasoning
- Consider: 3+ turns old, key points extractable, purpose served
- Always mask: Repeated outputs, boilerplate, already summarized
工具输出可能占token总量的80%以上。用引用替换冗长的输出:
python
if len(observation) > max_length:
ref_id = store_observation(observation)
return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"掩码规则:
- 切勿掩码: 当前任务关键内容、最新对话轮次、活跃推理过程
- 考虑掩码: 超过3轮的历史内容、可提取关键要点、已达成目的的内容
- 始终掩码: 重复输出、模板化内容、已被总结过的内容
KV-Cache Optimization
KV-Cache优化
Cache Key/Value tensors for requests with identical prefixes:
python
undefined为具有相同前缀的请求缓存Key/Value张量:
python
undefinedCache-friendly ordering: stable content first
Cache-friendly ordering: stable content first
context = [
system_prompt, # Cacheable
tool_definitions, # Cacheable
reused_templates, # Reusable
unique_content # Unique
]
**Design for cache stability**:
- Avoid dynamic content (timestamps)
- Use consistent formatting
- Keep structure stable across sessionscontext = [
system_prompt, # Cacheable
tool_definitions, # Cacheable
reused_templates, # Reusable
unique_content # Unique
]
**为缓存稳定性设计**:
- 避免动态内容(如时间戳)
- 使用一致的格式
- 在会话间保持结构稳定Context Partitioning
上下文分区法
Split work across sub-agents with isolated contexts:
python
undefined将工作拆分到具有独立上下文的子Agent中:
python
undefinedEach sub-agent has clean, focused context
Each sub-agent has clean, focused context
results = await gather(
research_agent.search("topic A"),
research_agent.search("topic B"),
research_agent.search("topic C")
)
results = await gather(
research_agent.search("topic A"),
research_agent.search("topic B"),
research_agent.search("topic C")
)
Coordinator synthesizes without carrying full context
Coordinator synthesizes without carrying full context
synthesized = await coordinator.synthesize(results)
undefinedsynthesized = await coordinator.synthesize(results)
undefinedBudget Management
预算管理
python
context_budget = {
"system_prompt": 2000,
"tool_definitions": 3000,
"retrieved_docs": 10000,
"message_history": 15000,
"reserved_buffer": 2000
}python
context_budget = {
"system_prompt": 2000,
"tool_definitions": 3000,
"retrieved_docs": 10000,
"message_history": 15000,
"reserved_buffer": 2000
}Monitor and trigger optimization at 70-80%
Monitor and trigger optimization at 70-80%
undefinedundefinedWhen to Optimize
优化时机
| Signal | Action |
|---|---|
| Utilization >70% | Start monitoring |
| Utilization >80% | Apply compaction |
| Quality degradation | Investigate cause |
| Tool outputs dominate | Observation masking |
| Docs dominate | Summarization/partitioning |
| 信号 | 操作 |
|---|---|
| 利用率>70% | 开始监控 |
| 利用率>80% | 应用压缩法 |
| 质量下降 | 调查原因 |
| 工具输出占主导 | 应用观测掩码法 |
| 文档内容占主导 | 总结/分区处理 |
Performance Targets
性能目标
- Compaction: 50-70% reduction, <5% quality loss
- Masking: 60-80% reduction in masked observations
- Cache: 70%+ hit rate for stable workloads
- 压缩法:缩减50-70%,质量损失<5%
- 掩码法:被掩码的观测内容缩减60-80%
- 缓存:稳定工作负载下命中率70%+
Best Practices
最佳实践
- Measure before optimizing
- Apply compaction before masking
- Design for cache stability
- Partition before context becomes problematic
- Monitor effectiveness over time
- Balance token savings vs quality
- Test at production scale
- Implement graceful degradation
- 优化前先进行度量
- 先应用压缩法再使用掩码法
- 为缓存稳定性进行设计
- 在上下文出现问题前进行分区
- 长期监控优化效果
- 平衡token节省与质量
- 在生产规模下进行测试
- 实现优雅降级