context-optimization

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Context Optimization Techniques

上下文优化技术

Context optimization extends the effective capacity of limited context windows through strategic compression, masking, caching, and partitioning. The goal is not to magically increase context windows but to make better use of available capacity. Effective optimization can double or triple effective context capacity without requiring larger models or longer contexts.
上下文优化通过策略性压缩、掩码、缓存和划分,扩展有限上下文窗口的有效容量。其目标并非凭空增大上下文窗口,而是更高效地利用现有容量。有效的优化可使有效上下文容量翻倍或增至三倍,无需使用更大的模型或更长的上下文。

When to Activate

激活时机

Activate this skill when:
  • Context limits constrain task complexity
  • Optimizing for cost reduction (fewer tokens = lower costs)
  • Reducing latency for long conversations
  • Implementing long-running agent systems
  • Needing to handle larger documents or conversations
  • Building production systems at scale
在以下场景激活本Skill:
  • 上下文限制制约了任务复杂度
  • 以降低成本为优化目标(token越少,成本越低)
  • 减少长对话的延迟
  • 实现长期运行的Agent系统
  • 需要处理更大的文档或对话
  • 构建规模化的生产系统

Core Concepts

核心概念

Context optimization extends effective capacity through four primary strategies: compaction (summarizing context near limits), observation masking (replacing verbose outputs with references), KV-cache optimization (reusing cached computations), and context partitioning (splitting work across isolated contexts).
The key insight is that context quality matters more than quantity. Optimization preserves signal while reducing noise. The art lies in selecting what to keep versus what to discard, and when to apply each technique.
上下文优化通过四种主要策略扩展有效容量:压缩(在接近限制时总结上下文)、观测掩码(用引用替换冗长输出)、KV-cache优化(复用缓存计算结果)和上下文划分(在独立上下文间拆分工作)。
关键见解是上下文质量比数量更重要。优化在减少噪声的同时保留信号。关键在于选择保留和丢弃的内容,以及何时应用每种技术。

Detailed Topics

详细主题

Compaction Strategies

压缩策略

What is Compaction Compaction is the practice of summarizing context contents when approaching limits, then reinitializing a new context window with the summary. This distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation.
Compaction typically serves as the first lever in context optimization. The art lies in selecting what to keep versus what to discard.
Compaction Implementation Compaction works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries. Priority for compression goes to tool outputs (replace with summaries), old turns (summarize early conversation), retrieved docs (summarize if recent versions exist), and never compress system prompt.
Summary Generation Effective summaries preserve different elements depending on message type:
Tool outputs: Preserve key findings, metrics, and conclusions. Remove verbose raw output.
Conversational turns: Preserve key decisions, commitments, and context shifts. Remove filler and back-and-forth.
Retrieved documents: Preserve key facts and claims. Remove supporting evidence and elaboration.
什么是压缩 压缩是指在接近上下文限制时总结上下文内容,然后用总结内容重新初始化新的上下文窗口。这能以高保真度提炼上下文窗口的内容,使Agent能够继续运行且性能下降最小。
压缩通常是上下文优化的首个手段。关键在于选择保留和丢弃的内容。
压缩实现 压缩的工作方式是识别可压缩的部分,生成捕捉核心要点的总结,并用总结替换完整内容。压缩优先级为:工具输出(替换为总结)、早期对话回合(总结早期对话)、检索到的文档(如果有最新版本则总结),且切勿压缩系统提示词。
总结生成 有效的总结会根据消息类型保留不同元素:
工具输出:保留关键发现、指标和结论。移除冗长的原始输出。
对话回合:保留关键决策、承诺和上下文转换。移除填充性内容和来回对话。
检索到的文档:保留关键事实和主张。移除支持证据和详细阐述。

Observation Masking

观测掩码

The Observation Problem Tool outputs can comprise 80%+ of token usage in agent trajectories. Much of this is verbose output that has already served its purpose. Once an agent has used a tool output to make a decision, keeping the full output provides diminishing value while consuming significant context.
Observation masking replaces verbose tool outputs with compact references. The information remains accessible if needed but does not consume context continuously.
Masking Strategy Selection Not all observations should be masked equally:
Never mask: Observations critical to current task, observations from the most recent turn, observations used in active reasoning.
Consider masking: Observations from 3+ turns ago, verbose outputs with key points extractable, observations whose purpose has been served.
Always mask: Repeated outputs, boilerplate headers/footers, outputs already summarized in conversation.
观测问题 在Agent的运行轨迹中,工具输出可能占token使用量的80%以上。其中大部分是已经完成其用途的冗长输出。一旦Agent使用工具输出做出决策,保留完整输出的价值会不断降低,同时会占用大量上下文。
观测掩码用紧凑的引用替换冗长的工具输出。信息在需要时仍可访问,但不会持续占用上下文。
掩码策略选择 并非所有观测都应被同等掩码:
绝不掩码:对当前任务至关重要的观测、最近回合的观测、主动推理中使用的观测。
考虑掩码:3个及以上回合前的观测、可提取核心要点的冗长输出、已完成用途的观测。
始终掩码:重复输出、模板化页眉/页脚、已在对话中总结过的输出。

KV-Cache Optimization

KV-cache优化

Understanding KV-Cache The KV-cache stores Key and Value tensors computed during inference, growing linearly with sequence length. Caching the KV-cache across requests sharing identical prefixes avoids recomputation.
Prefix caching reuses KV blocks across requests with identical prefixes using hash-based block matching. This dramatically reduces cost and latency for requests with common prefixes like system prompts.
Cache Optimization Patterns Optimize for caching by reordering context elements to maximize cache hits. Place stable elements first (system prompt, tool definitions), then frequently reused elements, then unique elements last.
Design prompts to maximize cache stability: avoid dynamic content like timestamps, use consistent formatting, keep structure stable across sessions.
理解KV-cache KV-cache存储推理过程中计算的Key和Value张量,其大小随序列长度线性增长。在共享相同前缀的请求间缓存KV-cache可避免重复计算。
前缀缓存通过基于哈希的块匹配,在具有相同前缀的请求间复用KV块。这能大幅降低具有公共前缀(如系统提示词)的请求的成本和延迟。
缓存优化模式 通过重新排序上下文元素以最大化缓存命中率来优化缓存。先放置稳定元素(系统提示词、工具定义),然后是频繁复用的元素,最后是唯一元素。
设计提示词以最大化缓存稳定性:避免时间戳等动态内容,使用一致的格式,在会话间保持结构稳定。

Context Partitioning

上下文划分

Sub-Agent Partitioning The most aggressive form of context optimization is partitioning work across sub-agents with isolated contexts. Each sub-agent operates in a clean context focused on its subtask without carrying accumulated context from other subtasks.
This approach achieves separation of concerns—the detailed search context remains isolated within sub-agents while the coordinator focuses on synthesis and analysis.
Result Aggregation Aggregate results from partitioned subtasks by validating all partitions completed, merging compatible results, and summarizing if still too large.
子Agent划分 最激进的上下文优化形式是在具有独立上下文的子Agent间拆分工作。每个子Agent在专注于其子任务的干净上下文中运行,无需携带其他子任务积累的上下文。
这种方法实现了关注点分离——详细的搜索上下文在子Agent内保持隔离,而协调器专注于综合和分析。
结果聚合 通过验证所有划分任务完成、合并兼容结果、如果仍过大则进行总结,来聚合划分后的子任务结果。

Budget Management

预算管理

Context Budget Allocation Design explicit context budgets. Allocate tokens to categories: system prompt, tool definitions, retrieved docs, message history, and reserved buffer. Monitor usage against budget and trigger optimization when approaching limits.
Trigger-Based Optimization Monitor signals for optimization triggers: token utilization above 80%, degradation indicators, and performance drops. Apply appropriate optimization techniques based on context composition.
上下文预算分配 设计明确的上下文预算。将token分配给以下类别:系统提示词、工具定义、检索到的文档、消息历史和预留缓冲区。监控使用情况是否符合预算,并在接近限制时触发优化。
基于触发条件的优化 监控优化触发信号:token利用率超过80%、性能下降指标和响应质量降低。根据上下文构成应用适当的优化技术。

Practical Guidance

实践指南

Optimization Decision Framework

优化决策框架

When to optimize:
  • Context utilization exceeds 70%
  • Response quality degrades as conversations extend
  • Costs increase due to long contexts
  • Latency increases with conversation length
What to apply:
  • Tool outputs dominate: observation masking
  • Retrieved documents dominate: summarization or partitioning
  • Message history dominates: compaction with summarization
  • Multiple components: combine strategies
何时优化:
  • 上下文利用率超过70%
  • 随着对话延长,响应质量下降
  • 由于上下文过长导致成本增加
  • 对话长度增加导致延迟上升
应用什么策略:
  • 工具输出占主导:观测掩码
  • 检索到的文档占主导:总结或划分
  • 消息历史占主导:结合总结的压缩
  • 多个组件占主导:组合策略

Performance Considerations

性能考量

Compaction should achieve 50-70% token reduction with less than 5% quality degradation. Masking should achieve 60-80% reduction in masked observations. Cache optimization should achieve 70%+ hit rate for stable workloads.
Monitor and iterate on optimization strategies based on measured effectiveness.
压缩应实现50-70%的token减少,且性能下降少于5%。掩码应实现被掩码观测的60-80%减少。对于稳定工作负载,缓存优化应实现70%以上的命中率。
基于测量的有效性监控并迭代优化策略。

Examples

示例

Example 1: Compaction Trigger
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
Example 2: Observation Masking
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
Example 3: Cache-Friendly Ordering
python
undefined
示例1:压缩触发
python
if context_tokens / context_limit > 0.8:
    context = compact_context(context)
示例2:观测掩码
python
if len(observation) > max_length:
    ref_id = store_observation(observation)
    return f"[Obs:{ref_id} elided. Key: {extract_key(observation)}]"
示例3:缓存友好的排序
python
undefined

Stable content first

Stable content first

context = [system_prompt, tool_definitions] # Cacheable context += [reused_templates] # Reusable context += [unique_content] # Unique
undefined
context = [system_prompt, tool_definitions] # Cacheable context += [reused_templates] # Reusable context += [unique_content] # Unique
undefined

Guidelines

指南

  1. Measure before optimizing—know your current state
  2. Apply compaction before masking when possible
  3. Design for cache stability with consistent prompts
  4. Partition before context becomes problematic
  5. Monitor optimization effectiveness over time
  6. Balance token savings against quality preservation
  7. Test optimization at production scale
  8. Implement graceful degradation for edge cases
  1. 优化前先测量——了解当前状态
  2. 尽可能先应用压缩再应用掩码
  3. 设计提示词以确保缓存稳定性,使用一致格式
  4. 在上下文出现问题前进行划分
  5. 随时间监控优化有效性
  6. 在token节省和质量保留间取得平衡
  7. 在生产规模下测试优化
  8. 为边缘情况实现优雅降级

Integration

集成

This skill builds on context-fundamentals and context-degradation. It connects to:
  • multi-agent-patterns - Partitioning as isolation
  • evaluation - Measuring optimization effectiveness
  • memory-systems - Offloading context to memory
本Skill基于context-fundamentals和context-degradation构建。它与以下内容相关:
  • multi-agent-patterns - 作为隔离的划分
  • evaluation - 测量优化有效性
  • memory-systems - 将上下文卸载到内存

References

参考资料

Internal reference:
  • Optimization Techniques Reference - Detailed technical reference
Related skills in this collection:
  • context-fundamentals - Context basics
  • context-degradation - Understanding when to optimize
  • evaluation - Measuring optimization
External resources:
  • Research on context window limitations
  • KV-cache optimization techniques
  • Production engineering guides

内部参考:
  • Optimization Techniques Reference - 详细技术参考
本集合中的相关Skill:
  • context-fundamentals - 上下文基础
  • context-degradation - 了解何时优化
  • evaluation - 测量优化
外部资源:
  • 上下文窗口限制的研究
  • KV-cache优化技术
  • 生产工程指南

Skill Metadata

Skill元数据

Created: 2025-12-20 Last Updated: 2025-12-20 Author: Agent Skills for Context Engineering Contributors Version: 1.0.0
创建时间: 2025-12-20 最后更新时间: 2025-12-20 作者: Agent Skills for Context Engineering Contributors 版本: 1.0.0