context-degradation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseContext Degradation Patterns
上下文退化模式
Language models exhibit predictable degradation patterns as context length increases. Understanding these patterns is essential for diagnosing failures and designing resilient systems. Context degradation is not a binary state but a continuum of performance degradation that manifests in several distinct ways.
随着上下文长度增加,语言模型会呈现出可预测的退化模式。理解这些模式对于诊断故障和设计具备韧性的系统至关重要。上下文退化并非二元状态,而是一系列性能退化的连续体,会以几种不同的方式表现出来。
When to Activate
适用场景
Activate this skill when:
- Agent performance degrades unexpectedly during long conversations
- Debugging cases where agents produce incorrect or irrelevant outputs
- Designing systems that must handle large contexts reliably
- Evaluating context engineering choices for production systems
- Investigating "lost in middle" phenomena in agent outputs
- Analyzing context-related failures in agent behavior
在以下场景中启用该技能:
- 长对话过程中Agent性能意外下降
- 调试Agent生成错误或无关输出的案例
- 设计需要可靠处理大上下文的系统
- 评估生产系统的上下文工程方案
- 调查Agent输出中的“中间信息丢失”现象
- 分析与上下文相关的Agent行为故障
Core Concepts
核心概念
Context degradation manifests through several distinct patterns. The lost-in-middle phenomenon causes information in the center of context to receive less attention. Context poisoning occurs when errors compound through repeated reference. Context distraction happens when irrelevant information overwhelms relevant content. Context confusion arises when the model cannot determine which context applies. Context clash develops when accumulated information directly conflicts.
These patterns are predictable and can be mitigated through architectural patterns like compaction, masking, partitioning, and isolation.
上下文退化会通过几种不同的模式表现出来。“中间信息丢失”现象会导致上下文中间的信息受到的关注较少。上下文污染是指错误通过重复引用不断累积。上下文干扰是指无关信息淹没了相关内容。上下文混淆是指模型无法确定适用的上下文。上下文冲突是指累积的信息直接相互矛盾。
这些模式是可预测的,可以通过压缩、掩码、分区和隔离等架构模式来缓解。
Detailed Topics
详细主题
The Lost-in-Middle Phenomenon
中间信息丢失现象
The most well-documented degradation pattern is the "lost-in-middle" effect, where models demonstrate U-shaped attention curves. Information at the beginning and end of context receives reliable attention, while information buried in the middle suffers from dramatically reduced recall accuracy.
Empirical Evidence
Research demonstrates that relevant information placed in the middle of context experiences 10-40% lower recall accuracy compared to the same information at the beginning or end. This is not a failure of the model but a consequence of attention mechanics and training data distributions.
Models allocate massive attention to the first token (often the BOS token) to stabilize internal states. This creates an "attention sink" that soaks up attention budget. As context grows, the limited budget is stretched thinner, and middle tokens fail to garner sufficient attention weight for reliable retrieval.
Practical Implications
Design context placement with attention patterns in mind. Place critical information at the beginning or end of context. Consider whether information will be queried directly or needs to support reasoning—if the latter, placement matters less but overall signal quality matters more.
For long documents or conversations, use summary structures that surface key information at attention-favored positions. Use explicit section headers and transitions to help models navigate structure.
有最充分文献记载的退化模式是“中间信息丢失”效应,即模型呈现出U型注意力曲线。上下文开头和结尾的信息会得到稳定的关注,而隐藏在中间的信息的召回准确率会大幅下降。
实证证据
研究表明,放置在上下文中间的相关信息,其召回准确率比放置在开头或结尾的相同信息低10-40%。这并非模型故障,而是注意力机制和训练数据分布的必然结果。
模型会将大量注意力分配给第一个token(通常是BOS token)以稳定内部状态。这会形成一个“注意力吸收点”,消耗大量注意力预算。随着上下文增长,有限的预算会被进一步摊薄,中间token无法获得足够的注意力权重以实现可靠检索。
实践启示
根据注意力模式设计上下文布局。将关键信息放置在上下文的开头或结尾。考虑信息是会被直接查询还是需要支持推理——如果是后者,布局的重要性较低,但整体信号质量更为关键。
对于长文档或对话,使用摘要结构将关键信息置于受注意力青睐的位置。使用明确的章节标题和过渡语帮助模型导航结构。
Context Poisoning
上下文污染
Context poisoning occurs when hallucinations, errors, or incorrect information enters context and compounds through repeated reference. Once poisoned, context creates feedback loops that reinforce incorrect beliefs.
How Poisoning Occurs
Poisoning typically enters through three pathways. First, tool outputs may contain errors or unexpected formats that models accept as ground truth. Second, retrieved documents may contain incorrect or outdated information that models incorporate into reasoning. Third, model-generated summaries or intermediate outputs may introduce hallucinations that persist in context.
The compounding effect is severe. If an agent's goals section becomes poisoned, it develops strategies that take substantial effort to undo. Each subsequent decision references the poisoned content, reinforcing incorrect assumptions.
Detection and Recovery
Watch for symptoms including degraded output quality on tasks that previously succeeded, tool misalignment where agents call wrong tools or parameters, and hallucinations that persist despite correction attempts. When these symptoms appear, consider context poisoning.
Recovery requires removing or replacing poisoned content. This may involve truncating context to before the poisoning point, explicitly noting the poisoning in context and asking for re-evaluation, or restarting with clean context and preserving only verified information.
上下文污染是指幻觉、错误或不正确信息进入上下文并通过重复引用不断累积。一旦被污染,上下文会形成反馈循环,强化错误认知。
污染发生路径
污染通常通过三种路径产生。首先,工具输出可能包含错误或意外格式,模型会将其视为事实。其次,检索到的文档可能包含不正确或过时的信息,模型会将其纳入推理过程。第三,模型生成的摘要或中间输出可能引入幻觉,并持续存在于上下文中。
累积效应非常严重。如果Agent的目标部分被污染,它会形成需要大量努力才能纠正的策略。后续的每个决策都会引用被污染的内容,强化错误假设。
检测与恢复
留意以下症状:之前成功完成的任务输出质量下降;工具调用错位,即Agent调用错误的工具或参数;即使尝试纠正,幻觉仍然持续。出现这些症状时,需考虑上下文污染的可能性。
恢复需要移除或替换被污染的内容。这可能包括将上下文截断到污染发生之前的位置,在上下文中明确标注污染并要求重新评估,或使用干净的上下文重启,仅保留经过验证的信息。
Context Distraction
上下文干扰
Context distraction emerges when context grows so long that models over-focus on provided information at the expense of their training knowledge. The model attends to everything in context regardless of relevance, and this creates pressure to use provided information even when internal knowledge is more accurate.
The Distractor Effect
Research shows that even a single irrelevant document in context reduces performance on tasks involving relevant documents. Multiple distractors compound degradation. The effect is not about noise in absolute terms but about attention allocation—irrelevant information competes with relevant information for limited attention budget.
Models do not have a mechanism to "skip" irrelevant context. They must attend to everything provided, and this obligation creates distraction even when the irrelevant information is clearly not useful.
Mitigation Strategies
Mitigate distraction through careful curation of what enters context. Apply relevance filtering before loading retrieved documents. Use namespacing and organization to make irrelevant sections easy to ignore structurally. Consider whether information truly needs to be in context or can be accessed through tool calls instead.
当上下文变得过长时,模型会过度关注提供的信息,而忽略自身训练知识,此时就会出现上下文干扰。模型会关注上下文中的所有内容,无论是否相关,这会迫使模型使用提供的信息,即使内部知识更准确。
干扰效应
研究表明,即使上下文中只有一个无关文档,也会降低涉及相关文档的任务性能。多个干扰源会加剧退化。这种效应与绝对噪声量无关,而是与注意力分配有关——无关信息会与相关信息竞争有限的注意力预算。
模型没有“跳过”无关上下文的机制。它必须关注所有提供的内容,这种义务即使在无关信息明显无用时也会造成干扰。
缓解策略
通过仔细筛选进入上下文的内容来缓解干扰。在加载检索到的文档之前应用相关性过滤。使用命名空间和组织方式,使无关部分在结构上易于被忽略。考虑信息是否真的需要放入上下文,还是可以通过工具调用来访问。
Context Confusion
上下文混淆
Context confusion arises when irrelevant information influences responses in ways that degrade quality. This is related to distraction but distinct—confusion concerns the influence of context on model behavior rather than attention allocation.
If you put something in context, the model has to pay attention to it. The model may incorporate irrelevant information, use inappropriate tool definitions, or apply constraints that came from different contexts. Confusion is especially problematic when context contains multiple task types or when switching between tasks within a single session.
Signs of Confusion
Watch for responses that address the wrong aspect of a query, tool calls that seem appropriate for a different task, or outputs that mix requirements from multiple sources. These indicate confusion about what context applies to the current situation.
Architectural Solutions
Architectural solutions include explicit task segmentation where different tasks get different context windows, clear transitions between task contexts, and state management that isolates context for different objectives.
当无关信息以降低输出质量的方式影响响应时,就会出现上下文混淆。这与干扰相关但有所不同——混淆关注的是上下文对模型行为的影响,而非注意力分配。
如果将某些内容放入上下文,模型就必须关注它。模型可能会纳入无关信息、使用不适当的工具定义,或应用来自不同上下文的约束。当上下文包含多种任务类型,或在单个会话中切换任务时,混淆问题尤其严重。
混淆迹象
留意以下响应:针对查询的错误方面进行回答;工具调用似乎适合其他任务;输出混合了来自多个来源的要求。这些迹象表明模型对当前场景适用的上下文存在混淆。
架构解决方案
架构解决方案包括:显式任务分段,即为不同任务分配不同的上下文窗口;任务上下文之间的清晰过渡;以及状态管理,为不同目标隔离上下文。
Context Clash
上下文冲突
Context clash develops when accumulated information directly conflicts, creating contradictory guidance that derails reasoning. This differs from poisoning where one piece of information is incorrect—in clash, multiple correct pieces of information contradict each other.
Sources of Clash
Clash commonly arises from multi-source retrieval where different sources have contradictory information, version conflicts where outdated and current information both appear in context, and perspective conflicts where different viewpoints are valid but incompatible.
Resolution Approaches
Resolution approaches include explicit conflict marking that identifies contradictions and requests clarification, priority rules that establish which source takes precedence, and version filtering that excludes outdated information from context.
当累积的信息直接相互矛盾,产生相互对立的指导意见并破坏推理过程时,就会出现上下文冲突。这与污染不同——污染是某条信息不正确,而冲突是多条正确信息相互矛盾。
冲突来源
冲突通常来自以下场景:多源检索中不同来源的信息相互矛盾;版本冲突中过时和当前信息同时出现在上下文中;视角冲突中不同观点均有效但互不兼容。
解决方法
解决方法包括:显式标记冲突并请求澄清;制定优先级规则确定哪个来源优先;以及版本过滤,将过时信息排除在上下文之外。
Empirical Benchmarks and Thresholds
实证基准与阈值
Research provides concrete data on degradation patterns that inform design decisions.
RULER Benchmark Findings
The RULER benchmark delivers sobering findings: only 50% of models claiming 32K+ context maintain satisfactory performance at 32K tokens. GPT-5.2 shows the least degradation among current models, while many still drop 30+ points at extended contexts. Near-perfect scores on simple needle-in-haystack tests do not translate to real long-context understanding.
Model-Specific Degradation Thresholds
| Model | Degradation Onset | Severe Degradation | Notes |
|---|---|---|---|
| GPT-5.2 | ~64K tokens | ~200K tokens | Best overall degradation resistance with thinking mode |
| Claude Opus 4.5 | ~100K tokens | ~180K tokens | 200K context window, strong attention management |
| Claude Sonnet 4.5 | ~80K tokens | ~150K tokens | Optimized for agents and coding tasks |
| Gemini 3 Pro | ~500K tokens | ~800K tokens | 1M context window, native multimodality |
| Gemini 3 Flash | ~300K tokens | ~600K tokens | 3x speed of Gemini 2.5, 81.2% MMMU-Pro |
Model-Specific Behavior Patterns
Different models exhibit distinct failure modes under context pressure:
- Claude 4.5 series: Lowest hallucination rates with calibrated uncertainty. Claude Opus 4.5 achieves 80.9% on SWE-bench Verified. Tends to refuse or ask clarification rather than fabricate.
- GPT-5.2: Two modes available - instant (fast) and thinking (reasoning). Thinking mode reduces hallucination through step-by-step verification but increases latency.
- Gemini 3 Pro/Flash: Native multimodality with 1M context window. Gemini 3 Flash offers 3x speed improvement over previous generation. Strong at multi-modal reasoning across text, code, images, audio, and video.
These patterns inform model selection for different use cases. High-stakes tasks benefit from Claude 4.5's conservative approach or GPT-5.2's thinking mode; speed-critical tasks may use instant modes.
研究提供了关于退化模式的具体数据,可为设计决策提供依据。
RULER基准测试结果
RULER基准测试的结果发人深省:在声称支持32K+上下文的模型中,只有50%能在32K token时保持令人满意的性能。GPT-5.2是当前模型中退化程度最低的,而许多模型在扩展上下文时性能仍会下降30分以上。在简单的“大海捞针”测试中近乎完美的分数,并不能转化为实际的长上下文理解能力。
模型特定的退化阈值
| 模型 | 退化起始点 | 严重退化点 | 说明 |
|---|---|---|---|
| GPT-5.2 | ~64K tokens | ~200K tokens | 开启思考模式时,整体退化抗性最佳 |
| Claude Opus 4.5 | ~100K tokens | ~180K tokens | 支持200K上下文窗口,注意力管理能力强 |
| Claude Sonnet 4.5 | ~80K tokens | ~150K tokens | 针对Agent和编码任务优化 |
| Gemini 3 Pro | ~500K tokens | ~800K tokens | 支持1M上下文窗口,原生多模态能力 |
| Gemini 3 Flash | ~300K tokens | ~600K tokens | 速度是Gemini 2.5的3倍,MMMU-Pro得分81.2% |
模型特定的行为模式
不同模型在上下文压力下表现出不同的故障模式:
- Claude 4.5系列:幻觉率最低,不确定性校准准确。Claude Opus 4.5在SWE-bench Verified测试中得分80.9%。倾向于拒绝回答或请求澄清,而非编造内容。
- GPT-5.2:提供两种模式——即时(快速)模式和思考(推理)模式。思考模式通过逐步验证减少幻觉,但会增加延迟。
- Gemini 3 Pro/Flash:原生多模态能力,支持1M上下文窗口。Gemini 3 Flash比上一代速度提升3倍。在文本、代码、图像、音频和视频的多模态推理方面表现出色。
这些模式可为不同用例的模型选择提供依据。高风险任务适合采用Claude 4.5的保守方法或GPT-5.2的思考模式;对速度要求高的任务可使用即时模式。
Counterintuitive Findings
违反直觉的发现
Research reveals several counterintuitive patterns that challenge assumptions about context management.
Shuffled Haystacks Outperform Coherent Ones
Studies found that shuffled (incoherent) haystacks produce better performance than logically coherent ones. This suggests that coherent context may create false associations that confuse retrieval, while incoherent context forces models to rely on exact matching.
Single Distractors Have Outsized Impact
Even a single irrelevant document reduces performance significantly. The effect is not proportional to the amount of noise but follows a step function where the presence of any distractor triggers degradation.
Needle-Question Similarity Correlation
Lower similarity between needle and question pairs shows faster degradation with context length. Tasks requiring inference across dissimilar content are particularly vulnerable.
研究揭示了几个违反直觉的模式,挑战了关于上下文管理的假设。
乱序上下文表现优于连贯上下文
研究发现,乱序(不连贯)的上下文比逻辑连贯的上下文表现更好。这表明连贯上下文可能会产生误导性关联,干扰检索,而不连贯上下文会迫使模型依赖精确匹配。
单个干扰源影响巨大
即使只有一个无关文档,也会显著降低性能。这种影响与噪声量不成正比,而是遵循阶跃函数——只要存在任何干扰源,就会触发性能退化。
关键信息与问题的相似度相关性
关键信息与问题对的相似度越低,性能随上下文长度的退化速度越快。需要跨不相似内容进行推理的任务尤其脆弱。
When Larger Contexts Hurt
大上下文为何会起反作用
Larger context windows do not uniformly improve performance. In many cases, larger contexts create new problems that outweigh benefits.
Performance Degradation Curves
Models exhibit non-linear degradation with context length. Performance remains stable up to a threshold, then degrades rapidly. The threshold varies by model and task complexity. For many models, meaningful degradation begins around 8,000-16,000 tokens even when context windows support much larger sizes.
Cost Implications
Processing cost grows disproportionately with context length. The cost to process a 400K token context is not double the cost of 200K—it increases exponentially in both time and computing resources. For many applications, this makes large-context processing economically impractical.
Cognitive Load Metaphor
Even with an infinite context, asking a single model to maintain consistent quality across dozens of independent tasks creates a cognitive bottleneck. The model must constantly switch context between items, maintain a comparative framework, and ensure stylistic consistency. This is not a problem that more context solves.
更大的上下文窗口并非总能提升性能。在许多情况下,大上下文会带来新问题,其负面影响超过收益。
性能退化曲线
模型的性能随上下文长度呈非线性退化。在达到某个阈值前,性能保持稳定,之后会迅速退化。该阈值因模型和任务复杂度而异。对于许多模型,即使上下文窗口支持更大的规模,有意义的退化也会在8000-16000 token左右开始。
成本影响
处理成本随上下文长度不成比例地增长。处理400K token上下文的成本并非200K的两倍——时间和计算资源的消耗呈指数级增长。对于许多应用来说,大上下文处理在经济上并不划算。
认知负载隐喻
即使拥有无限上下文,要求单个模型在数十个独立任务中保持一致的质量也会造成认知瓶颈。模型必须不断在不同项目之间切换上下文,维持比较框架,并确保风格一致性。这不是增加上下文就能解决的问题。
Practical Guidance
实践指南
The Four-Bucket Approach
四桶法策略
Four strategies address different aspects of context degradation:
Write: Save context outside the window using scratchpads, file systems, or external storage. This keeps active context lean while preserving information access.
Select: Pull relevant context into the window through retrieval, filtering, and prioritization. This addresses distraction by excluding irrelevant information.
Compress: Reduce tokens while preserving information through summarization, abstraction, and observation masking. This extends effective context capacity.
Isolate: Split context across sub-agents or sessions to prevent any single context from growing large enough to degrade. This is the most aggressive strategy but often the most effective.
四种策略可解决上下文退化的不同方面:
写入(Write):使用暂存区、文件系统或外部存储将上下文保存到窗口之外。这能保持活跃上下文精简,同时保留信息访问能力。
选择(Select):通过检索、过滤和优先级排序将相关上下文拉入窗口。这能通过排除无关信息解决干扰问题。
压缩(Compress):通过摘要、抽象和观察掩码减少token数量,同时保留信息。这能扩展有效上下文容量。
隔离(Isolate):在子Agent或会话之间拆分上下文,防止任何单个上下文增长到足以引发退化的规模。这是最激进的策略,但通常也是最有效的。
Architectural Patterns
架构模式
Implement these strategies through specific architectural patterns. Use just-in-time context loading to retrieve information only when needed. Use observation masking to replace verbose tool outputs with compact references. Use sub-agent architectures to isolate context for different tasks. Use compaction to summarize growing context before it exceeds limits.
通过特定架构模式实现这些策略。使用即时上下文加载,仅在需要时检索信息。使用观察掩码将冗长的工具输出替换为紧凑引用。使用子Agent架构为不同任务隔离上下文。使用压缩策略在上下文超出限制前对其进行摘要。
Examples
示例
Example 1: Detecting Degradation
yaml
undefined示例1:检测退化
yaml
undefinedContext grows during long conversation
长对话过程中上下文增长
turn_1: 1000 tokens
turn_5: 8000 tokens
turn_10: 25000 tokens
turn_20: 60000 tokens (degradation begins)
turn_30: 90000 tokens (significant degradation)
**Example 2: Mitigating Lost-in-Middle**
```markdownturn_1: 1000 tokens
turn_5: 8000 tokens
turn_10: 25000 tokens
turn_20: 60000 tokens (退化开始)
turn_30: 90000 tokens (严重退化)
**示例2:缓解中间信息丢失**
```markdownOrganize context with critical info at edges
组织上下文,将关键信息置于两端
[CURRENT TASK] # At start
- Goal: Generate quarterly report
- Deadline: End of week
[DETAILED CONTEXT] # Middle (less attention)
- 50 pages of data
- Multiple analysis sections
- Supporting evidence
[KEY FINDINGS] # At end
- Revenue up 15%
- Costs down 8%
- Growth in Region A
undefined[当前任务] # 开头
- 目标:生成季度报告
- 截止日期:本周末
[详细上下文] # 中间(关注度较低)
- 50页数据
- 多个分析章节
- 支持证据
[关键发现] # 结尾
- 收入增长15%
- 成本下降8%
- A地区增长显著
undefinedGuidelines
准则
- Monitor context length and performance correlation during development
- Place critical information at beginning or end of context
- Implement compaction triggers before degradation becomes severe
- Validate retrieved documents for accuracy before adding to context
- Use versioning to prevent outdated information from causing clash
- Segment tasks to prevent context confusion across different objectives
- Design for graceful degradation rather than assuming perfect conditions
- Test with progressively larger contexts to find degradation thresholds
- 开发过程中监控上下文长度与性能的相关性
- 将关键信息放置在上下文的开头或结尾
- 在退化变得严重前触发压缩策略
- 将检索到的文档添加到上下文之前验证其准确性
- 使用版本控制防止过时信息引发冲突
- 对任务进行分段,防止不同目标之间的上下文混淆
- 设计时考虑优雅退化,而非假设完美条件
- 使用逐步增大的上下文进行测试,找出退化阈值
Integration
集成
This skill builds on context-fundamentals and should be studied after understanding basic context concepts. It connects to:
- context-optimization - Techniques for mitigating degradation
- multi-agent-patterns - Using isolation to prevent degradation
- evaluation - Measuring and detecting degradation in production
该技能基于上下文基础知识,应在理解基本上下文概念后学习。它与以下技能相关:
- context-optimization - 缓解退化的技术
- multi-agent-patterns - 使用隔离防止退化
- evaluation - 在生产环境中测量和检测退化
References
参考资料
Internal reference:
- Degradation Patterns Reference - Detailed technical reference
Related skills in this collection:
- context-fundamentals - Context basics
- context-optimization - Mitigation techniques
- evaluation - Detection and measurement
External resources:
- Research on attention mechanisms and context window limitations
- Studies on the "lost-in-middle" phenomenon
- Production engineering guides from AI labs
内部参考:
- 退化模式参考文档 - 详细技术参考
本集合中的相关技能:
- context-fundamentals - 上下文基础知识
- context-optimization - 缓解技术
- evaluation - 检测与测量
外部资源:
- 注意力机制和上下文窗口限制的相关研究
- “中间信息丢失”现象的相关研究
- AI实验室发布的生产工程指南
Skill Metadata
技能元数据
Created: 2025-12-20
Last Updated: 2025-12-20
Author: Agent Skills for Context Engineering Contributors
Version: 1.0.0
创建时间: 2025-12-20
最后更新时间: 2025-12-20
作者: Agent Skills for Context Engineering Contributors
版本: 1.0.0