Context Degradation in Long Sessions

In short: Context degradation is the gradual loss of accuracy that happens as a session fills with verbose output: the model begins to reason from generic typical patterns rather than the specific classes, files, and findings it discovered earlier. Anthropic calls the underlying effect context rot, and recognising its symptoms is the trigger to take corrective action.

What context degradation looks like in extended sessions

Context degradation is the moment a long working session quietly stops being reliable. Early in a session, ask Claude about the TokenValidator class you explored twenty minutes ago and it answers with the exact method names and edge cases it read. An hour and forty thousand tokens of output later, the same question returns a confident but generic answer about how authentication is "typically" structured. Nothing errored. The model simply lost its grip on the specifics, and that silent slide is precisely what makes degradation dangerous on the Claude Certified Architect exam.

The mechanism is not a bug; it is a property of how attention works over long inputs. Anthropic's documentation is explicit that as token count grows, "accuracy and recall degrade, a phenomenon known as context rot." Because every token has to attend to every other token, attention is stretched thinner as the sequence lengthens. The result is a performance gradient: the model does not fall off a cliff at the context limit, it degrades steadily well before it. This is why a session can feel sharp at the start and muddy in the middle of the same window.

Context rot: Anthropic's name for the measured decline in a model's ability to recall information accurately as the number of tokens in context increases. It is gradual and begins long before the hard token limit, which is why curating context matters more than raw window size.

Why the model starts referencing typical patterns

The single most exam-relevant symptom of context degradation is a shift in how the model talks about your codebase. A healthy session grounds answers in concrete artefacts: this class, that file, this specific exception. A degraded session drifts toward generic language: "services like this usually validate the token here," "a typical retry policy would back off exponentially." The hedging words are the signal. When specifics are replaced by what is "typical" or "usually" true, the model has lost the earlier findings and is filling the gap with its training priors instead.

This matters because the priors are often plausible and occasionally wrong. In a large codebase exploration the whole point is that your system deviates from the textbook, and the deviations are exactly what gets lost first. The exam frames this under Task 5.4 (managing context in large codebase exploration) and Scenario 2 (code generation), where an agent is asked to navigate an unfamiliar repository over a long session. The trap the exam sets is continuing such a session past the point where the model is clearly generalising, instead of treating that generalisation as a stop-and-act signal.

How context fills and grip is lost

Two forces fill a session window in tandem. The first is conversation accumulation: the Claude API preserves every previous turn completely, so input grows linearly with each round trip. The second is verbose tool output, which dominates in codebase work. A single directory listing, a grep across the repo, or a file read can each add thousands of tokens, most of which are never referenced again. As that low-value bulk accumulates, the high-value early findings make up a smaller and smaller fraction of what the model is attending to, and their effective recall drops.

How an extended session degrades

Loading diagram...

Degradation is a loop you monitor: when answers turn generic, the session needs a context-management intervention before it continues.

Note what the diagram does not show: a sudden failure at the right-hand edge. There is no single token at which the session breaks. Instead, the quality of every answer is a little worse than it would have been with a cleaner window, and the effect compounds. A bigger window does not escape this. Anthropic is direct that "more context isn't automatically better" and that even on long-context benchmarks the gains "depend on what's in context, not just how much fits." A million-token window simply lets you accumulate more bulk before the symptoms become obvious.

Why recognising degradation is a skill, not a metric

There is no API field that announces "context is now degraded." Newer models expose context awareness, a running token-budget signal the model can read, but that tells you how full the window is, not how much grip has been lost. Grip is judged behaviourally. The architect's job is to watch for the symptoms and treat them as a trigger: vague references to typical structure, answers that contradict something established earlier in the session, re-asking for information that was already provided, or summaries that flatten distinct findings into one generic claim.

This is why this knowledge point sits at the root of Task 5.4 and unlocks the four remedies that follow. You cannot choose between writing findings to a scratchpad file, delegating noisy work to a subagent, injecting a summary and running /compact, or exporting a recovery manifest until you can first detect that the window has degraded. Detection is the prerequisite skill; the interventions are the apply-level responses.

Worked example

An agent is refactoring a 400-file authentication service over a two-hour session and the engineer notices the answers getting vaguer.

The session opens well. Asked how sessions are revoked, the agent cites SessionStore.revoke(), names the Redis key prefix it uses, and flags a race condition it spotted in revokeAll(). Forty file reads later, the engineer asks a follow-up about the same revocation path. This time the agent replies that "session revocation is typically handled by clearing the session store and invalidating tokens", accurate in the abstract, but it has dropped the specific method, the key prefix, and the race condition it found earlier.

That shift from SessionStore.revoke() to "typically handled by" is the degradation symptom in its purest form. The verbose output of forty file reads has crowded out the precise early findings. The correct architect response is not to push on and hope the next answer is sharper; it is to treat the generic answer as the trigger to intervene. Here the engineer captures the concrete findings into a scratchpad file, then runs the next phase from that durable note rather than from fading conversational memory. The session continues, but now the specifics live somewhere the window cannot rot.

What the engineer must not do is conclude the model "got worse" and reach for a larger context window or a hotter temperature. Neither addresses the cause. The cause is an unmanaged window full of low-value bulk, and the cure is curation, not capacity.

Context awareness helps but does not cure it

Newer Claude models add a feature that is easy to mistake for a fix: context awareness. Claude Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 receive an explicit signal of their remaining token budget. At the start of a conversation the model is told its total window, for example <budget:token_budget>1000000</budget:token_budget>, and after each tool call it receives an update such as <system_warning>Token usage: 35000/1000000; 965000 remaining</system_warning>. Anthropic likens a model without this to a contestant cooking without a clock; with it, the model can pace itself and persist on long tasks rather than guessing how much room remains.

This is genuinely useful for long agent runs, but it is important not to over-read it. Knowing how full the window is tells the model how much space it has to work; it does not stop the recall of earlier content from fading as that space fills. A model can be perfectly aware that it is at sixty percent of its budget and still have lost grip on a finding from the first ten percent. Context awareness improves pacing and planning; it does not repeal context rot. The architect still owns the decision to curate, and the symptoms, generic answers, re-asked facts, remain the trigger to act, budget signal or no.

Degradation is not the same as statelessness

A subtle exam distinction is the difference between context degradation and the statelessness of the Messages API. Statelessness, covered by the full conversation history requirement, is about memory between calls: the endpoint keeps nothing, so the caller must resend the whole history each request. Degradation is about recall within a single, growing window: even when the full history is faithfully resent, the model attends to it less precisely as it lengthens. They are different failure modes with different fixes.

Conflating them produces wrong answers. If an agent forgets earlier turns because the developer stopped sending them, that is a statelessness bug fixed by resending history. If an agent has the full history in context but still drifts to generic answers, that is degradation fixed by curating the window, scratchpads, subagents, summaries, manifests. The exam will sometimes describe one and tempt you with the remedy for the other. Anchor on the question: is the information absent from the request, or present but poorly recalled? The first is statelessness; the second is degradation.

How extended thinking adds to a filling window

A source of context growth that is easy to overlook is the model's own reasoning. When extended thinking is enabled, Claude emits thinking blocks, and those blocks consume tokens in the window exactly like prompt text or tool output. On a long agentic session this is not a rounding error: rich step-by-step reasoning across dozens of turns can occupy a meaningful slice of the budget, accelerating the fill that drives degradation. An architect estimating how quickly a session will degrade has to account for thinking tokens, not just the visible conversation.

How those blocks are carried forward is model-specific, and the difference matters for pacing. Anthropic's extended-thinking documentation notes that on its more recent models, Claude Opus 4.5 and later and Sonnet 4.6 and later, previous thinking blocks are retained by default, so they keep occupying context across later turns; on earlier Opus and Sonnet models and on all Haiku models they are not included in later-turn context. The same workflow therefore degrades at different rates depending on the model, because one configuration keeps spending budget on reasoning history while another discards it.

There is also a display nuance worth knowing. Extended thinking can return summarised rather than full reasoning, and on some Claude 4 models the display field defaults to omitted, so you must request summarised thinking explicitly if you want it back. None of this changes the core lesson of this knowledge point: thinking is real context you are spending, and treating it as free is one more way a window fills faster than the visible conversation suggests.

How this is tested on the exam

Domain 5 (Context Management and Reliability) weights this lightly in volume but heavily in consequence, because almost every reliability failure in a long agent run traces back to unrecognised degradation. Questions at the understand level describe a session where the agent starts generalising, re-asks for known facts, or contradicts an earlier finding, and ask you to name what is happening and what it signals. The right answer identifies context degradation, attributes it to the window filling with verbose output, and frames it as the cue to apply a context-management technique, never as a reason to raise max_tokens, increase temperature, or simply trust the larger model.

The most common distractor exploits the intuition that a bigger window solves the problem. It does not. Capacity delays the symptom; it does not remove the underlying recall decline. Hold onto two facts and these questions become straightforward: degradation is gradual and starts before the limit, and its signature is the move from specific findings to typical patterns.

Misconception

Context degradation only becomes a problem once the session hits the context window limit.

What's actually true

It begins well before the limit. Anthropic describes context rot as a gradual decline in recall as token count rises, so quality is already dropping while there is plenty of room left. Waiting for the hard limit means acting long after the answers stopped being reliable.

Misconception

If the model starts giving generic answers, switching to a model with a larger context window will fix it.

What's actually true

A larger window holds more bulk but does not stop recall from degrading. Anthropic states that more context is not automatically better and that results depend on what is in context, not how much fits. The fix is curating the window, not enlarging it.

Check your understanding

During a long codebase-exploration session, an agent that earlier cited the exact validator class and a specific race condition now answers that 'authentication is typically handled by validating the token and checking the session store.' No errors occur. What is happening and what should the architect do?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Context Degradation in Extended Claude Sessions