I work on research at Chroma, and I just published our latest technical report on context rot.
TLDR: Model performance is non-uniform across context lengths, including state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models.
This highlights the need for context engineering. Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.
Here is the complete open-source codebase to replicate our results: https://github.com/chroma-core/context-rot
Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).
Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.
Long story short: Context engineering is still king, RAG is not dead
LLMs will need RAG one way or another, you can hide it from the user, but it still must be there.
The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?
It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)
Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.
I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.
The best results seem to be from clear, explicit instructions and plan up front for a discrete change or feature, with the relevant files to edit dragged into the context prompt.
https://lukebechtel.com/blog/vibe-speccing
Instead I have a good instance going, but the model fumbles for 20k tokens and then that session heavily rotted. Let me cut it out!
Dead Comment
LLMs-as-a-service don't offer this because it makes it trivial to bypass their censoring.
I'm sure it's all my poor prompting and context, but it really seems like Claude has lost 30 iq points last few weeks.
Does this not feel like gaslighting we've all now internalized?
One paper that stood out to me a while back was Many-Shot In-Context Learning[1] which showed large positive jumps in performance from filling the context with examples.
As always, it’s important to test one’s problem to know how the LLM changes in behavior for different context contents/lengths — I wouldn’t assume a longer context is always worse.
[1] https://arxiv.org/pdf/2404.11018
ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.
It really depends on the task, but I imagine most real world scenarios have a mixed bag of requirements, such that it's not a needle-in-a-haystack problem, but closer to ICL. Even memory retrieval (an example given in the post) can be tricky because you cannot always trust cosine similarity on short text snippets to cleanly map to relevant memories, and so you may end up omitting good data and including bad data (which heavily skews the LLM the wrong way).
[1]: Coincidentally what the post author is selling
Media literacy disclaimer: Chroma is a vectorDB company.