Context Rot: How increasing input tokens impacts LLM performance

I've definitely noticed this anecdotally.

Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).

Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.

zwaps · a month ago

Gemini loses coherence and reasoning ability well before the chat hits the context limitations, and according to this report, it is the best model on several dimensions.

Long story short: Context engineering is still king, RAG is not dead

deadbabe · a month ago

RAG was never going away, the people who say that are the same types who say software engineers will be totally replaced with AI.

LLMs will need RAG one way or another, you can hide it from the user, but it still must be there.

tvshtr · a month ago

Yep, it can decohere really badly with bigger context. It's not only context related though. Sometimes it can lose focus early on in a way that is impossible to get it back on track.

risyachka · a month ago

Yep. The easiest way to tell someone has no experience with LLMs is if they say “RAG is dead”

Xmd5a · a month ago

Gemini loses the notion of context the longer its context is: I often ask it to provide a summary of our discussion for the outside world and it will reference ideas or documents without introducing them, via anaphore, as if the outside world had knowledge of the context.

Inviz · a month ago

Cursor lifted "Start a new chat" limitation on gemini and i'm actually now enjoying keeping longer sessions within one window, becuase it's still very reasonable at recall, but doesnt need to restate everything each time

darepublic · a month ago

Can you elaborate on how prompts enhanced with rag avoid this context pollution? I don't understand why that would be

irskep · a month ago

"Compactions" are just reducing the transcript to a summary of the transcript, right? So it makes sense that it would get worse because the agent is literally losing information, but it wouldn't be due to context rot.

The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?

0x457 · a month ago

Yes, but on agentic workflows it's possible to do more intelligent compaction.

bayesianbot · a month ago

I feel like the optimal coding agent would do this automatically - collect and (sometimes) summarize the required parts of code, MCP responses, repo maps etc., then combine the results into a new message in a new 'chat' that would contain all the required parts and nothing else. It's basically what I already do with aider, and I feel the performance (in situations with a lot of context) is way better than any agentic / more automated workflow I've tried so far, but it is a lot of work.

OccamsMirror · a month ago

Claude Code tries, and it seems to be OK at it. It's hard to tell though and it definitely feels like sometimes you absolutely have to quit out and start again.

gonzric1 · a month ago

Appmap's ai agent does this very well.

tough · a month ago

Have you tried NotebookLM which basically does this as an app on the bg (chunking and summarising many docs) and you can -chat- with the full corpus using RAG

This is one type of problem of information retrieval, but I think the change in performance with context length may be different for non-retrieval answers (such as “what is the edited code for making this button red?” or “which of the above categories does the sentence ‘…’ fall under?”).

One paper that stood out to me a while back was Many-Shot In-Context Learning[1] which showed large positive jumps in performance from filling the context with examples.

As always, it’s important to test one’s problem to know how the LLM changes in behavior for different context contents/lengths — I wouldn’t assume a longer context is always worse.

[1] https://arxiv.org/pdf/2404.11018

orbital-decay · a month ago

My intuition is that questions that require reasoning always perform worse than direct retrieval questions, without exceptions. Especially when it's about negatives or when distractors are present. You're right though, intuition is not measuring, some relevant numbers would be nice to see.

ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.

blixt · a month ago

Yeah ultimately it depends on the problem. Reading an article like this, it's easy to conclude that the context should always be reduced, all context relegated to a vector database[1], and retrieved on demand such that the context is as small as possible. Seeing it makes me want to refer to situations where conversely growing the context helps a lot to improve performance.

It really depends on the task, but I imagine most real world scenarios have a mixed bag of requirements, such that it's not a needle-in-a-haystack problem, but closer to ICL. Even memory retrieval (an example given in the post) can be tricky because you cannot always trust cosine similarity on short text snippets to cleanly map to relevant memories, and so you may end up omitting good data and including bad data (which heavily skews the LLM the wrong way).

[1]: Coincidentally what the post author is selling

msgodel · a month ago

I always disable reasoning when I can. It got over hyped because of deepseek when the short one sentence chain of thought most conversational models were trained to do seemed to be enough.

posnet · a month ago

lukev · a month ago

This effect is well known but not well documented so far, so great job here.

It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)

Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.

I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.

milchek · a month ago

Anecdotally, my experience has been that the longer a conversation goes on in Cursor about a new feature or code change, the worse the output gets.

The best results seem to be from clear, explicit instructions and plan up front for a discrete change or feature, with the relevant files to edit dragged into the context prompt.

elmean · a month ago

Agreed, The flow of Explore -> plan -> code -> test -> commit. Has made things better with clearing the context between steps if it makes sense

chrisweekly · a month ago

I liked this blog post that underscores the benefits of creating an explicit plan or "specs", up front:

https://lukebechtel.com/blog/vibe-speccing

Yeah, that's why I often save context once there is enough information for work to be done. Then, once I notice regression in quality, I do a summary of work done (still could be a low quality) and add it on top of previous checkpoint.

Workaccount2 · a month ago

What's really needed is a way to easily prune context. If I could go and manually manage the entire chat with a model, I could squeeze way more juice out of a typical ~200k token coding session.

Instead I have a good instance going, but the model fumbles for 20k tokens and then that session heavily rotted. Let me cut it out!

aaronblohowiak · a month ago

Even just a rollback to previous checkpoint would be killer frsture

sevenseacat · a month ago

Zed's agent mode lets you do this, don't know about others

t55 · a month ago

that's a standard feature in cursor, windsurf, etc.

Dead Comment

snickerdoodle12 · a month ago

Local LLMs let you edit the context however you want, including the responses generated by the LLM so it will later think it said what you want it to say which can help put it on the right track.

LLMs-as-a-service don't offer this because it makes it trivial to bypass their censoring.

I've heard it repeated so many times that once things start to go sideways, trying to get back on track is a mistake. Have you had real-world success hacking context using rewritten responses?

steveklabnik · a month ago

I have experimented with "hey claude i am about to reset your context, please give me a prompt that will allow you to continue your work" and then reviewing that and tweaking it before feeding it back in.

lordswork · a month ago

/compress is the command to do this in most cli agents

That will reduce the context to a summary, not prune a bunch of irrelevant stuff

boesboes · a month ago

Claude code looses the ability to distinguish between it's own mistakes and my instructions. Once it gets confused, start over. The longer the sessions, the more it starts to go in loops or just decides that the test was already broken (despite it breaking it in this session) and that it will just ignore it.

I'm sure it's all my poor prompting and context, but it really seems like Claude has lost 30 iq points last few weeks.

vevoe · a month ago

No I feel the same way too. I'm on the max plan and I swear it has good days and bad days.

SketchySeaBeast · a month ago

> I'm sure it's all my poor prompting and context,

Does this not feel like gaslighting we've all now internalized?

Very cool results, very comprehensive article, many insights!

Media literacy disclaimer: Chroma is a vectorDB company.

philip1209 · a month ago

Chroma does vector, full-text, and regex search. And, it's designed for multitenant workloads typical of AI applications. So, not just a "vectorDB company"

firejake308 · a month ago

yeah, but they benefit from convincing people not to dump everything in context, because the alternative is to dump everything in a db (like Chroma) and then retrieve only the relevant parts (whether that's using vector search or regex search or full-text search or whatever). I still think their thesis is correct, but readers should be aware of the author's bias and make their own judgment.

tjkrusinski · a month ago

Interesting report. Are there recommended sizes for different models? How do I know what works or doesn't for my use case?