lihanc111 (u/lihanc111)

lihanc111 commented on Reuse non-prefix KV Cache and speed up RAG by 3X with LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

lihanc111 · 2 months ago

Hey HN Community!

A while back, we shared our open-source project LMCache here and were blown away by the incredible support and feedback. Today, our team is thrilled to share more about one of our core components: CacheBlend. Recognized with a Best Paper Award at ACM EuroSys 2025, this technique is a pain killer for efficient RAG applications The Problem: Your KV Cache is Wasting Potential In modern LLM applications like RAG and Agents, we constantly feed the model new context. For example, in RAG, we retrieve relevant documents and stuff them into the prompt.

The issue is that this dynamically retrieved context doesn't always appear at the beginning of the input sequence. Traditional KV caching only reuses a "common prefix," so if the new information isn't at the very start, the cache hit rate plummets, and your GPU ends up recomputing the same things over and over. The Solution: CacheBlend - 100% Hit Rate, No Compromises CacheBlend changes the game by allowing for the reuse of pre-computed KV caches regardless of their position in the input sequence.

This means we can finally achieve a 100% KV Cache hit rate in applications like RAG. The performance gains are significant:

Faster Time-To-First-Token (TTFT): Get your initial response much quicker.

More Throughput: Serve significantly more users with the same hardware.

Almost lossless Output Quality: All of this is achieved with little degradation in the model's generation quality.

How does it work? CacheBlend intelligently handles the two main challenges of reusing non-prefix caches: Positional Encoding Update: It efficiently updates positional encodings to ensure the model always knows the correct position of each token, even when we're stitching together cached and new data.

Selective Attention Recalculation: Instead of recomputing everything, it strategically recalculates only the minimal cross-attention needed between the new and cached chunks to maintain perfect generation quality.

For detailed analysis, please refer to the official paper: https://dl.acm.org/doi/10.1145/3689031.3696098

Where can I try it? Our official repo is at: https://github.com/LMCache/LMCache The newest interactive CacheBlend demo at: https://github.com/LMCache/LMCache-Examples/tree/main/demo-r...

Ask us anything!

lihanc111 commented on Lossless LLM 3x Throughput Increase by LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

sgammon · 2 months ago

Hey LMCache team! Saw you guys at OSS N.A. but wasn’t able to set aside time to say hello. We’d love to chat about collaborating. Is there an email we can reach out to?

lihanc111 · 2 months ago

Please send to contact@lmcache.ai

lihanc111 commented on Lossless LLM 3x Throughput Increase by LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

0xjunhao · 3 months ago

Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

lihanc111 · 2 months ago

It is almost true for both. Although for the second case you can just skip storing in these cases where there is little improvement.

lihanc111 commented on Lossless LLM 3x Throughput Increase by LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

nativeit · 2 months ago

Has it been used in IBM's inference stack, or used with IBM's inference stack? In other words, has this been merged into IBM's own repositories, or has someone just tested it using them?

lihanc111 · 2 months ago

It is in IBM's llm-d open source stack

lihanc111 commented on Lossless LLM 3x Throughput Increase by LMCache github.com/LMCache/LMCach... · Posted by u/lihanc111

lihanc111 · 3 months ago

Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

Ask us anything!