> Embeddings are crucial here, as they efficiently identify and integrate vital information—like documents, conversation history, and tool definitions—directly into a model's working memory.
I feel like I'm falling behind here, but can someone explain this to me?
My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.
But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.
So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?
RAG is taking a bunch of docs, chunking them it to text blocks of a certain length (how best todo this up for debate), creating a search API that takes query (like a google search) and compares it to the document chunks (very much how your describing). Take the returned chunks, ignore the score from vector search, feed those chunks into a re-ranker with the original query (this step is important vector search mostly sucks), filter those re-ranked for the top 1/2 results and then format a prompt like;
The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)
Doc1.pdf - Chunk N
Eat cheese
Doc2.pdf- Chunk Y
Dont eat cheese
You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.
But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.
To be fair though, its pretty powerful once you've got in place.
Sorry for my lack of knowledge, but I've been wondering what if you ask a question to the RAG, where the answer to the question is not close in embedding space to the embedded question? Will that not limit the quality of the result? Or how does a RAG handle that? I guess maybe the multi-turn convo you mentioned helps in this regard?
The way I see RAG is it's basically some sort of semantic search, where the query needs to be similar to whatever you are searching for in the embedding space order to get good results.
> So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?
Context is sometimes called working memory. But no your understanding is right: find the right document through cosine similarity (and thus through embeddings), then add the content of those docs to the context
One of the things I find confusing about this article is that the author positions RAG as being unrelated to both context engineering and vector search.
The directly into working memory bit is nonsense of course, but it does point to a problem that is probably worth solving.
What would it take to make the KV cache more portable and cut/paste vs. highly specific to the query?
In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right? The next time around, I can just load it in, and continue from <new query>?
To keep going, you should be able to train the model to operate so that you can have discontinous KV cache segments that are unrelated, so you can drop in <cached KV from doc 1> <cached KV from doc 2> with <query related to both> and have it just work ... but I don't think you can do that today.
I seem remember seeing some papers that tried to "unRoPE" the KV and then "re-RoPE" it, so it can be reused ... but I have not seen the latest. Anybody know what the current state is?
Seems crazy to have to re-process the same context multiple times just to ask it a new query.
Do you have any links to the papers for the “unRoPE” and “re-Rope” technique? I tried some searching and couldn’t find anything. I would love to look into this idea more.
I think that copy/paste-able KV cache idea sounds pretty promising. It might lose some of the inter-document context and attention that would get built up in the hidden state of the model as it processes the prompt. Maybe just throw in some ‘reasoning’ tokens before it gives its answer to give it a chance to attend cross-document
> In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right?
People do this, it's called prefix caching.
There's also https://arxiv.org/abs/2506.06266 where they compress the context down to a smaller representation they call a "cartridge," and composing cartridges from different contexts seems to work reasonably well.
Perhaps the person that wrote it is also confused. I guess Geminis embedding model offers multilingual support, but we can use anything. The assumption is the developer uses these embeddings on their end with their implementation of storage/querying (their own vector db). The confusing thing is the article is suggesting that whole process is now done automatically soon as you send the embeddings to Gemini (which doesn’t even make sense, shouldn’t it only take text?).
At least in theory. If the model is the same, the embeddings can be reused by the model rather than recomputing them.
I believe this is what they mean.
In practice, how fast will the model change (including tokenizer)? how fast will the vector db be fully backfilled to match the model version?
That would be the “cache hit rate” of sorts and how much it helps likely depends on some of those variables for your specific corpus and query volumes.
This can’t be what they mean. Even if this was somehow possible, Embeddings lose information and are not reversible, I.e embeddings do not magically compress actual text into a vector in a way that a model can implicitly recover the source text from the vector.
LLMs can’t take embeddings (unless I’m really confused). Even if it could take embeddings, the embeddings would have lost all word sequence and structure (wouldn’t make sense to the LLM).
LLMs can use search engines as a tool. One possibility is Google embeds the search query through these embeddings and does retrieval using them and then the retrieved result is pasted into the model's chain of thought (which..unless they have an external memory module in their model, is basically the model's only working memory).
I'm reading the docs and it does not appear Google keeps these embeddings at all. I send some text to them, they return the embedding for that text at the size I specified.
So the flow is something like:
1. Have a text doc (or library of docs)
2. Chunk it into small pieces
3. Send each chunk to <provider> and get an embedding vector of some size back
4. Use the embedding to:
4a. Semantic search / RAG: put the embeddings in a vector DB and do some similarity search on the embedding. The ultimate output is the source chunk
4b. Run a cluster algorithm on the embedding to generate some kind of graph representation of my data
4c. Run a classifier algorithm on the embedding to allow me to classify new data
5. The output of all steps in 4 is crucially text
6. Send that text to an LLM
At no point is the embedding directly in the models memory.
You're right on this. "Advanced" RAG techniques are all complete marketing BS, in the end all you're doing it passing the text into the model's context window.
Your comment really helps me improve my mental model about LLM. Can someone smarter help me verify my understanding:
1) at the end of the day, we are still sending raw text over LLM as input to get output back as response.
2) RAG/Embedding is just a way to identify a "certain chunk" to be included in the LLM input so that you don't have to dump the entire ground truth document into LLM
Let's take Everlaw for example: all of their legal docs are in embeddings format and RAG/tool call will retrieve relevant document to feed into LLM input.
So in that sense, what do these non-foundational models startups mean when they say they are training or fine tuning models? Where does the line end between inputting into LLM vs having them baked in model weights
(1) and (2) are correct (well, I don’t know specifics of Everlaw). Fine tuning is something different, where you incrementally train the model itself further using more inputs, so that given the same input context it will produce better output in your use case.
To be more precise, you seldom directly continue training the model, because it’s much cheaper and easier to add some more small layers to the big model and train those instead (see LoRA or Peft).
Something like Everlaw might do all three, by fine tuning a model to do better at discovery retrieval, then building a RAG system on top of that.
Oh what you don't understand is that LLMs also use embeddings inside, it's how they represent tokens. It's just that you don't get to see the embeddings, they are inner workings.
Qwen embedding models score very highly but are highly sensitive to word order (they use last token pooling which simplified means they only look at the last word of the input). Change the word order and the scores change completely. Voyage models score highly too, but changing a word from singular to plural can again completely change the scores.
I find myself doing a hybrid search, rerank and shortlist the results, then feed them to an LLM to judge what is and isn't relevant.
> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]
looks like even the 256-dim embeddings perform really well.
> Both of our new embedding models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.
It's a practical feature. Scaling is irrelevant in this context because it scales to the length of the embedding, although in batches of k-length embeddings.
To anyone working in these types of applications, are embeddings still worth it compared to agentic search for text? If I have a directory of text files, for example, is it better to save all of their embeddings in a VDB and use that, or are LLMs now good enough that I can just let them use ripgrep or something to search for themselves?
If your LLM is good enough you'll likely get better results from tool calling with grep or a FTS engine - the better models can even adapt their search patterns to search for things like "dog OR canine" where previously vector similarity may have been a bigger win.
Getting embeddings working takes a bunch of work: you need to decide on a chunking strategy, then run the embeddings, then decide how best to store them for fast retrieval. You often end up having to keep your embedding store in memory which can add up for larger volumes of data.
I did a whole lot of work with embeddings last year but I've mostly lost interest now that tool-based-search has become so powerful.
Hooking up tool-based-search that itself uses embeddings is worth exploring, but you may find that the results you get from ripgrep are good enough that it's not worth the considerable extra effort.
If you have a million records of unstructured text (very common, maybe website scrapes of product descriptions, user reviews, etc) you want to be doing an embedding search on these to get the most relevant docs.
If you have a hundred .py files than you want your agent to navigate through these with a grep tool
With the caveat that I have not spent a serious amount of time trying to get RAG to work - my brief attempt to use it via AWS knowledge base to compare it vs agentic search resulted in me sticking with agentic search (via Claude code SDK)
My impression was there’s lots of knobs you can tune with RAG and it’s just more complex in general - so maybe there’s a point where the amount of text I have is large enough that that complexity pays off - but right now agentic search works very well and is significantly simpler to get started with
Use pymupdf to extract the PDF text. Hell, run that nasty business through an LLM as step-2 to get a beautiful clean markdown version of the text. Lord knows the PDF format is horribly complex!
We OCR them with an LLM into markdown. Super expensive and slow but way more reliable than trying to decode insanely structured PDFs that users upload, which often include pages that are images of the text, or diagrams and figures that need to be read.
Really depends on your scale and speed requirements.
Question to other GCP users, how are you finding Google's aggressive deprecation of older embedding models? Feels like you have to pay to rerun your data through every 12 months.
You know of lots of LLM-using apps that don't need to re-run their fine tunings or embeddings because of improvements or new features at least annually? Things are moving so fast that "every 12 months" seems kinda slow...
This is precisely the risk I’ve been wondering about with vectorization. I’ve considered that an open source model might be valuable as you could always find someone to host it for you and control the deprecation rate yourself.
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
I have been thinking around solving this problem. I think one of the reasons some AI assistants shine vs others is how they can reduce the amount of context the LLM needs to work with using in-built tools. I think there's room to democratize these capabilities. One such capability is allowing the LLMs to directly work with the embeddings.
I wrote an MCP server directory-indexer[1] for this (self-hosted indexing mcp server). The goal being indexing any directories you want your AI to know about and gives the it MCP tools to search through the embeddings etc. While an agentic grep may be valuable, when working with tons of files with similar topics (like customer cases, technical docs), pre-processed embeddings have proven valuable for me. One reason I really like it is that it democratizes my data and documents: giving consistent results when working with different AI assistants - the alternative being vastly different results based on the in-built capabilities of the coding assistants. Another being having access to you "knowledge" from any project you're on. Though since this is selfhosted, I use nomic-embed-text for the embedding which has been sufficient for most use cases.
I feel like I'm falling behind here, but can someone explain this to me?
My high-level view of embedding is that I send some text to the provider, they tokenize the text and then run it through some NN that spits out a vector of numbers of a particular size (looks to be variable in this case including 768, 1536 and 3072). I can then use those embeddings in places like a vector DB where I might want to do some kind of similarity search (e.g. cosine difference). I can also use them to do clustering on that similarity which can give me some classification capabilities.
But how does this translate to these things being "directly into a model's working memory'? My understanding is that with RAG I just throw a bunch of the embeddings into a vector DB as keys but the ultimate text I send in the context to the LLM is the source text that the keys represent. I don't actually send the embeddings themselves to the LLM.
So what is is marketing stuff about "directly into a model's working memory."? Is my mental view wrong?
The user ask 'long query', we fetched some docs (see below), answer the query based on the docs (reference the docs if u feel like it)
Doc1.pdf - Chunk N Eat cheese
Doc2.pdf- Chunk Y Dont eat cheese
You then expose the search API as a "tool" for the LLM to call, slightly reformatting the prompt above into a multi turn convo, and suddenly you're in ze money.
But once your users are happy with those results they'll want something dumb like the latest football scores, then you need a web tool - and then it never ends.
To be fair though, its pretty powerful once you've got in place.
The way I see RAG is it's basically some sort of semantic search, where the query needs to be similar to whatever you are searching for in the embedding space order to get good results.
I've been thinking about this because it would be nice to have a fuzzier search.
Context is sometimes called working memory. But no your understanding is right: find the right document through cosine similarity (and thus through embeddings), then add the content of those docs to the context
What would it take to make the KV cache more portable and cut/paste vs. highly specific to the query?
In theory today, I should be able to process <long quote from document> <specific query> and just stop after the long document and save the KV cache right? The next time around, I can just load it in, and continue from <new query>?
To keep going, you should be able to train the model to operate so that you can have discontinous KV cache segments that are unrelated, so you can drop in <cached KV from doc 1> <cached KV from doc 2> with <query related to both> and have it just work ... but I don't think you can do that today.
I seem remember seeing some papers that tried to "unRoPE" the KV and then "re-RoPE" it, so it can be reused ... but I have not seen the latest. Anybody know what the current state is?
Seems crazy to have to re-process the same context multiple times just to ask it a new query.
I think that copy/paste-able KV cache idea sounds pretty promising. It might lose some of the inter-document context and attention that would get built up in the hidden state of the model as it processes the prompt. Maybe just throw in some ‘reasoning’ tokens before it gives its answer to give it a chance to attend cross-document
imo the discontinuous segments bit would not work because of the causal dependence in transformers + RoPE as you mention, but maybe could be possible
People do this, it's called prefix caching.
There's also https://arxiv.org/abs/2506.06266 where they compress the context down to a smaller representation they call a "cartridge," and composing cartridges from different contexts seems to work reasonably well.
They're listing applications of that by third parties to demonstrate the use-case, but this is just a model for generating those vectors.
I believe this is what they mean.
In practice, how fast will the model change (including tokenizer)? how fast will the vector db be fully backfilled to match the model version?
That would be the “cache hit rate” of sorts and how much it helps likely depends on some of those variables for your specific corpus and query volumes.
I can't find any evidence that this is possible with Gemini or any other LLM provider.
So the flow is something like:
1. Have a text doc (or library of docs)
2. Chunk it into small pieces
3. Send each chunk to <provider> and get an embedding vector of some size back
4. Use the embedding to:
4a. Semantic search / RAG: put the embeddings in a vector DB and do some similarity search on the embedding. The ultimate output is the source chunk
4b. Run a cluster algorithm on the embedding to generate some kind of graph representation of my data
4c. Run a classifier algorithm on the embedding to allow me to classify new data
5. The output of all steps in 4 is crucially text
6. Send that text to an LLM
At no point is the embedding directly in the models memory.
1) at the end of the day, we are still sending raw text over LLM as input to get output back as response.
2) RAG/Embedding is just a way to identify a "certain chunk" to be included in the LLM input so that you don't have to dump the entire ground truth document into LLM Let's take Everlaw for example: all of their legal docs are in embeddings format and RAG/tool call will retrieve relevant document to feed into LLM input.
So in that sense, what do these non-foundational models startups mean when they say they are training or fine tuning models? Where does the line end between inputting into LLM vs having them baked in model weights
To be more precise, you seldom directly continue training the model, because it’s much cheaper and easier to add some more small layers to the big model and train those instead (see LoRA or Peft).
Something like Everlaw might do all three, by fine tuning a model to do better at discovery retrieval, then building a RAG system on top of that.
https://huggingface.co/spaces/mteb/leaderboard
Particularly Qwen3-Embedding-8B and Qwen3-Embedding-4B:
https://huggingface.co/Qwen/Qwen3-Embedding-8B
I made a small tool to help me compare various embedding models: https://www.vectorsimilaritytest.com/
Qwen embedding models score very highly but are highly sensitive to word order (they use last token pooling which simplified means they only look at the last word of the input). Change the word order and the scores change completely. Voyage models score highly too, but changing a word from singular to plural can again completely change the scores.
I find myself doing a hybrid search, rerank and shortlist the results, then feed them to an LLM to judge what is and isn't relevant.
> The Gemini embedding model, gemini-embedding-001, is trained using the Matryoshka Representation Learning (MRL) technique which teaches a model to learn high-dimensional embeddings that have initial segments (or prefixes) which are also useful, simpler versions of the same data. Use the output_dimensionality parameter to control the size of the output embedding vector. Selecting a smaller output dimensionality can save storage space and increase computational efficiency for downstream applications, while sacrificing little in terms of quality. By default, it outputs a 3072-dimensional embedding, but you can truncate it to a smaller size without losing quality to save storage space. We recommend using 768, 1536, or 3072 output dimensions. [0]
looks like even the 256-dim embeddings perform really well.
[0]: https://ai.google.dev/gemini-api/docs/embeddings#quality-for...
I've seen it in a few models now - Nomic Embed 1.5 was the first https://www.nomic.ai/blog/posts/nomic-embed-matryoshka
> Both of our new embedding models were trained with a technique that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.
Getting embeddings working takes a bunch of work: you need to decide on a chunking strategy, then run the embeddings, then decide how best to store them for fast retrieval. You often end up having to keep your embedding store in memory which can add up for larger volumes of data.
I did a whole lot of work with embeddings last year but I've mostly lost interest now that tool-based-search has become so powerful.
Hooking up tool-based-search that itself uses embeddings is worth exploring, but you may find that the results you get from ripgrep are good enough that it's not worth the considerable extra effort.
If you have a million records of unstructured text (very common, maybe website scrapes of product descriptions, user reviews, etc) you want to be doing an embedding search on these to get the most relevant docs.
If you have a hundred .py files than you want your agent to navigate through these with a grep tool
My impression was there’s lots of knobs you can tune with RAG and it’s just more complex in general - so maybe there’s a point where the amount of text I have is large enough that that complexity pays off - but right now agentic search works very well and is significantly simpler to get started with
Deleted Comment
Really depends on your scale and speed requirements.
But am I crazy or did the pre-production version of gemini-embedding-001 have a much larger max context length?
Edit: It seems like it did? 8k -> 2k? Huge downgrade if true, I was really excited about the experimental model reaching GA before that
There are some good open models there that have longer context limits and fewer dimensions.
The benchmarks are just a guide. It's best to build a test dataset with your own data. This is a good example of that: https://github.com/beir-cellar/beir/wiki/Load-your-custom-da...
Another benefit of having your own test dataset, is that it can grow as your data grows. And you can quickly test new models to see how it performs with YOUR data.
For now there is only https://exa.ai/ that is currently doing something similar it seems.
I wrote an MCP server directory-indexer[1] for this (self-hosted indexing mcp server). The goal being indexing any directories you want your AI to know about and gives the it MCP tools to search through the embeddings etc. While an agentic grep may be valuable, when working with tons of files with similar topics (like customer cases, technical docs), pre-processed embeddings have proven valuable for me. One reason I really like it is that it democratizes my data and documents: giving consistent results when working with different AI assistants - the alternative being vastly different results based on the in-built capabilities of the coding assistants. Another being having access to you "knowledge" from any project you're on. Though since this is selfhosted, I use nomic-embed-text for the embedding which has been sufficient for most use cases.
[1] https://github.com/peteretelej/directory-indexer