The most key challenges I've faced around RAG are things like:
- Only works on text based modalities (how can I use this with all types of source documents, including images)
- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval
- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:
- Generate example queries that the chunk can answer and embed those to search against.
- Parent document retrieval
- so many newer better Rag techniques have been talked about and used that are better than chunk based
- How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
- Also other search approaches like fuzzy search/lexical based approaches. And ranking them based on criterial like (user query is one word, use fuzzy search instead of semantic search). Things like thatSo far this platform seems to just lock you into a really simple embedding pipeline that only supports the most simple chunk based retrieval. I wouldn't use this unless there was some promise of it actually solving some challenges in RAG.
I also have dealt with what you're describing, but then it goes much farther when going to prod IME. The ingestion part is even more messy in ways these kinds of platforms don't seem to help with. When managing multiple tools in prod with overlapping and non-constant data sources (say, you have two tools that need to both know the price of a product, which can change at any time), I need both of those to be built on the same source of truth and for that source of truth to be fed by our data infra in real time, where relevant documents need to be replaced in real time in more or less an atomic way.
Then, I have some tools that have varying levels of permissioning on those overlapping data sources, say, you have two tools that exist in a classroom, one that helps the student based on their work, and another that is used by the TA or teacher to help understand students' answers in a large course. They have overlapping data needs on otherwise private data, and this kind of permissioning layer which is pretty trivial in a normal webapp has, IME, had to have been implemented basically from scratch on top of the vector db and retrieval system.
Then experimentation, eval, testing, and releases are the hardest and most underserved. It was only relatively recently that it seemed like anyone even seemed to be talking about eval as a problem to aspire to solve. There's a pretty interesting and novel interplay of the problems of production ML eval, but with potentially sparse data, and conventional unit testing. This is the area we had to put the most of our own thought into for me to feel reasonably confident in putting anything into prod.
FWIW we just built our own internal platform on top of langchain a while back, seemed like a good balance of the right level of abstraction for our use cases, solid productivity gains from shared effort.
I think this is a really interesting problem space, but yeah, I'm skeptical of all of these platforms as they seem to always be promising a lot more than they're delivering. It looks superficially like there has been all of this progress on tooling, but I built a production service based on vector search in 2018 and it really isn't that much easier today. It works better because the models are so much better, but the tools and frameworks don't help that much with the hard parts, to my surprise honestly.
Perhaps I'm just not the user and am being excessively critical, but I keep having to deal with execs and product people throwing these frameworks at us internally without understanding the alignment between what is hard about building these kinds of services in prod and what these kinds of tools make easier vs harder.
How will the concept of RAG fare in the era of ultra large context windows and sub-quadratic alternatives to attention in transformers?
Another 12 months and we might have million+ token context windows at GPT-3.5 pricing.
For most use cases, does it even make sense to invest in RAG anymore?
If you are dealing with highly cardinal permissioning models (even just a large number of users who own their own data, but the problem compounds if you have overlapping permissions), then tuning a separate set of layers for every permission set is always going to be wasteful. Trusting a model to have some kind of "understanding" of its permissioning seems plausible assuming some kind of omniscient and perfectly aligned machine, but unrealistic in the foreseeable future and definitely not going to cut it for data regs.
Also, in current status quo I don't believe there is a solution on the horizon for continuous, rapid incremental training in prod, so any data sources that change often are also going to be best addressed in this way. That will most likely be solved at some point, but it doesn't seem imminent, and regardless there will likely be some balancing of cost/performance where context from after the watermark being injected in at inference time might still make sense anyway to keep training costs managable rather than having to iterate training on literally every single interaction.
But yeah, if you're just using it because you have a single collection of context for many users which is too large to fit into the prompt, that seems like it will be subject to the problem you're describing. Although there might still be some benefit to cost/performance optimization both to keeping the prompt short (for cost) and focused (for performance).