Best case scenario you can come up with a chunking strategy specific to your use case that will make it work: stuff like grouping all the paragraphs/tables about a register together or grouping tables of physical properties in a datasheet with the table title or grouping the paragraphs in a PCB layout guideline together into a single unit. You also have to figure out how much overlap to allow between the different types of chunks and how many dimensions you need in the output vectors. You then have to link chunks together so that when your RAG matches the register description, it knows to include the chunk with the actual documentation so that the LLM can actually use the documentation chunk instead of just the description chunk. I've had to train many a classifier to get this part even remotely usable in nontrivial use cases like caselaw.
Worst case scenario you have to finetune your own embedding model because the colloquialisms the general purpose ones are trained on have little overlap with how terms of art and jargon used in the documents (this is especially bad for legal and highly technical texts IME). This generally requires thousands of examples created by an expert in the field.
> Every time I've tried to apply general purpose RAG tools to specific types of documents like medical records, internal knowledge base, case law, datasheets, and legislation, it's been a mess.
Would it be fair to paraphrase you as saying that people should avoid using _any_ library's ready-made components for a RAG pipeline, or do you think there's something specific to LangChain that is making it harder for people to achieve their goals when they use it? Either way, is there more detail that you can share on this? Even if it's _any_ library - what are we all getting wrong?
Not trying to correct you here - rather stating my perspective in hopes that you'll correct it (pretty please) - but my take as someone who was a user before joining the company is that LangChain is a good starting point because of the _structure_ it provides, rather than the specific components.
I don't know what the specific design intent was (again, new to the team!) but just candidly as a user I tend to look at the components as stand-ins that'll help me get something up and running super quickly so I can start building out evals. I might be very unique in this, but I tend to think that until I have evals, I don't really have any idea if my changes are actually improvements or not. Once I have evals running against something that does _roughly_ what I want it to do, I can start optimizing the end-to-end workflow. I suspect in 99.9% of cases that'll involve replacing some (many?) of our prebuilt components with custom ones that are more tailored to your specific task.
Complete side note, but for anyone looking at LangChain to build out RAG stuff today, I'd advise using LangGraph for structuring your end-to-end process. You can still pull in components for individual process steps from LangChain (or any other library you prefer) as needed, and you can still use LangChain pipelines as individual workflow steps if you want to, but I think you'll find that LangGraph is a more flexible foundation to build upon when it comes to defining the structure of your overall workflow.
If you ever see one of those contracts here, it's usually usually for a very reasonable situation and a well paid position.
It’s not at all a ridiculous ask, either. I’ve made a career out of going after high-impact roles in whatever is the fastest growing area of technology at the time. The non-compete isn’t just asking me to sacrifice the income from my next role, it’s asking me to sacrifice the experience as well. It also limits my ability to renegotiate comp while on the job, because they know your BATNA isn’t to just go get a better offer from a competitor.
If a company wants me to give all of that up, I’m sure as shit not doing it just for the privilege of working for them.