I’ve been working on a project called L88 — a local RAG system that I initially focused on UI/UX for, so the retrieval and model architecture still need proper refinement.
Repo: https://github.com/Hundred-Trillion/L88-Full
I’m running this on 8GB VRAM and a strong CPU (128GB RAM). Embeddings and preprocessing run on CPU, and the main model runs on GPU. One limitation I ran into is that my evaluator and generator LLM ended up being the same model due to compute constraints, which defeats the purpose of evaluation.
I’d really appreciate feedback on:
Better architecture ideas for small-VRAM RAG
Splitting evaluator/generator roles effectively
Improving the LangGraph pipeline
Any bugs or design smells you notice
Ways to optimize the system for local hardware
I’m 18 and still learning a lot about proper LLM architecture, so any technical critique or suggestions would help me grow as a developer. If you check out the repo or leave feedback, it would mean a lot — I’m trying to build a solid foundation and reputation through real projects.
Thanks!
1. Separate your query analysis from retrieval. A single LLM call can classify the query type, decide whether to use hybrid search, and pick search parameters all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by query type. Exact-match queries need heavy keyword weighting, while conceptual questions need more embedding weight. A static 50/50 split leaves performance on the table.
3. For your evaluator/generator being the same model — one practical workaround is to skip LLM-as-judge evaluation entirely and use a small cross-encoder reranker between retrieval and generation instead. It catches the cases where vector similarity returns semantically related but not actually useful chunks, and it gives you a relevance score you can threshold on without needing a separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a semantic cache (cosine similarity threshold on the query embedding, longer TTL). The semantic layer catches "how do I X" vs "what's the way to X" without hitting the retriever again.
What model are you using for generation on the 8GB? That constraint probably shapes a lot of the architecture choices downstream.
You’re right about my query flow: I’m still doing separate LLM calls for the router, analyzer, and rewriter. Merging that into one should cut latency a lot, especially since Qwen2.5-7B-AWQ on an RTX 4000 Ada only gives me ~15–25 tok/s.
The BM25 point is spot-on too. I’ve been running pure vector search (BGE-base-en-v1.5 + FAISS, reranked with bge-reranker-v2-m3). Adding BM25 with dynamic weighting — especially for exact-match queries like titles/authors — is something I really shouldn’t keep putting off.
Using the cross-encoder as the evaluator is probably the easiest fix. My current GOOD/UNSURE/BAD scoring uses the same Qwen model, which is the circular issue I mentioned. Since I’m already running the cross-encoder, letting it handle the thresholding would let me drop the LLM evaluator entirely.
No caching yet, but I’ll start with exact-match hashing and layer semantic caching later.
Model-wise: Qwen2.5-7B-AWQ on GPU, with Qwen2.5-14B on CPU as a slow fallback. AWQ is what makes the 8GB VRAM setup workable.
Really appreciate you taking the time — I’ll open issues for hybrid search + caching this week.
I recently challenged myself to architect a multi-agent LangGraph pipeline on an extremely constrained 512MB RAM free-tier server, so I totally understand your VRAM/RAM pain. Here is some architecture feedback based on your questions:
1. Small-VRAM Architecture (Parent-Child Chunking): If you are running out of memory, don't keep large text chunks in your Vector DB. Implement strict Parent-Child chunking. I only embed tiny 'child' chunks into Qdrant or vectordbs and store the large 'parent' text payloads in a lightweight DB like SQLite/Supabase. Search small, retrieve large.
2. Routing & Skipping Heavy Rerankers: I completely agree with using a single LLM call for routing to save compute. In my setup, I deliberately skipped heavy cross-encoder rerankers because they absolutely destroy free-tier/low-VRAM constraints (Use when you are not on resource constraints).
3. LangGraph Memory Management: Leverage LangGraph's state machine to avoid OOM crashes. Don't try to hold Evaluator and Generator contexts in memory simultaneously. Sequence your nodes with conditional edges (Generator -> Evaluator -> Route back if failed). By doing this sequentially, you never overload your VRAM at any single tick.
Keep building, you have a very solid foundation here!
Dead Comment