The cases that bite me:
1. Docker build args — tokens passed to Dockerfiles for private package installs live in docker-compose.yml, not .env. No .env-focused tool catches them.
2. YAML config files with connection strings and API keys — again, not .env format, invisible to .env tooling.
3. Shell history — even if you never cat the .env, you've probably exported a var or run a curl with a key at some point in the session.
The proxy/surrogate approach discussed upthread seems like the only thing that actually closes the loop, since it works regardless of which file or log the secret would have ended up in.
Per-call monkey-patching sees each call in isolation. What I ended up doing was a trace-based approach: every request gets a trace ID, each service appends cost spans asynchronously, and a separate enrichment step aggregates the total. The hard part was deduplication — when service A reports an aggregate cost and service B reports the individual calls that compose it, you need to reconcile or you double-count.
Your atomic disk writes for halt state is a nice pattern. I went with fire-and-forget (never block the request path, accept eventual consistency on cost data) but that means you can't do hard enforcement mid-request like AgentBudget does.
I've built multi-channel chat infrastructure and the honest answer is: keep the monolith until you have a specific scaling bottleneck, not a theoretical one.
One pattern that helped was normalizing all channel-specific message formats into a single internal message type early. Each channel adapter handles its own quirks (some platforms give you 3 seconds to respond, others 20, some need deferred responses) but they all produce the same normalized message that the core processing pipeline consumes. This decoupling is what made it possible to split later without rewriting business logic.
On Redis pub/sub specifically: for a solo dev, skip it until you actually have multiple server instances that need to share state. A single process with WebSocket sessions in memory is fine for early users. The complexity cost of pub/sub isn't worth it until you need horizontal scaling or have a separate worker process pushing messages.
What's your current message volume like? That usually determines timing better than architecture diagrams.
1. Separate your query analysis from retrieval. A single LLM call can classify the query type, decide whether to use hybrid search, and pick search parameters all at once. This saves a round-trip vs doing them sequentially.
2. If you add BM25 alongside vector search, the blend ratio matters a lot by query type. Exact-match queries need heavy keyword weighting, while conceptual questions need more embedding weight. A static 50/50 split leaves performance on the table.
3. For your evaluator/generator being the same model — one practical workaround is to skip LLM-as-judge evaluation entirely and use a small cross-encoder reranker between retrieval and generation instead. It catches the cases where vector similarity returns semantically related but not actually useful chunks, and it gives you a relevance score you can threshold on without needing a separate evaluation model.
4. Consider a two-level cache: exact match (hash the query, short TTL) plus a semantic cache (cosine similarity threshold on the query embedding, longer TTL). The semantic layer catches "how do I X" vs "what's the way to X" without hitting the retriever again.
What model are you using for generation on the 8GB? That constraint probably shapes a lot of the architecture choices downstream.