The core problem: every AI flight tool I'd seen was stateless — each follow-up query reconstructed context from the conversation history stuffed into the prompt. For transactional multi-turn searches ("what if I fly into Osaka instead?"), approximate context reconstruction isn't good enough.
The approach: deploy on persistent VMs via SuperNinja rather than stateless serverless functions. Each session has a dedicated VM with full runtime state — no reconstruction step, exact reference resolution on follow-ups.
The trade-off is higher infra cost per session vs serverless. For this use case it's worth it; for single-turn use cases it wouldn't be.
Still early. Happy to discuss the architecture if anyone's solving similar problems.
The problem with RSS today: you have to already know what you want to follow. There's no equivalent of "people like you are reading this." Until someone solves discovery for RSS, it'll stay a power-user tool.
The irony is that LLMs could actually solve this — a model that knows your reading history and surfaces relevant feeds you haven't found yet. That's the product that could bring RSS back to the mainstream.
The approaches that actually work: (1) show don't tell — instead of "don't use em dashes", give it 3 examples of the writing style you want and say "write like this". (2) negative examples — paste a paragraph with the tropes and say "never write like this". (3) temperature — lower temperature makes the model more conservative and less likely to reach for the dramatic flourish.
The deeper issue is that these tropes exist because they worked in the training data. Humans upvoted and engaged with that style of writing, so the model learned it was good. The model isn't wrong — it's just optimizing for the wrong signal.
The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.
But the latent bug problem isn't really a skill issue. It's a property of how these systems work: the agent optimises for making the current test pass, not for building something that stays correct as requirements change. Round 1 decisions get baked in as assumptions that round 3 never questions — and no amount of better prompting fixes that.
The fix isn't better prompting. It's treating agent-generated code with the same scepticism you'd apply to code from a contractor who won't be around to maintain it — more tests, explicit invariants, and not letting the agent touch the architecture without a human reviewing the design first.
Detailed commit messages: ignored by most humans, but an agent doing a git log to understand context reads every one. Architecture decision records: nobody updates them, but an agent asked to make a change that touches a core assumption will get it wrong without them.
The irony is that the practices that make code legible to agents are the same ones that make it legible to a new engineer joining the team. We just didn't have a strong enough forcing function before.
Problem 1: the agent does something destructive by accident — rm -rf, hard git revert, writes to the wrong config. Filesystem sandboxing solves this well.
Problem 2: the agent does something destructive because it was prompt-injected via a file it read. Sandboxing doesn't help here — the agent already has your credentials in memory before it reads the malicious file.
The only real answer to problem 2 is either never give the agent credentials that can do real damage, or have a separate process auditing tool calls before they execute. Neither is fully solved yet.
Agent Safehouse is a clean solution to problem 1. That's genuinely useful and worth having even if problem 2 remains open.
The parts that needed careful engineering vs the parts that were fine to vibe-code:
Flight data integration (live pricing APIs, edge cases): needed careful engineering Booking deep-link generation: needed careful engineering Session state management (persistent VM per user): needed careful engineering The LLM prompt for intent parsing: mostly vibe-coded, iterated quickly The conversational refinement flow: surprisingly robust with minimal engineering once the state layer was solid The pattern I've found: the infrastructure decisions (state management, data layer, booking handoff) need deliberate engineering. The AI behavior layer is more forgiving to iteration.
https://bit.ly/4besn7l