Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI

Launch HN: LlamaFarm (YC W22) – Open-source framework for distributed AI github.com/llama-farm/lla...

Hi HN! We're Rob, Matt, and Rachel from LlamaFarm (https://llamafarm.dev). We're building an open-source AI framework based on a simple belief: the future isn't one massive model in the cloud—it's specialized models running everywhere, continuously fine-tuned from real usage.

The problem: We were building AI tools and kept falling into the same trap. AI demos die before production. We built a bunch of AI demos but they were impossible to get to production. It would work perfectly on our laptop, but when we deployed it, something broke, and RAG would degrade. If we were running our own model, it would quickly become out of date. The proof-of-concept that impressed the team couldn't handle real-world data.

Our solution: declarative AI-as-code. One YAML defines models, policies, data, evals, and deploy. Instead of one brittle giant, we orchestrate a Mixture of Experts—many small, specialized models you continuously fine-tune from real usage. With RAG for source-grounded answers, systems get cheaper, faster, and auditable.

There’s a short demo here: https://www.youtube.com/watch?v=W7MHGyN0MdQ and a more in-depth one at https://www.youtube.com/watch?v=HNnZ4iaOSJ4.

Ultimately, we want to deliver a single, signed bundle—models + retrieval + database + API + tests—that runs anywhere: cloud, edge, or air-gapped. No glue scripts. No surprise egress bills. Your data stays in your runtime.

We believe that the AI industry is evolving like computing did. Just as we went from mainframes to distributed systems and monolithic apps to microservices, AI is following the same path: models are getting smaller and better. Mixture of Experts is here to stay. Qwen3 is sick. Llama 3.2 runs on phones. Phi-3 fits on edge devices. Domain models beat GPT-5 on specific tasks.

RAG brings specialized data to your model: You don't need a 1T parameter model that "knows everything." You need a smart model that can read your data. Fine-tuning is democratizing: what cost $100k last year now costs $500. Every company will have custom models.

Data gravity is real: Your data wants to stay where it is: on-prem, in your AWS account, on employee laptops.

Bottom line: LlamaFarm turns AI from experiments into repeatable, secure releases, so teams can ship fast.

What we have working today: Full RAG pipeline: 15+ document formats, programmatic extraction (no LLM calls needed), vector-database embedding, universal model layer that runs the same code for 25+ providers, automatic failover, cost-based routing; Truly portable: Identical behavior from laptop → datacenter → cloud; Real deployment: Docker Compose works now with Kubernetes basics and cloud templates on the way.

Check out our readme/quickstart for easy install instructions: https://github.com/llama-farm/llamafarm?tab=readme-ov-file#-...

Or just grab a binary for your platform directly from the latest release: https://github.com/llama-farm/llamafarm/releases/latest

The vision is to be able to run, update, and continuously fine-tune dozens of models across environments with built-in RAG and evaluations, all wrapped in a self-healing runtime. We have an MVP of that today (with a lot more to do!).

We’d love to hear your feedback! Think we’re way off? Spot on? Want us to build something for your specific use case? We’re here for all your comments!

So this sounds like an application layer approach, maybe just shy of a replit or base44, with the twist that you can own the pipeline. While there's something to that, I think there are some further questions around differentiation that need to be answered. I think the biggest challenge is going to be the beachead: what client demographic has the cash to want to own the pipeline and not use SaaS, but doesn't have the staff on hand to do it?

mhamann · 3 months ago

I think that enterprises and small businesses alike need stuff like this, regardless of whether they're software companies or some other vertical like healthcare or legal. I worked at IBM for over a decade and it was always preferable to start with an open source framework if it fit your problem space, especially for internal stuff. We shipped products with components built on Elastic, Drupal, Express, etc.

You could make the same argument for Kubernetes. If you have the cash and the team, why not build it yourself? Most don't have the expertise or the time to find/train the people who do.

People want AI that works out of the box on day one. Not day 100.

rgthelen · 3 months ago

Yeah, that’s a fair framing — it is kind of an “application layer” for AI orchestration, but focused on ownership and portability instead of just convenience.

Yeah, the beachead will be our biggest issue - where to find first hard-core users. I was thinking legal (they have a need for AI, but data cannot leave their servers), healthcare (same as legal, but more regualtions), and government (not right now, but normally have deep pockets).

What do you think is a good starting place?

johnthecto · 3 months ago

Like your ideas around legal and Healthcare. Both sectors where you've got interest and money. Another way might be to look at partnering with a service org who does transformation/modernization as a way to accelerate delivery. Maybe MSPs? They're always trying to figure out how to lean out.

An idea might be to try and get a vertical sooner rather than later. The only thing better than an interested lawyer would be a selection of curated templates and prompts designed by people in the industry for example. So you get orchestration and industry-specific aligned verts. Much easier sell than a general purpose platform. But then you're fighting with the other vertically integrated offerings.

Maybe there are other differentiators? If this is like bedrock for your network, maybe the angle is private models where you want them. Others are doing that though, so there's pretty active competition there as well.

The more technical and general the audience the more you're going to have to talk them out of just rolling openwebui themselves.

jochalek · 3 months ago

Very cool to see a serious local first effort. Looking back at how far local models have come I definitely believe their usefulness combined with RAG or in domain specific contexts is soon to be (or already is) on par with general purpose gpt5-like massive parameter cloud models. The ability to generate quality responses without having to relinquish private data to the cloud used to be a pipedream. It's exciting to see a team dedicated to making this a reality.

What are a few use-cases you want to see this used for?

Few ideas related to healthcare

- AI assistants for smaller practices without enterprise EHR. Epic at the moment integrates 3rd party AI assistants, but those are of course cloud services and are aimed at contracts with large hospital systems. They're a great step forward, but leave much to be desired by doctors in actual usefulness.

- Consumer/patient facing products to help people synthesize all of their health information and understand what their healthcare providers are doing. Think of a n on device assistant that can connect with something like https://www.fastenhealth.com/ to make local RAG of their health history.

Overall, users can feel more confident they know where their PHI is, and potentially easier for smaller companies/start-ups to get into the healthcare space without having to move/store people's PHI.

Thanks! It means a lot to hear you say that.

A4ET8a8uTh0_v2 · 3 months ago

I am not sure if it is the future, but I am glad there is some movement to hinder centralization in this sector as much as possible ( yes, I recognize future risk, but for now it counts as hindering it ).

I am kind of militant about this. The ability to run great AI models locally is critical, not just for this sector, but for innovation overall. The bar of "build a few datacenters" is far too high for all but the largest countries and companies in the world.

100%. We don't know what's going to happen in the future. Things are evolving so quickly. Hopefully pushing back on centralization now will keep the ecosystem healthier and give developers real options outside the big two/three cloud providers.

zackangelo · 3 months ago

Just a bit of feedback:

> Instead of one brittle giant, we orchestrate a Mixture of Experts…

“mixture of experts” is a specific term of art that describes an architectural detail of a type of transformer model. It’s definitely not using smaller specialized models for individual tasks. Experts in an MoE model are actually routed to on a per token basis, not on a per task or per generation basis.

I know it’s tempting to co-opt this term because it would fit nicely for what you’re trying to do but it just adds confusion.

I hear you and valid. A mixture of Models is probably a better phrase - we are constantly moving between AI experts and very skilled Developers who use OpenAI endpoints and call it AI, so we are constantly working on finding the correct language. This was a miss though - will do better :)

gus_22 · 3 months ago

Would like to connect. I've got a YC colleague I have asked to setup an intro. Agree with John's point below about the chicken and egg scenario. Our community flagged your post. Fundamental problems like who cares, who pays reinforce why this hasn't already been done (well and at scale). And why this time might be the right time!

Appreciate that — and totally agree. The “who cares / who pays” question is exactly why this hasn’t scaled before.

Our bet is that the timing’s finally right: local inference, smaller and more powerful open models (Qwen, Granite, Deepseek), and enterprise appetite for control have all converged. We’re working with large enterprises (especially in regulated industries) where innovation teams need to build and run AI systems internally, across mixed or disconnected environments.

That’s the wedge — not another SaaS, but a reproducible, ownable AI layer that can actually move between cloud, edge, and air-gapped. Just reach out, no intro needed - robert @ llamafarm.dev

Eisenstein · 3 months ago

How do you deal with the space continually evolving? Like, MCP changed major ways over the course of a few months, new models are released with significant capability upgrades every month, inference engines like llamacpp get updated multiple times a day. But organizations want to setup their frameworks and then maintain them. Will this let them do that?

Yes, our goal is to provide a stable, open source platform on top of the cutting-edge AI tools. We can systematically update dependencies as needed and ensure that outputs meet quality requirements.

We also have plans for eval features in the product so that users can measure the quality of changes over time, whether to their own project configs or actual LlamaFarm updates.

Yes, all that's a bit hand-wavy, I know. :-) But we do recognize the problem and have real ideas on solutions. But execution is everything. ;-)

ivape · 3 months ago

We built a bunch of AI demos but they were impossible to get to production. It would work perfectly on our laptop, but when we deployed it, something broke, and RAG would degrade.

How did RAG degrade when it went to prod? Do you mean your prod server had throughput issues?

Multiple areas of degradation. Typically, you don't ship a dataset to prod and then never change it. You want the system to continue to learn and improve as new data is available. This can create performance issues as the dataset grows in size. But also, your model's performance in terms of quality can degrade over time if you're not constantly evaluating its responses. This can occur because of new info within RAG, a model swap/upgrade, or changes to prompts. Keeping all of those knives in the air is tricky. We're hoping we can solve a bunch of pain points around this so that reliable AI systems are accessible to anyone.

Heya - one blue sea question.

Where are you on Vulkan support? Hard to find good stacks to use with all this great intel and non-rocm amd hardware. Might be a good angle too rather than chasing the usual Nvidia money train.

Funny you bring it up. We shipped Vulkan support TODAY through a tight integration Lemonade (https://lemonade-server.ai).

We now support AMD, Intel, CPU, and Cuda/Nvidia.

Hit me up if you want a walk through - this is in dev right now (you have to pull down the repo to run it), but we'll ship it as a part of our next release.

https://docs.llamafarm.dev/docs/models#lemonade-runtime