This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Yeah, that's a fair point at first glance. 50GB might not sound like a huge burden for a modern SSD.
However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.
For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.
The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.
It feels weird that the search index is bigger than the underlying data, weren't search indexes supposed to be efficient formats giving fast access to the underlying data?
Exactly. That's because instead of just mapping keywords, vector search stores the rich meaning of the text as massive data structures, and LEANN is our solution to that paradoxical inefficiency.
Good point! Maybe indexing is a bad term here, and it's more like feature extraction (and since embeddings are high dimensional we extract a lot of features). From that point of view it makes sense that "the index" takes more space than the original data.
I guess for semantic search(rather than keyword search), the index is larger than the text because we need to embed them into a huge semantic space, which make sense to me
Nonclustered indexes in RDBMS can be larger than the tables. It’s usually poor design or indexing a very simple schema in a non-trivial way, but the ultimate goal of the index is speed, not size. As long as you can select and use only a subset of the index based on its ordering it’s still a win.
Why is that considred relevant to get a RAG of people digital traces burdening them in every single interactions they have with a computer?
Having locally distributed similar grounds is one thing. Push everyone to much in its own information bubble, is an other orthogonal topic.
When someone mind recall about that email from years before, having the option to find it again in a few instants can interesting. But when the device is starting to funnel you through past traces, then it doesn't matter much whether it the solution is in local or remote: the spontaneous thought flow is hijacked.
Since it’ll be local, this behavior can be controlled. I for one find the option of it digging through my personal files to give me valuable personal information attractive.
Thank you for the pointer to LEANN! I've been experimenting with RAGs and missed this one.
I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]
Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).
I think I'll give it a try later (using cloud frontier model for LLM though, for now...)
This is annoyingly Apple-only though. Even though my main dev machine is a Macbook, this would be a LOT more useful if it was a Docker container.
I'd still take a Docker container over an Apple container, because even though docker is not VM-level-secure, it's good enough for running local AI generated code. You don't need DEFCON Las Vegas levels of security for that.
And also because Docker runs on my windows gaming machine with a fast GPU with WSL ubuntu, and my linux VPS in the cloud running my website, etc etc. And most people have already memorized all the basic Docker commands.
This would be a LOT better if it was just a single docker command we can copy paste, run it a few times, and then delete if necessary.
> Even with help from the "world's best" LLMs, things didn't go quite as smoothly as we had expected. They hallucinated steps, missed platform-specific quirks, and often left us worse off.
This shows how little native app training data is even available.
People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.
Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.
These types of holes in training data are going to be a larger and larger problem.
Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.
You left out the next lines, which add some important context:
> Then we tried wrapping a NextJS app inside Electron. It took us longer than we'd like to admit. As of this writing, it looks like there's just no (clean) way to do it.
> So, we gave up on the Mac app.
They weren't writing a fully native app. They started with a NextJS web app and then tried to put it inside Electron, a cross-platform toolkit.
All the training data in the world about native app development wouldn't have helped here. They were using a recent JS framework and trying to put it in a relatively recent cross-platform tool. The two parts weren't made to work together so training data likely doesn't exist, other than maybe some small amount of code or issues on GitHub discussing problems with the approach.
I thought that was odd too. There are lots of ChatGPT clones implemented as native MacOS apps.
The main advancement in TFA is using the new Container Swift API for local tool use. That functionality would probably be a welcome contribution to any of these:
> This shows how little native app training data is even available.
FWIW, we have very few desktop native apps nowadays. Most apps are either mobile, cli or web-based. Heck, I’m sure there’s more material online on writing cli apps than gui apps.
A lot of us just don't want to be web developers. I mostly write IEC 61131 code, with sprinkles of BASIC (yuck), C, Perl, and Lisp. I've used JavaScript and quite frankly, you can keep it.
Great effort, a strong self-hosting community for LLMs is going to be similarly important as the FLOSS movement imho. But right now I feel the bigger bottleneck is on the hardware side rather than software. The amount of fast RAM that you need for decent models (80b+ params) is just not something that's commonly available for consumer hardware right now, not even gaming machines. I heard that Macs (minis) are great for the purpose, but you don't really get them with enough RAM or at prices that don't really qualify as consumer-grade anymore. I've seen people create home clusters (eg using Exo [0]), but I wouldn't really call it practical (single digit token/sec for large models, and the price isn't exactly accessible either). Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
With smaller models becoming more efficient and harder continually improving I think the sweet spot for local LLM computing will arrive in a couple years.
So many comments like to highlight that you can buy a Mac Studio with 512GB of RAM for $10K, but that's a huge amount of money to spend on something that still can't compete with a $2/hour rented cloud GPU server in terms of output speed. Even that will be lower quality and slower than the $20/month plan from the LLM provider of your choice.
The only reasons to go local are if you need it (privacy, contractual obligations, regulations) or if you're a hardcore hobbiest who values running it yourself over quality and speed of output.
> Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
Framework is getting a lot of headlines for their brand recognition but there are a growing number of options with the same AMD Strix Halo part. Here's a random example I found from a Google search - https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-39...
All of these are somewhat overpriced right now due to supply and demand. If the supply situation is alleviated they should come down in price.
They're great for what they are, but their memory bandwidth is still relatively limited. If the 128GB versions came down to $1K I might pick one up, but at the $2-3K price range I'd rather put that money toward upgrading my laptop to an M4 MacBook Pro with 128GB of RAM.
Prices are still coming down. Assuming that keeps happening we will have laptops with enough RAM in the sub-2k range in 5 years.
Question is whether models will keep getting bigger. If useful model sizes plateau eventually a good model becomes something at least many people can easily run locally. If models keep usefully growing this doesn’t happen.
The largest ones I see are in the 405g range which quantized fits in 256g RAM.
Long term I expect custom hardware accelerators designed specifically for LLMs to show up, basically an ASIC. If those got affordable I could see little USB-C accelerator boxes being under $1k able to run huge LLMs fast and with less power.
GPUs are most efficient for batch inference which lends itself to hosting not local use. What I mean is a lighter chip made to run small or single batch inference very fast using less power. The bottleneck there is memory bandwidth so I suspect fast RAM would be most of the cost of such a device. Small or single batch inference is memory bandwidth bound.
What's the deal with Exo anyway? I've seen it described as an abandoned, unmaintained project.
Anyway, you don't really need a lot of fast RAM unless you insist on getting a real-time usable response. If you're fine with running a "good" model overnight or thereabouts, there are things you can do to get better use of fairly low-end hardware.
Jeff Geerling just did a video with a cluster of 4 Framework Desktop main boards. He put a decent amount of work into Exo and concluded it’s a VC Rugpull… abandoned as soon as it won some attention.
He also explored several other open source AI scale out libraries, and reported that they’re generally way less mature than tooling for traditional scientific cluster computing.
The founders of Exo ghosted the dev community and went closed-source. Nobody has heard from them. I wish people would stop recommending Exo (a tribute to their marketing) and check out GPUStack instead. Overall another rug pull by the devs as soon as they got traction.
> The amount of friction to get privacy today is astounding
I don't understand this.
It's easy to get a local LLM running with a couple commands in the terminal. There are multiple local LLM runners to choose from.
This blog post introduces some additional tools for sandboxed code execution and browser automation, but you don't need those to get started with local LLMs.
There are multiple local options. This one is easy to start with: https://ollama.com/
I'd assume they are referring to being able to run your own workloads in a home-built system, rather then surrendering that ownership to the tech giants alone
It's the hardware more than the software that is the limiting factor at the moment, no? Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.
>Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.
And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.
I hope it improves at such a steady rate! Please lets just hope that there is still room for improvement to packing even more improvements in such LLMS which can help the home labbing community in general.
I think I still prefer local but I feel like that's because that most AI inference is kinda slow or comparable to local. But I recently tried out cerebras or (I have heard about groq too) and honestly when you try things at 1000 tk/s or similar, your mental model really shifts and becomes quite impatient.
Cerebras does say that they don't log your data or anything in general and you would have to trust me to say that I am not sponsored by them (Wish I was tho) Its just that they are kinda nice.
But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.
Avoiding cloud dependency seems like a huge amount of effort to make life harder for yourself, just to end up with a situation where you have to rely on other parts of the cloud to do any other part of your work or business anyway. I mean, why stop there? Unplug yourself from the grid so you don’t have to depend on water or electricity just in case the companies that provide it stop working.
Yeah, you could do that, but honestly the only world where you’re able to live off grid safely without being attacked and looted is a world where the rest of society hasn’t broken down and still has a grid to connect to. Similarly, the only world where you could succeed as a software developer is one where the cloud generally still functions so you may as well use cloud services that are convenient and right there.
You’re not the military with military secrets, you don’t have anything to gain from being independent from the cloud.
I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051
In 2025 I would consider this a relatively meager requirement.
However, the 50GB figure was just a starting point for emails. A true "local Jarvis," would need to index everything: all your code repositories, documents, notes, and chat histories. That raw data can easily be hundreds of gigabytes.
For a 200GB text corpus, a traditional vector index can swell to >500GB. At that point, it's no longer a "meager" requirement. It becomes a heavy "tax" on your primary drive, which is often non-upgradable on modern laptops.
The goal for practical local AI shouldn't just be that it's possible, but that it's also lightweight and sustainable. That's the problem we focused on: making a comprehensive local knowledge base feasible without forcing users to dedicate half their SSD to a single index.
I think you meant this: https://arxiv.org/abs/2506.08276
Having locally distributed similar grounds is one thing. Push everyone to much in its own information bubble, is an other orthogonal topic.
When someone mind recall about that email from years before, having the option to find it again in a few instants can interesting. But when the device is starting to funnel you through past traces, then it doesn't matter much whether it the solution is in local or remote: the spontaneous thought flow is hijacked.
In mindset dystopia, the device prompts you.
I am particularly excited about using RAG as the knowledge layer for LLM agents/pipelines/execution engines to make it feasible for LLMs to work with large codebases. It seems like the current solution is already worth a try. It really makes it easier that your RAG solution already has Claude Code integration![1]
Has anyone tried the above challenge (RAG + some LLM for working with large codebases)? I'm very curious how it goes (thinking it may require some careful system-prompting to push agent to make heavy use of RAG index/graph/KB, but that is fine).
I think I'll give it a try later (using cloud frontier model for LLM though, for now...)
[1]: https://github.com/yichuan-w/LEANN/blob/main/packages/leann-...
I'd still take a Docker container over an Apple container, because even though docker is not VM-level-secure, it's good enough for running local AI generated code. You don't need DEFCON Las Vegas levels of security for that.
And also because Docker runs on my windows gaming machine with a fast GPU with WSL ubuntu, and my linux VPS in the cloud running my website, etc etc. And most people have already memorized all the basic Docker commands.
This would be a LOT better if it was just a single docker command we can copy paste, run it a few times, and then delete if necessary.
Are there projects that implement this same “pruned graph” approach for cloud embeddings?
This shows how little native app training data is even available.
People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.
Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.
These types of holes in training data are going to be a larger and larger problem.
Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.
> Then we tried wrapping a NextJS app inside Electron. It took us longer than we'd like to admit. As of this writing, it looks like there's just no (clean) way to do it.
> So, we gave up on the Mac app.
They weren't writing a fully native app. They started with a NextJS web app and then tried to put it inside Electron, a cross-platform toolkit.
All the training data in the world about native app development wouldn't have helped here. They were using a recent JS framework and trying to put it in a relatively recent cross-platform tool. The two parts weren't made to work together so training data likely doesn't exist, other than maybe some small amount of code or issues on GitHub discussing problems with the approach.
The main advancement in TFA is using the new Container Swift API for local tool use. That functionality would probably be a welcome contribution to any of these:
https://github.com/Renset/macai
https://github.com/huggingface/chat-macOS
https://github.com/SidhuK/WardenApp
https://github.com/psugihara/FreeChat
FWIW, we have very few desktop native apps nowadays. Most apps are either mobile, cli or web-based. Heck, I’m sure there’s more material online on writing cli apps than gui apps.
[0] https://github.com/exo-explore/exo
So many comments like to highlight that you can buy a Mac Studio with 512GB of RAM for $10K, but that's a huge amount of money to spend on something that still can't compete with a $2/hour rented cloud GPU server in terms of output speed. Even that will be lower quality and slower than the $20/month plan from the LLM provider of your choice.
The only reasons to go local are if you need it (privacy, contractual obligations, regulations) or if you're a hardcore hobbiest who values running it yourself over quality and speed of output.
> Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
Framework is getting a lot of headlines for their brand recognition but there are a growing number of options with the same AMD Strix Halo part. Here's a random example I found from a Google search - https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-39...
All of these are somewhat overpriced right now due to supply and demand. If the supply situation is alleviated they should come down in price.
They're great for what they are, but their memory bandwidth is still relatively limited. If the 128GB versions came down to $1K I might pick one up, but at the $2-3K price range I'd rather put that money toward upgrading my laptop to an M4 MacBook Pro with 128GB of RAM.
Question is whether models will keep getting bigger. If useful model sizes plateau eventually a good model becomes something at least many people can easily run locally. If models keep usefully growing this doesn’t happen.
The largest ones I see are in the 405g range which quantized fits in 256g RAM.
Long term I expect custom hardware accelerators designed specifically for LLMs to show up, basically an ASIC. If those got affordable I could see little USB-C accelerator boxes being under $1k able to run huge LLMs fast and with less power.
GPUs are most efficient for batch inference which lends itself to hosting not local use. What I mean is a lighter chip made to run small or single batch inference very fast using less power. The bottleneck there is memory bandwidth so I suspect fast RAM would be most of the cost of such a device. Small or single batch inference is memory bandwidth bound.
Anyway, you don't really need a lot of fast RAM unless you insist on getting a real-time usable response. If you're fine with running a "good" model overnight or thereabouts, there are things you can do to get better use of fairly low-end hardware.
He also explored several other open source AI scale out libraries, and reported that they’re generally way less mature than tooling for traditional scientific cluster computing.
https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram...
the slow interconnects (yes, even at 40Gbps thunderbolt) severely limit both TtFT and tokens/second.
I tried it extensively for a few days, and ended up getting a single M3 Ultra Mac Studio, and am loving life.
What sort of specs do you need?
I don't understand this.
It's easy to get a local LLM running with a couple commands in the terminal. There are multiple local LLM runners to choose from.
This blog post introduces some additional tools for sandboxed code execution and browser automation, but you don't need those to get started with local LLMs.
There are multiple local options. This one is easy to start with: https://ollama.com/
Easy for what percentage of people?
And "good" is still questionable. The thing that makes this stuff useful is when it works instantly like magic. Once you find yourself fiddling around with subpar results at slower speeds, essentially all of the value is gone. Local models have come a long way but there is still nothing even close to Claude levels when it comes to coding. I just tried taking the latest Qwen and GLM models for a spin through OpenRouter with Cline recently and they feel roughly on par with Claude 3.0. Benchmarks are one thing, but reality is a completely different story.
https://simonwillison.net/2025/Jul/29/space-invaders/
But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.
Unil a judge says they must log everything, indefinitely
Yeah, you could do that, but honestly the only world where you’re able to live off grid safely without being attacked and looted is a world where the rest of society hasn’t broken down and still has a grid to connect to. Similarly, the only world where you could succeed as a software developer is one where the cloud generally still functions so you may as well use cloud services that are convenient and right there.
You’re not the military with military secrets, you don’t have anything to gain from being independent from the cloud.
Interesting as a thought experiment, though.
I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)
[1] https://hypersonic.chat/