Experimenting with Local LLMs on macOS

I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.

The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.

What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.

giancarlostoro · 5 months ago

I feel like Apple needs a new CEO, I've felt this way for a long time. If I had been in charge of Apple I would have embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia, I also would have probably toyed around with the idea of selling server-grade Apple Silicon processors and opening up the GPU spec so people can build against it. Seems like Apple tries to play it too safe. While Tim Cook is good as a COO, he's still running Apple as a COO. They need a man of vision, not a COO at the helm.

aurareturn · 5 months ago

I think if Cook had vision, he could have started something called Apple Enterprise and sold Apple Silicon as a server and made AI chips. I agree he’s too conservative and has no product vision. Great manager though.

jbverschoor · 5 months ago

Local llm.. everybody is scared of privacy.. many people don’t want to buy subscriptions (still).

Just sell a proper HomePod with 64GB-128GB ram, which handles everything including your personal LLM, Time Machine if needed, back to Mac (Tailscale/zerotier)

+ they can compete efficiently with the other. Cloud providers.

ako · 5 months ago

They have local LLMs, apple foundation models: https://developer.apple.com/documentation/FoundationModels

jbs789 · 5 months ago

Sounds like you’ve got a solid handle on things - go do it!

elAhmo · 5 months ago

I think shareholders are fine with Tim Cook as a CEO.

woooooo · 5 months ago

Not to mention, build a car with all that cash they have. Xiaomi makes awesome cars, Apple branded electric could scoop all the brand equity that Elon passed away.

saagarjha · 5 months ago

One does not simply put a 5090 into an existing chip.

__loam · 5 months ago

I'm glad Tim is the CEO instead of you.

brookst · 5 months ago

Under Cook, Apple’s market cap has increased 10x, at a CAGR of 18%.

Do you really think that they need something different? As a shareholder would you bet on your vision of focusing on server parts?

bigyabai · 5 months ago

Software-wise, it makes sense: Nvidia has the IP lead, industry buy-in and supports the OSes everyone wants to use.

Hardware-wise though, I actually agree - Apple has dropped the ball so hard here that it's dumbfounding. They're the only TSMC customer that could realistically ship a comparable volume of chips as Nvidia, even without really impacting their smartphone business. They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI for a while now. Apple wants to build a CoreML silo in a world where better products exist everywhere, it's a dead-end approach that should have died back in 2018.

Contextually it's weird too, I've seen tons of people defend Cook's relationship with Trump as "his duty to shareholders" and the like. But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want. Future MBAs will be taught about this hubris once the shape of the total damages come into view.

zozbot234 · 5 months ago

From reverse engineered information (in the context of Asahi Linux, which can have raw hardware access to the ANE) it seems that the M1/M2 Apple Neural Engine provides exclusively for statically scheduled MADD's of INT8 or FP16 values.[0] This wastes a lot of memory bandwidth on padding in the context of newer local models which generally are more heavily quantized.

(That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)

The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)

[0] Some historical information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even older information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... .

More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks) seems to basically confirm the above.

(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)

slacka · 5 months ago

I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?

bigyabai · 5 months ago

Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.

Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.

numpad0 · 5 months ago

Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.

Someone · 5 months ago

I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says

“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”

GeekyBear · 5 months ago

There is no NPU "standard".

Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.

Even Nvidia GPUs often have breaking changes moving from one generation to the next.

svachalek · 5 months ago

I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.

jondwillis · 5 months ago

https://github.com/Anemll/Anemll

Deleted Comment

GeekyBear · 5 months ago

> Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.

If you want to convert models to run on the ANE there are tools provided:

> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.

https://apple.github.io/coremltools/docs-guides/index.html

ls-a · 5 months ago

I thought Apple MLX can do that if you convert your model using it https://mlx-framework.org/

coffeecoders · 5 months ago

It is less about conversion and more about extending ANE support for transformer-style models or giving developers more control.

The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.

ai-christianson · 5 months ago

I can run GLM 4.5 Air and gpt-oss-120b both very reasonably. GPT OSS has particularly good latency.

I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.

These models are just about getting as good as the frontier models.

ru552 · 5 months ago

You're better served using Apple's MLX if you want to run models locally.

daemonologist · 5 months ago

ONNX Runtime purports to support CoreML: https://onnxruntime.ai/docs/execution-providers/CoreML-Execu... , which gives a decent amount of compatibility for inference. I have no idea to what extent workloads actually end up on the ANE though.

(Unfortunately ONNX doesn't support Vulkan, which limits it on other platforms. It's always something...)

wslh · 5 months ago

I find surprising that you can also do that from the browser (e.g. WebLLM). I imagine that in the near future we will run these engines locally for many use cases, instead of via APIs.

witnessme · 5 months ago

Don't try 12-20B on 16GB. You should stick with 4-8B instead. You'll get way too slow tps and marginal perf improvements going higher on a 16GB machine.

zackmorris · 5 months ago

Don't get me started. Many new computers come with an NPU of some kind, which is superfluous to a GPU.

But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.

To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.

It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.

jondwillis · 5 months ago

https://github.com/Anemll/Anemll

Dead Comment

So far I've not run into the kind of use cases that local LLMs can convincingly provide without making me feel like I'm using the first ever ChatGPT from 2022, in that they are limited and quite limiting. I am curious about what use cases the community has found that work for them. The example that one user has given in this thread about their local LLM inventing a Sun Tzu interview is exactly the kind of limitation I'm talking about. How does one use a local LLM to do something actually useful?

narrator · 5 months ago

I have tried a lot of different LLMs and Gemma3:27b on a 48gb+ Macbook is probably the best for analyzing diaries and personal stuff you don't want to share with the cloud. The China models are comically bad with life advice. For example, I asked Deepseek to read my diaries and talk to me about my life goals and it told me in a very Confucian manner what the proper relationships in my life were for my stage of life and station in society. Gemma is much more western.

solardev · 5 months ago

Lol, that's actually kinda cool. Did you get any interesting Eastern responses to your diary entries?

I'm imagining something like...

> Dear diary, I got bullied again today, and the bread was stale in my PB&J :(

>> My son, remember this: The one who mocks others wounds his own virtue. The one who suffers mockery must guard his heart. To endure without hatred is strength; to strike without cause is disgrace. The noble one corrects himself first, then the world will follow.

elorant · 5 months ago

Chinese models are also awful with translations. Even the Deepseek R1 model performs worse than Mistral small.

punitvthakkar · 5 months ago

That is fascinating. One insight I read about LLMs is that they do represent a world-view of the people who train it, and hence the country that ships the dominant LLM technology can spread widely its world-view on others. Your experience seems to validate that insight.

crazygringo · 5 months ago

I see local LLM's being used mainly for automation as opposed to factual knowledge -- for classification, summarization, search, and things like grammar checking.

So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.

The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.

theshrike79 · 5 months ago

And local models are static, predictable and don't just go away when a new one comes out.

This makes them perfect for automation tasks.

dxetech · 5 months ago

There are situations where internet access is limited, or where there are frequent outages. An outdated LLM might be more useful than none at all. For example: my internet is out due to a severe storm, what safety precautions do I need to take?

theshrike79 · 5 months ago

Just use Kiwix: https://kiwix.org/en/

punitvthakkar · 5 months ago

Yes - emergency use cases make tons of sense.

volemo · 5 months ago

Surely not the ones you get from an LLM?

jondwillis · 5 months ago

I use, or at least try to use local models while prototyping/developing apps.

First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.

Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)

However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.

vorticalbox · 5 months ago

I keep a lot of notes, all my thoughts feelings both happy and sad, things I’ve done, etc. in obsidian. These are deeply personal and I don’t want this going to a cloud provider even if they “say” they don’t train on my chats.

I forget a lot of things so I feed these into chromeDB and then use a LLM to chat with all my notes.

I’ve started using abliterated models which have their refusal removed [0]

Other use case is for work. I work with financial data and I have created an mcp that automates some of my job. Running model locally allows me to not worry about the information I feed it.

[0] https://github.com/Sumandora/remove-refusals-with-transforme...

dragonwriter · 5 months ago

Well, a lot of what is possible with local models depends on what your local hardware is, but docling is a pretty good example of a library that can use local models (VLMs instead of regular LLMs) “under the hood” for productive tasks.

ivape · 5 months ago

I use Claude code in the terminal only mostly to figure out what to commit along with what to write for the commit message. I believe a solid 7-8b model can do this locally.

So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).

I also have a use for terminal command autocompletion, which again, a small model can be great for.

Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.

The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.

rukuu001 · 5 months ago

I'm running Gemma3-270M locally (MLX). I got a Python script that pulls down emails based on a whitelist and summarises them. The 270M model does a good job of this. This is running in a terminal. It means I barely look at my email during the day.

ghilston · 5 months ago

Any willingness to share this script? I've been working on some code to ingest things and summarize for them and I haven't gotten to email just yet.

luckydata · 5 months ago

Local models can do embedding very well, which is useful for things like building a screenshot manager for example.

punitvthakkar · 5 months ago

Whoa. I didn't think about using embeddings for screenshot management. How would I do this?

bityard · 5 months ago

I use a local LLM for lots of little things that I used to use search engines for. Defining words, looking up unicode symbols for copy/paste, reminders on how to do X in bash or Python. Sometimes I use it as a starting point for high-level questions and curiosity and then move to actual human content or larger online models for more details and/or fact-checking if needed.

If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.

My reasons:

1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.

LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.

2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.

Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.

3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.

4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.

5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.

punitvthakkar · 5 months ago

All totally valid points and insights. This is great, thank you!

jeffybefffy519 · 5 months ago

Gemma3 is pretty useful on a long haul flight without internet

kristopolous · 5 months ago

kimi v2 by moonshot. Give it a go

bigyabai · 5 months ago

Qwen3 A3B (in my experience) writes code as-good-as ChatGPT 4o and much better than GPT-OSS.

hu3 · 5 months ago

I just tested Qwen3 A3B vs ChatGPE a random prompt from my head and:

> Please write a C# middleware to block requests from browser agents that contain any word in a specified list of words: openai, grok, gemini, claude.

I used ChatpGPT 4o from GitHub Copilot inside VSCode. And Qwen3 A3B from here: https://deepinfra.com/Qwen/Qwen3-30B-A3B

ChatGPT 4o was considerably better. Less verbose and less unnecessary abstractions.

ActorNightly · 5 months ago

Smaller models require a lot more direction, a.k.a system prompt engineering, and sometimes custom wrappers . For example Gemma models are very eager to generate code even if you tell them not to.

mentalgear · 5 months ago

something like rewind or openRecall can use local LLMs for on-device semantic search.

segmondy · 5 months ago

The same way you use a cloud LLM.

oblio · 5 months ago

I think the point was that for example for programming, people perceive state of the art LLMs as being net positive contributors, at least for mainstream programming languages and tasks, and I guess local LLMs aren't net positive contributors (i.e. an experienced programmer can build the same thing at least as fast when using an LLM).