Can I run AI locally?

I have spent a HUGE amount of time the last two years experimenting with local models.

A few lessons learned:

1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.

2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...

Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.

I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.

sdrinf · 14 hours ago

Just want to echo the recommendation for qwen3.5:9b. This is a smol, thinking, agentic tool-using, text-image multimodal creature, with very good internal chains of thought. CoT can be sometimes excessive, but it leads to very stable decision-making process, even across very large contexts -something we haven't seen models of this size before.

What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.

Strong, strong recommend.

steve_adams_86 · 14 hours ago

I've been building a harness for qwen3.5:9b lately (to better understand how to create agentic tools/have fun) and I'm not going to use it instead of Opus 4.6 for my day job but it's remarkably useful for small tasks. And more than snappy enough on my equipment. It's a fun model to experiment with. I was previously using an old model from Meta and the contrast in capability is pretty crazy.

I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.

threecheese · 12 hours ago

You can really see the limitations of qwen3.5:9b in reasoning traces- it’s fascinating. When a question “goes bad”, sometimes the thinking tokens are WILD - it’s like watching the Poirot after a head injury.

Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.

kingo55 · 13 hours ago

How's it compare in quality with larger models in the same series? E.g 122b?

ggsp · 13 hours ago

How much difference are you seeing between standard and Q4 versions in terms of degradation, and is it constant across tasks or more noticeable in some vs others?

Dead Comment

johnmaguire · 16 hours ago

I'd love to know how you fit smaller models into your workflow. I have an M4 Macbook Pro w/ 128GB RAM and while I have toyed with some models via ollama, I haven't really found a nice workflow for them yet.

philipkglass · 16 hours ago

It really depends on the tasks you have to perform. I am using specialized OCR models running locally to extract page layout information and text from scanned legal documents. The quality isn't perfect, but it is really good compared to desktop/server OCR software that I formerly used that cost hundreds or thousands of dollars for a license. If you have similar needs and the time to try just one model, start with GLM-OCR.

If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.

tempaccount5050 · 12 hours ago

Not OP but I had an XML file with inconsistent formatting for album releases. I wanted to extract YouTube links from it, but the formatting was different from album to album. Nothing you could regex or filter manually. I shoved it all into a DB, looked up the album, then gave the xml to a local LLM and said "give me the song/YouTube pairs from this DB entry". Worked like a charm.

Bluecobra · 15 hours ago

I didn’t realize that you can get 128GB of memory in a notebook, that is impressive!

saltwounds · 15 hours ago

I use Raycast and connect it to LM Studio to run text clean up and summaries often. The models are small enough I keep them in memory more often than not

Dead Comment

echelon · 14 hours ago

Shouldn't we prioritize large scale open weights and open source cloud infra?

An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.

I'm fine with running everything in the cloud as long as we own the software infra and the weights.

This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.

flutetornado · 11 hours ago

My experience with qwen3.5 9b has not been the same. It’s definitely good at agentic responses but it hallucinates a lot. 30%-50% of the content it generated for a research task (local code repo exploration) turned out to be plain wrong to the extent of made up file names and function names. I ran its output through KimiK2 and asked it to verify its output - which found out that much of what it had figured out after agentic exploration was plain wrong. So use smaller models but be very cautious how much you depend on their output.

adamkittelson · 13 hours ago

Anecdotal but for some reason I had a pretty bad time with qwen3.5 locally for tool usage. I've been using GPT-OSS-120B successfully and switched to qwen so that I could process images as well (I'm using this for a discord chat bot).

Everything worked fine on GPT but Qwen as often as not preferred to pretend to call a tool and not actually call it. After much aggravation I wound up just setting my bot / llama swap to use gpt for chat and only load up qwen when someone posts an image and just process / respond to the image with qwen and pop back over to gpt when the next chat comes in.

GorbachevyChase · 12 hours ago

You are responsible for the dead internet theory.

dhblumenfeld1 · 13 hours ago

Have you found that using a frontier model for planning and small local model for writing code to be a solid workflow? Been wanting to experiment with relying less on Claude Code/Codex and more on local models.

dataflow · 13 hours ago

Thanks for sharing this, it's super helpful. I have a question if you don't mind: I want a model that I can feed, say, my entire email mailbox to, so that I can ask it questions later. (Just the text content, which I can clean and preprocess offline for its use.) Have any offline models you've dealt with seemed suitable for that sort of use case, with that volume of content?

lilactown · 4 hours ago

If your inbox is as big as mine, you won’t be able to load all the text content into a prompt even with SotA cloud hosted models.

Instead you should give it tools to search over the mailbox for terms, labels, addresses, etc. so that the model can do fine grained filters based on the query, read the relevant emails it finds, then answer the question.

perbu · 13 hours ago

Prompt injection is a problem if your agent has access to anything.

The local models are quite weak here.

eek2121 · 13 hours ago

Qwen is actually really good at code as well. I used qwen3-coder-next a while back and it was every bit as good as claude code in the use cases I tested it in. Both made the same amount of mistakes, and both did a good job of the rest.

storus · 11 hours ago

Coding locally with Qwen3-Coder-Next or Qwen-3.5 is a piece of cake on a workstation card (RTX Pro 6000); set it up in llama.cpp or vLLM in 1 hour, install Claude Code, force local API hostname and fake secret key, and just run it like regular setup with Claude4 but on Qwen.

chrisweekly · 12 hours ago

Thanks for this, Mark. And for your website and books and generosity of spirit. Signal in the noise. Have an awesome weekend!

sakesun · 12 hours ago

Becoming a retired builder is the ultimate bliss.

manmal · 16 hours ago

What about running e.g. Qwen3.5 128B on a rented RTX Pro 6000?

girvo · 13 hours ago

IMO you’re better off using qwen3.5-plus through the model studio coding plan, but ymmv

nine_k · 16 hours ago

What kind of hardware did you use? I suppose that a 8GB gaming GPU and a Mac Pro with 512 GB unified RAM give quite different results, both formally being local.

fzzzy · 15 hours ago

A Mac Pro with 512 gb unified ram does not exist.

cyanydeez · 14 hours ago

Cline (https://marketplace.visualstudio.com/items?itemName=saoudriz...) in vscode, inside a code-server run within docker (https://docs.linuxserver.io/images/docker-code-server/) using lmstudio (https://lmstudio.ai/) to access unsloth models (https://unsloth.ai/docs/get-started/unsloth-model-catalog) speficially (https://unsloth.ai/docs/models/qwen3-coder-next) appears to be right at the edge of productivity, as long as you realize what complexity means when issuing tasks.

kylehotchkiss · 16 hours ago

I've been really interested in the difference between 3.5 9b and 14b for information extraction. Is there a discernible difference in quality of capability?

Dead Comment

This seems to be estimating based on memory bandwidth / size of model, which is a really good estimate for dense models, but MoE models like GPT-OSS-20b don't involve the entire model for every token, so they can produce more tokens/second on the same hardware. GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)

lambda · 17 hours ago

Yeah, I looked up some models I have actually run locally on my Strix Halo laptop, and its saying I should have much lower performance than I actually have on models I've tested.

For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.

tommy_axle · 16 hours ago

I'm guessing this is also calculating based on the full context size that the model supports but depending on your use case it will be misleading. Even on a small consumer card with Qwen 3 30B-A3B you probably don't need 128K context depending on what you're doing so a smaller context and some tensor overrides will help. llama.cpp's llama-fit-params is helpful in those cases.

pbronez · 17 hours ago

The docs page addresses this:

> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.

> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.

https://www.canirun.ai/docs

lambda · 16 hours ago

It discusses it, and they have data showing that they know the number of active parameters on an MoE model, but they don't seem to use that in their calculation. It gives me answers far lower than my real-world usage on my setup; its calculation lines up fairly well for if I were trying to run a dense model of that size. Or, if I increase my memory bandwidth in the calculator by a factor of 10 or so which is the ratio between active and total parameters in the model, I get results that are much closer to real world usage.

littlestymaar · 17 hours ago

While your remark is valid, there's two small inaccuracies here:

> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.

First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).

Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).

lambda · 15 hours ago

So, this is all true, but this calculation isn't that nuanced. It's trying to get you into a ballpark range, and based on my usage on my real hardware (if I put in my specs, since it's not in their hardware list), the results are fairly close to my real experience if I compensate for the issue where it's calculating based on total params instead of active.

So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.

"what is the best open weight model for high-quality coding that fits in 8GB VRAM and 32GB system RAM with t/s >= 30 and context >= 32768" -> Qwen2.5-Coder-7B-Instruct "what is the best open weight model for research w/web search that fits in 24GB VRAM and 32GB system RAM with t/s >= 60 and context >= 400k" -> Qwen3-30B-A3B-Instruct-2507 "what is the best open weight embedding model for RAG on a collection of 100,000 documents that fits in 40GB VRAM and 128GB system RAM with t/s >= 50 and context >= 200k" -> Qwen3-Embedding-8B

mark_l_watson · 16 hours ago

meatmanek · 17 hours ago

mopierotti · 16 hours ago

This (+ llmfit) are great attempts, but I've been generally frustrated by how it feels so hard to find any sort of guidance about what I would expect to be the most straightforward/common question:

"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"

(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.

I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.

comboy · 14 hours ago

What is the $/Mtok that would make you choose your time vs savings of running stuff locally?

Just to be clear, it may sound like a snarky comment but I'm really curious from you or others how do you see it. I mean there are some batches long running tasks where ignoring electricity it's kind of free but usually local generation is slower (and worse quality) and we all kind of want some stuff to get done.

Or is it not about the cost at all, just about not pushing your data into the clouds.

mopierotti · 13 hours ago

Good question. I agree with what I think you're implying, which is that local generation is not the right choice if you want to maximize results per time/$ spent. In my experience, hosted models like Claude Opus 4.6 are just so effective that it's hard to justify using much else.

Nevertheless, I spend a lot of time with local models because of:

1. Pure engineering/academic curiosity. It's a blast to experiment with low-level settings/finetunes/lora's/etc. (I have a Cog Sci/ML/software eng background.)

2. I prefer not to share my data with 3rd party services, and it's also nice to not have to worry too much about accidentally pasting sensitive data into prompts (like personal health notes), or if I'm wasting $ with silly experiments, or if I'm accidentally poisoning some stateful cross-session 'memories' linked to an account.

3. It's nice to be able solve simple tasks without having to reason about any external 'side-effects' outside my machine.

wilkystyle · 13 hours ago

For me it's a combination of privacy and wanting to be able to experiment as much as I want without limits. I'd happily take something that is 80% as good as SOTA but I can run it locally 24/7. I don't think there's anything out there yet that would 100% obviate my desire to at least occasionally fall back to e.g. Claude, but I think most of it could be done locally if I had infinite tokens to throw at it.

phillmv · 13 hours ago

i can think of some tasks (classification, structured info extraction) that i _imagine_ even small meh models could do quite well at

on data i would never ever want to upload to any vendor if i can avoid it

0xbadcafebee · 12 hours ago

Too generic question. Gotta be more specific:

Specific models & sizes for specific use cases on specific hardware at specific speeds.

J_Shelby_J · 16 hours ago

It’s a hard problem. I’ve been working on it for the better part of a year.

Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.

But “quality” is the hard part. In this case I’m just choosing the largest quants.

mopierotti · 14 hours ago

Supporting all the various devices does sound quite challenging.

I wouldn't expect a perfect single measurement of "quality" to exist, but it seems like it could be approximated enough to at least be directionally useful. (e.g. comparing subsequent releases of the same model family)

downrightmike · 15 hours ago

LLMs are just special purpose calculators, as opposed to normal calculators which just do numbers and MUST be accurate. There aren't very good ways of knowing what you want because the people making the models can't read your mind and have different goals

suheilaaita · an hour ago

The simplest way to really start, use anything like claude code, vs code, cursor, antigratvity, (or any other IDE) ask them to install ollama and pull the latest solid local model that was released that you can run based on your computer specs.

Wait 5-10 minutes, and should be done.

It genuinely is that simple.

You can even use local models using claude code or codex infrastrucutre (MASSIVE UNLOCK), but you need solid GPU(s) to run decent models. So that's the downside.

cloogshicer · 41 minutes ago

Genuine question, will this actually give you the latest solid local model?

I would've thought no, because of the knowledge cutoff in whatever model you use to download it.

twampss · 18 hours ago

Is this just llmfit but a web version of it?

https://github.com/AlexsJones/llmfit

deanc · 18 hours ago

Yes. But llmfit is far more useful as it detects your system resources.

Someone1234 · 15 hours ago

I feel like they both solve different issues well:

- If you already HAVE a computer and are looking for models: LLMFit

- If you are looking to BUY a computer/hardware, and want to compare/contrast for local LLM usage: This

You cannot exactly run LLMFit on hardware you don't have.

dgrin91 · 18 hours ago

Honestly I was surprised about this. It accurately got my GPU and specs without asking for any permissions. I didnt realize I was exposing this info.

rootusrootus · 17 hours ago

That's super handy, thanks for sharing the link. Way more useful than the web site this post is about, to be honest.

It looks like I can run more local LLMs than I thought, I'll have to give some of those a try. I have decent memory (96GB) but my M2 Max MBP is a few years old now and I figured it would be getting inadequate for the latest models. But llmfit thinks it's a really good fit for the vast majority of them. Interesting!

hrmtst93837 · 15 hours ago

Your hardware can run a good range of local models, but keep an eye on quantization since 4-bit models trade off some accuracy, especially with longer context or tougher tasks. Thermal throttling is also an issue, since even Apple silicon can slow down when all cores are pushed for a while, so sustained performance might not match benchmark numbers.

LeifCarrotson · 18 hours ago

This lacks a whole lot of mobile GPUs. It also does not understand that you can share CPU memory with the GPU, or perform various KV cache offloading strategies to work around memory limits.

It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.

It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.

But I don't run out of tokens.

rahimnathwani · 9 hours ago

This site presents models in an incomplete and misleading way.

When I visit the site with an Apple M1 Max with 32GB RAM, the first model that's listed is Llama 3.1 8B, which is listed as needing 4.1GB RAM.

But the weights for Llama 3.1 8B are over 16GB. You can see that here in the official HF repo: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main

The model this site calls 'Llama 3.1 8B' is actually a 4-bit quantized version ( Q4_K_M) available on ollama.com/library: https://ollama.com/library/llama3.1:8b

If you're going to recommend a model to someone based on their hardware, you have to recommend not only a specific model, but a specific version of that model (either the original, or some specific quantized version).

This matters because different quantized versions of the model will have different RAM requirements and different performance characteristics.

Another thing I don't like is that the model names are sometimes misleading. For example, there's a model with the name 'DeepSeek R1 1.5B'. There's only one architecture for DeepSeek R1, and it has 671B parameters. The model they call 'DeepSeek R1 1.5B' does not use that architecture. It's a qwen2 1.5B model that's been finetuned on DeepSeek R1's outputs. (And it's a Q4_K_M quantized version.)

zargon · 8 hours ago

They appear to be using Ollama as a data source. Ollama does that sort of thing regularly.

sxates · 19 hours ago

Cool thing!

A couple suggestions:

1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!

utopcell · 15 hours ago

Unfortunately, Apple retired the 512GiB models.

ProllyInfamous · 15 hours ago

Sure, but those already sold still exist.

ActorNightly · 4 hours ago

>. I have an M3 Ultra with 256GB of memory,

Im sorry but spending this kind of money when you could have just built yourself a dual 3090 workstation that would have been better for pretty much everything including local models is just plain stupid.

Hell, even one 3090 can now run Gemma 3 27b qat very fast.

xiconfjs · an hour ago

Except if you are living in a region where electricity is quite expensive :/

gentleman11 · 6 hours ago

ask apple to graciously allow you to install your own ram in the computer you "own"