I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Last night I had it write me a complete plugin for my LLM tool like this:
llm install llm-mlx
llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
llm -m mlx-community/gemma-3-27b-it-qat-4bit \
-f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
-f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
-s 'Write a new fragments plugin in Python that registers
issue:org/repo/123 which fetches that issue
number from the specified github repo and uses the same
markdown logic as the HTML page to turn that into a
fragment'
More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.
> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
There is another aspect to consider, aside from privacy.
These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.
However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.
Thanks for the call out on this model! I have 42gb usable VRAM on my ancient (~10yrs old) quad-sli titan-x workstation and have been looking for a model to balance large context window with output quality. I'm able to run this model with a 56K context window and it just fits into my 42gb VRAM to run 100% GPU. The output quality is really good and 56K context window is very usable. Nice find!
The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why?
Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?
Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output
Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.
Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
I haven't tried it yet but there's an MLX project that exposes an OpenAI-compatible serving endpoint that should work with Open WebUI: https://github.com/madroidmaq/mlx-omni-server
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me.
> We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.
Wish they showed benchmarks / added quantized versions to the arena! :>
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
Fellow owner of a 32GB MBP here: how much memory does it use while resident - or, if swapping happens, do you see the effects in your day to day work? I’m in the awkward position of using on a daily basis a lot of virtualized bloated Windows software (mostly SAS).
I have the usual programs running on my Mac, along with open-codex: Emacs, web browser, terminals, VSCode, etc. Even with large contexts, open-codex with Ollama and Gemma 3 27B QAT does not seem to overload my system.
To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.
Never mind, found your Leanpub book and followed the instructions and at least I have it running with qwen-2.5. I'll investigate what happens with Gemma.
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using).
It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.
so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.
I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.
Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/
Last night I had it write me a complete plugin for my LLM tool like this:
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.
With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).
And I am not yet talking about context window etc.
I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.
I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.
I enjoy local models for research and for the occasional offline scenario.
I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.
You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.
Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.
`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS
`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS
Strange results; the full model gives me slightly more TPS.
query: "make me a snake game in python with pygame"
(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token
(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token
using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...
LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.
(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)
There is another aspect to consider, aside from privacy.
These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.
However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.
That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.
("AI assistant" is an evolutionary dead end, and Star Trek be damned.)
You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M
Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?
Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?
By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.
Deleted Comment
Why is the memory use different? Are you using different context size in both set-ups?
https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?
Dead Comment
I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(
Wish they showed benchmarks / added quantized versions to the arena! :>
gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.
I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.
Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.
To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.
./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png
Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.
It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.
I just love it.
For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.
I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.
But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.
Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.
Which method are you employing to feed a textbook into the model?
That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.