Readit News logoReadit News
simonw · 4 months ago
I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

Last night I had it write me a complete plugin for my LLM tool like this:

  llm install llm-mlx
  llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit

  llm -m mlx-community/gemma-3-27b-it-qat-4bit \
    -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
    -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
    -s 'Write a new fragments plugin in Python that registers
    issue:org/repo/123 which fetches that issue
        number from the specified github repo and uses the same
        markdown logic as the HTML page to turn that into a
        fragment'
It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/

rs186 · 4 months ago
Can you quote tps?

More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

And I am not yet talking about context window etc.

I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

simonw · 4 months ago
My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

I enjoy local models for research and for the occasional offline scenario.

I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

overfeed · 4 months ago
> Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

ein0p · 4 months ago
Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.
trees101 · 4 months ago
Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

`ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

`ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

Strange results; the full model gives me slightly more TPS.

k__ · 4 months ago
The local LLM is your project manager, the big remote ones are the engineers and designers :D
jonaustin · 4 months ago
On a M4 Max 128GB via LM Studio:

query: "make me a snake game in python with pygame"

(mlx 4 bit quant) mlx-community/gemma-3-27b-it-qat@4bit: 26.39 tok/sec • 1681 tokens 0.63s to first token

(gguf 4 bit quant) lmstudio-community/gemma-3-27b-it-qat: 22.72 tok/sec • 1866 tokens 0.49s to first token

using Unsloth's settings: https://docs.unsloth.ai/basics/tutorial-how-to-run-and-fine-...

starik36 · 4 months ago
On an A5000 with 24GB, this model typically gets between 20 to 25 tps.
pantulis · 4 months ago
> Can you quote tps?

LLM Studio running on a Mac Studio M4 Max with 128GB, gemma-3-27B-it-QAT-Q4_0.gguf with a 4096 token context I get 8.89 tps.

a_e_k · 4 months ago
I'm seeing ~38--42 tps on a 4090 in a fresh build of llama.cpp under Fedora 42 on my personal machine.

(-t 32 -ngl 100 -c 8192 -fa -ctk q8_0 -ctv q8_0 -m models/gemma-3-27b-it-qat-q4_0.gguf)

DJHenk · 4 months ago
> More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

There is another aspect to consider, aside from privacy.

These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

otabdeveloper4 · 4 months ago
The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

bobjordan · 4 months ago
Thanks for the call out on this model! I have 42gb usable VRAM on my ancient (~10yrs old) quad-sli titan-x workstation and have been looking for a model to balance large context window with output quality. I'm able to run this model with a 56K context window and it just fits into my 42gb VRAM to run 100% GPU. The output quality is really good and 56K context window is very usable. Nice find!
paprots · 4 months ago
The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?
zorgmonkey · 4 months ago
Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output
kgwgk · 4 months ago
Look up 27b in https://ollama.com/library/gemma3/tags

You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M

superkuh · 4 months ago
Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.
nolist_policy · 4 months ago
I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.
prvc · 4 months ago
> ~15GB (MLX) leaving plenty of memory for running other apps.

Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

simonw · 4 months ago
I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.
tomrod · 4 months ago
Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).
simonw · 4 months ago
MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.
nico · 4 months ago
Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?

simonw · 4 months ago
The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...
tootie · 4 months ago
I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.
Casteil · 4 months ago
This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.

By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.

Deleted Comment

littlestymaar · 4 months ago
> and it only uses ~22Gb (via Ollama) or ~15GB (MLX)

Why is the memory use different? Are you using different context size in both set-ups?

simonw · 4 months ago
No idea. MLX is its own thing, optimized for Apple Silicon. Ollama uses GGUFs.

https://ollama.com/library/gemma3:27b-it-qat says it's Q4_0. https://huggingface.co/mlx-community/gemma-3-27b-it-qat-4bit says it's 4bit. I think those are the same quantization?

Patrick_Devine · 4 months ago
The vision tower is 7GB, so I was wondering if you were loading it without vision?
codybontecou · 4 months ago
Can you run the mlx-variation of this model through Ollama so that I can interact with it in Open WebUI?
simonw · 4 months ago
I haven't tried it yet but there's an MLX project that exposes an OpenAI-compatible serving endpoint that should work with Open WebUI: https://github.com/madroidmaq/mlx-omni-server
ygreif · 4 months ago
Do many consumer GPUs have >20 gigabytes RAM? That sounds like a lot to me
mcintyre1994 · 4 months ago
I don't think so, but Apple's unified memory architecture makes it a possibility for people with Macbook Pros.

Dead Comment

Samin100 · 4 months ago
I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!
itake · 4 months ago
I tried to use the -it models for translation, but it completely failed at translating adult content.

I think this means I either have to train the -pt model with my own instruction tuning or use another provider :(

jychang · 4 months ago
Try mradermacher/amoral-gemma3-27B-v2-qat-GGUF
andhuman · 4 months ago
Have you tried Mistral Small 24b?
diggan · 4 months ago
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
croemer · 4 months ago
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
nithril · 4 months ago
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
claiir · 4 months ago
Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

Wish they showed benchmarks / added quantized versions to the arena! :>

mark_l_watson · 4 months ago
Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.

gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.

I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.

Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

nxobject · 4 months ago
Fellow owner of a 32GB MBP here: how much memory does it use while resident - or, if swapping happens, do you see the effects in your day to day work? I’m in the awkward position of using on a daily basis a lot of virtualized bloated Windows software (mostly SAS).
mark_l_watson · 4 months ago
I have the usual programs running on my Mac, along with open-codex: Emacs, web browser, terminals, VSCode, etc. Even with large contexts, open-codex with Ollama and Gemma 3 27B QAT does not seem to overload my system.

To be clear, I sometimes toggle open-codex to use the Gemini 3.5 Pro API also, but I enjoy running locally for simpler routine work.

pantulis · 4 months ago
How did you manage to run open-codex against a local ollama? I keep getting 400 Errors no matter what I try with the --provider and --model options.
pantulis · 4 months ago
Never mind, found your Leanpub book and followed the instructions and at least I have it running with qwen-2.5. I'll investigate what happens with Gemma.
Tsarp · 4 months ago
What tps are you hitting? And did you have to change KV size?
mekpro · 4 months ago
Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.
trebligdivad · 4 months ago
It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.
simonw · 4 months ago
What are you using to run it? I haven't got image input working yet myself.
trebligdivad · 4 months ago
I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:

./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png

Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.

terhechte · 4 months ago
Image input has been working with LM Studio for quite some time
Havoc · 4 months ago
The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU
slekker · 4 months ago
What's MoE?
manjunaths · 4 months ago
I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.

It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.

I just love it.

For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.

I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.

But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.

Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.

mdp2021 · 4 months ago
> My kid's been using it to feed his school textbooks and ask it questions

Which method are you employing to feed a textbook into the model?

behnamoh · 4 months ago
This is what local LLMs need—being treated like first-class citizens by the companies that make them.

That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

mmoskal · 4 months ago
Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.
freeamz · 4 months ago
so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.
behnamoh · 4 months ago
half the amount of those dots is what it takes. but also, why compare a 27B model with a +600B? that doesn't make sense.