Readit News logoReadit News
archerx · 9 months ago
I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.

Models that are worth writing home about are;

EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.

Rocinante-12B-v2i - Fun for stories and D&D

Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks

OpenThinker-7B - Good and fast reasoning

The Deepseek destills - Able to handle more complex task while still being fast

DeepHermes-3-Llama-3-8B - A really good vLLM

Medical-Llama3-v2 - Very interesting but be careful

Plus more but not Gemma.

anon373839 · 9 months ago
From the limited testing I've done, Gemma 3 27B appears to be an incredibly strong model. But I'm not seeing the same performance in Ollama as I'm seeing on aistudio.google.com. So, I'd recommend trying it from the source before you draw any conclusions.

One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.

moffkalast · 9 months ago
At the end of the day it doesn't matter how good it its, it has no system prompt which means no steerability, a sliding window for incredibly slow inference compared to similar sized models because it's too niche and most inference systems have high overhead implementations of it, and Google's psychotic instruct tuning that made Gemma 2 an inconsistent and unreliable glass cannon.

I mean hell, even Mistral added system prompts in their last release, Google are the only ones that don't seem to bother with it by now.

sieve · 9 months ago
The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.

The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.

Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.

[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)

Dead Comment

mythz · 9 months ago
Concur with Gemma2 being underwhelming, I dismissed it pretty quickly but gemma3:27b is looking pretty good atm.

BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.

mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.

InsideOutSanta · 9 months ago
One more vote for Mistral for local models. The 7B model is extremely fast and still good enough for many prompts.
zacksiri · 9 months ago
You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.
jrm4 · 9 months ago
Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"

I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?

blooalien · 9 months ago
I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.
usef- · 9 months ago
To clarify, are you basing this comment on experience with previous Gemma releases, or the one from today?
mupuff1234 · 9 months ago
Ok, but have you tried Gemma3?
rpastuszak · 9 months ago
Thanks for the overview.

> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning

Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?

miroljub · 9 months ago
Qwen2.5-Coder:32B is the best open source coding model. I use it daily, and I don't notice that it lags much behind Claude 3.5.

I would be actually happy to see R1 distilled version, it may make it perform better with the less resource usage.

thom · 9 months ago
Could you talk a little more about your D&D usage? This has turned into one of my primary use cases for ChatGPT, cooking up encounters or NPCs with a certain flavour if I don't have time to think something up myself. I've also been working on hooking up to the D&D Beyond API so you can get everything into homebrew monsters and encounters.
archerx · 9 months ago
I noticed the prompt makes a big difference in the experience you get. I have a simple game prompt.

The first prompt I tested out I got from this video; https://www.youtube.com/watch?v=0Cq-LuJnaRg

It was ok and produces shallow adventures.

The second one I tried was from this site; https://www.rpgprompts.com/post/dungeons-dragons-chatgpt-pro...

a bit better and is easier to modify but still shallow.

The best one I have tried so far is this one from reddit; https://old.reddit.com/r/ChatGPT/comments/zoiqro/most_improv...

It is a super long prompt and I had to edit it a lot, and manually extract the data from some of the links but it has been the best experience by far. I even became "friends" with an NPC who accompanied me on a quest and it was a lot of fun and I was fully engaged.

The model of choice matters but even llama 1B and 2B can handle some stories.

camel_Snake · 9 months ago
May want to check out the Wayfarer models on: https://huggingface.co/LatitudeGames

afaik they are more for roleplaying a D&D style adventure than planning it, but I've heard good things.

DeepSeaTortoise · 9 months ago
TBH, I REALLY like the tiny models. Like smollm2.

Also lobotomized LLMs ("abliterated") can be a lot of fun.

andai · 9 months ago
I think you mean un-lobotomize, and apparently it can be done without retraining? Wild!

https://huggingface.co/blog/mlabonne/abliteration

Dead Comment

pduggishetti · 9 months ago
Recently phi4 has been very good too!
m00dy · 9 months ago
sshht, don't make it a public debate :P)
memhole · 9 months ago
Do you mostly stick with smaller models? I’m pretty surprised at how good the smaller models can be at times now. A year ago they were nearly useless. I kind of like too the hallucinations are more obvious sometimes. Or at least it seems like they are.
archerx · 9 months ago
I like the smaller models because they are faster. I even got a Llama 3 1B model running on TinkerBoard 2S and it was fun to play around with and not too slow. The smaller models are still good at summarizing and other basic tasks. For coding they start showing their limits but still work great for trying to figure out issues in small bits of code.

The real issue with local models is managing context. smaller models let you have a longer context without losing performance but bigger models are smarter but if you want to keep it fast I have to reduce the context length.

Also all of the models have their own "personalities" and they still manifest in the finetunes.

jeswin · 9 months ago
I still find them useless. What do you use them for?
sebastiansm · 9 months ago
Anyone can recommend a small model specific for translation? english to spanish mostly.
pzo · 9 months ago
Depends what you mean small 4B? 7B? You can try qwen2.5 3B or 7B though 3B version is on no commercial license. Phi4-mini also should be good. Tested only on polish/english pairs should be good for spanish too. Smaller models like 1.5B were kind of useless for me.
archerx · 9 months ago
I haven't done deep testing on it but Tower-Babel_Babel-9B should be what you are looking for.
karma_fountain · 9 months ago
Ah, OpenThinker-7B. A diverse variety of LLM from the OpenThoughts team. Light and airy, suitable for everyday usage and not too heavy on the CPU. A new world LLM for the discerning user.
flir · 9 months ago
I find New World LLMs kinda... well, they don't have the terroir, ya know?
panki27 · 9 months ago
I've had really good results with Qwen2.5-7b-Instruct.

Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?

archerx · 9 months ago
If I could only use one model from now on it would either be the deepSeek R1 Qwen or Llama distill.
xnx · 9 months ago
Let us know when you've evaluated Gemma 3. Just as with the switch between ChatGPT 3.5 and ChatGPT 4, old versions don't tell you much about the current version.
tomp · 9 months ago
Any below 7B you'd recommend?

IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.

archerx · 9 months ago
Try;

- EXAONE-3.5-2.4B-Instruct - Llama-3.2-3B-Instruct-uncensored - qwq-lcot-3b-instruct - qwen2.5-3b-instruct

These have been very interesting tiny models, they can do text processing task and can handle story telling. The Llama-3.2 is way to sensitive to random stuff so get the uncensored or abliterated versions

_1 · 9 months ago
How are you grading these? Are you going on feeling, or do you have a formalized benchmarking process?
archerx · 9 months ago
From just using them a lot and getting the results that I want without going "ugh!".
dudefeliciano · 9 months ago
what hardware are you using those on? Is it still prohibitively expensive to self-host a model that gives decent outputs (sorry my last experience has been underwhelming with llama a while back)
sliken · 9 months ago
I'm tinkering with gemma 3 27B on a last gen 12 core ryzen. I get 5 tokens/sec.
archerx · 9 months ago
I have an AMD 6700 XT card with 12gb of VRAM and a 24 core cpu with 48gigs of ram. This is the bare minimum,
michaelbuckbee · 9 months ago
What's the driving reason for local models? Cost? Censorship?
laborcontract · 9 months ago
PII is the driving force for me. I like to have local models manage my browser tabs, reply to emails, and go through personal documents. I don't trust LLM providers not to retain my data.
dannyw · 9 months ago
Privacy is another big reason. I like to store my files locally with a backup, not on Dropbox or whatever.

Deleted Comment

danielhanchen · 9 months ago
I wrote a mini guide on running Gemma 3 at https://docs.unsloth.ai/basics/tutorial-how-to-run-gemma-3-e...!

The recommended settings according to the Gemma team are:

temperature = 0.95

top_p = 0.95

top_k = 64

Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M

vessenes · 9 months ago
Daniel, as always, thanks for these. I had good results with your Q4_K_M quant on mac / llama.cpp. However, on Linux/A100/ollama, there is something very wrong with your Q8_0 quant. python code has indentation errors, missing close parens, quite a lot that's bad. I ran both with your suggested command lines, but of course could have been some mistake I made. I'm testing the bf16 on the A100 now to make sure it's not a hardware issue, but my gut is there's a model or ollama sampling problem here.

EDIT: 27b size

tarruda · 9 months ago
Thanks for this, but I'm still unable to reproduce the results from Google AI studio.

I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.

ac29 · 9 months ago
Some models are more sensitive to quantization than others, presumably AI Studio is running the full 16 bit model.

Try maybe the 8bit quant if you have the hardware for it? ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q8_0

svachalek · 9 months ago
This seems worse than the official Ollama build. First question I tried:

>>> who is president

The বর্তমানpresident of the United States is Джо Байден (JoeBiden).

swores · 9 months ago
See the other HN submission (for the Gemma3 technical report doc) for a more active discussion thread - 50 comments at time of writing this.

https://news.ycombinator.com/item?id=43340491

iamgopal · 9 months ago
Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.
antirez · 9 months ago
After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.
archerx · 9 months ago
That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.
antirez · 9 months ago
Good test. I'm slowly accumulating private tests that I use to rate LLMs, and this one was missing... Thanks.
amelius · 9 months ago
Aren't there any "blind" benchmarks?
nathanasmith · 9 months ago
Unfortunately that wouldn't help as much as you think since talented AI labs can just watch the public leaderboard and note what models move up and down to deduce and target whatever the hidden benchmark is testing.
nickthegreek · 9 months ago
OpenRouter Arena Ratings are probably the closet thing.
toinewx · 9 months ago
can you expand a bit?
antirez · 9 months ago
The model performs very poorly in practice, while in the benchmark it is shown to be DeepSeek V3 level. It's not terrible but it's at another level compared to the models it is very close to (a bit better / a bit worse) in the benchmarks.
bearjaws · 9 months ago
Prompt adherence is pretty bad from what I can tell.
smcleod · 9 months ago
No mention of how well it's claimed to perform with tool calling?

The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.

PKop · 9 months ago
I wasn't able to get function calls to work for Gemma3 in ollama, nor were others[0]. What is another way to run these models locally?

[0] https://github.com/ollama/ollama/issues/9680

[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...

mythz · 9 months ago
Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.

Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.

From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.

Patrick_Devine · 9 months ago
There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.
dizhn · 9 months ago
It might not be downloading but converting the model. Or if it's already downloading a properly formatted model file, deduping on disk which I hear it does. This also makes its model files on disk useless for other frontends.
squeakywhite · 9 months ago
I experienced this just now. The download slowed down to approx 500kB/s for the last 1% or so. When this happens, you can Ctrl+C to cancel and then start the download again It will continue from where it left off, but at regular (fast) download speed.
elif · 9 months ago
Good job Google. It is kinda hilarious that 'open'AI seems to be the big player least likely to release any of their models.
amelius · 9 months ago
lyingAI