you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf
https://lmstudio.ai/
Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.
Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.
It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.
The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)
Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.
Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.
Not anymore tho. It used to be the place to vibe-check a model ~1 year ago, but lately it's filled with toxic my team vs. your team, memes about CEOs (wtf) and general poor takes on a lot of things.
For a while it was china vs. world, but lately it's even more divided, with heavy camping on specific models. You can still get some signal, but you have to either ban a lot of accounts, or read new during different tzs so you can get some of that "i'm just here for the tech stack" vibe from posters.
Since the ranking is based on token usage, wouldn't this ranking be skewed by the fact that small models' APIs are often used for consumer products, especially free ones? Meanwhile reasoning models skew it in the opposite direction, but to what extent I don't know.
It's an interesting proxy, but idk how reliable it'd be.
According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.
It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?
Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.
just install lmstudio and run the q8_0 version of it i.e. here https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....
you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/
Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.
Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.
It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.
It's crazy that we're at this point now.
mlx 4bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 5bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 6bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 8bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
edit: corrected the 4b link
https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)
For a while it was china vs. world, but lately it's even more divided, with heavy camping on specific models. You can still get some signal, but you have to either ban a lot of accounts, or read new during different tzs so you can get some of that "i'm just here for the tech stack" vibe from posters.
The new qwen3 model is not out yet.
It's an interesting proxy, but idk how reliable it'd be.
Dead Comment
I am running this beast on my dumb pc with no gpu, now we are talking!
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
=====
LiveCodeBench
E4B IT: 13.2
Qwen: 55.2
===== AIME25
E4B IT: 11.6
Qwen: 81.3
[1]: https://huggingface.co/google/gemma-3n-E4B