We are in this together! Hoping for more models to come from the labs in varying sizes that will fit on devices.
as for 4chan, they've hated ollama for a long time because they built on top of llama.cpp and then didn't contribute upstream or give credit to the original project
To help future optimizations for given quantizations, we have been trying to limit the quantizations to ones that fit for majority of users.
In the case of mistral-small3.1, Ollama supports ~4bit (q4_k_m), ~8bit (q8_0) and fp16.
https://ollama.com/library/mistral-small3.1/tags
I'm hopeful that in the future, more and more model providers will help optimize for given model quantizations - 4 bit (i.e. NVFP4, MXFP4), 8 bit, and a 'full' model.
When z.ia launched GLM-4.6, I subscribed to their Coding Pro plan. Although I haven't been coding as heavy this month as the prior two months, I used to hit Claude limits almost daily, often twice a day. That was with both the $20 and $100 plans. I have yet to hit a limit with z.ai and the server response is at least as good as Claude.
I mention synthetic.new as it's good to have options and I do appreciate them sponsoring the dev of Octofriend. z.ai is a China company and I think hosts in Singapore. That could be a blocker for some.
Hosting through z.ai and synthetic.new. Both good experiences. z.ai even answers their support emails!! 5-stars ;)
May I ask what system you are using where you are getting memory estimations wrong? This is an area Ollama has been working on and improved quite a bit on.
Latest version of Ollama is 0.12.5 and with a pre-release of 0.12.6
0.7.1 is 28 versions behind.
https://github.com/21st-dev/1code