Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/
I still don't get where the money for new open source models is going to come from once setting investor dollars on fire is no longer a viable business model. Does anyone seriously expect companies to keep buying and running thousands of ungodly expensive GPUs, plus whatever they spend on human workers to do labelling/tuning, and then giving away the spoils for free, forever?
I've been waiting since November for 1, just 1*, model other than Claude than can reliably do agentic tool call loops.
As long as the Chinese open models are chasing reasoning and benchmark maxxing vs. mid-2024 US private models, I'm very comfortable with somewhat ignoring these models.
(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)
* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then
"The foundation model companies are screwed." Not really, they can either make API access expensive or resign from exposing APIs and offer their custom products. Open Source models are great, but you need powerful hardware to run them, surely it will not be a smartphone, at least in the nearest future.
Yes I also believe the same though I only believe in either grok , gemini or claude ai of the western world.
Gemini isn't too special , it's actually just comparable to deepseek / less than deepseek but it is damn fast so maybe forget gemini for true tasks.
Grok / gemini can be used as a deep research model which I think I like ? Grok seems to have just taken the deepseek approach but just scaled it by their hyper massive gpu cluster, I am not sure I think that grok can also be replaced.
What I truly believe in is claude.
I am not sure but claude really feels good for coding especially.
For any other thing I might use something like deepseek / chinese models
I used cerebras.ai and holy moly they are so fast , I used the deepseek 70 b model , it is still something incredibly fast and my time matters so I really like the open source way so that companies like cereberas can focus on what they do best.
I am not sure about nvidia though. Nvidia seems so connected to the western ai that deepseek improvements impact nvidia.
I do hope that nvidia cheapens the price of gpu though I don't think they have much incentive.
OpenAI is basically a zombie company at this point. They could not make a profit even when they were the only player in town, it's now a very competitive landscape.
IMO, people will keep investing in this because whoever accomplishes the first intelligence explosion is going to have the potential for massive influence over all human life.
Since we are on HN here, I can highly recommend open-webui with some OpenAI-compatible provider. I'm running with Deep Infra for more than a year now and am very happy. New models are usually available within one or two days after release. Also have some friends who use the service almost daily.
Thats because its a 3rd party API someone is hosting and trying to arb the infra cost or mine training data, or maybe something even more sinister. I stay away from open router API's that aren't served by reputable well known companies, and even then...
I think they didn’t want to rewrite their post. It’s more substantial and researched than any comment here, and all their posts are full of information. I think they should get a pass, and calling it self-promotion is a stretch.
A: "To determine which number is larger between 9.11 and 9.9, let's compare them step by step. Both numbers have the same whole number part: 9. 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Since 9 (from 9.9) is greater than 1 (from 9.11), we can conclude that 9.9 is larger than 9.11."
"Final Answer: 9.9"
I don't think anything else needs to be said here.
I’ve legit seen a heated online debate with hundreds of comments about this question (maybe not the exact numbers), and I don’t think most participants were memeing. People are that bad at math. It’s depressing.
We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our mind. We ask it to find something in an image and we get a response in like half a second with a 4090 and it's most of the time correct. What's even more mind blowing is that when we ask it to extract any entity name from the image, and the entity name is truncated, it gives us the complete name without even having to ask for it (e.g. "Coca-C" is barely visible in the background, it will return "Coca-Cola" on its own). And it does it with entities not as well known as Coca-Cola, and with entities only known in some very specific regions too. Haven't looked back to Llama or any other vision models since we tried Qwen.
Ever since I switched to Qwen as my go to, it's been a bliss. They have a model for many (if not all) cases. No more daily quota! And you get to use their massive context window (1M tokens).
32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).
I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.
I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.
I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.
I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.
Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.
If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.
Qwq:32b + qwen2.5-coder:32b is a nice combination for aider, running locally on a 4090. It has to swap models between architect and edit steps so it's not especially fast, but it's capable enough to be useful. qwen2.5-coder does screw up the edit format sometimes though, which is a pain.
I've only recently started looking into running these models locally on my system. I have limited knowledge regarding LLMs and even more limited when it comes to building my own PC.
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?
I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.
Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".
Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).
That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.
I don't think there's any local model other than full-sized DeepSeek (not distillations!) that is on the level of the original GPT-4, at least not in reasoning tasks. Scoreboards lie.
That aside, QwQ-32 is amazingly smart for its size.
The 4090 can run 32B models in Q4_K_M, so yes, on that measure. Not unquantised though, nothing bigger than Q8 would fit. On a 32GB card you'll have more choices to trade off quantisation against context.
Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable?
Valuation can depend on lots of different things, including hype. However, it ultimately comes down to an estimated discounted cash flow from the future, i.e. those who buy their shares (through private equity methods) at the current valuation believe the company will earn such and such money in the future to justify the valuation.
ChatGPT's o1 is still really good and the free options are not compelling enough to switch if you've been using it for a while. They've positioned themselves to be a good mainstream default.
Because what would seem like a tiny difference in those benchmark graphs is the difference between worth paying for and complete waste of time in practice
Yeah but cheaper alternatives (and open source and local ones) it would be super easy for most of the customers to migrate to a different provider. I am not saying they don't provide any value, but it's like paid software vs open source alternative. Open source alternative ends up imposing, especially among tech people.
Their valuation is not marked to market. We know their previous round valuation, but at this point it is speculative until they go through another round that will mark them again.
That being said, they have a user base and integrations. As long as they stay close or a bit ahead of the Chinese models they'll be fine. If the Chinese models significantly jumps ahead of them, well, then they are pretty much dead. Add open source to the mix and they become history.
The competition isn't self-hosting. If you can just pick a capable model from any provider inference just turns into a infrastructure/PaaS game -> The majority of the profits will be captured by the cloud providers.
Because they offer extremely powerful models at pretty modest prices.
The hardware for a local model would cost years and years of a $20/mo subscription, would output lower quality work, and would be much slower.
3.7 Thinking is an insane programming model. Maybe it cannot do an SWE's job, but it sure as hell can write functional narrow-scope programs with a GUI.
For coding and other integrations people pay per token on api key, not subscription. Claude code costs few $ per task on your code - it gets expensive quite quickly.
Does anyone know how making the models multimodal impacts their text capabilities? The article is claiming this achieves good performance on pure text as well, but I'm curious if there is any analysis on how much impact it usually has.
I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.
I am having a hard time finding controlled testing, but the premise is straightforward: different modalities encourage different skills and understandings. Text builds up more formal idea tokenization and strengthens logic/reasoning while images require it learns a more robust geometric intuition. Since these learnings are applied to the same latent space, the strengths can be cross-applied.
The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.
My understanding is that in multimodal models, both text and image vectors align to the same semantic space, this alignment seems to be the main difference from text-only models."
That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.
I wish they would start producing graphs with quantized version performances as well. What matters is RAM/bandwidth vs performance, not number of parameters.
deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB VRAM with fast generation.
32b versions bleed into RAM and take a serious performance hit but still usable.
The foundation model companies are screwed. Only shovel makers (Nvidia, infra companies) and product companies are going to win.
(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)
* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then
Gemini isn't too special , it's actually just comparable to deepseek / less than deepseek but it is damn fast so maybe forget gemini for true tasks.
Grok / gemini can be used as a deep research model which I think I like ? Grok seems to have just taken the deepseek approach but just scaled it by their hyper massive gpu cluster, I am not sure I think that grok can also be replaced.
What I truly believe in is claude.
I am not sure but claude really feels good for coding especially.
For any other thing I might use something like deepseek / chinese models
I used cerebras.ai and holy moly they are so fast , I used the deepseek 70 b model , it is still something incredibly fast and my time matters so I really like the open source way so that companies like cereberas can focus on what they do best.
I am not sure about nvidia though. Nvidia seems so connected to the western ai that deepseek improvements impact nvidia.
I do hope that nvidia cheapens the price of gpu though I don't think they have much incentive.
https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
do you think this needs attention?
Deleted Comment
Dead Comment
Q: "9.11 and 9.9, which one is larger?"
A: "To determine which number is larger between 9.11 and 9.9, let's compare them step by step. Both numbers have the same whole number part: 9. 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Since 9 (from 9.9) is greater than 1 (from 9.11), we can conclude that 9.9 is larger than 9.11."
"Final Answer: 9.9"
I don't think anything else needs to be said here.
Will this humbling moment change your opinion?
-1 to humanity
Deleted Comment
Might want to check your math? Seems right to me
Maybe try rephrase your question to "which version came later, 9.9 or 9.11".
Deleted Comment
I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this:
That downloaded an ~18GB model and gave me a VERY impressive result, shown at the bottom here: https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/Is UV the best way to run it?
I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.
I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.
I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.
Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.
If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.
Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?
[0]: https://twm.me/posts/calculate-vram-requirements-local-llms/
To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.
Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).
That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.
That aside, QwQ-32 is amazingly smart for its size.
That being said, they have a user base and integrations. As long as they stay close or a bit ahead of the Chinese models they'll be fine. If the Chinese models significantly jumps ahead of them, well, then they are pretty much dead. Add open source to the mix and they become history.
The hardware for a local model would cost years and years of a $20/mo subscription, would output lower quality work, and would be much slower.
3.7 Thinking is an insane programming model. Maybe it cannot do an SWE's job, but it sure as hell can write functional narrow-scope programs with a GUI.
It's not unlikely that chinese products may be banned / tarriff'd
Deleted Comment
That name alone holds the most mindshare in it's product category, and is close to the level of name recognition just like Google.
In reality OpenAI is loosing money per user.
Cost per token is tanking like crazy due to competition.
They guesstimate break even and then profit in couple of years.
Their guesses seem to not account for progress much especially on open weight models.
Frankly I have no idea what they're thinking there – they can barely keep up with investor subsidized, non sustainable model.
I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.
The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.
Helen Keller still learned robust generalizations.
(It's not mentioned anywhere in the blog post.)
I guess the -7B might run on my 16GB AMD card?
That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.
Also it's entirely possible to run a model that doesn't fit in available GPU memory, it will just be slower.