Qwen2.5-VL-32B: Smarter and Lighter

simonw · 5 months ago

Big day for open source Chinese model releases - DeepSeek-v3-0324 came out today too, an updated version of DeepSeek v3 now under an MIT license (previously it was a custom DeepSeek license). https://simonwillison.net/2025/Mar/24/deepseek/

echelon · 5 months ago

Pretty soon I won't be using any American models. It'll be a 100% Chinese open source stack.

The foundation model companies are screwed. Only shovel makers (Nvidia, infra companies) and product companies are going to win.

jsheard · 5 months ago

I still don't get where the money for new open source models is going to come from once setting investor dollars on fire is no longer a viable business model. Does anyone seriously expect companies to keep buying and running thousands of ungodly expensive GPUs, plus whatever they spend on human workers to do labelling/tuning, and then giving away the spoils for free, forever?

refulgentis · 5 months ago

I've been waiting since November for 1, just 1*, model other than Claude than can reliably do agentic tool call loops. As long as the Chinese open models are chasing reasoning and benchmark maxxing vs. mid-2024 US private models, I'm very comfortable with somewhat ignoring these models.

(this isn't idle prognostication hinging on my personal hobby horse. I got skin in the game, I'm virtually certain I have the only AI client that is able to reliably do tool calls with open models in an agentic setting. llama.cpp got a massive contribution to make this happen and the big boys who bother, like ollama, are still using a dated json-schema-forcing method that doesn't comport with recent local model releases that can do tool calls. IMHO we're comfortably past a point where products using these models can afford to focus on conversational chatbots, thats cute but a commodity to give away per standard 2010s SV thinking)

* OpenAI's can but are a little less...grounded?...situated? i.e. it can't handle "read this file and edit it to do $X". Same-ish for Gemini, though, sometimes I feel like the only person in the world who actually waits for the experimental models to go GA, as per letter of the law, I shouldn't deploy them until then

piokoch · 5 months ago

"The foundation model companies are screwed." Not really, they can either make API access expensive or resign from exposing APIs and offer their custom products. Open Source models are great, but you need powerful hardware to run them, surely it will not be a smartphone, at least in the nearest future.

Imustaskforhelp · 5 months ago

Yes I also believe the same though I only believe in either grok , gemini or claude ai of the western world.

Gemini isn't too special , it's actually just comparable to deepseek / less than deepseek but it is damn fast so maybe forget gemini for true tasks.

Grok / gemini can be used as a deep research model which I think I like ? Grok seems to have just taken the deepseek approach but just scaled it by their hyper massive gpu cluster, I am not sure I think that grok can also be replaced.

What I truly believe in is claude.

I am not sure but claude really feels good for coding especially.

For any other thing I might use something like deepseek / chinese models

I used cerebras.ai and holy moly they are so fast , I used the deepseek 70 b model , it is still something incredibly fast and my time matters so I really like the open source way so that companies like cereberas can focus on what they do best.

I am not sure about nvidia though. Nvidia seems so connected to the western ai that deepseek improvements impact nvidia.

I do hope that nvidia cheapens the price of gpu though I don't think they have much incentive.

buyucu · 5 months ago

OpenAI is basically a zombie company at this point. They could not make a profit even when they were the only player in town, it's now a very competitive landscape.

AlexCoventry · 5 months ago

IMO, people will keep investing in this because whoever accomplishes the first intelligence explosion is going to have the potential for massive influence over all human life.

fsndz · 5 months ago

indeed. open source will win. sam Altman was wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong

chaosprint · 5 months ago

it seems that this free version "may use your prompts and completions to train new models"

https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free

do you think this needs attention?

wgd · 5 months ago

That's typical of the free options on OpenRouter, if you don't want your inputs used for training you use the paid one: https://openrouter.ai/deepseek/deepseek-chat-v3-0324

huijzer · 5 months ago

Since we are on HN here, I can highly recommend open-webui with some OpenAI-compatible provider. I'm running with Deep Infra for more than a year now and am very happy. New models are usually available within one or two days after release. Also have some friends who use the service almost daily.

TechDebtDevin · 5 months ago

Thats because its a 3rd party API someone is hosting and trying to arb the infra cost or mine training data, or maybe something even more sinister. I stay away from open router API's that aren't served by reputable well known companies, and even then...

Deleted Comment

madduci · 5 months ago

As always, avoid using sensitive information and you are good to go

Dead Comment

behnamoh · 5 months ago

good grief! people are okay with it when OpenAI and Google do it, but as soon as open source providers do it, people get defensive about it...

ilrwbwrkhv · 5 months ago

Both of them are better than any American models. Both for reasoning, agentic, fine tuning etc.

th0ma5 · 5 months ago

Why do you keep promoting your blog on every LLM post?

rudedogg · 5 months ago

I think they didn’t want to rewrite their post. It’s more substantial and researched than any comment here, and all their posts are full of information. I think they should get a pass, and calling it self-promotion is a stretch.

simonw · 5 months ago

Because I want people to read it. I only promote it if I think it's useful and relevant.

jschoe · 5 months ago

And it still can't answer this:

Q: "9.11 and 9.9, which one is larger?"

A: "To determine which number is larger between 9.11 and 9.9, let's compare them step by step. Both numbers have the same whole number part: 9. 9.11 has a 1 in the tenths place. 9.9 has a 9 in the tenths place. Since 9 (from 9.9) is greater than 1 (from 9.11), we can conclude that 9.9 is larger than 9.11."

"Final Answer: 9.9"

I don't think anything else needs to be said here.

vbezhenar · 5 months ago

But that’s correct. 9.9 = 9.90 > 9.11. Seems that it answered the question absolutely correctly.

AuryGlenz · 5 months ago

I suggest we’ve already now passed what shall be dubbed the jschoe test ;)

owebmaster · 5 months ago

> I don't think anything else needs to be said here.

Will this humbling moment change your opinion?

bongodongobob · 5 months ago

Lol, well I guess we've a achieved the functional equivalent of AGI, at least for you. Please don't delete your comment.

oefrha · 5 months ago

I’ve legit seen a heated online debate with hundreds of comments about this question (maybe not the exact numbers), and I don’t think most participants were memeing. People are that bad at math. It’s depressing.

aurareturn · 5 months ago

+1 to Deepseek

-1 to humanity

Deleted Comment

MiiMe19 · 5 months ago

Sorry, I don't quite see what is wrong here.

dangoodmanUT · 5 months ago

9.9-9.11 =0.79

Might want to check your math? Seems right to me

keyle · 5 months ago

9.9 is larger than 9.11. This right here is the perfect example of the dunning-kruger effect.

Maybe try rephrase your question to "which version came later, 9.9 or 9.11".

erichocean · 5 months ago

This is hilarious, especially if it's unintentional.

cplusplus6382 · 5 months ago

Answer is correct no?

WithinReason · 5 months ago

You just failed the Turing test, now we know you're an LLM.

kwakubiney · 5 months ago

But the answer is correct? 9.9 is larger than 9.11

gaoryrt · 5 months ago

This makes my day.

sejje · 5 months ago

What do you think the answer is?

Deleted Comment

simonw · 5 months ago

This model is available for MLX now, in various different sizes.

I ran https://huggingface.co/mlx-community/Qwen2.5-VL-32B-Instruct... using uv (so no need to install libraries first) and https://github.com/Blaizzy/mlx-vlm like this:

  uv run --with 'numpy<2' --with mlx-vlm \
    python -m mlx_vlm.generate \
      --model mlx-community/Qwen2.5-VL-32B-Instruct-4bit \
      --max-tokens 1000 \
      --temperature 0.0 \
      --prompt "Describe this image." \
      --image Mpaboundrycdfw-1.png

That downloaded an ~18GB model and gave me a VERY impressive result, shown at the bottom here: https://simonwillison.net/2025/Mar/24/qwen25-vl-32b/

john_alan · 5 months ago

Does quantised MLX support vision though?

Is UV the best way to run it?

dphnx · 5 months ago

uv is just a Python package manager. No idea why they thought it was relevant to mention that

ggregoire · 5 months ago

We were using Llama vision 3.2 a few months back and were very frustrated with it (both in term of speed and results quality). Some day we were looking for alternatives on Hugging Face and eventually stumbled upon Qwen. The difference in accuracy and speed absolutely blew our mind. We ask it to find something in an image and we get a response in like half a second with a 4090 and it's most of the time correct. What's even more mind blowing is that when we ask it to extract any entity name from the image, and the entity name is truncated, it gives us the complete name without even having to ask for it (e.g. "Coca-C" is barely visible in the background, it will return "Coca-Cola" on its own). And it does it with entities not as well known as Coca-Cola, and with entities only known in some very specific regions too. Haven't looked back to Llama or any other vision models since we tried Qwen.

Alifatisk · 5 months ago

Ever since I switched to Qwen as my go to, it's been a bliss. They have a model for many (if not all) cases. No more daily quota! And you get to use their massive context window (1M tokens).

Hugsun · 5 months ago

How are you using them? Who is enforcing the daily quota?

exe34 · 5 months ago

what do you use to serve it, ollama or llama.cpp or similar?

simonw · 5 months ago

32B is one of my favourite model sizes at this point - large enough to be extremely capable (generally equivalent to GPT-4 March 2023 level performance, which is when LLMs first got really useful) but small enough you can run them on a single GPU or a reasonably well specced Mac laptop (32GB or more).

faizshah · 5 months ago

I just started self hosting as well on my local machine, been using https://lmstudio.ai/ Locally for now.

I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.

I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.

I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.

Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.

If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.

wetwater · 5 months ago

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally? I am grad student on budget but I want to host one locally and trying to build a PC that could run one of these models.

regularfry · 5 months ago

Qwq:32b + qwen2.5-coder:32b is a nice combination for aider, running locally on a 4090. It has to swap models between architect and edit steps so it's not especially fast, but it's capable enough to be useful. qwen2.5-coder does screw up the edit format sometimes though, which is a pain.

pixelHD · 5 months ago

what spec is your local mac?

wetwater · 5 months ago

I've only recently started looking into running these models locally on my system. I have limited knowledge regarding LLMs and even more limited when it comes to building my own PC.

Are there any good sources that I can read up on estimiating what would be hardware specs required for 7B, 13B, 32B .. etc size If I need to run them locally?

TechDebtDevin · 5 months ago

VRAM Required = Number of Parameters (in billions) × Number of Bytes per Parameter × Overhead[0].

[0]: https://twm.me/posts/calculate-vram-requirements-local-llms/

YetAnotherNick · 5 months ago

I don't think these models are GPT-4 level. Yes they seem to be on benchmarks, but it has been known that models increasingly use A/B testing in dataset curation and synthesis(using GPT 4 level models) to optimize not just the benchmarks but things which could be benchmarked like academics.

simonw · 5 months ago

I'm not talking about GPT-4o here - every benchmark I've seen has had the new models from the past ~12 months out-perform the March 2023 GPT-4 model.

To pick just the most popular one, https://lmarena.ai/?leaderboard= has GPT-4-0314 ranked 83rd now.

tosh · 5 months ago

Also "GPT-4 level" is a bit loaded. One way to think about it that I found helpful is to split how good a model is into "capability" and "knowledge/hallucination".

Many benchmarks test "capability" more than "knowledge". There are many use cases where the model gets all the necessary context in the prompt. There a model with good capability for the use case will do fine (e.g. as good as GPT-4).

That same model might hallucinate when you ask about the plot of a movie while a larger model like GPT-4 might be able to recall better what the movie is about.

Tepix · 5 months ago

32B is also great for two 24GB GPUs if you want a nice context size and/or Q8 quantization which is usually very good.

int_19h · 5 months ago

I don't think there's any local model other than full-sized DeepSeek (not distillations!) that is on the level of the original GPT-4, at least not in reasoning tasks. Scoreboards lie.

That aside, QwQ-32 is amazingly smart for its size.

clear_view · 5 months ago

32B don't fully fit 16GB of VRAM. Still fine for higher quality answers, worth the extra wait in some cases.

abraxas · 5 months ago

Would a 40GB A6000 fully accommodate a 32B model? I assume an fp16 quantization is still necessary?

buyucu · 5 months ago

I prefer 24b because it's the largest model I can run on a 16GB laptop :)

redrove · 5 months ago

Or quantized on a 4090!

osti · 5 months ago

Are 5090's able to run 32B models?

regularfry · 5 months ago

The 4090 can run 32B models in Q4_K_M, so yes, on that measure. Not unquantised though, nothing bigger than Q8 would fit. On a 32GB card you'll have more choices to trade off quantisation against context.

101008 · 5 months ago

Silly question: how can OpenAI, Claude and all, have a valuation so large considering all the open source models? Not saying they will disappear or be tiny (closed models), but why so so so valuable?

Gathering6678 · 5 months ago

Valuation can depend on lots of different things, including hype. However, it ultimately comes down to an estimated discounted cash flow from the future, i.e. those who buy their shares (through private equity methods) at the current valuation believe the company will earn such and such money in the future to justify the valuation.

neither_color · 5 months ago

ChatGPT's o1 is still really good and the free options are not compelling enough to switch if you've been using it for a while. They've positioned themselves to be a good mainstream default.

rafaelmn · 5 months ago

Because what would seem like a tiny difference in those benchmark graphs is the difference between worth paying for and complete waste of time in practice

barbarr · 5 months ago

It's user base and brand. Just like with Pepsi and Coca Cola. There's a reason OpenAI ran a Super Bowl ad.

TechDebtDevin · 5 months ago

Most "normies" I know only recognize ChatGPT with AI, so for sure, brand recognition is the only thing that matters.

101008 · 5 months ago

Yeah but cheaper alternatives (and open source and local ones) it would be super easy for most of the customers to migrate to a different provider. I am not saying they don't provide any value, but it's like paid software vs open source alternative. Open source alternative ends up imposing, especially among tech people.

csomar · 5 months ago

Their valuation is not marked to market. We know their previous round valuation, but at this point it is speculative until they go through another round that will mark them again.

That being said, they have a user base and integrations. As long as they stay close or a bit ahead of the Chinese models they'll be fine. If the Chinese models significantly jumps ahead of them, well, then they are pretty much dead. Add open source to the mix and they become history.

elorant · 5 months ago

The average user won't self-host a model.

hobofan · 5 months ago

The competition isn't self-hosting. If you can just pick a capable model from any provider inference just turns into a infrastructure/PaaS game -> The majority of the profits will be captured by the cloud providers.

epolanski · 5 months ago

...yet

Workaccount2 · 5 months ago

Because they offer extremely powerful models at pretty modest prices.

The hardware for a local model would cost years and years of a $20/mo subscription, would output lower quality work, and would be much slower.

3.7 Thinking is an insane programming model. Maybe it cannot do an SWE's job, but it sure as hell can write functional narrow-scope programs with a GUI.

mirekrusin · 5 months ago

For coding and other integrations people pay per token on api key, not subscription. Claude code costs few $ per task on your code - it gets expensive quite quickly.

seydor · 5 months ago

People cannot normally invest in their competitors.

It's not unlikely that chinese products may be banned / tarriff'd

FreakyT · 5 months ago

There are non-Chinese open LLMs (Mistral, LLama, etc), so I don't think that explains it.

Deleted Comment

rvz · 5 months ago

OpenAI is worth >$100B because of the "ChatGPT" name which it turns out, over 400M+ users use it weekly.

That name alone holds the most mindshare in it's product category, and is close to the level of name recognition just like Google.

mirekrusin · 5 months ago

...according to investors. (ps. it's even >$150B)

In reality OpenAI is loosing money per user.

Cost per token is tanking like crazy due to competition.

They guesstimate break even and then profit in couple of years.

Their guesses seem to not account for progress much especially on open weight models.

Frankly I have no idea what they're thinking there – they can barely keep up with investor subsidized, non sustainable model.

Arcuru · 5 months ago

Does anyone know how making the models multimodal impacts their text capabilities? The article is claiming this achieves good performance on pure text as well, but I'm curious if there is any analysis on how much impact it usually has.

I've seen some people claim it should make the models better at text, but I find that a little difficult to believe without data.

kmacdough · 5 months ago

I am having a hard time finding controlled testing, but the premise is straightforward: different modalities encourage different skills and understandings. Text builds up more formal idea tokenization and strengthens logic/reasoning while images require it learns a more robust geometric intuition. Since these learnings are applied to the same latent space, the strengths can be cross-applied.

The same applies to humans. Imagine a human who's only life involved reading books in a dark room, vs one who could see images vs one who can actually interact.

ItDoBeWimdyTho · 5 months ago

That comparison actually makes human reasoning abilities more impressive.

Helen Keller still learned robust generalizations.

netdur · 5 months ago

My understanding is that in multimodal models, both text and image vectors align to the same semantic space, this alignment seems to be the main difference from text-only models."

lysace · 5 months ago

To clarify: Qwen is made by Alibaba Cloud.

(It's not mentioned anywhere in the blog post.)

jauntywundrkind · 5 months ago

Wish I knew better how to estimate what sized video card one needs. HuggingFace link says this is bfloat16, so at least 64GB?

I guess the -7B might run on my 16GB AMD card?

zamadatix · 5 months ago

https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

That will help you quickly calculate the model VRAM usage as well as the VRAM usage of the context length you want to use. You can put "Qwen/Qwen2.5-VL-32B-Instruct" in the "Model (unquantized)" field. Funnily enough the calculator lacks the option to see without quantizing the model, usually because nobody worried about VRAM bothers running >8 bit quants.

azinman2 · 5 months ago

Except when it comes to deepseek

xiphias2 · 5 months ago

I wish they would start producing graphs with quantized version performances as well. What matters is RAM/bandwidth vs performance, not number of parameters.

wgd · 5 months ago

You can run 4-bit quantized version at a small (though nonzero) cost to output quality, so you would only need 16GB for that.

Also it's entirely possible to run a model that doesn't fit in available GPU memory, it will just be slower.

clear_view · 5 months ago

deepseek-r1:14b/mistral-small:24b/qwen2.5-coder:14b fit 16GB VRAM with fast generation. 32b versions bleed into RAM and take a serious performance hit but still usable.