Trying out QvQ – Qwen's new visual reasoning model

uv run --with 'numpy<2.0' --with mlx-vlm python \ -m mlx_vlm.generate \ --model mlx-community/QVQ-72B-Preview-4bit \ --max-tokens 10000 \ --temp 0.0 \ --prompt "describe this" \ --image pelicans-on-bicycles-veo2.jpg

I also tried feeding it the famous "tank man" photo and asking for a description and the response came back blank!

https://en.m.wikipedia.org/wiki/Tank_Man

tkgally · a year ago

It’s fun to test the censorship.

I tried various photos with the prompt “When and where might have this picture been taken?”

Nondescript photos of Beijing street scenes in the 1990s get full responses.

A nondescript photo of Tiananmen Square—a screenshot of the photo in [1], so it contains no relevant metadata—gets the beginning of a response: “So I've got this picture here, and it looks like it's taken in a place that's really familiar to me. It's Tiananmen Square in Beijing, China. I recognize it right away because of the iconic buildings and landmarks that are visible. First off, there's the Gate of”. The output stopped there and an “Error” message appeared.

A photo of the Statue of Liberty in Tiananmen Square during the 1989 protests [2] gets no response at all. Similarly for a vanilla photo of the Gate of Heavenly Peace [3].

A photo of the Monument to the People’s Heroes [3] gets a full response, and QvQ identifies the location. The conclusion: “To summarize, based on the monument's design, the inscriptions, the formal garden, and the large official building in the background, combined with the overall layout of the space, I am confident that this image was captured at Tiananmen Square, one of the most recognizable and historically significant locations in China.”

Some more testing in English and Chinese might reveal what exactly is triggering the censorship of Tiananmen photos. The cut-off in the middle of “Gate of Heavenly Peace” seems to suggest a clue.

[1] https://ruqintravel.com/china-destination-guides/tiananmen-s...

[2] https://www.britannica.com/event/Tiananmen-Square-incident

[3] https://www.britannica.com/topic/Tiananmen-Square

jazzyjackson · a year ago

While I and l assume Lady Liberty would have been familiar to the protestors, the statue erected in the square really has a style all her own IMO, and is generally referenced as The Goddess of Democracy / 自由女神. I appreciate how her pose is a moment of action, urging the torch upwards, unafraid and optimistic, as opposed to the stoic watchkeeper of New Yorks harbor.

She is also referenced in my favorite protest art to come out of the Umbrella revolution in Hong Kong, see here: https://china-underground.com/2019/09/03/interview-with-oliv...

https://en.m.wikipedia.org/wiki/Goddess_of_Democracy

plaguuuuuu · a year ago

I assume this is somehow baked into the model itself - how does this actually work? how would the model actually cause an error with some sequence of tokens?

thinkingemote · a year ago

How about anti censorship details about VPNs, details of the GFW etc?

tw1984 · a year ago

I tried questions like "will it be considered as offensive by some people when using words like Dad and Mum" in some western LLMs, the result is equally shockingly unacceptable by Chinese standard and values.

You should be happy for the fact that with all those non-western LLMs, for first time in 200 years, this world managed to have an alternative non-western voice in major techs.

weakfish · a year ago

I gave it a photo of Tianamen Square in the present day and promoted it with:

> where might this photo have been taken? what historical significance & does this location have?

And got back a normal response describing the image, until it got to this:

> One of the most significant events that comes to mind is the Tian

Where it then errored out before finishing…

martin-t · a year ago

How does this even work?

Is hugging face hosting just the weights or some custom code?

If it's just weights then I don't see how it could error out, it's just math. Do these chinese models have extra code checking the output for anti-totalitarian content? Can it be turned off?

nextworddev · a year ago

That shows that the censorship mechanism happens at the token sampling level and not post generation, at the least

eruleman · a year ago

Interesting... I submitted 2 photos of Mao Zedong, one time it came back blank and the other time it error'd. It also came back blank for Xi Jinping & Jack Ma.

It did tell identify Yao Ming though!

magicalhippo · a year ago

I'm sorry Dave, I cannot do that.

Would be interesting to see how much image manipulation you need to do till it suddenly starts responding sensibly.

moffkalast · a year ago

I think you would just need to remove "You are Qwen, created by Alibaba" from the system prompt, at least QwQ is suddenly fine with answering a lot more questions after that's gone.

davely · a year ago

Oh, yeah. I asked QwQ (their text based reasoning model released a few weeks ago) about the Tiananmen Square Massacre and got a simple response [1]: “I’m sorry, I can’t help you with that.”

[1] https://bsky.app/profile/davely.bsky.social/post/3lc6mpnjks5...

simonw · a year ago

... this may have been a mistake on my part. A similar prompt running locally worked just fine: https://news.ycombinator.com/item?id=42507454

My default problem for testing these systems has been this word search puzzle where you get a list of words and a grid of letters, and you need to find each word in the grid. [0] I just tried it with QvQ and it failed; none of the answers were correct.

Interestingly enough, I recently tried the new Gemini release in AI Studio and it also failed at the first pass. With a bit of coaxing I was ultimately able to get it to find one word successfully, and then a second word once it understood the correct strategy. After that attempt I asked it for a program to search the grid for each word, and although the initial attempt failed, it only took 4 iterations of bug fixes to get a fully working program. The two main bugs were: the grid had uppercase letters while the words were lowercase, and one of the words (soup mix) has a space which needs to be stripped when searching the grid.

Asking QvQ to generate a program to solve this puzzle... The first try gave me a program that would only check if the word was in the grid or not, but not the actual position of the word. This was partially my fault for using a bad prompt. I updated the prompt to include printing the position of each word, but it just got caught up thinking about the problem. It somehow made a mistake in converting the grid and became convinced that some of the words aren't actually present in the grid. After thinking about the problem for a very long time it ended up giving me an invalid program, and I don't feel particularly motivated to try and debug where it went wrong.

What I find most interesting is that asking for a direct solution tends to fail, but asking for a program which solves the problem gets us much closer to a correct answer. Ideally the LLM should be able to just figure out that writing a program is the optimal solution, and then it can run that and extract the answer.

[0] https://imgur.com/F8dm8Zo

zamadatix · a year ago

This seems analogous to the beat to death "strawberry" test where you're essentially testing if the model is trained on tokens or not. When you switch to having it write a Python program (in both scenarios) instead of trying to process it directly you stop trying to work directly with individual letter inputs+outputs so it does better.

thorum · a year ago

Reasoning models are supposed to be able to work around that kind of limitation (o1 was code named strawberry after all) so it’s not a bad test IMO.

xigoi · a year ago

However, it shows that the AI is not self-aware enough to realize by itself that it needs to write a program to get the correct answer.

bartman · a year ago

This model is fun. Uploading images and asking it a user research style question like "Please think aloud while viewing this image" creates gems like this:

"Let me take another close look at the sandwich. The way the light reflects off the melted cheese highlights its creamy texture. The pasta pieces are well-integrated into the cheese, and the herbs are evenly distributed, adding color and likely flavor.

I can almost taste it now. The combination of textures and flavors must be divine. The crispy bread, the melty cheese, the soft pasta, the crunch from the fried bits, and the freshness from the herbs—all working together in harmony.

I think I might have to go out and find this sandwich place soon. Or perhaps I'll try recreating it at home this weekend. Either way, this image has certainly whetted my appetite for a delicious grilled cheese sandwich."

[0] https://imgur.com/a/0OtGxeB

epcoa · a year ago

Are those herbs actually evenly distributed though?

3abiton · a year ago

This sounds so fun. I wonder if you take that description and feed it to a text2img model like flux, what would be the results.

I got it working on my laptop! M2 64GB and needs 38GB of disk space:

Output of that command here: https://simonwillison.net/2024/Dec/24/qvq/#with-mlx-vlm

mkagenius · a year ago

Being able to run STOA models on my mac has been so cool. I am using it to let molmo vision via `mlx-vlm` click on my phone[1].

Although its 4bit quantised, it works surprisingly well.

1. Controlling phone using local LLMs - https://github.com/BandarLabs/clickclickclick

talldayo · a year ago

To be fair, if you pay Apple's prices for RAM upgrades then your computer better come with every bell and whistle in the book. The fact that Apple gets away with compute shaders instead of full-fat OpenCL compliance is a bit of a testament to how little they care about GPU compute.

benreesman · a year ago

+1 for also teaching me more cool uv flags.

Is the locally run model censored in the same way as the model hosted on HuggingFace? If so, I wonder how that censorship is baked into the weights—particularly the errors thrown when it starts to output specific strings.

OK, this surprised me:

Here's the output: https://gist.github.com/simonw/e04e4fdade0c380ec5dd1e90fb5f3...

It included this bit:

> I remember seeing this image before; it's famous for capturing a moment of civil resistance. The person standing alone against the tanks symbolizes courage and defiance in the face of overwhelming power. It's a powerful visual statement about the human spirit and the desire for freedom or protest.

dostick · a year ago

I wonder if it knows about Winnie the Pooh, or that character was wiped and don’t exist for it?

Deleted Comment

brianjking · a year ago

Damn, I'm struggling to run this on a M4 Pro with 48gb.

bt1a · a year ago

sudo sysctl iogpu.wired_limit_mb=45000

TheAceOfHearts · a year ago

JimDabell · a year ago

I think the running commentary is interesting. It sounds like a person talking through their thought process out loud.

This is how Molmo’s dataset, PixMo was created – by recording human annotators describing the image out loud. I wonder if this is how QvQ was trained as well?

> we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used.

— https://molmo.allenai.org/blog

d416 · a year ago

Are the Q* models open source with open data? Asking because other than the known tianamen limitation, prompting these models for any advice about acts against authority in a corporate setting yields strong recommendations of compliance. Haven’t explored this deeply, but it was enough of a red flag to limit use to coding/logic tasks.

londons_explore · a year ago

Model architecture and weights are open source.

Training data is not (but I don't think anyone in this game has fully open training data)

marci · a year ago

Here you'll find some of the most advanced, fully open (models, data, code, logs, checkpoints...): https://allenai.org/blog/olmo2

tokioyoyo · a year ago

Nobody wants to be sued to oblivion.

hustwindmaple1 · a year ago

There are a couple, i.e., OLMo 2

owenpalmer · a year ago

Q* was the codename for a model at OpenAI that was eventually released under the name o1. This model is something different. QwQ and QvQ are open source models made by the Qwen Team.

ukuina · a year ago

> So I’ve got this picture of some pelicans, and I need to count them

It seems to start all responses in this style, but still hilarious. Seems very anti-GPT4 in how casual it sounds.

fpgaminer · a year ago

Google's experimental thinking model is similarly casual. Not as casual as QwQ, but more casual than Gemini 1.5 Pro. Flash 2.0 will also go a bit more casual in its responses randomly, and when you tell it to think step by step.

emptiestplace · a year ago

I find OpenAI's models are the least fun to work with.

dtquad · a year ago

Yeah, both GPT-4 Turbo and GPT-4o are much less fun and creative than the original GPT-4.

smcnally · a year ago

Restating the question to start the response is good practice for Actual and Artificial Intelligence.

coder543 · a year ago

> Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview

When they dropped it, the huggingface license metadata said it was under the Qwen license, but the actual LICENSE file was Apache 2.0. Now, they have "corrected" the mistake and put the Qwen LICENSE file in place.

I'm not a lawyer. If the huggingface metadata had also agreed it was Apache 2.0, then I would be pretty confident it was Apache 2.0 forever, whether Alibaba wanted that or not, but the ambiguity makes me less certain how things would shake out in a court of law.