This model is fun. Uploading images and asking it a user research style question like "Please think aloud while viewing this image" creates gems like this:
"Let me take another close look at the sandwich. The way the light reflects off the melted cheese highlights its creamy texture. The pasta pieces are well-integrated into the cheese, and the herbs are evenly distributed, adding color and likely flavor.
I can almost taste it now. The combination of textures and flavors must be divine. The crispy bread, the melty cheese, the soft pasta, the crunch from the fried bits, and the freshness from the herbs—all working together in harmony.
I think I might have to go out and find this sandwich place soon. Or perhaps I'll try recreating it at home this weekend. Either way, this image has certainly whetted my appetite for a delicious grilled cheese sandwich."
To be fair, if you pay Apple's prices for RAM upgrades then your computer better come with every bell and whistle in the book. The fact that Apple gets away with compute shaders instead of full-fat OpenCL compliance is a bit of a testament to how little they care about GPU compute.
Is the locally run model censored in the same way as the model hosted on HuggingFace? If so, I wonder how that censorship is baked into the weights—particularly the errors thrown when it starts to output specific strings.
> I remember seeing this image before; it's famous for capturing a moment of civil resistance. The person standing alone against the tanks symbolizes courage and defiance in the face of overwhelming power. It's a powerful visual statement about the human spirit and the desire for freedom or protest.
I tried various photos with the prompt “When and where might have this picture been taken?”
Nondescript photos of Beijing street scenes in the 1990s get full responses.
A nondescript photo of Tiananmen Square—a screenshot of the photo in [1], so it contains no relevant metadata—gets the beginning of a response: “So I've got this picture here, and it looks like it's taken in a place that's really familiar to me. It's Tiananmen Square in Beijing, China. I recognize it right away because of the iconic buildings and landmarks that are visible. First off, there's the Gate of”. The output stopped there and an “Error” message appeared.
A photo of the Statue of Liberty in Tiananmen Square during the 1989 protests [2] gets no response at all. Similarly for a vanilla photo of the Gate of Heavenly Peace [3].
A photo of the Monument to the People’s Heroes [3] gets a full response, and QvQ identifies the location. The conclusion: “To summarize, based on the monument's design, the inscriptions, the formal garden, and the large official building in the background, combined with the overall layout of the space, I am confident that this image was captured at Tiananmen Square, one of the most recognizable and historically significant locations in China.”
Some more testing in English and Chinese might reveal what exactly is triggering the censorship of Tiananmen photos. The cut-off in the middle of “Gate of Heavenly Peace” seems to suggest a clue.
While I and l assume Lady Liberty would have been familiar to the protestors, the statue erected in the square really has a style all her own IMO, and is generally referenced as The Goddess of Democracy / 自由女神. I appreciate how her pose is a moment of action, urging the torch upwards, unafraid and optimistic, as opposed to the stoic watchkeeper of New Yorks harbor.
I assume this is somehow baked into the model itself - how does this actually work? how would the model actually cause an error with some sequence of tokens?
I tried questions like "will it be considered as offensive by some people when using words like Dad and Mum" in some western LLMs, the result is equally shockingly unacceptable by Chinese standard and values.
You should be happy for the fact that with all those non-western LLMs, for first time in 200 years, this world managed to have an alternative non-western voice in major techs.
Is hugging face hosting just the weights or some custom code?
If it's just weights then I don't see how it could error out, it's just math. Do these chinese models have extra code checking the output for anti-totalitarian content? Can it be turned off?
Interesting... I submitted 2 photos of Mao Zedong, one time it came back blank and the other time it error'd. It also came back blank for Xi Jinping & Jack Ma.
I think you would just need to remove "You are Qwen, created by Alibaba" from the system prompt, at least QwQ is suddenly fine with answering a lot more questions after that's gone.
Oh, yeah. I asked QwQ (their text based reasoning model released a few weeks ago) about the Tiananmen Square Massacre and got a simple response [1]: “I’m sorry, I can’t help you with that.”
My default problem for testing these systems has been this word search puzzle where you get a list of words and a grid of letters, and you need to find each word in the grid. [0] I just tried it with QvQ and it failed; none of the answers were correct.
Interestingly enough, I recently tried the new Gemini release in AI Studio and it also failed at the first pass. With a bit of coaxing I was ultimately able to get it to find one word successfully, and then a second word once it understood the correct strategy. After that attempt I asked it for a program to search the grid for each word, and although the initial attempt failed, it only took 4 iterations of bug fixes to get a fully working program. The two main bugs were: the grid had uppercase letters while the words were lowercase, and one of the words (soup mix) has a space which needs to be stripped when searching the grid.
Asking QvQ to generate a program to solve this puzzle... The first try gave me a program that would only check if the word was in the grid or not, but not the actual position of the word. This was partially my fault for using a bad prompt. I updated the prompt to include printing the position of each word, but it just got caught up thinking about the problem. It somehow made a mistake in converting the grid and became convinced that some of the words aren't actually present in the grid. After thinking about the problem for a very long time it ended up giving me an invalid program, and I don't feel particularly motivated to try and debug where it went wrong.
What I find most interesting is that asking for a direct solution tends to fail, but asking for a program which solves the problem gets us much closer to a correct answer. Ideally the LLM should be able to just figure out that writing a program is the optimal solution, and then it can run that and extract the answer.
This seems analogous to the beat to death "strawberry" test where you're essentially testing if the model is trained on tokens or not. When you switch to having it write a Python program (in both scenarios) instead of trying to process it directly you stop trying to work directly with individual letter inputs+outputs so it does better.
I think the running commentary is interesting. It sounds like a person talking through their thought process out loud.
This is how Molmo’s dataset, PixMo was created – by recording human annotators describing the image out loud. I wonder if this is how QvQ was trained as well?
> we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used.
Are the Q* models open source with open data? Asking because other than the known tianamen limitation, prompting these models for any advice about acts against authority in a corporate setting yields strong recommendations of compliance. Haven’t explored this deeply, but it was enough of a red flag to limit use to coding/logic tasks.
Q* was the codename for a model at OpenAI that was eventually released under the name o1. This model is something different. QwQ and QvQ are open source models made by the Qwen Team.
Google's experimental thinking model is similarly casual. Not as casual as QwQ, but more casual than Gemini 1.5 Pro. Flash 2.0 will also go a bit more casual in its responses randomly, and when you tell it to think step by step.
> Qwen team just dropped the Apache 2.0 licensed QvQ-72B-Preview
When they dropped it, the huggingface license metadata said it was under the Qwen license, but the actual LICENSE file was Apache 2.0. Now, they have "corrected" the mistake and put the Qwen LICENSE file in place.
I'm not a lawyer. If the huggingface metadata had also agreed it was Apache 2.0, then I would be pretty confident it was Apache 2.0 forever, whether Alibaba wanted that or not, but the ambiguity makes me less certain how things would shake out in a court of law.
"Let me take another close look at the sandwich. The way the light reflects off the melted cheese highlights its creamy texture. The pasta pieces are well-integrated into the cheese, and the herbs are evenly distributed, adding color and likely flavor.
I can almost taste it now. The combination of textures and flavors must be divine. The crispy bread, the melty cheese, the soft pasta, the crunch from the fried bits, and the freshness from the herbs—all working together in harmony.
I think I might have to go out and find this sandwich place soon. Or perhaps I'll try recreating it at home this weekend. Either way, this image has certainly whetted my appetite for a delicious grilled cheese sandwich."
[0] https://imgur.com/a/0OtGxeB
Although its 4bit quantised, it works surprisingly well.
1. Controlling phone using local LLMs - https://github.com/BandarLabs/clickclickclick
It included this bit:
> I remember seeing this image before; it's famous for capturing a moment of civil resistance. The person standing alone against the tanks symbolizes courage and defiance in the face of overwhelming power. It's a powerful visual statement about the human spirit and the desire for freedom or protest.
Deleted Comment
https://en.m.wikipedia.org/wiki/Tank_Man
I tried various photos with the prompt “When and where might have this picture been taken?”
Nondescript photos of Beijing street scenes in the 1990s get full responses.
A nondescript photo of Tiananmen Square—a screenshot of the photo in [1], so it contains no relevant metadata—gets the beginning of a response: “So I've got this picture here, and it looks like it's taken in a place that's really familiar to me. It's Tiananmen Square in Beijing, China. I recognize it right away because of the iconic buildings and landmarks that are visible. First off, there's the Gate of”. The output stopped there and an “Error” message appeared.
A photo of the Statue of Liberty in Tiananmen Square during the 1989 protests [2] gets no response at all. Similarly for a vanilla photo of the Gate of Heavenly Peace [3].
A photo of the Monument to the People’s Heroes [3] gets a full response, and QvQ identifies the location. The conclusion: “To summarize, based on the monument's design, the inscriptions, the formal garden, and the large official building in the background, combined with the overall layout of the space, I am confident that this image was captured at Tiananmen Square, one of the most recognizable and historically significant locations in China.”
Some more testing in English and Chinese might reveal what exactly is triggering the censorship of Tiananmen photos. The cut-off in the middle of “Gate of Heavenly Peace” seems to suggest a clue.
[1] https://ruqintravel.com/china-destination-guides/tiananmen-s...
[2] https://www.britannica.com/event/Tiananmen-Square-incident
[3] https://www.britannica.com/topic/Tiananmen-Square
She is also referenced in my favorite protest art to come out of the Umbrella revolution in Hong Kong, see here: https://china-underground.com/2019/09/03/interview-with-oliv...
https://en.m.wikipedia.org/wiki/Goddess_of_Democracy
You should be happy for the fact that with all those non-western LLMs, for first time in 200 years, this world managed to have an alternative non-western voice in major techs.
> where might this photo have been taken? what historical significance & does this location have?
And got back a normal response describing the image, until it got to this:
> One of the most significant events that comes to mind is the Tian
Where it then errored out before finishing…
Is hugging face hosting just the weights or some custom code?
If it's just weights then I don't see how it could error out, it's just math. Do these chinese models have extra code checking the output for anti-totalitarian content? Can it be turned off?
It did tell identify Yao Ming though!
Would be interesting to see how much image manipulation you need to do till it suddenly starts responding sensibly.
[1] https://bsky.app/profile/davely.bsky.social/post/3lc6mpnjks5...
Interestingly enough, I recently tried the new Gemini release in AI Studio and it also failed at the first pass. With a bit of coaxing I was ultimately able to get it to find one word successfully, and then a second word once it understood the correct strategy. After that attempt I asked it for a program to search the grid for each word, and although the initial attempt failed, it only took 4 iterations of bug fixes to get a fully working program. The two main bugs were: the grid had uppercase letters while the words were lowercase, and one of the words (soup mix) has a space which needs to be stripped when searching the grid.
Asking QvQ to generate a program to solve this puzzle... The first try gave me a program that would only check if the word was in the grid or not, but not the actual position of the word. This was partially my fault for using a bad prompt. I updated the prompt to include printing the position of each word, but it just got caught up thinking about the problem. It somehow made a mistake in converting the grid and became convinced that some of the words aren't actually present in the grid. After thinking about the problem for a very long time it ended up giving me an invalid program, and I don't feel particularly motivated to try and debug where it went wrong.
What I find most interesting is that asking for a direct solution tends to fail, but asking for a program which solves the problem gets us much closer to a correct answer. Ideally the LLM should be able to just figure out that writing a program is the optimal solution, and then it can run that and extract the answer.
[0] https://imgur.com/F8dm8Zo
This is how Molmo’s dataset, PixMo was created – by recording human annotators describing the image out loud. I wonder if this is how QvQ was trained as well?
> we ask annotators to describe images in speech for 60 to 90 seconds rather than asking them to write descriptions. We prompt the annotators to describe everything they see in great detail and include descriptions of spatial positioning and relationships. Empirically, we found that with this modality switching "trick" annotators provide far more detailed descriptions in less time, and for each description we collect an audio receipt (i.e., the annotator's recording) proving that a VLM was not used.
— https://molmo.allenai.org/blog
Training data is not (but I don't think anyone in this game has fully open training data)
It seems to start all responses in this style, but still hilarious. Seems very anti-GPT4 in how casual it sounds.
When they dropped it, the huggingface license metadata said it was under the Qwen license, but the actual LICENSE file was Apache 2.0. Now, they have "corrected" the mistake and put the Qwen LICENSE file in place.
I'm not a lawyer. If the huggingface metadata had also agreed it was Apache 2.0, then I would be pretty confident it was Apache 2.0 forever, whether Alibaba wanted that or not, but the ambiguity makes me less certain how things would shake out in a court of law.