QwQ: Alibaba's O1-like reasoning LLM

This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/

sams99 · a year ago

I find it odd that is refused me so badly https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai... my guess is that I am using a quantized model

It simply did not want to use XML tools for some reason something that even qwen coder does not struggle with: https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...

I have not seen any model including sonnet that is able to 1 shot a working 9x9 go board

For ref gpt-4o which is still quite bad https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...

cherioo · a year ago

What hardware are you able to run this on?

Terretta · a year ago

If your job or hobby in any way likes LLMs, and you like to "Work Anywhere", it's hard not to justify the MBP Max (e.g. M3 Max, now M4 Max) with 128GB. You can run more than you'd think, faster than you'd think.

See also Hugging Face's MLX community:

https://huggingface.co/mlx-community

QwQ 32B is featured:

https://huggingface.co/collections/mlx-community/qwq-32b-pre...

If you want a traditional GUI, LM Studio beta 0.3.x is iterating on MLX: https://lmstudio.ai/beta-releases

naming_the_user · a year ago

Works well for me on an MBP with 36GB ram with no swapping (just).

I've been asking it to perform relatively complex integrals and it either manages them (with step by step instructions) or is very close with small errors that can be rectified by following the steps manually.

simonw · a year ago

M2 MacBook Pro with 64GB of RAM.

mark_l_watson · a year ago

I am running it on a 32G memory mac mini with an M2 Pro using Ollama. It runs fine, faster than I expected. The way it explains plans for solving problems, then proceeding step by step is impressive.

torginus · a year ago

Sorry for the random question, I wonder if you know, what's the status of running LLMs non-NVIDIA GPUs nowadays? Are they viable?

Deleted Comment

singularity2001 · a year ago

uhm the pelican SVG is ... not impressive

tethys · a year ago

For comparison, this is what other models produce: https://github.com/simonw/pelican-bicycle/blob/main/README.m...

mhast · a year ago

These are language models, they are not designed for producing image output at all. In a way it's impressive it can even produce working SVG code as output. Even more sonthst it vaguely resembles a bird on a bike.

kfajdsl · a year ago

You try writing an SVG by hand without being able to look at the render.

m3kw9 · a year ago

The svg is very unimpressive but you are impressed by it, what gives? It looks nothing like a pelican

simonw · a year ago

Asking language models to draw things by outputting SVG is a deliberately absurd task.

Given how unreasonable that is I thought this model did very well, especially compared to others that I've tried: https://github.com/simonw/pelican-bicycle?tab=readme-ov-file...

I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.

See the response here: https://pastecode.io/s/6uyjstrt

sysmax · a year ago

Well, to be perfectly honest, it's hard question for an LLM that reasons in tokens and not letters. Reminds me of that classic test that kids easily pass and grownups utterly fail. The test looks like this: continue a sequence:

Grownups try to find a pattern in the numbers, different types of series, progressions, etc. The correct answer is 1 because it's the number of circles in the graphical image of the number "9".

exact_string · a year ago

These tests always make me wonder: What qualifies as a valid pattern rule?

For example, why wouldn't "0" be a correct answer here (rule being "every other number on the right should be 0, other numbers do not have a pattern")?

disago · a year ago

Well, I had to try just for fun of it. The immediate reply was the following:

[...]

Starting with 0 paired with 1. If I think about functions or some formula that could map these values, but that might be too complicated.

Alternatively, perhaps these numbers represent something specific. For example, maybe they're representing the number of certain features in the number itself, like the number of closed loops in the digit's shape.

Let's think about that:

0 has one closed loop. 5 has no closed loops. 6 has one closed loop. 7 has no closed loops. 8 has two closed loops. 9 has one closed loop. Wait a minute, that seems to match the sequence:

0 (1 loop) - 1

5 (0 loops) - 0

6 (1 loop) - 1

7 (0 loops) - 0

8 (2 loops) - 2

So, if 9 has one closed loop, then it should be paired with 1.

But let's double-check this theory with the given pairs.

0 has one loop, paired with 1 – matches.

5 has no loops, paired with 0 – matches.

6 has one loop, paired with 1 – matches.

7 has no loops, paired with 0 – matches.

8 has two loops, paired with 2 – matches.

Yes, this seems consistent.

Therefore, 9 should have one closed loop, so it should be paired with 1.

[...]

Then it continued trying another approach related to the number of angles in the digit shape, but it discarded it. Then "Alternatively, perhaps it's related to the number of segments that are 'on' in a seven-segment display." and so on.

343rwerfd · a year ago

a possible lesson to infer from this example of human cognition, would be that LLMs that can't solve the strawberry test could not be automatically less cognitive capable that another intelligent entity (humans by default).

An extension of the idea could be that many other similar tests trying to measure and/or evaluate machine cognition, when the LLMs fails, are not precisely measuring and/or evaluating anything else than an specific edge case in which machine cognitions fails (i.e. for the specific LLM / AI system being evaluated).

Maybe the models are actually more intelligent than they seem, like an adult failing the number of circles inside the graphical image of the numbers, in the mentioned problem.

lupire · a year ago

I think it's a "just so story" myth that children are better than adults at that task.

written-beyond · a year ago

Damn I guessed the answer to be 9...

prometheon1 · a year ago

I don't know if this is being done already, but couldn't we add some training data to teach the LLM how to spell? We also teach kids what each letter means and how they combine into words. Maybe we can do this with tokens as well? E.g.:

Token 145 (ar) = Token 236 (a) + Token 976 (r)

Repeat many times with different combinations and different words?

tobyjsullivan · a year ago

I was about to give it some credit for being thorough, even if overly so. But then I noticed these gems:

   thinking about the phonetics, "strawberry" has three /r/ sounds...

   For example, in Spanish, it's "fresa," which has no "r"s, or in French, "fraise," which also has no "R"s...

   But to be absolutely certain, let's count them one by one: s-t-r-a-w-b-e-r-r-y. Okay, the first "r" is after the "a," then after the "b," and finally twice at the end before the "y." Wait, is that two or three "r"s at the end? Let's see: s-t-r-a-w-b-e-r-r-y. So, r-a-w-b-e-r-r-y. Okay, there are two "r"s before the "y." So, that's one after "a," one after "b," and two before "y," making it four "r"s in total. Wait a minute, now I'm confusing myself...

   For example, "copyright" has two "r"s, "arroba" has one "r,"...

   Wait, perhaps I'm confusing it with "arrowroot," which has two "r"s. Alternatively, maybe thinking of a word like "corroborate," which has four "r"s...

and

   Alternatively, thinking about Middle English "strawberie," still three "r"s.

It's not being thorough. It's just trained to sound like it's being thorough. It's producing the same garbage as any other LLM, this one just happens to get the answer right because it's been trained on data that includes discussions of the strawberry problem.

There's certainly a unique mechanism which might make it better at searching a space for answers, and that's pretty neat in its own way.

nurettin · a year ago

That's hilarious. It looks like they've successfully modeled OCD.

tiraz · a year ago

Yes, I thought that, too. And as LLMs become more and more "intelligent", I guess we will see more and more variants of mental disorders.

throw310822 · a year ago

Wow this is fantastic, and I feel a little bit sorry for the LLM. It's like the answer was too simple and it couldn't believe it wasn't a trick question somehow.

aragonite · a year ago

Ha, interesting. FWIW the response I got is much shorter. It second-guessed itself once, considered 2 alternative interpretations of the question, then gave me the correct answer: https://justpaste.it/fqxbf

fennecfoxy · a year ago

I mean that's less of a reasoning capability problem and more of an architectural problem as afaik it's to do with the way that words are broken down into tokens, strawberry becomes straw + berry, or something like st-raw-b-erry as per the tokeniser.

An LLM trying to get the number of letters will just be regurgitating for the most part because afaik it has no way to actually count letters. If the architecture was changed to allow for this (breaking certain words down into their letter tokens rather than whole word tokens) then it may help, but is it worth it?

williamdclt · a year ago

I’m oscillating between “this is mind blowing” and “this is similarly impressive-looking-but-not-usefully-smart as other LLMs”.

The tone and expression is novel and it _looks_ like there’s something fundamentally different about reasoning but… also it keeps repeating the same things, sometimes in succession (a paragraph about “foreign languages” then another about “different languages”), most paragraphs have a theory then a rebuttal that doesn’t quite answer why the theory is irrelevant, and sometimes it’s flat out wrong (no Rs in “fraise” or “fresa”?).

So… holding my judgement on whether this model actually is useful in novel ways

Y_Y · a year ago

This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went and figured out a solution and gave it a clue.

Watching it "think" through the answer is surreal and something I haven't felt since watching GPT-2 write punchlines for jokes.

Here's the lightbulb moment it "realizes" what the clue meant:

> For example, if you have a covering space of degree n, then the Euler characteristic of the covering space is n times the Euler characteristic of the base space.

> But I'm not sure how that relates here.

> Alternatively, perhaps it's about considering the Euler characteristic of a quotient space.

> But again, that might be too advanced for this problem.

> Maybe I'm overcomplicating things.

> Let me go back to the beginning.

> The problem says: "Consider the Euler characteristic of the tetrahemihexahedron and of a genus g torus. Then consider the additive property of the Euler characteristic. Then try again."

> So, perhaps it's simply about adding their Euler characteristics together.

I'm under no illusions about how LLMs are built and operate, but this really feels somehow more "real" and person-y than previous generations, even when you coax them into an answer.

I'm going to go and try having GPT-4o roleplay a student solving the problem and see if it's genuinely different. I've been getting impressive answers from o1, but just coldly getting the result is much more robot than human.

Salgat · a year ago

I wonder how much of chain of thought is just addressing the fact that LLMs are trained on human writing (much of which are conversations online), which creates an implicit bias in the model towards mimicking conversations where people are trying to explain their problem enough before they get an answer, even if none of this is required with a sufficiently informed expert (which are more rare to find in the training data compared to average people helping each other out in the countless low-voted discussions on reddit/stackoverflow/etc).

moralestapia · a year ago

It's on Wikipedia though ...

https://en.wikipedia.org/wiki/Euler_characteristic

pathsjs · a year ago

I mean, isn't it trivial? For instance, you can make any Euler characteristic you want by making a cell complex with a 1-cells and b 2-cells and then the Euler characteristic will be 1 + b - a. Am I missing something? If this is the problem and this is how the model looks for a solution, yes, it is overcomplicating things.

It hasn't been to grad school yet, and it's only 32B parameters!

Yes that's a nice solution. I think I asked for the answer as a polyhedron (I tried a few variations), but maybe it (and I) should have just reached for CW complexes.

bartman · a year ago

QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.

[0] https://news.ycombinator.com/item?id=41524263

echelon · a year ago

Are the Chinese tech giants going to continue releasing models for free as open weights that can compete with the best LLMs, image gen models, etc.?

I don't see how this doesn't put extreme pressure on OpenAI and Anthropic. (And Runway and I suppose eventually ElevenLabs.)

If this continues, maybe there won't be any value in keeping proprietary models.

tyre · a year ago

I don’t see why they wouldn’t.

If you’re China and willing to pour state resources into LLMs, it’s an incredible ROI if they’re adopted. LLMs are black boxes, can be fine tuned to subtly bias responses, censor, or rewrite history.

They’re a propaganda dream. No code to point to of obvious interference.

chvid · a year ago

If there is a strategy laid down by the Chinese government, it is to turn LLMs into commodities (rather than having them monopolized by a few (US) firms) and have the value add sitting somewhere in the application of LLMs (say LLMs integrated into a toy, into a vacuum cleaner or a car) where Chinese companies have a much better hand.

Who cares if a LLM can spit out an opinion on some political sensitive subject? For most applications it does not matter at all.

WiSaGaN · a year ago

What I find remarkable is that deepseek and qwen are much more open about the model output (not hiding intermediate thinking process), open their weights, and a lot of time, details on how they are trained, and the caveats along the way. And they don't have "Open" in their names.

tokioyoyo · a year ago

Well, the second they'll start overwhelmingly outperforming other open source LLMs, and people start incorporating them into their products, they'll get banned in the states. I'm being cynical, but the whole "dangerous tech with loads of backdoors built into it" excuse will be used to keep it away. Whether there will be some truth to it or not, that's a different question.

Sabinus · a year ago

It's a strategy to keep up during the scale-up of the AI industry without the amount of compute American companies can secure. When the Chinese get their own chips in volume they'll dig their moats, don't worry. But in the meantime, the global open source community can be leveraged.

Facebook and Anthropic are taking similar paths when faced with competing against companies that already have/are rapidly building data-centres of GPUs like Microsoft and Google.

alwayslikethis · a year ago

32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.

qeternity · a year ago

IIRC Llama 3 had a 32b variant but it didn't pass red teaming and Meta have said nothing about it since.

Svoka · a year ago

what do you mean? I can easily run 70b on my macbook. Fits easily.

Macs that can run it are quite a bit more expensive than a 3090. GPUs can also do finetuning and run other models with larger batch sizes which Macs would struggle with. Also, for the models that fit both, an nvidia card can run it much faster.

do_not_redeem · a year ago

Yes and I can fit 70b in system RAM on my linux box. But parent was talking about discrete GPUs, and shared memory architectures like Intel's integrated GPUs or Apple's unified memory will never perform as well as dedicated GPU memory.

diggan · a year ago

> what do you mean? I can easily run 70b on my macbook. Fits easily

Does your MacBook really have a 24GB VRAM consumer (GPU) card?

doctoboggan · a year ago

fngjdflmdflg · a year ago

It gets the Sally question correct, but it takes more than 100 lines of reasoning.

>Sally has three brothers. Each brother has two sisters. How many sisters does sally have?

Here is the answer: https://pastebin.com/JP2V92Kh

icoder · a year ago

This line had me laughing:

"But that doesn't make sense because Sally can't be her own sister."

Having said this, how many 'lines' of reasoning does the average human need? It's a weird comparison perhaps but the point is does it really matter if it needs 100 or 100k 'lines', if it could hide that (just as we hide our thoughts or even can't really access the - semi-parallel - things our brain does to come to an answer) eventually and summarise it + give the correct answer, that'd be acceptable?

The LLM: https://www.youtube.com/watch?v=-fC2oke5MFg&ab_channel=TraeB...

kaba0 · a year ago

The implementation details don't matter. LLMs not being able to properly reason though is a fundamental limitation and no amount of re-running will help.

roenxi · a year ago

In fairness it actually works out the correct answer fairly quickly (20 lines, including a false start and correction thereof). It seems to have identified (correctly) that this is a tricky question that it is struggling with so it does a lot of checking.

noisy_boy · a year ago

> Let me check online for similar problems.

And finally googles the problem, like we do :)

whimsicalism · a year ago

overthinking is also a problem o1 struggles with

I should have read the blog post. This is a known issue:

>Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.

So the 100 lines was probably not necessary.

owenpalmer · a year ago

I don't think this question is super hard. ChatGPT 4o mini gets this one correct consistently without being asked to reason step by step.

Lucasoato · a year ago

> Find the least odd prime factor of 2019^8+1

God that's absurd. The mathematical skills involved on that reasoning are very advanced; the whole process is a bit long but that's impressive for a model that can potentially be self-hosted.

pitpatagain · a year ago

Also probably in the training data: https://www.quora.com/What-is-the-least-odd-prime-factor-of-...

It's a public AIME problem from 2019.

dartos · a year ago

People have to realize that many problems that are hard for humans are in a dataset somewhere.

gowld · a year ago

The process is only long because it babbled several useless ideas (direct factoring, direct exponentiating, Sophie Germain) before (and in the middle of) the short correct process.

Vetch · a year ago

I think it's exploring in-context. Bringing up related ideas and not getting confused by them is pivotal to these models eventually being able to contribute as productive reasoners. These traces will be immediately helpful in a real world iterative loop where you don't already know the answers or how to correctly phrase the questions.

sebastiennight · a year ago

Wait I didn't read the source of that quote, but is the answer "2"?

Since (2019^8 (an odd number) plus one) is an even number, it only has one prime factor which isn't odd : 2.

> This version is but an early step on a longer journey - a student still learning to walk the path of reasoning. Its thoughts sometimes wander, its answers aren’t always complete, and its wisdom is still growing. But isn’t that the beauty of true learning? To be both capable and humble, knowledgeable yet always questioning?

> Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun.

Cool intro text.

swyx · a year ago

how much are you willing to bet that it was written by a human

Rastonbury · a year ago

Not saying it wasn't written by AI, but it also looks a lot like Chinese to English translation