This one is crazy. I made up a silly topology problem which I guessed wouldn't be in a textbook (given X create a shape with Euler characteristic X) and set it to work. Its first effort was a program that randomly generated shapes, calculated X and hoped it was right. I went and figured out a solution and gave it a clue.
Watching it "think" through the answer is surreal and something I haven't felt since watching GPT-2 write punchlines for jokes.
Here's the lightbulb moment it "realizes" what the clue meant:
> For example, if you have a covering space of degree n, then the Euler characteristic of the covering space is n times the Euler characteristic of the base space.
> But I'm not sure how that relates here.
> Alternatively, perhaps it's about considering the Euler characteristic of a quotient space.
> But again, that might be too advanced for this problem.
> Maybe I'm overcomplicating things.
> Let me go back to the beginning.
> The problem says: "Consider the Euler characteristic of the tetrahemihexahedron and of a genus g torus. Then consider the additive property of the Euler characteristic. Then try again."
> So, perhaps it's simply about adding their Euler characteristics together.
I'm under no illusions about how LLMs are built and operate, but this really feels somehow more "real" and person-y than previous generations, even when you coax them into an answer.
I'm going to go and try having GPT-4o roleplay a student solving the problem and see if it's genuinely different. I've been getting impressive answers from o1, but just coldly getting the result is much more robot than human.
I wonder how much of chain of thought is just addressing the fact that LLMs are trained on human writing (much of which are conversations online), which creates an implicit bias in the model towards mimicking conversations where people are trying to explain their problem enough before they get an answer, even if none of this is required with a sufficiently informed expert (which are more rare to find in the training data compared to average people helping each other out in the countless low-voted discussions on reddit/stackoverflow/etc).
I mean, isn't it trivial? For instance, you can make any Euler characteristic you want by making a cell complex with a 1-cells and b 2-cells and then the Euler characteristic will be 1 + b - a. Am I missing something? If this is the problem and this is how the model looks for a solution, yes, it is overcomplicating things.
It hasn't been to grad school yet, and it's only 32B parameters!
Yes that's a nice solution. I think I asked for the answer as a polyhedron (I tried a few variations), but maybe it (and I) should have just reached for CW complexes.
This one is pretty impressive. I'm running it on my Mac via Ollama - only a 20GB download, tokens spit out pretty fast and my initial prompts have shown some good results. Notes here: https://simonwillison.net/2024/Nov/27/qwq/
If your job or hobby in any way likes LLMs, and you like to "Work Anywhere", it's hard not to justify the MBP Max (e.g. M3 Max, now M4 Max) with 128GB. You can run more than you'd think, faster than you'd think.
Works well for me on an MBP with 36GB ram with no swapping (just).
I've been asking it to perform relatively complex integrals and it either manages them (with step by step instructions) or is very close with small errors that can be rectified by following the steps manually.
I am running it on a 32G memory mac mini with an M2 Pro using Ollama. It runs fine, faster than I expected. The way it explains plans for solving problems, then proceeding step by step is impressive.
These are language models, they are not designed for producing image output at all. In a way it's impressive it can even produce working SVG code as output. Even more sonthst it vaguely resembles a bird on a bike.
QwQ can solve a reverse engineering problem [0] in one go that only o1-preview and o1-mini have been able to solve in my tests so far. Impressive, especially since the reasoning isn't hidden as it is with o1-preview.
If you’re China and willing to pour state resources into LLMs, it’s an incredible ROI if they’re adopted. LLMs are black boxes, can be fine tuned to subtly bias responses, censor, or rewrite history.
They’re a propaganda dream. No code to point to of obvious interference.
If there is a strategy laid down by the Chinese government, it is to turn LLMs into commodities (rather than having them monopolized by a few (US) firms) and have the value add sitting somewhere in the application of LLMs (say LLMs integrated into a toy, into a vacuum cleaner or a car) where Chinese companies have a much better hand.
Who cares if a LLM can spit out an opinion on some political sensitive subject? For most applications it does not matter at all.
What I find remarkable is that deepseek and qwen are much more open about the model output (not hiding intermediate thinking process), open their weights, and a lot of time, details on how they are trained, and the caveats along the way. And they don't have "Open" in their names.
Well, the second they'll start overwhelmingly outperforming other open source LLMs, and people start incorporating them into their products, they'll get banned in the states. I'm being cynical, but the whole "dangerous tech with loads of backdoors built into it" excuse will be used to keep it away. Whether there will be some truth to it or not, that's a different question.
It's a strategy to keep up during the scale-up of the AI industry without the amount of compute American companies can secure. When the Chinese get their own chips in volume they'll dig their moats, don't worry. But in the meantime, the global open source community can be leveraged.
Facebook and Anthropic are taking similar paths when faced with competing against companies that already have/are rapidly building data-centres of GPUs like Microsoft and Google.
32B is a good choice of size, as it allows running on a 24GB consumer card at ~4 bpw (RTX 3090/4090) while using most of the VRAM. Unlike llama 3.1, which had 8b, 70B (much too big to fit), and 405B.
Macs that can run it are quite a bit more expensive than a 3090. GPUs can also do finetuning and run other models with larger batch sizes which Macs would struggle with. Also, for the models that fit both, an nvidia card can run it much faster.
Yes and I can fit 70b in system RAM on my linux box. But parent was talking about discrete GPUs, and shared memory architectures like Intel's integrated GPUs or Apple's unified memory will never perform as well as dedicated GPU memory.
I asked the classic 'How many of the letter “r” are there in strawberry?' and I got an almost never ending stream of second guesses. The correct answer was ultimately provided but I burned probably 100x more clockcycles than needed.
Well, to be perfectly honest, it's hard question for an LLM that reasons in tokens and not letters. Reminds me of that classic test that kids easily pass and grownups utterly fail. The test looks like this: continue a sequence:
0 - 1
5 - 0
6 - 1
7 - 0
8 - 2
9 - ?
Grownups try to find a pattern in the numbers, different types of series, progressions, etc. The correct answer is 1 because it's the number of circles in the graphical image of the number "9".
These tests always make me wonder: What qualifies as a valid pattern rule?
For example, why wouldn't "0" be a correct answer here (rule being "every other number on the right should be 0, other numbers do not have a pattern")?
Well, I had to try just for fun of it. The immediate reply was the following:
[...]
Starting with 0 paired with 1. If I think about functions or some formula that could map these values, but that might be too complicated.
Alternatively, perhaps these numbers represent something specific. For example, maybe they're representing the number of certain features in the number itself, like the number of closed loops in the digit's shape.
Let's think about that:
0 has one closed loop.
5 has no closed loops.
6 has one closed loop.
7 has no closed loops.
8 has two closed loops.
9 has one closed loop.
Wait a minute, that seems to match the sequence:
0 (1 loop) - 1
5 (0 loops) - 0
6 (1 loop) - 1
7 (0 loops) - 0
8 (2 loops) - 2
So, if 9 has one closed loop, then it should be paired with 1.
But let's double-check this theory with the given pairs.
0 has one loop, paired with 1 – matches.
5 has no loops, paired with 0 – matches.
6 has one loop, paired with 1 – matches.
7 has no loops, paired with 0 – matches.
8 has two loops, paired with 2 – matches.
Yes, this seems consistent.
Therefore, 9 should have one closed loop, so it should be paired with 1.
[...]
Then it continued trying another approach related to the number of angles in the digit shape, but it discarded it. Then "Alternatively, perhaps it's related to the number of segments that are 'on' in a seven-segment display." and so on.
a possible lesson to infer from this example of human cognition, would be that LLMs that can't solve the strawberry test could not be automatically less cognitive capable that another intelligent entity (humans by default).
An extension of the idea could be that many other similar tests trying to measure and/or evaluate machine cognition, when the LLMs fails, are not precisely measuring and/or evaluating anything else than an specific edge case in which machine cognitions fails (i.e. for the specific LLM / AI system being evaluated).
Maybe the models are actually more intelligent than they seem, like an adult failing the number of circles inside the graphical image of the numbers, in the mentioned problem.
I don't know if this is being done already, but couldn't we add some training data to teach the LLM how to spell? We also teach kids what each letter means and how they combine into words. Maybe we can do this with tokens as well? E.g.:
Token 145 (ar) = Token 236 (a) + Token 976 (r)
Repeat many times with different combinations and different words?
I was about to give it some credit for being thorough, even if overly so. But then I noticed these gems:
thinking about the phonetics, "strawberry" has three /r/ sounds...
For example, in Spanish, it's "fresa," which has no "r"s, or in French, "fraise," which also has no "R"s...
But to be absolutely certain, let's count them one by one: s-t-r-a-w-b-e-r-r-y. Okay, the first "r" is after the "a," then after the "b," and finally twice at the end before the "y." Wait, is that two or three "r"s at the end? Let's see: s-t-r-a-w-b-e-r-r-y. So, r-a-w-b-e-r-r-y. Okay, there are two "r"s before the "y." So, that's one after "a," one after "b," and two before "y," making it four "r"s in total. Wait a minute, now I'm confusing myself...
For example, "copyright" has two "r"s, "arroba" has one "r,"...
Wait, perhaps I'm confusing it with "arrowroot," which has two "r"s. Alternatively, maybe thinking of a word like "corroborate," which has four "r"s...
and
Alternatively, thinking about Middle English "strawberie," still three "r"s.
It's not being thorough. It's just trained to sound like it's being thorough. It's producing the same garbage as any other LLM, this one just happens to get the answer right because it's been trained on data that includes discussions of the strawberry problem.
There's certainly a unique mechanism which might make it better at searching a space for answers, and that's pretty neat in its own way.
Wow this is fantastic, and I feel a little bit sorry for the LLM. It's like the answer was too simple and it couldn't believe it wasn't a trick question somehow.
Ha, interesting. FWIW the response I got is much shorter. It second-guessed itself once, considered 2 alternative interpretations of the question, then gave me the correct answer: https://justpaste.it/fqxbf
I mean that's less of a reasoning capability problem and more of an architectural problem as afaik it's to do with the way that words are broken down into tokens, strawberry becomes straw + berry, or something like st-raw-b-erry as per the tokeniser.
An LLM trying to get the number of letters will just be regurgitating for the most part because afaik it has no way to actually count letters. If the architecture was changed to allow for this (breaking certain words down into their letter tokens rather than whole word tokens) then it may help, but is it worth it?
I’m oscillating between “this is mind blowing” and “this is similarly impressive-looking-but-not-usefully-smart as other LLMs”.
The tone and expression is novel and it _looks_ like there’s something fundamentally different about reasoning but… also it keeps repeating the same things, sometimes in succession (a paragraph about “foreign languages” then another about “different languages”), most paragraphs have a theory then a rebuttal that doesn’t quite answer why the theory is irrelevant, and sometimes it’s flat out wrong (no Rs in “fraise” or “fresa”?).
So… holding my judgement on whether this model actually is useful in novel ways
"But that doesn't make sense because Sally can't be her own sister."
Having said this, how many 'lines' of reasoning does the average human need? It's a weird comparison perhaps but the point is does it really matter if it needs 100 or 100k 'lines', if it could hide that (just as we hide our thoughts or even can't really access the - semi-parallel - things our brain does to come to an answer) eventually and summarise it + give the correct answer, that'd be acceptable?
The implementation details don't matter. LLMs not being able to properly reason though is a fundamental limitation and no amount of re-running will help.
In fairness it actually works out the correct answer fairly quickly (20 lines, including a false start and correction thereof). It seems to have identified (correctly) that this is a tricky question that it is struggling with so it does a lot of checking.
God that's absurd. The mathematical skills involved on that reasoning are very advanced; the whole process is a bit long but that's impressive for a model that can potentially be self-hosted.
The process is only long because it babbled several useless ideas (direct factoring, direct exponentiating, Sophie Germain) before (and in the middle of) the short correct process.
I think it's exploring in-context. Bringing up related ideas and not getting confused by them is pivotal to these models eventually being able to contribute as productive reasoners. These traces will be immediately helpful in a real world iterative loop where you don't already know the answers or how to correctly phrase the questions.
> This version is but an early step on a longer journey - a student still learning to walk the path of reasoning. Its thoughts sometimes wander, its answers aren’t always complete, and its wisdom is still growing. But isn’t that the beauty of true learning? To be both capable and humble, knowledgeable yet always questioning?
> Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun.
Watching it "think" through the answer is surreal and something I haven't felt since watching GPT-2 write punchlines for jokes.
Here's the lightbulb moment it "realizes" what the clue meant:
> For example, if you have a covering space of degree n, then the Euler characteristic of the covering space is n times the Euler characteristic of the base space.
> But I'm not sure how that relates here.
> Alternatively, perhaps it's about considering the Euler characteristic of a quotient space.
> But again, that might be too advanced for this problem.
> Maybe I'm overcomplicating things.
> Let me go back to the beginning.
> The problem says: "Consider the Euler characteristic of the tetrahemihexahedron and of a genus g torus. Then consider the additive property of the Euler characteristic. Then try again."
> So, perhaps it's simply about adding their Euler characteristics together.
I'm under no illusions about how LLMs are built and operate, but this really feels somehow more "real" and person-y than previous generations, even when you coax them into an answer.
I'm going to go and try having GPT-4o roleplay a student solving the problem and see if it's genuinely different. I've been getting impressive answers from o1, but just coldly getting the result is much more robot than human.
https://en.wikipedia.org/wiki/Euler_characteristic
Yes that's a nice solution. I think I asked for the answer as a polyhedron (I tried a few variations), but maybe it (and I) should have just reached for CW complexes.
It simply did not want to use XML tools for some reason something that even qwen coder does not struggle with: https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...
I have not seen any model including sonnet that is able to 1 shot a working 9x9 go board
For ref gpt-4o which is still quite bad https://discuss.samsaffron.com/discourse-ai/ai-bot/shared-ai...
See also Hugging Face's MLX community:
https://huggingface.co/mlx-community
QwQ 32B is featured:
https://huggingface.co/collections/mlx-community/qwq-32b-pre...
If you want a traditional GUI, LM Studio beta 0.3.x is iterating on MLX: https://lmstudio.ai/beta-releases
I've been asking it to perform relatively complex integrals and it either manages them (with step by step instructions) or is very close with small errors that can be rectified by following the steps manually.
Deleted Comment
Given how unreasonable that is I thought this model did very well, especially compared to others that I've tried: https://github.com/simonw/pelican-bicycle?tab=readme-ov-file...
[0] https://news.ycombinator.com/item?id=41524263
I don't see how this doesn't put extreme pressure on OpenAI and Anthropic. (And Runway and I suppose eventually ElevenLabs.)
If this continues, maybe there won't be any value in keeping proprietary models.
If you’re China and willing to pour state resources into LLMs, it’s an incredible ROI if they’re adopted. LLMs are black boxes, can be fine tuned to subtly bias responses, censor, or rewrite history.
They’re a propaganda dream. No code to point to of obvious interference.
Who cares if a LLM can spit out an opinion on some political sensitive subject? For most applications it does not matter at all.
Facebook and Anthropic are taking similar paths when faced with competing against companies that already have/are rapidly building data-centres of GPUs like Microsoft and Google.
Does your MacBook really have a 24GB VRAM consumer (GPU) card?
See the response here: https://pastecode.io/s/6uyjstrt
For example, why wouldn't "0" be a correct answer here (rule being "every other number on the right should be 0, other numbers do not have a pattern")?
[...]
Starting with 0 paired with 1. If I think about functions or some formula that could map these values, but that might be too complicated.
Alternatively, perhaps these numbers represent something specific. For example, maybe they're representing the number of certain features in the number itself, like the number of closed loops in the digit's shape.
Let's think about that:
0 has one closed loop. 5 has no closed loops. 6 has one closed loop. 7 has no closed loops. 8 has two closed loops. 9 has one closed loop. Wait a minute, that seems to match the sequence:
0 (1 loop) - 1
5 (0 loops) - 0
6 (1 loop) - 1
7 (0 loops) - 0
8 (2 loops) - 2
So, if 9 has one closed loop, then it should be paired with 1.
But let's double-check this theory with the given pairs.
0 has one loop, paired with 1 – matches.
5 has no loops, paired with 0 – matches.
6 has one loop, paired with 1 – matches.
7 has no loops, paired with 0 – matches.
8 has two loops, paired with 2 – matches.
Yes, this seems consistent.
Therefore, 9 should have one closed loop, so it should be paired with 1.
[...]
Then it continued trying another approach related to the number of angles in the digit shape, but it discarded it. Then "Alternatively, perhaps it's related to the number of segments that are 'on' in a seven-segment display." and so on.
An extension of the idea could be that many other similar tests trying to measure and/or evaluate machine cognition, when the LLMs fails, are not precisely measuring and/or evaluating anything else than an specific edge case in which machine cognitions fails (i.e. for the specific LLM / AI system being evaluated).
Maybe the models are actually more intelligent than they seem, like an adult failing the number of circles inside the graphical image of the numbers, in the mentioned problem.
Token 145 (ar) = Token 236 (a) + Token 976 (r)
Repeat many times with different combinations and different words?
There's certainly a unique mechanism which might make it better at searching a space for answers, and that's pretty neat in its own way.
An LLM trying to get the number of letters will just be regurgitating for the most part because afaik it has no way to actually count letters. If the architecture was changed to allow for this (breaking certain words down into their letter tokens rather than whole word tokens) then it may help, but is it worth it?
The tone and expression is novel and it _looks_ like there’s something fundamentally different about reasoning but… also it keeps repeating the same things, sometimes in succession (a paragraph about “foreign languages” then another about “different languages”), most paragraphs have a theory then a rebuttal that doesn’t quite answer why the theory is irrelevant, and sometimes it’s flat out wrong (no Rs in “fraise” or “fresa”?).
So… holding my judgement on whether this model actually is useful in novel ways
>Sally has three brothers. Each brother has two sisters. How many sisters does sally have?
Here is the answer: https://pastebin.com/JP2V92Kh
"But that doesn't make sense because Sally can't be her own sister."
Having said this, how many 'lines' of reasoning does the average human need? It's a weird comparison perhaps but the point is does it really matter if it needs 100 or 100k 'lines', if it could hide that (just as we hide our thoughts or even can't really access the - semi-parallel - things our brain does to come to an answer) eventually and summarise it + give the correct answer, that'd be acceptable?
And finally googles the problem, like we do :)
>Recursive Reasoning Loops: The model may enter circular reasoning patterns, leading to lengthy responses without a conclusive answer.
So the 100 lines was probably not necessary.
God that's absurd. The mathematical skills involved on that reasoning are very advanced; the whole process is a bit long but that's impressive for a model that can potentially be self-hosted.
It's a public AIME problem from 2019.
Since (2019^8 (an odd number) plus one) is an even number, it only has one prime factor which isn't odd : 2.
> Through deep exploration and countless trials, we discovered something profound: when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun.
Cool intro text.