The one I'm running is the 8.54GB file. I'm using Ollama like this:
ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:
uvx --with llm-ollama \
llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
'a joke about a pelican and a walrus who run a tea room together'
Thanks for all your writing on these topics Simon! It has turned me from a bit of a naysayer to an optimist around this tooling, especially being able running stuff locally with access to tools. Have an M4 Pro mac mini arriving this week to have a similar self hosted setup over tailscale.
Yes, I did the same but with Docker serving open-webui. It's definitely not fast. Took 2 minutes to answer "what is the highest peak in California," but it eventually go there. OpenAI o1 took 6 seconds.
Pasting the logic for anyone interested:
<details type="reasoning" done="true" duration="102">
<summary>Thought for 102 seconds</summary>
> Okay, so I need to figure out what the tallest peak in California is. Hmm, I remember that the highest mountain in the contiguous United States is Mount Whitney, but wait, isn't that in California? Let me think. Yes, I'm pretty sure it's in California. But maybe there are other tall peaks too.
>
> Wait, no, Mount Whitney is definitely the tallest. It's over 14,000 feet. But sometimes people might confuse it with other mountains like Shasta or something else. Oh, wait, isn't Mount Shasta also a volcano? I think that's in California too but it's not as tall as Whitney.
>
> So, putting it together, Mount Whitney is the tallest peak in California at 14,505 feet. It's located in the Sierra Nevada range. People hike there a lot because it's such a high point. But I should double-check to make sure I'm not mixing up anything else. Yeah, that seems right.
> I wonder if there are any taller peaks elsewhere. Oh, Denali in Alaska is higher, but that's not in California. So within the state, Whitney is the highest. Got it.
> So the answer is Mount Whitney at 14,505 feet above sea level. It's a popular spot for hikers and climbers. Makes sense because it's such an iconic peak in the Sierra Nevada mountains.
> I think I've got all that right. No other peaks in California are taller than Whitney. So yeah, Mount Whitney is the tallest.
> Just to make sure, sometimes there are different names or maybe alternate names for peaks, but Whitney is pretty straightforward. Yeah, it's definitely the one.
> Okay, so I'm confident now. The answer is Mount Whitney.
> *Final Answer*
> The tallest peak in California is Mount Whitney, with an elevation of 14,505 feet above sea level.
>
> \boxed{Mount Whitney}
</details>
To determine the tallest peak in California, we consider notable mountains such as Mount Whitney and Mount Shasta. Mount Whitney is recognized as the highest peak in the contiguous United States, standing at 14,505 feet. It is located in the Sierra Nevada mountain range and is a popular destination for hikers. Mount Shasta, although also a significant peak, does not surpass Whitney's elevation. Therefore, the tallest peak in California is confirmed to be Mount Whitney.
*Final Answer*
The tallest peak in California is \boxed{Mount Whitney}.
> Wait, maybe the punchline is something like: "We don’t have any fish in the tea, but we do have a lot of krill."
Shucks, it was so close to coming up with a good punchline it could work back from.
I'm thinking set it in a single-cell comic. A downtrodden young man or woman sitting alone at a table, a pelican in the background clearly making drinks in its voluminous beak, and the walrus waiter places a cup in front of the person, consolingly saying "there's plenty of fish in the tea".
I think the problem is that humor isn't about reasoning and logic, but almost the reverse - it's about punchlines that surprise us (i.e. not what one would logically anticipate) and perhaps shock us by breaking taboos.
Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.
Good point, but this is an 8GB model. That's the problem. This is a spectacular answer given that constraint. Also it's hilarious how bad the final answer is.
Thanks! Playing around with this vs the https://ollama.com/tripplyons/r1-distill-qwen-7b variant and find 7b to be somewhat of sweet spot of getting to the point with minimal (or less) waffle.
Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.
Yeesh, that shows a pretty comprehensive dearth of humour in the model. It did a decent examination of characteristics that might form the components of a joke, but completely failed to actually construct one.
I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.
Yeah it's very interesting... It appears to lead itself astray: the way it looks at several situational characteristics, gives each a "throw-away" example, only to then mushing all those examples together to make a joke seems to be it's downfall in this particular case.
Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better
why shouldn’t i assume that the “thinking” is just the usual LLM regurgitation of “how would a human coming up with a joke explain their reasoning?” or something like that, and zero “thinking”?
Given how it is trained specifically (they didn't encourage it to think, they allowed it to) there was a lot of emergent behavior as it trained.
Sort of like chess engines rediscovering classic (named) chess openings. See section 2.2.3 for the training template (it's a single paragraph I can't reproduce here because I'm on my phone)
Example emergent behavior (section 2.2.4 page 8): the model learns to solve more complex problems by spending more time reasoning. It also naturally develops reflection (what have I tried?) and exploration strategies.
Fundamentally, you should think of this as a nn that learned to solve real problems by reasoning about them in written language.
(My favorite part: it defaulted to reasoning in multiple languages. They constrained it to only reason in a single language and this negatively impacted performance! But the hypothesis is that it improves interpretability)
You shouldn't assume that because you shouldn't assume the converse either.
The LLM will use the chain of thought, as it's being built out like a garden path, to explore different completion possibilities. It's not necessarily logically related what it completes afterwards, but it's definitely influenced.
The search may not be valuable as an artifact in itself. It's likely to be logically unsound especially in parts. But the end result may be useful.
Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?
They are probably the same model, unsloth does model quants and provides them to the community, AFAIK ollama doesn't, they just indexes publicly available models, whether full or quantized, for convenient use in their frontend.
FWIW, you can also try all of the distills out in BF16 on https://glhf.chat (either in the UI or via the API), including the 70b. Personally I've been most impressed with the Qwen 32b distill.
(And here's the output of the "pelican and the walrus in a tearoom" joke from DeepSeek's distilled 70b... Mildly better?)
<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously.
Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.
Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.
Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.
Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.
I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>
Why did the pelican and the walrus get kicked out of the tea room?
Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!
Six months ago I had almost given up on local LLMs - they were fun to try but they were so much less useful than Sonnet 3.5 / GPT-4o that it was hard to justify using them.
That's changed in the past two months. Llama 3 70B, Qwen 32B and now these R1 models are really impressive, to the point that I'm considering trying to get real work done with them.
The catch is RAM: I have 64GB, but loading up a current GPT-4 class model uses up around 40GB of that - which doesn't leave much for me to run Firefox and VS Code.
So I'm still not likely to use them on a daily basis - but it does make me wonder if I should keep this laptop around as a dedicated server next time I upgrade.
If you are worried about security or IP at all, it's preferable to run locally, or spin up your own box that you can query running one of these models.
These "thinking" LLMs are essentially productizing that trick. All the rambling and second guessing they go through at the beginning gives them source material to compute the "better" answer from at the end.
“I never really had a childhood”, said Walrus, blowing on his tea with a feigned sigh. “Why’s that?” asked Pelican, refilling a sugar shaker. Walrus: “I was born long in the tooth!” Pelican: [big stupid pelican laughing noise]
Would that be funny? Failing a job interview because, when ask to make a joke about a pelican and a walrus who run a tea room together, you couldn't come up with a quip, therefore showing your inability to reason.
Let's not even talk about the "r" you forgot when asked to write "cranberry"...
I think "reasoning" models will solve the joke issue (amongst other issues), but not because they're "reasoning". Rather because they help solve the exploration issue and the scaling issue.
Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.
Longer explanation:
Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.
A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.
An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.
One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.
Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.
To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.
This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.
But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.
Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.
Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.
So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).
I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.
It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.
Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.
But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.
The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.
Addendum:
Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.
And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.
I’m on my phone so I can’t give this a proper response but I want to say that your mental intuition about the latent space algorithms is excellent and has improved my thinking. I haven’t seen much writing applying pathfinding (what we used to call AI, in the Half Life days) terminology to this. Your ideal generator sounds like letting A* run on all nodes in a grid and not exiting when the first path is found.
Mountains and cliffs are a good way to describe the terrain of the topology of the weights in hyper dimensional space though they are terms for a 2D matrix.
Disclaimer: I am very well aware this is not a valid test or indicative or anything else. I just thought it was hilarious.
When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.
Skynet sends Terminator to eradicate humanity, the Terminator uses this as its internal reasoning engine... "instructions unclear, dick caught in ceiling fan"
It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.
I may be looking at this too deeply, but I think this suggests that the reasoning is not always utilized when forming the final reply.
For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:
> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.
1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)
2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence
3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning
To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.
Strawberry is "difficult" not because the reasoning is difficult, but because tokenization doesn't let the model reason at the level of characters. That's why it has to work so hard and doesn't trust its own conclusions.
I wonder if the reason the models have problem with this is that their tokens aren't the same as our characters. It's like asking someone who can speak English (but doesn't know how to read) how many R's are there in strawberry. They are fluent in English audio tokens, but not written tokens.
The way LLMs get it right by counting the letters then change their answer at the last second makes me feel like there might be a large amount of text somewhere (eg. a reddit thread) in the dataset that repeats over and over that there is the wrong number of Rs. We've seen may weird glitches like this before (eg. a specific reddit username that would crash chatgpt)
The amazing thing continues to be that they can ever answer these questions correctly.
It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.
Yeah that’s my understanding of the root cause. It can also cause weirdness with numbers because they aren’t tokenized one digit at a time. For good reason, but it still causes some unexpected issues.
This was my first prompt after downloading too and I got the same thing. Just spinning again and again based on it's gut instinct that there must be 2 R's in strawberry, despite the counting always being correct. It just won't accept that the word is spelled that way and it's logic is correct.
I think it's great that you can see the actual chain of thought behind the model, not just the censored one from OpenAI.
It strikes me that it's both so far from getting it correct and also so close- I'm not an expert but it feels like it could be just an iteration away from being able to reason through a problem like this. Which if true is an amazing step forward.
How long until we get to the point where models know that LLMs get this wrong, and that it is an LLM, and therefore answers wrong on purpose? Has this already happened?
(I doubt it has, but there ARE already cases where models know they are LLMs, and therefore make the plausible but wrong assumption that they are ChatGPT.)
My understanding is that the model does not "know" it is an LLM. It is prompted (in the app's system prompt) or trained during RLHF to answer that it is an LLM.
I think there is an inherent weight associated with the intrinsic knowledge opposed to the reasoning steps as intrinsic knowledge can override reasoning.
I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?
If I asked you, "hey. How many Rs in strawberry?". You're going to tell me 2, because the likelihood is I am asking about the ending Rs. That's at least how I'd interpret the question without the "llm test" clouding my vision.
Same for if I asked how many gullible. I'd say "it's a double L after the u".
This is from a small model. 32B and 70B answer this correctly. "Arrowroot" too. Interestingly, 32B's "thinking" is a lot shorter and it seems to be more "sure". Could be because it's based on Qwen rather than LLaMA.
How would they build guardrails for this? In CFD, physical simulation with ML, they talk about using physics-informed models instead of purely statistical. How would they make language models that are informed with formal rules, concepts of English?
> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
The R1 GitHub repo is way more exciting than I had thought.
They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]
This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.
> The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email.
The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.
Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."
o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....
... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"
Yes, o1 hid its input. Still, it also provided a summary of its reasoning steps. In the email case, o1 thought for six seconds, summarized its thinking as "summarizing the email", and then provided the answer.
We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]
I think O1 does do that. It once spit out the name of the expert model for programming in its “inner monologue” when I used it. Click on the grey “Thought about X for Y seconds” and you can see the internal monologue
>Now for summarizing email itself it seems a bit more like a waste of compute
This is the thought path that led to 4o being embarrassingly unable to do simple tasks. Second you fall into the level of task OpenAI doesn’t consider “worth the compute cost” you get to see it fumble about trying to do the task with poorly written python code and suddenly it can’t even do basic things like correctly count items in a list that OG GTP4 would get correct in a second.
Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.
I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)
By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.
o1 is impressive, I tried feeding it some of the trickier problems I have solved
(that involved nontrivial algorithmic challenges) over the past few months, and it managed to solve all of them, and usually came up with slightly different solutions than I did, which was great.
However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.
Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.
On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.
The things that ChatGPT or Claude spit out are impressive one-shots but hard to iterate on or integrate with other code.
And you can’t just throw Aider/Cursor/Copilot/etc at the original output without quickly making a mess. At least not unless you are nudging it in the right directions at every step, occasionally jumping in and writing code yourself, fixing/refactoring the LLM code to fit style/need, etc.
One place where all LLMs fail hard is in graphics programming. I try on and off since the release of ChatGPT 3 and no model manages to coherently juggle GLSL Shader Inputs, their processing and the output. It fails hard at even the basics.
I guess it's because the topic is such a cross between fields like math, cs, art and so visual, maybe for a similar reason LLMs do so poorly with SVG ouput, like the unicorn benchmark: https://gpt-unicorn.adamkdean.co.uk/
Do you mean o1-preview or the current o1? I rarely get anything really useful out of the current one ($20 subscription, not the 200 one). They seem to have seriously nerfed it.
I think this is a massive dark horse application of LLMs. Something like DeepSeek that is able to produce OpenSCAD code for complex parts is a game changer for a lot of industries and individuals. Effectively free part design, in a way that libraries of standard parts can rapidly be built, and the code version-controlled using standard software dev tooling.
> Give me five odd numbers that don't have the letter 'e' in their spelling
Compare the reasoning times!!! 84s vs 342s
R1 (Thought for 84 seconds)
No odd number in English avoids the letter 'e' in its spelling. The request for five such numbers cannot be fulfilled.
o1 Pro (Thought for 5 minutes and 42 seconds)
No standard English spelling of an odd number can avoid “e.” Every odd digit (one, three, five, seven, nine) already includes “e,” so once you build any odd number out of those digits, the spelled‐out form will contain “e.” As a result, there are no such odd numbers in standard English without an “e” in their name.
> Can it solve easy problems yet? Weirdly, I think that's an important milestone.
Easy for who? Some problems are better solved in one way compared to another.
In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.
And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.
Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.
> I suspect the breakthrough won't be trivial that enables solving trivial questions.
So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!
I would argue anything requiring insights on spelling is a hard problem for an LLM: they use tokens, not letters. Your point still stands, but you need different examples IMO.
I use it at https://chat.deepseek.com/ . It’s free but requires a log-in. Now, when I hover over the “DeepThink” button below the prompt field, a pop-up appears saying “Use DeepSeek-R1 to solve reasoning problems.”
The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.
Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like.
What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement.
The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective.
This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master
closed systems through pure RL before cracking the harder open-ended domains.
The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway.
Interestingly this point was indicated by Karpathy last summer that RLHF is barely RL. He said it would be very difficult to apply pure reinforcement learning on open-domains. This is why RLHF are a shortcut to fill this gap but still because the reward model is trained on human vibes checks the LLM could easily game the RM by giving out misleading responses or gaming the system.
Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.
The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.
That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.
In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.
When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.
We should be doing the same when taming LLMs from their pretraining as document completers into assistants.
Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.
In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be
used to generate feedback based on predefined test cases."
Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.
You're totally right there must be supervision; it's just a matter of how the term is used.
"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.
Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.
They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.
Kind of insane how a severely limited company founded 1 year ago competes with the infinite budget of Open AI
Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either
The nature of software that has not moat built into it. Which is fantastic for the world, as long as some companies are willing to pay the premium involved in paving the way. But man, what a daunting prospect for developers and investors.
Makes me suspect if the primary plateau is data, and we're now seeing a place where all the AI labs who are actually having a crack at this seem to have similar levels of quality data to train on. Layering in chain of thought and minor architectural changes doesn't seem to be giving anyone a truly groundbreaking lead.
This is clearly what is happening. Deepseek can train on o1 generated synthetic data and generate a very capable and small model. This requires that somebody build an o1 and make it available via API first.
I might be just being a bitter sceptic (although I'm probably not bitter because I'm very excited by their results), but some of the spending stats feel slightly too good to be true to me. But I can't really claim to have an insider-quality intuition.
It's pretty clear, because OpenAI has no clue what they are doing. If I was the CEO of OpenAI, I would have invested significantly in catastrophic forgetting mitigations and built a model capable of continual learning.
If you have a model that can learn as you go, then the concept of accuracy on a static benchmark would become meaningless, since a perfect continual learning model would memorize all the answers within a few passes and always achieve a 100% score on every question. The only relevant metrics would be sample efficiency and time to convergence. i.e. how quickly does the system learn?
It's actually great if the end result is that the incumbent with infinite money that has unrealistic aspirations of capturing a huge section of the sector lights all the money on fire. It's what happened with Magic Leap - and I think everyone can agree that the house of Saud tossing their money into a brilliant blaze like that is probably better than anything else they would have wanted to do with that money. And if we get some modest movements forward in that technical space because of that, all the better. Sometimes capitalism can be great, because it funnels all the greed into some hubris project like this and all the people that are purely motivated by greed can go spin their wheels off in the corner and minimize the damage they do. And then some little startup like Deepseek can come along and do 90% of the job for 1% of the money
DeepSeek is a Chinese AI company and we're talking about military technology. The next world war will be fought by AI, so the Chinese government won't leave China's AI development to chance. The might of the entire Chinese government is backing DeepSeek.
Except it’s not really a fair comparison, since DeepSeek is able to take advantage of a lot of the research pioneered by those companies with infinite budgets who have been researching this stuff in some cases for decades now.
The key insight is that those building foundational models and original research are always first, and then models like DeepSeek always appear 6 to 12 months later. This latest move towards reasoning models is a perfect example.
Or perhaps DeepSeek is also doing all their own original research and it’s just coincidence they end up with something similar yet always a little bit behind.
This is what many folks said about OpenAI when they appeared on the scene building on foundational work done at Google. But the real point here is not to assign arbitrary credit, it’s to ask how those big companies are going to recoup their infinite budgets when all they’re buying is a 6-12 month head start.
There are some significant innovations behind behind v2 and v3 like multi-headed latent attention, their many MoE improvements and multi-token prediction.
Also don’t forget that if you think some of the big names are playing fast and loose with copyright / personal data then DeepSeek is able to operate in a regulatory environment that has even less regard for such things, especially so for foreign copyright.
Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!
This says more about benchmarks than R1, which I do believe is absolutely an impressive model.
For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.
Sonnet's strength was always comprehending the problem and its context. It happened to also be pretty good at generating code, but what it actually made it its first really useful model was that it understood _what_ to code and how to communicate.
I assume this is because reasoning is easy as long as it's just BAU prediction based on reasoning examples it was trained on. It's only when tackling a novel problem that the model needs to "reason for itself" (try to compose a coherent chain of reasoning). By generating synthetic data (R1 outputs) it's easy to expand the amount of reasoning data in the training set, making more "reasoning" problems just simple prediction that a simple model can support.
I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?
You can’t wave a magic wand and make an 8b model that good.
I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.
You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.
> like an insane transfer of capabilities to a relatively tiny model
It certainly does.
…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.
It’s somewhere in between, really. This is a rapidly advancing space, so to some degree, it’s expected that every few months, new bars are being set.
There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.
I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.
> This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs.
Wow. They’re really trying to undercut closed source LLMs
> In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
I decided to ask one of their models for myself. Note that bringing up China completely shut off the thinking phase!
----
llm -m huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF 'Why would China push for open-weight LLM models and development?'
<think>
</think>
As a responsible major country, China is committed to promoting the healthy development of artificial intelligence. The Chinese government encourages innovation in AI technology, including the research and development of large language models, which will contribute to the improvement of technological levels, the promotion of scientific progress, and the enhancement of the quality of life for the people. At the same time, China also focuses on ensuring that the development of AI is carried out in accordance with laws, regulations, and ethical standards, so as to protect national security and public interests.
The one I'm running is the 8.54GB file. I'm using Ollama like this:
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this: Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...I also set an alias for the model like this:
Now I can run "llm -m r1l" (for R1 Llama) instead.I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/
(It wouldn't accept the Q_8 because it was sharded.)
Will report back lol
What app do you use on your phone to access your home LLM servers over tailscale?
Very funny. How does the 70B model do in comparison?
Pasting the logic for anyone interested:
<details type="reasoning" done="true" duration="102"> <summary>Thought for 102 seconds</summary> > Okay, so I need to figure out what the tallest peak in California is. Hmm, I remember that the highest mountain in the contiguous United States is Mount Whitney, but wait, isn't that in California? Let me think. Yes, I'm pretty sure it's in California. But maybe there are other tall peaks too. > > Wait, no, Mount Whitney is definitely the tallest. It's over 14,000 feet. But sometimes people might confuse it with other mountains like Shasta or something else. Oh, wait, isn't Mount Shasta also a volcano? I think that's in California too but it's not as tall as Whitney. > > So, putting it together, Mount Whitney is the tallest peak in California at 14,505 feet. It's located in the Sierra Nevada range. People hike there a lot because it's such a high point. But I should double-check to make sure I'm not mixing up anything else. Yeah, that seems right. > I wonder if there are any taller peaks elsewhere. Oh, Denali in Alaska is higher, but that's not in California. So within the state, Whitney is the highest. Got it. > So the answer is Mount Whitney at 14,505 feet above sea level. It's a popular spot for hikers and climbers. Makes sense because it's such an iconic peak in the Sierra Nevada mountains. > I think I've got all that right. No other peaks in California are taller than Whitney. So yeah, Mount Whitney is the tallest. > Just to make sure, sometimes there are different names or maybe alternate names for peaks, but Whitney is pretty straightforward. Yeah, it's definitely the one. > Okay, so I'm confident now. The answer is Mount Whitney. > *Final Answer* > The tallest peak in California is Mount Whitney, with an elevation of 14,505 feet above sea level. > > \boxed{Mount Whitney} </details> To determine the tallest peak in California, we consider notable mountains such as Mount Whitney and Mount Shasta. Mount Whitney is recognized as the highest peak in the contiguous United States, standing at 14,505 feet. It is located in the Sierra Nevada mountain range and is a popular destination for hikers. Mount Shasta, although also a significant peak, does not surpass Whitney's elevation. Therefore, the tallest peak in California is confirmed to be Mount Whitney.
*Final Answer* The tallest peak in California is \boxed{Mount Whitney}.
Shucks, it was so close to coming up with a good punchline it could work back from.
I'm thinking set it in a single-cell comic. A downtrodden young man or woman sitting alone at a table, a pelican in the background clearly making drinks in its voluminous beak, and the walrus waiter places a cup in front of the person, consolingly saying "there's plenty of fish in the tea".
These days I sometimes feel as if my skills as an iOS dev are losing market value. Maybe it's time I get back to writing standup :)
https://replicate.delivery/xezq/BfkZZEPS3CQHKa9HfOQTetyORSaW...
I think it doesn't quite get what webbed feet mean.
Even masters of humor like Seinfeld, with great intuition for what might work, still need to test new material in front of a live audience to see whether it actually does get a laugh or not.
Come on, LLM, be funny! 3 2 1 go!
This is gold. If I was a writer, I'd wring value from that entire thinking-out-loud section and toss the actual punchline.
This is weirdly reminiscent of co-programming with CodyAI. It gives me a lot of good 'raw material' and I'm left integrating the last mile stuff.
Certainly, interesting reading their thought processes, value in that might be greater than the answer itself depending on use-case.
> Why did the rabbi become a boxer?
> Because he couldn't handle the holy punches!
> But wait, that's not quite right. Let me try again.
> Here’s one: What do you call a Jewish, Orthodox, and Catholic priest working together?
> A trio of men who probably shouldn’t be telling jokes after their day job!
> Sorry for the confusion earlier—I hope this one makes more sense!
It did not.
I couldn't see a single idea or wordplay that actually made sense or elicited anything like a chuckle. The model _nearly_ got there with 'krill' and 'kill', but failed to actually make the pun that it had already identified.
Also I can't help but think that if it had written out a few example jokes about animals rather than simply "thinking" about jokes, it might have come up with something better
Sort of like chess engines rediscovering classic (named) chess openings. See section 2.2.3 for the training template (it's a single paragraph I can't reproduce here because I'm on my phone)
Example emergent behavior (section 2.2.4 page 8): the model learns to solve more complex problems by spending more time reasoning. It also naturally develops reflection (what have I tried?) and exploration strategies.
Fundamentally, you should think of this as a nn that learned to solve real problems by reasoning about them in written language.
(My favorite part: it defaulted to reasoning in multiple languages. They constrained it to only reason in a single language and this negatively impacted performance! But the hypothesis is that it improves interpretability)
The LLM will use the chain of thought, as it's being built out like a garden path, to explore different completion possibilities. It's not necessarily logically related what it completes afterwards, but it's definitely influenced.
The search may not be valuable as an artifact in itself. It's likely to be logically unsound especially in parts. But the end result may be useful.
(Disclosure: I'm the cofounder)
<think> Alright, I'm supposed to come up with a joke about a pelican and a walrus in a tea room. Hmm, sounds like a classic setup for a funny situation. Let me think about the characteristics of these animals that could be used humorously. Okay, so pelicans are known for their large beaks, right? They can carry a lot of water in them. Walruses have those big tusks and blubbery bodies. Maybe I can play on the idea of them trying to have a tea party but their physical traits causing some trouble.
Wait, being in a tea room, they might be trying to sip tea, but their size or features get in the way. The pelican's beak could be problematic with the cup. And the walrus's size might knock things over or make a mess.
Let me consider the flow. Maybe the pelican struggles with the teacup because of his beak, and the walrus has a comment about it. Then the walrus tries to sip but ends up splashing tea everywhere. That could lead to a funny exchange.
Putting it together: Pelican says something self-deprecating about his beak not being suited for tea. Then the walrus, who perhaps agrees, tries to sip and splashes tea, making them both laugh. That feels natural and funny.
I should make sure the punchline is clear and the dialogue flows smoothly. Maybe start with the pelican struggling, then the walrus makes a comment, and their interaction leads to the humorous conclusion. Yeah, that should work. </think>
Why did the pelican and the walrus get kicked out of the tea room?
Because the pelican kept dipping his beak in the wrong cup, and the walrus kept breaking the delicate china with his tusks. They just couldn’t “sea” themselves fitting in!
(This is a serious question, not poking fun; I am actually curious about this.)
That's changed in the past two months. Llama 3 70B, Qwen 32B and now these R1 models are really impressive, to the point that I'm considering trying to get real work done with them.
The catch is RAM: I have 64GB, but loading up a current GPT-4 class model uses up around 40GB of that - which doesn't leave much for me to run Firefox and VS Code.
So I'm still not likely to use them on a daily basis - but it does make me wonder if I should keep this laptop around as a dedicated server next time I upgrade.
"Make it better"
It should've stopped there :D
Let's not even talk about the "r" you forgot when asked to write "cranberry"...
Tell me you're simonw without telling me you're simonw...
I don't have any experience running models on Windows or Linux, where your GPU VRAM becomes the most important factor.
Having worked with LLMs a lot for my JoyCaption project, I've got all these hypothesis floating around in my head. I guess the short version, specifically for jokes, is that we lack "joke reasoning" data. The solution, like mathematical problems, is to get the LLM to generate the data and then RL it into more optimal solutions.
Longer explanation:
Imagine we want an LLM to correctly answer "How many r's are in the word strawberry?". And imagine that language has been tokenized, and thus we can form a "token space". The question is a point in that space, point Q. There is a set of valid points, set A, that encompasses _any_ answer to this question which is correct. There are thus paths through token space from point Q to the points contained by set A.
A Generator LLM's job is, given a point, predict valid paths through token space. In fact, we can imagine the Generator starting at point Q and walking its way to (hopefully) some point in set A, along a myriad of inbetween points. Functionally, we have the model predict next token (and hence point in token space) probabilities, and we can use those probabilities to walk the path.
An Ideal Generator would output _all_ valid paths from point Q to set A. A Generator LLM is a lossy compression of that ideal model, so in reality the set of paths the Generator LLM will output might encompass some of those valid paths, but it might also encompass invalid paths.
One more important thing about these paths. Imagine that there is some critical junction. A specific point where, if the Generator goes "left", it goes into a beautiful flat, grassy plain where the sun is shining. That area is really easy to navigate, and the Generator LLM's predictions are all correct. Yay! But if it goes "right" it ends up in the Fire Swamp with many dangers that it is not equipped to handle. i.e. it isn't "smart" enough in that terrain and will frequently predict invalid paths.
Pretraining already taught the Generator LLM to avoid invalid paths to the best of its abilities, but again its abilities are limited.
To fix this, we use RL. A Judge LLM takes a completed path and determines if it landed in the set A or not. With an RL algorithm and that reward signal, we can train the Generator LLM to avoid the Fire Swamp, since it often gets low rewards there, and instead goes to the Plain since it often gets rewards there.
This results in a Generator LLM that is more _reliable_ and thus more useful. The RL encourages it to walk paths it's good at and capable of, avoid paths it struggles with, and of course encourages valid answers whenever possible.
But what if the Generator LLM needs to solve a really hard problem. It gets set down at point Q, and explores the space based on its pretraining. But that pretraining _always_ takes it through a mountain and it never succeeds. During RL the model never really learns a good path, so these tend to manifest as hallucinations or vapid responses that "look" correct.
Yet there are very easy, long paths _around_ the mountain that gets to set A. Those don't get reinforced because they never get explored. They never get explored because those paths weren't in the pretraining data, or are so rare that it would take an impractical amount of exploration for the PT model to output them.
Reasoning is one of those long, easy paths. Digestible small steps that a limited Generator LLM can handle and use to walk around the mountain. Those "reasoning" paths were always there, and were predicted by the Ideal Generator, but were not explored by our current models.
So "reasoning" research is fundamentally about expanding the exploration of the pretrained LLM. The judge gets tweaked slightly to encourage the LLM to explore those kinds of pathways, and/or the LLM gets SFT'd with reasoning data (which is very uncommon in its PT dataset).
I think this breakdown and stepping back is important so that we can see what we're really trying to do here: get a limited Generator LLM to find its way around areas it can't climb. It is likely true that there is _always_ some path from a given point Q and set A that a limited Generator LLM can safely traverse, even if that means those paths are very long.
It's not easy for researchers to know what paths the LLM can safely travel. So we can't just look at Q and A and build a nice dataset for it. It needs to generate the paths itself. And thus we arrive at Reasoning.
Reasoning allows us to take a limited, pretrained LLM, and turn it into a little path finding robot. Early during RL it will find really convoluted paths to the solution, but it _will_ find a solution, and once it does it gets a reward and, hopefully, as training progresses, it learns to find better and shorter paths that it can still navigate safely.
But the "reasoning" component is somewhat tangential. It's one approach, probably a very good approach. There are probably other approaches. We just want the best ways to increase exploration efficiently. And we're at the point where existing written data doesn't cover it, so we need to come up with various hacks to get the LLM to do it itself.
The same applies to jokes. Comedians don't really write down every single thought in their head as they come up with jokes. If we had that, we could SFT existing LLMs to get to a working solution TODAY, and then RL into something optimal. But as it stands PT LLMs aren't capable of _exploring_ the joke space, which means they never come out of the RL process with humor.
Addendum:
Final food for thought. There's kind of this debating going on about "inference scaling", with some believing that CoT, ToT, Reasoning, etc are all essentially just inference scaling. More output gives the model more compute so it can make better predictions. It's likely true that that's the case. In fact, if it _isn't_ the case we need to take a serious look at our training pipelines. But I think it's _also_ about exploring during RL. The extra tokens might give it a boost, sure, but the ability for the model to find more valid paths during RL enables it to express more of its capabilities and solve more problems. If the model is faced with a sheer cliff face it doesn't really matter how much inference compute you throw at it. Only the ability for it to walk around the cliff will help.
And, yeah, this all sounds very much like ... gradient descent :P and yes there have been papers on that connection. It very much seems like we're building a second layer of the same stuff here and it's going to be AdamW all the way down.
Mountains and cliffs are a good way to describe the terrain of the topology of the weights in hyper dimensional space though they are terms for a 2D matrix.
When I asked the normal "How many 'r' in strawberry" question, it gets the right answer and argues with itself until it convinces itself that its (2). It counts properly, and then says to it self continuously, that can't be right.
https://gist.github.com/IAmStoxe/1a1e010649d514a45bb86284b98...
Skynet sends Terminator to eradicate humanity, the Terminator uses this as its internal reasoning engine... "instructions unclear, dick caught in ceiling fan"
For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:
> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.
1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)
2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence
3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning
To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.
In which of the following Incertae sedis families does the letter `a` appear the most number of times?
``` Alphasatellitidae Ampullaviridae Anelloviridae Avsunviroidae Bartogtaviriformidae Bicaudaviridae Brachygtaviriformidae Clavaviridae Fuselloviridae Globuloviridae Guttaviridae Halspiviridae Itzamnaviridae Ovaliviridae Plasmaviridae Polydnaviriformidae Portogloboviridae Pospiviroidae Rhodogtaviriformidae Spiraviridae Thaspiviridae Tolecusatellitidae ```
Please respond with the name of the family in which the letter `a` occurs most frequently
https://pastebin.com/raw/cSRBE2Zy
I used temp 0.2, top_k 20, min_p 0.07
They speak a different language that captures the same meaning, but has different units.
Somehow they need to learn that their unit of thought is not the same as our speech. So that these questions need to map to a different alphabet.
That's my two cents.
It's very easy to write a paper in the style of "it is impossible for a bee to fly" for LLMs and spelling. The incompleteness of our understanding of these systems is astonishing.
I read an explanation about why it makes sense to change doors. But no, my gut tells me there's a 50/50 chance. I scroll down, repeat...
It strikes me that it's both so far from getting it correct and also so close- I'm not an expert but it feels like it could be just an iteration away from being able to reason through a problem like this. Which if true is an amazing step forward.
https://gist.github.com/gsuuon/c8746333820696a35a52f2f9ee6a7...
(I doubt it has, but there ARE already cases where models know they are LLMs, and therefore make the plausible but wrong assumption that they are ChatGPT.)
"Alice has N brothers and she also has M sisters. How many sisters does Alice's brother have?"
The 7b one messed it up first try:
>Each of Alice's brothers has \(\boxed{M-1}\) sisters.
Trying again:
>Each of Alice's brothers has \(\boxed{M}\) sisters.
Also wrong. Again:
>\[ >\boxed{M + 1} >\]
Finally a right answer, took a few attempts though.
Written out here: https://news.ycombinator.com/item?id=42773282
I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?
If I asked you, "hey. How many Rs in strawberry?". You're going to tell me 2, because the likelihood is I am asking about the ending Rs. That's at least how I'd interpret the question without the "llm test" clouding my vision.
Same for if I asked how many gullible. I'd say "it's a double L after the u".
It's my guess this has muddled the training data.
It could be the quantized version failing?
There is now research in Large Concept Models to tackle this but I'm not literate enough to understand what that actually means...
Deleted Comment
We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.
It's great that DeepSeek-R1 fixes that.
The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.
Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.
[1] https://github.com/ubicloud/ubicloud/discussions/2608
They aren't only open sourcing R1 as an advanced reasoning model. They are also introducing a pipeline to "teach" existing models how to reason and align with human preferences. [2] On top of that, they fine-tuned Llama and Qwen models that use this pipeline; and they are also open sourcing the fine-tuned models. [3]
This is *three separate announcements* bundled as one. There's a lot to digest here. Are there any AI practitioners, who could share more about these announcements?
[2] We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models.
[3] Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.
Deleted Comment
Dead Comment
This is probably the result of a classifier which determines if it have to go through the whole CoT at the start. Mostly on tough problems it does, and otherwise, it just answers as is. Many papers (scaling ttc, and the mcts one) have talked about this as a necessary strategy to improve outputs against all kinds of inputs.
The full o1 reasoning traces aren't available, you just have to guess about what it is or isn't doing from the summary.
Sometimes you put in something like "hi" and it says it thought for 1 minute before replying "hello."
o1 layers: "Why did they ask me hello. How do they know who I am. Are they following me. We have 59.6 seconds left to create a plan on how to kill this guy and escape this room before we have to give a response....
... and after also taking out anyone that would follow thru in revenge and overthrowing the government... crap .00001 seconds left, I have to answer"
o1: "Hello"
Did o1 actually do this on a user hidden output?
At least in my mind if you have an AI that you want to keep from outputting harmful output to users it shouldn't this seems like a necessary step.
Also, if you have other user context stored then this also seems like a means of picking that up and reasoning on it to create a more useful answer.
Now for summarizing email itself it seems a bit more like a waste of compute, but in more advanced queries it's possibly useful.
We saw this in other questions as well. For example, if you asked o1 to write a "python function to download a CSV from a URL and create a SQLite table with the right columns and insert that data into it", it would immediately produce the answer. [4] If you asked it a hard math question, it would try dozens of reasoning strategies before producing an answer. [5]
[4] https://github.com/ubicloud/ubicloud/discussions/2608#discus...
[5] https://github.com/ubicloud/ubicloud/discussions/2608#discus...
This is the thought path that led to 4o being embarrassingly unable to do simple tasks. Second you fall into the level of task OpenAI doesn’t consider “worth the compute cost” you get to see it fumble about trying to do the task with poorly written python code and suddenly it can’t even do basic things like correctly count items in a list that OG GTP4 would get correct in a second.
I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)
By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.
However what I've found odd was the way it formulated the solution was in excessively dry and obtuse mathematical language, like something you'd publish in an academic paper.
Once I managed to follow along its reasoning, I understood what it came up with could essentially be explain in 2 sentences of plain english.
On the other hand, o1 is amazing at coding, being able to turn an A4 sheet full of dozens of separate requirements into an actual working application.
Working != maintainable
The things that ChatGPT or Claude spit out are impressive one-shots but hard to iterate on or integrate with other code.
And you can’t just throw Aider/Cursor/Copilot/etc at the original output without quickly making a mess. At least not unless you are nudging it in the right directions at every step, occasionally jumping in and writing code yourself, fixing/refactoring the LLM code to fit style/need, etc.
I guess it's because the topic is such a cross between fields like math, cs, art and so visual, maybe for a similar reason LLMs do so poorly with SVG ouput, like the unicorn benchmark: https://gpt-unicorn.adamkdean.co.uk/
Prompts like, "Give me five odd numbers that don't have the letter 'e' in their spelling," or "How many 'r's are in the word strawberry?"
I suspect the breakthrough won't be trivial that enables solving trivial questions.
Compare the reasoning times!!! 84s vs 342s
R1 (Thought for 84 seconds)
o1 Pro (Thought for 5 minutes and 42 seconds)Easy for who? Some problems are better solved in one way compared to another.
In the case of counting letters and such, it is not a easy problem, because of how the LLM tokenizes their input/outputs. On the other hand, it's really simple problem for any programming/scripting language, or humans.
And then you have problems like "5142352 * 51234" which is trivial problems for any basic calculator, but very hard for a human or a LLM.
Or "problems" like "Make a list of all the cities that had celebrity from there who knows how to program in Fortan", would be a "easy" problem for a LLM, but pretty much a hard problem anything else than Wikidata, assuming both LLM/Wikidata have data about it in their datasets.
> I suspect the breakthrough won't be trivial that enables solving trivial questions.
So with what I wrote above in mind, LLMs already solve trivial problems, assuming you think about the capabilities of the LLM. Of course, if you meant "trivial for humans", I'll expect the answer to always remain "No", because things like "Standing up" is trivial for humans, but it'll never be trivial for a LLM, it doesn't have any legs!
Asking a question like this only highlights the questioners complete lack of understanding of LLMs rather than an LLMs inability to do something.
Deleted Comment
Deleted Comment
Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like.
What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement.
The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective.
This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master closed systems through pure RL before cracking the harder open-ended domains.
The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway.
Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.
Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246
That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.
In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.
When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.
We should be doing the same when taming LLMs from their pretraining as document completers into assistants.
Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.
"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.
Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.
This made me smile, as I thought (non snarkily) that's what living beings do.
In some domains it is harder than math and code.
Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either
The negative downsides begin at "dystopia worse than 1984 ever imagined" and get worse from there
It's indeed very dystopia.
Both R1 and V3 say that they are ChatGPT from OpenAI
If you have a model that can learn as you go, then the concept of accuracy on a static benchmark would become meaningless, since a perfect continual learning model would memorize all the answers within a few passes and always achieve a 100% score on every question. The only relevant metrics would be sample efficiency and time to convergence. i.e. how quickly does the system learn?
You say it as if it's an easy thing to do. These things take time man.
I personally would have gone for search/reasoning as has been done. It's the reason path.
DeepSeek is a Chinese AI company and we're talking about military technology. The next world war will be fought by AI, so the Chinese government won't leave China's AI development to chance. The might of the entire Chinese government is backing DeepSeek.
Dead Comment
The key insight is that those building foundational models and original research are always first, and then models like DeepSeek always appear 6 to 12 months later. This latest move towards reasoning models is a perfect example.
Or perhaps DeepSeek is also doing all their own original research and it’s just coincidence they end up with something similar yet always a little bit behind.
There are some significant innovations behind behind v2 and v3 like multi-headed latent attention, their many MoE improvements and multi-token prediction.
https://epoch.ai/gradient-updates/how-has-deepseek-improved-...
For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.
https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?
You can’t wave a magic wand and make an 8b model that good.
I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.
You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.
> like an insane transfer of capabilities to a relatively tiny model
It certainly does.
…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.
There’s also a lot of work going on right now showing that small models can significantly improve their outputs by inferencing multiple times[1], which is effectively what this model is doing. So even small models can produce better outputs by increasing the amount of compute through them.
I get the benchmark fatigue, and it’s merited to some degree. But in spite of that, models have gotten really significantly better in the last year, and continue to do so. In some sense, really good models should be really difficult to evaluate, because that itself is an indicator of progress.
[1] https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling...
Deleted Comment
Wow. They’re really trying to undercut closed source LLMs
> In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
----