I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]
I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.
I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.
> how to build systems where the whole is bigger than the sum of its parts
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.
> And, even the dumber LLMs would slot in naturally into such a process
That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.
But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.
The opposite might apply, too; the whole system may be smaller than its parts, as it excels at individual tasks but mixes things up in combination. Improvements will be made, but I wonder if we should aim for generalists, or accept more specialist approaches as it is difficult to optimise for all tasks at once.
You know the meme "seems like will have AGI before we can reliably parse PDFs" :)
So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?
You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.
Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?
If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.
And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/
E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.
> build systems where the whole is bigger than the sum of its parts.
Any “product” can be thought of this way.
Of systems there are many systems nested within systems, yet a simple singular order “emerges”, usually it is the designed intended function.
The trick to discerning systems lies in their relationships.
Actors through interfaces have a relationship (usually more than one so think of each relationship as its own system dynamic.)
A relationship is where the magic happens, usually a process with work being done (therefore interface inputs must account for this balance.)
Vectors. Vectors I am thinking are the real intellectual and functional mechanisms. Most systems process inputs of potential (“energy”) control signal (“information”) and assets (other actors for nested systems). Processes do the work of adding vector solutions [for some other problem] for whatever the output is.
> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
Is (1) that surprising? If I ask someone a simple question but tell them to "think really hard about it", they'll be more likely to treat it as a trick question and look for a non-obvious answer. Overthinking it, basically.
It is hard to compare models with humans so not sure how to answer it for both. :)
But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]
If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)
Human language is far from perfect as a cognitive tool but still serves us well because it is not foundational. We use it both for communication and some reasoning/planning as a high level layer.
I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.
We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.
Human language is more powerful than its surface syntax or semantics: it carries meaning beyond formal correctness. We often communicate effectively even with grammatically broken sentences, using jokes, metaphors, or emotionally charged expressions. This richness makes language a uniquely human cognitive layer, shaped by context, culture, and shared experience. While it's not foundational in the same way as sensorimotor interaction, it is far more than just a high-level communication tool.
I agree that language is even more useful as a cognitive tool than as a communication medium.
But that is not my point. The map is not the territory, and this map (language) is too poor to build something that is going to give more than what it was fed with.
Agree with this. Human language is also not very information-dense; there is a lot of redundancy and uninformative repetition of words.
I also wonder about the compounding effects of luck and survivorship bias when using these systems. If you model a series of interactions with these systems probabilistically, as a series of failure/success modes, then you are bound to get a sub-population of users (of LLM/LLRMs) that will undoubtedly have “fantastic” results. This sub-population will then espouse and promote the merits of the system. There is clearly something positive these models do, but how much of the “success” is just luck.
Language mediates those interactions with the world. There is no unmediated interaction with the world. Those moments when one feels most directly in contact with reality, that is when one is so deep down inside language that one cannot see daylight at all.
I don't know about you, but as far as I can tell I mediate and manipulate the world with my body and senses without necessarily using language. In fact, I can often do both at once, for example, thinking about something entirely unrelated while jogging, and still making physical decisions and actions without invoking language at all. Plus, animals (especially lower order like amoebas) also mediate with the world without needing language.
As far as we can tell without messing with complex experiental concepts like qualia and the possibility of philosophical zombies, language mainly helps higher order animals communicate with other animals and (maybe) keep a train of thought, though there are records of people that say that they don't. And now also it allows humans talk to LLMs.
But I digress, I would say this is an open academic debate. Suggesting that there is always language deep down is speculation.
I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
> There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.
Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.
So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.
It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.
LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.
> What we do have are very good pattern matchers and probabilistic data generators
This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
>There's nothing "omniscient" or "dim-witted" about these tools
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
> I think AI maximalists will continue to think that the models are in fact getting less dim-witted
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.
I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.
> Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.
This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.
To be fair, the technology sigmoid curve rises fastest just before its inflection point, so it is hard to predict at what point innovation slows down due to its very nature.
The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.
And not just in aviation. Consider what aviation did to make the world smaller. Huge 2nd order changes. The COVID-19 pandemic would not have happened the way it did, if there were no Boeing or Airbus.
AGI has always been "just around the corner", ever since computers were invented.
Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.
Even if they never get better than they are today (unlikely) they are still the biggest change in software development and the software development industry in my 28 year career.
> the easy part is done but the hard part is so hard it takes years to progress
There is also no guarantee of continued progress to a breakthrough.
We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.
LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".
What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so. Papers like this are the academics mapping out where the engineering efforts need to be directed to get there and it seems to be a relatively small number of challenges that are easier as the ones already overcome - we know machine learning can solve Towers of Hanoi, for example. It isn't fundamentally complicated like Baduk is. The next wall to overcome is more of a low fence.
Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.
> What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so
It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.
So, so far future looks bright, there is progress, problems, but not deadlocks.
PS: Turing test is a <beep> nobody seriously talks about today.
People are drawing erroneous conclusions from this.
My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)
You can absolutely extrapolate the results, because what this shows is that even when "reasoning" these models are still fundamentally repeating in-sample patterns, and that they collapse when faced with novel reasoning tasks above a small complexity threshold.
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.
The authors don't say anything like this that I can see. Their conclusion specifically identifies these as weaknesses of current frontier models.
Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.
If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.
I guess the authors are making an important point (that challenges the current belief & trend in AI): adding reasoning or thinking to a model (regardless of the architecture or generation)doesn’t always lead to a net gain. In fact, once you factor in compute costs and answer quality across problems of varying complexity, the overall benefit can sometimes turn out to be negative.
Human brains work the same way. Some of us are just better at analogy. I’ve worked with plenty of people who were unable to transfer knowledge of one area to another, identical area with different terms.
All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing, Block World) could easily be solved perfectly by any of the LLMs if the authors had allowed it to write code.
I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
> I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.
Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.
Some specialized people could probably do 20x20, but I'd still expect them to make a mistake at 100x100. The level we needed for space crafts was much less than that, and we had many levels of checks to help catch errors afterwards.
I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right.
There's a reason we had to invent lots of machines to help us.
It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.
Yeah, and FWIW doing this through writing code is trivial in an LLM / LRM - after testing locally took not even a minute to have a working solution no matter the amount of disks.
Your analogy makes sense, no reasonable person would try to solve a Tower of Hanoi type problem with e.g. 15 disks and sit there for 32,767 moves non-programmatically.
This argument is tired as it keeps getting repeated for any flaws seen in LLMs. And the other tired argument is: wait ! this is a sigmoid curve, and we have not seen the inflection point yet. If someone have me a penny for every comment saying these, I'd be rich by now.
Humans invented machines because they could not do certain things. All the way from simple machines in physics (Archimedes lever) to the modern computer.
> Humans invented machines because they could not do certain things.
If your disappointment is that the LLM didn't invent a computer to solve the problem, maybe you need to give it access to physical tools, robots, labs etc.
>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents
>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.
The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.
The goal isnt to assess the LLM capability at solving any of those problems. The point isnt how good they are at block world puzzles.
The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.
I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.
I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.
[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...
[2]: https://arxiv.org/pdf/2406.05946
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.
That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.
But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.
So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?
You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.
Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?
If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.
And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/
E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.
Any “product” can be thought of this way.
Of systems there are many systems nested within systems, yet a simple singular order “emerges”, usually it is the designed intended function.
The trick to discerning systems lies in their relationships.
Actors through interfaces have a relationship (usually more than one so think of each relationship as its own system dynamic.)
A relationship is where the magic happens, usually a process with work being done (therefore interface inputs must account for this balance.)
Vectors. Vectors I am thinking are the real intellectual and functional mechanisms. Most systems process inputs of potential (“energy”) control signal (“information”) and assets (other actors for nested systems). Processes do the work of adding vector solutions [for some other problem] for whatever the output is.
That’s the topology as I am seeing it.
But they can also do math, logic, music notation, write code, LaTeX, SVG, etc.
Very clever, I must say. Kudos to folks who made this particular choice.
> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
This is fascinating! We need more "mapping" of regimes like this!
What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.
For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.
But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]
If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)
[1] https://github.com/deepseek-ai/DeepSeek-R1
Deleted Comment
https://arcprize.org/
I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.
We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.
But that is not my point. The map is not the territory, and this map (language) is too poor to build something that is going to give more than what it was fed with.
I also wonder about the compounding effects of luck and survivorship bias when using these systems. If you model a series of interactions with these systems probabilistically, as a series of failure/success modes, then you are bound to get a sub-population of users (of LLM/LLRMs) that will undoubtedly have “fantastic” results. This sub-population will then espouse and promote the merits of the system. There is clearly something positive these models do, but how much of the “success” is just luck.
As far as we can tell without messing with complex experiental concepts like qualia and the possibility of philosophical zombies, language mainly helps higher order animals communicate with other animals and (maybe) keep a train of thought, though there are records of people that say that they don't. And now also it allows humans talk to LLMs.
But I digress, I would say this is an open academic debate. Suggesting that there is always language deep down is speculation.
Ofc I imagine they've tried similar things and that it almost takes away the point if u had to prompt that way.
I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.
But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.
We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.
All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.
The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.
> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.
I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.
Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.
So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.
It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.
LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.
> What we do have are very good pattern matchers and probabilistic data generators
This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.
I disagree in that that seems quite a good way of describing them. All language is a bit inexact.
Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...
>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years
I figure it's hard to argue that that is not at least somewhat intelligent?
In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.
Computers can't think and submarines can't swim.
I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.
What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).
That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.
This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.
Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:
1. Reasoning/strategising step-by-step for very long periods
2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)
Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.
Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.
This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.
The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.
Big hard-to-predict changes ahead.
Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.
the easy part is done but the hard part is so hard it takes years to progress
There is also no guarantee of continued progress to a breakthrough.
We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.
LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".
Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.
It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.
So, so far future looks bright, there is progress, problems, but not deadlocks.
PS: Turing test is a <beep> nobody seriously talks about today.
My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)
My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted
The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)
That is not a model-specific claim, it's a claim on the nature of LLMs.
For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.
Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.
If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.
I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.
People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.
Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.
I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right. There's a reason we had to invent lots of machines to help us.
It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.
Yeah, and FWIW doing this through writing code is trivial in an LLM / LRM - after testing locally took not even a minute to have a working solution no matter the amount of disks.
Your analogy makes sense, no reasonable person would try to solve a Tower of Hanoi type problem with e.g. 15 disks and sit there for 32,767 moves non-programmatically.
This argument is tired as it keeps getting repeated for any flaws seen in LLMs. And the other tired argument is: wait ! this is a sigmoid curve, and we have not seen the inflection point yet. If someone have me a penny for every comment saying these, I'd be rich by now.
Humans invented machines because they could not do certain things. All the way from simple machines in physics (Archimedes lever) to the modern computer.
If your disappointment is that the LLM didn't invent a computer to solve the problem, maybe you need to give it access to physical tools, robots, labs etc.
>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.
https://arxiv.org/abs/2311.13373
The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.
Doesn't that come down to allowing it to directly regurgitate training data? Surely it's seen dozens of such solutions.
The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.