The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

I think one of the reason we are confused about what LLMs can do is because they use language. And we look at the "reasoning traces" and the tokens there look human, but what is actually happening is very alien to us, as shown by "Biology of Large Language Models"[1] and "Safety Alignment Should Be Made More Than Just a Few Tokens Deep"[2]

I am struggling a lot to see what the tech can and can not do, particularly designing systems with them, and how to build systems where the whole is bigger than the sum of its parts. And I think this is because I am constantly confused by their capabilities, despite understanding their machinery and how they work, their use of language just seems like magic. I even wrote https://punkx.org/jackdoe/language.html just to remind myself how to think about it.

I think this kind of research is amazing and we have to spend tremendous more effort into understanding how to use the tokens and how to build with them.

[1]: https://transformer-circuits.pub/2025/attribution-graphs/bio...

[2]: https://arxiv.org/pdf/2406.05946

dmos62 · 9 months ago

> how to build systems where the whole is bigger than the sum of its parts

A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.

throwaway71271 · 9 months ago

> And, even the dumber LLMs would slot in naturally into such a process

That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.

But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.

dleeftink · 9 months ago

The opposite might apply, too; the whole system may be smaller than its parts, as it excels at individual tasks but mixes things up in combination. Improvements will be made, but I wonder if we should aim for generalists, or accept more specialist approaches as it is difficult to optimise for all tasks at once.

jackdoe · 9 months ago

You know the meme "seems like will have AGI before we can reliably parse PDFs" :)

So if you are building a system, lets say you ask it to parse a pdf, and you put a judge to evaluate the quality of the output, and then you create a meta judge to improve the prompts of the parser and the pdf judge. The question is, is this going to get better as it is running, and even more, is it going to get better as the models are getting better?

You can build the same system in completely different way, more like 'program synthesis' imagine you dont use llms to parse, but you use them to write parser code, and tests, and then judge to judge the tests, or even escalate to human to verify, then you train your classifier that picks the parser. Now this system is much more likely to improve itself as it is running, and as the models are getting better.

Few months ago Yannic Kilcher gave this example as that it seems that current language models are very constrained mid-sentence, because they most importantly want produce semantically consistent and grammatically correct text, so the entropy mid sentence is very different than the entropy after punctuation. The . dot "frees" the distribution. What does that mean for "generalists" or "specialists" approach when sampling the wrong token can completely derail everything?

If you believe that the models will "think" then you should bet on the prompt and meta prompt approach, if you believe they will always be limited then you should build with program synthesis.

And, honestly, I am totally confused :) So this kind of research is incredibly useful to clear the mist. Also things like https://www.neuronpedia.org/

E.G. Why compliment (you can do this task), guilt (i will be fired if you don't do this task), and threatening (i will harm you if you don't do this task) work with different success rate? Sergey Brin said recently that threatening works best, I cant get my self to do it, so I take his word for it.

overu589 · 9 months ago

> build systems where the whole is bigger than the sum of its parts.

Any “product” can be thought of this way.

Of systems there are many systems nested within systems, yet a simple singular order “emerges”, usually it is the designed intended function.

The trick to discerning systems lies in their relationships.

Actors through interfaces have a relationship (usually more than one so think of each relationship as its own system dynamic.)

A relationship is where the magic happens, usually a process with work being done (therefore interface inputs must account for this balance.)

Vectors. Vectors I am thinking are the real intellectual and functional mechanisms. Most systems process inputs of potential (“energy”) control signal (“information”) and assets (other actors for nested systems). Processes do the work of adding vector solutions [for some other problem] for whatever the output is.

That’s the topology as I am seeing it.

bufferoverflow · 9 months ago

> we are confused about what LLMs can do is because they use language.

But they can also do math, logic, music notation, write code, LaTeX, SVG, etc.

throwaway71271 · 9 months ago

as this paper shows, it sees they can do tower of hanoi as well, up to a certain point that is.

I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted. And the question they are collectively trying to ask is whether this will continue forever.

I've never seen this question quantified in a really compelling way, and while interesting, I'm not sure this PDF succeeds, at least not well-enough to silence dissent. I think AI maximalists will continue to think that the models are in fact getting less dim-witted, while the AI skeptics will continue to think these apparent gains are in fact entirely a biproduct of "increasing" "omniscience." The razor will have to be a lot sharper before people start moving between these groups.

But, anyway, it's still an important question to ask, because omniscient-yet-dim-witted models terminate at "superhumanly assistive" rather than "Artificial Superintelligence", which in turn economically means "another bite at the SaaS apple" instead of "phase shift in the economy." So I hope the authors will eventually succeed.

imiric · 9 months ago

> I think the intuition the authors are trying to capture is that they believe the models are omniscient, but also dim-witted.

We keep assigning adjectives to this technology that anthropomorphize the neat tricks we've invented. There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.

All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.

This is a neat trick, but it doesn't solve the underlying problems that plague these models like hallucination. If the "reasoning" process contains garbage, gets stuck in loops, etc., the final answer will also be garbage. I've seen sessions where the model approximates the correct answer in the first "reasoning" step, but then sabotages it with senseless "But wait!" follow-up steps. The final answer ends up being a mangled mess of all the garbage it generated in the "reasoning" phase.

The only reason we keep anthropomorphizing these tools is because it makes us feel good. It's wishful thinking that markets well, gets investors buzzing, and grows the hype further. In reality, we're as close to artificial intelligence as we were a decade ago. What we do have are very good pattern matchers and probabilistic data generators that can leverage the enormous amount of compute we can throw at the problem. Which isn't to say that this can't be very useful, but ascribing human qualities to it only muddies the discussion.

BoiledCabbage · 9 months ago

> There's nothing "omniscient" or "dim-witted" about these tools. They have no wit. They do not think or reason.

> All Large "Reasoning" Models do is generate data that they use as context to generate the final answer. I.e. they do real-time tuning based on synthetic data.

I always wonder when people make comments like this if they struggle with analogies. Or if it's a lack of desire to discuss concepts at different levels of abstraction.

Clearly an LLM is not "omniscient". It doesn't require a post to refute that, OP obviously doesn't mean that literally. It's an analogy describing two semi (fairly?) independent axes. One on breadth of knowledge, one on something more similar to intelligence and being able to "reason" from smaller components of knowledge. The opposite of which is dim witted.

So at one extreme you'd have something completely unable to generalize or synthesize new results. Only able to correctly respond if it identically matches prior things it has seen, but has seen and stored a ton. At the other extreme would be something that only knows a very smal set of general facts and concepts but is extremely good at reasoning from first principles on the fly. Both could "score" the same on an evaluation, but have very different projections for future growth.

It's a great analogy and way to think about the problem. And it me multiple paragraphs to write ehat OP expressed in two sentences via a great analogy.

LLMs are a blend of the two skills, apparently leaning more towards the former but not completely.

> What we do have are very good pattern matchers and probabilistic data generators

This an unhelpful description. And object is more than the sum of its parts. And higher levels behaviors emerge. This statement is factually correct and yet the equivalent of describing a computer as nothing more than a collection of gates and wires so shouldn't be discussed at a higher level of abstraction.

tim333 · 9 months ago

>There's nothing "omniscient" or "dim-witted" about these tools

I disagree in that that seems quite a good way of describing them. All language is a bit inexact.

Also I don't buy we are no closer to AI than ten years ago - there seem lots going on. Just because LLMs are limited doesn't mean we can't find or add other algorithms - I mean look at alphaevolve for example https://www.technologyreview.com/2025/05/14/1116438/google-d...

>found a faster way to solve matrix multiplications—a fundamental problem in computer science—beating a record that had stood for more than 50 years

I figure it's hard to argue that that is not at least somewhat intelligent?

antics · 9 months ago

I am not sure we are on the same page that the point of my response is that this paper is not enough to prevent exactly the argument you just made.

In any event, if you want to take umbrage with this paper, I think we will need to back up a bit. The authors use a mostly-standardized definition of "reasoning", which is widely-accepted enough to support not just one, but several of their papers, in some of the best CS conferences in the world. I actually think you are right that it is reasonable to question this definition (and some people do), but I think it's going to be really hard for you to start that discussion here without (1) saying what your definition specifically is, and (2) justifying why its better than theirs. Or at the very least, borrowing one from a well-known critique like, e.g., Gebru's, Bender's, etc.

Kon5ole · 9 months ago

>They have no wit. They do not think or reason.

Computers can't think and submarines can't swim.

drodgers · 9 months ago

> I think AI maximalists will continue to think that the models are in fact getting less dim-witted

I'm bullish (and scared) about AI progress precisely because I think they've only gotten a little less dim-witted in the last few years, but their practical capabilities have improved a lot thanks to better knowledge, taste, context, tooling etc.

What scares me is that I think there's a reasoning/agency capabilities overhang. ie. we're only one or two breakthroughs away from something which is both kinda omniscient (where we are today), and able to out-think you very quickly (if only through dint of applying parallelism to actually competent outcome-modelling and strategic decision making).

That combination is terrifying. I don't think enough people have really imagined what it would mean for an AI to be able to out-strategise humans in the same way that they can now — say — out-poetry humans (by being both decent in terms of quality and super fast). It's like when you're speaking to someone way smarter than you and you realise that they're 6 steps ahead, and actively shaping your thought process to guide you where they want you to end up. At scale. For everything.

This exact thing (better reasoning + agency) is also the top priority for all of the frontier researchers right now (because it's super useful), so I think a breakthrough might not be far away.

Another way to phrase it: I think today's LLMs are about as good at snap judgements in most areas as the best humans (probably much better at everything that rhymes with inferring vibes from text), but they kinda suck at:

1. Reasoning/strategising step-by-step for very long periods

2. Snap judgements about reasoning or taking strategic actions (in the way that expert strategic humans don't actually need to think through their actions step-by-step very often - they've built intuition which gets them straight to the best answer 90% of the time)

Getting good at the long range thinking might require more substantial architectural changes (eg. some sort of separate 'system 2' reasoning architecture to complement the already pretty great 'system 1' transformer models we have). OTOH, it might just require better training data and algorithms so that the models develop good enough strategic taste and agentic intuitions to get to a near-optimal solution quickly before they fall off a long-range reasoning performance cliff.

Of course, maybe the problem is really hard and there's no easy breakthrough (or it requires 100,000x more computing power than we have access to right now). There's no certainty to be found, but a scary breakthrough definitely seems possible to me.

sitkack · 9 months ago

I think you are right, and that the next step function can be achieved using the models we have, either by scaling the inference, or changing the way inference is done.

sitkack · 9 months ago

There is no reason that omniscient-yet-dim-witted has to plateau at human intelligence.

antics · 9 months ago

I am not sure if you mean this to refute something in what I've written but to be clear I am not arguing for or against what the authors think. I'm trying to state why I think there is a disconnect between them and more optimistic groups that work on AI.

curious_cat_163 · 9 months ago

> Rather than standard benchmarks (e.g., math problems), we adopt controllable puzzle environments that let us vary complexity systematically

Very clever, I must say. Kudos to folks who made this particular choice.

> we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

This is fascinating! We need more "mapping" of regimes like this!

What I would love to see (not sure if someone on here has seen anything to this effect) is how these complexity regimes might map to economic value of the task.

For that, the eval needs to go beyond puzzles but the complexity of the tasks still need to be controllable.

pegasus · 9 months ago

Is (1) that surprising? If I ask someone a simple question but tell them to "think really hard about it", they'll be more likely to treat it as a trick question and look for a non-obvious answer. Overthinking it, basically.

It is hard to compare models with humans so not sure how to answer it for both. :)

But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]

If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)

[1] https://github.com/deepseek-ai/DeepSeek-R1

Deleted Comment

make3 · 9 months ago

Using puzzles is not special or anything, it has been done a million times since before (and including) the LSTM paper (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

jimmySixDOF · 9 months ago

The Arc Prize just released a new update and it's all minigame puzzles

https://arcprize.org/

stephc_int13 · 9 months ago

Human language is far from perfect as a cognitive tool but still serves us well because it is not foundational. We use it both for communication and some reasoning/planning as a high level layer.

I strongly believe that human language is too weak (vague, inconsistent, not expressive enough etc.) to replace interactions with the world as a basis to build strong cognition.

We're easily fooled by the results of LLM/LRM models because we typically use language fluency and knowledge retrieval as a proxy benchmark for intelligence among our peers.

wslh · 9 months ago

Human language is more powerful than its surface syntax or semantics: it carries meaning beyond formal correctness. We often communicate effectively even with grammatically broken sentences, using jokes, metaphors, or emotionally charged expressions. This richness makes language a uniquely human cognitive layer, shaped by context, culture, and shared experience. While it's not foundational in the same way as sensorimotor interaction, it is far more than just a high-level communication tool.

I agree that language is even more useful as a cognitive tool than as a communication medium.

But that is not my point. The map is not the territory, and this map (language) is too poor to build something that is going to give more than what it was fed with.

squidproquo · 9 months ago

Agree with this. Human language is also not very information-dense; there is a lot of redundancy and uninformative repetition of words.

I also wonder about the compounding effects of luck and survivorship bias when using these systems. If you model a series of interactions with these systems probabilistically, as a series of failure/success modes, then you are bound to get a sub-population of users (of LLM/LLRMs) that will undoubtedly have “fantastic” results. This sub-population will then espouse and promote the merits of the system. There is clearly something positive these models do, but how much of the “success” is just luck.

antithesizer · 9 months ago

Language mediates those interactions with the world. There is no unmediated interaction with the world. Those moments when one feels most directly in contact with reality, that is when one is so deep down inside language that one cannot see daylight at all.

mrbungie · 9 months ago

I don't know about you, but as far as I can tell I mediate and manipulate the world with my body and senses without necessarily using language. In fact, I can often do both at once, for example, thinking about something entirely unrelated while jogging, and still making physical decisions and actions without invoking language at all. Plus, animals (especially lower order like amoebas) also mediate with the world without needing language.

As far as we can tell without messing with complex experiental concepts like qualia and the possibility of philosophical zombies, language mainly helps higher order animals communicate with other animals and (maybe) keep a train of thought, though there are records of people that say that they don't. And now also it allows humans talk to LLMs.

But I digress, I would say this is an open academic debate. Suggesting that there is always language deep down is speculation.

anton-c · 9 months ago

Sounds like we need ai legalese as that's how we navigate the vagueness of language in the real world.

Ofc I imagine they've tried similar things and that it almost takes away the point if u had to prompt that way.

I was not referring to the prompt but to the underlying network that is built on weak cognitive foundations because all of it is coming from language.

gwd · 9 months ago

> Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.

This is exactly my experience with coding. Start simple and build up complexity, and everything is great until you get to some threshold, at which point it completely falls apart and seems to stop even trying. Getting effective utilization out of Claude + aider involves managing the complexity that the LLM sees.

actinium226 · 9 months ago

Man, remember when everyone was like 'AGI just around the corner!' Funny how well the Gartner hype cycle captures these sorts of things

latchup · 9 months ago

To be fair, the technology sigmoid curve rises fastest just before its inflection point, so it is hard to predict at what point innovation slows down due to its very nature.

The first Boeing 747 was rolled out in 1968, only 65 years after the first successful heavier-than-air flight. If you told people back then that not much will fundamentally change in civil aviation over the next 57 years, no one would have believed you.

PantaloonFlames · 9 months ago

And not just in aviation. Consider what aviation did to make the world smaller. Huge 2nd order changes. The COVID-19 pandemic would not have happened the way it did, if there were no Boeing or Airbus.

Big hard-to-predict changes ahead.

bayindirh · 9 months ago

They're similar to self-driving vehicles. Both are around the corner, but neither can negotiate the turn.

nmca · 9 months ago

I saw your comment and counted — in May I took a Waymo thirty times.

hskalin · 9 months ago

And commerically viable nuclear fusion

kfarr · 9 months ago

Waymo's pretty good at unprotected lefts

einrealist · 9 months ago

All that to keep the investment pyramid schemes going.

brookst · 9 months ago

…but that was, like, two years ago? If we go from GPT2 to AGI in ten years that will still feel insanely fast.

piskov · 9 months ago

We won’t

otabdeveloper4 · 9 months ago

AGI has always been "just around the corner", ever since computers were invented.

Some problems have become more tractable (e.g. language translation), mostly by lowering our expectations of what constitutes a "solution", but AGI is nowhere nearer. AGI is a secular milleniarist religion.

IshKebab · 9 months ago

Yeah it's already been 2½ years! How long does it take to develop artificial life anyway? Surely no more than 3 years? I demand my money back!

yahoozoo · 9 months ago

We will be treating LLMs “like a junior developer” forever.

sneak · 9 months ago

Even if they never get better than they are today (unlikely) they are still the biggest change in software development and the software development industry in my 28 year career.

JKCalhoun · 9 months ago

And I'm fine with that.

tonyhart7 · 9 months ago

I think we just around at 80% of progress

the easy part is done but the hard part is so hard it takes years to progress

georgemcbay · 9 months ago

> the easy part is done but the hard part is so hard it takes years to progress

There is also no guarantee of continued progress to a breakthrough.

We have been through several "AI Winters" before where promising new technology was discovered and people in the field were convinced that the breakthrough was just around the corner and it never came.

LLMs aren't quite the same situation as they do have some undeniable utility to a wide variety of people even without AGI springing out of them, but the blind optimism that surely progress will continue at a rapid pace until the assumed breakthrough is realized feels pretty familiar to the hype cycle preceding past AI "Winters".

naasking · 9 months ago

Interpreting "just around the corner" as "this year" sounds like your error. Most projections are are years out, at least.

mirekrusin · 9 months ago

I remember "stochastic parrot" and people saying it's fancy markov chain/dead end. You don't hear them much after roughly agentic coding appeared.

Marazan · 9 months ago

Spicy autocomplete is still spicy autocomplete

roenxi · 9 months ago

What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so. Papers like this are the academics mapping out where the engineering efforts need to be directed to get there and it seems to be a relatively small number of challenges that are easier as the ones already overcome - we know machine learning can solve Towers of Hanoi, for example. It isn't fundamentally complicated like Baduk is. The next wall to overcome is more of a low fence.

Besides, AI already passes the Turing test (or at least, is most likely to fail because it is too articulate and reasonable). There is a pretty good argument we've already achieved AGI and now we're working on achieving human- and superhuman-level intelligence in AGI.

MoonGhost · 9 months ago

> What do you think has changed? The situation is still about as promising for AGI in a few years - if not more so

It's better today. Hoping that LLMs can get us to AGI in one hop was naive. Depending on definition of AGI we might be already there. But for superhuman level in all possible tasks there are many steps to be done. The obvious way is to find a solution for each type of tasks. We have already for math calculations, it's using tools. Many other types can be solved the same way. After a while we'll gradually get to well rounded 'brain', or model(s) + support tools.

So, so far future looks bright, there is progress, problems, but not deadlocks.

PS: Turing test is a <beep> nobody seriously talks about today.

avsteele · 9 months ago

People are drawing erroneous conclusions from this.

My read of this is that the paper demonstrates that given a particular model (and the problems examined with it) that giving more thought tokens does not help on problems above a certain complexity. It does not say anything about the capabilities of future, larger, models to handle more complex tasks. (NB: humans trend similarly)

My concern is that people are extrapolating from this to conclusions about LLM's generally, and this is not warranted

The only part about this i find even surprising is he abstract's conclusion (1): that 'thinking' can lead to worse outcomes for certain simple problem. (again though, maybe you can say humans are the same here. You can overthink things)

lukev · 9 months ago

You can absolutely extrapolate the results, because what this shows is that even when "reasoning" these models are still fundamentally repeating in-sample patterns, and that they collapse when faced with novel reasoning tasks above a small complexity threshold.

That is not a model-specific claim, it's a claim on the nature of LLMs.

For your argument to be true would need to mean that there is a qualitative difference, in which some models possess "true reasoning" capability and some don't, and this test only happened to look at the latter.

The authors don't say anything like this that I can see. Their conclusion specifically identifies these as weaknesses of current frontier models.

Furthermore we have clearly seen increases in reasoning from previous frontier models to current frontier models.

If the authors could /did show that both previous-generation and current-generation frontier models hit a wall at similar complexity that would be something, AFAIK they do not.

deepsharp · 9 months ago

I guess the authors are making an important point (that challenges the current belief & trend in AI): adding reasoning or thinking to a model (regardless of the architecture or generation)doesn’t always lead to a net gain. In fact, once you factor in compute costs and answer quality across problems of varying complexity, the overall benefit can sometimes turn out to be negative.

lowbloodsugar · 9 months ago

Human brains work the same way. Some of us are just better at analogy. I’ve worked with plenty of people who were unable to transfer knowledge of one area to another, identical area with different terms.

thomasahle · 9 months ago

All the environments the test (Tower of Hanoi, Checkers Jumping, River Crossing, Block World) could easily be solved perfectly by any of the LLMs if the authors had allowed it to write code.

I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

Jensson · 9 months ago

> I don't really see how this is different from "LLMs can't multiply 20 digit numbers"--which btw, most humans can't either. I tried it once (using pen and paper) and consistently made errors somewhere.

People made missiles and precise engineering like jet aircraft before we had computers, humans can do all of those things reliably just by spending more time thinking about it, inventing better strategies and using more paper.

Our brains weren't made to do such computations, but a general intelligence can solve the problem anyway by using what it has in a smart way.

Some specialized people could probably do 20x20, but I'd still expect them to make a mistake at 100x100. The level we needed for space crafts was much less than that, and we had many levels of checks to help catch errors afterwards.

I'd wager that 95% of humans wouldn't be able to do 10x10 multiplication without errors, even if we paid them $100 to get it right. There's a reason we had to invent lots of machines to help us.

It would be an interesting social studies paper to try and recreate some "LLMs can't think" papers with humans.

jdmoreira · 9 months ago

No. a huge population of humans did while standing on the shoulders of giants.

blither · 9 months ago

> if the authors had allowed it to write code.

Yeah, and FWIW doing this through writing code is trivial in an LLM / LRM - after testing locally took not even a minute to have a working solution no matter the amount of disks.

Your analogy makes sense, no reasonable person would try to solve a Tower of Hanoi type problem with e.g. 15 disks and sit there for 32,767 moves non-programmatically.

bwfan123 · 9 months ago

> but humans cant do it either

This argument is tired as it keeps getting repeated for any flaws seen in LLMs. And the other tired argument is: wait ! this is a sigmoid curve, and we have not seen the inflection point yet. If someone have me a penny for every comment saying these, I'd be rich by now.

Humans invented machines because they could not do certain things. All the way from simple machines in physics (Archimedes lever) to the modern computer.

> Humans invented machines because they could not do certain things.

If your disappointment is that the LLM didn't invent a computer to solve the problem, maybe you need to give it access to physical tools, robots, labs etc.

Xmd5a · 9 months ago

>Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

>In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task.

https://arxiv.org/abs/2311.13373

someothherguyy · 9 months ago

> humans can't

The reasons humans can't and the reasons LLMs can't are completely different though. LLMs are often incapable of performing multiplication. Many humans just wouldn't care to do it.

PatronBernard · 9 months ago

>write code

Doesn't that come down to allowing it to directly regurgitate training data? Surely it's seen dozens of such solutions.

mjburgess · 9 months ago

The goal isnt to assess the LLM capability at solving any of those problems. The point isnt how good they are at block world puzzles.

The point is to construct non-circular ways of quantifying model performance in reasoning. That the LLM has access to prior exemplars of any given problem is exactly the issue in establishing performance in reasoning, over historical synthesis.

How are these problems more interesting than simple arithmetic or algorithmic problems?

Well that's because all these LLMs have memorized a ton of code bases with solutions to all these problems.