> What they delivered barely moved the needle on code generation, the one capability that everything else depends on.
I don't think that holds up. GPT-5 is wildly better at coding that GPT-4o was (and got even better with GPT-5-Codex). A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.
From the conclusion:
> All [AI coding startups] betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards.
The models really don't need to get better at generating code right now for the economic impact to be profound. If progress froze today we could still spend the next 12+ months finding new ways to get better results for code out of our current batch of models.
There's also the pretty simple observation that the ability to chain tool calls this way is itself a profound model improvement that occurred, in public mainstream foundation models, only a year ago.
OP here. Thanks for the thoughtful reply. Curious if you’ve measured o1’s accuracy and token cost with tool use enabled vs disabled? Wondering if python sandbox gives higher accuracy and lower cost, since internal reasoning chains are longer and pricier.
[Edit] Its probably premature to argue without the above data, but if we assume tool use gives ~100% accuracy and reasoning-only ~90%, then that 10% gap might represent the loss in the probabilistic model: either from functional ambiguity in the model itself or symbolic ambiguity from tokenization?
I can't call o1 with the Python tool via the API, so I'll have to provide the price for the GPT-5 example in https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af... - that one was 777 input tokens and 140 output tokens. Why 777 input tokens? That's a bit of a mystery to me - my assumption is that a bunch of extra system prompt stuff gets stuffed on describing that coding tool.
> A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.
Even if GPT-5 was less capable of a coder than Claude, I'd still not use Claude because of its ridiculous quotas, context window restrictions, slowness, and Anthropic's pedantic stance on AI safety.
I've seen research that shows that starting with reasoning models, and fine-tuning to slowly remove the reasoning steps, allows you to bake the reasoning directly into the model weights in a strong sense. Here's a recent example, and you can see the digits get baked into a pentagonal prism in the weights, allowing accurate multi-digit multiplication without needing notes: https://arxiv.org/abs/2510.00184. So, reasoning and tool use could be the first step, to collect a ton of training data to do something like this fine-tuning process.
Yeah, I have to imagine animal brains are just giant Fourier transform engines under the hood, and humans brains have just evolved to make some frequencies more precise.
This is what I deem to be more comparable to human reasoning, although in this case it happens at an extremely slow timescale. Ideally real reasoning would have an impact on the weights, although that would be practically impossible(very impractical with the current model architectures) and especially if the conversation is had with the broader public.
Yeah, kind of. Though it seems like humans do some post-analysis on the reasoning and kind of "root cause" the path to the result. Fine tuning on the raw thought output seems like it could capture a ton of unnecessary noise. I don't know if it would be able to capture the "aha" moment and encode that alone in a way that makes sense.
That's interesting, though I wonder if what's "baked into the weights" is closer to intuition than reasoning. Once reasoning traces are distilled into weights, the model stops thinking through problems and starts pattern-matching answers. That feels more like a stochastic parrot with intuition than an analytical reasoner.
I'd guess it works like any form of learning a subject. The better you have internalized the fundamentals, the better you can perform at higher level tasks.
Though to that end, I wonder if the model "knows" that it "understands" the fundamentals better once it's been trained like this, or if when it has to do a large multiplication as part of a larger reasoning task, does it still break it down step by step.
> These are not model improvements. They're engineering workarounds for models that stopped improving.
One might characterize it as an improvement in the document-style which the model operates upon.
My favorite barely-a-metaphor is that the "AI" interaction is based on a hidden document that looks like a theater script, where characters User and Bot are having a discussion. Periodically, the make_document_longer(doc) function (the stateless LLM) is invoked to to complete more Bot lines. An orchestration layer performs the Bot lines towards the (real) user, and transcribes the (real) user's submissions into User dialogue.
Recent improvements? Still a theater-script, but:
1. Reasoning - The Bot character is a film-noir detective with a constant internal commentary, not typically "spoken" to the User character and thus not "performed" by the orchestration layer: "The case was trouble, but I needed to make rent, and to do that I had to remember it was Georgia the state, not the country."
2. Tools - There are more stage-directions, such as "Bot uses [CALCULATOR] inputting [sqrt(5)*pi] and getting [PASTE_RESULT_HERE]". Regular programs are written to parse the script, run tools, and then replace the result.
Meanwhile, the fundamental architecture and the make_document_longer(doc) haven't changed as much, hence the author's title of "not model improvement."*
Exactly. Both the theater script and code are metadata that manipulates entities: characters in a play or variables in memory. There's definitely abstract-level understanding emerging: that's why models can be trained on python, but write code in java. That could be instructions like pseudo-code or the hidden document/theater script you mentioned. That capability jump from GPT3 to o1 is real.
But my point is: pure metadata manipulation has hit a ceiling or is moving at a crawling pace since o1. The breakthrough applications (like agentic AI) still depend on the underlying model's ability to generate accurate code. When that capability plateaus, all the clever orchestration on top of it plateaus too.
Just to confirm, as this topic gets "very meta" with levels of indirection, it sounds like you mean the LLM appends a "fitting" document fragment like:
This was an unusual task Bot wasn't sure how to solve directly.
Bot decided it needed to execute a program:
[CODE_START]foo(bar(baz())[CODE_END]
Which resulted in
[CODE_RESULT_PLACEHOLDER]
This stage-direction is externally parsed, executed, and substituted, and then the LLM is called upon to generate Bot-character's next reaction.
In terms of how this could go wrong, it makes me think of a meme:
> Thinking quickly, Dave constructs a homemade megaphone, using only some string, a squirrel, and a megaphone.
Hi HN,
OP here. I'd appreciate feedback from folks with deep model knowledge on a few technical claims in the essay. I want to make sure I'm getting the fundamentals right.
1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?
2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?
3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?
1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.
2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.
3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.
As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.
Question #1 was on the model's ability to handle arithmetic. The answer to question seems to be unrelated, at least to me: "you can enable or disable tool use in most APIs".
The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?
Reasoning model doesn't imply tool calling – those shouldn't be conflated.
Reasoning just means more implicit chain-of-thought. It can be emulated by non reasoning model by explicitly constructing prompt to perform longer step by step thought process. With reasoning models it just happens implicitly, some models allow for control over reasoning effort with special tokens. Those models are simply fine tuned to do it themselves without explicit dialogue from the user.
Tool calling happens primarily on the client side. Research/web access mode etc made available by some providers (based on tool calling that they handle themselves) is not a property of a model, can be enabled on any model.
Nothing plateaued from where I'm standing – new models are being trained, releases happen frequently with impressive integration speed. New models outperform previous ones. Models gain multi modality etc.
Regarding alternative architectures – there are new ones proposed all the time. It's not easy to verify all of them at scale. Some ideas that are extending current state of art architectures end up in frontier models - but it takes time to train so lag does exist. There are also a lot of improvements that are hidden from public by commercial companies.
All i can say as someone sorta lay is that math isn't an LLM's strength. Having them defer calculations to calculators or python code seems better than it guessing that 1+1 = 2 because it's past data says 1+1 = 2
I know this isn't using tools (e.g. the Python interpreter) because you have to turn those on explicitly. That's not actually supported for o1 in the API but you can do it for GPT-5 like this:
Note this bit where the code interpreter Python tool is called:
{
"id": "rs_080a5801ca14ad990068fa91f2779081a0ad166ee263153d98",
"type": "reasoning",
"summary": [
{
"type": "summary_text",
"text": "**Calculating large product**\n\nI see that I need to compute the product of two large numbers, which involves big integer multiplication. It\u2019s a straightforward task, and given that I can use Python, that seems like the best route to avoid any potential errors. The user specifically asked for this multiplication, so I\u2019ll go ahead and use the python tool for accurate analysis. Let\u2019s get started on that!"
}
]
},
{
"id": "ci_080a5801ca14ad990068fa91f4dbe481a09eb646af049541c6",
"type": "code_interpreter_call",
"status": "completed",
"code": "a = 87654321\r\nb = 98765432\r\na*b",
"container_id": "cntr_68fa91f12f008191a359f1eeaed561290c438cc21b3fc083",
"outputs": null
}
>I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?
Both reasoning and non-reasoning models may choose to use the Python interpreter to solve math problems. This isn't hidden from the user; it will show the interpreter ("Analyzing...") and you can click on it to see the code it ran.
It can also solve math problems by working through them step-by-step. In this case it will do long multiplication using the pencil-and-paper method, and it will show its work.
I don't think 2 is true: when OpenAI model won a gold medal in the math olympiads, it did so without tools or web search, just pure inference. Such a feat definitely would not have happened with o1.
> Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.
> For OpenAI, the models had access to a code execution sandbox, so they could compile and test out their solutions. That was it though; no internet access.
True, but aren't the math (and competitive programming) achievements a bit different? They're specific models heavily RL'd on competition math problems. Obviously still ridiculously impressive, but if you haven't done competition math or programming before it's much more memorization of techniques than you might expect and it's much easier to RL on.
Point 2 is 1000% not true, the models have both gotten better at the overall act of coding, but have also gotten WAY better at USING tools. This isn't tool orchestration frameworks, this is knowing how and when to use tools effectively and it is largely inside the model. I would also say this is a fundamental model capability.
This improved think->act->sense loop that they now form, exponentially increases the possible utility of the models. We are just starting to see this with gpt-5 and the 4+ series of Claude models.
Yes, the models have gotten better at using tools because tech companies have poured an insane amount of money into improving tools and integrating them with LLMs. Is this because the models have actually improved, or because the tools and integration methods have improved? I don’t think anyone actually knows.
Try running local LLMs like Qwen3 yourself. They can calculate accurately in their reasoning traces even if you don't give them access to coding tools. In fact, even mid-range models (32b params) under 4-bit quantization can perform pretty well. No need to make guesses, you can try it yourself!
I don't really know what you mean by "preserve semantic meaning at the word level." The significant misunderstanding about tokenization present elsewhere in the article is concerning, given that the proposed path forward is to do with replacing tokenization somehow.
Not affiliated with anyone, but I think the likes of OptNet (differentiable constraint optimization) are soon going to play a role in developing AI with precise deductive reasoning.
More broadly I think what we’re looking for at the end of the day, AGI, is going come about from a diaspora of methods capturing the diverse aspects of what we recognize as intelligence. ‘Precise deductive reasoning’ is one capability out of many. Attention isn’t all you need, neither is compression, convex programming, what have you. The perceived “smoothness” or “unity” of our intelligence is an illusion like virtual memory hiding cache, and building it is going to look a lot more like stitching these capabilities together than deriving some deep and elegant equation.
I just think that LLM calls are the new transistor. Transistors don't do much, but you build computers out of them. LLMs do a lot more than transistors.
LLMs are very good at imitating moderate-length patterns. It can usually keep an apparently sensible conversation going with itself for at least a couple thousand words before it goes completely off the rails, although you never know exactly when it will go off the rails; it's very unlikely to be after the first sentence, far more likely to be after the twenty-first, and will never get past the 50th. If you inject novel input in periodically (such as reminding and clarifying prompts), you can keep the plate spinning longer.
So some tricks work right now to extend the amount of time the thing can go before falling into the inevitable entropy that comes from talking to itself too long, and I don't think that we should assume that there won't ever be a way to keep the plate spinning forever. We may be able to do it practically (making it very unusual for them to fall apart), or somebody may come up with a way to make them provably resilient.
I don't know if the current market leaders have any insight into how to do this, however. But I'm also sure that an LLM reaching for a calculator and injecting the correct answer into the context keeps that context useful for longer than if it hadn't.
If you subscribe to extended mind theory and Merleau Ponty's brand of phenomenology, tools are just an extension of your cognitive process, and "shelling out" in this way is really to be expected of high intelligence, if not consciousness. Some would say it might even be a prerequisite for consciousness, that you need to be a being-in-the-world etc etc
Not to say that GPT is conscious, in its current form I think it certainly isn't, but rather I would say reasoning is a positive development, not an embarrassing one
I can't compute 297298*248 immediately in my head, and if I were to try it I'd have to hobble through a multiplicaion algorithm, in my head... it's quite simlar to what they're doing here, it's just they can wire it right into a real calculator instead of slowly running a shitty algo on wetware
Yeah humans have done this physically for very long. We have puny little teeth but have knives and butchering tools much better than any animal's teeth. We have little hair, but make clothes that can keep us warm in the arctic or even in space. We have an underdeveloped colon and digestive system, but instead we pre-digest the food by cooking it on fire. In some sense the stove is part of our digestive system, the jacket is part of our dermis (like the shell of a snail, except we build it through a different process), and we have external teeth in the form of utensils and stone tools.
Now, ideally, the LLMs could also design their own tools, when they realize there is a recurring task that can be accomplished better and more reliably by coding up a tool.
Heidegger ready-to-hand is another take on this same idea. Something I took to heart years ago and was a big part of my using and contributing to free software as much as possible. Proprietary software is a form of mind control along these lines of thought and I don't like that one bit.
I use thinking with claude cod extensively. It’s same as reasoning they just name it differently. It definitely helps, sometimes it feels like they came up with original thoughts and ideas
This article is pretty much all wrong. Reasoning is not tool calling.
Reasoning is about working through problems step-by-step. This is always going to be necessary for some problems (logic solving, puzzles, etc) because they have a known minimum time complexity and fundamentally require many steps of computation.
Bigger models = more width to store more information. Reasoning models = more depth to apply more computation.
> When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result.
That's not true of the model itself, see my comment here which demonstrates it multiplying two large numbers via the OpenAI API without using Python: https://news.ycombinator.com/item?id=45683113#45686295
On GPT-5 it says:
> What they delivered barely moved the needle on code generation, the one capability that everything else depends on.
I don't think that holds up. GPT-5 is wildly better at coding that GPT-4o was (and got even better with GPT-5-Codex). A lot of people have been ditching Claude for GPT-5 for coding stuff, and Anthropic held the throne for "best coding model" for well over a year prior to that.
From the conclusion:
> All [AI coding startups] betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards.
The models really don't need to get better at generating code right now for the economic impact to be profound. If progress froze today we could still spend the next 12+ months finding new ways to get better results for code out of our current batch of models.
[Edit] Its probably premature to argue without the above data, but if we assume tool use gives ~100% accuracy and reasoning-only ~90%, then that 10% gap might represent the loss in the probabilistic model: either from functional ambiguity in the model itself or symbolic ambiguity from tokenization?
My o1 call in https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e... used 16 input tokens and produced 2357 output tokens (1664 were reasoning). At o1's price that's 14 cents! https://www.llm-prices.com/#it=16&ot=2357&ic=15&cic=7.5&oc=6...
I can't call o1 with the Python tool via the API, so I'll have to provide the price for the GPT-5 example in https://gist.github.com/simonw/c53c373fab2596c20942cfbb235af... - that one was 777 input tokens and 140 output tokens. Why 777 input tokens? That's a bit of a mystery to me - my assumption is that a bunch of extra system prompt stuff gets stuffed on describing that coding tool.
GPT-5 is hugely cheaper than o1 so that cost 0.22 cents (almost a quarter of a cent) - but if o1 ran with the same number of tokens it would only cost 1.94 cents: https://www.llm-prices.com/#it=777&ot=130&sel=gpt-5%2Co1-pre...
Even if GPT-5 was less capable of a coder than Claude, I'd still not use Claude because of its ridiculous quotas, context window restrictions, slowness, and Anthropic's pedantic stance on AI safety.
Though to that end, I wonder if the model "knows" that it "understands" the fundamentals better once it's been trained like this, or if when it has to do a large multiplication as part of a larger reasoning task, does it still break it down step by step.
One might characterize it as an improvement in the document-style which the model operates upon.
My favorite barely-a-metaphor is that the "AI" interaction is based on a hidden document that looks like a theater script, where characters User and Bot are having a discussion. Periodically, the make_document_longer(doc) function (the stateless LLM) is invoked to to complete more Bot lines. An orchestration layer performs the Bot lines towards the (real) user, and transcribes the (real) user's submissions into User dialogue.
Recent improvements? Still a theater-script, but:
1. Reasoning - The Bot character is a film-noir detective with a constant internal commentary, not typically "spoken" to the User character and thus not "performed" by the orchestration layer: "The case was trouble, but I needed to make rent, and to do that I had to remember it was Georgia the state, not the country."
2. Tools - There are more stage-directions, such as "Bot uses [CALCULATOR] inputting [sqrt(5)*pi] and getting [PASTE_RESULT_HERE]". Regular programs are written to parse the script, run tools, and then replace the result.
Meanwhile, the fundamental architecture and the make_document_longer(doc) haven't changed as much, hence the author's title of "not model improvement."*
In terms of how this could go wrong, it makes me think of a meme:
> Thinking quickly, Dave constructs a homemade megaphone, using only some string, a squirrel, and a megaphone.
1. On o1's arithmetic handling: I claim that when o1 multiplies large numbers, it generates Python code rather than calculating internally. I don't have full transparency into o1's internals. Is this accurate?
2. On model stagnation: I argue that fundamental model capabilities (especially code generation) have plateaued, and that tool orchestration is masking this. Do folks with hands-on experience building/evaluating models agree?
3. On alternative architectures: I suggest graph transformers that preserve semantic meaning at the word level as one possible path forward. For those working on novel architectures - what approaches look promising? Are graph-based architectures, sparse attention, or hybrid systems actually being pursued seriously in research labs?
Would love to know your thoughts!
1. You can enable or disable tool use in most APIs. Generally, tools such as web search and Python interpreter give models an edge. The same is true for humans, so, no surprise. At the frontier, model performance keeps climbing - both with tool use enabled and with it disabled.
2. Model capabilities keep improving. Frontier models of today are both more capable at their peak, and pack more punch for their weight, figuratively and literally. Capability per trained model weight and capability per unit of inference compute are both rising. This is reflected directly in model pricing - "GPT-4 level of performance" is getting cheaper over time.
3. We're 3 years into the AI revolution. If I had ten bucks for every "breakthrough new architecture idea" I've seen in a meanwhile, I'd be able to buy a full GB200 NVL72 with that.
As a rule: those "breakthroughs" aren't that. At best, they offer some incremental or area-specific improvements that could find their way into frontier models eventually. Think +4% performance across the board, or +30% to usable context length for the same amount of inference memory/compute, or a full generational leap but only in challenging image understanding tasks. There are some promising hybrid approaches, but none that do away with "autoregressive transformer with attention" altogether. So if you want a shiny new architecture to appear out of nowhere and bail you out of transformer woes? Prepare to be disappointed.
The original question still stands: do recent LLMs have an inherent knowledge of arithmetic, or do they have to offload the calculation to some other non-LLM system?
Reasoning just means more implicit chain-of-thought. It can be emulated by non reasoning model by explicitly constructing prompt to perform longer step by step thought process. With reasoning models it just happens implicitly, some models allow for control over reasoning effort with special tokens. Those models are simply fine tuned to do it themselves without explicit dialogue from the user.
Tool calling happens primarily on the client side. Research/web access mode etc made available by some providers (based on tool calling that they handle themselves) is not a property of a model, can be enabled on any model.
Nothing plateaued from where I'm standing – new models are being trained, releases happen frequently with impressive integration speed. New models outperform previous ones. Models gain multi modality etc.
Regarding alternative architectures – there are new ones proposed all the time. It's not easy to verify all of them at scale. Some ideas that are extending current state of art architectures end up in frontier models - but it takes time to train so lag does exist. There are also a lot of improvements that are hidden from public by commercial companies.
If you call the OpenAI API for o1 and ask it to multiply two large numbers it cannot use Python to help it.
Try this:
Here's what I got back just now: https://gist.github.com/simonw/a6438aabdca7eed3eec52ed7df64e...o1 correctly answered the multiplication by running a long multiplication process entirely through reasoning tokens.
Note this bit where the code interpreter Python tool is called:
> "tool_choice": "auto"
> "parallel_tool_calls": true
Can you remake the API call explicitly asking it to not perform any tool calls?
Deleted Comment
Deleted Comment
Both reasoning and non-reasoning models may choose to use the Python interpreter to solve math problems. This isn't hidden from the user; it will show the interpreter ("Analyzing...") and you can click on it to see the code it ran.
It can also solve math problems by working through them step-by-step. In this case it will do long multiplication using the pencil-and-paper method, and it will show its work.
Here's OpenAI's tweet about this: https://twitter.com/SebastienBubeck/status/19465776504050567...
> Just to spell it out as clearly as possible: a next-word prediction machine (because that's really what it is here, no tools no nothing) just produced genuinely creative proofs for hard, novel math problems at a level reached only by an elite handful of pre‑college prodigies.
My notes: https://simonwillison.net/2025/Jul/19/openai-gold-medal-math...
They DID use tools for the International Collegiate Programming Contest (ICPC) programming one though: https://twitter.com/ahelkky/status/1971652614950736194
> For OpenAI, the models had access to a code execution sandbox, so they could compile and test out their solutions. That was it though; no internet access.
This improved think->act->sense loop that they now form, exponentially increases the possible utility of the models. We are just starting to see this with gpt-5 and the 4+ series of Claude models.
More broadly I think what we’re looking for at the end of the day, AGI, is going come about from a diaspora of methods capturing the diverse aspects of what we recognize as intelligence. ‘Precise deductive reasoning’ is one capability out of many. Attention isn’t all you need, neither is compression, convex programming, what have you. The perceived “smoothness” or “unity” of our intelligence is an illusion like virtual memory hiding cache, and building it is going to look a lot more like stitching these capabilities together than deriving some deep and elegant equation.
LLMs are very good at imitating moderate-length patterns. It can usually keep an apparently sensible conversation going with itself for at least a couple thousand words before it goes completely off the rails, although you never know exactly when it will go off the rails; it's very unlikely to be after the first sentence, far more likely to be after the twenty-first, and will never get past the 50th. If you inject novel input in periodically (such as reminding and clarifying prompts), you can keep the plate spinning longer.
So some tricks work right now to extend the amount of time the thing can go before falling into the inevitable entropy that comes from talking to itself too long, and I don't think that we should assume that there won't ever be a way to keep the plate spinning forever. We may be able to do it practically (making it very unusual for them to fall apart), or somebody may come up with a way to make them provably resilient.
I don't know if the current market leaders have any insight into how to do this, however. But I'm also sure that an LLM reaching for a calculator and injecting the correct answer into the context keeps that context useful for longer than if it hadn't.
Not to say that GPT is conscious, in its current form I think it certainly isn't, but rather I would say reasoning is a positive development, not an embarrassing one
I can't compute 297298*248 immediately in my head, and if I were to try it I'd have to hobble through a multiplicaion algorithm, in my head... it's quite simlar to what they're doing here, it's just they can wire it right into a real calculator instead of slowly running a shitty algo on wetware
Now, ideally, the LLMs could also design their own tools, when they realize there is a recurring task that can be accomplished better and more reliably by coding up a tool.
Reasoning is about working through problems step-by-step. This is always going to be necessary for some problems (logic solving, puzzles, etc) because they have a known minimum time complexity and fundamentally require many steps of computation.
Bigger models = more width to store more information. Reasoning models = more depth to apply more computation.