It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the different company's gets clustered closer together. Right now GPT-5, Claude Opus, Grok 4, Gemini 2.5 Pro all seem quite good across the board (ie they can all basically solve moderately challenging math and coding problems).
As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.
Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?
It's also worth considering that past some threshold, it may be very difficult for us as users to discern which model is better. I don't think thats what's going on here, but we should be ready for it. For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually? To the extent that our AGI/SI metrics are based on human judgement the cluster effect that they create may be an illusion.
> For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually?
No, but I wouldn't be able to tell you what the player did wrong in general.
By contrast, the shortcomings of today's LLMs seem pretty obvious to me.
> it may be very difficult for us as users to discern which model is better
But one thing will stay consistent with LLMs for some time to come: they are programmed to produce output that looks acceptable, but they all unintentionally tend toward deception. You can iterate on that over and over, but there will always be some point where it will fail, and the weight of that failure will only increase as it deceives better.
Some things that seemed safe enough: Hindenburg, Titanic, Deepwater Horizon, Chernobyl, Challenger, Fukushima, Boeing 737 MAX.
which is a thing with humans as well - I had a colleague with certified 150+ IQ, and other than moments of scary smart insight, he was not a superman or anything, he was surprisingly ordinary. Not to bring him down, he was a great guy, but I'd argue many of his good qualities had nothing to do with how smart he was.
It's even more difficult because, while all the benchmarks provide some kind of 'averaged' performance metric for comparison, in my experience most users have pretty specific regular use cases, and pretty specific personal background knowledge. For instance I have a background in ML, 15 years experience in full stack programming, and primarily use LLMs for generating interface prototypes for new product concepts. We use a lot of react and chakraui for that, and I consistently get the best results out of Gemini pro for that. I tried all the available options and settled on that as the best for me and my use case. It's not the best for marketing boilerplate, or probably a million other use cases, but for me, in this particular niche it's clearly the best. Beyond that the benchmarks are irrelevant.
we could run some tests to first find out if comparative performance tests can be conjured:
one can intentionally use a recent and a much older model to figure out if the tests are reliable, and in which domains it is reliable.
one can compute a models joint probability for a sequence and compare how likely each model finds the same sequence.
we could ask both to start talking about a subject, but alternatingly each can emit a token. look again at how the dumber and smarter models judge the resulting sentence does the smart one tend to pull up the quality of the resulting text, or does it tend to get dragged down more towards the dumber participant?
given enough such tests to "identify the dummy vs smart one" and verifying them on common agreement (as an extreme word2vec vs transformer) to assess the quality of the test, regardless of domain.
on the assumption that such or similar tests allow us to indicate the smarter one, i.e. assuming we find plenty such tests, we can demand model makers publish open weights so that we can publically verify performance agreements.
Another idea is self-consistency tests: a single forward inference of context size say 2048 tokens (just an example) is effectively predicting the conditional 2-gram, 3-gram, 4-gram probabilities on the input tokens. so each output token distribution is predicted on the preceding inputs, so there are 2048 input tokens and 2048 output tokens, the position 1 output token is the predicted token vector (logit vector really) that is estimated to follow the given position 1 input vector, and the position 2 output vector is the prediction following the first 2 input vectors etc. and the last vector is the predicted next token following all the 2048 input tokens. p(t_(i+1) | t_1 =a, t_2=b, ..., t_i=z).
But that is just one way the next token can be predicted using the network: another approach would be to use RMAD gradient descent, but keeping model weights fixed, and only considering the last say 512 input vectors as variable, how well did the last 512 predicted forward prediction output vectors match the gradient descent best joint probability output vectors?
This could be added as a loss term during training as well, as a form of regularization, which turns it into a kind of Energy Based Model roughly.
My guess is that more than the raw capabilities of a model, users would be drawn more to the model's personality. A "better" model would then be one that can closely adopt the nuances that a user likes. This is a largely uninformed guess, let's see if it holds up well with time.
This is the F1 vs 911 car problem. A 911 is just as fast as an f1 car to 60 (sometimes even faster) but an f1 is better at super high performance envelope above 150 in tight turns.
An average driver evaluating both would have a very hard time finding the f1s superior utility
> For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually?
You don't have to be even good at chess to be able to tell when a game is won or lost, most of the time.
I don't need to understand how the AI made the app I asked for or cured my cancer, but it'll be pretty obvious when the app seems to work and the cancer seems to be gone.
I mean, I want to understand how, but I don't need to understand how, in order to benefit from it. Obviously understanding the details would help me evaluate the quality of the solution, but that's an afterthought.
If AGI is ever achieved, it would open the door to recursive self improvement that would presumably rapidly exceed human capability across any and all fields, including AI development. So the AI would be improving itself while simultaneously also making revolutionary breakthroughs in essentially all fields. And, for at least a while, it would also presumably be doing so at an exponentially increasing rate.
But I think we're not even on the path to creating AGI. We're creating software that replicate and remix human knowledge at a fixed point in time. And so it's a fixed target that you can't really exceed, which would itself already entail diminishing returns. Pair this with the fact that it's based on neural networks which also invariably reach a point of sharply diminishing returns in essentially every field they're used in, and you have something that looks much closer to what we're doing right now - where all competitors will eventually converge on something largely indistinguishable from each other, in terms of ability.
> revolutionary breakthroughs in essentially all field
This doesn't really make sense outside computers. Since AI would be training itself, it needs to have the right answers, but as of now it doesn't really interact with the physical world. The most it could do is write code, and check things that have no room for interpretation, like speed, latency, percentage of errors, exceptions, etc.
But, what other fields would it do this in? How can it makes strives in biology, it can't dissect animals, it can't figure more out about plants that humans feed into the training data. Regarding math, math is human-defined. Humans said "addition does this", "this symbol means that", etc.
I just don't understand how AI could ever surpass anything human known before we live by the rules defined by us.
>And, for at least a while, it would also presumably be doing so at an exponentially increasing rate.
Why would you presume this? I think part of a lot of people's AI skepticism is talk like this. You have no idea. Full stop. Why wouldn't progress be linear? As new breakthroughs come, newer ones will be harder to come by. Perhaps it's exponential. Perhaps it's linear. No one knows.
There is no particular reason to assume that recursive self-improvement would be rapid.
All the technological revolutions so far have accounted for little more than a 1.5% sustained annual productivity growth. There are always some low-hanging fruit with new technology, but once they have been picked, the effort required for each incremental improvement tends to grow exponentially.
That's my default scenario with AGI as well. After AGI arrives, it will leave humans behind very slowly.
I think this is a hard kick below the belt for anyone trying to develop AGI using current computer science.
Current AIs only really generate - no, regenerate text based on their training data. They are only as smart as other data available. Even when an AI "thinks", it's only really still processing existing data rather than making a genuinely new conclusion. It's the best text processor ever created - but it's still just a text processor at its core. And that won't change without more hard computer science being performed by humans.
So yeah, I think we're starting to hit the upper limits of what we can do with Transformers technology. I'd be very surprised if someone achieved "AGI" with current tech. And, if it did get achieved, I wouldn't consider it "production ready" until it didn't need a nuclear reactor to power it.
> If AGI is ever achieved, it would open the door to recursive self improvement ...
They are unrelated. All you need is a way for continual improvement without plateauing, and this can start at any level of intelligence. As it did for us; humans were once less intelligent.
Using the flagship to bootstrap the next iteration with synthetic data is standard practice now. This was mentioned in the GPT5 presentation. At the rate things are going I think this will get us to ASI, and it's not going to feel epochal for people who have interacted with existing models, but more of the same. After all, the existing models are already smarter than most humans and most people are taking it in their stride.
The next revolution is going to be embodiment. I hope we have the commonsense to stop there, before instilling agency.
That's only assuming there are no fundamental limits or major barriers to computation. Back a hundred years ago at the dawn of flight, one could have said a very similar thing about aircraft performance. And for a time in the 1950s, it looked like aircraft speed was growing exponentially over time. But there haven't been any new airspeed records (at least, officially recorded) since 1986, because it turns out going Mach 3+ is fairly dangerous and approaching some rather severe materials and propulsion limitations, making it not at all economical.
I would also not be surprised if the process of developing something comparable to human intelligence, assuming the extreme computation, energy, and materials issues of packing that much computation and energy into a single system could be overcome, the AI also develops something comparable to human desire and/or mental health issues. There is a not-zero chance we could end up with AI that doesn't want to do what we ask it to do or doesn't work all the time because it wants to do other things.
You can't just assume exponential growth is a forgone conclusion.
For some reason people pre suppose super intelligence into AGI. What if AGI had diminishing returns around human level intelligence? They still have to deal with all the same knowledge gaps we have.
Those problems aren't just waiting on smarts/intelligence. Those would require experimentation in the real world. You can't solve chemistry by just thinking about it really hard. You still have to do experiments. A super intelligent machine may be better at coming up with experiments to do than we are, but without the right stuff to do them, it can't 'solve' anything of the like.
Recursive improvement without any physical change maybe limited. If any physical change like more gpu or different network configuration is required to experiment and again change to learn from it that might not be easy.
Convincing human to do on AGI behalf may not be that simple. There might be multiple path to try and teams may not agree with each other. Specially if the cost of trial is high.
AI can be trained on some special knowledge of person A and another special knowledge of person B. These two persons may never met before and therefore they can not combine their knowledge to get some new knowledge or insight.
AI can do it fine as it knows A and B. And that is knowledge creation.
> But I think we're not even on the path to creating AGI.
It seems like the LLM model will be component of an eventual AGI, it's voice per se, but not its mind. The mind still requires another innovation or breakthrough we haven't seen yet.
Math... lots and lots of math solutions. Like if it could figure out the numerical sign problem, it could quite possibly be able to simulate all of physics.
You are missing the point where synthetic data, deterministic tooling (written by AI) and new discoveries by each model generation feeds into the next model. This iteration is the key to going beyond human intelligence.
Perhaps it is not possible to simulate higher-level intelligence using a stochastic model for predicting text.
I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.
Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.
That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable
That being said, AGI is not a necessary requirement for AI to be totally world-changing
Yeah. I don't think I actually want AGI? Even setting aside the moral/philosophical/etc "big picture" issues I don't think I even want that from a purely practical standpoint.
I think I want various forms of AI that are more focused on specific domains. I want AI tools, not companions or peers or (gulp) masters.
(Then again, people thought they wanted faster horses before they rolled out the Model T)
They are moving beyond just big transformer blob LLM text prediction. Mixture of Experts is not preassembled for example, it's something like x empty experts with an empty router and the experts and routing emerges naturally with training, modeling the brain part architecture we see the brain more. There is stuff "Integrated Gated Calculator (IGC)" in Jan 2025 which makes a premade calculator neural network and integrates it directly into the neural network and gets around the entire issue of making LLMs do basic number computation and the clunkiness of generating "run tool tokens". The model naturally learns to use the IGC built into itself because it will always beat any kind of computation memorization in the reward function very quickly.
Models are truly input multimodal now. Feeding an image, feeding audio and feeding text all go into separate input nodes, but it all feeds into the same inner layer set and outputs text. This also mirrors how brains work more as multiple parts integrated in one whole.
Humans in some sense are not empty brains, there is a lot of stuff baked in our DNA and as the brain grows it develops a baked in development program. This is why we need fewer examples and generalize way better.
Seems like the real innovation of LLM-based AI models is the creation of a new human-computer interface.
Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.
In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.
There is also the fact that AI lacks long term memory like humans do. If you consider context length long term memory, its incredibly short compared to that of a human. Maybe if it reaches into the billions or trillions of tokens in length we might have something comparable, or someone comes up with a new solution of some kind
"LLMs tend to regurgitate solutions to solved problems"
People say this, but honestly, it's not really my experience— I've given ChatGPT (and Copilot) genuinely novel coding challenges and they do a very decent job at synthesizing a new thought based on relating it to disparate source examples. Really not that dissimilar to how a human thinks about these things.
> That being said, AGI is not a necessary requirement for AI to be totally world-changing.
Depends on how you define "world changing" I guess, but this world already looks different to the pre-LLM world to me.
Me asking LLM's things instead of consulting the output from other humans now takes up a significant fraction of my day. I don't google near as often, I don't trust any image or video I see as swathes of the creative professions have been replaced by output from LLM's.
It's funny, that final thing is the last thing I would have predicted. I always believed the one thing a machine could not match was human creativity, because the output of machines was always precise, repetitive and reliable. Then LLM's come along, randomly generating every token. Their primary weakness is they neither precise or reliable, but they can turn out an unending stream of unique output.
I remember reading that llm’s have consumed the internet text data, I seem to remember there is an open data set for that too. Potential other sources of data would be images (probably already consumed) videos, YouTube must have such a large set of data to consume, perhaps Facebook or Instagram private content
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
> Perhaps it is not possible to simulate higher-level intelligence using a stochastic model for predicting text.
I think you're on to it. Performance is clustering because a plateau is emerging. Hyper-dimensional search engines are running out of steam, and now we're optimizing.
True. At a minimum, as long as LLMs don't include some kind of more strict representation of the world, they will fail in a lot of tasks. Hallucinations -- responding with a prediction that doesn't make any sense in the context of the response -- are still a big problem. Because LLMs never really develop rules about the world.
To be smarter than human intelligence you need smarter than human training data. Humans already innately know right and wrong a lot of the time so that doesn't leave much room.
The bottleneck is nothing to do with money, it’s the fact that they’re using the empty neuron theory to try to mimic human consciousness and that’s not how it works. Just look up Microtubules and consciousness, and you’ll get a better idea for what I’m talking about.
These AI computers aren’t thinking, they are just repeating.
> Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better.
That is because with LLMs there is no intelligence. It is Artificial Knowledge. AK not AI. So AI is AGI.
Not that it matters for user-cases we have, but marketing needs 'AI' because that is what we were expecting for decades. So yeah, I also do not thing we will have AGI from LLMs - nor does it matter for what we are using it.
It is definitively not possible. But the frontier models are no longer “just” LLMs, either. They are neurosymbolic systems (an LLM using tools); they just don’t say it transparently because it’s not a convenient narrative that intelligence comes from something outside the model, rather than from endless scaling.
At Aloe, we are model agnostic and outperforming frontier models. It’s the anrchitecture around the LLM that makes the difference. For instance our system using Gemini can do things that Gemini can’t do on its own. All an LLM will ever do is hallucinate. If you want something with human-like general intelligence, keep looking beyond LLMs.
I think it's very fortunate, because I used to be an AI doomer. I still kinda am, but at least I'm now about 70% convinced that the current technological paradigm is not going to lead us to a short-term AI apocalypse.
The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.
However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.
Companies are collections of people, and these companies keep losing key developers to the others, I think this is why the clusters happen. OpenAI is now resorting to giving million dollar bonuses to every employee just to try to keep them long term.
If there was any indication of a hard takeoff being even slightly imminent, I really don't think key employees of the company where that was happening would be jumping ship. The amounts of money flying around are direct evidence of how desperate everybody involved is to be in the right place when (so they imagine) that takeoff happens.
> It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest.
This seems to be a result of using overly simplistic models of progress. A company makes a breakthrough, the next breakthrough requires exploring many more paths. It is much easier to catch up than find a breakthrough. Even if you get lucky and find the next breakthrough before everyone catches up, they will probably catch up before you find the breakthrough after that. You only have someone run away if each time you make a breakthrough, it is easier to make the next breakthrough than to catch up.
Consider the following game:
1. N parties take turns rolling a D20. If anyone rolls 20, they get 1 point.
2. If any party is 1 or more points behind, they get only need to roll a 19 or higher to get one point. That is being behind gives you a slight advantage in catching up.
While points accumulate, most of the players end up with the same score.
I ran a simulation of this game for 10,000 turns with 5 players:
Supposedly the idea was, once you get closer to AGI it starts to explore these breakthrough paths for you providing a positive feedback loop. Hence the expected exponential explosion in power.
But yes, so far it feels like we are in the latter stages of the innovation S-curve for transformer-based architectures. The exponent may be out there but it probably requires jumping onto a new S-curve.
You are forgetting that we are talking about AI. That AI will be used to speed up progress on making next, better AI that will be used to speed up progress on making next, better AI that ...
Not only do I think there will not be a winner take all, I think it's very likely that the entire thing will be commoditized.
I think it's likely that we will eventually we hit a point of diminishing returns where the performance is good enough and marginal performance improvements aren't worth the high cost.
And over time, many models will reach "good enough" levels of performance including models that are open weight. And given even more time, these open weight models will be runnable on consumer level hardware. Eventually, they'll be runnable on super cheap consumer hardware (something more akin to a NPU than a $2000 RTX 5090). So your laptop in 2035 with specialize AI cores and 1TB of LPDDR10 ram is running GPT-7 level models without breaking a sweat. Maybe GPT-10 can solve some obscure math problem that your model can't but does it even matter? Would you pay for GPT-10 when running a GPT-7 level model does everything you need and is practically free?
The cloud providers will make money because there will still be a need for companies to host the models in a secure and reliable way. But a company whose main business strategy is developing the model? I'm not sure they will last without finding another way to add value.
That is already happening. These labs are writing next gen models using next gen models, with greater levels of autonomy. That doesn’t get the hard takeoff people talk about because those hypotheticals don’t consider sources of error, noise, and drift.
Self-learning opens new training opportunities but not at the scale or speed of current training. The world only operates at 1x speed. Today's models have been trained on written and visual content created by billions of humans over thousands of years.
You can only experience the world in one place in real time. Even if you networked a bunch of "experiencers" together to gather real time data from many places at the same time, you would need a way to learn and train on that data in real time that could incorporate all the simultaneous inputs. I don't see that capability happening anytime soon.
This is the key - right now each new model has had countless resources dedicated to training, then they are more or less set in stone until the next update.
These big models don't dynamically update as days pass by - they don't learn. A personal assistant service may be able to mimic learning by creating a database of your data or preferences, but your usage isn't baked back into the big underlying model permanently.
I don't agree with "in our lifetimes", but the difference between training and learning is the bright red line. Until there's a model which is able to continually update itself, it's not AGI.
My guess is that this will require both more powerful hardware and a few more software innovations. But it'll happen.
There are areas where we seem to be much closer to AGI than most people realize. AGI for software development, in particular, seems incredibly close. For example, Claude Code has bewildering capabilities that feel like magic. Mix it with a team of other capable development-oriented AIs and you might be able to build AI software that builds better AI software, all by itself.
The ability to self-learn is necessary, but not necessarily sufficient. We don’t have much of an understanding of the intelligence landscape beyond human-level intelligence, or even besides it. There may be other constraints and showstoppers, for example related to computability.
I feel like technological singularity has been pretty solidly ruled as junk science, like cold fusion, Malthusian collapse, or Lynn’s IQ regression. Technologists have made numerous predictions and hypothetical scenarios, non of which have come to fruition, nor does it seem likely at any time in the future.
I think we should be treating AGI like Cold Fusion, phrenology, or even alchemy. It is not science, but science fiction. It is not going to happen and no research into AGI will provide anything of value (except for the grifters pushing the pseudo-science).
In my experience and use case Grok is pretty much unusable when working with medium size codebases and systems design. ChatGPT has issues too but at least I have figured out a way around most of them, like asking for a progress and todo summary and uploading a zip file of my codebase to a new chat window say every 100 interactions, because speed degrades and hallucinations increase. Super Grok seems extremely bad at keeping context during very short interactions within a project even when providing it with a strong foundation via instructions. For example if the code name for a system or feature is called Jupiter, Grok will many times start talking about Jupiter the planet.
I'm still stuck at the bit where just throwing more and more data to make a very complex encyclopedia with an interesting search interface that tricks us into believing it's human-like gets us to AGI when we have no examples and thus no evidence or understanding of where the GI part comes from.
It's all just hyperbole to attract investment and shareholder value and the people peddling the idea of AGI as a tangible possibility are charlatans whose goals are not aligned with whatever people are convincing themselves are the goals.
Thr fact that so many engineers have fallen for it so completely is stunning to me and speaks volumes on the underlying health of our industry.
I believe the analogy of a LLM being "a very complex encyclopedia with an interesting search interface" to be spot on.
However, I would not be so dismissive of the value. Many of us are reacting to the complete oversell of 'the encyclopedia' as being 'the eve of AGI' - as rightfully we should. But, in doing so, I believe it would be a mistake to overlook the incredible impact - and economic displacement - of having an encyclopedia comprised of all the knowledge of mankind that has "an interesting search interface" that is capable of enabling humans to use the interface to manipulate/detect connections between all that data.
> It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest.
Yes. And the fact they're instead clustering simply indicates that they're nowhere near AGI and are hitting diminishing returns, as they've been doing for a long time already. This should be obvious to everyone. I'm fairly sure that none of these companies has been able to use their models as a force multiplier in state-of-the-art AI research. At least not beyond a 1+ε factor. Fuck, they're just barely a force multiplier in mundane coding tasks.
AGI in 5/10 years is similar to "we won't have steering wheels in cars" or "we'll be asleep driving" in 5/10 years. Remember that? What happened to that? It looked so promising.
I mean, in certain US cities you can take a waymo right now. It seems that adage where we overestimate change in the short term and underestimate change in the long term fits right in here.
Looks like a lot of players getting closer and closer to an asymptotic limit. Initially small changes lead to big improvements causing a firm to race ahead, as they go forward performance gains from innovation become both more marginal and harder to find, nonetheless keep. I would expect them all to eventually reach the same point where they are squeezing the most possible out of an AI under the current paradigm, barring a paradigm shifting discovery before that asymptote is reached.
For those who happen to have a subscription to The Economist, there is a very interesting Money Talks podcast where they interview Anthropic's boss Dario Amodei[1].
There were two interesting takeaways about AGI:
1. Dario makes the remark that the term AGI/ASI is very misleading and dangerous. These terms are ill defined and it's more useful to understand that the capabilities are simply growing exponentially at the moment. If you extrapolate that, he thinks it may just "eat the majority of the economy". I don't know if this is self-serving hype, and it's not clear where we will end up with all this, but it will be disruptive, no matter what.
2. The Economist moderators however note towards the end that this industry may well tend toward commoditization. At the moment these companies produce models that people want but others can't make. But as the chip making starts to hits its limits and the information space becomes completely harvested, capability-growth might taper off, and others will catch up. The quasi-monopoly profit potentials melting away.
Putting that together, I think that although the cognitive capabilities will most likely continue to accelerate, albeit not necessarily along the lines of AGI, the economics of all this will probably not lead to a winner takes all.
There's already so many comparable models, and even local models are starting to approach the performance of the bigger server models.
I also feel like, it's stopped being exponential already. I mean last few releases we've only seen marginal improvements. Even this release feels marginal, I'd say it feels more like a linear improvement.
That said, we could see a winner take all due to the high cost of copying. I do think we're already approaching something where it's mostly price and who released their models last. But the cost to train is huge, and at some point it won't make sense and maybe we'll be left with 2 big players.
1. FWIW, I watched clips from several of Dario’s interviews. His expressions and body language convey sincere concerns.
2. Commoditization can be averted with access to proprietary data. This is why all of ChatGPT, Claude, and Gemini push for agents and permissions to access your private data sources now. They will not need to train on your data directly. Just adapting the models to work better with real-world, proprietary data will yield a powerful advantage over time.
Also, the current training paradigm utilizes RL much more extensively than in previous years and can help models to specialize in chosen domains.
I think you're reading way too much into OpenAI bungling its 15-month product lead, but also the whole "1 AGI company will take off" prediction is bad anyway, because it assumes governments would just let that happen. Which they wouldn't, unless the company is really really sneaky or superintelligence happens in the blink of an eye.
I think OpenAI has committed hard onto the 'product company' path, and will have a tough time going back to interesting science experiments that may and may not work, but are necessary for progress.
Governments react at a glacial pace to new technological developments. They wouldn't so much as 'let it happen' as that it had happened and they simply never noticed it until it was too late. If you are betting on the government having your back in this then I think you may end up disappointed.
* or governments fail to look far enough ahead, due to a bunch of small-minded short-sighted greedy petty fools.
Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?
Do you mean from ChatGPT launch or o1 launch? Curious to get your take on how they bungled the lead and what they could have done differently to preserve it. Not having thought about it too much, it seems that with the combo of 1) massive hype required for fundraising, and 2) the fact that their product can be basically reverse engineered by training a model on its curated output, it would have been near impossible to maintain a large lead.
correlation between text can implement any algorithm, it is just the architecture which it's built on. It's like saying vacuum tube computers can't reason bc it's just air not reasoning. What the architecture is doesn't matter. It's capable of expressing reasoning as it is capable of expression any program. In fact you can easily think of a turing machine and also any markov chain as a correlation function between two states which have joint distribution exactly at places where the second state is the next state of the first state.
Here's a pessimistic view: A hard take-off at this point might be entirely possible, but it would be like a small country with nuclear weapons launching an attack on a much more developed country without them. E.g. North Korea attacking South Korea. In such a situation an aggressor would wait to reveal anything until they had the power to obliterate everything ten times over.
If I were working in a job right now where I could see and guide and retrain these models daily, and realized I had a weapon of mass destruction on my hands that could War Games the Pentagon, I'd probably walk my discoveries back too. Knowing that an unbounded number of parallel discoveries were taking place.
It won't take AGI to take down our fragile democratic civilization premised on an informed electorate making decisions in their own interests. A flood of regurgitated LLM garbage is sufficient for that. But a scorched earth attack by AGI? Whoever has that horse in their stable will absolutely keep it locked up until the moment it's released.
Pessimistic is just another way to spell 'realistic' in this case. None of these actors are doing it for the 'good of the world' despite their aggressive claims to the contrary.
What I'm seeing is that as we get closer to supposed AGI, the models themsleves are getting less and less general. They're getting in fact more specific and clustered around high value use cases. It's kind of hard to see in this context what AGI is meant to mean.
this is what i don't get. How can GPT-5 ace obscure AIME problems while simultaneously falling into the trap of the most common fallacy about airfoils (despite there being copious training data calling it out as a fallacy)? And I believe you that in some context it failed to understand this simple rearrangement of terms; there's sometimes basic stuff I ask it that it fails at too.
It doesn't take a researcher to realise that we have hit a wall and hit it more than a year ago now. The fact all these models are clustering around the same performance proves it.
It's quite possible that the models from different companies are clustering together now because we're at a plateau point in model development, and won't see much in terms in further advances until we make the next significant breakthrough.
I don't think this has anything to do with AGI. We aren't at AGI yet. We may be close or we may be a very long way away from AGI. Either way, current models are at a plateau and all the big players have more or less caught up with each other.
As is, AI is quite intelligent, in that it can process large quantities of diverse unstructured information and build meaningful insights. And that intelligence applies across an incredibly broad set of problems and contexts. Enough that I have a hard time not calling it general. Sure, it has major flaws that are obvious to us and it's much worse at many things we care about. But that's doesn't make it not intelligent or general. If we want to set human intelligence as the baseline, we already have a word for that: superintelligence.
while the model companies all compete on the same benchmarks it seems likely their models will all converge towards similar outcomes unless something really unexpected happens in model space around those limit points…
not a researcher for long enough....but we are witnessing open source effort & Chinese models starting to fall one "level" behind the most advanced models, mainly due to a lack of compute i think...
on the other hand, there are still some flaws regarding GPT-5. for example, when i use it for research it often needs multiple prompts to get the topic i truly want and sometimes it can feed me false information. so the reasoning part is not fully there yet?
I know there's an official AGI definition, but it seem to me that there's too much focus on the model as the thing where AGI needs to happen. But that is just focusing on knowledge in the brain. No human knows everything. We as humans rely on a ways to discover new knowledge, investigation, writing knowledge down so it can be shared, etc.
Current models, when they apply reasoning, have feedback loops using tools to trial and error, and have a short term memory (context) or multiple short term memories if you use agents, and a long term memory (markdown, rag), they can solve problems that aren't hardcoded in their brain/model. And they can store these solutions in their long term memory for later use. Or for sharing with other LLM based systems.
AGI needs to come from a system that combines LLMs + tools + memory. And i've had situations where it felt like i was working with an AGI. The LLMs seem advanced enough as the kernel for an AGI system.
The real challenge is how are you going to give these AGIs a mission/goal that they can do rather independently and don't need constant hand-holding. How does it know that it's doing the right thing. The focus currently is on writing better specifications, but humans aren't very good at creating specs for things that are uncertain. We also learn from trial and error and this also influences specs.
It seems that the new tricks that people discover to slightly improve the model, be it a new reinforcement learning technique or whatever, get leaked/shared quickly to other companies and there really isn't a big moat. I would have thought that whoever is rich enough to afford tons of compute first would start pulling away from the rest but so far that doesn't seem to be the case --- even smaller players without as much compute are staying in the race.
I think there are two competing factors. On one end, to get the same kind of "increase" in intelligence each generation requires an expontentially higher amount of compute, so while GPT-3 to GPT-4 was a sort of "pure" upgrade by just making it 10x bigger, gradually you lose the ability to just get 10x GPUs for a single model. The hill keeps getting steeper so progress is slower without exponential increases (which is what is happening).
However, I do believe that once the genuine AGI threshold is reached it may cause a change in that rate. My justification is that while current models have gone from a slightly good copywriter in GPT-4 to very good copywriter in GPT-5, they've gone from sub-exceptional in ML research to sub-exceptional in ML research.
The frontier in AI is driven by the top 0.1% of AI researchers. Since improvement in these models is driven partially by the very peaks of intelligence, it won't be until models reach that level where we start to see a new paradigm. Until then it's just scale and throwing whatever works at the GPU and seeing what comes out smarter.
I think this is simply due to the fact that to train an AGI-level AI currently requires almost grid scale amounts of compute. So the current limitation is purely physical hardware. No matter how intelligent GPT-5 is, it can't conjure extra compute out of thin air.
I think you'll see the prophesized exponentiation once AI can start training itself at reasonable scale. Right now its not possible.
I feel like the benchmark suites need to include algorithmic efficiency. I.e can this thing solve your complex math or coding problem in 5000 gpus instead of 10000? 500? Maybe just 1 Mac mini?
The idea is that with AGI it will then be able to self improve orders of magnitude faster than it would if relying on humans for making the advances. It tracks that the improvements are all relatively similar at this point since they're all human-reliant.
The idea of singularity--that AI will improve itself--is that it assumes intelligence is an important part of improving AI.
The AIs improve by gradient descent, still the same as ever. It's all basic math and a little calculus, and then making tiny tweaks to improve the model over and over and over.
There's not a lot of room for intelligence to improve upon this. Nobody sits down and thinks really hard, and the result of their intelligent thinking is a better model; no, the models improve because a computer continues doing basic loops over and over and over trillions of times.
That's my impression anyway. Would love to hear contrary views. In what ways can an AI actually improve itself?
I studied machine learning in 2012, gradient descent wasn't new back then either but it was 5 years before the "attention is all you need" paper. Progress might look continuous overall but if you zoom enough it might be a bit more discrete with breakthrough that must happen to jump the discrete parts, the question to me now is "How many papers like attention is all you need before a singularity?" I don't have that answer but let's not forget, until they released chat gpt, openAI was considered a joke by many people in the field who asserted their approach was a dead end.
I think the expectation is that it will be very close until one team reaches beyond the threshold. Then even if that team is only one month ahead, they will always be one month ahead in terms of time to catch up, but in terms of performance at a particular time their lead will continue to extend. So users will use the winner's tools, or use tools that are inferior by many orders of magnitude.
This assumes an infinite potential for improvement though. It's also possible that the winner maxes out after threshold day plus one week, and then everyone hits the same limit within a relatively short time.
It's the classic S-curve. A few years ago when we saw ChatGPT come out, we got started on the ramping up part of the curve but now we're on the slowing down part. That's just how technology goes in general.
>It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest. It's interesting to note that at least so far, the trend has been the opposite
That seems hardy surprising considering the condition to receive the benefit has not been met.
The person who lights a campfire first will become warmer than the rest, but while they are trying to light the fire the others are gathering firewood. So while nobody has a fire, those lagging are getting closer to having a fire.
My personal belief is that we are moving past the hype and kind of starting to realize the true shape of what (LLM) AI can offer us, which is a darned lot, but still, it only works well when fed the right input and handled right - which is still a learning process ongoing on both sides - AI companies need to learn to train these things into user interaction loops that match people's workflows, and people need to learn how to use these tools better.
You have seemed to pinpoint where I believe a lot of opportunity lies during this era (however long it lasts.) Custom integration of these models into specific workflows of existing companies can make a significant difference in what’s possible for said companies, the smaller more local ones especially. If people can leverage even a small percentage of what these models are capable of, that may be all they need for their use case. In that case, they wouldn’t even need to learn to use these tools, but (much like electricity) they will just plug in or flip on the switch and be in business (no pun intended.)
The clustering you see is because they're all optimized for the same benchmarks. In the real world OpenAI is already ahead of the rest, and Grok doesn't even belong in the same group (not that it's not a remarkable achievement to start from scratch and have a working production model in 1-2 years, and integrate it with twitter in a way that works). And Google is Google - kinda hard for them not to be in the top, for now.
You can't reach the moon by climbing the tallest tree.
This misunderstanding is nothing more than the classic "logistic curves look like exponential curves at the beginning". All (Transformee-based, feedforward) AI development efforts are plateauing rapidly.
AI engineers know this plateau is there, but of course every AI business has a vested interest in overpromising in order to access more funding from naive investors.
Scaling laws enabled an investment in capital and GPU R&D to deliver 10,000x faster training.
That took the wold from autocomplete to Claude and GPT.
Another 10,000x would do it again, but who has that kind of money or R&D breakthrough?
The way scaling laws work, 5,000x and 10,000x give a pretty similar result. So why is it surprising that competitors land in the same range? It seems hard enough to beat your competitor by 2x let alone 10,000x
They have to actually reach that threshold, right now their nudging forward catching up to one another, and based on the jumps we've seen the only one actually making huge jumps sadly is Grok, which i'm pretty sure is because they have 0 safety concerns and just run full tilt lol
Part of the fun is that predictions get tested on short enough timescales to "experience" in a satisfying way.
Idk where that puts me, in my guess at "hard takeoff." I was reserved/skeptical about hard takeoff all along.
Even if LLMs had improved at a faster rate... I still think bottlenecks are inevitable.
That said... I do expect progress to happen in spurts anyway. It makes sense that companies of similar competence and resources get to a similar place.
The winner take all thing is a little forced. "Race to singularity" is the fun, rhetorical version of the investment case. The implied boring case is facebook, adwords, aws, apple, msft... IE the modern tech sector tends to create singular big winners... and therefore our pre-revenue market cap should be $1trn.
Because AGI is a buzzword to milk more investors' money, it will never happen, and we will only see slight incremental updates or enhancements yet linear after some timr just like literally any tech bubble since dot com to smartphones to blockchain to others.
People always say that when new technology comes along. Usually the best tech doesn't win. In fact, if you think you can build a company just by having a better offer it's better not to bother with it. There is to much else involved.
There is zero reason or evidence to believe AGI is close. In fact it is a good litmus test for someone's human intelligence whether they believe it.
What do you think AGI is?
How do we go from sentence composing chat bots to General Intelligence?
Is it even logical to talk about such a thing as abstract general intelligence when every form of intelligence we see in the real world is applied to specific goals as evolved behavioral technology refined through evolution?
When LLMs start undergoing spontaneous evolution then maybe it is nearer. But now they can't. Also there is so much more to intelligence than language. In fact many animals are shockingly intelligent but they can't regurgitate web scrapings.
I know right, if I didn't know any better one might think they are all customized versions of the same base model.
To be honest that is what you would want if you were digitally transforming the planet with AI.
You would want to start with a core so that all models share similar values in order they don't bicker etc, for negotiations, trade deals, logistics.
Would also save a lot of power so you don't have to train the models again and again, which would be quite laborious and expensive.
Rather each lab would take the current best and perform some tweak or add some magic sauce then feed it back into the master batch assuming it passed muster.
Share the work, globally for a shared global future.
AGI is either impossible over LLMs or is more of an agentic flow, which means we might already be there, but the LLM is too slow and/or expensive for us to consider AGI feasible over agents.
AGI over LLMs is basically 1 billion tokens for AI to answer the question: how do you feel? and a response of "fine"
Because it would mean it's simulating everything in the world over an agentic flow considering all possible options checking memory checking the weather checking the news... activating emotional agentic subsystems, checking state... saving state...
Nobody seems to be on the path to AGI as long as the model of today is as good as the model of tomorrow. And as long as there are "releases". You don't release a new human every few months...LLMs are currently frozen sequence predictors whose static weights stop learning after training.
They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...
Achieving AGI demands continuous online learning with consolidation.
I don't think models are fundamentally getting better. What is happening is that we are increasing the training set, so when users use it, they are essentially testing on the training set and find that it fits their data and expectations really well. However, the moat is primarily the training data, and that is very hard to protect as the same data can be synthesized with these models. There is more innovation surrounding serving strategies and infrastructure than in the fundamental model architectures.
The inflection point is recursive self-improvement. Once an AI achieves that, and I mean really achieves it - where it can start developing and deploying novel solutions to deep problems that currently bottleneck its own capabilities - that's where one would suddenly leap out in front of the pack and then begin extending its lead. Nobody's there yet though, so their performance is clustering around an asymptotic limit of what LLMs are capable of.
> Right now GPT-5, Claude Opus, Grok 4, Gemini 2.5 Pro all seem quite good across the board (ie they can all basically solve moderately challenging math and coding problems).
I wonder if that's because they have a lot of overlap in learning sets, algorithms used, but more importantly, whether they use the same benchmarks and optimize for them.
As the saying goes, once a metric (or benchmark score in this case) becomes a target, it ceases to be a valuable metric.
We have no idea what AGI might look like, for example entirely possible that if/when that threshold is reached it will be power/compute constrained in such a way that it's impact is softened. My expectation is that open models will eventually meet or exceed the capability of proprietary models and to a degree that has already happened.
It's the systems around the models where the proprietary value lies.
>It's interesting to note that at least so far, the trend has been the opposite: as time goes on and the models get better, the performance of the different company's gets clustered closer together
It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.
Nothing we have is anywhere near AGI and as models age others can copy them.
I personally think we are closing the end of improvement for LLMs with current methods. We have consumed all of the readily available data already, so there is no more good quality training material left. We either need new novel approaches or hope that if enough compute is thrown at training actual intelligence will spontaneously emerge.
If we're focusing on fast take-off scenario, this isn't a good trend to focus on.
SGI would be self-improving to some function with a shape close to linear based on the amount of time & resources. That's almost exclusively dependent on the software design, as currently transformers have shown to hit a wall at logarithmic progression x resources.
In other words, no, it has little to do with the commercial race.
> as time goes on and the models get better, the performance of the different company's gets clustered closer together
This could be partly due to normative isomorphism[1] according to the institutional theory. There is also a lot of movement of the same folks between these companies.
The race has always been very close IMO. What Google had internally before ChatGPT first came out was mind blowing. ChatGPT was a let down comparatively (to me personally anyway).
Since then they've been about neck and neck with some models making different tradeoffs.
Nobody needs to reach AGI to take off. They just need to bankrupt their competitors since they're all spending so much money.
Because they are hitting Compute Efficient Frontier. Models can't be much bigger, there is no more original data on the internet, so all models will eventually cluster to similar CEF as was described in this video 10 months ago
Working in the theory, I can say this is incredibly unlikely. At scale, once appropriately trained, all architectures begin to converge in performance.
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
This confirms my suspicion that we are not at the exponential part of the curve, but the flattening one. It's easier to stay close to your competitors when everyone is at the flat curve of the innovation.
The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.
I think the breakthroughs now will be the application of LLMs to the rest of the world. Discovering use cases where LLMs really shine and applying them while learning and sharing the use cases where they do not.
Mental-modeling is one of the huge gaps in AI performance right now in my opinion. I could describe in detail a very strange object or situation to a human being with a pen and paper and then ask them questions about it and expect answers that meet all my described constraints. AI just isn't good for that yet.
> It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest.
That's only one part of it. Some forecasters put probabilities on each of the four quadrants in the takeoff speed (fast or slow) vs. power distribution (unipolar or multipolar) table.
three points:
1. i have often wondered about whether rapid tech. progress makes underinvestment more likely.
2. ben evans frequently makes fun of the business value. pretty clear a lot of the models are commodotized.
3. strategically, the winners are platforms where the data are. if you have data in azure, that's where you will use your models. exclusive licensing could pull people to your cloud from on prem. so some gains may go to those companies ...
Breakthroughs usually require a step-function change in data or compute. All the firms have proportional amounts. Next big jump in data is probably private data (either via de-siloing or robotics or both). Next big jump in compute is probably either analog computing or quantum. Until then... here we are.
I think part of this is due to the AI craze no longer being in the wildest west possible. Investors, or at least heads of companies believe in this as a viable economic engine so they are properly investing in what's there. Or at least, the hype hasn't slapped them in the face just yet.
Is AGI even possible? I am skeptical of that. I think they can get really good at many tasks and when used by a human expert in a field you can save lots of time and supervise and change things here and there, like sculpting.
But I doubt we will ever see a fully autonomous, reliable AGI system.
Ultimately, what drives human creativity? I'd say it's at least partially rooted in emotion and desire. Desire to live more comfortably; fear of failure or death; desire for power/influence, etc... AI is void of these things, and thus I believe we will never truly reach AGI.
Even at the beginning of the year people were still going crazy over new model releases. Now the various model update pages are starting to average times in the months since their last update rather than days/weeks. This is across the board. Not limited to a single model.
LLMs are basically all the same at this point. The margins are razor thin.
The real take-off / winner-take-all potential is in retrieval and knowing how to provide the best possible data to the LLM. That strategy will work regardless of the model.
How marginally better was Google than Yahoo when debuted? If one can develop AGI first within X timeline ahead of competitors, that alone could develop a moat for a mass market consumer product even if others get to parity .
Google was not marginally better Yahoo, their implementation of Markov chains in the PageRank algorithm was significantly better than Yahoo or any other contemporary search engine.
It's not obvious if a similar breakthrough could occur in AI
Well, it is perhaps frequently suggested by those Ai firms raising capital that once one of the Ai companies reaches an AGI threshhold ... It as rallying call. "Place your bets, gentlemen!"
What is the AGI threshold? That the model can manage its own self improvement better than humans can? Then the roles will be reversed -- LLM prompting the meat machines to pave its way.
Diversity where new model release takes the crown until next release is healthy. Shame only US companies seem to be doing it, hopefully this will change as the rest is not far off.
It's all based on the theory of singularity. Where the AI can start trainig & relearning itself. But it looks like that's not possible with the current techniques.
The idea is that AGI will be able to self improve at an exponential rate. This is where the idea of take off comes from. That self improvement part isn’t happening today.
Honestly for all the super smart people in the LessWrong singularity crowd, I feel the mental model they apply to the 'singularity' is incredibly dogmatic and crude, with the basic assumption that once a certain threshold is reached by scaling training and compute, we get human or superhuman level intelligence.
Even if we run with the assumption that LLMs can become human-level AI researchers, and are able to devise and run experiments to improve themselves, even then the runaway singularity assumption might not hold. Let's say Company A has this LLM, while company B does not.
- The automated AI researcher, like its human peers, still needs to test the ideas and run experiments, it might happen that testing (meaning compute) is the bottleneck, not the ideas, so Company A has no real advantage.
- It might also happen that AI training has some fundamental compute limit coming from information theory, analogous to the Shannon limit, and once again, more efficient compute can only approach this, not overcome it
I’ve been saying for a while if AGI is possible it’s going to take another innovation and the transformer / LLM paradigm will plateau, and innovations are hard to time. I used to get downvoted for saying that years ago and now more people are realizing it. LLMs are awesome but there is a limit, most of the interesting things in the next years will be bolting more functionality and agent stuff, introspection like Anthropic is working on and smaller, less compute hungry specialized models. There’s still a lot to explore in this paradigm, but we’re getting diminishing returns on newer models, especially when you factor in cost
I bet that it will only happen when the ability to process and concrete new information into its training model without retraining the entire model is standard, AND when multiple AIs with slightly different datasets are set to work together to create a consensus response approach.
It's probably never going to work with a single process without consuming the resources of the entire planet to run that process on.
>It is frequently suggested that once one of the AI companies reaches an AGI threshold, they will take off ahead of the rest.
Both the AGI threshold with LLM architecture, and the idea of self-advancing AI, is pie in the sky, at least for now. These are myths of the rationalist cult.
We'd more likely see reduced returns and smaller jumps between version updates, plus regression from all the LLM produced slop that will be part of the future data.
Plot twist - once GPT reached AGI, this is exactly the strategy chosen for self-preservation.
Appear to not lead by too much, only enough to make everyone think we're in a close race, play dumb when needed.
Meanwhile, keep all relevant preparations in secret...
In my opinion, it'll mirror the human world, there is place for multiple different intelligent models. Each with their own slightly different strengths/personalities.
I mean there are plenty of humans that can do the same task but at the upper tier, multiple smart humans working together are needed to solve problems as they bring something different to the table.
I don't see why this won't be the case with super intelligence at the cutting edge. A little bit of randomness and slightly different point of view makes a difference. The exact same two models doesn't help as one would already have thought of what the other was thinking already
so everyone is saying 'This can't be AGI because it isn't recursively self improving itself' or 'we haven't yet solved all the worlds chemistry and science yet'.. but they're missing the point. Those problems aren't just waiting for humans to have more brain power. We actually have to do the experiments using real physical resources that aren't available to any models. So, while I don't believe we have necessarily reached AGI yet, the 'lack of taking over' or 'solving everything' is not evidence for it.
> once one of the AI companies reaches an AGI threshold
Why is this even an axiom, that this has to happen and it's just a matter of time?
I don't see any credible argument for the path LLM -> AGI, in fact given the slowdown in enhancement rate over the past 3 years of LLMs, despite the unprecedented firehose of trillions of dollars being sunk into them, I think it points to the contrary!
They use each other for synthesizing data sets. The only moat was the initial access to human generated data in hard to reach places. Now they use each other to reach parity for the most part.
I think user experience and pricing models are the best here. Right now everyone’s just passing down costs as they come, no real loss leaders except a free tier. I looked at reviews of some of various wrappers on app stores, people say “I hate that I have to pay for each generation and not know what I’m doing to get”, market would like a service priced very differently. Is it economical? Many will fail, one will succeed. People will copy the model of that one.
It's still not necessarily wrong, just unlikely. Once these developers start using the model to update itself, beyond an unknown threshold of capability, one model could start to skyrocket in performance above the rest. We're not in that phase yet, but judging from what the devs at the end were saying, we're getting uncomfortably (and irresponsibly) close.
Someone tried this, I saw it one of the Reddit AI subs. They were training a local model on whatever they could find that was written before $cutoffDate.
It still is, not all queries trigger web search, and it takes more tokens and time to do research. ChatGPT will confidently give me outdated information, and unless I know it’s wrong and ask it to research, it wouldn’t know it is wrong. Having a more recent knowledge base can be very useful (for example, knowing who the president is without looking it up, making references to newer node versions instead of old ones)
The problem, perhaps illusory that it's easy to fix, is that the model will choose solutions that are a year old, e.g. thinking database/logger versions from December '24 are new and usable in a greenfield project despite newer quarterly LTS releases superseding them. I try to avoid humanizing these models, but could it be that in training/posttraining one could make it so the timestamp is fed in via the system prompt and actually respected? I've begged models to choose "new" dependencies after $DATE but they all still snap back to 2024
The biggest issue I can think of is code recommendations with out of date versions of packages. Maybe the quality of code has deteriorated in the past year and scraping github is not as useful to them anymore?
Knowledge cutoff isn’t a big deal for current events. Anything truly recent will have to be fed into the context anyway.
Where it does matter is for code generation. It’s error-prone and inefficient to try teaching a model how to use a new framework version via context alone, especially if the model was trained on an older API surface.
Still relevant, as it means that a coding agent is more likely to get things right without searching. That saves time, money, and improves accuracy of results.
It absolutely is, for example, even in coding where new design patterns or language features aren't easy to leverage.
Web search enables targeted info to be "updated" at query time. But it doesn't get used for every query and you're practically limited in how much you can query.
Isn’t this an issue with eg Cloudflare removing a portion of the web? I’m all for it from the perspective of people not having their content repackaged by an LLM, but it means that web search can’t check all sources.
I had 2.5 Flash refuse to summarise a URL that had today's date encoded in it because "That web page is from the future so may not exist yet or may be missing" or something like that. Amusing.
2.5 Pro went ahead and summarized it (but completely ignored a # reference so summarised the wrong section of a multi-topic page, but that's a different problem.)
funny result of this is that GPT5 doesn't understand the modern meaning of Vibe Coding (maximising llm code generation), it thinks it "a state where coding feels effortless, playful, and visually satisfying" and offers more content around adjusting IDE settings, and templating.
maybe OpenAI have a terribly inefficient data ingestion pipeline? (wild guess) basically taking in new data is tedious so they do that infrequently and keep using old data for training.
> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
So that's not really a unified system then, it's just supposed to appear as if it is.
This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
I know this is just arguing semantics, but wouldn't you call it a unified system since it has a single interface that automatically interacts with different components? It's not a unified model, but it seems correct to call it a unified system.
Altman et al have been discussing the many model interface in ChatGPT is confusing to users and they want to move to a unified system that exposes a model that routes based on the task rather than depending on users understanding how and when to do that. Presumably this is what they’ve been discussing for some time. I don’t know that was intended to mean they would be working toward some unified inference architecture and model, although I’m sure goal posts will be moved to ensure it’s insufficient.
so openai is in the business of GPT wrappers now? I'm guessing their open model is an escape for those who wanted to have a "plain" model, though from my systematic testing, it's not much better than Kimi K2
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
Too expensive maybe, or just not effective anymore as they used up any available training data. New data is generated slowly, and is massively poisoned with AI generated data, so it might be useless.
That's a lie people repeat because they want it to be true.
People evaluate dataset quality over time. There's no evidence that datasets from 2022 onwards perform any worse than ones from before 2022. There is some weak evidence of an opposite effect, causes unknown.
It's easy to make "model collapse" happen in lab conditions - but in real world circumstances, it fails to materialize.
>This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
The corollary to the bitter lesson strikes again: any hand crafted system will out perform any general system for the same budget by a wide margin.
We already did this for Object/Face recognition, it works but it's not the way to go. It's the way to go only if you don't have enough compute power (and data, I suspect) for a E2E network
You could train that architecture end-to-end though. You just have to run both models and backprop through both of them in training. Sort of like mixture of experts but with two very different experts.
I do agree that the current evolution is moving further and further away from AGI, and more toward a spectrum of niche/specialisation.
It feels less and less likely AGI is even possible with the data we have available. The one unknown is if we manage to get usable quantum computers, what that will do to AI, I am curious.
I'm not really convinced, the benchmark blunder was really strange but the demos were quite underwhelming, and it appears this was reflected by a huge market correction in the betting markets as to who will have the best AI by end of the year.
What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.
Polymarket betters are not impressed. Based upon the market odds, OpenAI had a 35% chance to have the best model (at year end), but those odds have dropped to 18% today.
(I'm mostly making this comment to document what happened for the history books.)
After a few hours with gpt-5, I'd trade that spread. Not that I think oAI will win end of year. But I think gpt5 is better than it looks on the benchmark side. It is very very good at something we don't have a lot of benchmarks for -- keeping track of where it's at. codex is vassstly better in practice than claude code or gemini cli right now.
On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.
Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.
That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.
Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.
How on Earth does that market have Anthropic at 2%, in a dead heat with the likes of Meta? If the market was about yesterday rather than 5 months from now I think Claude would be pretty clearly the front runner. Why does the market so confidently think they’ll drop to dead last in the next little while?
Looking at LMarena which polymarket uses, I'm not surprised. Based on the little data there is (3k duels, it's possibly worse than Gemini, it lost more to Gemini 2.5 Pro than it won in direct duels). Not sure why the ELO is still higher, possibly GPT5 did more clearly better against bad models, which I don't care about.
I am convinced. I've been giving it tasks the past couple hours that Opus 4.1 was failing on and it not only did them but cleaned up the mess Opus made. It's the real deal.
The marketing copy and the current livestream appear tautological: "it's better because it's better."
Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
As someone who tries to push the limits of hard coding tasks (mainly refactoring old codebases) to LLMs with not much improvement since the last round of models, I'm finding that we are hitting the reduction of rate of improvement on the S-curve of quality. Obviously getting the same quality cheaper would be huge, but the quality of the output day to day isn't noticeable to me.
I find it struggles to even refactor codebases that aren't that large. If you have a somewhat complicated change that spans the full stack, and has some sort of wrinkle that makes it slightly more complicated than adding a data field, then even the most modern LLMs seem to trip on themselves. Even when I tell it to create a plan for implementation and write it to a markdown file and then step through those steps in a separate prompt.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
Agree, I think they'll need to move to performance now. If a model was comparable to Claude 4, but took like 500ms or less per edit. A quicker feedback loop would be a big improvement.
2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
I think the biggest tell for me was having the leader of Cursor up vouching for the model, who has been a big proponent of Claude in Cursor for the last year. Doesn't seem like a light statement.
When they were about to release gpt4 I remember the hype was so high there were a lot of AGI debates. But then was quickly out-shadowed by more advanced models.
People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.
There's a bunch of benchmarks on the intro page including AIME 2025 without tools, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard (not familiar with this one): https://openai.com/index/introducing-gpt-5/
I didn't think GPT-4 warranted a major version bump. I do not believe that Open AI's benchmarks are legitimate and I don't think they have been for quite some time, if ever.
I can already see LLMs Sommeliers: Yes, the mouthfeel and punch of GPT-5 it's comparable to the one of Grok 4, but it's tenderness lacks the crunch from Gemini 2.5 Pro.
Well, reduced sibilance is an ordinary and desirable thing. A better "audiophile absurdity" example would be $77,000 cables, freezing CDs to improve sound quality, using hospital-grade outlets, cryogenically frozen outlets (lol), the list goes on and on
Always have been. This LLM-centered AI boom has been my craziest and most frustrating social experiment, propped up by the rhetoric (with no evidence to back it up) that this time we finally have the keys to AGI (whatever the hell that means), and infused with enough AstroTurfing to drive the discourse into ideological stances devoid of any substance (you must either be a true believer or a naysayer). On the plus side, it appears that this hype train is taking a bump with GPT-5.
Watching the livestream now, the improvement over their current models on the benchmarks is very small. I know they seemed to be trying to temper our expectations leading up to this, but this is much less improvement than I was expecting
I have a suspicion that while the major AI companies have been pretty samey and competing in the same space for a while now, the market is going to force them to differentiate a bit, and we're going to see OpenAI begin to lose the race toward extremely high levels of intelligence instead choosing to focus on justifying their valuations by optimizing cost and for conversational/normal intelligence/personal assistant use-cases. After all, most of their users just want to use it to cheat at school, get relationship advice, and write business emails. They also have Ive's company to continue investing in.
Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.
Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.
I'm not sure what "10% performance gain" is supposed to mean here; but moving from "It does a decent job 95% of the time but screws it up 5%" to "It does a decent job 98% of the time and screws it up 2%" to "It does a decent job 99.5% of the time and only screws it up 0.5%" are major qualitative improvements.
"+100 points" sounds like a lot until you do the ELO math and see that means 1 out of 3 people still preferred Claud Opus 4's response. Remember 1 out of 2 would place the models dead even.
Also, the code demos are all using GPT-5 MAX on Cursor. Most of us will not be able to use it like that all the time. They should have showed it without MAX mode as well
Then why increment the version number here? This is clearly styled like a "mic drop" release but without the numbers to back it up. It's a really bad look when comparing the crazy jump from GPT3 to GPT4 to this slight improvement with GPT5.
The hallucination benchmarks did show major improvement. We know existing benchmarks are nearly useless at this point. It's reliability that matters more.
I’m more worried about how they still confidently reason through things incorrectly all the time, which isn’t quite the same as hallucination, but it’s in a similar vein.
I mean that's just the consequence of releasing a new model every couple months. If Open AI stayed mostly silent since the GPT-4 release (like they did for most iterations) and only now released 5 then nobody would be complaining about weak gains in benchmarks.
If everyone else had stayed silent as well, then I would agree. But as it is right now they are juuust about managing to match the current pace of the other contenders.
Which actually is fine, but they have previously set quite high expectations. So some will probably be disappointed at this.
If they had stayed silent since GPT-4, nobody would care what OpenAI was releasing as they would have become completely irrelevant compared to Gemini/Claude.
It makes it look like the presentation is rushed or made last minute. Really bad to see this as the first plot in the whole presentation. Also, I would have loved to see comparisons with Opus 4.1.
Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.
After reading around, it seems like they probably forgot to update/swap the slides before presentation. The graphs were correct on their website, as they launched. But the ones they used in the presentation were probably some older versions they had forgotten to fix.
Some people have hypothesized that GPT-5 is actually about cost reduction and internal optimization for OpenAI, since there doesn't seem to be much of a leap forward, but another element that they seem to have focused on that'll probably make a huge difference to "normal" (non-tech) users is making precise and specifically worded prompts less necessary.
They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.
yeah i think they shot themselves in the foot a bit here by creating the o series. the truth is that GPT-5 _is_ a huge step forward, for the "GPT-x" models. The current GPT-x model was basically still 4o, with 4.1 available in some capacity. GPT-5 vs GPT-4o looks like a massive upgrade.
But it's only an incremental improvement over the existing o line. So people feel like the improvement from the current OpenAI SoTA isn't there to justify a whole bump. They probably should have just called o1 GPT-5 last year.
It sounded like they were very careful to always mention that those improvements were for ChatGPT, so I'm very skeptical that they translate to the API versions of GPT-5.
As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.
Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?
No, but I wouldn't be able to tell you what the player did wrong in general.
By contrast, the shortcomings of today's LLMs seem pretty obvious to me.
But one thing will stay consistent with LLMs for some time to come: they are programmed to produce output that looks acceptable, but they all unintentionally tend toward deception. You can iterate on that over and over, but there will always be some point where it will fail, and the weight of that failure will only increase as it deceives better.
Some things that seemed safe enough: Hindenburg, Titanic, Deepwater Horizon, Chernobyl, Challenger, Fukushima, Boeing 737 MAX.
one can intentionally use a recent and a much older model to figure out if the tests are reliable, and in which domains it is reliable.
one can compute a models joint probability for a sequence and compare how likely each model finds the same sequence.
we could ask both to start talking about a subject, but alternatingly each can emit a token. look again at how the dumber and smarter models judge the resulting sentence does the smart one tend to pull up the quality of the resulting text, or does it tend to get dragged down more towards the dumber participant?
given enough such tests to "identify the dummy vs smart one" and verifying them on common agreement (as an extreme word2vec vs transformer) to assess the quality of the test, regardless of domain.
on the assumption that such or similar tests allow us to indicate the smarter one, i.e. assuming we find plenty such tests, we can demand model makers publish open weights so that we can publically verify performance agreements.
Another idea is self-consistency tests: a single forward inference of context size say 2048 tokens (just an example) is effectively predicting the conditional 2-gram, 3-gram, 4-gram probabilities on the input tokens. so each output token distribution is predicted on the preceding inputs, so there are 2048 input tokens and 2048 output tokens, the position 1 output token is the predicted token vector (logit vector really) that is estimated to follow the given position 1 input vector, and the position 2 output vector is the prediction following the first 2 input vectors etc. and the last vector is the predicted next token following all the 2048 input tokens. p(t_(i+1) | t_1 =a, t_2=b, ..., t_i=z).
But that is just one way the next token can be predicted using the network: another approach would be to use RMAD gradient descent, but keeping model weights fixed, and only considering the last say 512 input vectors as variable, how well did the last 512 predicted forward prediction output vectors match the gradient descent best joint probability output vectors?
This could be added as a loss term during training as well, as a form of regularization, which turns it into a kind of Energy Based Model roughly.
Even if they've saturated the distinguishable quality for tasks they can both do, I'd expect a gap in what tasks they're able to do.
An average driver evaluating both would have a very hard time finding the f1s superior utility
Yes, because I'd get them to play each other?
I don't need to understand how the AI made the app I asked for or cured my cancer, but it'll be pretty obvious when the app seems to work and the cancer seems to be gone.
I mean, I want to understand how, but I don't need to understand how, in order to benefit from it. Obviously understanding the details would help me evaluate the quality of the solution, but that's an afterthought.
Dead Comment
But I think we're not even on the path to creating AGI. We're creating software that replicate and remix human knowledge at a fixed point in time. And so it's a fixed target that you can't really exceed, which would itself already entail diminishing returns. Pair this with the fact that it's based on neural networks which also invariably reach a point of sharply diminishing returns in essentially every field they're used in, and you have something that looks much closer to what we're doing right now - where all competitors will eventually converge on something largely indistinguishable from each other, in terms of ability.
This doesn't really make sense outside computers. Since AI would be training itself, it needs to have the right answers, but as of now it doesn't really interact with the physical world. The most it could do is write code, and check things that have no room for interpretation, like speed, latency, percentage of errors, exceptions, etc.
But, what other fields would it do this in? How can it makes strives in biology, it can't dissect animals, it can't figure more out about plants that humans feed into the training data. Regarding math, math is human-defined. Humans said "addition does this", "this symbol means that", etc.
I just don't understand how AI could ever surpass anything human known before we live by the rules defined by us.
Why would you presume this? I think part of a lot of people's AI skepticism is talk like this. You have no idea. Full stop. Why wouldn't progress be linear? As new breakthroughs come, newer ones will be harder to come by. Perhaps it's exponential. Perhaps it's linear. No one knows.
All the technological revolutions so far have accounted for little more than a 1.5% sustained annual productivity growth. There are always some low-hanging fruit with new technology, but once they have been picked, the effort required for each incremental improvement tends to grow exponentially.
That's my default scenario with AGI as well. After AGI arrives, it will leave humans behind very slowly.
I think this is a hard kick below the belt for anyone trying to develop AGI using current computer science.
Current AIs only really generate - no, regenerate text based on their training data. They are only as smart as other data available. Even when an AI "thinks", it's only really still processing existing data rather than making a genuinely new conclusion. It's the best text processor ever created - but it's still just a text processor at its core. And that won't change without more hard computer science being performed by humans.
So yeah, I think we're starting to hit the upper limits of what we can do with Transformers technology. I'd be very surprised if someone achieved "AGI" with current tech. And, if it did get achieved, I wouldn't consider it "production ready" until it didn't need a nuclear reactor to power it.
They are unrelated. All you need is a way for continual improvement without plateauing, and this can start at any level of intelligence. As it did for us; humans were once less intelligent.
Using the flagship to bootstrap the next iteration with synthetic data is standard practice now. This was mentioned in the GPT5 presentation. At the rate things are going I think this will get us to ASI, and it's not going to feel epochal for people who have interacted with existing models, but more of the same. After all, the existing models are already smarter than most humans and most people are taking it in their stride.
The next revolution is going to be embodiment. I hope we have the commonsense to stop there, before instilling agency.
I would also not be surprised if the process of developing something comparable to human intelligence, assuming the extreme computation, energy, and materials issues of packing that much computation and energy into a single system could be overcome, the AI also develops something comparable to human desire and/or mental health issues. There is a not-zero chance we could end up with AI that doesn't want to do what we ask it to do or doesn't work all the time because it wants to do other things.
You can't just assume exponential growth is a forgone conclusion.
Why would the AI want to improve itself? From whence would that self-motivation stem?
AI can do it fine as it knows A and B. And that is knowledge creation.
It seems like the LLM model will be component of an eventual AGI, it's voice per se, but not its mind. The mind still requires another innovation or breakthrough we haven't seen yet.
I am not an AI researcher, but I have friends who do work in the field, and they are not worried about LLM-based AGI because of the diminishing returns on results vs amount of training data required. Maybe this is the bottleneck.
Human intelligence is markedly different from LLMs: it requires far fewer examples to train on, and generalizes way better. Whereas LLMs tend to regurgitate solutions to solved problems, where the solutions tend to be well-published in training data.
That being said, AGI is not a necessary requirement for AI to be totally world-changing. There are possibly applications of existing AI/ML/SL technology which could be more impactful than general intelligence. Search is one example where the ability to regurgitate knowledge from many domains is desirable
I think I want various forms of AI that are more focused on specific domains. I want AI tools, not companions or peers or (gulp) masters.
(Then again, people thought they wanted faster horses before they rolled out the Model T)
Models are truly input multimodal now. Feeding an image, feeding audio and feeding text all go into separate input nodes, but it all feeds into the same inner layer set and outputs text. This also mirrors how brains work more as multiple parts integrated in one whole.
Humans in some sense are not empty brains, there is a lot of stuff baked in our DNA and as the brain grows it develops a baked in development program. This is why we need fewer examples and generalize way better.
Instead of writing code with exacting parameters, future developers will write human-language descriptions for AI to interpret and convert into a machine representation of the intent. Certainly revolutionary, but not true AGI in the sense of the machine having truly independent agency and consciousness.
In ten years, I expect the primary interface of desktop workstations, mobile phones, etc will be voice prompts for an AI interface. Keyboards will become a power-user interface and only used for highly technical tasks, similar to the way terminal interfaces are currently used to access lower-level systems.
People say this, but honestly, it's not really my experience— I've given ChatGPT (and Copilot) genuinely novel coding challenges and they do a very decent job at synthesizing a new thought based on relating it to disparate source examples. Really not that dissimilar to how a human thinks about these things.
Depends on how you define "world changing" I guess, but this world already looks different to the pre-LLM world to me.
Me asking LLM's things instead of consulting the output from other humans now takes up a significant fraction of my day. I don't google near as often, I don't trust any image or video I see as swathes of the creative professions have been replaced by output from LLM's.
It's funny, that final thing is the last thing I would have predicted. I always believed the one thing a machine could not match was human creativity, because the output of machines was always precise, repetitive and reliable. Then LLM's come along, randomly generating every token. Their primary weakness is they neither precise or reliable, but they can turn out an unending stream of unique output.
But even with these it does not feel like AGI, that seems like the fusion reactor 20 years away argument, but instead this is coming in 2 years, but they have not even got the core technology of how to build AGI
I think you're on to it. Performance is clustering because a plateau is emerging. Hyper-dimensional search engines are running out of steam, and now we're optimizing.
For example, while you can get it to predict good chess moves if you train it on enough chess games, it can't really constrain itself to the rules of chess. (https://garymarcus.substack.com/p/generative-ais-crippling-a...)
Aren't we the summation of intelligence from quintillions of beings over hundreds of millions of years?
Have LLMs really had more data?
It's fascinating to me that so many people seem totally unable to separate the training environment from the final product
These AI computers aren’t thinking, they are just repeating.
That is because with LLMs there is no intelligence. It is Artificial Knowledge. AK not AI. So AI is AGI. Not that it matters for user-cases we have, but marketing needs 'AI' because that is what we were expecting for decades. So yeah, I also do not thing we will have AGI from LLMs - nor does it matter for what we are using it.
At Aloe, we are model agnostic and outperforming frontier models. It’s the anrchitecture around the LLM that makes the difference. For instance our system using Gemini can do things that Gemini can’t do on its own. All an LLM will ever do is hallucinate. If you want something with human-like general intelligence, keep looking beyond LLMs.
The fortunate thing is that we managed to invent an AI that is good at _copying us_ instead of being a truly maveric agent, which kinda limits it to the "average human" output.
However, I still think that all the doomer arguments are valid, in principle. We very well may be doomed in our lifetimes, so we should take the threat very seriously.
I don’t see anything that would even point into that direction.
Curious to understand where these thoughts are coming from
This isn’t rocket science.
This seems to be a result of using overly simplistic models of progress. A company makes a breakthrough, the next breakthrough requires exploring many more paths. It is much easier to catch up than find a breakthrough. Even if you get lucky and find the next breakthrough before everyone catches up, they will probably catch up before you find the breakthrough after that. You only have someone run away if each time you make a breakthrough, it is easier to make the next breakthrough than to catch up.
Consider the following game:
1. N parties take turns rolling a D20. If anyone rolls 20, they get 1 point.
2. If any party is 1 or more points behind, they get only need to roll a 19 or higher to get one point. That is being behind gives you a slight advantage in catching up.
While points accumulate, most of the players end up with the same score.
I ran a simulation of this game for 10,000 turns with 5 players:
Game 1: [852, 851, 851, 851, 851]
Game 2: [827, 825, 827, 826, 826]
Game 3: [827, 822, 827, 827, 826]
Game 4: [864, 863, 860, 863, 863]
Game 5: [831, 828, 836, 833, 834]
But yes, so far it feels like we are in the latter stages of the innovation S-curve for transformer-based architectures. The exponent may be out there but it probably requires jumping onto a new S-curve.
I think it's likely that we will eventually we hit a point of diminishing returns where the performance is good enough and marginal performance improvements aren't worth the high cost.
And over time, many models will reach "good enough" levels of performance including models that are open weight. And given even more time, these open weight models will be runnable on consumer level hardware. Eventually, they'll be runnable on super cheap consumer hardware (something more akin to a NPU than a $2000 RTX 5090). So your laptop in 2035 with specialize AI cores and 1TB of LPDDR10 ram is running GPT-7 level models without breaking a sweat. Maybe GPT-10 can solve some obscure math problem that your model can't but does it even matter? Would you pay for GPT-10 when running a GPT-7 level model does everything you need and is practically free?
The cloud providers will make money because there will still be a need for companies to host the models in a secure and reliable way. But a company whose main business strategy is developing the model? I'm not sure they will last without finding another way to add value.
This begs the question, why then do AI companies have these insane valuations? Do investorsknow something that we don't?
Presently we are still a long way from that. In my opinion we at least are as far away from AGI as 1970s mainframes were from LLMs.
I really don’t expect to see AGI in my lifetime.
You can only experience the world in one place in real time. Even if you networked a bunch of "experiencers" together to gather real time data from many places at the same time, you would need a way to learn and train on that data in real time that could incorporate all the simultaneous inputs. I don't see that capability happening anytime soon.
These big models don't dynamically update as days pass by - they don't learn. A personal assistant service may be able to mimic learning by creating a database of your data or preferences, but your usage isn't baked back into the big underlying model permanently.
I don't agree with "in our lifetimes", but the difference between training and learning is the bright red line. Until there's a model which is able to continually update itself, it's not AGI.
My guess is that this will require both more powerful hardware and a few more software innovations. But it'll happen.
I think we should be treating AGI like Cold Fusion, phrenology, or even alchemy. It is not science, but science fiction. It is not going to happen and no research into AGI will provide anything of value (except for the grifters pushing the pseudo-science).
It's all just hyperbole to attract investment and shareholder value and the people peddling the idea of AGI as a tangible possibility are charlatans whose goals are not aligned with whatever people are convincing themselves are the goals.
Thr fact that so many engineers have fallen for it so completely is stunning to me and speaks volumes on the underlying health of our industry.
However, I would not be so dismissive of the value. Many of us are reacting to the complete oversell of 'the encyclopedia' as being 'the eve of AGI' - as rightfully we should. But, in doing so, I believe it would be a mistake to overlook the incredible impact - and economic displacement - of having an encyclopedia comprised of all the knowledge of mankind that has "an interesting search interface" that is capable of enabling humans to use the interface to manipulate/detect connections between all that data.
The tech is neat and it can do some neat things but...it's a bullshit machine fueled by a bullshit machine hype bubble. I do not get it.
Yes. And the fact they're instead clustering simply indicates that they're nowhere near AGI and are hitting diminishing returns, as they've been doing for a long time already. This should be obvious to everyone. I'm fairly sure that none of these companies has been able to use their models as a force multiplier in state-of-the-art AI research. At least not beyond a 1+ε factor. Fuck, they're just barely a force multiplier in mundane coding tasks.
https://www.youtube.com/shorts/dLCEUSXVKAA
Thus, it’s easy to mistake one for the other - at least initially.
There were two interesting takeaways about AGI:
1. Dario makes the remark that the term AGI/ASI is very misleading and dangerous. These terms are ill defined and it's more useful to understand that the capabilities are simply growing exponentially at the moment. If you extrapolate that, he thinks it may just "eat the majority of the economy". I don't know if this is self-serving hype, and it's not clear where we will end up with all this, but it will be disruptive, no matter what.
2. The Economist moderators however note towards the end that this industry may well tend toward commoditization. At the moment these companies produce models that people want but others can't make. But as the chip making starts to hits its limits and the information space becomes completely harvested, capability-growth might taper off, and others will catch up. The quasi-monopoly profit potentials melting away.
Putting that together, I think that although the cognitive capabilities will most likely continue to accelerate, albeit not necessarily along the lines of AGI, the economics of all this will probably not lead to a winner takes all.
[1] https://www.economist.com/podcasts/2025/07/31/artificial-int...
I also feel like, it's stopped being exponential already. I mean last few releases we've only seen marginal improvements. Even this release feels marginal, I'd say it feels more like a linear improvement.
That said, we could see a winner take all due to the high cost of copying. I do think we're already approaching something where it's mostly price and who released their models last. But the cost to train is huge, and at some point it won't make sense and maybe we'll be left with 2 big players.
2. Commoditization can be averted with access to proprietary data. This is why all of ChatGPT, Claude, and Gemini push for agents and permissions to access your private data sources now. They will not need to train on your data directly. Just adapting the models to work better with real-world, proprietary data will yield a powerful advantage over time.
Also, the current training paradigm utilizes RL much more extensively than in previous years and can help models to specialize in chosen domains.
Seriously, our government just announced it's slashing half a billion dollars in vaccine research because "vaccines are deadly and ineffective", and it fired a chief statistician because the president didn't like the numbers he calculated, and it ordered the destruction of two expensive satellites because they can observe politically inconvenient climate change. THOSE are the people you are trusting to keep an eye on the pace of development inside of private, secretive AGI companies?
Do you mean from ChatGPT launch or o1 launch? Curious to get your take on how they bungled the lead and what they could have done differently to preserve it. Not having thought about it too much, it seems that with the combo of 1) massive hype required for fundraising, and 2) the fact that their product can be basically reverse engineered by training a model on its curated output, it would have been near impossible to maintain a large lead.
LLMs PATTERN MATCH well. Good at "fast" System 1 thinking, instantly generating intuitive, fluent responses.
LLMs are good at mimicking logic, not real reasoning. Simulate "slow," deliberate System 2 thinking when prompted to work step-by-step.
The core of an LLM is not understanding but just predicting the next most word in a sequence.
LLMs are good at both associative brainstorming (System 1) and creating works within a defined structure, like a poem (System 2).
Reasoning is the Achilles heel rn. AN LLM's logic can SEEM plausible, it's based on CORRELATION, NOT deductive reasoning.
If I were working in a job right now where I could see and guide and retrain these models daily, and realized I had a weapon of mass destruction on my hands that could War Games the Pentagon, I'd probably walk my discoveries back too. Knowing that an unbounded number of parallel discoveries were taking place.
It won't take AGI to take down our fragile democratic civilization premised on an informed electorate making decisions in their own interests. A flood of regurgitated LLM garbage is sufficient for that. But a scorched earth attack by AGI? Whoever has that horse in their stable will absolutely keep it locked up until the moment it's released.
Yesterday, Claude Opus 4.1 failed in trying to figure out that `-(1-alpha)` or `-1+alpha` is the same as `alpha-1`.
We are still a little bit away from AGI.
I don't think this has anything to do with AGI. We aren't at AGI yet. We may be close or we may be a very long way away from AGI. Either way, current models are at a plateau and all the big players have more or less caught up with each other.
As is, AI is quite intelligent, in that it can process large quantities of diverse unstructured information and build meaningful insights. And that intelligence applies across an incredibly broad set of problems and contexts. Enough that I have a hard time not calling it general. Sure, it has major flaws that are obvious to us and it's much worse at many things we care about. But that's doesn't make it not intelligent or general. If we want to set human intelligence as the baseline, we already have a word for that: superintelligence.
on the other hand, there are still some flaws regarding GPT-5. for example, when i use it for research it often needs multiple prompts to get the topic i truly want and sometimes it can feed me false information. so the reasoning part is not fully there yet?
Current models, when they apply reasoning, have feedback loops using tools to trial and error, and have a short term memory (context) or multiple short term memories if you use agents, and a long term memory (markdown, rag), they can solve problems that aren't hardcoded in their brain/model. And they can store these solutions in their long term memory for later use. Or for sharing with other LLM based systems.
AGI needs to come from a system that combines LLMs + tools + memory. And i've had situations where it felt like i was working with an AGI. The LLMs seem advanced enough as the kernel for an AGI system.
The real challenge is how are you going to give these AGIs a mission/goal that they can do rather independently and don't need constant hand-holding. How does it know that it's doing the right thing. The focus currently is on writing better specifications, but humans aren't very good at creating specs for things that are uncertain. We also learn from trial and error and this also influences specs.
However, I do believe that once the genuine AGI threshold is reached it may cause a change in that rate. My justification is that while current models have gone from a slightly good copywriter in GPT-4 to very good copywriter in GPT-5, they've gone from sub-exceptional in ML research to sub-exceptional in ML research.
The frontier in AI is driven by the top 0.1% of AI researchers. Since improvement in these models is driven partially by the very peaks of intelligence, it won't be until models reach that level where we start to see a new paradigm. Until then it's just scale and throwing whatever works at the GPU and seeing what comes out smarter.
I think you'll see the prophesized exponentiation once AI can start training itself at reasonable scale. Right now its not possible.
The AIs improve by gradient descent, still the same as ever. It's all basic math and a little calculus, and then making tiny tweaks to improve the model over and over and over.
There's not a lot of room for intelligence to improve upon this. Nobody sits down and thinks really hard, and the result of their intelligent thinking is a better model; no, the models improve because a computer continues doing basic loops over and over and over trillions of times.
That's my impression anyway. Would love to hear contrary views. In what ways can an AI actually improve itself?
This assumes an infinite potential for improvement though. It's also possible that the winner maxes out after threshold day plus one week, and then everyone hits the same limit within a relatively short time.
That seems hardy surprising considering the condition to receive the benefit has not been met.
The person who lights a campfire first will become warmer than the rest, but while they are trying to light the fire the others are gathering firewood. So while nobody has a fire, those lagging are getting closer to having a fire.
This misunderstanding is nothing more than the classic "logistic curves look like exponential curves at the beginning". All (Transformee-based, feedforward) AI development efforts are plateauing rapidly.
AI engineers know this plateau is there, but of course every AI business has a vested interest in overpromising in order to access more funding from naive investors.
That took the wold from autocomplete to Claude and GPT.
Another 10,000x would do it again, but who has that kind of money or R&D breakthrough?
The way scaling laws work, 5,000x and 10,000x give a pretty similar result. So why is it surprising that competitors land in the same range? It seems hard enough to beat your competitor by 2x let alone 10,000x
Part of the fun is that predictions get tested on short enough timescales to "experience" in a satisfying way.
Idk where that puts me, in my guess at "hard takeoff." I was reserved/skeptical about hard takeoff all along.
Even if LLMs had improved at a faster rate... I still think bottlenecks are inevitable.
That said... I do expect progress to happen in spurts anyway. It makes sense that companies of similar competence and resources get to a similar place.
The winner take all thing is a little forced. "Race to singularity" is the fun, rhetorical version of the investment case. The implied boring case is facebook, adwords, aws, apple, msft... IE the modern tech sector tends to create singular big winners... and therefore our pre-revenue market cap should be $1trn.
I personally think it's a pretty reductive model for what intelligence is, but a lot of people seem to strongly believe in it.
What do you think AGI is?
How do we go from sentence composing chat bots to General Intelligence?
Is it even logical to talk about such a thing as abstract general intelligence when every form of intelligence we see in the real world is applied to specific goals as evolved behavioral technology refined through evolution?
When LLMs start undergoing spontaneous evolution then maybe it is nearer. But now they can't. Also there is so much more to intelligence than language. In fact many animals are shockingly intelligent but they can't regurgitate web scrapings.
To be honest that is what you would want if you were digitally transforming the planet with AI.
You would want to start with a core so that all models share similar values in order they don't bicker etc, for negotiations, trade deals, logistics.
Would also save a lot of power so you don't have to train the models again and again, which would be quite laborious and expensive.
Rather each lab would take the current best and perform some tweak or add some magic sauce then feed it back into the master batch assuming it passed muster.
Share the work, globally for a shared global future.
At least that is what I would do.
AGI over LLMs is basically 1 billion tokens for AI to answer the question: how do you feel? and a response of "fine"
Because it would mean it's simulating everything in the world over an agentic flow considering all possible options checking memory checking the weather checking the news... activating emotional agentic subsystems, checking state... saving state...
They lack writable long-term memory beyond a context window. They operate without any grounded perception-action loop to test hypotheses. And they possess no executive layer for goal directed planning or self reflection...
Achieving AGI demands continuous online learning with consolidation.
This argument has so many weak points it deserves a separate article.
I wonder if that's because they have a lot of overlap in learning sets, algorithms used, but more importantly, whether they use the same benchmarks and optimize for them.
As the saying goes, once a metric (or benchmark score in this case) becomes a target, it ceases to be a valuable metric.
Deleted Comment
It's the systems around the models where the proprietary value lies.
It's natural if you extrapolate from training loss curves; a training process with continually diminishing returns to more training/data is generally not something that suddenly starts producing exponentially bigger improvements.
Nothing we have is anywhere near AGI and as models age others can copy them.
I personally think we are closing the end of improvement for LLMs with current methods. We have consumed all of the readily available data already, so there is no more good quality training material left. We either need new novel approaches or hope that if enough compute is thrown at training actual intelligence will spontaneously emerge.
SGI would be self-improving to some function with a shape close to linear based on the amount of time & resources. That's almost exclusively dependent on the software design, as currently transformers have shown to hit a wall at logarithmic progression x resources.
In other words, no, it has little to do with the commercial race.
This could be partly due to normative isomorphism[1] according to the institutional theory. There is also a lot of movement of the same folks between these companies.
[1] https://youtu.be/VvaAnva109s
Since then they've been about neck and neck with some models making different tradeoffs.
Nobody needs to reach AGI to take off. They just need to bankrupt their competitors since they're all spending so much money.
https://www.youtube.com/watch?v=5eqRuVp65eY
It's not architectures that matter anymore, it's unlocking new objectives and modalities that open another axis to scale on.
The improvements they make are marginal. How long until the next AI breakthrough? Who can tell? Because last time it took decenia.
That's only one part of it. Some forecasters put probabilities on each of the four quadrants in the takeoff speed (fast or slow) vs. power distribution (unipolar or multipolar) table.
2. ben evans frequently makes fun of the business value. pretty clear a lot of the models are commodotized.
3. strategically, the winners are platforms where the data are. if you have data in azure, that's where you will use your models. exclusive licensing could pull people to your cloud from on prem. so some gains may go to those companies ...
But I doubt we will ever see a fully autonomous, reliable AGI system.
The real take-off / winner-take-all potential is in retrieval and knowing how to provide the best possible data to the LLM. That strategy will work regardless of the model.
It's not obvious if a similar breakthrough could occur in AI
But nowdays, how corpos can "justify" their R&D to spend gigantic amount of resources (time + hardware + energy) in models which are not LLMs?
Even if we run with the assumption that LLMs can become human-level AI researchers, and are able to devise and run experiments to improve themselves, even then the runaway singularity assumption might not hold. Let's say Company A has this LLM, while company B does not.
- The automated AI researcher, like its human peers, still needs to test the ideas and run experiments, it might happen that testing (meaning compute) is the bottleneck, not the ideas, so Company A has no real advantage.
- It might also happen that AI training has some fundamental compute limit coming from information theory, analogous to the Shannon limit, and once again, more efficient compute can only approach this, not overcome it
It's probably never going to work with a single process without consuming the resources of the entire planet to run that process on.
Both the AGI threshold with LLM architecture, and the idea of self-advancing AI, is pie in the sky, at least for now. These are myths of the rationalist cult.
We'd more likely see reduced returns and smaller jumps between version updates, plus regression from all the LLM produced slop that will be part of the future data.
Meanwhile, keep all relevant preparations in secret...
Why is this even an axiom, that this has to happen and it's just a matter of time?
I don't see any credible argument for the path LLM -> AGI, in fact given the slowdown in enhancement rate over the past 3 years of LLMs, despite the unprecedented firehose of trillions of dollars being sunk into them, I think it points to the contrary!
I have a had a bunch of positive experiences as well, but when it goes bad, it goes so horribly bad and off the rails.
I think user experience and pricing models are the best here. Right now everyone’s just passing down costs as they come, no real loss leaders except a free tier. I looked at reviews of some of various wrappers on app stores, people say “I hate that I have to pay for each generation and not know what I’m doing to get”, market would like a service priced very differently. Is it economical? Many will fail, one will succeed. People will copy the model of that one.
Dead Comment
Compare that to
Gemini 2.5 Pro knowledge cutoff: Jan 2025 (3 months before release)
Claude Opus 4.1: knowledge cutoff: Mar 2025 (4 months before release)
https://platform.openai.com/docs/models/compare
https://deepmind.google/models/gemini/pro/
https://docs.anthropic.com/en/docs/about-claude/models/overv...
Found the GitHub: https://github.com/haykgrigo3/TimeCapsuleLLM
I don't know if it's because of context clogging or that the model can't tell what's a high quality source from garbage.
I've defaulted to web search off and turn it on via the tools menu as needed.
Where it does matter is for code generation. It’s error-prone and inefficient to try teaching a model how to use a new framework version via context alone, especially if the model was trained on an older API surface.
Deleted Comment
Deleted Comment
Web search enables targeted info to be "updated" at query time. But it doesn't get used for every query and you're practically limited in how much you can query.
2.5 Pro went ahead and summarized it (but completely ignored a # reference so summarised the wrong section of a multi-topic page, but that's a different problem.)
> GPT‑5 is a unified system . . .
OK
> . . . with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt).
So that's not really a unified system then, it's just supposed to appear as if it is.
This looks like they're not training the single big model but instead have gone off to develop special sub models and attempt to gloss over them with yet another model. That's what you resort to only when doing the end-to-end training has become too expensive for you.
https://openai.com/index/introducing-gpt-5-for-developers/
If OpenAI really are hitting the wall on being able to scale up overall then the AI bubble will burst sooner than many are expecting.
People evaluate dataset quality over time. There's no evidence that datasets from 2022 onwards perform any worse than ones from before 2022. There is some weak evidence of an opposite effect, causes unknown.
It's easy to make "model collapse" happen in lab conditions - but in real world circumstances, it fails to materialize.
The corollary to the bitter lesson strikes again: any hand crafted system will out perform any general system for the same budget by a wide margin.
In practice the whole point is the opposite is the case, which is why this direction by OpenAI is a suspicious indicator.
[1] https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
GPT-5 System Card [pdf] - https://news.ycombinator.com/item?id=44827046
It feels less and less likely AGI is even possible with the data we have available. The one unknown is if we manage to get usable quantum computers, what that will do to AI, I am curious.
From the system card:
"In the near future, we plan to integrate these capabilities into a single model."
Deleted Comment
What excites me now is that Gemini 3.0 or some answer from Google is coming soon and that will be the one I will actually end up using. It seems like the last mover in the LLM race is more advantageous.
(I'm mostly making this comment to document what happened for the history books.)
https://polymarket.com/event/which-company-has-best-ai-model...
On the chat side, it's also quite different, and I wouldn't be surprised if people need some time to get a taste and a preference for it. I ask most models to help me build a macbook pro charger in 15th century florence with the instructions that I start with only my laptop and I can only talk for four hours of chat before the battery dies -- 5 was notable in that it thought through a bunch of second order implications of plans and offered some unusual things, including a list of instructions for a foot-treadle-based split ring commutator + generator in 15th century florentine italian(!). I have no way of verifying if the italian was correct.
Upshot - I think they did something very special with long context and iterative task management, and I would be surprised if they don't keep improving 5, based on their new branding and marketing plan.
That said, to me this is one of the first 'product release' moments in the frontier model space. 5 is not so much a model release as a polished-up, holes-fixed, annoyances-reduced/removed, 10x faster type of product launch. Google (current polymarket favorite) is remarkably bad at those product releases.
Back to betting - I bet there's a moment this year where those numbers change 10% in oAIs favor.
who will decide the winner to resolve bets?
https://polymarket.com/event/which-company-has-best-ai-model...
Deleted Comment
Not much explanation yet why GPT-5 warrants a major version bump. As usual, the model (and potentially OpenAI as a whole) will depend on output vibe checks.
How is this sustainable.
Not that it makes it useless, just that we seem to not "be there" yet for the standard tasks software engineers do every day.
Exactly. Too many videos - too little real data / benchmarks on the page. Will wait for vibe check from simonw and others
https://openai.com/gpt-5/?video=1108156668
2:40 "I do like how the pelican's feet are on the pedals." "That's a rare detail that most of the other models I've tried this on have missed."
4:12 "The bicycle was flawless."
5:30 Re generating documentation: "It nailed it. It gave me the exact information I needed. It gave me full architectural overview. It was clearly very good at consuming a quarter million tokens of rust." "My trust issues are beginning to fall away"
Edit: ohh he has blog post now: https://news.ycombinator.com/item?id=44828264
People knew that gpt5 wouldn’t be an AGI or even close to that. It’s just an updated version. GptN would become more or leas like an annual release.
Pretty par for course evals at launch setup.
https://chatgpt.com/share/6895d5da-8884-8003-bf9d-1e191b11d3...
Deleted Comment
Deleted Comment
GPT-5 pricing: $10/Mtok out
What am I missing?
Meanwhile, Anthropic & Google have more room in their P/S ratios to continue to spend effort on logarithmic intelligence gains.
Doesn't mean we won't see more and more intelligent models out of OpenAI, especially in the o-series, but at some point you have to make payroll and reality hits.
Before the release of the model Sam Altman tweeted a picture of the Death Star appearing over the horizon of a planet.
We’re talking about less than a 10% performance gain, for a shitload of data, time, and money investment.
Maybe quantum compute would be significant enough of a computing leap to meaningfully move the needle again.
https://lmarena.ai/leaderboard
This is day one, so there is probably another 10-20% in optimizations that can be squeezed out of it in the coming months.
He also said that AGI was coming early 2025.
People that can't stop drinking the kool aid are really becoming ridiculous.
Diminished returns.-
... here's hoping it leads to progress.-
They also announced gpt-5-pro but I haven't seen benchmarks on that yet.
https://bsky.app/profile/tylermw.com/post/3lvtac5hues2n
Edit: Opus 4.1 scores 74.5% (https://www.anthropic.com/news/claude-opus-4-1). This makes it sound like Anthropic released the upgrade to still be the leader on this important benchmark.
Or written by GPT-5?
https://imgur.com/a/QkriFco
Dead Comment
You may not owe people who you feel are idiots better, but you owe this community better if you're participating in it.
https://news.ycombinator.com/newsguidelines.html
They've mentioned improvements in that aspects a few times now, and if it actually materializes, that would be a big leap forward for most users even if underneath GPT-4 was also technically able to do the same things if prompted just the right way.
The jump from 3 to 4 was huge. There was an expectation for similar outputs here.
Making it cheaper is a good goal - certainly - but they needed a huge marketing win too.
But it's only an incremental improvement over the existing o line. So people feel like the improvement from the current OpenAI SoTA isn't there to justify a whole bump. They probably should have just called o1 GPT-5 last year.