Hey folks, OOP/original author and 20-year HN lurker here — a friend just told me about this and thought I'd chime in.
Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.
Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of interesting 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? [1]
Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.
[1] There's another path with better design, e.g. CLIP that improves both architecture and data, but let's leave that aside for now.
10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.
Recently it doesn't seem to be playing out as such. The current best LLMs I find marvelously impressive (despite their flaws), and yet... where are all the awesome robots? Why can't I buy a robot that loads my dishwasher for me?
Last year this really started to bug me, and after digging into it with some friends I think we collectively realized something that may be a hint at the answer.
As far as we know, it took roughly 100M-1B years to evolve human level "embodiment" (evolve from single celled organisms to human), but it only took around ~100k-1M for humanity to evolve language, knowledge transfer and abstract reasoning.
So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?
> So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?
Essentially, yes, but I would go further in saying that embodiment is harder than intelligence in and of itself.
I would argue that intelligence is a very simple and primitive mechanism compared to the evolved animal body, and the effectiveness of our own intelligence is circumstantial. We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity, the combinatorial explosion of factors would be too much to model and our intelligence would be of limited usefulness.
And that's what we see with LLMs: I think they model relatively faithfully what, say, separates humans from chimps, but it lacks the animal library of innate world understanding which is supposed to ground intellect and stop it from hallucinating nonsense. It's trained on human language, which is basically the shadows in Plato's cave. It's very good at tasks that operate in that shadow world, like writing emails, or programming, or writing trite stories, but most of our understanding of the world isn't encoded in language, except very very implicitly, which is not enough.
What trips us up here is that we find language-related tasks difficult, but that's likely because the ability evolved recently, not because they are intrinsically difficult (likewise, we find mental arithmetic difficult, but it not intrinsically so). As it turns out, language is simple. Programming is simple. I expect that logic and reasoning are also simple. The evolved animal primitives that actually interface with the real world, on the other hand, appear to be much more complicated (but time will tell).
Part of the answer to this puzzle is that your dishwasher itself is a robot that washes dishes, and has had enormous impact on blue collar jobs since its invention and widespread deployment. There are tons of labor saving devices out there doing blue collar work that we don't think of as robots or as AI.
Not a robotics guy, but to extent that the same fundamentals hold—
I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.
All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.
We think this because ten years ago we were all having our minds blown by DeepMind's game playing achievements and videos of dancing robots and thought this meant blue collar work would be solved imminently.
But most of these solutions were more crude than they let on, and you wouldn't really know unless you were working in AI already.
Watch John Carmack's recent talk at Upper Bound if you want him to see him destroy like a trillion dollars worth of AI hype.
> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.
The moment you strip away the magical thinking, the humanization (bugs not hallucinations) what you realize is that this is just progress. Ford in the 1960's putting in the first robot arms vs auto manufacturing today. The phone: from switch board operators, to mechanical switching to digital to... (I think phone is in some odd hybrid era with text but only time will tell). Draftsmen in the 1970's all replaced by autocad by the 90's. GO further back to 1920, 30 percent of Americans were farmers, today thats less than 2.
Humans, on very human scales are very good at finding all new ways of making ourselves "busy" and "productive".
There is a lot of high quality text from diverse domains, there's a lot of audio or images or videos around. The largest robotics datasets are absolutely pathetic in size compared to that. We didn't collect or stockpile the right data in advance. Embodiment may be hard by itself, but doing embodiment in this data-barren wasteland is living hell.
So you throw everything but the kitchen sink at the problem. You pre-train on non-robotics data to squeeze transfer learning for all its worth, you run hard sims, a hundred flavors of data augmentation, you get hardware and set up actual warehouses with test benches where robots try their hand at specific tasks to collect more data.
And all of that combined only gets you to "meh" real world performance - slow, flaky, fairly brittle, and on relatively narrow tasks. Often good enough for an impressive demo, but not good enough to replace human workers yet.
There's a reason why a lot of those bleeding edge AI powered robots are designed for and ship with either teleoperation capabilities, or demonstration-replay capabilities. Companies that are doing this hope to start pushing units first, and then use human operators to start building up some of the "real world" datasets they need to actually train those robots to be more capable of autonomous operation.
Having to deal with Capital H Hardware is the big non-AI issue. You can push ChatGPT to 100 million devices, as long as you have a product people want to use for the price of "free", and the GPUs to deal with inference demand. You can't materialize 100 million actual physical robot bodies out of nowhere for free, GPUs or no GPUs. Scaling up is hard and expensive.
Embodiment is 1000x harder from a physical perspective.
Look at how hard it is for us to make reliable laptop hinges or the articulated car door handle trend (started by Tesla) where they constantly break.
These are simple mechanisms compared to any animal or human body. Our bodies last up to 80-100 years through not just constant regeneration but organic super-materials that rival anything synthetic in terms of durability within its spec range. Nature is full of this, like spider silk much stronger than steel or joints that can take repeated impacts for decades. This is what hundreds of millions to billions of years of evolution gets you.
We can build robots this good but they are expensive, so expensive that just hiring someone to do it manually is cheaper. So the problem is that good quality robots are still much more expensive than human labor.
The only areas where robots have replaced human labor is where the economics work, like huge volume manufacturing, or where humans can’t easily go or can’t perform. The latter includes tasks like lifting and moving things thousands of times larger than humans can or environments like high temperatures, deep space, the bottom of the ocean, radioactive environments, etc.
The problem is not the robot loading the diswasher, it is the dishwasher. The dishwasher (and general kitchen electronics) industry has not innovated in a long time.
My prediction is a new player will come in who vertically integrates these currently disjoint industries and product. The tableware used should be compatible with the dishwasher, the packaging of my groceries should be compatible with the cooking system. Like a mini-factory.
But current vendors have no financial incentive to do so, because if you take a step back the whole notion of putting one room of your apartment full with random electronics just to cook a meal once in a blue moon is deeply inefficient. End-to-end food automation is coming to the restaurant business, and I hope it pushes prices of meals so far down that having a dedicated room for a kitchen in the apartment is simply not worth it.
That's the "utopia" version of things.
In reality, we see prices for fast food (the most automated food business) going up while quality is going down. Does it make the established players more vulnerable to disruption? I think so.
> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work.
I'm not sure where people get this impression from, even back decades ago. Hardware is always harder than software. We had chess engines in the 20th century but a robotic hand that could move pieces? That was obviously not as easy because dealing with the physical world always has issues that dealing with the virtual doesn't.
Robots are only harder because they have expensive hardware. We already have robots that can load dishwashers and do other manual work but humans are cheaper so there isn't much of a market for them.
The rising tide idea came from a 1997 paper by Moravec. Here's a nice graphic and subsequent history https://lifearchitect.ai/flood/
Interestingly, Moravec also stated: "When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self−evident." We pretty much have those today so by 1997 standards, machines have minds, yet somehow we moved the goalposts and decided that doesn't count anymore. Even if LLMs end up being strictly more capable than every human on every subject, I'm sure we'll find some new excuse why they don't have minds or aren't really intelligent.
> if you can find a way to produce verifiable rewards about a target world
I feel like there's an interesting symmetry here between the pre and post LLM world, where I've always found that organisations over-optimise for things they can measure (e.g. balance sheets) and under-optimise for things they can't (e.g. developer productivity), which explains why its so hard to keep a software product up to date in an average org, as the natural pressure is to run it into the ground until a competitor suddenly displaces it.
So in a post LLM world, we have this gaping hole around things we either lack the data for, or as you say: lack the ability to produce verifiable rewards for. I wonder if similar patterns might play out as a consequence and what unmodelled, unrecorded, real-world things will be entirely ignored (perhaps to great detriment) because we simply lack a decent measure/verifiable-reward for it.
> if you can find a way to produce verifiable rewards about a target world
I have significant experience on modelling physical world (mostly CFD, but also gamedev - with realistic rigid body collisions and friction).
I admit, exists domain (spectrum of parameters), where CFD and game physics working just well; exists predictable domain (on borders of well working domain), where CFD and game physics working good enough but could show strange things, and exists domain, where you will see lot of bugs.
And, current computing power is so much, that even on small business level (just median gamer desktop), we could save on more than 90% real-world tests with simulations in well working domain (and just avoid use cases in unreliable domains).
So I think, most question is just conservative bosses and investors, who don't believe to engineers and don't understand how to do checks (and tuning) of simulations with real world tests, and what reliable domain is.
> rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks
I wonder if this doesn't reify a particular business model, of creating a general model and then renting it out Saas-style (possibly adapted to largish customers).
It reminds me of the early excitement over mainframes, how their applications were limited by the rarity of access, and how vigorously those trained in those fine arts defended their superiority. They just couldn't compete with the hordes of smaller competitors getting into every niche.
It may instead be that customer data and use cases are both the most relevant and the most profitable. An AI that could adopt a small user model and track and apply user use cases would have entirely different structure, and would have demonstrable price/performance ratios.
This could mean if Apple or Google actually integrated AI into their devices, they could have a decisive advantage. Or perhaps there's a next generation of web applications that model use-cases and interactions. Indeed, Cursor and other IDE companies might have a leg up if they can drive towards modeling the context instead of just feeding it as intention to the generative LLM.
Since you seem to know your stuff, why do LLMs need so much data anyway? Humans don't. Why can't we make models aware of their own uncertainty, e.g. feeding the variance of the next token distribution back into the model, as a foundation to guide their own learning. Maybe with that kind of signal, LLMs could develop 'curiosity' and 'rigorousness' and seek out the data that best refines them themselves. Let the AI make and test its own hypotheses, using formal mathematical systems, during training.
My focus lately is on the cost side of this. I believe strongly that it's possible to reduce the cost of compute for LLM type loads by 95% or more. Personally, it's been incredibly hard to get actual numbers for static and dynamic power in ASIC designs to be sure about this.
If I'm right (which I give a 50/50 odds to), and we can reduce the power of LLM computation by 95%, trillions can be saved in power bills, and we can break the need for Nvidia or other specialists, and get back to general purpose computation.
> there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today
Wouldn't the Bitter Lesson be to invest in those models over trying to be clever about ekeing out a little more oomph from today's language models (and langue-based data)?
I believe he is referring to OpenAI proposal to move beyond training with pure text. Instead train with multi modal data. Instead of only the dictionary definition of an apple. Train it with a picture of an apple. Train it with a video of someone eating an apple etc.
> this isn't really about whether scaling is "dead"
I think there's a good position paper by Sara Hooker[0] that mentions some of this. Key point being that while the frontier is being pushed by big models with big data there's a very quiet revolution of models using far fewer parameters (still quite big) and data. Maybe "Scale Is All You Need"[1], but that doesn't mean it is practical or even a good approach. It's a shame these research paths have gotten a lot of pushback, especially given today's concerns about inference costs (this pushback still doesn't seem to be decreasing)
> verifiable rewards
There's also a current conversation in the community over world models: is it actually a world model if the model does not recover /a physics/[2]. The argument for why they should recover a physics is that this means a counterfactual model must have been learned (no guarantees on if it is computationally irreducible). A counterfactual model gives far greater opportunities for robust generalization. In fact, you could even argue that the study of physics is the study of compression. In a sense, physics is the study of the computability of our universe[3]. Physics is counterfactual, allowing you to answer counterfactual questions like "What would the force have been if the mass had been 10x greater?" If this were not counterfactual we'd require different algorithms for different cases.
I'm in the recovery camp. Honestly I haven't heard a strong argument against it. Mostly "we just care that things work" which, frankly, isn't that the primary concern of all of us? I'm all for throwing shit at a wall and seeing what sticks, it can be a really efficient method sometimes (especially in early exploratory phases), but I doubt it is the most efficient way forward.
In my experience, having been a person who's created models that require magnitudes fewer resources for equivalent performance, I cannot stress enough the importance of quality over quantity. The tricky part is defining that quality.
[1] Personally, I'm unconvinced. Despite success of our LLMs it's difficult to decouple other variables.
[2] The "a" is important here. There's not one physics per-say. There are different models. This is a level of metaphysics most people will not encounter and has many subtleties.
[3] I must stress that there's a huge difference between the universe being computable and the universe being a computation. The universe being computable does not mean we all live in a simulation.
Just using common sense, if we had a genius, who had tremendous reasoning ability, total recall of memories, and an unlimited lifespan and patience, and he'd read what the current LLMs have read, we'd expect quite a bit more from him than what we're getting now from LLMs.
There are teenagers that win gold medals on the math olympiad - they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on. A difference of eight orders of magnitude.
In other words, data scarcity is not a fundamental problem, just a problem for the current paradigm.
If we can reduce the precision of the model parameters by 2~32x without much perceptible drop in performance, we are clearly dealing with something wildly inefficient.
I'm open to the possibility that over parameterization is essential as part of the training process, much like how MSAA/SSAA over sample the frame buffer to reduce information aliasing in the final scaled result (also wildly inefficient but very effective generally). However, I think for more exotic architectures (spiking / time domain) these rules don't work the same way. You can't back propagate a recurrent SNN so much of the prevailing machine learning mindset doesn't even apply.
It’s not clear that the inefficiency of the current paradigm is in the neural net architectures. It seems just as likely that it’s in the training objective.
Yes - we train only on a subset of human communication, the one using written symbols (even voice has much much more depth to it), but human brains train on the actual physical world.
Human students who only learned some new words but have not (yet) even began to really comprehend a subject will just throw around random words and sentences that sound great but have no basis in reality too.
For the same sentence, for example, "We need to open a new factory in country XY", the internal model lighting up inside the brain of someone who has actually participated when this was done previously will be much deeper and larger than that of someone who only heard about it in their course work. That same depth is zero for an LLM, which only knows the relations between words and has no representation of the world. Words alone cannot even begin to represent what the model created from the real-world sensors' data, which on top of the direct input is also based on many times compounded and already-internalized prior models (nobody establishes that new factory as a newly born baby with a fresh neural net, actually, even the newly born has inherited instincts that are all based on accumulated real world experiences, including the complex very structure of the brain).
Somewhat similarly, situations reported in comments like this one (client or manager vastly underestimating the effort required to do something): https://news.ycombinator.com/item?id=45123810
The internal model for a task of those far removed from actually doing it is very small compared to the internal models of those doing the work, so trying to gauge required effort falls short spectacularly if they also don't have the awareness.
I'm not sure what point you are trying to make. Are you saying in order to make LLMs better at learning the missing piece is to make the capable to interact with the outside world? Give them actuators and sensors?
This sentence really struck me in a particular way. Very interesting. It does seem like thoughts/stream of consciousness is just your brain generating random tokens to itself and learning from it lol.
"How could a telescope see saturn, human eyes have billions of years of evolution behind them, and we only made telescopes a few hundred years ago, so they should be much weaker than eyes"
"How can StockFish play chess better than a human, the human brain has had billions of years of evolution"
Evolution is random, slow, and does not mean we arrive at even a local optima.
To be fair, GPT-5 didn't start off as a blank slate. The architecture probably encodes a lot, much like how DNA encodes a lot. The former requires human writing to decompress into a human-like thing, the latter requires the Earth environment and a woman to decompress into a human organism.
But it's indeed apples and oranges. There's no good way to estimate the information encoded by the GPT architecture compared to human DNA. We just have to be empirical and look at what the thing can do.
The problem I am facing in my domain is that all of the data is human generated and riddled with human errors. I am not talking about typos in phone numbers, but rather fundamental errors in critical thinking, reasoning, semantic and pragmatic oversights, etc. all in long-form unstructured text. It's very much an LLM-domain problem, but converging on the existing data is like trying to converge on noise.
The opportunity in the market is the gap between what people have been doing and what they are trying to do, and I have developed very specialized approaches to narrow this gap in my niche, and so far customers are loving it.
I seriously doubt that the gap could ever be closed by throwing more data and compute at it. I imagine though that the outputs of my approach could be used to train a base model to close the gap at a lower unit cost, but I am skeptical that it would be economically worth while anytime soon.
This is one reason why verifiable rewards works really well, if it's possible for a given domain. Figuring out how to extract signal and verify it for an RL loop will be very popular for a lot of niche fields.
This is my current drum I bang on when an uninformed stakeholder tries shoving LLMs blindly down everyone’s throats: it’s the data, stupid. Current data aggregates outside of industries wholly dependent on it (so anyone not in web advertising, GIS, or intelligence) are garbage, riddled with errors and in awful structures that are opaque to LLMs. For your AI strategy to have any chance of success, your data has to be pristine and fresh, otherwise you’re lighting money on fire.
Throwing more compute and data at the problem won’t magically manifest AGI. To reach those lofty heights, we must first address the gaping wounds holding us back.
Yes, for me both customers and colleagues continually suggested "hey let's just take all these samples of past work and dump it in the magical black box and then replicate what they have been doing".
Instead I developed a UX that made it as easy as possible for people to explain what they want to be done, and a system that then goes and does that. Then we compare the system's output to their historical data and there is always variance, and when the customer inspects the variance they realize that their data was wrong and the system's output is far more accurate and precise than their process (and ~3 orders of magnitude cheaper). This is around when they ask how they can buy it.
This is the difference between making what people actually want and what they say they want: it's untangling the why from the how.
I've worked in 2 of those domains (I was a geographer at a web advertising company) and let me tell you, the data is only slightly better than the median industry and in the case of the geodata from apps I'd say it's far, far, far worse.
When studying human-created data, you always need to be aware of these factors, including bias from doctrines, such as religion, older information becoming superseded, outright lies and misinformation, fiction, etc. You can't just swallow it all uncritically.
I don't think Sutton's essay is misunderstood, but I agree with the OP's conclusion:
We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². OpenAI and Google were the first to show these transformer "scaling laws." We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens.
We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).
This is true for the pre-training step. What if advancements in the reinforcement learning steps performed later may benefit from more compute and more models parameters? If right now the RL steps only help with sampling, that is, they only optimize the output of a given possible reply instead of the other (there are papers pointing at this: that if you generate many replies with just the common sampling methods, and you can verify correctness of the reply, then you discover that what RL helps with is selecting what was already potentially within the model output) this would be futile. But maybe advancements in the RL will do to LLMs what AlphaZero-alike models did with Chess/Go.
It's possible. We're talking about pretraining meaningfully larger models past the point at which they plateau, only to see if they can improve beyond that plateau with RL. Call it option (3). No one knows if it would work, and it would be very expensive, so only the largest players can try it, but why the heck not?
> We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship.
> We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).
Of course we can, this is a non issue.
See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].
AlphaZero trained itself through chess games that it played with itself. Chess positions have something very close to an objective truth about the evaluation, the rules are clear and bounded. Winning is measurable. How do you achieve this for a language model?
Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.
That's the endgame, but on the other hand, we already have one, it's called "humanity". No reason to believe that another one would be much cheaper. Interacting with the real world is __expensive__. It's the most expensive thing of all.
In other words, the required amount of data scales with the square root of the compute. The square root of 2 ~= 1.414. If you double the compute, you need roughly 1.414 times more data.
I disagree with the author's thesis about data scarcity.
There's an infinite amount of data available in the real world. The real world is how all generally intelligent humans have been trained. Currently, LLMs have just been trained on the derived shadows (as in Plato's allegory of the cave). The grounding to base reality seems like an important missing piece. The other data type missing is the feedback: more than passively training/consuming text (and images/video), being able to push on the chair and have it push back.
Once the AI can more directly and recursively train on the real world, my guess is we'll see Sutton's bitter lesson proven out once again.
I don't know about that. LLMs have been trained mostly on text. If you add photos, audio and videos, and later even 3D games, or 3D videos, you get massively more data than the old plain text. Maybe by many orders of magnitude. And this is certainly that can improve cognition in general. Getting to AGI without audio and video, and 3D perception seems like a non-starter. And even if we think AGI is not the goal, further improvements from these new training datasets are certainly conceivable.
That's been done already for years. OpenAI were training on bulk AI transcribed YouTube vids already in the GPT-4 era. Modern models are all multi-modal and cotrained on audio and image tokens together with text.
The AI companies are not only out of such data but their access to it is shrinking as the people who control the hosting sites wall them off (like YouTube).
Also, even if we lacked the data to proceed with Chinchilla-optimal scaling that wouldn't be the same as being unable to proceed with scaling, it would just require larger models and more flops than we would prefer.
while I don't disagree with the facts, I don't understand the... tone?
when Dennard scaling (single core performance) started to fail in 90s-00s, I don't think there was a sentiment "how stupid was it to believe such a scaling at all"?
sure, people were compliant (and we still meme about running Crysis), but in the end the discussion resulted in "no more free lunch" - progress in one direction has hit a bottleneck, so it's time to choose some other direction to improve on (and multi-threading has now become mostly the norm)
I am not an expert in AI by any means but I think I know enough about it to comment on one thing: there was an interesting paper not too long ago that showed if you train a randomly-initialized model from scratch on questions, like a bank of physics questions & answers, models will end up with much higher quality if you teach it the simple physics questions first, and then move up to more complex physics questions. This shows that in some ways, these large language models really do learn like we do.
I think the next steps will be more along this vain of thinking. Treating all training data the same is a mistake. Some data is significantly more valuable to developing an intelligent model than most other training data, even when you pass quality filters. I think we need to revisit how we 'train' these models in the first place, and come up with a more intelligent/interactive system of doing so
From my personal experience training models this is only true when the parameter count is a limiting factor. When the model is past a certain size, it doesn't really lead to much improvement to use curriculum learning. I believe most research also applies it only to small models (e.g. Phi)
This is precisely why chain of thought worked. Written thoughts in plain English is a much higher SNR encoding of the human brain's inner workings than random pages scraped from Amazon. We just want the model to recover the brain, not Amazon's frontend web framework.
Wow. I really like this take. I've seen how time and time again nature follows the Pareto principle. It makes sense that training data would follow this principle as well.
Further that the order of training matters is novel to me and seems so obvious in hindsight.
Maybe both of these points are common knowledge/practice among current leading LLM builders. I don't build LLMs, I build on and with them, so I don't know.
Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.
Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of interesting 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? [1]
Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.
[1] There's another path with better design, e.g. CLIP that improves both architecture and data, but let's leave that aside for now.
Recently it doesn't seem to be playing out as such. The current best LLMs I find marvelously impressive (despite their flaws), and yet... where are all the awesome robots? Why can't I buy a robot that loads my dishwasher for me?
Last year this really started to bug me, and after digging into it with some friends I think we collectively realized something that may be a hint at the answer.
As far as we know, it took roughly 100M-1B years to evolve human level "embodiment" (evolve from single celled organisms to human), but it only took around ~100k-1M for humanity to evolve language, knowledge transfer and abstract reasoning.
So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?
Essentially, yes, but I would go further in saying that embodiment is harder than intelligence in and of itself.
I would argue that intelligence is a very simple and primitive mechanism compared to the evolved animal body, and the effectiveness of our own intelligence is circumstantial. We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity, the combinatorial explosion of factors would be too much to model and our intelligence would be of limited usefulness.
And that's what we see with LLMs: I think they model relatively faithfully what, say, separates humans from chimps, but it lacks the animal library of innate world understanding which is supposed to ground intellect and stop it from hallucinating nonsense. It's trained on human language, which is basically the shadows in Plato's cave. It's very good at tasks that operate in that shadow world, like writing emails, or programming, or writing trite stories, but most of our understanding of the world isn't encoded in language, except very very implicitly, which is not enough.
What trips us up here is that we find language-related tasks difficult, but that's likely because the ability evolved recently, not because they are intrinsically difficult (likewise, we find mental arithmetic difficult, but it not intrinsically so). As it turns out, language is simple. Programming is simple. I expect that logic and reasoning are also simple. The evolved animal primitives that actually interface with the real world, on the other hand, appear to be much more complicated (but time will tell).
We did.
Like, to the point that the AI that radically impacted blue collar work isn't even part of what is considered “AI” any more.
I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.
All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.
Real roboticists should chime in...
But most of these solutions were more crude than they let on, and you wouldn't really know unless you were working in AI already.
Watch John Carmack's recent talk at Upper Bound if you want him to see him destroy like a trillion dollars worth of AI hype.
https://m.youtube.com/watch?v=rQ-An5bhkrs&t=11303s&pp=2AGnWJ...
Spoiler: we're nowhere close to AGI
The moment you strip away the magical thinking, the humanization (bugs not hallucinations) what you realize is that this is just progress. Ford in the 1960's putting in the first robot arms vs auto manufacturing today. The phone: from switch board operators, to mechanical switching to digital to... (I think phone is in some odd hybrid era with text but only time will tell). Draftsmen in the 1970's all replaced by autocad by the 90's. GO further back to 1920, 30 percent of Americans were farmers, today thats less than 2.
Humans, on very human scales are very good at finding all new ways of making ourselves "busy" and "productive".
There is a lot of high quality text from diverse domains, there's a lot of audio or images or videos around. The largest robotics datasets are absolutely pathetic in size compared to that. We didn't collect or stockpile the right data in advance. Embodiment may be hard by itself, but doing embodiment in this data-barren wasteland is living hell.
So you throw everything but the kitchen sink at the problem. You pre-train on non-robotics data to squeeze transfer learning for all its worth, you run hard sims, a hundred flavors of data augmentation, you get hardware and set up actual warehouses with test benches where robots try their hand at specific tasks to collect more data.
And all of that combined only gets you to "meh" real world performance - slow, flaky, fairly brittle, and on relatively narrow tasks. Often good enough for an impressive demo, but not good enough to replace human workers yet.
There's a reason why a lot of those bleeding edge AI powered robots are designed for and ship with either teleoperation capabilities, or demonstration-replay capabilities. Companies that are doing this hope to start pushing units first, and then use human operators to start building up some of the "real world" datasets they need to actually train those robots to be more capable of autonomous operation.
Having to deal with Capital H Hardware is the big non-AI issue. You can push ChatGPT to 100 million devices, as long as you have a product people want to use for the price of "free", and the GPUs to deal with inference demand. You can't materialize 100 million actual physical robot bodies out of nowhere for free, GPUs or no GPUs. Scaling up is hard and expensive.
Look at how hard it is for us to make reliable laptop hinges or the articulated car door handle trend (started by Tesla) where they constantly break.
These are simple mechanisms compared to any animal or human body. Our bodies last up to 80-100 years through not just constant regeneration but organic super-materials that rival anything synthetic in terms of durability within its spec range. Nature is full of this, like spider silk much stronger than steel or joints that can take repeated impacts for decades. This is what hundreds of millions to billions of years of evolution gets you.
We can build robots this good but they are expensive, so expensive that just hiring someone to do it manually is cheaper. So the problem is that good quality robots are still much more expensive than human labor.
The only areas where robots have replaced human labor is where the economics work, like huge volume manufacturing, or where humans can’t easily go or can’t perform. The latter includes tasks like lifting and moving things thousands of times larger than humans can or environments like high temperatures, deep space, the bottom of the ocean, radioactive environments, etc.
My prediction is a new player will come in who vertically integrates these currently disjoint industries and product. The tableware used should be compatible with the dishwasher, the packaging of my groceries should be compatible with the cooking system. Like a mini-factory.
But current vendors have no financial incentive to do so, because if you take a step back the whole notion of putting one room of your apartment full with random electronics just to cook a meal once in a blue moon is deeply inefficient. End-to-end food automation is coming to the restaurant business, and I hope it pushes prices of meals so far down that having a dedicated room for a kitchen in the apartment is simply not worth it.
That's the "utopia" version of things.
In reality, we see prices for fast food (the most automated food business) going up while quality is going down. Does it make the established players more vulnerable to disruption? I think so.
I'm not sure where people get this impression from, even back decades ago. Hardware is always harder than software. We had chess engines in the 20th century but a robotic hand that could move pieces? That was obviously not as easy because dealing with the physical world always has issues that dealing with the virtual doesn't.
The rising tide idea came from a 1997 paper by Moravec. Here's a nice graphic and subsequent history https://lifearchitect.ai/flood/
Interestingly, Moravec also stated: "When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self−evident." We pretty much have those today so by 1997 standards, machines have minds, yet somehow we moved the goalposts and decided that doesn't count anymore. Even if LLMs end up being strictly more capable than every human on every subject, I'm sure we'll find some new excuse why they don't have minds or aren't really intelligent.
I feel like there's an interesting symmetry here between the pre and post LLM world, where I've always found that organisations over-optimise for things they can measure (e.g. balance sheets) and under-optimise for things they can't (e.g. developer productivity), which explains why its so hard to keep a software product up to date in an average org, as the natural pressure is to run it into the ground until a competitor suddenly displaces it.
So in a post LLM world, we have this gaping hole around things we either lack the data for, or as you say: lack the ability to produce verifiable rewards for. I wonder if similar patterns might play out as a consequence and what unmodelled, unrecorded, real-world things will be entirely ignored (perhaps to great detriment) because we simply lack a decent measure/verifiable-reward for it.
I have significant experience on modelling physical world (mostly CFD, but also gamedev - with realistic rigid body collisions and friction).
I admit, exists domain (spectrum of parameters), where CFD and game physics working just well; exists predictable domain (on borders of well working domain), where CFD and game physics working good enough but could show strange things, and exists domain, where you will see lot of bugs.
And, current computing power is so much, that even on small business level (just median gamer desktop), we could save on more than 90% real-world tests with simulations in well working domain (and just avoid use cases in unreliable domains).
So I think, most question is just conservative bosses and investors, who don't believe to engineers and don't understand how to do checks (and tuning) of simulations with real world tests, and what reliable domain is.
I wonder if this doesn't reify a particular business model, of creating a general model and then renting it out Saas-style (possibly adapted to largish customers).
It reminds me of the early excitement over mainframes, how their applications were limited by the rarity of access, and how vigorously those trained in those fine arts defended their superiority. They just couldn't compete with the hordes of smaller competitors getting into every niche.
It may instead be that customer data and use cases are both the most relevant and the most profitable. An AI that could adopt a small user model and track and apply user use cases would have entirely different structure, and would have demonstrable price/performance ratios.
This could mean if Apple or Google actually integrated AI into their devices, they could have a decisive advantage. Or perhaps there's a next generation of web applications that model use-cases and interactions. Indeed, Cursor and other IDE companies might have a leg up if they can drive towards modeling the context instead of just feeding it as intention to the generative LLM.
If I'm right (which I give a 50/50 odds to), and we can reduce the power of LLM computation by 95%, trillions can be saved in power bills, and we can break the need for Nvidia or other specialists, and get back to general purpose computation.
Wouldn't the Bitter Lesson be to invest in those models over trying to be clever about ekeing out a little more oomph from today's language models (and langue-based data)?
Deleted Comment
Do you mean challenges for which the answer is known?
I'm in the recovery camp. Honestly I haven't heard a strong argument against it. Mostly "we just care that things work" which, frankly, isn't that the primary concern of all of us? I'm all for throwing shit at a wall and seeing what sticks, it can be a really efficient method sometimes (especially in early exploratory phases), but I doubt it is the most efficient way forward.
In my experience, having been a person who's created models that require magnitudes fewer resources for equivalent performance, I cannot stress enough the importance of quality over quantity. The tricky part is defining that quality.
[0] https://arxiv.org/abs/2407.05694
[1] Personally, I'm unconvinced. Despite success of our LLMs it's difficult to decouple other variables.
[2] The "a" is important here. There's not one physics per-say. There are different models. This is a level of metaphysics most people will not encounter and has many subtleties.
[3] I must stress that there's a huge difference between the universe being computable and the universe being a computation. The universe being computable does not mean we all live in a simulation.
There are teenagers that win gold medals on the math olympiad - they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on. A difference of eight orders of magnitude.
In other words, data scarcity is not a fundamental problem, just a problem for the current paradigm.
If we can reduce the precision of the model parameters by 2~32x without much perceptible drop in performance, we are clearly dealing with something wildly inefficient.
I'm open to the possibility that over parameterization is essential as part of the training process, much like how MSAA/SSAA over sample the frame buffer to reduce information aliasing in the final scaled result (also wildly inefficient but very effective generally). However, I think for more exotic architectures (spiking / time domain) these rules don't work the same way. You can't back propagate a recurrent SNN so much of the prevailing machine learning mindset doesn't even apply.
Human students who only learned some new words but have not (yet) even began to really comprehend a subject will just throw around random words and sentences that sound great but have no basis in reality too.
For the same sentence, for example, "We need to open a new factory in country XY", the internal model lighting up inside the brain of someone who has actually participated when this was done previously will be much deeper and larger than that of someone who only heard about it in their course work. That same depth is zero for an LLM, which only knows the relations between words and has no representation of the world. Words alone cannot even begin to represent what the model created from the real-world sensors' data, which on top of the direct input is also based on many times compounded and already-internalized prior models (nobody establishes that new factory as a newly born baby with a fresh neural net, actually, even the newly born has inherited instincts that are all based on accumulated real world experiences, including the complex very structure of the brain).
Somewhat similarly, situations reported in comments like this one (client or manager vastly underestimating the effort required to do something): https://news.ycombinator.com/item?id=45123810 The internal model for a task of those far removed from actually doing it is very small compared to the internal models of those doing the work, so trying to gauge required effort falls short spectacularly if they also don't have the awareness.
My brain only needs to get mugged in a dark alley by a guy in a hoodie once to learn something.
Somewhat apples and oranges given billions of years of evolution behind that human. GPT-5 started off as a blank slate.
"How could a telescope see saturn, human eyes have billions of years of evolution behind them, and we only made telescopes a few hundred years ago, so they should be much weaker than eyes"
"How can StockFish play chess better than a human, the human brain has had billions of years of evolution"
Evolution is random, slow, and does not mean we arrive at even a local optima.
But it's indeed apples and oranges. There's no good way to estimate the information encoded by the GPT architecture compared to human DNA. We just have to be empirical and look at what the thing can do.
Dead Comment
The opportunity in the market is the gap between what people have been doing and what they are trying to do, and I have developed very specialized approaches to narrow this gap in my niche, and so far customers are loving it.
I seriously doubt that the gap could ever be closed by throwing more data and compute at it. I imagine though that the outputs of my approach could be used to train a base model to close the gap at a lower unit cost, but I am skeptical that it would be economically worth while anytime soon.
Throwing more compute and data at the problem won’t magically manifest AGI. To reach those lofty heights, we must first address the gaping wounds holding us back.
Instead I developed a UX that made it as easy as possible for people to explain what they want to be done, and a system that then goes and does that. Then we compare the system's output to their historical data and there is always variance, and when the customer inspects the variance they realize that their data was wrong and the system's output is far more accurate and precise than their process (and ~3 orders of magnitude cheaper). This is around when they ask how they can buy it.
This is the difference between making what people actually want and what they say they want: it's untangling the why from the how.
We still got pretty far by scraping internet data which we all know is not fully trust worthy.
We're reaching scaling limits with transformers. The number of parameters in our largest transformers, N, is now in the order of trillions, which is the most we can apply given the total number of tokens of training data available worldwide, D, also in the order of trillions, resulting in a compute budget C = 6N × D, which is in the order of D². OpenAI and Google were the first to show these transformer "scaling laws." We cannot add more compute to a given compute budget C without increasing data D to maintain the relationship. As the OP puts it, if we want to increase the number of GPUs by 2x, we must also increase the number of parameters and training tokens by 1.41x, but... we've already run out of training tokens.
We must either (1) discover new architectures with different scaling laws, and/or (2) compute new synthetic data that can contribute to learning (akin to dreams).
Of course we can, this is a non issue.
See e.g. AlphaZero [0] that's 8 years old at this point, and any modern RL training using synthetic data, e.g. DeepSeek-R1-Zero [1].
[0] https://en.m.wikipedia.org/wiki/AlphaZero
[1] https://arxiv.org/abs/2501.12948
Yes, distillation is a thing but that is more about compression and filtering. Distillation does not produce new data in the same way that chess games produce new positions.
That's option (2) in the parent comment: synthetic data.
To be clear I also agree with your (1) and (2).
If C = D^2, and you double compute, then 2C ==> 2D^2. How do you and the original author get 1.41D from 2D^2?
In other words, the required amount of data scales with the square root of the compute. The square root of 2 ~= 1.414. If you double the compute, you need roughly 1.414 times more data.
I don't know about that. LLMs have been trained mostly on text. If you add photos, audio and videos, and later even 3D games, or 3D videos, you get massively more data than the old plain text. Maybe by many orders of magnitude. And this is certainly that can improve cognition in general. Getting to AGI without audio and video, and 3D perception seems like a non-starter. And even if we think AGI is not the goal, further improvements from these new training datasets are certainly conceivable.
The AI companies are not only out of such data but their access to it is shrinking as the people who control the hosting sites wall them off (like YouTube).
darknets, the deep web, Usenet, BBS, Internet2, and all other paywalled archives .
when Dennard scaling (single core performance) started to fail in 90s-00s, I don't think there was a sentiment "how stupid was it to believe such a scaling at all"?
sure, people were compliant (and we still meme about running Crysis), but in the end the discussion resulted in "no more free lunch" - progress in one direction has hit a bottleneck, so it's time to choose some other direction to improve on (and multi-threading has now become mostly the norm)
I don't really see much of a difference?
I think the next steps will be more along this vain of thinking. Treating all training data the same is a mistake. Some data is significantly more valuable to developing an intelligent model than most other training data, even when you pass quality filters. I think we need to revisit how we 'train' these models in the first place, and come up with a more intelligent/interactive system of doing so
Further that the order of training matters is novel to me and seems so obvious in hindsight.
Maybe both of these points are common knowledge/practice among current leading LLM builders. I don't build LLMs, I build on and with them, so I don't know.