The Bitter Lesson Is Misunderstood

Hey folks, OOP/original author and 20-year HN lurker here — a friend just told me about this and thought I'd chime in.

Reading through the comments, I think there's one key point that might be getting lost: this isn't really about whether scaling is "dead" (it's not), but rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks.

Someone commented below about verifiable rewards and IMO that's exactly it: if you can find a way to produce verifiable rewards about a target world, you can essentially produce unlimited amounts of data and (likely) scale past the current bottleneck. Then the question becomes, working backwards from the set of interesting 4-8h METR tasks, what worlds can we make verifiable rewards for and how do we scalably make them? [1]

Which is to say, it's not about more data in general, it's about the specific kind of data (or architecture) we need to break a specific bottleneck. For instance, real-world data is indeed verifiable and will be amazing for robotics, etc. but that frontier is further behind: there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today.

[1] There's another path with better design, e.g. CLIP that improves both architecture and data, but let's leave that aside for now.

FloorEgg · 3 days ago

10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.

Recently it doesn't seem to be playing out as such. The current best LLMs I find marvelously impressive (despite their flaws), and yet... where are all the awesome robots? Why can't I buy a robot that loads my dishwasher for me?

Last year this really started to bug me, and after digging into it with some friends I think we collectively realized something that may be a hint at the answer.

As far as we know, it took roughly 100M-1B years to evolve human level "embodiment" (evolve from single celled organisms to human), but it only took around ~100k-1M for humanity to evolve language, knowledge transfer and abstract reasoning.

So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?

breuleux · 3 days ago

> So it makes me wonder, is embodiment (advanced robotics) 1000x harder than LLMs from an information processing perspective?

Essentially, yes, but I would go further in saying that embodiment is harder than intelligence in and of itself.

I would argue that intelligence is a very simple and primitive mechanism compared to the evolved animal body, and the effectiveness of our own intelligence is circumstantial. We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity, the combinatorial explosion of factors would be too much to model and our intelligence would be of limited usefulness.

And that's what we see with LLMs: I think they model relatively faithfully what, say, separates humans from chimps, but it lacks the animal library of innate world understanding which is supposed to ground intellect and stop it from hallucinating nonsense. It's trained on human language, which is basically the shadows in Plato's cave. It's very good at tasks that operate in that shadow world, like writing emails, or programming, or writing trite stories, but most of our understanding of the world isn't encoded in language, except very very implicitly, which is not enough.

What trips us up here is that we find language-related tasks difficult, but that's likely because the ability evolved recently, not because they are intrinsically difficult (likewise, we find mental arithmetic difficult, but it not intrinsically so). As it turns out, language is simple. Programming is simple. I expect that logic and reasoning are also simple. The evolved animal primitives that actually interface with the real world, on the other hand, appear to be much more complicated (but time will tell).

dragonwriter · 3 days ago

> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work.

We did.

Like, to the point that the AI that radically impacted blue collar work isn't even part of what is considered “AI” any more.

chrchr · 3 days ago

Part of the answer to this puzzle is that your dishwasher itself is a robot that washes dishes, and has had enormous impact on blue collar jobs since its invention and widespread deployment. There are tons of labor saving devices out there doing blue collar work that we don't think of as robots or as AI.

kushalc · 3 days ago

Not a robotics guy, but to extent that the same fundamentals hold—

I think it's a degrees of freedom question. Given the (relatively) low conditional entropy of natural language, there aren't actually that many degrees of (true) freedom. On the other hand, in the real world, there are massively more degrees of freedom both in general (3 dimensions, 6 degrees of movement per joint, M joints, continuous vs. discrete space, etc.) and also given the path dependence of actions, the non-standardized nature of actuators, actuators, kinematics, etc.

All in, you get crushed by the curse of dimensionality. Given N degrees of true freedom, you need O(exp(N)) data points to achieve the same performance. Folks do a bunch of clever things to address that dimensionality explosion, but I think the overly reductionist point still stands: although the real world is theoretically verifiable (and theoretically could produce infinite data), in practice we currently have exponentially less real-world data for an exponentially harder problem.

Real roboticists should chime in...

mbac32768 · 3 days ago

We think this because ten years ago we were all having our minds blown by DeepMind's game playing achievements and videos of dancing robots and thought this meant blue collar work would be solved imminently.

But most of these solutions were more crude than they let on, and you wouldn't really know unless you were working in AI already.

Watch John Carmack's recent talk at Upper Bound if you want him to see him destroy like a trillion dollars worth of AI hype.

https://m.youtube.com/watch?v=rQ-An5bhkrs&t=11303s&pp=2AGnWJ...

Spoiler: we're nowhere close to AGI

zer00eyz · 3 days ago

> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work. Not sure exactly where I got the impression, but I remember some "rising tide of AI" analogy and graphic that had artists and scientists positioned on the high ground.

The moment you strip away the magical thinking, the humanization (bugs not hallucinations) what you realize is that this is just progress. Ford in the 1960's putting in the first robot arms vs auto manufacturing today. The phone: from switch board operators, to mechanical switching to digital to... (I think phone is in some odd hybrid era with text but only time will tell). Draftsmen in the 1970's all replaced by autocad by the 90's. GO further back to 1920, 30 percent of Americans were farmers, today thats less than 2.

Humans, on very human scales are very good at finding all new ways of making ourselves "busy" and "productive".

ACCount37 · 3 days ago

The big robot AI issue is: no data!

There is a lot of high quality text from diverse domains, there's a lot of audio or images or videos around. The largest robotics datasets are absolutely pathetic in size compared to that. We didn't collect or stockpile the right data in advance. Embodiment may be hard by itself, but doing embodiment in this data-barren wasteland is living hell.

So you throw everything but the kitchen sink at the problem. You pre-train on non-robotics data to squeeze transfer learning for all its worth, you run hard sims, a hundred flavors of data augmentation, you get hardware and set up actual warehouses with test benches where robots try their hand at specific tasks to collect more data.

And all of that combined only gets you to "meh" real world performance - slow, flaky, fairly brittle, and on relatively narrow tasks. Often good enough for an impressive demo, but not good enough to replace human workers yet.

There's a reason why a lot of those bleeding edge AI powered robots are designed for and ship with either teleoperation capabilities, or demonstration-replay capabilities. Companies that are doing this hope to start pushing units first, and then use human operators to start building up some of the "real world" datasets they need to actually train those robots to be more capable of autonomous operation.

Having to deal with Capital H Hardware is the big non-AI issue. You can push ChatGPT to 100 million devices, as long as you have a product people want to use for the price of "free", and the GPUs to deal with inference demand. You can't materialize 100 million actual physical robot bodies out of nowhere for free, GPUs or no GPUs. Scaling up is hard and expensive.

api · 3 days ago

Embodiment is 1000x harder from a physical perspective.

Look at how hard it is for us to make reliable laptop hinges or the articulated car door handle trend (started by Tesla) where they constantly break.

These are simple mechanisms compared to any animal or human body. Our bodies last up to 80-100 years through not just constant regeneration but organic super-materials that rival anything synthetic in terms of durability within its spec range. Nature is full of this, like spider silk much stronger than steel or joints that can take repeated impacts for decades. This is what hundreds of millions to billions of years of evolution gets you.

We can build robots this good but they are expensive, so expensive that just hiring someone to do it manually is cheaper. So the problem is that good quality robots are still much more expensive than human labor.

The only areas where robots have replaced human labor is where the economics work, like huge volume manufacturing, or where humans can’t easily go or can’t perform. The latter includes tasks like lifting and moving things thousands of times larger than humans can or environments like high temperatures, deep space, the bottom of the ocean, radioactive environments, etc.

bflesch · 3 days ago

The problem is not the robot loading the diswasher, it is the dishwasher. The dishwasher (and general kitchen electronics) industry has not innovated in a long time.

My prediction is a new player will come in who vertically integrates these currently disjoint industries and product. The tableware used should be compatible with the dishwasher, the packaging of my groceries should be compatible with the cooking system. Like a mini-factory.

But current vendors have no financial incentive to do so, because if you take a step back the whole notion of putting one room of your apartment full with random electronics just to cook a meal once in a blue moon is deeply inefficient. End-to-end food automation is coming to the restaurant business, and I hope it pushes prices of meals so far down that having a dedicated room for a kitchen in the apartment is simply not worth it.

That's the "utopia" version of things.

In reality, we see prices for fast food (the most automated food business) going up while quality is going down. Does it make the established players more vulnerable to disruption? I think so.

petralithic · 3 days ago

> 10+ years ago I expected we would get AI that would impact blue collar work long before AI that impacted white collar work.

I'm not sure where people get this impression from, even back decades ago. Hardware is always harder than software. We had chess engines in the 20th century but a robotic hand that could move pieces? That was obviously not as easy because dealing with the physical world always has issues that dealing with the virtual doesn't.

foxglacier · 3 days ago

Robots are only harder because they have expensive hardware. We already have robots that can load dishwashers and do other manual work but humans are cheaper so there isn't much of a market for them.

The rising tide idea came from a 1997 paper by Moravec. Here's a nice graphic and subsequent history https://lifearchitect.ai/flood/

Interestingly, Moravec also stated: "When the highest peaks are covered, there will be machines than can interact as intelligently as any human on any subject. The presence of minds in machines will then become self−evident." We pretty much have those today so by 1997 standards, machines have minds, yet somehow we moved the goalposts and decided that doesn't count anymore. Even if LLMs end up being strictly more capable than every human on every subject, I'm sure we'll find some new excuse why they don't have minds or aren't really intelligent.

Quarrelsome · 3 days ago

> if you can find a way to produce verifiable rewards about a target world

I feel like there's an interesting symmetry here between the pre and post LLM world, where I've always found that organisations over-optimise for things they can measure (e.g. balance sheets) and under-optimise for things they can't (e.g. developer productivity), which explains why its so hard to keep a software product up to date in an average org, as the natural pressure is to run it into the ground until a competitor suddenly displaces it.

So in a post LLM world, we have this gaping hole around things we either lack the data for, or as you say: lack the ability to produce verifiable rewards for. I wonder if similar patterns might play out as a consequence and what unmodelled, unrecorded, real-world things will be entirely ignored (perhaps to great detriment) because we simply lack a decent measure/verifiable-reward for it.

simne · 2 days ago

> if you can find a way to produce verifiable rewards about a target world

I have significant experience on modelling physical world (mostly CFD, but also gamedev - with realistic rigid body collisions and friction).

I admit, exists domain (spectrum of parameters), where CFD and game physics working just well; exists predictable domain (on borders of well working domain), where CFD and game physics working good enough but could show strange things, and exists domain, where you will see lot of bugs.

And, current computing power is so much, that even on small business level (just median gamer desktop), we could save on more than 90% real-world tests with simulations in well working domain (and just avoid use cases in unreliable domains).

So I think, most question is just conservative bosses and investors, who don't believe to engineers and don't understand how to do checks (and tuning) of simulations with real world tests, and what reliable domain is.

w10-1 · 3 days ago

> rather how we continue to scale for language models at the current LM frontier — 4-8h METR tasks

I wonder if this doesn't reify a particular business model, of creating a general model and then renting it out Saas-style (possibly adapted to largish customers).

It reminds me of the early excitement over mainframes, how their applications were limited by the rarity of access, and how vigorously those trained in those fine arts defended their superiority. They just couldn't compete with the hordes of smaller competitors getting into every niche.

It may instead be that customer data and use cases are both the most relevant and the most profitable. An AI that could adopt a small user model and track and apply user use cases would have entirely different structure, and would have demonstrable price/performance ratios.

This could mean if Apple or Google actually integrated AI into their devices, they could have a decisive advantage. Or perhaps there's a next generation of web applications that model use-cases and interactions. Indeed, Cursor and other IDE companies might have a leg up if they can drive towards modeling the context instead of just feeding it as intention to the generative LLM.

olq_plo · 3 days ago

Since you seem to know your stuff, why do LLMs need so much data anyway? Humans don't. Why can't we make models aware of their own uncertainty, e.g. feeding the variance of the next token distribution back into the model, as a foundation to guide their own learning. Maybe with that kind of signal, LLMs could develop 'curiosity' and 'rigorousness' and seek out the data that best refines them themselves. Let the AI make and test its own hypotheses, using formal mathematical systems, during training.

mikewarot · 3 days ago

My focus lately is on the cost side of this. I believe strongly that it's possible to reduce the cost of compute for LLM type loads by 95% or more. Personally, it's been incredibly hard to get actual numbers for static and dynamic power in ASIC designs to be sure about this.

If I'm right (which I give a 50/50 odds to), and we can reduce the power of LLM computation by 95%, trillions can be saved in power bills, and we can break the need for Nvidia or other specialists, and get back to general purpose computation.

JumpCrisscross · 3 days ago

> there are some cool labs building foundational robotics models, but they're maybe ~5 years behind LMs today

Wouldn't the Bitter Lesson be to invest in those models over trying to be clever about ekeing out a little more oomph from today's language models (and langue-based data)?

Deleted Comment

amelius · 3 days ago

What do you mean by "verifiable rewards"?

Do you mean challenges for which the answer is known?

eab- · 3 days ago

What do you mean about CLIP?

rawgabbit · 3 days ago

I believe he is referring to OpenAI proposal to move beyond training with pure text. Instead train with multi modal data. Instead of only the dictionary definition of an apple. Train it with a picture of an apple. Train it with a video of someone eating an apple etc.

godelski · 3 days ago

  > this isn't really about whether scaling is "dead"

I think there's a good position paper by Sara Hooker[0] that mentions some of this. Key point being that while the frontier is being pushed by big models with big data there's a very quiet revolution of models using far fewer parameters (still quite big) and data. Maybe "Scale Is All You Need"[1], but that doesn't mean it is practical or even a good approach. It's a shame these research paths have gotten a lot of pushback, especially given today's concerns about inference costs (this pushback still doesn't seem to be decreasing)

  > verifiable rewards

There's also a current conversation in the community over world models: is it actually a world model if the model does not recover /a physics/[2]. The argument for why they should recover a physics is that this means a counterfactual model must have been learned (no guarantees on if it is computationally irreducible). A counterfactual model gives far greater opportunities for robust generalization. In fact, you could even argue that the study of physics is the study of compression. In a sense, physics is the study of the computability of our universe[3]. Physics is counterfactual, allowing you to answer counterfactual questions like "What would the force have been if the mass had been 10x greater?" If this were not counterfactual we'd require different algorithms for different cases.

I'm in the recovery camp. Honestly I haven't heard a strong argument against it. Mostly "we just care that things work" which, frankly, isn't that the primary concern of all of us? I'm all for throwing shit at a wall and seeing what sticks, it can be a really efficient method sometimes (especially in early exploratory phases), but I doubt it is the most efficient way forward.

In my experience, having been a person who's created models that require magnitudes fewer resources for equivalent performance, I cannot stress enough the importance of quality over quantity. The tricky part is defining that quality.

[0] https://arxiv.org/abs/2407.05694

[1] Personally, I'm unconvinced. Despite success of our LLMs it's difficult to decouple other variables.

[2] The "a" is important here. There's not one physics per-say. There are different models. This is a level of metaphysics most people will not encounter and has many subtleties.

[3] I must stress that there's a huge difference between the universe being computable and the universe being a computation. The universe being computable does not mean we all live in a simulation.

Just using common sense, if we had a genius, who had tremendous reasoning ability, total recall of memories, and an unlimited lifespan and patience, and he'd read what the current LLMs have read, we'd expect quite a bit more from him than what we're getting now from LLMs.

There are teenagers that win gold medals on the math olympiad - they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on. A difference of eight orders of magnitude.

In other words, data scarcity is not a fundamental problem, just a problem for the current paradigm.

bob1029 · 3 days ago

I think quantization is the simplest canary.

If we can reduce the precision of the model parameters by 2~32x without much perceptible drop in performance, we are clearly dealing with something wildly inefficient.

I'm open to the possibility that over parameterization is essential as part of the training process, much like how MSAA/SSAA over sample the frame buffer to reduce information aliasing in the final scaled result (also wildly inefficient but very effective generally). However, I think for more exotic architectures (spiking / time domain) these rules don't work the same way. You can't back propagate a recurrent SNN so much of the prevailing machine learning mindset doesn't even apply.

jebarker · 3 days ago

It’s not clear that the inefficiency of the current paradigm is in the neural net architectures. It seems just as likely that it’s in the training objective.

flooo · 3 days ago

Now consider that the genius cannot physically interact with the world or the people therein, and uses her eyes only for reading text.

nosianu · 3 days ago

Yes - we train only on a subset of human communication, the one using written symbols (even voice has much much more depth to it), but human brains train on the actual physical world.

Human students who only learned some new words but have not (yet) even began to really comprehend a subject will just throw around random words and sentences that sound great but have no basis in reality too.

For the same sentence, for example, "We need to open a new factory in country XY", the internal model lighting up inside the brain of someone who has actually participated when this was done previously will be much deeper and larger than that of someone who only heard about it in their course work. That same depth is zero for an LLM, which only knows the relations between words and has no representation of the world. Words alone cannot even begin to represent what the model created from the real-world sensors' data, which on top of the direct input is also based on many times compounded and already-internalized prior models (nobody establishes that new factory as a newly born baby with a fresh neural net, actually, even the newly born has inherited instincts that are all based on accumulated real world experiences, including the complex very structure of the brain).

Somewhat similarly, situations reported in comments like this one (client or manager vastly underestimating the effort required to do something): https://news.ycombinator.com/item?id=45123810 The internal model for a task of those far removed from actually doing it is very small compared to the internal models of those doing the work, so trying to gauge required effort falls short spectacularly if they also don't have the awareness.

imtringued · 3 days ago

Also the geniuses get beaten with a stick if they don't memorize and perfectly reproduce the text they've read.

Fargren · 3 days ago

I'm not sure what point you are trying to make. Are you saying in order to make LLMs better at learning the missing piece is to make the capable to interact with the outside world? Give them actuators and sensors?

voxic11 · 3 days ago

Maybe human brains are constantly generating (and training on) massive amounts of synthetic data and that is how they get so smart?

garspin · 2 days ago

I doubt it. Brains run at only a few operations per second.... GPUS at TFLOPS. There just isn't enough bandwidth.

My brain only needs to get mugged in a dark alley by a guy in a hoodie once to learn something.

timeinput · 3 days ago

You mean those like 8 hours of ~~nightmares~~ dreams I have every night?

anthonypasq · 3 days ago

This sentence really struck me in a particular way. Very interesting. It does seem like thoughts/stream of consciousness is just your brain generating random tokens to itself and learning from it lol.

jimbokun · 3 days ago

What experiment could be run to test this hypothesis?

energy123 · 3 days ago

> they've trained on < 1M tokens of math texts, never mind the 70T tokens that GPT5 appears to be trained on.

Somewhat apples and oranges given billions of years of evolution behind that human. GPT-5 started off as a blank slate.

TheDong · 3 days ago

This comparison is absolute nonsense.

"How could a telescope see saturn, human eyes have billions of years of evolution behind them, and we only made telescopes a few hundred years ago, so they should be much weaker than eyes"

"How can StockFish play chess better than a human, the human brain has had billions of years of evolution"

Evolution is random, slow, and does not mean we arrive at even a local optima.

txrx0000 · 2 days ago

To be fair, GPT-5 didn't start off as a blank slate. The architecture probably encodes a lot, much like how DNA encodes a lot. The former requires human writing to decompress into a human-like thing, the latter requires the Earth environment and a woman to decompress into a human organism.

But it's indeed apples and oranges. There's no good way to estimate the information encoded by the GPT architecture compared to human DNA. We just have to be empirical and look at what the thing can do.

Dead Comment

petralithic · 3 days ago

Humans are not tabulae rasae though. Evolution has hardwired our geniosity over millions of years.