Bitter Lesson is about AI agents

For a blog post of 1,200 words the bitter lesson has done more damage to AI research and funding than blowing up a nuclear bomb at neurips would.

Every time I try to write a reasonable blog post about why it's wrong it blows up to tens of thousands of words and no one can be bothered to read it, let alone the supporting citations.

In the spirit of low effort anec-data pulled from memory:

The raw compute needed to brute force any problem can only be known after the problem is solved. There is no sane upper limit to how much computation, memory and data any given task will take and humans are terrible at estimating how hard tasks actually are. We are after all only 60 years late for the undergraduate summer project that would solve computer vision.

Today VLMs are the best brute force approach to solving computer vision we have, and they look like they will take a PB of state to solve and the compute needed to train them will be available some time around 2040.

What do we do with the problems that are too hard to solve with the limited compute that we have? Lie down for 80 years and wait for compute to catch up? Or solve a smaller problem using specialized tricks that don't require a $10B super computer to build?

The bitter lesson is nothing of the sort, there is plenty of space for thinking hard, and there always will be.

fxtentacle · 5 months ago

In my experience, "the bitter lesson" is just wrong. Yes, a hypothetical AI model that has a suitable architecture so that it can memorize your handcrafted features can improve upon them, if it is perfectly trained. But that's a huge if. There's a reason that large AI models only work with some architectures and you need to randomly initialize tens to hundreds of times and you need exactly the right optimizer with the right hyperparameters and then the right training data in the right order... and the reason is that generic function approximators get stuck in a local minima incredibly easy.

People used to write handcrafted features to limit the optimization base area. Now people use specific pretraining data and initialization functions to do the same. It's just a different way to express the same constraints.

In short, the "bitter lesson" assumes a change that did not happen yet.

imtringued · 5 months ago

>and the reason is that generic function approximators get stuck in a local minima incredibly easy.

The key insight in massively over parameterized models (aka LLMs) has been that all the local minima are very similar. Picking one over the other doesn't actually benefit you that much.

cma · 5 months ago

> There's a reason that large AI models only work with some architectures and you need to randomly initialize tens to hundreds of times and you need exactly the right optimizer with the right hyperparameters and then the right training data in the right order..

GPT4 used the tensor programs (Greg Yang,now at X. Ai) research to effectively transfer hyper parameters from smaller models where experiments can go fast to the larger one in a predictable way and got a very smooth loss curve throughout the training.

abetusk · 5 months ago

In my opinion, you've grossly mischaracterized and misunderstood the bitter lesson. The bitter lesson is not saying that brute force will win out against better algorithms, it's saying that simple, efficient algorithms, in a theoretical sense, will win out against intricate and specialized algorithms.

That is, it's saying that worrying about the constant factor in the "big-Oh" is doomed to failure. My reading is that it's not saying that we should abandon polynomial time algorithms in favor of exponential ones, it's saying we should focus on polynomial time algorithms that are conceptually simple and let Moore's law take care of the constant factor differences between intricate, complex and "clever" algorithms. It's saying that Moore's law will obliterate the constant factor differences in algorithms, and potentially the small polynomial exponent differences, not that we shouldn't care about exponential time complexity.

The "Bitter Lesson" blog post came out during a time when people were debating whether simple algorithms on large datasets could win out against more complex algorithms with intricate domain knowledge. That is, did AGI, vision, language, translation, speech, etc. need intricate and deep domain knowledge or was it good enough to get coarser and "stupider" algorithms, that were still theoretically efficient/polynomial time efficient algorithms.

I can see why you would think this way and I certainly don't claim to have deep insight into Sutton to claim whether my reading is more correct than yours. From my perspective, the good faith reading of Sutton's article is talking about constant factor differences in algorithms, not in differences between polynomial time and exponential time worst case complexities.

m463 · 5 months ago

I wonder if an analogy with graphics would be that "simple algorithms that leveraged compute" would not be "brute force everything with 1000 gpus + ray tracing", it might be a pixel shader that simulated water taking the state of the art a bit further.

mmoskal · 5 months ago

The bitter lesson is about the balance of human ingenuity and compute being thrown at the problem. We've seen a few years of LLM compute being scaled up 10x every year, but this is hitting limits (fabs), and we will see more human effort, as it becomes compartively cheaper.

Also the current crop of models are inherently limited. Even for something as simple as following a JSON schema, models alone are not good enough [0]

Of course as the Moore law refuses to die, we'll continue seeing 1.5-2x or so every year, but that's far from 10x.

[0] https://openai.com/index/introducing-structured-outputs-in-t... - see plot

noosphr · 5 months ago

>Of course as the Moore law refuses to die, we'll continue seeing 1.5-2x or so every year, but that's far from 10x.

This is another one of those anec-data throw away sentences that take thousands of words to disprove - with a lot of graphs - that no one reads.

More hot takes: Moores law has been effectively dead since the Pentium 4 on CPUs. It's been dead on GPUs since 2020. Right now we're not seeing a 1.5-2x grow of compute per year. We've seen zero growth for 5. The only way that GPUs have gotten faster is by running ever hotter, and building out a trillion dollars worth of data centers.

No one cares because the current hotness in AI is transformers which are memory bound in both training and inference. If someone manages to get diffusion models to become the next hotness all of a sudden everyone will realize this is a problem since those are compute bound by a huge margin and current gen GPUs are fire hazards when ran at 100% utilization for weeks on end.

KaoruAoiShiho · 5 months ago

It's much higher than 1.5x-2x, both scale up and scale out. Moving onto fp4 alone will offer a huge speedup.

sriku · 5 months ago

Agree (though that's perhaps not a charitable interpretation of TBL). Prematurely articulated principles can do a lot of damage.

Btw I found the blog post to be one of the lowest quality one in terms of information content posted on HN - almost like it was written by chatgpt or something.

bob1029 · 5 months ago

I don't feel like the bitter lesson becomes applicable until you've demonstrated some initial degree of success with your technique on a single machine/GPU/CPU/thread. If you cannot make it work in practical environments, it's going to be an uphill battle the entire way.

This is why I've moved toward CPU-only techniques in my experimentation. Being able to execute arbitrary UTMs with high performance provides a significantly richer computational landscape to work with than matrix multiplication. I am perfectly happy with something taking longer as if it's on a linear basis. I.e., adding another CPU provides ~2x search speed. I am NOT happy with ideas like taking a 100x hit on token generation rate because half my parameters are paged out to disk at any given moment due to not having a step-size amount of vram.

The rigidity of the GPU solution stack makes exploration of clever techniques largely a slap on the wrist experience. Anything with rapid control flow changes is verboten for a fancy pants cluster. The latency domain of L1 cache in the CPU is impossible to compete with if you need to serialize all of your events. I strongly believe that control flow is where the magic happens. This is where you can cut through 100 million parameters of linear math bullshit and solve the problem with a lookup table and 2 interpreter cycles. You get about half a billion of these cycles to work with per second per thread, so there is a lot of room to play with ideas.

imtringued · 5 months ago

Meanwhile in robotics everyone is glad that computers are getting faster.

There are a lot of things that used to be impossible to do inside a 1000 Hz control loop.

>What do we do with the problems that are too hard to solve with the limited compute that we have? Lie down for 80 years and wait for compute to catch up? Or solve a smaller problem using specialized tricks that don't require a $10B super computer to build?

Solving a smaller problem using specialized tricks has gone nowhere in robotics. Almost all the advancements in robotics control happened in the last ten years as a result of computers getting faster.

We are very close to the cusp of non linear MPC becoming a solved problem for up to 64 degrees of freedom and a horizon of 20 time steps at 1000Hz, but we aren't there yet. It would definitely be possible with an ASIC built for MPC.

>The bitter lesson is nothing of the sort, there is plenty of space for thinking hard, and there always will be.

The bitter lesson doesn't say that human ingenuity is worthless. It guides it in a useful direction. A lot of human ingenuity was put into compute scaling solutions for transformers.

_cs2017_ · 5 months ago

"The bitter lesson" didn't merely claim that a sufficiently large amount of compute would obsolete an engineered solution. Its claim was far stronger: the time it takes for the compute growth to catch up with the hand-engineered solution is so short that the investment in the latter won't pay off in the sense of a researcher's personal career investment, or in the sense of a big-tech R&D effort ROI.

You may well dispute such a claim, of course. Would be interesting to read your thoughts if you are willing to share them.

jncfhnb · 5 months ago

> The raw compute needed to brute force any problem can only be known after the problem is solved. There is no sane upper limit to how much computation, memory and data any given task will take and humans are terrible at estimating how hard tasks actually are. We are after all only 60 years late for the undergraduate summer project that would solve computer vision.

I feel like you’re conflating conceptual difficulty and computational difficulty

RobinL · 5 months ago

I appreciate this comment. I'm currently working hard on something seemly straightforward (address matching) and sometimes I feel demotivated because it feels like whatever progress I make, the bitter lesson will get me in the end. Reading your comment made me feel that maybe it's worth the effort after all. I have also taken some comfort in the fact that current LLMs cannot perform this task very well.

uxcolumbo · 5 months ago

Can you provide more about the problems in address matching and what you are trying to solve?

Do you mean street address matching? Isn’t that already solved? (excuse the naive question)

The counter argument is a bitter lesson that Tesla is learning from Waymo and the lesson might be bitter enough to tank the company. Waymo's approach to self driving isn't end to end - they have classical control combined with tons of deep learning, creating a final product that actually works in the real world, meanwhile the purely data driving approach from Tesla has failed to deliver a working product.

grbsh · 5 months ago

Tesla / Waymo is a perfect illustration of the point, but the Bitter Lesson doesn’t allow us to pick a winner here. The Bitter Lesson tells us that the Tesla approach (fully end to end, minimizing hand coded features / logic) will _ultimately_ win out. The Bitter Lesson does not tell us that this approach has to economically justify itself 1 year in, 5 years in, or that the approach when the technology is immature will allow a company to avoid bankrupting itself in the meantime while they wait for the data and compute to scale.

In other words, just because we know that ultimately (possibly in 20+ years) the Tesla compute-only approach will be simpler and more effective, Tesla might not survive to see this happen. Instead, manual feature engineering and hacking can always give temporary gains over data and compute driven approaches. The bitter lesson was clear about this. I suspect Waymo will win, and at some point in the future once they are out of their growth at all costs stage, they will transition into their maximum value extraction stage, in which vision will make significantly more economic sense than LiDAR. But once they win, they’ll have plenty of time to see the bitter lesson through its ultimate consequences. Elon is right, but he’s probably too early.

typon · 5 months ago

That's religion, not a predictive theory.

The Bitter Lesson has held up in a lot of domains where injecting human inductive bias was detrimental. Adding LIDAR for example is not inductive bias - it's a strictly superior form of sensing. You won't call a wolf's sense of smell "hand engineered features" or a cat's reflexes a failure of evolution to extract more signal from an inferior sensory input.

Waymo will win because they want to make a product that works and not be ideological about it - that's ultimately what matters.

jsight · 5 months ago

I'd argue that bitter lesson might be the other way around. Waymo has been experimenting with more end-to-end approaches and is likely to end up with something that looks more like that than a "classical control" approach, though maybe not quite the same approach as Tesla's current setup.

IMO, this is the best public description of the current state of the art: https://www.youtube.com/watch?v=92e5zD_-xDw

I expect Waymo to continue to evolve in a similar direction.

ModernMech · 5 months ago

The lesson from Tesla is that AI is not just a magic box where you can put in data and get out intelligence. There are more to working systems than compute, and when they operate in the real world, data isn't enough. The key problem with Tesla cars that keep them from succeeding is not that they don't have enough data, but they have no idea what to do with it. Even if they had infinite compute and all the driving videos in the world, it wouldn't be enough to overcome the limitations of their sensors.

xg15 · 5 months ago

> The key problem with Tesla cars that keep them from succeeding is not that they don't have enough data, but they have no idea what to do with it. Even if they had infinite compute and all the driving videos in the world, it wouldn't be enough to overcome the limitations of their sensors.

Isn't this effectively a refutation of the "bitter lesson"?

dangus · 5 months ago

Tesla is a poor counterargument because it is no longer a market leader. It has poor management compared to 10 years ago and seems to be unable to attract top talent (poor labor relations).

Tesla is being leapfrogged by competitors across the auto industry. All it has is first mover status (charging network).

Tesla purposefully limits the capabilities of its self driving by refusing to implement it with sensors that go beyond smartphone cameras.

My belief is that Tesla doesn’t want to actually deliver a car that can drive itself because the end result of Waymo is that fewer people will need to own a car and fleets of short term rental self-driving cars won’t spend frivolous money on prestige and luxury like consumer car buyers. They won’t lease a car and replace it every 2-3 years like some car owners do just because they like having a new car. Fleet vehicle operators purchase cars with razor thin margins and make decisions based solely on economics, as well as having a lot more purchasing leverage over car manufacturers.

I don’t think Tesla ever wants self driving to work, they just want to sell the idea of the software.

imtringued · 5 months ago

Tesla is actually an example of relying too much on human domain knowledge.

Wayno is brute forcing the problem with hardware. They use Lidar.

Elon Musk's argument against Lidar is that humans only need two eyes and therefore stereoscopic vision is enough.

"Human drivers use two eyes, therefore self driving cars need two eyes." is exactly the type of thing the bitter lesson is warning against if you stretch the analogy to hardware.

lsy · 5 months ago

Going back to the original "Bitter Lesson" article, I think the analogy to chess computers could be instructive here. A lot of institutional resources were spent trying to achieve "superhuman" chess performance, it was achieved, and today almost the entire TAM for computer chess is covered by good-enough Stockfish, while most of the money tied up in chess is in matching human players with each other across the world, and playing against computers is sort of what you do when you're learning, or don't have an internet connection, or you're embarrassed about your skill and don't want to get trash-talked by an Estonian teenager.

The "Second Bitter Lesson" of AI might be that "just because massive amounts of compute make something possible doesn't mean that there will be a commensurately massive market to justify that compute".

"Bitter Lesson" I think also underplays the amount of energy and structure and design that has to go into compute-intensive systems to make them succeed: Deep Blue and current engines like Stockfish take advantage of tablebases of opening and closing positions that are more like GOFAI than deep tree search. And the current crop of LLMs are not only taking advantage of expanded compute, but of the hard-won ability of companies in the 21st century to not only build and resource massive server farms, but mobilize armies of contractors in low-COL areas to hand-train models into usefulness.

diego_sandoval · 5 months ago

The main useful outcome we get from chess is entertainment.

Entertainment that comes from a Human vs. Human match is higher than Human vs. AI, at least for spectators.

But many sectors of the economy don't gain much from it being done by humans. I don't care if my car was made by all humans or all robots, as long as it's the best car I can get for the money.

I think you're extrapolating a bit too much from the specific case of chess.

ip26 · 5 months ago

It’s not really about how the compute-intensive resources come to bear. You can draw a parallel to Moore’s law. Node advancement is one of the most expensive and cutting edge efforts by humanity today. But it’s also simultaneously true that software companies have succeeded or failed by betting for or against computers getting faster. There are famous examples of companies in the 80’s that designed software that was simply not usable on the computers on hand when the project began, but was incredible on the (much faster) computers of launch day.

The bitter lesson is very similar. In essence, when building on top of AI models, bet on the AI models getting much faster and more capable.

immibis · 5 months ago

And there is software today that is simply not usable on computers today, but will be incredible on computers in 20 years time if clock speed continue doubling every 2 years.

Most of it is written in Electron.

PollardsRho · 5 months ago

The time span on which these developments take place matter a lot for whether the bitter lesson is relevant to a particular AI deployment. The best AI models of the future will not have 100K lines of hand-coded edge cases, and developing those to make the models of today better won't be a long-term way to move towards better AI.

On the other hand, most companies don't have unlimited time to wait for improvements on the core AI side of things, and even so building competitive advantages like a large existing customer base or really good private data sets to train next-gen AI tools have huge long-term benefits.

There's been an extraordinary amount of labor hours put into developing games that could run, through whatever tricks were necessary, on whatever hardware actually existed for consumers at the time the developers were working. Many of those tricks are no longer necessary, and clearly the way to high-definition real-time graphics was not in stacking 20 years of tricks onto 2000-era hardware. I don't think anyone working on that stuff actually thought that was going to happen, though. Many of the companies dominating the gaming industry now are the ones that built up brands and customers and experience in all of the other aspects of the industry, making sure that when better underlying scaling came there they had the experience, revenue, and know-how to make use of that tooling more effectively.

spongebobstoes · 5 months ago

Why must the best model not have 100k edge cases hand coded?

Our firsthand experiences as humans can be viewed as such. People constantly over index on their own anecdata, and are the best "models" so far.

Previous experience isn't manual edge cases, it's training data. Humans have incredible scale (100 trillion synapses): we're incredibly good at generalizing, e.g., how to pick up objects we've never seen before or understanding new social situations.

If you want to learn how to play chess, understanding the basic principles of the game is far more effective than trying to memorize every time you make an opening mistake. You surely need some amount of rote knowledge, but learning how to appraise new chess positions scales much, much better than trying to learn an astronomically small fraction of chess positions by heart.

wegfawefgawefg · 5 months ago

Actually companies can just wait. Multiple times my company has said: "a new model that solves this will probably come out in like 2-4 months anyways, just leave the old one as is for now".

It has been true like ten times in the past two years.

It's not that technical work is guaranteed to be in your codebase 10 years from now, it's that customers don't want to use a product that might be good six months from now. The actors in the best position to use new AI advances are the ones with good brands, customer bases, engineering know-how that does transfer, etc.

abstractcontrol · 5 months ago

> Investment Strategy: Organizations should invest more in computing infrastructure than in complex algorithmic development.

> Competitive Advantage: The winners in AI won’t be those with the cleverest algorithms, but those who can effectively harness the most compute power.

> Career Focus: As AI engineers, our value lies not in crafting perfect algorithms but in building systems that can effectively leverage massive computational resources. That is a fundamental shift in mental models of how to build software.

I think the author has a fundamental misconception what making best use of computational resources requires. It's algorithms. His recommendation boils down to not do the one thing that would allow us to make the best use of computational resources.

His assumptions would only be correct if all the best algorithms were already known, which is clearly not the case at present.

Rich Sutton said something similar, but when he said it, he was thinking of old engineering intensive approaches, so it made sense in the context in which he said it and for the audience he directed it at. It was hardly groundbreaking either, the people whom he wrote the article for all thought the same thing already.

People like the author of this article don't understand the context and are taking his words as gospel. There is no reason not to think that there won't be different machine learning methods to supplant the current ones, and it's certain they won't be found by people who are convinced that algorithmic development is useless.

aDyslecticCrow · 5 months ago

I'm by the same mind.

I dare say ChatGPT 3.0 and 4.0 are the only recent examples where pure computing produced a significant edge compared to algorithmic improvements. And that edge lasted a solid year before others caught up. Even among the recent improvements;

1. Gaussian splashing, a hand-crafted method threw the entire field of Nerf models out the water. 2. Deepseek o1 is used for training reasoning without a reasoning dataset. 3. Inception-labs 16x speedup is done using a diffusion model instead of the next token prediction. 4. Deepseek distillation, compressing a larger model into a smaller model.

That sets aside the introduction of the Transformer and diffusion model themselves, which triggered the current wave in the first place.

AI is still a vastly immature field. We have not formally explored it carefully but rather randomly tested things. Good ideas are being dismissed for whatever randomly worked elsewhere. I suspect we are still missing a lot of fundamental understanding, even at the activation function level.

We need clever ideas more than compute. But the stock market seems to have mixed them up.

amarant · 5 months ago

>There is no reason not to think that there won't be different machine learning methods to supplant the current ones,

Sorry, is that a triple negative? I'm confused, but I think you're saying there WILL be improved algorithms in the future? That seems to jive better with the rest of your comment, but I just wanted to make sure I understood you correctly!

So.. Did I?

serjester · 5 months ago

This misses that if the agent is occasionally going haywire, the user is leaving and never coming back. AI deployments are about managing expectations - you’re much better off with an agent that’s 80 +/- 10% successful than 90 +/- 40%. The more you lean into full automation, the more guardrails you give up and the more variance your system has. This is a real problem.

fancyfredbot · 5 months ago

Sutton might have said you just need a loss function which penalises variance and the model will learn to reduce variance itself. He thinks this will be more effective than hand coded guardrails. He's probably right.

I don't know how you write that loss function mind you. Sounds tricky. But I doubt Sutton was saying it's easy, just that if you can do it then it's effective.

nsonha · 5 months ago

Penalises on training? Not runtime? The risk is that.

ankit219 · 5 months ago

You don't have to tolerate agent/AI going haywire. In a simple example, say of multiple parallel generations. It's compute intensive and it reduces the probability of your agent going haywire. You need mechanisms and evals to detect the best output in this scenario of course, that is still important. With more compute, you are preventing your final output to be haywire despite the variance.

ed · 5 months ago

Do you have a real world example of this? Claude Code for example doesn’t fit the pattern of “higher success but more variance.” If anything the variance is lower as the model (and tightly coupled agent) gets better.

TRiG_Ireland · 5 months ago

The only AI I've ever dealt with is unwillingly, when companies use AI chat bots to replace human support. They certainly make me want to leave and not come back.

dtagames · 5 months ago

Good stuff but the original "Bitter Lesson" article has the real meat, which is that by applying more compute power we get better results (just more accurate token predictions, really) than with human guiderails.

extr · 5 months ago

I bring this up often at work. There is more ROI in assuming models will continue to improve, and planning/engineering with that future in mind, rather than using a worse model and spending a lot of dev time shoring up it's weaknesses, prompt engineering, etc. The best models today will be cheaper tomorrow. The worst models today will literally cease to exist. You want to lean into this - have the AI handle as much as it possibly can.

Eg: We were using Flash 1.5 for awhile. Spent a lot of time prompt engineering to get it to do exactly what we wanted and be more reliable. Probably should have just done multi-shot and said "take best of 3", because as soon as Flash 2.0 came out, all the problems evaporated.

Thats the core of the argument. We are switching from a 100% deterministic and controlled worldview (in software terms) to a scenario where it's probabilistic, and we haven't updated ourselves accordingly. Best of n (with parallelization) is probably the simplest fix instead of such rigorous prompt engineering. Still many teams do want a deterministic output and spend a lot of time on prompts (as opposed to evals to choose the best output).

Honestly I feel the takeaways are the opposite.

There’s no point in building something non functional now simply because it will be replaceable with something functional later.

You should either do it without AI or not do it at all. You’re not actually adding value with a placeholder for “future AI”.

typewithrhythm · 5 months ago

Its more that if you don't know for sure if it's possible; and usually you don't. Then adding your expertise onto an ai system is never going to pay off compared to building out the ai compute infrastructure and training data.

This has the best chance of being functional in the long term, in the face of uncertainty.

If you already know it can work, then you can improve with specific expertise, but it's a fixed solution at that point.

Eh, so in reality there are a lot of AI products people are trying to build and it's very unclear at the outset "if it's possible", where "possible" is a business question that includes factors like:

- How hard is the task? Can it be completed with cheaper/faster models or does it require heavyweight SOTA tier models?

- What's your cost envelope for AI compute?

- How are you going to test/refine the exact prompt and examples you give the AI?

- How much scaffolding (aka, dev time = $$$) do you need to set up to integrate the AI with other systems?

- Is the result reliable enough to productize and show to users?

What you realize when designing these systems is there is a sliding scale where the more scaffolding and domain expertise you put into the system as a whole, the less you need to rely on the AI, but the more expensive it is in terms of man-hours it is to develop and maintain. It looks more and more just like a traditional system. And vice versa, perhaps with the most powerful SOTA models you can just dump 20K tokens of context and get an answer that is highly reliable and accurate with almost no extra work on your end (but costs more to run).

It's very individualized and task-dependent. But we do know from recent history, you can generally assume models are going to get faster/smarter/cheaper pretty quickly. So you try to figure out how close to the latter scenario you can get away with for now, knowing that in 6 months the equation could have completely changed in favor of "let the AI do most of the work".

As an addendum, I think it's completely crazy right now to be in the business of training your own models unless you have HIGHLY specialized needs or like to light money on fire. You are never going to achieve the performance/$ of the big AI labs, and they/their investors are doing all your R&D for FREE. It's like if Ford was releasing a new car every 6 months made out of ever more efficient and stronger carbon nanotubes or whatever, because the carbon nanotube companies were all competing for market share and wanted to win the "carbon nanotube race". It's crazy, never seen anything like it.