The Bitter Lesson (2019)

YeGoblynQueenne · 5 years ago

>> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search.

"Massive, deep search" that started from a book of opening moves and the combined expert knowledge of several chess Grandmasters. And that was an instance of the minimax algorithm with alpha-beta cutoff, i.e. a search algorithm specifically designed for two-player, deterministic games like chess. And with a hand-crafted evaluation function, whose parameters were filled-in by self-play. But still, an evaluation function; because the minimax algorithm requires one and blind search alone did not, could not, come up with minimax, or with the concept of an evaluation function in a million years. Essentially, human expertise about what matters in the game was baked-in to Deep Blue's design from the very beginning and permeated every aspect of its design.

Of course, ultimately, search was what allowed Deep Blue to beat Kasparov (3½–2½; Kasparov won two games and drew another). That, in the sense that the alpha-beta minimax algorithm itself is a search algorithm and it goes without saying that a longer, deeper, better search will inevitably eventually outperform whatever a human player is doing, which clearly is not search.

But, rather than an irrelevant "bitter" lesson about how big machines can perfom more computations than a human, a really useful lesson -and one that we haven't yet learned, as a field- is why humans can do so well without search. It is clear to anyone who has played any board game that humans can't search ahead more than a scant few ply, even for the simplest games. And yet, it took 30 years (counting from the Dartmouth workshop) for a computer chess player to beat an expert human player. And almost 60 to beat one in Go.

No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

Also see Rodney Brook's "better lesson" [2] addressing the other successes of big search discussed in the article.

_____________

[1] https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)#Des...

[2] https://rodneybrooks.com/a-better-lesson/

mtgp1000 · 5 years ago

>But, rather than an irrelevant "bitter" lesson about how big machines can perfom more computations than a human, a really useful lesson -and one that we haven't yet learned, as a field- is why humans can do so well without search

I think the answer is heuristics based on priors(e.g. board state), which we've demonstrated (with alphago and derivatives, especially alphago zero) that neural networks are readily able to learn.

This is why I get the impression that modern neural networks are quickly approaching humanlike reasoning - once you figure out how to

(1) encode (or train) heuristics and

(2) encode relationships between concepts in a manner which preserves a sort of topology (think for example of a graph where nodes represent generic ideas)

You're well on your way to artificial general reasoning - the only remaining question becomes one of hardware (compute, memory, and/or efficiency of architecture).

dreamcompiler · 5 years ago

Are we certain that well-trained human players are not doing search? It's possible that a search subnetwork gets "compiled without debugger symbols" and the owner of the brain is simply unaware that it's happening.

YeGoblynQueenne · 5 years ago

>> Are we certain that well-trained human players are not doing search?

Yes- because human players can only search a tiny portion of a game tree and a minimax search of the same extent is not even sufficient to beat a dedicated human in tic-tac-to, leta lone chess. That is, unless one wishes to countenance the possibility of an "unconscious search" which of course might as well be "the grace of God" or any such hand-wavy non-explanation.

>> It's possible that a search subnetwork gets "compiled without debugger symbols" and the owner of the brain is simply unaware that it's happening.

Sorry, I don't understand what you mean.

gwern · 5 years ago

I'm not sure why YeGoblynQueenne thinks this is such a mystery. (This is not the first time I've been puzzled by their pessimism on HN.) There is no mystery here: AlphaZero shows that you can get superhuman performance by searching only a few ply by sufficiently good pattern recognition in a highly parameterized and well-trained value function, and MuZero makes this point even more emphatically by doing away with the formal search entirely in favor of an more abstract recurrent pondering. What more is there to say?

bsder · 5 years ago

> Are we certain that well-trained human players are not doing search?

Some, but there's a LOT more context pruning the search space.

Watch some of the chess grandmasters play and miss obvious winning moves. Why? "Well, I didn't bother looking at that because <insert famous grandmaster> doesn't just hang a rook randomly."

new2628 · 5 years ago

At least in chess, if it is not the search, then it is probably the evaluation function.

Expert players have likely a very well-tuned evaluation function of how strong a board "feels". Some of it is explainable easily: center domination, diagonal bishop, connected pawn structure, rook supporting pawn from behind, others are more elaborate, come with experience and harder to verbalize.

When expert players play against computers, the limitation of their evaluation function becomes visible. Some board may feel strong, but you are missing some corner case that the minmax search observes and exploits.

YeGoblynQueenne · 5 years ago

I like to caution against taking concepts from computer science and AI and applying them directly to the way the human mind works. Unless we know that a player is applying a specific evaluation function (e.g. because they tell us, or because they vocalise their thought process etc) then even suggesting that "players have an evaluation function" is extrapolating far from what it is safe. For one thing- what does a "function" look like in the human mind?

Whatever human minds do, computing is only a very general metaphor for it and it's very risky to assume we understand anything about our mind just because we understand our computers.

burntoutfire · 5 years ago

> No, no. The biggest question in the field is not one that is answered by "a deeper search". The biggest question is "how can we do that without a search"?

My guess is that we're doing pattern recognision, where we recognize taht a current game state is similar to a situation that we've been in before (in some previous game), and recall the strategy we took and the outcomes it had lead to. With large enough body of experience, you can to remember lots of past attempted strategies for every kind of game state (of course, within some similarity distance).

blt · 5 years ago

This insight is the essence of the AlphaZero architecture. Whereas a pure Monte Carlo Tree Search (MCTS) starts each node in the search tree with a uniform distribution over actions, AlphaZero trains a neural network to observe the game state and output a distribution over actions. This distribution is optimized to be as similar as possible to the distribution obtained from running MCTS from that state in the past. It's very similar to the way humans play games.

kdoherty · 5 years ago

Potentially also of interest is Rod Brooks' response "A Better Lesson" (2019): https://rodneybrooks.com/a-better-lesson/

bnjmn · 5 years ago

"Potentially" is an understatement! A much better take, IMO.

auggierose · 5 years ago

I guess it depends on what you trying to do. I had a computer vision problem where I was like, hell yeah, let's machine learn the hell out of this. 2 months later, and the results were just not precise enough. It took me 2 more months, and now I am solving the task easily on an iPhone via Apple Metal in milliseconds with a hand-crafted optimisation approach ...

jefft255 · 5 years ago

His advice really concerns more scientific research and its long-term progress, and not really immediate applications. I think that injecting human knowledge can lead to faster, more immediate progress, and he seems to believe that too. The "bitter lesson" is that general, data-driven approaches will always win out eventually.

ksdale · 5 years ago

I think it's plausible that many technological advances follow a similar. Something like the steam engine is a step-improvement, but many of the subsequent improvements are basically the obvious next step, implemented once steel is strong enough, or machining precise enough, or fuel is refined enough. How many times has the world changed qualitatively, simply in the pursuit of making things quantitatively bigger or faster or stronger?

I can certainly see how it could be considered disappointing that pure intellect and creativity doesn't always win out, but I, personally, don't think it's bitter.

I also have a pet theory that the first AGI will actually be 10,000 very simple algorithms/sensors/APIs duct-taped together running on ridiculously powerful equipment rather than any sort of elegant Theory of Everything, and this wild conjecture may make me less likely to think this a bitter lesson...

StevenWaterman · 5 years ago

I agree, the first AGI probably will be bodget together with loads of expert input. However, that's not evidence against the bitter lesson.

The first of anything is usually made with the help of experts, but they're quickly overtaken by general methods that lever additional computation

ksdale · 5 years ago

Sorry, I didn't mean to suggest that the bitter lesson is wrong, just that it's not bitter, it's actually how a whole bunch of stuff progresses.

fxtentacle · 5 years ago

The current top contender on AI optical flow uses LESS CPU and LESS RAM than last year's leader. As such, I strongly disagree with the article.

Yes, many AI fields have become better from improved computational power. But this additional computational power has unlocked architectural choices which were previously impossible to execute in a timely manner.

So the conclusion may equally well be that a good network architecture results in a good result. And if you cannot use the right architecture due to RAM or CPU constraints, then you will get bad results.

And while taking an old AI algorithm and re-training it with 2x the original parameters and 2x the data does work and does improve results, I would argue that that's kind of low-level copycat "research" and not advancing the field. Yes, there's a lot of people doing it, but no, it's not significantly advancing the field. It's tiny incremental baby steps.

In the area of optical flow, this year's new top contenders introduce many completely novel approaches, such as new normalization methods, new data representations, new nonlinearities and a full bag of "never used before" augmentation methods. All of these are handcrafted elements that someone built by observing what "bug" needs fixing. And that easily halved the loss rate, compared to last year's architectures, while using LESS CPU and RAM. So to me, that is clear proof of a superior network architecture, not of additional computing power.

jmole · 5 years ago

Yup - and this year's top CPUs have almost 10x the performance per watt of CPUs from even 2-3 years ago [0]

Raw computation is only half the story. The other half is: what the hell do we do with all these extra transistors? [1]

0 - https://www.cpubenchmark.net/power_performance.html

1 - https://youtu.be/Nb2tebYAaOA?t=2167

fxtentacle · 5 years ago

Any day now people will start compiling old programs to web assembly so that you can wrap them with election, instead of compiling them to machine code. Once that happens, we have generated another 3 years of demand for Moore's law X_X

JoeAltmaier · 5 years ago

Got to believe, this is like heroin. Its a win until it isn't. Then where will AI researchers be? No progress for 20 (50?) years because the temptation to not understand but to just build performant engineering solutions, was so strong.

In fact, is the researcher supposed to be building the most performant solution? This article seems alarmingly misinformed. To understand 'artificial intelligence' isn't a race to VC money.

otoburb · 5 years ago

>>This article seems alarmingly misinformed.

I hate appeals to authority as much as anybody else on HN, but I'm not sure that we could say Rich Sutton[1] is "misinformed". He's an established expert in the field, and if we discount his academic credentials then at least consider he's understandably biased towards this line of thinking as one of the early pioneers of reinforcement learning techniques[2] and currently a research scientist at DeepMind leading their office in Alberta, Canada.

[1] https://en.wikipedia.org/wiki/Richard_S._Sutton

[2] http://incompleteideas.net/papers/sutton-88-with-erratum.pdf

JoeAltmaier · 5 years ago

He's writing that article for a reason, to be sure. Its just not the one that the article says its about, I'm thinking.

visarga · 5 years ago

AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

DNNs today can generate images that are hard to distinguish from real photos, super natural voices and surprisingly good text. They can beat us at all board games and most video games. They can write music and poetry better than the average human. Probably also drive better than an average human. Why worry about 'no progress for 50 years' at this point?

JoeAltmaier · 5 years ago

Because, they can't invent a new game. Unless of course they were only designed to invent games, and by trial and error and statistical correlation to existing games, thus producing a generic thing that relates to everything but invents nothing.

I'm not an idiot. I understand that we won't have general purpose thinking machines any time soon. But to give up entirely looking into that kind of thing, seems to me to be a mistake. To rebrand the entire field as calculating results to given problems and behaviors using existing mathematical tools, seems to do a disservice to the entire concept and future of artificial intelligence.

Imagine if the field of mathematics were stumped for a while, so investigators decided to just add up things faster and faster, and call that Mathematics.

YeGoblynQueenne · 5 years ago

>> AI as a field relied mostly on 'understanding' based approaches for 50 years without much success. These approaches were too brittle and ungrounded. Why return to something that doesn't work?

To begin with, because they do work and much better than the new approaches in a range of domains. For example, classical planners, automated theorem provers and SAT solvers are still state-of-the-art for their respective problem domains. Statistical techniques can not do any of those things very well, if at all.

Further, because the newer techniques have proven to also be brittle in their own way. Older techniques were "brittle in the sense that they didn't deal with uncertainty very well. Modern techniques are "brittle" because they are incapable of extrapolating from their training data. For example see the "elephant in the room" paper [1] or anything about adversarial examples regarding the brittleness of computer vision (probably the biggest success in modern statistical machine learning).

Finally, AI as a field did not rely on "understanding based approaches for 50 years"; there is no formal definition of "understanding" in the context of AI. A large part of Good, Old-Fashioned AI studied reasoning, which is to say, inference over rules expressed in a logic language, e.g. this was the approach exemplified by expert systems. Another large avenue of research was that on knowledge representation. And of course, machine learning itself was part of the field from its very early days, having been named by Arthur Samuel in 1959. Neural networks themselves are positively ancient: the "artifical neuron" was first described in 1938, by Pitts & McCulloch, many years before "artificial intelligence" was even coined by John McCarthy (and at the time it was a propositional-logic based circuit and nothing to do with gradient optimisation).

In general, all those obsolete dinosaurs of GOFAI could do things that modern systems cannot - for instance, deep neural nets are unrivalled classifiers but cannot do reasoning. Conversely, logic-based AI of the '70s and '80s excelled in formal reasoning. It seems that we have "progressed" by throwing out all the progress of earlier times.

____________

[1] https://arxiv.org/abs/1808.03305

P.S. Image, speech and text generation are cute, but a very poor measure for the progress of the field. There are not even good metrics for them so even saying that deep neural nets can "generate surprisingly good text" doesn't really say anything. What is "surprisingly good text"? Surprising, for whom? Good, according to what? etc. GOFAI folk were often accused of wastig time with "toy" problems, but what exactly is text generation if not a "toy problem" and a total waste of time?

astrophysician · 5 years ago

I think what he's basically saying is that priors (i.e. domain knowledge + custom, domain-inspired models) help when you're data limited or when your data is very biased, but once that's not the case (e.g. we have an infinite supply of voice samples), model capacity is usually all that matters.

sytse · 5 years ago

The article says we should focus on increasing the compute we use in AI instead of embedding domain specific knowledge. OpenAI seems to have taken this lesson to heart. They are training a generic model using more compute than anything else.

Many researchers predict a plateau for AI because it is missing the domain specific knowledge but this article and the benefits of more compute that OpenAI is demonstrating beg to differ.

throwaway7281 · 5 years ago

Model compression is an active research field and will probably be quite lucrative, as you will literally able to save millions.