Deep Neural Nets: 33 years ago and 33 years from now (2022)

Something else I find exciting, starting with one of the reflections-

The original training took 3 days on a Sun 4/260 workstation; I can't find specifics but I believe that era of early SPARC workstations would likely pull about 200 watts in total (the CPU wasn't super high powered but the whole system, running with the disks and the monitor etc would pull about that).

So 200 watts * 72 hours = 14400 watt-hours of energy.

Karpathy trained the equivalent on a Macbook, not even fully utilized, in 90 seconds. Likely something around 20 watts * 0.025 hours = 0.5 watt-hours.

An energy efficiency improvement of nearly 30000x.

fouronnes3 · 2 years ago

This is very interesting, because I've always thought that all NN performance should be measured in a unit with energy in the denominator.

eru · 2 years ago

It totally depends on what you want to use a measure for. Just like neither height or volume alone will tell you what will fit in your car.

By any measure that puts energy used by the brain in the denominator, humans are probably dumber than ants. But that doesn't mean those measures are always accurate.

(For contemporary neural networks, you also have to distinguish training costs from inference costs.)

quickthrower2 · 2 years ago

For inference that could be useful, but the energy is not for the model it is for at least the tuple of: model, model architecture and compilation, and hardware chosen.

Aardwolf · 2 years ago

30k doesn't even sound like that much to me given Moore's law. I'd expect more improvement since 1989. Supercomputer performance increased more than a million since then

swores · 2 years ago

My (wrong) intuition on reading your comment was that you were over-estimating the expected growth in performance over that time period, but actually after checking the maths based on Moore's Law, i.e. doubling every two years (though of course I understand that was a rough estimate, more of a concept prediction than expected to be precise) you're right so I'll share the maths for anyone else whose intuition might be as poor as mine:

Doubling every 2 years = compound annual growth rate (CAGR) of ~41.42%

  CAGR = ((End Value / Start Value)^(1 / Number of Years)) - 1

  ((2 / 1)^(1 / 2)) - 1 = 0.41421356237

Therefore in 34 years since then:

  1 * (1 + 0.41421356237)^34 = ~131,072

So x30k is ~4.4x less than 131k. Then again, that's equivalent to ~x1.833 every two years, compared to Moore's Law of x2 every two years, so only ~8% less growth per two years, which coming back to the fact that Moore's Law is a rough estimate concept not an exact fact, doesn't seem to far off!

ben_w · 2 years ago

It isn't much, but as the link says, the neural network they were reimplementing is too small to take advantage of modern hardware.

fooker · 2 years ago

Amdahl's law

cma · 2 years ago

33 years ago is 2000/1999

Yajirobe · 2 years ago

> watt-hours

You mean joules (up to a constant factor)?

renonce · 2 years ago

A watt-hour is 3600 joules but watt-hours or kilowatt-hours is commonly used because it’s easier to calculate.

The most fundamental change is the difference in what models are being trained on.

Little images of characters is a trivia type problem, very different from training on the linguistic and visual communication of essentially the whole human race.

Another 33 years of expanded computing resources won’t be training models to mimic the behavior and knowledge of humanity.

That problem (us!) will have been reduced to a toy problem long before then.

visarga · 2 years ago

I think AI models will evolve by generating synthetic data, filtering and improving it, and then retraining. Possibly with external systems in the loop - code execution, search, human, simulation or robot. Quality won't degrade because there will be a lot of effort put into data filtering and diversity. We can always improve on a model by giving it more time.

Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.

The future is data engineering, not model architecturing. Human culture, by analogy, evolves faster than human biology. The data is evolving faster than the model. And we are seeing a drastic reduction in novel architectures in AI, diverse datasets applied to the same transformer models in recent years. Even among the transformers, very few variants are largely used, thousands of them abandoned.

I like to think of it as language evolution by memetics being the real engine behind intelligence. We and AI are riding the language exponential together.

eru · 2 years ago

> Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.

You might be right in the same sense that big-O notation is 'right'. Constant factor can matter; especially once you have to take energy use into account.

ResearchCode · 2 years ago

Come close to solving the toy problem of autonomous driving first, we're still waiting.

Nevermark · 2 years ago

I don’t know. I find pessimistic views, like you are expressing, very strange.

My Tesla drives and navigates itself most of the time. 90-95% at least, just not 100%.

As apposed to cars 10 or more years ago which didn’t do any of that.

To me it is much like the “God of the Gaps” when tremendous progress on a big problem is dismissed negatively, due to the (continuously shrinking) gaps of what it can’t do.

mk_stjames · 2 years ago

breckinloggins · 2 years ago

I really enjoyed this article. My only critique is that the 2055 predictions are "meta-linear". In other words: the author avoids the (probable) mistake of taking our current tech and linearly regressing the numbers 33 years forward, but the predictions still suggest a kind of "worldline symmetry" with the present date at the origin.

It's quite possible that none of these predictions will come true simply because the timeframe is large enough for many unanticipated breakthroughs and roadblocks.

Maybe someone will figure out a much, much simpler foundational architecture than "perceptrons++", maybe we'll all be training clouds of 3D gaussians, maybe quantum computers will finally take off and we don't even have the nouns for the building blocks we'll use.

On the negative side perhaps we hit a hard scaling limit (in hardware or training) that we didn't see coming. Or a civilizational setback.

All that said, though, if I were a betting man I wouldn't exactly wager against the article's conclusions; they're probably the best we can extrapolate knowing only the past and present state of affairs.

I think you are right, the next 33 years are likely to be very different.

I would lean to them being even more dramatic, due to the opportunity to advance algorithms, not just resources.

On the more obvious side, most libraries are not yet taking full advantage of many known gradient optimization techniques. It’s been so much easier to just add data & processing that there is an overhangs of tools to still apply.

And large successful models are telling us important things.

For instance, it is clear that language models are learning a kind of logic of language similar to how we process thoughts, allowing highly disparate types information to be woven together sensibly.

At some point, identifying the nature of that processing could radically simplify language processing.

That is just one opportunity for radical architecture and algorithm advances, and it would be revolutionary.

m-i-l · 2 years ago

So should we spend the next 33 years doing the same things, just with more data and more compute power? That would be the logical conclusion of the breathless "I can't believe it is finally happening in my lifetime" and "we just need bigger models and more data" enthusiasm for LLMs when they first appeared. But can we really simply brute force our way to AGI?

Remember, 33 years ago "connectionist AI" wasn't the dominant AI paradigm, and "symbolic AI" wasn't the only other approach either - there were others, like "robotic functionalism" (the idea that you couldn't have true intelligence with interacting with the physical world). Maybe in 33 years some of these other approaches will have a resurgence, perhaps in combination with connectionist approaches. Or maybe they'll even be some entirely new approach.

mark_l_watson · 2 years ago

Great article. I lived through the early days of artificial neural networks. I was on a DARPA advisory panel for neural network tooling in the mid 1980s, wrote the first version of the SAIC ANSim commercial product, and created the simple back-prop model that was deployed in the bomb detector my company built under contract to the FAA. I also managed a ‘conventional’ deep learning team at Capital One 5-6 years ago.

My world has been very exciting in the last 18 months. I spend as much time as I can exploring self hosted LLMs, APIs from Hugging Face, OpenAI, etc.

My mind is blown even thinking about tech 33 years from now!

dsign · 2 years ago

It's not clear that compute will scale as it did for the next 33 years. But it doesn't really need to.

I read the article and I was thinking "my God, I remember I used MSE that weekend in my pet ML project and it really didn't work out that well; wrong loss function." Our current crop of LLMs, or the one next year, will be perfectly able to tell me how I can improve my code and graphs, which means that I can deploy some expert-level techniques that otherwise would be "locked" to me by 50000 hours of "mastery acquisition".

A part of me is telling me that we humans are doomed, and that in 33 years we would have created a world in which we humans are irrelevant. But another part tells me that if we avoid that fate and all the other dooms, the future might just be quite bright.

somewhereoutth · 2 years ago

> or the one next year

We have heard, and will continue to hear, this sort of thing rather a lot. The last 5 yards are the hardest, but without them the previous 5 miles are of limited utility.

NumberWangMan · 2 years ago

I think there’s going to be a point where we need to slow a AI way, way down in order to avoid bad outcomes. I’m with Zvi Mowshowitz here: we should encourage progress and risk taking in every area except those where there are extinction risks. Applying today’s LLMs to all sorts of problems won’t end us. But I think we may only be a few years away from AGI that is conscious and can plan, and we don’t know the upper limit of how smart we’ll be able to make them.

And I think that we have a responsibility to any intelligent being we bring into the world. Some lament that there’s no test to become a parent-what about creating a million copies of a new virtual brain from scratch? And basically so they can be born into lifelong servitude.

version_five · 2 years ago

This was really good. The only thing I didn't see explicitly discussed, although I guess it's obvious, is that what's different 33 years later is the inputs the models operate on. The '89 sota model used 16x16 greyscale images, today we have single digit megapixel color images, in 30 years, a desktop will be able to train Clip in 90 seconds, but what will the sota models be trained on?

retrac · 2 years ago

Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.

Humans have a brain that physically changes especially in childhood, so that is potentially a massive advantage.

reducesuffering · 2 years ago

Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.

eigenvalue · 2 years ago

We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.

I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.

ramblerman · 2 years ago

Millions of hours of data captured by headsets like the vision pro?

Not sure all the things it captures, but a model could be trained on the combination of audio/video/spatial/iris/what have you...

bilsbie · 2 years ago

It’s interesting in that time we almost completely lost interest in neural networks and then came back around to them.

sroussey · 2 years ago

I had to retake my AI class at university several times because I just didn’t agree on the “AI is symbolic search” aspect.

Now though, I’m sure people are taking LLMs and putting them together to do forward and backward chaining.

Sharlin · 2 years ago

In this case there are good reasons for the resurgence, but that's really the case with pretty much anything software-related. Except the fashion cycles tend to be shorter with more mainstream technologies.

jacquesm · 2 years ago

Thank Hinton for that. It's a pity we don't have a Nobel for software.

But a Turing award is pretty neat as well.