Something else I find exciting, starting with one of the reflections-
The original training took 3 days on a Sun 4/260 workstation; I can't find specifics but I believe that era of early SPARC workstations would likely pull about 200 watts in total (the CPU wasn't super high powered but the whole system, running with the disks and the monitor etc would pull about that).
So 200 watts * 72 hours = 14400 watt-hours of energy.
Karpathy trained the equivalent on a Macbook, not even fully utilized, in 90 seconds. Likely something around 20 watts * 0.025 hours = 0.5 watt-hours.
An energy efficiency improvement of nearly 30000x.
It totally depends on what you want to use a measure for. Just like neither height or volume alone will tell you what will fit in your car.
By any measure that puts energy used by the brain in the denominator, humans are probably dumber than ants. But that doesn't mean those measures are always accurate.
(For contemporary neural networks, you also have to distinguish training costs from inference costs.)
For inference that could be useful, but the energy is not for the model it is for at least the tuple of: model, model architecture and compilation, and hardware chosen.
30k doesn't even sound like that much to me given Moore's law. I'd expect more improvement since 1989. Supercomputer performance increased more than a million since then
My (wrong) intuition on reading your comment was that you were over-estimating the expected growth in performance over that time period, but actually after checking the maths based on Moore's Law, i.e. doubling every two years (though of course I understand that was a rough estimate, more of a concept prediction than expected to be precise) you're right so I'll share the maths for anyone else whose intuition might be as poor as mine:
Doubling every 2 years = compound annual growth rate (CAGR) of ~41.42%
CAGR = ((End Value / Start Value)^(1 / Number of Years)) - 1
((2 / 1)^(1 / 2)) - 1 = 0.41421356237
Therefore in 34 years since then:
1 * (1 + 0.41421356237)^34 = ~131,072
So x30k is ~4.4x less than 131k. Then again, that's equivalent to ~x1.833 every two years, compared to Moore's Law of x2 every two years, so only ~8% less growth per two years, which coming back to the fact that Moore's Law is a rough estimate concept not an exact fact, doesn't seem to far off!
I really enjoyed this article. My only critique is that the 2055 predictions are "meta-linear". In other words: the author avoids the (probable) mistake of taking our current tech and linearly regressing the numbers 33 years forward, but the predictions still suggest a kind of "worldline symmetry" with the present date at the origin.
It's quite possible that none of these predictions will come true simply because the timeframe is large enough for many unanticipated breakthroughs and roadblocks.
Maybe someone will figure out a much, much simpler foundational architecture than "perceptrons++", maybe we'll all be training clouds of 3D gaussians, maybe quantum computers will finally take off and we don't even have the nouns for the building blocks we'll use.
On the negative side perhaps we hit a hard scaling limit (in hardware or training) that we didn't see coming. Or a civilizational setback.
All that said, though, if I were a betting man I wouldn't exactly wager against the article's conclusions; they're probably the best we can extrapolate knowing only the past and present state of affairs.
I think you are right, the next 33 years are likely to be very different.
I would lean to them being even more dramatic, due to the opportunity to advance algorithms, not just resources.
On the more obvious side, most libraries are not yet taking full advantage of many known gradient optimization techniques. It’s been so much easier to just add data & processing that there is an overhangs of tools to still apply.
And large successful models are telling us important things.
For instance, it is clear that language models are learning a kind of logic of language similar to how we process thoughts, allowing highly disparate types information to be woven together sensibly.
At some point, identifying the nature of that processing could radically simplify language processing.
That is just one opportunity for radical architecture and algorithm advances, and it would be revolutionary.
So should we spend the next 33 years doing the same things, just with more data and more compute power? That would be the logical conclusion of the breathless "I can't believe it is finally happening in my lifetime" and "we just need bigger models and more data" enthusiasm for LLMs when they first appeared. But can we really simply brute force our way to AGI?
Remember, 33 years ago "connectionist AI" wasn't the dominant AI paradigm, and "symbolic AI" wasn't the only other approach either - there were others, like "robotic functionalism" (the idea that you couldn't have true intelligence with interacting with the physical world). Maybe in 33 years some of these other approaches will have a resurgence, perhaps in combination with connectionist approaches. Or maybe they'll even be some entirely new approach.
Great article. I lived through the early days of artificial neural networks. I was on a DARPA advisory panel for neural network tooling in the mid 1980s, wrote the first version of the SAIC ANSim commercial product, and created the simple back-prop model that was deployed in the bomb detector my company built under contract to the FAA. I also managed a ‘conventional’ deep learning team at Capital One 5-6 years ago.
My world has been very exciting in the last 18 months. I spend as much time as I can exploring self hosted LLMs, APIs from Hugging Face, OpenAI, etc.
My mind is blown even thinking about tech 33 years from now!
The most fundamental change is the difference in what models are being trained on.
Little images of characters is a trivia type problem, very different from training on the linguistic and visual communication of essentially the whole human race.
Another 33 years of expanded computing resources won’t be training models to mimic the behavior and knowledge of humanity.
That problem (us!) will have been reduced to a toy problem long before then.
I think AI models will evolve by generating synthetic data, filtering and improving it, and then retraining. Possibly with external systems in the loop - code execution, search, human, simulation or robot. Quality won't degrade because there will be a lot of effort put into data filtering and diversity. We can always improve on a model by giving it more time.
Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.
The future is data engineering, not model architecturing.
Human culture, by analogy, evolves faster than human biology. The data is evolving faster than the model. And we are seeing a drastic reduction in novel architectures in AI, diverse datasets applied to the same transformer models in recent years. Even among the transformers, very few variants are largely used, thousands of them abandoned.
I like to think of it as language evolution by memetics being the real engine behind intelligence. We and AI are riding the language exponential together.
> Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.
You might be right in the same sense that big-O notation is 'right'. Constant factor can matter; especially once you have to take energy use into account.
I don’t know. I find pessimistic views, like you are expressing, very strange.
My Tesla drives and navigates itself most of the time. 90-95% at least, just not 100%.
As apposed to cars 10 or more years ago which didn’t do any of that.
To me it is much like the “God of the Gaps” when tremendous progress on a big problem is dismissed negatively, due to the (continuously shrinking) gaps of what it can’t do.
It's not clear that compute will scale as it did for the next 33 years. But it doesn't really need to.
I read the article and I was thinking "my God, I remember I used MSE that weekend in my pet ML project and it really didn't work out that well; wrong loss function." Our current crop of LLMs, or the one next year, will be perfectly able to tell me how I can improve my code and graphs, which means that I can deploy some expert-level techniques that otherwise would be "locked" to me by 50000 hours of "mastery acquisition".
A part of me is telling me that we humans are doomed, and that in 33 years we would have created a world in which we humans are irrelevant. But another part tells me that if we avoid that fate and all the other dooms, the future might just be quite bright.
We have heard, and will continue to hear, this sort of thing rather a lot. The last 5 yards are the hardest, but without them the previous 5 miles are of limited utility.
I think there’s going to be a point where we need to slow a AI way, way down in order to avoid bad outcomes. I’m with Zvi Mowshowitz here: we should encourage progress and risk taking in every area except those where there are extinction risks. Applying today’s LLMs to all sorts of problems won’t end us. But I think we may only be a few years away from AGI that is conscious and can plan, and we don’t know the upper limit of how smart we’ll be able to make them.
And I think that we have a responsibility to any intelligent being we bring into the world. Some lament that there’s no test to become a parent-what about creating a million copies of a new virtual brain from scratch? And basically so they can be born into lifelong servitude.
This was really good. The only thing I didn't see explicitly discussed, although I guess it's obvious, is that what's different 33 years later is the inputs the models operate on. The '89 sota model used 16x16 greyscale images, today we have single digit megapixel color images, in 30 years, a desktop will be able to train Clip in 90 seconds, but what will the sota models be trained on?
Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.
Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.
We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.
I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.
In this case there are good reasons for the resurgence, but that's really the case with pretty much anything software-related. Except the fashion cycles tend to be shorter with more mainstream technologies.
The original training took 3 days on a Sun 4/260 workstation; I can't find specifics but I believe that era of early SPARC workstations would likely pull about 200 watts in total (the CPU wasn't super high powered but the whole system, running with the disks and the monitor etc would pull about that).
So 200 watts * 72 hours = 14400 watt-hours of energy.
Karpathy trained the equivalent on a Macbook, not even fully utilized, in 90 seconds. Likely something around 20 watts * 0.025 hours = 0.5 watt-hours.
An energy efficiency improvement of nearly 30000x.
By any measure that puts energy used by the brain in the denominator, humans are probably dumber than ants. But that doesn't mean those measures are always accurate.
(For contemporary neural networks, you also have to distinguish training costs from inference costs.)
Doubling every 2 years = compound annual growth rate (CAGR) of ~41.42%
Therefore in 34 years since then: So x30k is ~4.4x less than 131k. Then again, that's equivalent to ~x1.833 every two years, compared to Moore's Law of x2 every two years, so only ~8% less growth per two years, which coming back to the fact that Moore's Law is a rough estimate concept not an exact fact, doesn't seem to far off!You mean joules (up to a constant factor)?
It's quite possible that none of these predictions will come true simply because the timeframe is large enough for many unanticipated breakthroughs and roadblocks.
Maybe someone will figure out a much, much simpler foundational architecture than "perceptrons++", maybe we'll all be training clouds of 3D gaussians, maybe quantum computers will finally take off and we don't even have the nouns for the building blocks we'll use.
On the negative side perhaps we hit a hard scaling limit (in hardware or training) that we didn't see coming. Or a civilizational setback.
All that said, though, if I were a betting man I wouldn't exactly wager against the article's conclusions; they're probably the best we can extrapolate knowing only the past and present state of affairs.
I would lean to them being even more dramatic, due to the opportunity to advance algorithms, not just resources.
On the more obvious side, most libraries are not yet taking full advantage of many known gradient optimization techniques. It’s been so much easier to just add data & processing that there is an overhangs of tools to still apply.
And large successful models are telling us important things.
For instance, it is clear that language models are learning a kind of logic of language similar to how we process thoughts, allowing highly disparate types information to be woven together sensibly.
At some point, identifying the nature of that processing could radically simplify language processing.
That is just one opportunity for radical architecture and algorithm advances, and it would be revolutionary.
Remember, 33 years ago "connectionist AI" wasn't the dominant AI paradigm, and "symbolic AI" wasn't the only other approach either - there were others, like "robotic functionalism" (the idea that you couldn't have true intelligence with interacting with the physical world). Maybe in 33 years some of these other approaches will have a resurgence, perhaps in combination with connectionist approaches. Or maybe they'll even be some entirely new approach.
My world has been very exciting in the last 18 months. I spend as much time as I can exploring self hosted LLMs, APIs from Hugging Face, OpenAI, etc.
My mind is blown even thinking about tech 33 years from now!
Little images of characters is a trivia type problem, very different from training on the linguistic and visual communication of essentially the whole human race.
Another 33 years of expanded computing resources won’t be training models to mimic the behavior and knowledge of humanity.
That problem (us!) will have been reduced to a toy problem long before then.
Model architecture doesn't matter compared to the dataset. Any model from a class can learn the same skills from the same data, but change the data and they all change their abilities - the intelligence is in the data.
The future is data engineering, not model architecturing. Human culture, by analogy, evolves faster than human biology. The data is evolving faster than the model. And we are seeing a drastic reduction in novel architectures in AI, diverse datasets applied to the same transformer models in recent years. Even among the transformers, very few variants are largely used, thousands of them abandoned.
I like to think of it as language evolution by memetics being the real engine behind intelligence. We and AI are riding the language exponential together.
You might be right in the same sense that big-O notation is 'right'. Constant factor can matter; especially once you have to take energy use into account.
My Tesla drives and navigates itself most of the time. 90-95% at least, just not 100%.
As apposed to cars 10 or more years ago which didn’t do any of that.
To me it is much like the “God of the Gaps” when tremendous progress on a big problem is dismissed negatively, due to the (continuously shrinking) gaps of what it can’t do.
I read the article and I was thinking "my God, I remember I used MSE that weekend in my pet ML project and it really didn't work out that well; wrong loss function." Our current crop of LLMs, or the one next year, will be perfectly able to tell me how I can improve my code and graphs, which means that I can deploy some expert-level techniques that otherwise would be "locked" to me by 50000 hours of "mastery acquisition".
A part of me is telling me that we humans are doomed, and that in 33 years we would have created a world in which we humans are irrelevant. But another part tells me that if we avoid that fate and all the other dooms, the future might just be quite bright.
We have heard, and will continue to hear, this sort of thing rather a lot. The last 5 yards are the hardest, but without them the previous 5 miles are of limited utility.
And I think that we have a responsibility to any intelligent being we bring into the world. Some lament that there’s no test to become a parent-what about creating a million copies of a new virtual brain from scratch? And basically so they can be born into lifelong servitude.
Not sure all the things it captures, but a model could be trained on the combination of audio/video/spatial/iris/what have you...
Now though, I’m sure people are taking LLMs and putting them together to do forward and backward chaining.
But a Turing award is pretty neat as well.