I think there is a slight disconnect here between making AI systems which are smart and AI systems which are useful. It’s a very old fallacy in AI: pretending tools which assist human intelligence by solving human problems must themselves be intelligent.
The utility of big datasets was indeed surprising, but that skepticism came about from recognizing the scaling paradigm must be a dead end: vertebrates across the board require less data to learn new things, by several orders of magnitude. Methods to give ANNs “common sense” are essentially identical to the old LISP expert systems: hard-wiring the answers to specific common-sense questions in either code or training data, even though fish and lizards can rapidly make common-sense deductions about manmade objects they couldn’t have possibly seen in their evolutionary histories. Even spiders have generalization abilities seemingly absent in transformers: they spin webs inside human homes with unnatural geometry.
Again it is surprising that the ImageNet stuff worked as well as it did. Deep learning is undoubtedly a useful way to build applications, just like Lisp was. But I think we are about as close to AGI as we were in the 80s, since we have made zero progress on common sense: in the 80s we knew Big Data can poorly emulate common sense, and that’s where we’re at today.
> vertebrates across the board require less data to learn new things, by several orders of magnitude.
Sometimes I wonder if it’s fair to say this.
Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak.
What’s billions of years of sensory information that drove behavior and selection, if not training data?
My primary concern is the generalization to manmade things that couldn’t possibly be in the evolutionary “training data.” As a thought experiment, it seems very plausible that you can train a transformer ANN on spiderwebs between trees, rocks, bushes, etc, and get “superspider” performance (say in a computer simulation). But I strongly doubt this will generalize to building webs between garages and pantries like actual spiders, no matter how many trees you throw at it, so such a system wouldn’t be ASI.
This extends to all sorts of animal cognitive experiments: crows understand simple pulleys simply by inspecting them, but they couldn’t have evolved to use pulleys. Mice can quickly learn that hitting a button 5 times will give them a treat: does it make sense to say that they encountered a similar situation in their evolutionary past? It makes more sense to suppose that mice and crows have powerful abilities to reason causally about their actions. These abilities are more sophisticated than mere “Pavlovian” associative reasoning, which is about understanding stimuli. With AI we can emulate associative reasoning very well because we have a good mathematical framework for Pavlovian responses as a sort of learning of correlations. But causal reasoning is much more mysterious, and we are very far from figuring out a good mathematical formalism that a computer can make sense of.
I also just detest the evolution = training data metaphor because it completely ignores architecture. Evolution is not just glomming on data, it’s trying different types of neurons, different connections between them, etc. All organisms alive today evolved with “billions of years of training,” but only architecture explains why we are so much smarter than chimps. In fact I think the “evolution” preys on our misconception that humans are “more evolved” than chimps, but our common ancestor was more primitive than a chimp.
Difficult to compare, not only neurons are vastly more complex, but the neural networks change and adapt. That's like if GPUs were not only programmed by software, but the hardware could also be changed based on the training data (like more sophisticated FPGAs).
Our DNA also stores a lot of information, but it is not that much.
Our dogs can learn about things such as vehicles that they have not been exposed to nearly enough, evolution wide. And so do crows, using cars to crack nuts and then waiting for red lights. And that's completely unsupervised.
> Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak
It's not just information (e.g. sets of innate smells and response tendencies), but it's also all of the advanced functions built into our brains (e.g. making sense of different types of input, dynamically adapting the brain to conditions, etc.).
When you say billions of years, you have to remember that change in DNA is glacial compared to computing; we're talking the equivalent of years or even decades for a single training iteration to occur. Deep learning models on the other hand experience millions of these in a matter of a month, and each iteration is exposed to what would take a human thousands of lifetimes to be exposed to.
I also think this is a lazy claim. We have so so many internal sources of information like the feeling of temperature or vestibular system reacting to anything from an inclination change to effective power output of heart in real time every second of the day.
This argument mostly just hollows out the meaning of training: evolution gives you things like arms and ears, but if you say evolution is like training you imply that you could have grown a new kind of arm in school.
>> Organisms have had billions of years of training.
You're referring to evolution but evolution is not optimising an objective function over a large set of data (labelled, too). Evolution proceeds by random mutation. And just because an ancestral form has encountered e.g. ice and knows what that is, doesn't mean that its evolutionary descendants retain the memory of ice and know what that is because of that memory.
tl;dr evolution and machine learning are radically different processes and it doesn't make a lot of sense to say that organisms have "trained" for millions of years. They haven't! They've evolved for millions of years.
>> What’s billions of years of sensory information that drove behavior and selection, if not training data?
That's not how it works: organisms don't train on data. They adapt to environments. Very different things.
> vertebrates across the board require less data to learn new things
the human brain is absolutely inundated with data, especially from visual, audio, and kinesthetic mediums. the data is a very different form than what one would use to train a CNN or LLM, but it is undoubtedly data. newborns start out literally being unable to see, and they have to develop those neural pathways by taking in the "pixels" of the world for every millisecond of every day
Do you have, offhand, any names or references to point me toward why you think fish and lizards can make rapid common sense deductions about man made objects they couldn't have seen in their evolutionary histories?
Also, separately, I'm only assuming but it seems the reason you think these deductions are different from hard wired answers if that their evolutionary lineage can't have had to make similar deductions. If that's your reasoning, it makes me wonder if you're using a systematic description of decisions and of the requisite data and reasoning systems to make those decisions, which would be interesting to me.
> I think there is a slight disconnect here between making AI systems which are smart and AI systems which are useful. It’s a very old fallacy in AI: pretending tools which assist human intelligence by solving human problems must themselves be intelligent.
I have difficulties understanding why you could even believe in such a fallacy: just look around you: most jobs that have to be done require barely any intelligence, and on the other hand, there exist few jobs that do require an insane amount of intelligence.
Again I do think these things have utility and the unreliability of LLMs is a bit incidental here. Symbolic systems in LISP are highly reliable, but they couldn’t possibly be extended to AGI without another component, since there was no way to get the humans out of the loop: someone had to assign the symbols semantic meaning and encode the LISP function accordingly. I think there’s a similar conceptual issue with current ANNs, and LLMs in particular: they rely on far too much formal human knowledge to get off the ground.
The article credits two academics (Hinton, Fei Fei Li) and a CEO (Jensen Huang). But really it was three academics.
Jensen Huang, reasonably, was desperate for any market that could suck up more compute, which he could pivot to from GPUs for gaming when gaming saturated its ability to use compute. Screen resolutions and visible polygons and texture maps only demand so much compute; it's an S-curve like everything else. So from a marketing/market-development and capital investment perspective I do think he deserves credit. Certainly the Intel guys struggled to similarly recognize it (and to execute even on plain GPUs.)
But... the technical/academic insight of the CUDA/GPU vision in my view came from Ian Buck's "Brook" PhD thesis at Stanford under Pat Hanrahan (Pixar+Tableau co-founder, Turing Award Winner) and Ian promptly took it to Nvidia where it was commercialized under Jensen.
Jensen embraced AI as a way to recover TAM after ASICs took over crypto mining. You can see that between-period in NVidia revenue and profit graphs.
By that time, GP-GPU had been around for a long, long time. CUDA still doesn't have much to do with AI - sure, it supports AI usage, even includes some AI-specific features (low-mixed precision blocked operations).
Jensen embraced AI way before that. CuDNN was released back in 2014. I remember being at ICLR in 2015, and there were three companies with booths: Google and Facebook who were recruiting, and NVIDIA was selling a 4 GPU desktop computer.
ASIC's never took over mining ethereum because the algo was memory hard and producing ASIC's wasn't as profitable as just throwing GPUs at the problem...
that's what i remember. i remember reading an academic paper about a cool hack where someone was getting the shaders in gpus to do massively parallel general purpose vector ops. it was this massive orders of magnitude scaling that enabled neural networks to jump out of obscurity and into the limelight.
i remember prior to that, support vectors and rkhs were the hotness for continuous signal style ml tasks. they weren't particularly scalable and transfer learning formulations seemed quite complicated. (they were, however, pretty good for demos and contests)
They were running a massive neural network (by the standards back then) on a GPU years before CUDA even existed. Even funnier, they demoed it on ATI cards. But it still took until 2012 and AlexNet making heavy use of CUDA's simpler interface before the Deep Learning hype started to take off outside purely academic playgrounds.
So the insight neither came from Jensen nor the other authors mentioned above, but they were the first ones to capitalise on it.
I think neural nets are just a subset of machine learning techniques.
I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
I don't say that transformers, LLMs, deep learning and other great things that happened in the neural network space aren't very valuable, because they are.
But I think in the future we should also study other options which might be better suited than neural networks for some classes of problems.
Can a very large and expensive LLM do sentiment analysis or classification? Yes, it can. But so can simple SVMs and KNN and sometimes even better.
I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
>I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
But that's backwards from how new techniques and progress is made. What actually happens is somebody (maybe a student at a university) has an insight or new idea for an algorithm that's near $0 cost to implement a proof-of concept. Then everybody else notices the improvement and then extra millions/billions get directed toward it.
New ideas -- that didn't cost much at the start -- ATTRACT the follow on billions in investments.
This timeline of tech progress in computer science is the opposite from other disciplines such as materials science or bio-medical fields. Trying to discover the next super-alloy or cancer drug all requires expensive experiments. Manipulating atoms & molecules requires very expensive specialized equipment. In contrast, computer science experiments can be cheap. You just need a clever insight.
An example of that was the 2012 AlexNet image recognition algorithm that blew all the other approaches out of the water. Alex Krizhevsky had an new insight on a convolutional neural network to run on CUDA. He bought 2 NVIDIA cards (GTX580 3GB GPU) from Amazon. It didn't require NASA levels of investment at the start to implement his idea. Once everybody else noticed his superior results, the billions began pouring in to iterate/refine on CNNs.
Both the "attention mechanism" and the refinement of "transformer architecture" were also cheap to prove out at a very small scale. In 2014, Jakob Uszkoreit thought about an "attention mechanism" instead of RNN and LSTM for machine translation. It didn't cost billions to come up with that idea. Yes, ChatGPT-the-product cost billions but the "attention mechanism algorithm" did not.
>into SVMs, random forests, KNN, etc.
If anyone has found an unknown insight into SVM, KNN, etc that everybody else in the industry has overlooked, they can do cheap experiments to prove it. E.g. The entire Wikipedia text download is currently only ~25GB. Run the new SVM classification idea on that corpus. Very low cost experiments in computer science algorithms can still be done in the proverbial "home garage".
Transformers were made for machine translation - someone had the insight that when going from one language to another the context mattered such that the tokens that came before would bias which ones came after. It just so happened that transformers we more performant on other tasks, and at the time you could demonstrate the improvement on a small scale.
>I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
people did that to horses. No car resulted from it, just slightly better horses.
>I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
This "not best tool" is just there for the coders to call while the "simple SVMs and KNN" would require coding and training by those coders for the specific task they have at hand.
The best tool for the job is, I’d argue, the one that does the job most reliably for the least amount of money. When you consider how little expertise or data you need to use openai offerings, I’d be surprised if sentiment analysis using classical ML methods are actually better (unless you are an expert and have a good dataset).
As a simple example, if you ask a question and part of the answer is directly quoted from a book from memory, that text is not computed/reasoned by the AI and so doesn't have an "explanation".
But I also suspect that any AGI would necessarily produce answers it can't explain. That's called intuition.
Neural networks can encode any computable function.
KANs have no advantage in terms of computability. Why are they a promising pathway?
Also, the splines in KANs are no more "explainable" than the matrix weights. Sure, we can assign importance to a node, but so what? It has no more meaning than anything else.
Deep learning is easy to adapt to various domains, use cases, training criteria. Other approaches do not have the flexibility of combining arbitrary layers and subnetworks and then training them with arbitrary loss functions. The depth in deep learning is also pretty important, as it allows the model to create hierarchical representations of the inputs.
> I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
From my perspective, that is actually what happened between the mid-90s to 2015. Neural netowrks were dead in that period, but any other ML method was very, very hot.
I think neural networks are fundamental and we will focus/experiment a lot more with architecture, layers and other parts involved but emerging features arise through size
You are supposed to call it AI now. The word "machine learning" is for GOFAI 2nd gen only. Once all investors have been money drained and the next AI winter begins, then you will be allowed to call it Machine Learning
> “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
Not baloney. The culture around data in 2005-2010 -- at least / especially in academia -- was night and day to where it is today. It's not that people didn't understand that more data enabled richer + more accurate models, but that they accepted data constraints as a part of the problem setup.
Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.
One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time.
Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.
Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.
The culture was and is skeptical in biased manners. Between '04 and '08 I worked with a group that had trained neural nets for 3D reconstruction of human heads. They were using it for prenatal diagnostics and a facial recognition pre-processor, and I was using it for creating digital doubles in VFX film making. By '08 I'd developed a system suitable for use in mobile advertising, creating ads with people in them, and 3D games with your likeness as the player. VCs thought we were frauds, and their tech advisors told them our tech was an old discredited technique that could not do what we claimed. We spoke to every VC, some of which literally kicked us out. Finally, after years of "no" that same AlexNet success begins to change minds, but now they want the tech to create porn. At that point, after years of "no" I was making children's educational media, there was no way I was gonna do porn. Plus, president of my co was a woman, famous for creating children's media. Yeah, the culture was different then, not too long ago.
It's not quite so - we couldn't handle it, and we didn't have it, so it was a bit of a none question.
I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.
Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.
Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.
Answering to people arguing against my comment: you guys do not seem to take into account that the technical circumstances were totally different thirty, twenty or even ten years ago! People would have liked to train with more data, and there was a big interest in combining heterogeneous datasets to achieve exactly that. But one major problem was the compute! There weren't any pretrained models that you specialized in one way or the other - you always retrained from scratch. I mean, even today, who's get the capability to train a multibillion GPT from scratch? And not just retraining once a tried and trusted architecture+dataset, no, I mean as a research project trying to optimize your setup towards a certain goal.
Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.
So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.
True. But my point is that the quote "people didn't believe in data" is not true. Back in 2019, when GPT-2 was trained, the reason they didn't use the 3T of today was not because they "didn't believe in data" - they totally would have had it been technically feasible (as in: they had that much data + the necessary compute).
The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.
Not really. This is referring back to the 80's. People weren't even doing 'ML'. And back then people were more focused on teasing out 'laws' in as few data points as possible. The focus was more on formulas and symbols, and finding relationships between individual data points. Not the broad patterns we take for granted today.
I'm surprised that the article doesn't mention that one of the key factors that enabled deep learning was the use of RELU as the activation function in the early 2010s. RELU behaves a lot better than the logistic sigmoid that we used until then.
Geoffrey Hinton (now a Nobel Prize winner!) himself did a summary. I think it is the single best summary on this topic.
Our labeled datasets were thousands of times too small.
Our computers were millions of times too slow.
We initialized the weights in a stupid way.
We used the wrong type of non-linearity.
As compute has outpaced memory bandwidth most recent stuff has moved away from ReLU. I think Llama 3.x uses SwiGLU. Still probably closer to ReLU than logistic sigmoid, but it's back to being something more smooth than ReLU.
Indeed, there have been so many new activation functions that I have stopped following the literature after I retired. I am glad to see that people are trying out new things.
“GeForce 256 was marketed as "the world's first 'GPU', or Graphics Processing Unit", a term Nvidia defined at the time as "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second"”
They may have been the first with a product that fitted that definition to market.
That sounds like marketing wank, not a description of an invention.
I don't think you can get a speedup by running neural networks on the GeForce 256, and the features listed there aren't really relevant (or arguably even present) in today's GPUs. As I recall, people were trying to figure out how to use GPUs to get faster processing in their Beowulfs in the late 90s and early 21st century, but it wasn't until about 02005 that anyone could actually get a speedup. The PlayStation 3's "Cell" was a little more flexible.
After actually having read the article i can say that your comment is unnecessarily negative and clueless.
The article is a very good historical one showing how 3 important things came together to make the current progress possible viz;
1) Geoffrey Hinton's back-propagation algorithm for deep neural networks
2) Nvidia's GPU hardware used via CUDA for AI/ML and
3) Fei-Fei Li's huge ImageNet database to train the algorithm on the hardware. This team actually used "Amazon Mechanical Turk"(AMT) to label the massive dataset of 14 million images.
Excerpts;
“Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”
“That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,” Li said in a September interview at the Computer History Museum. “The first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.”
Wow, that is harsh. The quoted claim is in the middle of a very long article. The background of the author seems to be more on the scientific side, than the technical side. So throw out everything, because the author got one (not very important) date wrong?
Possibly technically correct, but utterly irrelevant. The 3dfx chips accelerated parts of the 3d graphics pipeline and were not general-purpose programmable computers the way a modern GPU is (and thus would be useless for deep learning or any other kind of AI).
If you are going to count 3dfx as a proper GPU and not just a geometry and lighting accelerator, then you might as well go back further and count things like the SGI Reality Engine. Either way, 3dfx wasn't really first to anything meaningful.
But the first NVidia GPUs didn't have general-purpose compute either. Google informs me that the first GPU with user-programmable shaders was the GeForce 3 in 2001.
Defining who "really" invented something is often tricky. For example I mentioned in the article that there is some dispute about who discovered backpropagation. A
According to Wikipedia, Nvidia released its first product, the RV1, in November 1995, the same month 3dfx released its first Voodoo Graphics 3D chip. Is there reason to think the 3dfx card was more of a "true" GPU than the RV1? If not, I'd say Nvidia has as good a claim to inventing the GPU as 3dfx does.
Arguably the November 01981 launch of Silicon Graphics kickstarted GPU interest and OpenGL. You can read Jim Clark's 01982 paper about the Geometry Engine in https://web.archive.org/web/20170513193926/http://excelsior..... His first key point in the paper was that the chip had a "general instruction set", although what he meant by it was quite different from today's GPUs. IRIS GL started morphing into OpenGL in 01992, and certainly when I went to SIGGRAPH 93 it was full of hardware-accelerated 3-D drawn with OpenGL on Silicon Graphics Hardware. But graphics coprocessors date back to the 60s; Evans & Sutherland was founded in 01968.
I mean, I certainly don't think NVIDIA invented the GPU—that's a clear error in an otherwise pretty decent article—but it was a pretty gradual process.
The deep learning boom caught deep-learning researchers by surprise because deep-learning researchers don't understand their craft well enough to predict essential properties of their creations.
A model is grown, not crafted like a computer program, which makes it hard to predict. (More precisely, a big growth phase follows the crafting phase.)
I was a deep learning researcher. The problem is that accuracy (+ related metrics) were prioritized in research and funding. Factors like interpretability, extrapolation, efficiency, or consistency were not prioritized, but were clearly important before being implemented.
Dall-E was the only big surprising consumer model-- 2022 saw a sudden huge leap from "txt2img is kind of funny" to "txt2img is actually interesting". I would have assumed such a thing could only come in 2030 or earlier. But deep learning is full of counterintuitive results (like the NFL theorem not mattering, or ReLU being better than sigmoid).
But in hindsight, it was naive to think "this does not work yet" would get in the way of the products being sold and monetized.
So the AI boom of the last 12 years was made possible by three visionaries who pursued unorthodox ideas in the face of widespread criticism.
I argue that Mikolov with word2vec was instrumental in current AI revolution. This demonstrated the easy of extracting meaning in mathematical way from text and directly lead to all advancements we have today with LLMs. And ironically, didn’t require GPU.
The utility of big datasets was indeed surprising, but that skepticism came about from recognizing the scaling paradigm must be a dead end: vertebrates across the board require less data to learn new things, by several orders of magnitude. Methods to give ANNs “common sense” are essentially identical to the old LISP expert systems: hard-wiring the answers to specific common-sense questions in either code or training data, even though fish and lizards can rapidly make common-sense deductions about manmade objects they couldn’t have possibly seen in their evolutionary histories. Even spiders have generalization abilities seemingly absent in transformers: they spin webs inside human homes with unnatural geometry.
Again it is surprising that the ImageNet stuff worked as well as it did. Deep learning is undoubtedly a useful way to build applications, just like Lisp was. But I think we are about as close to AGI as we were in the 80s, since we have made zero progress on common sense: in the 80s we knew Big Data can poorly emulate common sense, and that’s where we’re at today.
Sometimes I wonder if it’s fair to say this.
Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak.
What’s billions of years of sensory information that drove behavior and selection, if not training data?
This extends to all sorts of animal cognitive experiments: crows understand simple pulleys simply by inspecting them, but they couldn’t have evolved to use pulleys. Mice can quickly learn that hitting a button 5 times will give them a treat: does it make sense to say that they encountered a similar situation in their evolutionary past? It makes more sense to suppose that mice and crows have powerful abilities to reason causally about their actions. These abilities are more sophisticated than mere “Pavlovian” associative reasoning, which is about understanding stimuli. With AI we can emulate associative reasoning very well because we have a good mathematical framework for Pavlovian responses as a sort of learning of correlations. But causal reasoning is much more mysterious, and we are very far from figuring out a good mathematical formalism that a computer can make sense of.
I also just detest the evolution = training data metaphor because it completely ignores architecture. Evolution is not just glomming on data, it’s trying different types of neurons, different connections between them, etc. All organisms alive today evolved with “billions of years of training,” but only architecture explains why we are so much smarter than chimps. In fact I think the “evolution” preys on our misconception that humans are “more evolved” than chimps, but our common ancestor was more primitive than a chimp.
Our DNA also stores a lot of information, but it is not that much.
Our dogs can learn about things such as vehicles that they have not been exposed to nearly enough, evolution wide. And so do crows, using cars to crack nuts and then waiting for red lights. And that's completely unsupervised.
We have a long way to go.
It's not just information (e.g. sets of innate smells and response tendencies), but it's also all of the advanced functions built into our brains (e.g. making sense of different types of input, dynamically adapting the brain to conditions, etc.).
Like how good would LLMs be if their training set was built by humans responding with an intelligent signal at every crossroads.
There's around 600MB in our DNA. Subtract this from the size of any LLM out there and see how much you get.
Deleted Comment
You're referring to evolution but evolution is not optimising an objective function over a large set of data (labelled, too). Evolution proceeds by random mutation. And just because an ancestral form has encountered e.g. ice and knows what that is, doesn't mean that its evolutionary descendants retain the memory of ice and know what that is because of that memory.
tl;dr evolution and machine learning are radically different processes and it doesn't make a lot of sense to say that organisms have "trained" for millions of years. They haven't! They've evolved for millions of years.
>> What’s billions of years of sensory information that drove behavior and selection, if not training data?
That's not how it works: organisms don't train on data. They adapt to environments. Very different things.
the human brain is absolutely inundated with data, especially from visual, audio, and kinesthetic mediums. the data is a very different form than what one would use to train a CNN or LLM, but it is undoubtedly data. newborns start out literally being unable to see, and they have to develop those neural pathways by taking in the "pixels" of the world for every millisecond of every day
Also, separately, I'm only assuming but it seems the reason you think these deductions are different from hard wired answers if that their evolutionary lineage can't have had to make similar deductions. If that's your reasoning, it makes me wonder if you're using a systematic description of decisions and of the requisite data and reasoning systems to make those decisions, which would be interesting to me.
I have difficulties understanding why you could even believe in such a fallacy: just look around you: most jobs that have to be done require barely any intelligence, and on the other hand, there exist few jobs that do require an insane amount of intelligence.
Jensen Huang, reasonably, was desperate for any market that could suck up more compute, which he could pivot to from GPUs for gaming when gaming saturated its ability to use compute. Screen resolutions and visible polygons and texture maps only demand so much compute; it's an S-curve like everything else. So from a marketing/market-development and capital investment perspective I do think he deserves credit. Certainly the Intel guys struggled to similarly recognize it (and to execute even on plain GPUs.)
But... the technical/academic insight of the CUDA/GPU vision in my view came from Ian Buck's "Brook" PhD thesis at Stanford under Pat Hanrahan (Pixar+Tableau co-founder, Turing Award Winner) and Ian promptly took it to Nvidia where it was commercialized under Jensen.
For a good telling of this under-told story, see one of Hanrahan's lectures at MIT: https://www.youtube.com/watch?v=Dk4fvqaOqv4
Corrections welcome.
By that time, GP-GPU had been around for a long, long time. CUDA still doesn't have much to do with AI - sure, it supports AI usage, even includes some AI-specific features (low-mixed precision blocked operations).
TAM: Total Addressable Market
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memor...
At the peak, there were around 18-25m GPUs deployed worldwide.
Source: I mined with 150k AMD GPUs.
i remember prior to that, support vectors and rkhs were the hotness for continuous signal style ml tasks. they weren't particularly scalable and transfer learning formulations seemed quite complicated. (they were, however, pretty good for demos and contests)
They were running a massive neural network (by the standards back then) on a GPU years before CUDA even existed. Even funnier, they demoed it on ATI cards. But it still took until 2012 and AlexNet making heavy use of CUDA's simpler interface before the Deep Learning hype started to take off outside purely academic playgrounds.
So the insight neither came from Jensen nor the other authors mentioned above, but they were the first ones to capitalise on it.
I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.
I don't say that transformers, LLMs, deep learning and other great things that happened in the neural network space aren't very valuable, because they are.
But I think in the future we should also study other options which might be better suited than neural networks for some classes of problems.
Can a very large and expensive LLM do sentiment analysis or classification? Yes, it can. But so can simple SVMs and KNN and sometimes even better.
I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
But that's backwards from how new techniques and progress is made. What actually happens is somebody (maybe a student at a university) has an insight or new idea for an algorithm that's near $0 cost to implement a proof-of concept. Then everybody else notices the improvement and then extra millions/billions get directed toward it.
New ideas -- that didn't cost much at the start -- ATTRACT the follow on billions in investments.
This timeline of tech progress in computer science is the opposite from other disciplines such as materials science or bio-medical fields. Trying to discover the next super-alloy or cancer drug all requires expensive experiments. Manipulating atoms & molecules requires very expensive specialized equipment. In contrast, computer science experiments can be cheap. You just need a clever insight.
An example of that was the 2012 AlexNet image recognition algorithm that blew all the other approaches out of the water. Alex Krizhevsky had an new insight on a convolutional neural network to run on CUDA. He bought 2 NVIDIA cards (GTX580 3GB GPU) from Amazon. It didn't require NASA levels of investment at the start to implement his idea. Once everybody else noticed his superior results, the billions began pouring in to iterate/refine on CNNs.
Both the "attention mechanism" and the refinement of "transformer architecture" were also cheap to prove out at a very small scale. In 2014, Jakob Uszkoreit thought about an "attention mechanism" instead of RNN and LSTM for machine translation. It didn't cost billions to come up with that idea. Yes, ChatGPT-the-product cost billions but the "attention mechanism algorithm" did not.
>into SVMs, random forests, KNN, etc.
If anyone has found an unknown insight into SVM, KNN, etc that everybody else in the industry has overlooked, they can do cheap experiments to prove it. E.g. The entire Wikipedia text download is currently only ~25GB. Run the new SVM classification idea on that corpus. Very low cost experiments in computer science algorithms can still be done in the proverbial "home garage".
This falls apart for breakthroughs that are not zero cost to do a proof-of concept.
Think that is what the parent is rereferring . That other technologies might have more potential, but would take money to build out.
I though the main insights were embeddings, positional encoding and shortcuts through layers to improve back propagation.
People don't even think of doing anything else and those that might do, are paid to pursue research on LLMs.
people did that to horses. No car resulted from it, just slightly better horses.
>I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.
This "not best tool" is just there for the coders to call while the "simple SVMs and KNN" would require coding and training by those coders for the specific task they have at hand.
As a simple example, if you ask a question and part of the answer is directly quoted from a book from memory, that text is not computed/reasoned by the AI and so doesn't have an "explanation".
But I also suspect that any AGI would necessarily produce answers it can't explain. That's called intuition.
KANs have no advantage in terms of computability. Why are they a promising pathway?
Also, the splines in KANs are no more "explainable" than the matrix weights. Sure, we can assign importance to a node, but so what? It has no more meaning than anything else.
From my perspective, that is actually what happened between the mid-90s to 2015. Neural netowrks were dead in that period, but any other ML method was very, very hot.
I think neural networks are fundamental and we will focus/experiment a lot more with architecture, layers and other parts involved but emerging features arise through size
Fact by definition
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.
One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.
Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.
I've never heard this be put so succinctly! Thank you
I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.
Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.
Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.
Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.
So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.
The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.
The earliest paper I know which says this explicitly is "The Unreasonable Effectiveness of Data" from 2009, only two years before AlexNet:
https://static.googleusercontent.com/media/research.google.c...
It's about machine translation.
nets too small (not enough layers)
gradients not flowing (residual connections)
layer outputs not normalized
training algorithms and procedures not optimal (Adam, warm-up, etc)
Arguably the November 1996 launch of 3dfx kickstarted GPU interest and OpenGL.
After reading that, it’s hard to take author seriously on the rest of the claims.
“GeForce 256 was marketed as "the world's first 'GPU', or Graphics Processing Unit", a term Nvidia defined at the time as "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second"”
They may have been the first with a product that fitted that definition to market.
I don't think you can get a speedup by running neural networks on the GeForce 256, and the features listed there aren't really relevant (or arguably even present) in today's GPUs. As I recall, people were trying to figure out how to use GPUs to get faster processing in their Beowulfs in the late 90s and early 21st century, but it wasn't until about 02005 that anyone could actually get a speedup. The PlayStation 3's "Cell" was a little more flexible.
The article is a very good historical one showing how 3 important things came together to make the current progress possible viz;
1) Geoffrey Hinton's back-propagation algorithm for deep neural networks
2) Nvidia's GPU hardware used via CUDA for AI/ML and
3) Fei-Fei Li's huge ImageNet database to train the algorithm on the hardware. This team actually used "Amazon Mechanical Turk"(AMT) to label the massive dataset of 14 million images.
Excerpts;
“Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”
“That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,” Li said in a September interview at the Computer History Museum. “The first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.”
Deleted Comment
Deleted Comment
If you are going to count 3dfx as a proper GPU and not just a geometry and lighting accelerator, then you might as well go back further and count things like the SGI Reality Engine. Either way, 3dfx wasn't really first to anything meaningful.
According to Wikipedia, Nvidia released its first product, the RV1, in November 1995, the same month 3dfx released its first Voodoo Graphics 3D chip. Is there reason to think the 3dfx card was more of a "true" GPU than the RV1? If not, I'd say Nvidia has as good a claim to inventing the GPU as 3dfx does.
3dfx Voodoo cards were initially more successful, but I don’t think anything not actually used for deep learning should count.
I mean, I certainly don't think NVIDIA invented the GPU—that's a clear error in an otherwise pretty decent article—but it was a pretty gradual process.
A model is grown, not crafted like a computer program, which makes it hard to predict. (More precisely, a big growth phase follows the crafting phase.)
Dall-E was the only big surprising consumer model-- 2022 saw a sudden huge leap from "txt2img is kind of funny" to "txt2img is actually interesting". I would have assumed such a thing could only come in 2030 or earlier. But deep learning is full of counterintuitive results (like the NFL theorem not mattering, or ReLU being better than sigmoid).
But in hindsight, it was naive to think "this does not work yet" would get in the way of the products being sold and monetized.
I argue that Mikolov with word2vec was instrumental in current AI revolution. This demonstrated the easy of extracting meaning in mathematical way from text and directly lead to all advancements we have today with LLMs. And ironically, didn’t require GPU.