This essay sort of waves a hand at the sub-symbolic roots of knowledge that lie beneath text and that babies spend several years mastering before they are ever exposed to text. IMHO the proper measure of that latent knowledge is qualitative, not quantitative.
It's the tacit 'grounded knowledge' of the world that's present in humans that has the potential to fully fill in LLMs' causal blank in their text-based superficial info. This kind of knowledge is threadbare in today's LLMs, but essential to form a basis for further self-education in any intelligent agent. I know STaR and RLHF have been suggested as synthetic means to achieve that experimental end, but I'm not sure they're sufficient to connect the dots between LLMs' high-level book learning and human babies' low-level experiment-based intuition for cause and effect. But adding yet more text data is surely NOT the way to span that chasm.
It is an intelligent argument but it isn't leveraging it's own insight enough.
> So then how do humans generalize so well from so little language data? Is “pre-training” on visual data the secret to our success? No. Because… blind people? What are we doing here?
The problem there is that the author making the same mistake they identified earlier. Things like vision or touch are techniques, what matters is the amount of data that is ingested through them. Vision obviously presents vastly more data than text. But a sense of touch is actually vastly more data too, because touch is interacting with the same source of data as vision. It loses colour information and the bandwidth is lower, but there is still a lot more there than text.
If the article had estimated how much data was present in bytes it wouldn't have dismissed vision so easily. What matters is there are orders of magnitude more data than all the text on the internet available for training in the real world. Data that can't be polluted by AI blogspam, for that matter.
How many bits of all that sensory information--note that our vision is nearly optimally sensitive--actually make it to the brain? I think there's got to be a large amount of low-effort (structurally baked-in, automatic) filtering, aggregation, and lossy compression happening, otherwise it would just be way too much, right?
While im sure there is a lot of noise filtered out, the opposite is true as well. Much of what we perceive is interpolated and pattern matched filler.
I dont know if this is due to source input limitations, or if it is a compression>processing>decompression technique for efficiency. Either way, it does imply that the amount of data desired is more than makes it through the bottleneck.
Practically it has to be higher than the bitrate of audio-visual data that is presented on a computer. Call that 1MBps for a video stream (way under-calling the amount of data human vision reports I would suggest). That'd put a lower cap of around 50 GB/day of new data, 20 TB/year. Of course, computers can train with more than 2 eyes with one location and perspective. We aren't anywhere near the data cap with current training.
Although to be fair I do suspect that most of that data is repetitive, boring and of little use. In my opinion some sort of check for novel data is probably going to be the next big breakthrough in machine learning.
Building the noise filter is part of the learning. I don't see any reason to believe that it's "structural". Babies aren't known for their stellar information processing.
Mentioning the blind as evidence that there is no pre-training based on vision is broken in my opinion.
It could very well be that humans do multimodal pretraining (including vision), landing babies with a pre-trained brains that works even without vision.
A machine implementation of this is pre-training a self-driving car on vision + lidar then doing real-time inference on vision only.
It's weird to me how the article in the very first paragraph mentions brute force, but as data problem. As if we hadn't seen a staggering raise in ops and memory bandwidth. Machines have been able to hold the textual history of humans in short-term memory for a while. But the ops they could perform on that have been limited. Not much point telling a 2005 person to "add data". What were they going to do? Wait 20 years to finish a round we do in a week now?
It's very clear to me that the progress we observe in machine intelligence is due to brute processing power. Of course evolution of learning algorithms is important! But the main evolution that drives progress is in the compute. Algorithms can be iterated on that much faster if your generations are that much shorter.
Why are all these AI companies falling over each other to buy the best compute per watt humans have ever produced? Because compute is king and our head was optimized by evolution to be very efficient at probabilistic computing. That's where machines are catching up.
The mark of intelligence is to not need much data at all.
> The mark of intelligence is to not need much data at all.
I think part of the answer might be information filtering. The eye can detect single photons, but by the time that information from 10^16 photons/s entering the eyeball gets to the meat CPU it's been filtered down to something relevant and manageable. And at no part in that pipeline is any component operating at more than like 100Hz.
So fine tuning the filters to match the processor--and all the high fidelity sensors--simultaneously sounds like a job for evolutionary search if ever there was one. But this is the wild ass guess of someone who doesn't actually know much about biology or machine learning so take with a big chunk of salt.
You don't need more data when the data you have characterizes a problem well. More data is simply redundant and resource wasting. In this case, talking like people about things people talk about is covered well by current data sets. Saying we can't get more data is really saying we have collected at least enough data. Probably more than we need.
Lots of room to improve models though:
Using convolution for vision learning didn't create/require more data than training fully connected matrices. And it considerably increased models efficiency and effectiveness on the same amount of data. Or less.
Likewise, transformers have a limited window of response. Better architectures with open ended windows will be able to do much more. Likely more efficiently and effectively. Without any more data. Maybe with less.
Maybe in a few decades we will reach a wall of optimal models. At the rate models are improving now that doesn't appear to be anytime close.
Finally, once we start challenging models to perform tasks we can't, they will start getting data directly from reality. What works, what doesn't. Just as we have done. The original source of our knowledge wasn't an infinite loop of other people talking back to the beginning of time.
I believe this is part of the argument in the post - the "architecture" of the nervous system (and the organism it is an inseparable part of) is itself largely a product of evolution. Its already optimized to deal with the challenges the organism needs to survive/reproduce, and depending on the organism, with little or even no data.
I wonder if we'll reach the physical nanostructure wall of silicon long before that, and then all progress will have to be algorithmic efficiency gains. The era of Metal Muscle will end and we will return to the era of smart people pondering in coffee shops.
Even if transistors reach physical limits, there’s always different materials and architecture optimizations. We also know the human brain has far more intelligence per watt than any transistor architecture I know of. The real question is if those will be commercially worth researching.
> Current language models are trained on datasets fast approaching “all the text, ever”. What happen when it runs out?
Robots.
To reduce hallucinations our AI models need more grounding in the real world. No matter how smart an AI is it won't be able to magically come up with answer to any possible question just by sitting and thinking about it. AIs will need to do experiments and science just as we do.
To maximize the amount of data AIs can train on, we need robots to enable AIs to do their own science in the physical world. Then there is no limit to the data they can gather.
I think it's reasonable to argue that data acquired via a sensorimotor loop in an embodied agent will go beyond what you can learn passively from a trove of internet data, but this argument goes beyond that - the "data" in evolution is "learned" (in a fashion) not just from a single agent, but from millions of agents, even those that didn't survive to replicate (the "selection", of course, being a key part of evolution).
A neat thing about the kind of artificial robots we build now is that the process can be massively sped up compared to the plodding trial and error of natural evolution.
Exactly. We have huge advantages over evolution in some regards. All of the experience from every robot can be combined into a single agent, so even if AI is not as sample efficient as human brains it could still far surpass us. And honestly the jury is still out on sample efficiency. We haven't yet attempted to train models on the same kind of data a human child gets, and once we do we may find that we are not as far away from the brain's sample efficiency as we thought.
Doesn't this then turn into a problem of sample quantity? You would need to shift into a quality mindset because with a robot you can't perform a billion iterations, you're locked into much more complex world with unavoidably real time interactions. Failure is suddenly very costly.
With a million robots you can perform a billion iterations. We won't need a billion iterations on every task; we will start to see generalization and task transfer just as we did for LLMs once we have LLM-scale data.
You are right that failure is costly with today's robots. We need to reduce the cost of failure. That means cheaper and more robust robots. Robots that, like a toddler, can jump off a couch and fall over and still be OK.
Tying back to the article, this is the real evolutionary advantage that humans have over AIs. Not innate language skills or anything about the brain. It's our highly optimized, perceptive, robust, reliable, self-repairing, fail-safe, and efficient bodies, allowing us to experiment and learn in the real physical world.
AI's advantage would be that their learning can be shared
For example if Robot 0002 learns that trying to move a pan without using the handle is a bad idea, Robot 0001 would get that update (even if it came before)
My roomba can't do the whole room without screwing up or getting stuck, it feels like we are eons away from a robot being able to do what you describe autonomously.
Imagine if NASA-JPL had an LLM connected to all their active spacecraft, and at the terminal you could just type, "Hey V'Ger, how are conditions on Phobos over the past Martian Year?"
The human language is great, but it fails utterly on some tasks. Which is why we have all the jargon in specialized environment. I'd take any system with a reduced command interface that works well than one that takes generic commands and tries to infer what I mean (meaning it will get it wrong most of the time),especially for vocal interface.
Right. I think the tl-dr of the article is: AI needs a different type of machine. And the "learnings" of millions of years of evolution is how to build it.
I do wonder if humans will hit upon a real AI solution soon. We developed flying machines in < 100 years. They don't work like birds but they do fly.
Odd turn of phrase. Thinking LLMs work like brains may be holding back an advance to full AGI but is that a harm or good? I'm not against all powerful models but "build and deploy this stuff as fast as possible with minimal consequence consideration" definitely seems like a harm to me. Perhaps the Sam Altmans of the world should keep believing LLMs are "brains".
I guess it would depend on how you view AGI. I personally do not believe AGI is possible under current or near-future technology, so it is not really a concern to me. Even the definition of "AGI" is a little murky - we can't even definitely nail down what "g" is in humans, how will we do that with a machine?
Anyway, that aside, yes, your general understanding of my comment is correct - if you do believe in AGI, this kind of framing is harmful. If you don't believe AGI, like me, you will think it is harmful because we're inevitably headed into another AI winter once the bubble bursts. There are actual very useful things that can be done with ML technology, and I'd prefer if we keep investing resources into that stuff without all this nonsensical hype that can bring it crashing down at any moment.
An additional concern of mine is that continuing to make comparisons this way makes the broader populace much more willing to trust/accept these machines implicitly, rather than understanding they are inherently unreliable. However, that ship has probably already sailed.
In the kindest way possible: we have no idea how the brain works, and it would be foolish to write off statistical relationships as a core mechanism the brain uses. Doubly foolish when considering that we are not even sure how LLMs work.
It's not DNA, it's embodiment in general. People learn an enormous amount in the process of existing and moving through space, and they hang all of their abstract knowledge on this framework.
Related: it's a belief of mine that bodily symmetry is essential for cognition; having duplicate reflected forms that can imitate, work against, and coordinate with each other, like two hands, gives us the ability to imagine ourselves against the environment we're surrounded by. Seeing, sensing and being in full control of two things that are almost exactly the same, but are different (the two halves of one's body) gives us our first basis for the concept of comparison itself, and even of boundaries and the distinguishing of one thing from another. I believe this is almost the only function of external symmetry; since internally, and mostly away from sensory nerves, we're wildly asymmetrical. Our symmetry is the ignition for our mental processes.
So I'm not in a DNA data wall camp, I'm in an embodiment data wall camp. And I believe that it will be solved by embodying things and letting them learn physical intuitions and associations from the world. Mixing those nonverbal physical metaphors with the language models will improve the language models. I don't even think it will turn out to be hard. Having eyes that you can move and focus, and ears that you can direct will probably get you a long way. With 2 caveats: 1) our DNA does give us hints on what to be attracted to; there's no reason for a model to look or listen in a particular direction, we have instincts and hungers, and 2) smell and touch are really really rich, especially smell, and they're really hard to implement.
Incidentally: the article says that we've been optimized by evolution for cognition, but what could have been optimized was child-rearing. Having an instinct to train might be more innate and extensive than any instinct to comprehend. Human babies are born larval, and can't survive on their own for years if not decades. Training is not an optional step. Maybe the algorithms are fine, and our training methods are still hare-brained? We're training them on language, and most of what is written is wrong or even silly. Being able to catch a ball is never wrong, and will never generate bad data.
It's the tacit 'grounded knowledge' of the world that's present in humans that has the potential to fully fill in LLMs' causal blank in their text-based superficial info. This kind of knowledge is threadbare in today's LLMs, but essential to form a basis for further self-education in any intelligent agent. I know STaR and RLHF have been suggested as synthetic means to achieve that experimental end, but I'm not sure they're sufficient to connect the dots between LLMs' high-level book learning and human babies' low-level experiment-based intuition for cause and effect. But adding yet more text data is surely NOT the way to span that chasm.
> So then how do humans generalize so well from so little language data? Is “pre-training” on visual data the secret to our success? No. Because… blind people? What are we doing here?
The problem there is that the author making the same mistake they identified earlier. Things like vision or touch are techniques, what matters is the amount of data that is ingested through them. Vision obviously presents vastly more data than text. But a sense of touch is actually vastly more data too, because touch is interacting with the same source of data as vision. It loses colour information and the bandwidth is lower, but there is still a lot more there than text.
If the article had estimated how much data was present in bytes it wouldn't have dismissed vision so easily. What matters is there are orders of magnitude more data than all the text on the internet available for training in the real world. Data that can't be polluted by AI blogspam, for that matter.
I dont know if this is due to source input limitations, or if it is a compression>processing>decompression technique for efficiency. Either way, it does imply that the amount of data desired is more than makes it through the bottleneck.
Although to be fair I do suspect that most of that data is repetitive, boring and of little use. In my opinion some sort of check for novel data is probably going to be the next big breakthrough in machine learning.
It could very well be that humans do multimodal pretraining (including vision), landing babies with a pre-trained brains that works even without vision.
A machine implementation of this is pre-training a self-driving car on vision + lidar then doing real-time inference on vision only.
It's very clear to me that the progress we observe in machine intelligence is due to brute processing power. Of course evolution of learning algorithms is important! But the main evolution that drives progress is in the compute. Algorithms can be iterated on that much faster if your generations are that much shorter.
Why are all these AI companies falling over each other to buy the best compute per watt humans have ever produced? Because compute is king and our head was optimized by evolution to be very efficient at probabilistic computing. That's where machines are catching up.
The mark of intelligence is to not need much data at all.
I think part of the answer might be information filtering. The eye can detect single photons, but by the time that information from 10^16 photons/s entering the eyeball gets to the meat CPU it's been filtered down to something relevant and manageable. And at no part in that pipeline is any component operating at more than like 100Hz.
So fine tuning the filters to match the processor--and all the high fidelity sensors--simultaneously sounds like a job for evolutionary search if ever there was one. But this is the wild ass guess of someone who doesn't actually know much about biology or machine learning so take with a big chunk of salt.
You don't need more data when the data you have characterizes a problem well. More data is simply redundant and resource wasting. In this case, talking like people about things people talk about is covered well by current data sets. Saying we can't get more data is really saying we have collected at least enough data. Probably more than we need.
Lots of room to improve models though:
Using convolution for vision learning didn't create/require more data than training fully connected matrices. And it considerably increased models efficiency and effectiveness on the same amount of data. Or less.
Likewise, transformers have a limited window of response. Better architectures with open ended windows will be able to do much more. Likely more efficiently and effectively. Without any more data. Maybe with less.
Maybe in a few decades we will reach a wall of optimal models. At the rate models are improving now that doesn't appear to be anytime close.
Finally, once we start challenging models to perform tasks we can't, they will start getting data directly from reality. What works, what doesn't. Just as we have done. The original source of our knowledge wasn't an infinite loop of other people talking back to the beginning of time.
Robots.
To reduce hallucinations our AI models need more grounding in the real world. No matter how smart an AI is it won't be able to magically come up with answer to any possible question just by sitting and thinking about it. AIs will need to do experiments and science just as we do.
To maximize the amount of data AIs can train on, we need robots to enable AIs to do their own science in the physical world. Then there is no limit to the data they can gather.
A neat thing about the kind of artificial robots we build now is that the process can be massively sped up compared to the plodding trial and error of natural evolution.
You are right that failure is costly with today's robots. We need to reduce the cost of failure. That means cheaper and more robust robots. Robots that, like a toddler, can jump off a couch and fall over and still be OK.
Tying back to the article, this is the real evolutionary advantage that humans have over AIs. Not innate language skills or anything about the brain. It's our highly optimized, perceptive, robust, reliable, self-repairing, fail-safe, and efficient bodies, allowing us to experiment and learn in the real physical world.
For example if Robot 0002 learns that trying to move a pan without using the handle is a bad idea, Robot 0001 would get that update (even if it came before)
The human language is great, but it fails utterly on some tasks. Which is why we have all the jargon in specialized environment. I'd take any system with a reduced command interface that works well than one that takes generic commands and tries to infer what I mean (meaning it will get it wrong most of the time),especially for vocal interface.
I do wonder if humans will hit upon a real AI solution soon. We developed flying machines in < 100 years. They don't work like birds but they do fly.
Odd turn of phrase. Thinking LLMs work like brains may be holding back an advance to full AGI but is that a harm or good? I'm not against all powerful models but "build and deploy this stuff as fast as possible with minimal consequence consideration" definitely seems like a harm to me. Perhaps the Sam Altmans of the world should keep believing LLMs are "brains".
Anyway, that aside, yes, your general understanding of my comment is correct - if you do believe in AGI, this kind of framing is harmful. If you don't believe AGI, like me, you will think it is harmful because we're inevitably headed into another AI winter once the bubble bursts. There are actual very useful things that can be done with ML technology, and I'd prefer if we keep investing resources into that stuff without all this nonsensical hype that can bring it crashing down at any moment.
An additional concern of mine is that continuing to make comparisons this way makes the broader populace much more willing to trust/accept these machines implicitly, rather than understanding they are inherently unreliable. However, that ship has probably already sailed.
Deleted Comment
Related: it's a belief of mine that bodily symmetry is essential for cognition; having duplicate reflected forms that can imitate, work against, and coordinate with each other, like two hands, gives us the ability to imagine ourselves against the environment we're surrounded by. Seeing, sensing and being in full control of two things that are almost exactly the same, but are different (the two halves of one's body) gives us our first basis for the concept of comparison itself, and even of boundaries and the distinguishing of one thing from another. I believe this is almost the only function of external symmetry; since internally, and mostly away from sensory nerves, we're wildly asymmetrical. Our symmetry is the ignition for our mental processes.
So I'm not in a DNA data wall camp, I'm in an embodiment data wall camp. And I believe that it will be solved by embodying things and letting them learn physical intuitions and associations from the world. Mixing those nonverbal physical metaphors with the language models will improve the language models. I don't even think it will turn out to be hard. Having eyes that you can move and focus, and ears that you can direct will probably get you a long way. With 2 caveats: 1) our DNA does give us hints on what to be attracted to; there's no reason for a model to look or listen in a particular direction, we have instincts and hungers, and 2) smell and touch are really really rich, especially smell, and they're really hard to implement.
Incidentally: the article says that we've been optimized by evolution for cognition, but what could have been optimized was child-rearing. Having an instinct to train might be more innate and extensive than any instinct to comprehend. Human babies are born larval, and can't survive on their own for years if not decades. Training is not an optional step. Maybe the algorithms are fine, and our training methods are still hare-brained? We're training them on language, and most of what is written is wrong or even silly. Being able to catch a ball is never wrong, and will never generate bad data.