I'm not sure folks who're putting out strong takes based on this have read this paper.
This paper uses GPT-2 transformer scale, on sinusoidal data:
>We trained a decoder-only Transformer [7] model of GPT-2 scale implemented in the Jax based machine learning framework, Pax4 with 12 layers, 8 attention heads, and a 256-dimensional embedding space (9.5M parameters) as our base configuration [4].
> Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of (x,f(x)) pairs rather than natural language.
Nowhere near definitive or conclusive.
Not sure why this is news outside of the Twitter-techno-pseudo-academic-influencer bubble.
It would be news is somebody showed transformers could generalize beyond the training data. Deep learning models generally cannot, so it's not a surprise this holds for transformers.
It depends on what does "generalize beyond the training data" means. If I invent a new programming language and I teach (in-context) the language to the model and it's able to use it to solve many tasks, is it generalizing beyond the training data?
I read an interesting paper recently that had a great take on this: If you add enough data, nothing is outside training data. Thus solving the generalization problem.
Wasn’t the main point of that paper, but it made me go ”Huh yeah … I guess … technically correct?”. It raises an interesting thought that yes if you just train your neural network on everything, then nothing falls outside its domain. Problem solved … now if only compute was cheap.
OpenAI showed it in 2017 with the sentiment neuron (https://openai.com/research/unsupervised-sentiment-neuron). Basically, the model learned to classify the sentiment of a text which I would agree is a general principle, so the model learned a generalized representation based on the data.
Having said that, the real question is what percentage of the learned representations do generalize. For a perfect model, it would learn only representations that generalize and none that overfit. But, that's unreasonable to expect for a machine *and* even for a human.
Maybe we just don't know. We are staring at a black box and doing some statistical tests, but actually don't know whether the current AI architecture is capable enough to get to some kind of human intelligence equivalent.
Has it even been shown that the average human can generalize beyond their training data? Isn't this the central thrust of the controversy around IQ tests? For example, some argue that access to relevant training data is a greater determinant of performance on IQ tests than genetics[1].
> I'm not sure folks who're putting out strong takes based on this have read this paper.
They haven't read the other papers either. It's really striking to me to watch people retweet this and it get written up in pseudo-media like Business Insider when other meta-learning papers on the distributional hypothesis of inducing meta-learning & generalization, which are at least as relevant, can't even make a peep on specialized research subreddits - like, "Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression", Raventós et al 2023 https://arxiv.org/abs/2306.15063 (or https://arxiv.org/abs/2310.08391 ) both explains & obsoletes OP, and it was published months before! OP is a highly limited result which doesn't actually show anything that you wouldn't expect on ordinary Bayesian meta-reinforcement-learning grounds, but there's so much appetite for someone claiming that this time, for real, DL will 'hit the wall' that any random paper appears to be definitive to critics.
> Not sure why this is news outside of the Twitter-techno-pseudo-academic-influencer bubble.
The paper is making the rounds despite being a weak result because it confirms what people want, for non-technical reasons, to be true. You see this kind of thing all the time in other fields: for decades, the media has elevated p-hacked psychology studies on three undergrads into the canon of pop psychology because these studies provide a fig leaf of objective backing for pre-determined conclusions
There have been two criticisms of this paper floating around.
1. The test mechanism is to use prediction of sinusoidal series. While it's certainly possible to train transformers on mathematical functions, it's not clear why findings from a model trained on sinusoidal functions would generalize into the domain of written human language (which is ironic, given the paper's topic).
2. Even if it were true that these models don't generalize beyond their training, large LLMs' training corpus is basically all of written human knowledge. So then the goalpost has been moved to "well, they won't push the frontier of human knowledge forward," which seems to be a much diminished claim, since the vast majority of humans are also not pushing the frontier of human knowledge forward and instead use existing human knowledge to accomplish their daily goals.
You're using the idea of "all of human knowledge" differently in the two places it appears, and the gap in those definitions weakens the claim a bit.
LLMs are trained on a tiny subset of the written human knowledge which we've proven to probably not be garbage and which is nicely formatted in simple text formats and which was published without too many paywalls on the web and on and on and on. It's a lot, and it definitely includes enough facts that no one person knows all the things the LLM "knows", but the average child knows plenty of things which never made it into that sort of a corpus. Yes, it's probably true that the vast majority of humans are also not pushing the frontier of human knowledge forward, but the vast majority of humans are working with a slightly different (partially overlapping) set of information than what the LLMs see.
My uneducated opinion is that this paper is bullocks. Maybe they are looking at deeper mathemtical results, instead of every day tasks.
But every single day I am using OpenAI GPT4 to handle novel tasks. I am working on a traditional saas vertical, except with a pure chatbot. The model works, is able to understand which function to call, to extract which parameters, and to know when the inputs will not work. Sure, if you ask it to do some extraneous task, it fails.
Google/Deep Mind need to start showing up with some working results.
I use it frequently with my own custom UI framework. It’s never seen my framework before but it can output new, useable code with just a few examples. If that’s not generalization I don’t know what is.
I've given it descriptions of non-existent "franken-languages" composed by telling it to imagine taking programming language A and adding various features I want to explore to it, and then had it correctly symbolically reason about a program written in this hypothetical language that doesn't exist anywhere, so yeah, the notion it doesn't generalize to at least some degree is nonsense, but note this involved tests on a GPT2 scale model so it's not very surprising they had poor results.
That said, even GPT4 certainly has pretty significant limitations on what it manages to reason about. But without comparing their capabilities in other aspects, arguably so do most humans. We tend to force our way past those limitations by learning incrementally by doing over and over. Current models don't get that luxury without complicated fine-tuning steps, so if anything what should surprise us is how well they do with the limitation of only context to act as short-term memory.
The issue is though the the line between in domain and out of domain is fuzzy. This sort of means that generalization is in a continum. Chatgpt has seen enough UI framework code that it can interpolate concepts. This is a form of generalization but people would be looking for a lot more. I guess a better way to check generalization capability is to train the model on just C++ and then see how much it can do stuff in python using only few shot examples.
Another important thing to keep in mind is one paper(wish I could remember which one it was) that showed even larger scale llms have trouble understanding that A=B is same as B=A if they have not seen A or B before
I upgraded because I wanted to see what it could do with a screen shot of a web page. I had it describe the page and create an html version of the page. it wasn't horrible.
We humans don't even know when we are doing real extrapolation, and the vast majority of humans are interpolating. I bet many do nothing but interpolate their whole lives.
So - and I say this as someone who writes NLP papers too - who cares?
The one thing is that they seem to be using relatively small models. This may be a really damning result but I was under the impression that any generalization capabilities of LLMs appear in a non-linear fashion when you increase the parameter count to the tens of billions/trillions as in GPT4. It would be interesting if they could recreate the same experiment with a much larger model. Unfortunately I dont think thats likely to happen because of the resources required to train such models and the anti-open-source hysteria likely preventing larger models from being made publicly available much less the data they were trained on. Imagine that, stifling research and fearmongering reduces the usefulness of the science that does manage to get done.
Current AI models are approximation functions with huge number of parameters. These approximation functions are reasonably good at interpolation, meh at extrapolation, and have nothing to do with generalization.
You always can extrapolate. E.g. linear approximation for x^2 by 2 points will extrapolate reasonably well around these 2 points but will be bad with x -> +/- infinity. Similarly, there are examples where GPT invented legal cases when asked to create a legal brief.
If you trained it on one function class, of course that's all it learned to do. That's all it ever saw!
If you want to learn arbitrary function classes to some degree, the solution is simple. Train it on many different function classes.
Untrained models are as blank slate as you could possibly imagine. They're not even comparable to new born humans with millions of years of evolution baked in. The data you feed them is their world. Their only world.
FWIW the paper title is focuses on quite a different conclusion than the submission title: Pretraining Data Mixtures Enable Narrow Model Selection
Capabilities in Transformer Models
This paper uses GPT-2 transformer scale, on sinusoidal data:
>We trained a decoder-only Transformer [7] model of GPT-2 scale implemented in the Jax based machine learning framework, Pax4 with 12 layers, 8 attention heads, and a 256-dimensional embedding space (9.5M parameters) as our base configuration [4].
> Building on previous work, we investigate this question in a controlled setting, where we study transformer models trained on sequences of (x,f(x)) pairs rather than natural language.
Nowhere near definitive or conclusive.
Not sure why this is news outside of the Twitter-techno-pseudo-academic-influencer bubble.
I read an interesting paper recently that had a great take on this: If you add enough data, nothing is outside training data. Thus solving the generalization problem.
Wasn’t the main point of that paper, but it made me go ”Huh yeah … I guess … technically correct?”. It raises an interesting thought that yes if you just train your neural network on everything, then nothing falls outside its domain. Problem solved … now if only compute was cheap.
Having said that, the real question is what percentage of the learned representations do generalize. For a perfect model, it would learn only representations that generalize and none that overfit. But, that's unreasonable to expect for a machine *and* even for a human.
Maybe we just don't know. We are staring at a black box and doing some statistical tests, but actually don't know whether the current AI architecture is capable enough to get to some kind of human intelligence equivalent.
[1] https://www.youtube.com/watch?v=FkKPsLxgpuY
They haven't read the other papers either. It's really striking to me to watch people retweet this and it get written up in pseudo-media like Business Insider when other meta-learning papers on the distributional hypothesis of inducing meta-learning & generalization, which are at least as relevant, can't even make a peep on specialized research subreddits - like, "Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression", Raventós et al 2023 https://arxiv.org/abs/2306.15063 (or https://arxiv.org/abs/2310.08391 ) both explains & obsoletes OP, and it was published months before! OP is a highly limited result which doesn't actually show anything that you wouldn't expect on ordinary Bayesian meta-reinforcement-learning grounds, but there's so much appetite for someone claiming that this time, for real, DL will 'hit the wall' that any random paper appears to be definitive to critics.
Deleted Comment
The paper is making the rounds despite being a weak result because it confirms what people want, for non-technical reasons, to be true. You see this kind of thing all the time in other fields: for decades, the media has elevated p-hacked psychology studies on three undergrads into the canon of pop psychology because these studies provide a fig leaf of objective backing for pre-determined conclusions
1. The test mechanism is to use prediction of sinusoidal series. While it's certainly possible to train transformers on mathematical functions, it's not clear why findings from a model trained on sinusoidal functions would generalize into the domain of written human language (which is ironic, given the paper's topic).
2. Even if it were true that these models don't generalize beyond their training, large LLMs' training corpus is basically all of written human knowledge. So then the goalpost has been moved to "well, they won't push the frontier of human knowledge forward," which seems to be a much diminished claim, since the vast majority of humans are also not pushing the frontier of human knowledge forward and instead use existing human knowledge to accomplish their daily goals.
LLMs are trained on a tiny subset of the written human knowledge which we've proven to probably not be garbage and which is nicely formatted in simple text formats and which was published without too many paywalls on the web and on and on and on. It's a lot, and it definitely includes enough facts that no one person knows all the things the LLM "knows", but the average child knows plenty of things which never made it into that sort of a corpus. Yes, it's probably true that the vast majority of humans are also not pushing the frontier of human knowledge forward, but the vast majority of humans are working with a slightly different (partially overlapping) set of information than what the LLMs see.
But every single day I am using OpenAI GPT4 to handle novel tasks. I am working on a traditional saas vertical, except with a pure chatbot. The model works, is able to understand which function to call, to extract which parameters, and to know when the inputs will not work. Sure, if you ask it to do some extraneous task, it fails.
Google/Deep Mind need to start showing up with some working results.
Where. are. the. models. google.
That said, even GPT4 certainly has pretty significant limitations on what it manages to reason about. But without comparing their capabilities in other aspects, arguably so do most humans. We tend to force our way past those limitations by learning incrementally by doing over and over. Current models don't get that luxury without complicated fine-tuning steps, so if anything what should surprise us is how well they do with the limitation of only context to act as short-term memory.
Another important thing to keep in mind is one paper(wish I could remember which one it was) that showed even larger scale llms have trouble understanding that A=B is same as B=A if they have not seen A or B before
Dead Comment
Dead Comment
Dead Comment
So - and I say this as someone who writes NLP papers too - who cares?
https://arxiv.org/abs/2110.09485
"Supercharged Interpolation" is not a real thing.
They generalize fine when the data incentivizes that.
If you trained it on one function class, of course that's all it learned to do. That's all it ever saw!
If you want to learn arbitrary function classes to some degree, the solution is simple. Train it on many different function classes.
Untrained models are as blank slate as you could possibly imagine. They're not even comparable to new born humans with millions of years of evolution baked in. The data you feed them is their world. Their only world.