It's not clear from the article whether it's a dense model or MoE. This matters when it comes to comparing with GPT-4 - in terms of # params - which is reported to be MoE
yes, I can't imagine the architecture of a supercomputer is the right one for LLM training. But maybe?
If not, spending years to design and build a system for weather and nuke simulations and ending up doing something that's totally not made for these systems is kind of a mind bender.
I can imagine the conversations that led to this:
"we need a government owned LLM"
"okay what do we need"
"lots of compute power"
"well we have this new supercomputer just coming online"
"not the right kind of compute"
"come on, it's top-500!"
I just asked GPT-4, and it denies being a MoE model.
User:
Are you an MoE model?
ChatGPT:
No, I am not based on a Mixture of Experts (MoE) model. My underlying architecture is based on the GPT (Generative Pre-trained Transformer) framework, specifically the GPT-4 version. This architecture is a large-scale transformer-based neural network, but it does not use the MoE approach.
In a GPT model like mine, the entire model works in a unified manner on each task, without dividing it among specialized sub-models or experts. The model is trained on a diverse range of data and tasks, enabling it to generate responses across a wide array of topics and questions. While this approach lacks the specialization that MoE models offer, it provides broad, general-purpose language understanding and generation capabilities.
Was thinking the same, so 1T might bring you to the league of GPT-4.
Actually in the best case, since it seems that meta, google, openai and so on have the most talent.
Anyway, to bring it to the next level, how big should it be? Maybe 10T? 100T?
I don't think we have enough training data to train models so big in a way to efficiently use all the params. We would need to generate training data, but then I don't know how effective it would be.
The solution won't be just "bigger". A model with a trillion parameters will be more expensive to train and to run, but is unlikely to be better. Think of the early days of flight, you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.
Improving AI will involve architectural changes. No human requires the amount of training data we are already giving the models. Improvements will make more efficient use of that data, and (no idea how - innovation required) allow them to generalize and reason from that data.
> "No human requires the amount of training data we are already giving the models."
Well, humans are also trained differently. We interact with other humans in real time and get immediate feedback on their responses. We don't just learn by reading through reams of static information. We talk to people. We get into arguments. And so on.
Maybe the ideal way to train an AI is to have it interact with lots of humans, so it can try things on its own? The implication of that is maybe the best trained AI will be the center of some important web property, like, say, the Google search engine (I'm imagining something like Google search now, but more conversational -- it asks you if that was what you're looking for, and asks clarifying questions.) Whoever has the most users will have the best AI, which creates the best product, which attracts the most users... and so on.
I do agree that architectural improvements could be hugely significant too.
Yeah I'm not totally convinced humans don't have a tremendous amount of training data - interacting with the world for years with constant input from all our senses and parental corrections. I bet if you add up that data it's a lot.
But once we are partially trained, training more requires a lot less.
The first humans were kinda dumb. Yet by interacting with the world and each other, humans got smart. I wonder if neural network instances can interact with a simulated environment and each other, they could become smart enough for practical purposes.
Also, humans have multiple internal models for various aspects of reality, that seem to exist as a layer on top of the "raw" training. They have the ability to extrapolate based on those models (which transformers cannot do, as far I as understand). Perhaps GP is right -- perhaps what's missing in the architecture is an analog of these models.
While I agree with you that advances will come from being able to train with less data using as yet undevised techniques, I think you are jumping to a conclusion with this particular work.
First, yes, bigger appears to be better so far. We haven't yet found the plateau. No, bigger won't solve the well known problems, but it's absolutely clear that each time we build a bigger model it is qualitatively better.
Second, it's not clear that this work is trying to build AGI, which I assume you are referring to when you say "the solution." Of all the use case for language models, building one off all the worlds scientific data like they are doing in this project is probably the most exciting to me. If all it can do is dig up relevant work for a given topic in the entire body of scientific literature, it will be revolutionary for science.
If you train a bigger model on more text, we have a lot of confidence that the next-word prediction task will improve. So algorithmic progress is not necessary, it's a very nice bonus, but we can sort of get more powerful models for free, because we can just get a bigger computer, which we can say with some confidence we're going to get, and just train a bigger model for longer, and we are very confident we are going to get a better result.
I asked ChatGPT how many parameters the human brain has, and it said 86B neurons * 1000 connections, so 86T parameters.
It does seem like bigger models give better responses when given benchmarks. It might plateau or overfit the data at some point, but I'm not sure we've reached it yet.
Unlike biplanes, CPU's with more transistors are more powerful than those with less. And adding more CPU cores keeps increasing the amount of threads you can run at the same time.
Why would LMM's be more like the biplanes analogy, and less like the CPU analogy?
In general you can view "understanding" as a compression of information. You take in a bunch of information, detect an underlying pattern, and remember the pattern and necessary context, instead of the entire input.
The "problem" with larger neural networks is that they can store more information, so they can substitute understanding with memorization. Something similar happens with human students, who can stuff lots of raw information into short-term-memory, but to fit it into the much more precious long-term-memory you have to "understand" the topic, not just memorize it. In neural networks we call that memorization a failure to generalize. Just like a human, a network that just memorizes doesn't do well if you ask it about anything slightly different than the training data.
Of course it's a balance act, because a network that's too small doesn't have space to store enough "understanding" and world model. A lot of the original premise of OpenAI was to figure out if LLMs keep getting better if you make them bigger, and so far that has worked. But there is bound to be a ceiling on this, where making the model bigger starts making it dumber.
No one expected larger LLMs to be amazing, so although it's unlikely that these larger models will do anything, it was also unlikely that we are in our current situation regarding LLMs.
What? The defining trend of the last 5 or so years is the victory of the scaling hypothesis. More scale = more intelligence. GPT-4 is way smarter than 3.5, this trend is ongoing.
You need more data to utilize more parameters, but folks at the forefront are confident that they are not going to run out of data any time soon.
If you mean “solution to AGI” maybe. But perhaps in-context scratchpads and agent loops will be sufficient to get this architecture to human-level performance, with enough data/parameters. (Sutskever and Hinton have both expressed credulity that the current architectures might get us there.)
All that said, it’s also possible that new architectures will be needed at some point, I’m just pushing back on your claim that we already hit the cliff.
The main hero here is not model size but the dataset and the environment that created it. All this model talk missed the point - without the repository of human experience captured in language these models would not get so smart. And the improvement path is the same - assimilate more experience. This time the AI agent can create its own interactions and feedback signals, this would help it fix its flaws.
Learning in third person from the past data can only take AI so far. It needs to act and learn in the present, in first person, to be truly smart. No architectural change is needed, but the model needs to be placed in a real environment to get feedback signals.
>we found 72,508 ebook titles (including 83 from Stanford University Press) that were pirated and then widely used to train LLMs despite the protections of copyright law
It will be interesting to see what the government can do here. Can they use their powers to get their hands on the most data?
im still skeptical because new techniques are going to give an order of magnitude efficiency boost to transformer models, so 'just waiting' seems like the best approach for now. I dont think they will be able to just skip to the finish line by having the most money.
I just realized that the NSA has probably been able to train GPT-4 equivalents on _all the data_ for a while now. We'll probably never learn about it but that's maybe scarier than just the Snowden collection story because LLMs are so good at retrieval.
Holy shit, you are right. They probably have 10-100x the data used to train gpt-4. Decades of every text message, phone call transcript, and so on. I can’t believe I haven’t seen anyone mention that yet.
People keep saying we don’t have enough data. I think there is a lot more data than we realize, even ignoring things like NSA.
That seems like a good reason for them to do this. I wonder how much non-public stuff they have, or it's just meant to incorporate a specific kind of information.
Haha, this is funny because everyone is talking about this as if it is designed to be like the LLMs we have access to.
The training parameters will be the databases of info scooped up and integrated into profiles of every person and their entire digital footprint, queriable and responsive to direct questioning
Edit: Ah, Mixture of Experts. I hadn't heard this one yet. Thanks!
Dead Comment
Argonne installs final components of Aurora supercomputer (22 June 2023): https://www.anl.gov/article/argonne-installs-final-component...
Aurora Supercomputer Blade Installation Complete (22 June 2023): https://www.intel.com/content/www/us/en/newsroom/news/aurora...
Intel® Data Center GPU Max Series, previously codename Ponte Vecchio (31 May 2023): https://www.intel.com/content/www/us/en/developer/articles/t...
If not, spending years to design and build a system for weather and nuke simulations and ending up doing something that's totally not made for these systems is kind of a mind bender.
I can imagine the conversations that led to this: "we need a government owned LLM" "okay what do we need" "lots of compute power" "well we have this new supercomputer just coming online" "not the right kind of compute" "come on, it's top-500!"
User:
Are you an MoE model?
ChatGPT:
No, I am not based on a Mixture of Experts (MoE) model. My underlying architecture is based on the GPT (Generative Pre-trained Transformer) framework, specifically the GPT-4 version. This architecture is a large-scale transformer-based neural network, but it does not use the MoE approach.
In a GPT model like mine, the entire model works in a unified manner on each task, without dividing it among specialized sub-models or experts. The model is trained on a diverse range of data and tasks, enabling it to generate responses across a wide array of topics and questions. While this approach lacks the specialization that MoE models offer, it provides broad, general-purpose language understanding and generation capabilities.
Anyway, to bring it to the next level, how big should it be? Maybe 10T? 100T?
I don't think we have enough training data to train models so big in a way to efficiently use all the params. We would need to generate training data, but then I don't know how effective it would be.
Improving AI will involve architectural changes. No human requires the amount of training data we are already giving the models. Improvements will make more efficient use of that data, and (no idea how - innovation required) allow them to generalize and reason from that data.
Well, humans are also trained differently. We interact with other humans in real time and get immediate feedback on their responses. We don't just learn by reading through reams of static information. We talk to people. We get into arguments. And so on.
Maybe the ideal way to train an AI is to have it interact with lots of humans, so it can try things on its own? The implication of that is maybe the best trained AI will be the center of some important web property, like, say, the Google search engine (I'm imagining something like Google search now, but more conversational -- it asks you if that was what you're looking for, and asks clarifying questions.) Whoever has the most users will have the best AI, which creates the best product, which attracts the most users... and so on.
I do agree that architectural improvements could be hugely significant too.
But once we are partially trained, training more requires a lot less.
Kinda like alpha go zero but with "life".
First, yes, bigger appears to be better so far. We haven't yet found the plateau. No, bigger won't solve the well known problems, but it's absolutely clear that each time we build a bigger model it is qualitatively better.
Second, it's not clear that this work is trying to build AGI, which I assume you are referring to when you say "the solution." Of all the use case for language models, building one off all the worlds scientific data like they are doing in this project is probably the most exciting to me. If all it can do is dig up relevant work for a given topic in the entire body of scientific literature, it will be revolutionary for science.
But people did try! https://en.wikipedia.org/wiki/Besson_H-5 https://en.wikipedia.org/wiki/Multiplane_(aeronautics)#Quadr...
https://en.wikipedia.org/wiki/Horatio_Frederick_Phillips
[1]: https://youtu.be/zjkBMFhNj_g?si=eCH04466rmgBkHDA
If you train a bigger model on more text, we have a lot of confidence that the next-word prediction task will improve. So algorithmic progress is not necessary, it's a very nice bonus, but we can sort of get more powerful models for free, because we can just get a bigger computer, which we can say with some confidence we're going to get, and just train a bigger model for longer, and we are very confident we are going to get a better result.
https://youtu.be/zjkBMFhNj_g?t=1543 (23:43)
Deleted Comment
It does seem like bigger models give better responses when given benchmarks. It might plateau or overfit the data at some point, but I'm not sure we've reached it yet.
Why would LMM's be more like the biplanes analogy, and less like the CPU analogy?
The "problem" with larger neural networks is that they can store more information, so they can substitute understanding with memorization. Something similar happens with human students, who can stuff lots of raw information into short-term-memory, but to fit it into the much more precious long-term-memory you have to "understand" the topic, not just memorize it. In neural networks we call that memorization a failure to generalize. Just like a human, a network that just memorizes doesn't do well if you ask it about anything slightly different than the training data.
Of course it's a balance act, because a network that's too small doesn't have space to store enough "understanding" and world model. A lot of the original premise of OpenAI was to figure out if LLMs keep getting better if you make them bigger, and so far that has worked. But there is bound to be a ceiling on this, where making the model bigger starts making it dumber.
Deleted Comment
You need more data to utilize more parameters, but folks at the forefront are confident that they are not going to run out of data any time soon.
If you mean “solution to AGI” maybe. But perhaps in-context scratchpads and agent loops will be sufficient to get this architecture to human-level performance, with enough data/parameters. (Sutskever and Hinton have both expressed credulity that the current architectures might get us there.)
All that said, it’s also possible that new architectures will be needed at some point, I’m just pushing back on your claim that we already hit the cliff.
Learning in third person from the past data can only take AI so far. It needs to act and learn in the present, in first person, to be truly smart. No architectural change is needed, but the model needs to be placed in a real environment to get feedback signals.
Deleted Comment
One example is the Google books project made digital quite a few texts, but I’ve never heard if Google considers these fair game to train on for Bard.
Most of the copyright discussions I’ve seen have been around images and code but not much about books.
Seems to become more relevant as things scale up as indicated by this article.
https://aicopyright.substack.com/p/the-books-used-to-train-l...
im still skeptical because new techniques are going to give an order of magnitude efficiency boost to transformer models, so 'just waiting' seems like the best approach for now. I dont think they will be able to just skip to the finish line by having the most money.
People keep saying we don’t have enough data. I think there is a lot more data than we realize, even ignoring things like NSA.
The training parameters will be the databases of info scooped up and integrated into profiles of every person and their entire digital footprint, queriable and responsive to direct questioning