Training for one trillion parameter model backed by Intel and US govt has begun

Isn't GPT-4 already over 1T parameters? And GPT-5 should be even "an order of magnitude" bigger than GPT-4...

Was it ever confirmed whether GPT-4 is a Mixture of Experts or not?

cschmidt · 2 years ago

I just asked GPT-4, and it denies being a MoE model.

User:

Are you an MoE model?

ChatGPT:

No, I am not based on a Mixture of Experts (MoE) model. My underlying architecture is based on the GPT (Generative Pre-trained Transformer) framework, specifically the GPT-4 version. This architecture is a large-scale transformer-based neural network, but it does not use the MoE approach.

In a GPT model like mine, the entire model works in a unified manner on each task, without dividing it among specialized sub-models or experts. The model is trained on a diverse range of data and tasks, enabling it to generate responses across a wide array of topics and questions. While this approach lacks the specialization that MoE models offer, it provides broad, general-purpose language understanding and generation capabilities.

gammalost · 2 years ago

Yes, the article even mentions that. Guess they went with that headline to attract readers.

MichaelRazum · 2 years ago

Was thinking the same, so 1T might bring you to the league of GPT-4. Actually in the best case, since it seems that meta, google, openai and so on have the most talent.

Anyway, to bring it to the next level, how big should it be? Maybe 10T? 100T?

lossolo · 2 years ago

> Maybe 10T? 100T?

I don't think we have enough training data to train models so big in a way to efficiently use all the params. We would need to generate training data, but then I don't know how effective it would be.

The solution won't be just "bigger". A model with a trillion parameters will be more expensive to train and to run, but is unlikely to be better. Think of the early days of flight, you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

Improving AI will involve architectural changes. No human requires the amount of training data we are already giving the models. Improvements will make more efficient use of that data, and (no idea how - innovation required) allow them to generalize and reason from that data.

elihu · 2 years ago

> "No human requires the amount of training data we are already giving the models."

Well, humans are also trained differently. We interact with other humans in real time and get immediate feedback on their responses. We don't just learn by reading through reams of static information. We talk to people. We get into arguments. And so on.

Maybe the ideal way to train an AI is to have it interact with lots of humans, so it can try things on its own? The implication of that is maybe the best trained AI will be the center of some important web property, like, say, the Google search engine (I'm imagining something like Google search now, but more conversational -- it asks you if that was what you're looking for, and asks clarifying questions.) Whoever has the most users will have the best AI, which creates the best product, which attracts the most users... and so on.

I do agree that architectural improvements could be hugely significant too.

jonplackett · 2 years ago

Yeah I'm not totally convinced humans don't have a tremendous amount of training data - interacting with the world for years with constant input from all our senses and parental corrections. I bet if you add up that data it's a lot.

But once we are partially trained, training more requires a lot less.

jojobas · 2 years ago

The first humans were kinda dumb. Yet by interacting with the world and each other, humans got smart. I wonder if neural network instances can interact with a simulated environment and each other, they could become smart enough for practical purposes.

Kinda like alpha go zero but with "life".

circlefavshape · 2 years ago

It's not just humans we interact with - it's reality. For an LLM all "facts" are socially-determined, for a human they are not

hliyan · 2 years ago

Also, humans have multiple internal models for various aspects of reality, that seem to exist as a layer on top of the "raw" training. They have the ability to extrapolate based on those models (which transformers cannot do, as far I as understand). Perhaps GP is right -- perhaps what's missing in the architecture is an analog of these models.

tensor · 2 years ago

While I agree with you that advances will come from being able to train with less data using as yet undevised techniques, I think you are jumping to a conclusion with this particular work.

First, yes, bigger appears to be better so far. We haven't yet found the plateau. No, bigger won't solve the well known problems, but it's absolutely clear that each time we build a bigger model it is qualitatively better.

Second, it's not clear that this work is trying to build AGI, which I assume you are referring to when you say "the solution." Of all the use case for language models, building one off all the worlds scientific data like they are doing in this project is probably the most exciting to me. If all it can do is dig up relevant work for a given topic in the entire body of scientific literature, it will be revolutionary for science.

jefftk · 2 years ago

> you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

But people did try! https://en.wikipedia.org/wiki/Besson_H-5 https://en.wikipedia.org/wiki/Multiplane_(aeronautics)#Quadr...

isolli · 2 years ago

Why stop here? This engineer experimented with 20 wings and more:

https://en.wikipedia.org/wiki/Horatio_Frederick_Phillips

huijzer · 2 years ago

Karpathy in his recent video [1] agrees, but at this point scaling is a very reliable way to better accuracy.

[1]: https://youtu.be/zjkBMFhNj_g?si=eCH04466rmgBkHDA

scriptsmith · 2 years ago

Seems like he actually disagrees here:

If you train a bigger model on more text, we have a lot of confidence that the next-word prediction task will improve. So algorithmic progress is not necessary, it's a very nice bonus, but we can sort of get more powerful models for free, because we can just get a bigger computer, which we can say with some confidence we're going to get, and just train a bigger model for longer, and we are very confident we are going to get a better result.

https://youtu.be/zjkBMFhNj_g?t=1543 (23:43)

py4 · 2 years ago

This. We have not exhausted all the techniques at our disposal yet. We do need to look for a new architecture though, but these are orthogonal

wodenokoto · 2 years ago

OpenAIs big innovation was “bigger”. It is not clear when we should stop scaling.

Deleted Comment

TriangleEdge · 2 years ago

I asked ChatGPT how many parameters the human brain has, and it said 86B neurons * 1000 connections, so 86T parameters.

It does seem like bigger models give better responses when given benchmarks. It might plateau or overfit the data at some point, but I'm not sure we've reached it yet.

Aardwolf · 2 years ago

Unlike biplanes, CPU's with more transistors are more powerful than those with less. And adding more CPU cores keeps increasing the amount of threads you can run at the same time.

Why would LMM's be more like the biplanes analogy, and less like the CPU analogy?

wongarsu · 2 years ago

In general you can view "understanding" as a compression of information. You take in a bunch of information, detect an underlying pattern, and remember the pattern and necessary context, instead of the entire input.

The "problem" with larger neural networks is that they can store more information, so they can substitute understanding with memorization. Something similar happens with human students, who can stuff lots of raw information into short-term-memory, but to fit it into the much more precious long-term-memory you have to "understand" the topic, not just memorize it. In neural networks we call that memorization a failure to generalize. Just like a human, a network that just memorizes doesn't do well if you ask it about anything slightly different than the training data.

Of course it's a balance act, because a network that's too small doesn't have space to store enough "understanding" and world model. A lot of the original premise of OpenAI was to figure out if LLMs keep getting better if you make them bigger, and so far that has worked. But there is bound to be a ceiling on this, where making the model bigger starts making it dumber.

Deleted Comment

mensetmanusman · 2 years ago

No one expected larger LLMs to be amazing, so although it's unlikely that these larger models will do anything, it was also unlikely that we are in our current situation regarding LLMs.

fiedzia · 2 years ago

Really early days of flying where like "let's add a few feathers and then add some more". Though architectural changes were added too.

theptip · 2 years ago

What? The defining trend of the last 5 or so years is the victory of the scaling hypothesis. More scale = more intelligence. GPT-4 is way smarter than 3.5, this trend is ongoing.

You need more data to utilize more parameters, but folks at the forefront are confident that they are not going to run out of data any time soon.

If you mean “solution to AGI” maybe. But perhaps in-context scratchpads and agent loops will be sufficient to get this architecture to human-level performance, with enough data/parameters. (Sutskever and Hinton have both expressed credulity that the current architectures might get us there.)

All that said, it’s also possible that new architectures will be needed at some point, I’m just pushing back on your claim that we already hit the cliff.

visarga · 2 years ago

The main hero here is not model size but the dataset and the environment that created it. All this model talk missed the point - without the repository of human experience captured in language these models would not get so smart. And the improvement path is the same - assimilate more experience. This time the AI agent can create its own interactions and feedback signals, this would help it fix its flaws.

Learning in third person from the past data can only take AI so far. It needs to act and learn in the present, in first person, to be truly smart. No architectural change is needed, but the model needs to be placed in a real environment to get feedback signals.

Deleted Comment