Readit News logoReadit News
py4 · 2 years ago
It's not clear from the article whether it's a dense model or MoE. This matters when it comes to comparing with GPT-4 - in terms of # params - which is reported to be MoE
COAGULOPATH · 2 years ago
As far as I know, EVERY 1t+ LLM is a MoE. Switch-c-2048, Wi Dao 2.0, GLAM, Pangu-Σ, presumably GPT4. Am I missing any?
freeqaz · 2 years ago
What is MoE?

Edit: Ah, Mixture of Experts. I hadn't heard this one yet. Thanks!

Dead Comment

washadjeffmad · 2 years ago
A lot of this was new to me, but it looks like Intel hopes to use this to demonstrate the linear scaling capacity of their Aurora nodes.

Argonne installs final components of Aurora supercomputer (22 June 2023): https://www.anl.gov/article/argonne-installs-final-component...

Aurora Supercomputer Blade Installation Complete (22 June 2023): https://www.intel.com/content/www/us/en/newsroom/news/aurora...

Intel® Data Center GPU Max Series, previously codename Ponte Vecchio (31 May 2023): https://www.intel.com/content/www/us/en/developer/articles/t...

bane · 2 years ago
yes, I can't imagine the architecture of a supercomputer is the right one for LLM training. But maybe?

If not, spending years to design and build a system for weather and nuke simulations and ending up doing something that's totally not made for these systems is kind of a mind bender.

I can imagine the conversations that led to this: "we need a government owned LLM" "okay what do we need" "lots of compute power" "well we have this new supercomputer just coming online" "not the right kind of compute" "come on, it's top-500!"

awongh · 2 years ago
How are the optimal architectures for weather simulations and LLM training different?
visarga · 2 years ago
Probably optimized for fp64 instead of 4 bit floats.
kirubakaran · 2 years ago
Could the weights be FOIA'd?
paxys · 2 years ago
They are pretty much guaranteed to slap the "national security" label on it, so no.
2OEH8eoCRo0 · 2 years ago
Hopefully not.
Jensson · 2 years ago
Why not? Would be cool with some new open source models.
pulse7 · 2 years ago
Isn't GPT-4 already over 1T parameters? And GPT-5 should be even "an order of magnitude" bigger than GPT-4...
checkyoursudo · 2 years ago
Was it ever confirmed whether GPT-4 is a Mixture of Experts or not?
cschmidt · 2 years ago
I just asked GPT-4, and it denies being a MoE model.

User:

Are you an MoE model?

ChatGPT:

No, I am not based on a Mixture of Experts (MoE) model. My underlying architecture is based on the GPT (Generative Pre-trained Transformer) framework, specifically the GPT-4 version. This architecture is a large-scale transformer-based neural network, but it does not use the MoE approach.

In a GPT model like mine, the entire model works in a unified manner on each task, without dividing it among specialized sub-models or experts. The model is trained on a diverse range of data and tasks, enabling it to generate responses across a wide array of topics and questions. While this approach lacks the specialization that MoE models offer, it provides broad, general-purpose language understanding and generation capabilities.

gammalost · 2 years ago
Yes, the article even mentions that. Guess they went with that headline to attract readers.
MichaelRazum · 2 years ago
Was thinking the same, so 1T might bring you to the league of GPT-4. Actually in the best case, since it seems that meta, google, openai and so on have the most talent.

Anyway, to bring it to the next level, how big should it be? Maybe 10T? 100T?

lossolo · 2 years ago
> Maybe 10T? 100T?

I don't think we have enough training data to train models so big in a way to efficiently use all the params. We would need to generate training data, but then I don't know how effective it would be.

bradley13 · 2 years ago
The solution won't be just "bigger". A model with a trillion parameters will be more expensive to train and to run, but is unlikely to be better. Think of the early days of flight, you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

Improving AI will involve architectural changes. No human requires the amount of training data we are already giving the models. Improvements will make more efficient use of that data, and (no idea how - innovation required) allow them to generalize and reason from that data.

elihu · 2 years ago
> "No human requires the amount of training data we are already giving the models."

Well, humans are also trained differently. We interact with other humans in real time and get immediate feedback on their responses. We don't just learn by reading through reams of static information. We talk to people. We get into arguments. And so on.

Maybe the ideal way to train an AI is to have it interact with lots of humans, so it can try things on its own? The implication of that is maybe the best trained AI will be the center of some important web property, like, say, the Google search engine (I'm imagining something like Google search now, but more conversational -- it asks you if that was what you're looking for, and asks clarifying questions.) Whoever has the most users will have the best AI, which creates the best product, which attracts the most users... and so on.

I do agree that architectural improvements could be hugely significant too.

jonplackett · 2 years ago
Yeah I'm not totally convinced humans don't have a tremendous amount of training data - interacting with the world for years with constant input from all our senses and parental corrections. I bet if you add up that data it's a lot.

But once we are partially trained, training more requires a lot less.

jojobas · 2 years ago
The first humans were kinda dumb. Yet by interacting with the world and each other, humans got smart. I wonder if neural network instances can interact with a simulated environment and each other, they could become smart enough for practical purposes.

Kinda like alpha go zero but with "life".

circlefavshape · 2 years ago
It's not just humans we interact with - it's reality. For an LLM all "facts" are socially-determined, for a human they are not
hliyan · 2 years ago
Also, humans have multiple internal models for various aspects of reality, that seem to exist as a layer on top of the "raw" training. They have the ability to extrapolate based on those models (which transformers cannot do, as far I as understand). Perhaps GP is right -- perhaps what's missing in the architecture is an analog of these models.
tensor · 2 years ago
While I agree with you that advances will come from being able to train with less data using as yet undevised techniques, I think you are jumping to a conclusion with this particular work.

First, yes, bigger appears to be better so far. We haven't yet found the plateau. No, bigger won't solve the well known problems, but it's absolutely clear that each time we build a bigger model it is qualitatively better.

Second, it's not clear that this work is trying to build AGI, which I assume you are referring to when you say "the solution." Of all the use case for language models, building one off all the worlds scientific data like they are doing in this project is probably the most exciting to me. If all it can do is dig up relevant work for a given topic in the entire body of scientific literature, it will be revolutionary for science.

jefftk · 2 years ago
> you had biplanes; then you had triplanes. You could have followed that farther, and added more wings - but it wouldn't have improved things.

But people did try! https://en.wikipedia.org/wiki/Besson_H-5 https://en.wikipedia.org/wiki/Multiplane_(aeronautics)#Quadr...

isolli · 2 years ago
Why stop here? This engineer experimented with 20 wings and more:

https://en.wikipedia.org/wiki/Horatio_Frederick_Phillips

huijzer · 2 years ago
Karpathy in his recent video [1] agrees, but at this point scaling is a very reliable way to better accuracy.

[1]: https://youtu.be/zjkBMFhNj_g?si=eCH04466rmgBkHDA

scriptsmith · 2 years ago
Seems like he actually disagrees here:

If you train a bigger model on more text, we have a lot of confidence that the next-word prediction task will improve. So algorithmic progress is not necessary, it's a very nice bonus, but we can sort of get more powerful models for free, because we can just get a bigger computer, which we can say with some confidence we're going to get, and just train a bigger model for longer, and we are very confident we are going to get a better result.

https://youtu.be/zjkBMFhNj_g?t=1543 (23:43)

py4 · 2 years ago
This. We have not exhausted all the techniques at our disposal yet. We do need to look for a new architecture though, but these are orthogonal
wodenokoto · 2 years ago
OpenAIs big innovation was “bigger”. It is not clear when we should stop scaling.

Deleted Comment

TriangleEdge · 2 years ago
I asked ChatGPT how many parameters the human brain has, and it said 86B neurons * 1000 connections, so 86T parameters.

It does seem like bigger models give better responses when given benchmarks. It might plateau or overfit the data at some point, but I'm not sure we've reached it yet.

Aardwolf · 2 years ago
Unlike biplanes, CPU's with more transistors are more powerful than those with less. And adding more CPU cores keeps increasing the amount of threads you can run at the same time.

Why would LMM's be more like the biplanes analogy, and less like the CPU analogy?

wongarsu · 2 years ago
In general you can view "understanding" as a compression of information. You take in a bunch of information, detect an underlying pattern, and remember the pattern and necessary context, instead of the entire input.

The "problem" with larger neural networks is that they can store more information, so they can substitute understanding with memorization. Something similar happens with human students, who can stuff lots of raw information into short-term-memory, but to fit it into the much more precious long-term-memory you have to "understand" the topic, not just memorize it. In neural networks we call that memorization a failure to generalize. Just like a human, a network that just memorizes doesn't do well if you ask it about anything slightly different than the training data.

Of course it's a balance act, because a network that's too small doesn't have space to store enough "understanding" and world model. A lot of the original premise of OpenAI was to figure out if LLMs keep getting better if you make them bigger, and so far that has worked. But there is bound to be a ceiling on this, where making the model bigger starts making it dumber.

Deleted Comment

mensetmanusman · 2 years ago
No one expected larger LLMs to be amazing, so although it's unlikely that these larger models will do anything, it was also unlikely that we are in our current situation regarding LLMs.
fiedzia · 2 years ago
Really early days of flying where like "let's add a few feathers and then add some more". Though architectural changes were added too.
theptip · 2 years ago
What? The defining trend of the last 5 or so years is the victory of the scaling hypothesis. More scale = more intelligence. GPT-4 is way smarter than 3.5, this trend is ongoing.

You need more data to utilize more parameters, but folks at the forefront are confident that they are not going to run out of data any time soon.

If you mean “solution to AGI” maybe. But perhaps in-context scratchpads and agent loops will be sufficient to get this architecture to human-level performance, with enough data/parameters. (Sutskever and Hinton have both expressed credulity that the current architectures might get us there.)

All that said, it’s also possible that new architectures will be needed at some point, I’m just pushing back on your claim that we already hit the cliff.

visarga · 2 years ago
The main hero here is not model size but the dataset and the environment that created it. All this model talk missed the point - without the repository of human experience captured in language these models would not get so smart. And the improvement path is the same - assimilate more experience. This time the AI agent can create its own interactions and feedback signals, this would help it fix its flaws.

Learning in third person from the past data can only take AI so far. It needs to act and learn in the present, in first person, to be truly smart. No architectural change is needed, but the model needs to be placed in a real environment to get feedback signals.

Deleted Comment

WhitneyLand · 2 years ago
Is anything known about what extent if any non-public domain books are used for LLM’s?

One example is the Google books project made digital quite a few texts, but I’ve never heard if Google considers these fair game to train on for Bard.

Most of the copyright discussions I’ve seen have been around images and code but not much about books.

Seems to become more relevant as things scale up as indicated by this article.

lporto · 2 years ago
>we found 72,508 ebook titles (including 83 from Stanford University Press) that were pirated and then widely used to train LLMs despite the protections of copyright law

https://aicopyright.substack.com/p/the-books-used-to-train-l...

Footnote7341 · 2 years ago
It will be interesting to see what the government can do here. Can they use their powers to get their hands on the most data?

im still skeptical because new techniques are going to give an order of magnitude efficiency boost to transformer models, so 'just waiting' seems like the best approach for now. I dont think they will be able to just skip to the finish line by having the most money.

raccoonDivider · 2 years ago
I just realized that the NSA has probably been able to train GPT-4 equivalents on _all the data_ for a while now. We'll probably never learn about it but that's maybe scarier than just the Snowden collection story because LLMs are so good at retrieval.
dwaltrip · 2 years ago
Holy shit, you are right. They probably have 10-100x the data used to train gpt-4. Decades of every text message, phone call transcript, and so on. I can’t believe I haven’t seen anyone mention that yet.

People keep saying we don’t have enough data. I think there is a lot more data than we realize, even ignoring things like NSA.

tyingq · 2 years ago
If not "the most" data, they may have the most access to data that's exclusively available to them.
phkahler · 2 years ago
That seems like a good reason for them to do this. I wonder how much non-public stuff they have, or it's just meant to incorporate a specific kind of information.
upsidesinclude · 2 years ago
Haha, this is funny because everyone is talking about this as if it is designed to be like the LLMs we have access to.

The training parameters will be the databases of info scooped up and integrated into profiles of every person and their entire digital footprint, queriable and responsive to direct questioning