Meta AI Unleashes Megabyte, a Scalable Model Architecture

Ugh. I just spent the past few days reading this exact paper and preparing a detailed presentation on it for my job as an AI researcher, and headlines like this make me roll my eyes very hard.

MEGABYTE is indeed a cool new architecture, but it is still very much just a proof of concept at the moment. The paper shows that the model can compete with (but not decimate) vanilla Transformers on the scale of ~1B parameters in long-sequence prediction tasks. They did not "unleash" anything, and scalability to very large parameter counts and datasets still has not been tested.

I'm certainly very excited to see where this architecture goes as the community gets ahold of it and starts developing it, but to call it "revolutionary" this early on is disingenuous. I personally have a few experiments I want to run with it, but I put the probability of it being a true GPT-killer at <20%. I would love to be wrong, though!

xp84 · 3 years ago

Every 'news' site: "But how are we gonna get clicks if we don't vastly exaggerate and sensationalize everything?"

Thanks for the actual sober analysis. As not-an-AI-researcher, I could have been easily bamboozled by an article like this.

nwoli · 3 years ago

I wish you had a link to your blog or twitter in your bio, you have the kind of nuanced tone I’d like to hear more from

mkaic · 3 years ago

Oh, thanks! Though to be honest, my blog does have a good bit of wild speculation on it, as I'm a bit of a hypocrite on that front :)

I've added links to both in my bio.

riwsky · 3 years ago

they should consider doing this for a living

jdthedisciple · 3 years ago

When I started reading this and immediately came across the word "groundbreaking" I paused for a sec and thought to myself:

Let me just pretend this word isn't there, it's probably an exaggeration.

It takes a special kind of adaptation to filter out all the 10%-40% bluff usually found in these kinds of news articles.

kjkjadksj · 3 years ago

Its always better to just get at the underlying, usually far dryer scientific paper. Even if its not in your field try and read the intro and the conclusion, the authors will describe why this is generally relevant with much less flowery language than the press release. Even a word like “novel” is bold to use too many times.

derefr · 3 years ago

Now I kind of want to see a browser extension that deletes all the intensifier adjectives from runs of text, and is able to be enabled/disabled per domain.

PoignardAzur · 3 years ago

> The paper shows that the model can compete with (but not decimate) vanilla Transformers on the scale of ~1B parameters in long-sequence prediction tasks. They did not "unleash" anything, and scalability to very large parameter counts and datasets still has not been tested.

One thing to keep in mind in particular is that the vast majority of transformer alternatives/improvements/optimizations that initially showed promise ultimately ended up scaling less well than the baseline transformer architecture. The transformer has been weirdly hard to improve/dethrone.

mkaic · 3 years ago

Yeah, it's had remarkable staying power. I think it says a lot that I can see "Vaswani, et al." cited in a paper and know exactly what the author is referring to lol. I don't usually memorize researcher names but the original Attention Is All You Need paper is just freaking ubiquitous.

lordofgibbons · 3 years ago

I'd love to watch a recording of this presentation or write up, if possible!

mkaic · 3 years ago

Hmm, I may make a blog post using some of the graphics I made for the presentation. Can't make any guarantees though.

My main argument against the AI doomsayers has so far been that the current scaling laws simply make runaway singularity style scenarios algorithmically impossible (if for each step of improvement you need 10x parameters and 100x training, you quickly run into a brick wall).

This is part of why I’m not worried about the current crop of generative AI. I am however both curious and concerned about what the tsunami of talent and $$$ chasing the current trend will achieve.

If this n^(4/3) alt transformer compute scaling is real (and there’s been many a pretender, so it’s too early to tell), then that could fundamentally change the overall AI scaling law, substantially lowering the brick wall.

And that could be a game changer.

p-e-w · 3 years ago

I don't buy the idea (with either architecture) that "10x"-type scaling is required for another breakthrough.

Think of a human with below average intelligence. Then think of a human genius. Now consider how incredibly similar their brains are, despite the massive performance gap. It's not like one has 10x the number of neurons/synapses/connections etc. of the other. They're both healthy human brains, and you need powerful technology to even distinguish them structurally.

Considering this, it seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance. It certainly beats the average human at many tasks already. The gap between a moron and a genius is a lot larger than that between GPT-4 and a superhuman.

wudangmonk · 3 years ago

You mean the average brain with about 100 billion neurons with about 1000 connections each bringing it to around 100 trillion connections. With an estimated 1000 "AI" neurons required per biologial neuron.

I don't think you are givin these "below average" intelligence individuals enough credit. What we consider a genius is the equivalent of a dog show obstacle course. We measure intelligence/genius as whatever is hard for humans and completely ignore what is easy because we fail to see the complexity behind the easy stuff.

nemothekid · 3 years ago

>Think of a human with below average intelligence. Then think of a human genius.

LLMs are not AGI. A human with below average intelligence is still a league above a chimpanzee. A chimpanzee will never be able to read, not because "it's too dumb", but because a chimp's brain lacks the actual hardware for reading. The LLM is the chimpanzee. The gap between an LLM and a "human with below average intelligence" is far more than 10x.

DrScientist · 3 years ago

> Considering this, it seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance.

Except that structurally the brain is clearly has vastly more capacity than the GPT-4 model.

So sure one brain doesn't look that much different to the other - and it's in the details of the learning, wiring.

But the brain, looks vastly different from a GPT-4 model in terms of capacity - with trillions of connections - with each connection and internal state being more subtle as well.

> vastly superhuman performance

In terms of specific tasks computers ( whether you write the program explicitly or it's learnt by tweaking params in a network ) have been there for decades.

So the question is really around which tasks can you apply computers successful to. Neural nets are allowing programs to be written that weren't possible be hand.

I find it amusing that people worry about ChatGPT etc al putting programmers out of a job, when it already has in the sense that CHapGPT is a program that was built by another program already.

woah · 3 years ago

The whole idea of the singularity is that AI starts improving itself. Until that happens, normal human progress in the LLM field is just normal human progress, and will probably follow a similar path of human progress where there's a breakthrough followed by lots of low hanging fruit and hype, then a plateauing and refinement/productization.

LLMs can help humans develop new LLMs faster, but mostly in implementation (CoPilot and ChatGPT), and that's not really the important part. I have yet to see an LLM come up with original ideas.

Given that training data seems to be a big bottleneck, and LLMs are really good at generating text, I think that maybe we can start to talk about the possibility of "singularity" once LLMs are able to generate their own training data that increases their abilities. After all, humans are able to do this. That is the history of human knowledge.

visarga · 3 years ago

> I don't buy the idea (with either architecture) that "10x"-type scaling is required for another breakthrough.

Scaling can happen in two dimensions - model size and dataset size. What counts is the product of n_examples x n_parameters. That's why we have the super-Chinchilla laws, where n_examples >> 20*n_parameters. Scale the data, keep the model lean. Not to mention dataset quality - if you got clean diverse data you need less of it.

Another important trend is using LLMs to generate training sets. Example: ConstitutionalAI (pure RLAIF), TinyStories (scaling down LLMs), Alpaca (borrowed RLHF), AlpacaFarm - a recent paper promising fine-tuned models for $200 cost in 24h.

beefield · 3 years ago

> seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance

Sorry, can't help commenting the weirdness of high dimensional spaces. If you take an cube of a size of a hair breadth (100um) in a space of hindreds of billions of dimensions like gpt4, distances between random points in that cube are in the order of tens of meters...

somewhereoutth · 3 years ago

However, that is to equivocate an LLM with a human brain. They both can be conceptualised with the idea of the 'neuron' (though with wildly different actual implementations of that term), but that is their only point of comparison. Thus your conjecture is invalid.

raincole · 3 years ago

> Now consider how incredibly similar their brains are, despite the massive performance gap.

A disk filled with random bits is "similar" to a disk that stores the whole Wikipedia's text content. So writing the whole Wikipedia is an effort of a hair's breadth...?

anon291 · 3 years ago

> It certainly beats the average human at many tasks already

This is not the revolution you think it is.

```python > 338383*887282 ... ANSWER ... ```

Yet skynet never came.

BulgarianIdiot · 3 years ago

Your premise is wrong. This has nothing to do with 10x-ing parameters. One could argue the current parameter sizes are good enough as we observe "large breadth, shallow depth" behavior from LLM and to some extent, diffusion models.

This suggests the problem is the depth of inference, which is single pass "hot takes" for all language models right now, due to cost of inference and our limited understanding of what makes a model's response high quality.

Yes, you don't need more parameters to increase the depth. You need to iterate, instead. Loop. Imagine programming if looping was not allowed, nor recursion, or not even defining functions and calling them. Everything you write runs at most once during program execution and that's it. This is what an AI model is right now during inference. One big flat, single-pass, directed acyclic graph. And soon it won't be.

Research into Chain of Thought, Tree of Thought reveals this dimension. This means you can take existing models and make them perform much more complex tasks with much better precision, though various ways of letting them iterate. Think of how you'd perform if you always had exactly 5 seconds to answer a question. Now imagine if you have 5 minutes. 5 hours. 5 days. Lo and behold, turns out an AI isn't different in that aspect.

We also need more iterations of training (on the same amount of data), we need larger context windows, and we need new architectures, like Meta's MEGABYTE, for example.

Parameter count and data size could hypothetically have already hit a hard wall (they haven't) and AI will keep exponentially improving regardless. There's too much low hanging fruit and more grows by the nanosecond.

Henk0 · 3 years ago

This. So much this.

I'm completely dumbfounded by obviously highly intelligent people consistently not getting this, and dismissing current generation AI systems as not being intelligent because they can't reliably solve massively complex problems in one go. Like anyone would expect a human programmer or researcher to just intuitively come up with a complex program, or the correct answer for a hard problem every time, instantly

Human thinking and problem solving involves a lot of trial and error, iterative thinking, and sharing and discussing the problem with other humans. Processes that AI researchers are just now beginning to explore, with results like increasing reasoning ability by 900% in a recent paper. Every thinking human runs a near constant loop of thought, with no conscious control of which thought will appear next (we're very good at fooling ourselves that we have control though)

We do have super-intelligences already, but they're severely handicapped by lacking a bunch of these - apparently fairly straightforward to implement - abilities, plus a few senses and the ability to directly effect change in the physical world (which really isn't needed if they can get access to human agents who will do their bidding, wittingly or unwittingly), and to self-improve. With regards to self-improvement, the increasing coding skills combined with iterative 'thought' loops should get there in very little time considering the current rate of progress

There's also the idea that a single AI model should be able to do everything our human brains do, when our brains actually contain a number of specialised subunits that handle different aspects of our behavioural repertoire. It reasonable to allow for the same thing with an AI system, where specialised sub-networks handle input, output and other subtasks. AI systems also have the advantage of being able to add any arbitrary number of subunits to increase its capacity to solve various problems

We seem to suffer from a species-wide narcissism with regards to our own intelligence and capabilities, and there's this huge focus on the number of connections in the human brain – most of which deal with things that are by no means necessary to act on the world unless one has a meat body and the need to navigate social situations, make friends and mate. Fact is, we have terrible short-term memory (worse than chimpanzees), slow processing time, lots of cognitive heuristics, many of which cause more harm than good in the modern world. We are emotional and easily fooled. Even the most intelligent people historically have believed in what we now consider fairy tales. We are slow to take in information, bad at storing it, and generally bad at transmitting it. A few of us can generate great ideas – building on accumulated knowledge from our forebears and peers – but most of us are just not that great at coming up with anything original or useful

I've been actively looking for good arguments against AGI being much closer than we should be comfortable with, and reasons why we should not fear systems that surpass us in intelligence. All I've come across so far is some combination of the above, often expressed with a dismissive attitude, disparaging current LLM:s as parrots (that can apparently reason on the level of university level humans, but much more quickly), and pejorative terms like fearmongerers and doomers to describe those of us who really don't think its a good idea to pursue more intelligent systems. My guess is these people will act surprised when the arms race inevitably leads to some very bad unintended consequences. I don't see a way to stop it though, so I'm just strapped in for the ride along with the rest of humankind

Again, if you have good arguments against any of the points above, please do share them with me

godelski · 3 years ago

> if for each step of improvement you need 10x parameters and 100x training, you quickly run into a brick wall

Btw, PALM2 has far fewer parameters than V1.

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks, including natural language generation, translation, and reasoning. These results suggest that model scaling is not the only way to improve performance.

From the leak we know that the large version is 340B parameters, compared to the original 540B parameters. From Table 2 in the document we see that the small version (unknown size) is on par with version 1.

ML typically follows a cycle. Improve, distill, repeat. It is unfortunate that the big labs lead these efforts because many smaller labs try to work on the distill part in parallel (out of necessity) but works get rejected (due to lack of SOTA) or ignored. Like all research, we need to be careful and nuanced in our evaluations. There's a lot of hype and many trying to take advantage of the confusion and sell snake oil. I think AI/ML is and will continue to change our world, but we have to be careful to not let the salesmen dictate the conversations.

PaLM https://arxiv.org/abs/2305.10403

PaLM 2 https://arxiv.org/abs/2204.02311

primax · 3 years ago

Honestly, I welcome the talent working on AI now. They were working on how to make me spend 5 more seconds on Facebook, or how to click on a Google ad. AI has potentially huge positive productivity potential.

lostmsu · 3 years ago

For all you know they might now be working on how to make you spend the rest of your life in a coal mine for AI overlords.

digging · 3 years ago

Why do you think the best AI isn't going to be used to work on how to make you spend 5 more seconds on Facebook?

vlovich123 · 3 years ago

I don’t think I follow this argument. AI has been dropping in costs and complexity the more engineering time is spent on it. It seems like the bottleneck is humans creating new AI techniques right? If an AI is capable of developing new AI techniques unsupervised, isn’t that by definition the singularity? Heck doesn’t even need to be unsupervised. If it can even do most of the heavy lifting for a human I feel like that would put us into runaway territory.

Granted we’re a long way away from that and likely we’d need an AI that could come up with its own hypotheses for new AI models to make this truly minimal cost and that feels even farther away. But I don’t think I follow the claim that each improvement step change requires to much massive extra scaling as it seems not match what we’ve seen over the past few years (granted I could be misinformed - I’m applying a 10k view spectator and maybe my own mental model here is flawed).

doctor_eval · 3 years ago

I consider the singularity to be the point at which the certainty of our predictions about our future becomes close to zero. By this definition I reckon we are already in the singularity.

It’s not necessarily bad. The problem with the singularity is that that we can’t tell if it’s bad or not.

woah · 3 years ago

> If an AI is capable of developing new AI techniques unsupervised, isn’t that by definition the singularity?

AI is not capable of developing new AI techniques unsupervised.

Honestly the lowest hanging fruit should be the ability for LLMs to generate their own textual training data that leads to an improvement in model abilities, since that is what they are supposed to be good at. Until then, it's still garbage in garbage out.

iinnPP · 3 years ago

This assumes only the LLM matters. I don't think it does. I think GPT3.5 is plenty, if not overkill.

I can't imagine I am alone in my thought. I've even seen some other experiments in the same line of reasoning.

Meta seems to be thinking the same thing, or at least closer to it than the majority.

ummonk · 3 years ago

Do you believe that Moore’s Law (or rather Huang’s Law) will stop being true before AI exceeds human capabilities by orders of magnitude?

We don’t need a runaway singularity for AI to just render human intelligence obsolete.

BorisTheBrave · 3 years ago

I think you've misunderstood. Megabyte scales better with context window length. I don't know if they're saying the training data / compute are any more efficient.

lazzlazzlazz · 3 years ago

10x scaling won't take long, even putting aside improvements in our understanding of training processes.

Deleted Comment

Dead Comment