Readit News logoReadit News
mkaic · 3 years ago
Ugh. I just spent the past few days reading this exact paper and preparing a detailed presentation on it for my job as an AI researcher, and headlines like this make me roll my eyes very hard.

MEGABYTE is indeed a cool new architecture, but it is still very much just a proof of concept at the moment. The paper shows that the model can compete with (but not decimate) vanilla Transformers on the scale of ~1B parameters in long-sequence prediction tasks. They did not "unleash" anything, and scalability to very large parameter counts and datasets still has not been tested.

I'm certainly very excited to see where this architecture goes as the community gets ahold of it and starts developing it, but to call it "revolutionary" this early on is disingenuous. I personally have a few experiments I want to run with it, but I put the probability of it being a true GPT-killer at <20%. I would love to be wrong, though!

xp84 · 3 years ago
Every 'news' site: "But how are we gonna get clicks if we don't vastly exaggerate and sensationalize everything?"

Thanks for the actual sober analysis. As not-an-AI-researcher, I could have been easily bamboozled by an article like this.

nwoli · 3 years ago
I wish you had a link to your blog or twitter in your bio, you have the kind of nuanced tone I’d like to hear more from
mkaic · 3 years ago
Oh, thanks! Though to be honest, my blog does have a good bit of wild speculation on it, as I'm a bit of a hypocrite on that front :)

I've added links to both in my bio.

riwsky · 3 years ago
they should consider doing this for a living
jdthedisciple · 3 years ago
When I started reading this and immediately came across the word "groundbreaking" I paused for a sec and thought to myself:

Let me just pretend this word isn't there, it's probably an exaggeration.

It takes a special kind of adaptation to filter out all the 10%-40% bluff usually found in these kinds of news articles.

kjkjadksj · 3 years ago
Its always better to just get at the underlying, usually far dryer scientific paper. Even if its not in your field try and read the intro and the conclusion, the authors will describe why this is generally relevant with much less flowery language than the press release. Even a word like “novel” is bold to use too many times.
derefr · 3 years ago
Now I kind of want to see a browser extension that deletes all the intensifier adjectives from runs of text, and is able to be enabled/disabled per domain.
PoignardAzur · 3 years ago
> The paper shows that the model can compete with (but not decimate) vanilla Transformers on the scale of ~1B parameters in long-sequence prediction tasks. They did not "unleash" anything, and scalability to very large parameter counts and datasets still has not been tested.

One thing to keep in mind in particular is that the vast majority of transformer alternatives/improvements/optimizations that initially showed promise ultimately ended up scaling less well than the baseline transformer architecture. The transformer has been weirdly hard to improve/dethrone.

mkaic · 3 years ago
Yeah, it's had remarkable staying power. I think it says a lot that I can see "Vaswani, et al." cited in a paper and know exactly what the author is referring to lol. I don't usually memorize researcher names but the original Attention Is All You Need paper is just freaking ubiquitous.
lordofgibbons · 3 years ago
I'd love to watch a recording of this presentation or write up, if possible!
mkaic · 3 years ago
Hmm, I may make a blog post using some of the graphics I made for the presentation. Can't make any guarantees though.
jacobn · 3 years ago
My main argument against the AI doomsayers has so far been that the current scaling laws simply make runaway singularity style scenarios algorithmically impossible (if for each step of improvement you need 10x parameters and 100x training, you quickly run into a brick wall).

This is part of why I’m not worried about the current crop of generative AI. I am however both curious and concerned about what the tsunami of talent and $$$ chasing the current trend will achieve.

If this n^(4/3) alt transformer compute scaling is real (and there’s been many a pretender, so it’s too early to tell), then that could fundamentally change the overall AI scaling law, substantially lowering the brick wall.

And that could be a game changer.

p-e-w · 3 years ago
I don't buy the idea (with either architecture) that "10x"-type scaling is required for another breakthrough.

Think of a human with below average intelligence. Then think of a human genius. Now consider how incredibly similar their brains are, despite the massive performance gap. It's not like one has 10x the number of neurons/synapses/connections etc. of the other. They're both healthy human brains, and you need powerful technology to even distinguish them structurally.

Considering this, it seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance. It certainly beats the average human at many tasks already. The gap between a moron and a genius is a lot larger than that between GPT-4 and a superhuman.

wudangmonk · 3 years ago
You mean the average brain with about 100 billion neurons with about 1000 connections each bringing it to around 100 trillion connections. With an estimated 1000 "AI" neurons required per biologial neuron.

I don't think you are givin these "below average" intelligence individuals enough credit. What we consider a genius is the equivalent of a dog show obstacle course. We measure intelligence/genius as whatever is hard for humans and completely ignore what is easy because we fail to see the complexity behind the easy stuff.

nemothekid · 3 years ago
>Think of a human with below average intelligence. Then think of a human genius.

LLMs are not AGI. A human with below average intelligence is still a league above a chimpanzee. A chimpanzee will never be able to read, not because "it's too dumb", but because a chimp's brain lacks the actual hardware for reading. The LLM is the chimpanzee. The gap between an LLM and a "human with below average intelligence" is far more than 10x.

DrScientist · 3 years ago
> Considering this, it seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance.

Except that structurally the brain is clearly has vastly more capacity than the GPT-4 model.

So sure one brain doesn't look that much different to the other - and it's in the details of the learning, wiring.

But the brain, looks vastly different from a GPT-4 model in terms of capacity - with trillions of connections - with each connection and internal state being more subtle as well.

> vastly superhuman performance

In terms of specific tasks computers ( whether you write the program explicitly or it's learnt by tweaking params in a network ) have been there for decades.

So the question is really around which tasks can you apply computers successful to. Neural nets are allowing programs to be written that weren't possible be hand.

I find it amusing that people worry about ChatGPT etc al putting programmers out of a job, when it already has in the sense that CHapGPT is a program that was built by another program already.

woah · 3 years ago
The whole idea of the singularity is that AI starts improving itself. Until that happens, normal human progress in the LLM field is just normal human progress, and will probably follow a similar path of human progress where there's a breakthrough followed by lots of low hanging fruit and hype, then a plateauing and refinement/productization.

LLMs can help humans develop new LLMs faster, but mostly in implementation (CoPilot and ChatGPT), and that's not really the important part. I have yet to see an LLM come up with original ideas.

Given that training data seems to be a big bottleneck, and LLMs are really good at generating text, I think that maybe we can start to talk about the possibility of "singularity" once LLMs are able to generate their own training data that increases their abilities. After all, humans are able to do this. That is the history of human knowledge.

visarga · 3 years ago
> I don't buy the idea (with either architecture) that "10x"-type scaling is required for another breakthrough.

Scaling can happen in two dimensions - model size and dataset size. What counts is the product of n_examples x n_parameters. That's why we have the super-Chinchilla laws, where n_examples >> 20*n_parameters. Scale the data, keep the model lean. Not to mention dataset quality - if you got clean diverse data you need less of it.

Another important trend is using LLMs to generate training sets. Example: ConstitutionalAI (pure RLAIF), TinyStories (scaling down LLMs), Alpaca (borrowed RLHF), AlpacaFarm - a recent paper promising fine-tuned models for $200 cost in 24h.

beefield · 3 years ago
> seems perfectly possible that a model like GPT-4 is just a hair's breadth away from vastly superhuman performance

Sorry, can't help commenting the weirdness of high dimensional spaces. If you take an cube of a size of a hair breadth (100um) in a space of hindreds of billions of dimensions like gpt4, distances between random points in that cube are in the order of tens of meters...

somewhereoutth · 3 years ago
However, that is to equivocate an LLM with a human brain. They both can be conceptualised with the idea of the 'neuron' (though with wildly different actual implementations of that term), but that is their only point of comparison. Thus your conjecture is invalid.
raincole · 3 years ago
> Now consider how incredibly similar their brains are, despite the massive performance gap.

A disk filled with random bits is "similar" to a disk that stores the whole Wikipedia's text content. So writing the whole Wikipedia is an effort of a hair's breadth...?

anon291 · 3 years ago
> It certainly beats the average human at many tasks already

This is not the revolution you think it is.

```python > 338383*887282 ... ANSWER ... ```

Yet skynet never came.

BulgarianIdiot · 3 years ago
Your premise is wrong. This has nothing to do with 10x-ing parameters. One could argue the current parameter sizes are good enough as we observe "large breadth, shallow depth" behavior from LLM and to some extent, diffusion models.

This suggests the problem is the depth of inference, which is single pass "hot takes" for all language models right now, due to cost of inference and our limited understanding of what makes a model's response high quality.

Yes, you don't need more parameters to increase the depth. You need to iterate, instead. Loop. Imagine programming if looping was not allowed, nor recursion, or not even defining functions and calling them. Everything you write runs at most once during program execution and that's it. This is what an AI model is right now during inference. One big flat, single-pass, directed acyclic graph. And soon it won't be.

Research into Chain of Thought, Tree of Thought reveals this dimension. This means you can take existing models and make them perform much more complex tasks with much better precision, though various ways of letting them iterate. Think of how you'd perform if you always had exactly 5 seconds to answer a question. Now imagine if you have 5 minutes. 5 hours. 5 days. Lo and behold, turns out an AI isn't different in that aspect.

We also need more iterations of training (on the same amount of data), we need larger context windows, and we need new architectures, like Meta's MEGABYTE, for example.

Parameter count and data size could hypothetically have already hit a hard wall (they haven't) and AI will keep exponentially improving regardless. There's too much low hanging fruit and more grows by the nanosecond.

Henk0 · 3 years ago
This. So much this.

I'm completely dumbfounded by obviously highly intelligent people consistently not getting this, and dismissing current generation AI systems as not being intelligent because they can't reliably solve massively complex problems in one go. Like anyone would expect a human programmer or researcher to just intuitively come up with a complex program, or the correct answer for a hard problem every time, instantly

Human thinking and problem solving involves a lot of trial and error, iterative thinking, and sharing and discussing the problem with other humans. Processes that AI researchers are just now beginning to explore, with results like increasing reasoning ability by 900% in a recent paper. Every thinking human runs a near constant loop of thought, with no conscious control of which thought will appear next (we're very good at fooling ourselves that we have control though)

We do have super-intelligences already, but they're severely handicapped by lacking a bunch of these - apparently fairly straightforward to implement - abilities, plus a few senses and the ability to directly effect change in the physical world (which really isn't needed if they can get access to human agents who will do their bidding, wittingly or unwittingly), and to self-improve. With regards to self-improvement, the increasing coding skills combined with iterative 'thought' loops should get there in very little time considering the current rate of progress

There's also the idea that a single AI model should be able to do everything our human brains do, when our brains actually contain a number of specialised subunits that handle different aspects of our behavioural repertoire. It reasonable to allow for the same thing with an AI system, where specialised sub-networks handle input, output and other subtasks. AI systems also have the advantage of being able to add any arbitrary number of subunits to increase its capacity to solve various problems

We seem to suffer from a species-wide narcissism with regards to our own intelligence and capabilities, and there's this huge focus on the number of connections in the human brain – most of which deal with things that are by no means necessary to act on the world unless one has a meat body and the need to navigate social situations, make friends and mate. Fact is, we have terrible short-term memory (worse than chimpanzees), slow processing time, lots of cognitive heuristics, many of which cause more harm than good in the modern world. We are emotional and easily fooled. Even the most intelligent people historically have believed in what we now consider fairy tales. We are slow to take in information, bad at storing it, and generally bad at transmitting it. A few of us can generate great ideas – building on accumulated knowledge from our forebears and peers – but most of us are just not that great at coming up with anything original or useful

I've been actively looking for good arguments against AGI being much closer than we should be comfortable with, and reasons why we should not fear systems that surpass us in intelligence. All I've come across so far is some combination of the above, often expressed with a dismissive attitude, disparaging current LLM:s as parrots (that can apparently reason on the level of university level humans, but much more quickly), and pejorative terms like fearmongerers and doomers to describe those of us who really don't think its a good idea to pursue more intelligent systems. My guess is these people will act surprised when the arms race inevitably leads to some very bad unintended consequences. I don't see a way to stop it though, so I'm just strapped in for the ride along with the rest of humankind

Again, if you have good arguments against any of the points above, please do share them with me

godelski · 3 years ago
> if for each step of improvement you need 10x parameters and 100x training, you quickly run into a brick wall

Btw, PALM2 has far fewer parameters than V1.

> The largest model in the PaLM 2 family, PaLM 2-L, is significantly smaller than the largest PaLM model but uses more training compute. Our evaluation results show that PaLM 2 models significantly outperform PaLM on a variety of tasks, including natural language generation, translation, and reasoning. These results suggest that model scaling is not the only way to improve performance.

From the leak we know that the large version is 340B parameters, compared to the original 540B parameters. From Table 2 in the document we see that the small version (unknown size) is on par with version 1.

ML typically follows a cycle. Improve, distill, repeat. It is unfortunate that the big labs lead these efforts because many smaller labs try to work on the distill part in parallel (out of necessity) but works get rejected (due to lack of SOTA) or ignored. Like all research, we need to be careful and nuanced in our evaluations. There's a lot of hype and many trying to take advantage of the confusion and sell snake oil. I think AI/ML is and will continue to change our world, but we have to be careful to not let the salesmen dictate the conversations.

PaLM https://arxiv.org/abs/2305.10403

PaLM 2 https://arxiv.org/abs/2204.02311

primax · 3 years ago
Honestly, I welcome the talent working on AI now. They were working on how to make me spend 5 more seconds on Facebook, or how to click on a Google ad. AI has potentially huge positive productivity potential.
lostmsu · 3 years ago
For all you know they might now be working on how to make you spend the rest of your life in a coal mine for AI overlords.
digging · 3 years ago
Why do you think the best AI isn't going to be used to work on how to make you spend 5 more seconds on Facebook?
vlovich123 · 3 years ago
I don’t think I follow this argument. AI has been dropping in costs and complexity the more engineering time is spent on it. It seems like the bottleneck is humans creating new AI techniques right? If an AI is capable of developing new AI techniques unsupervised, isn’t that by definition the singularity? Heck doesn’t even need to be unsupervised. If it can even do most of the heavy lifting for a human I feel like that would put us into runaway territory.

Granted we’re a long way away from that and likely we’d need an AI that could come up with its own hypotheses for new AI models to make this truly minimal cost and that feels even farther away. But I don’t think I follow the claim that each improvement step change requires to much massive extra scaling as it seems not match what we’ve seen over the past few years (granted I could be misinformed - I’m applying a 10k view spectator and maybe my own mental model here is flawed).

doctor_eval · 3 years ago
I consider the singularity to be the point at which the certainty of our predictions about our future becomes close to zero. By this definition I reckon we are already in the singularity.

It’s not necessarily bad. The problem with the singularity is that that we can’t tell if it’s bad or not.

woah · 3 years ago
> If an AI is capable of developing new AI techniques unsupervised, isn’t that by definition the singularity?

AI is not capable of developing new AI techniques unsupervised.

Honestly the lowest hanging fruit should be the ability for LLMs to generate their own textual training data that leads to an improvement in model abilities, since that is what they are supposed to be good at. Until then, it's still garbage in garbage out.

iinnPP · 3 years ago
This assumes only the LLM matters. I don't think it does. I think GPT3.5 is plenty, if not overkill.

I can't imagine I am alone in my thought. I've even seen some other experiments in the same line of reasoning.

Meta seems to be thinking the same thing, or at least closer to it than the majority.

ummonk · 3 years ago
Do you believe that Moore’s Law (or rather Huang’s Law) will stop being true before AI exceeds human capabilities by orders of magnitude?

We don’t need a runaway singularity for AI to just render human intelligence obsolete.

BorisTheBrave · 3 years ago
I think you've misunderstood. Megabyte scales better with context window length. I don't know if they're saying the training data / compute are any more efficient.
lazzlazzlazz · 3 years ago
10x scaling won't take long, even putting aside improvements in our understanding of training processes.

Deleted Comment

Dead Comment

crakenzak · 3 years ago
Paper: https://arxiv.org/abs/2305.07185

Wow, seems like Meta AI is so ahead of the curve compared to even Google and OpenAI recently especially with their open sourcing pushes.

Great for the research community as a whole!

1024core · 3 years ago
Meta has no horse in the race (i.e. they don't have a search engine). So, they don't mind throwing random things out. Withholding it won't really make much of a difference for them, as they don't have a way to productionize the tech.
dumpsterdiver · 3 years ago
While I disagree that having a search engine is the only way to have a "horse in the race", I must agree that at this point Meta does not appear to have a horse in the race.

Other companies are providing services that are so useful that it makes us think twice about how secure our jobs are. Then there is Meta, who seems to think that the world at large will forget about the terrible motion sickness that their VR products have wrought upon us. I for one will not forget. I'm actually traumatized, and even thinking about putting on VR goggles at this point makes me feel queasy.

wolfd · 3 years ago
I would be _shocked_ if Zuck wasn't thinking 24/7 about how to capitalize on LLMs. I'm sure there are a thousand ideas (maybe even a few good ones?) being thrown around Meta at how to use LLMs to beat Google/Microsoft+OpenAI at the "search buddy" game.
zoogeny · 3 years ago
Meta still has the Portal line of products - which were competitors to the Alexa, Siri and similar Google audio assistant product lines. I just searched and it looks like Meta are currently partnering with Amazon to license Alexa on these devices, but I could imagine they might want to replace it with their own LLM eventually.

I am a bit surprised that no one is talking about how these new LLM models will disrupt Alexa, Siri, etc. since that seems to be the most applicable market I can imagine.

abwizz · 3 years ago
would not be surprised if they integrated a "suggested conversations" feature based on your chat history and behaviour, where the user just picks sentences from a list and both parties can enjoy a effortless "organic" conversation.
mturmon · 3 years ago
Commoditizing their complement?
jerpint · 3 years ago
Deepmind was behind this kind of thinking years ago with Perceiver models, but meta is crushing it with high quality publications lately
joshxyz · 3 years ago
great positioning for zuck, impressive

Dead Comment

zxexz · 3 years ago
Great paper. But wow, I really wish everyone would use more easily searchable names for their projects. In 6 months, there’s a high probability I’ll end up googling/ddging “megabyte model” trying to find this paper again.
machdiamonds · 3 years ago
I've been using myReach as a bookmark manager lately. I've found it to be pretty useful for resurfacing information from links I have saved. I think they're using embeddings to help you track down information across different documents, articles, posts, etc. The big sell is that it's a personalized AI assistant that answers your specific queries based on what data you feed it (e.g.,"what was my electricity bill last month"). However, I'm hesitant about uploading personal data, so I am just using it as a bookmark manager. Their LLM although a little slow, works pretty well. I had this HN link saved about an open-source TTS model. A commenter said they were going to release a model later that week that could be comparable to Elevenlabs. I asked the chatbot on myReach: "What's that TTS model that's looking to rival ElevenLabs?" It surfaced the article and used the specific comment for a response. I'm not sure about the future of these kinds of services from startups. Given their ability to integrate browsers, emails, cloud storage, photos, and more, Google and Microsoft are likely to develop a similar service. With the resources at their disposal, they could probably design something superior and streamline the process of joining. But for now, I think I'll continue using myReach.
mitthrowaway2 · 3 years ago
The trick is to search HN submissions and filter by the date you remember reading about it. That's how I deal with these unsearchable names.
bckr · 3 years ago
One step more effective is to keep track of things that keep your interest in a notes app.
sfjailbird · 3 years ago
"Metabyte". Missed opportunity.

Deleted Comment

seydor · 3 years ago
Patchformers
DrScientist · 3 years ago
There is an obvious trend to scaling and performance by making models hierarchical ( not strictly but an element of local learning and global tuned connections ).

The obvious next step is to have specialised models loosely connected ( and trained [1] ) as a whole.

[1] We've had multiple models connected with text->concept and concept-> image etc but I'm not sure if that connection is trained yet.

seanhunter · 3 years ago
That’s sort of what happens in the “toolformer” paper[1] which sets out the basic architecture by which things like chatGPT plugins work. Language models can learn to use “tools” provided by plugins. Those tools might themselves bee the specialized models you are describing (although for some plugins they could be a non-AI tool such as web search or whatever).

[1] https://arxiv.org/pdf/2302.04761.pdf

riwsky · 3 years ago
My LLM is totally real! And she’s the best, way better than GPT-4. But you wouldn’t know her, she goes to another school. In Canada.
danielbln · 3 years ago
Normally I would agree, especially if it's some Arxiv paper with two dudes talking about mind reading from fmri (you know who you are..). But Meta AI has shown that they are very much capable of creating serious models and releasing them, so I remain optimistic this isn't a smoke screen.
dr_dshiv · 3 years ago
Wow: “researchers discovered that the Megabyte model's maximum capacity exceeded 1.2M tokens. For comparison, OpenAI's GPT-4 has a limit of 32,000 tokens, while Anthropic's Claude has a limit of 100,000 tokens”

In my understanding, this is accomplished through a hierarchical approach of “patches” of tokens.

forrestthewoods · 3 years ago
First “Transformer” architecture and now “Megabyte”? I swear they’re trolling us!

We’re just a generation or two away from the “computer” model.

mkaic · 3 years ago
I think you mean the Constantly Online Meta-Programming Universal Task EnactoR model