Attention models? Attention existed before those papers. What they did was show that it was enough to predict next word sequences in a certain context. I'm certain they didn't realize what they found. We used this frame work in 2018 and it gave us wildly unusual behavior (but really fun) and we tried to solve it (really looking for HF capability more than RL) but we didn't see what another group found: that scale in compute with simple algorithms were just better. To argue one group discovered and changed AI and ignore all the other groups is really annoying. I'm glad for these researchers. They deserve the accolade but they didn't invent modern AI. They advanced it. In an interesting way. But even now we want to return to a more deterministic approach. World models. Memory. Graph. Energy minimization. Generative is fun and it taught us something but I'm not sure we can just keep adding more and more chips to solve AGI/SGI through compute. Or maybe we can. But that paper is not written yet.
This is an uncharitable and oddly dismissive take (i.e. perfect for HN, I suppose).
Today's incredible state-of-the-art does not exist without the transformer architecture. Transformers aren't merely some lucky passengers riding the coattails of compute scale. If they were, then the ChatGPT app which set the world ablaze would've instead been called ChatMLP, or ChatCNN. But it's not. And in 2024 we still have no competing NLP architecture. Because the transformer is a genuinely profound, remarkable idea with remarkable properties (e.g. training parallelism). It's easy to downplay GPTs as a mostly derivative idea with the benefit of hindsight. I'm sure we'll perform the same revisionist history with state-space models, or whatever architecture eventually supplants transformers. Do GPTs build on prior work? Do other approaches and ideas deserve recognition? Yeah, obviously. Like...welcome to science. But the transformer's architects earned their praise -- including via this article -- which isn't some slight against everyone else, as if accolades were a zero-sum game. These 8 people changed our world and genuinely deserve the love!
Question for you, as someone relatively new to the world of AI (well, not exactly new - I took many courses in AI, including neural networks, but in the late 90s... the world is just a tad different now!)
Is there any good summary of the history of AI/deep learning from, say, late 00s/2010 to the present? I think learning some of this history would really help be better understand how we ended up at the current state of the art.
> And in 2024 we still have no competing NLP architecture.
No, we do. State space models are both faster and scale just as well. E.g., RWKV and Mamba.
> Transformers aren't merely some lucky passengers riding the coattails of compute scale.
Err... they are, though. They were just the type of model the right researchers were already using at the time, probably for translating between natural languages.
I kind of wonder if the reason this seems to be true is that emergent systems are just able to go a lot farther into much more complex design spaces than any system a human mind is capable of constructing.
I’m studying neuroscience but very interested in how ai works. I’ve read up on the old school but phrases like memory graph and energy minimization are new to me. What modern papers/articles would you recommend for folks who want to learn more?
For phrases, Google's TF glossary [0] is a good resource, but it does not cover certain subsets of AI (and more specifically, is mostly focused on TensorFlow).
If you are in neuroscience I would recommend looking into neural radiance fields rendering as well. I find it fascinating since it's essentially an over-fitted neural network.
This is a classic... This isn't the first time this piece has been posted before either. Here's one from FT last year[0] or Bloomberg[1]. You can find more. Google certainly played a major role, but it is too far to say they invented it or that they are the ones that created modern AI. Like Einstein said, shoulder of giants. And realistically, those giants are just a bunch of people in trench coats. Millions of researchers being unrecognized. I don't want to undermine the work of these researchers, but that doesn't mean we should also undermine the work of the many others (and thank god for the mathematicians who get no recognition and lay all the foundation for us).
And of course, a triggered Yann[2] (who is absolutely right).
But it is odd since it is actually a highly discussed topic, the history of attention. It's been discussed on HN many times before. And of course there's Lilian Weng's very famous blog post[3] that covers this in detail.
The word attention goes back well over a decade and even before Schmidhuber's usage. He has a reasonable claim but these things are always fuzzy and not exactly clear.
At least the article is more correct specifying Transformer rather than Attention, but even this is vague at best. FFormer (FFT-Transformer) was a early iteration and there were many variants. Do we call a transformer a residual attention mechanism with a residual feed forward? Can it be a convolution? There is no definitive definition but generally people mean DPMHA w/ skip layer + a processing network w/ skip layer. But this can be reflective of many architectures since every network can be decomposed into subnetworks. This even includes a 3 layer FFN (1 hidden layer).
Stories are nice, but I think it is bad to forget all the people who are contribution in less obvious ways. If a butterfly can cause a typhoon, then even a poor paper can contribute to a revolution.
A little after that. I'd put the heyday between 2003-2010, starting with the GMail launch and ending with the Chrome & Android launches. That period includes GMail, Maps, Scholar, Orkut, Reader, the acquisitions of Blogger/YouTube/Docs/Sheets/Slides/Analytics/Android, Summer of Code, OAuth, Translate, Voice Search, search suggestions, universal search, Picasa, etc. Basically I can look at my phone or computer and basically everything I routinely use dates from that period.
Agreed. I remember the town hall meeting where they announced the transition to being Alphabet. My manager was flying home from the US at the time. He left a Google employee and landed an Alphabet employee.
I know it was probably meaningless in any real sense, but when they dropped the Dont Be Evil motto, it was a sign that the fun times were drawing to an end.
The quota system can kick in at whatever time the limits are reached.
And GPUs are scattered across borg cells, limiting the ceiling. That's why XBorg was created so that a global search among all Borg cells for researchers.
And data center Capex is around 5 billion each year.
Google makes hundres of billions of revenue each year.
You are asking what people would do in impossible situation. Like "what you do after you are dead", literally I could do nothing after I am dead.
I cannot even understand what I do stands for in the context of your question. The above is my direct reaction in the line that he assumes he had unlimited budget.
> I cannot even understand what I do stands for in the context of your question
That he had a higher budget than he knew what to do with. When I worked at Google I could bring up thousands of workers doing big tasks for hours without issue whenever I wanted, for me that was the same as being infinite since I never needed more, and that team didn't even have a particularly large budget. I can see a top ML team having enough compute budget to run a task on the entire Google scrape index dataset every day to test things, you don't need that much to do that, I wasn't that far from that.
At that point the issue is no longer budget but time for these projects to run and return a result. Of course that was before LLMs, the models before then weren't that expensive.
Those were fun times! (& great to see you again after all these years). It's astonishing to me how far the tech has come given what we were working on at the time.
> "Realistically, we could have had GPT-3 or even 3.5 probably in 2019, maybe 2020. The big question isn’t, did they see it? The question is, why didn’t we do anything with the fact that we had seen it? The answer is tricky.”
The answer is that monopolies stifle technological innovation because one well-established part of their business (advertising-centric search) would be negatively impacted by an upstart branch (chatbots) that would cut into search ad revenue.
This is comparable to a investor-owned consortium of electric utilities, gas-fired power plants, and fracked natural gas producers. Would they want the electric utility component to install thousands of solar panels and cut off the revenue from natural gas sales to the utility? Of course not.
It's a good argument for giving Alphabet the Ma Bell anti-trust treatment, certainly.
A better example of that behaviour would be Kodak which invented the first digital camera as soon as 1975 then killed the project, because it was a threat to their chemical business.
Digital photography works not just because of the camera but because of the surrounding digital ecosystem. What would people do with digital photos in 1975?
How is that not immediate grounds for his termination? The board should be canned too for allowing such an obvious charlatan to continue ruining the company.
On the other hand, Alphabet's inability to deploy GPT-3 or GPT-3.5 has led to the possibility of its disruption, so anti-trust treatment may not be necessary.
Disrupted by a whom? Microsoft? Facebook? The company formerly known as Twitter? Even if one of these takes over we'd just be trading masters.
And that's ignoring how Alphabet's core business, Search, has little to fear from GPT-3 or GPT-3.5. These models are decent for a chatbot, but for anything where you want reliably correct answers they are lacking.
Honestly, this is part of the reason why I don't think Google will be a dominant business in 10 years. Searching the web for information helped us service a lot of useful 'jobs to be done' ... but most of those are now all done better by ChatGPT, Claude, etc.
Sure we have Gemini but can Google take a loss in revenue in search advertising in their existing product to maybe one day make money from Gemini search? Advertising in the LLM interface hasn't been figured out yet.
Google (kind of) feels like an old school newspaper in the age of the internet. Advertising models for the web took a while to shake out.
The problem is that chatting with an LLM is extremely disruptive to their business model and it's difficult for them to productize without killing the golden goose.
I know everyone cites this as the innovators dilemma, but so far the evidence suggests this isn't true.
ChatGPT has been around for a while now, and it hasn't led to a collapse in Google's search revenue, and in fact now Google is rushing to roll out their version instead of trying to entrench search.
A famous example is the iPhone killing the iPod, and it took around 3 and a half years for the iPod to really collapse, so chat and co-pilots might be early still. On the other hand handheld consumer electronics have much longer buying cycles than software tools.
No, it's the other way around. Right now web search is "disruptive" to the LLM business model.
Google is a business fundamentally oriented around loss leaders. They make money on commercial queries where someone wants to buy something. They lose money when people search for facts or knowledge. The model works because people don't want to change their search engine every five minutes, so being good at the money losing queries means people will naturally use you for the money making queries too.
Right now LLMs take all the money losing queries and spend vast sums of investor capital on serving them, but they are useless for queries like [pizza near me] or [medical injury legal advice] or [holidays in the canary islands]. So right now I'd expect actually Google to do quite well. Their competitor is burning capital taking away all the stuff that they don't really want, whilst leaving them with the gold.
Now of course, that's today. The obvious direction for OpenAI to go in is finding ways to integrate ads with the free version of ChatGPT. But that's super hard. Building an ad network is hard. It takes a lot of time and effort, and it's really unclear what the product looks like there. Ads on web search is pretty obvious: the ads look like search results. What does an ad look like in a ChatGPT response?
Google have plenty of time to figure this out because OpenAI don't seem interested. They've apparently decided that all of ChatGPT is a loss leader for their API services. Whether that's financially sustainable or not is unclear, but it's also irrelevant. People still want to do commercial queries, they still want things like real images and maps and yes even ads (the way the ad auction works on Google is a very good ranking signal for many businesses). ChatGPT is still useless for them, so for now they will continue to leave money on the table where Google can keep picking it up.
Ironically they killed the golden goose (the web) with display ads which incentivized low quality content to keep users scrolling. To put salt in their own wound, they are now indexing ever more genAI crap than they can handle and search is imploding.
I think it’s evidence that timing is everything. In the 2010s deep learning was still figuring out how to leverage GPUs. The scale of compute required for everything after GPT-2 would have been nearly impossible in 2017/2018–our courses at Udacity used a few hours of time on K80 GPUs. By 2020 it was becoming possible to get unbelievable amounts of compute to throw at these models to test the scale hypothesis. The rise of LLMs is stark proof of the Bitter Lesson because it’s at least as much the story of GPU advancement as it is about algorithms.
Sure, Google is worth more but they essentially reduced themselves to an ad business. They aren't focused on innovation as much as before. Shareholder value is the new god.
Well at the time before Microsoft got involved it was sort of an unspoken rule among the AI community to be open and not release certain models to the public.
> Not only were the authors all Google employees, they also worked out of the same offices.
Subtle plug for return-to-office. In-person face-to-face collaboration (with periods of solo uninterrupted deep focus) probably is the best technology we have for innovation.
“Office” does not have to mean open office. Academics all have offices with doors for a reason. I can’t stand open office, but private office in a building with other people is great.
> The group is also culturally diverse. Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.
As much as I think the America has lot of things it needs to fix, there is no other country on earth this would be possible. That's just a fact.
> there is no other country on earth this would be possible. That's just a fact
I don’t think this the case. If anything the US makes life very hard for even high-skilled work-based immigrants. Many countries have a higher % of foreign born residents than the US (Singapore, Australia, Germany, Canada)
I myself used to work at Google UK and my own team was 100% foreign born engineers from every continent.
Today's incredible state-of-the-art does not exist without the transformer architecture. Transformers aren't merely some lucky passengers riding the coattails of compute scale. If they were, then the ChatGPT app which set the world ablaze would've instead been called ChatMLP, or ChatCNN. But it's not. And in 2024 we still have no competing NLP architecture. Because the transformer is a genuinely profound, remarkable idea with remarkable properties (e.g. training parallelism). It's easy to downplay GPTs as a mostly derivative idea with the benefit of hindsight. I'm sure we'll perform the same revisionist history with state-space models, or whatever architecture eventually supplants transformers. Do GPTs build on prior work? Do other approaches and ideas deserve recognition? Yeah, obviously. Like...welcome to science. But the transformer's architects earned their praise -- including via this article -- which isn't some slight against everyone else, as if accolades were a zero-sum game. These 8 people changed our world and genuinely deserve the love!
Is there any good summary of the history of AI/deep learning from, say, late 00s/2010 to the present? I think learning some of this history would really help be better understand how we ended up at the current state of the art.
No, we do. State space models are both faster and scale just as well. E.g., RWKV and Mamba.
> Transformers aren't merely some lucky passengers riding the coattails of compute scale.
Err... they are, though. They were just the type of model the right researchers were already using at the time, probably for translating between natural languages.
The bitter lesson [0] strikes again.
[0] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Personally, I think both compute and NN architecture is probably needed to get closer to AGI.
https://bbycroft.net/llm
I believe energy minimization is literal, just look at the size of that thing and imagine the power bill.
[0] https://developers.google.com/machine-learning/glossary
However, some advances can have huge consequences to the field compared to others, even if at the technical level they appear comparable.
One example that comes to mind is CRISPR.
And of course, a triggered Yann[2] (who is absolutely right).
But it is odd since it is actually a highly discussed topic, the history of attention. It's been discussed on HN many times before. And of course there's Lilian Weng's very famous blog post[3] that covers this in detail.
The word attention goes back well over a decade and even before Schmidhuber's usage. He has a reasonable claim but these things are always fuzzy and not exactly clear.
At least the article is more correct specifying Transformer rather than Attention, but even this is vague at best. FFormer (FFT-Transformer) was a early iteration and there were many variants. Do we call a transformer a residual attention mechanism with a residual feed forward? Can it be a convolution? There is no definitive definition but generally people mean DPMHA w/ skip layer + a processing network w/ skip layer. But this can be reflective of many architectures since every network can be decomposed into subnetworks. This even includes a 3 layer FFN (1 hidden layer).
Stories are nice, but I think it is bad to forget all the people who are contribution in less obvious ways. If a butterfly can cause a typhoon, then even a poor paper can contribute to a revolution.
[0] https://www.ft.com/content/37bb01af-ee46-4483-982f-ef3921436...
[1] https://www.bloomberg.com/opinion/features/2023-07-13/ex-goo...
[2] https://twitter.com/ylecun/status/1770471957617836138
[3] https://lilianweng.github.io/posts/2018-06-24-attention/
I asked "What would you do if you had an unlimited budget?"
He simply said, "I do"
I worked at Borg
The quota system can kick in at whatever time the limits are reached.
And GPUs are scattered across borg cells, limiting the ceiling. That's why XBorg was created so that a global search among all Borg cells for researchers.
And data center Capex is around 5 billion each year.
Google makes hundres of billions of revenue each year.
You are asking what people would do in impossible situation. Like "what you do after you are dead", literally I could do nothing after I am dead.
I cannot even understand what I do stands for in the context of your question. The above is my direct reaction in the line that he assumes he had unlimited budget.
That he had a higher budget than he knew what to do with. When I worked at Google I could bring up thousands of workers doing big tasks for hours without issue whenever I wanted, for me that was the same as being infinite since I never needed more, and that team didn't even have a particularly large budget. I can see a top ML team having enough compute budget to run a task on the entire Google scrape index dataset every day to test things, you don't need that much to do that, I wasn't that far from that.
At that point the issue is no longer budget but time for these projects to run and return a result. Of course that was before LLMs, the models before then weren't that expensive.
Dead Comment
The answer is that monopolies stifle technological innovation because one well-established part of their business (advertising-centric search) would be negatively impacted by an upstart branch (chatbots) that would cut into search ad revenue.
This is comparable to a investor-owned consortium of electric utilities, gas-fired power plants, and fracked natural gas producers. Would they want the electric utility component to install thousands of solar panels and cut off the revenue from natural gas sales to the utility? Of course not.
It's a good argument for giving Alphabet the Ma Bell anti-trust treatment, certainly.
Sundar was afraid of the technology and how it would be received and tried to ice it.
And that's ignoring how Alphabet's core business, Search, has little to fear from GPT-3 or GPT-3.5. These models are decent for a chatbot, but for anything where you want reliably correct answers they are lacking.
Sure we have Gemini but can Google take a loss in revenue in search advertising in their existing product to maybe one day make money from Gemini search? Advertising in the LLM interface hasn't been figured out yet.
Google (kind of) feels like an old school newspaper in the age of the internet. Advertising models for the web took a while to shake out.
ChatGPT has been around for a while now, and it hasn't led to a collapse in Google's search revenue, and in fact now Google is rushing to roll out their version instead of trying to entrench search.
A famous example is the iPhone killing the iPod, and it took around 3 and a half years for the iPod to really collapse, so chat and co-pilots might be early still. On the other hand handheld consumer electronics have much longer buying cycles than software tools.
Google is a business fundamentally oriented around loss leaders. They make money on commercial queries where someone wants to buy something. They lose money when people search for facts or knowledge. The model works because people don't want to change their search engine every five minutes, so being good at the money losing queries means people will naturally use you for the money making queries too.
Right now LLMs take all the money losing queries and spend vast sums of investor capital on serving them, but they are useless for queries like [pizza near me] or [medical injury legal advice] or [holidays in the canary islands]. So right now I'd expect actually Google to do quite well. Their competitor is burning capital taking away all the stuff that they don't really want, whilst leaving them with the gold.
Now of course, that's today. The obvious direction for OpenAI to go in is finding ways to integrate ads with the free version of ChatGPT. But that's super hard. Building an ad network is hard. It takes a lot of time and effort, and it's really unclear what the product looks like there. Ads on web search is pretty obvious: the ads look like search results. What does an ad look like in a ChatGPT response?
Google have plenty of time to figure this out because OpenAI don't seem interested. They've apparently decided that all of ChatGPT is a loss leader for their API services. Whether that's financially sustainable or not is unclear, but it's also irrelevant. People still want to do commercial queries, they still want things like real images and maps and yes even ads (the way the ad auction works on Google is a very good ranking signal for many businesses). ChatGPT is still useless for them, so for now they will continue to leave money on the table where Google can keep picking it up.
But they are probably using various techniques in analyses, embeddings, "canned" answers to queries, and so on.
* https://www.youtube.com/watch?v=QWWgr2rN45o
* https://www.youtube.com/watch?v=E14IsFbAbpI ('mirror')
Goes over Hinton's history and why he went the direction he did with his research, as well as Li's efforts with ImageNet.
Subtle plug for return-to-office. In-person face-to-face collaboration (with periods of solo uninterrupted deep focus) probably is the best technology we have for innovation.
Which is usually impossible in the office. So more like a mix, which is what all reasonable people are saying.
As much as I think the America has lot of things it needs to fix, there is no other country on earth this would be possible. That's just a fact.
I don’t think this the case. If anything the US makes life very hard for even high-skilled work-based immigrants. Many countries have a higher % of foreign born residents than the US (Singapore, Australia, Germany, Canada)
I myself used to work at Google UK and my own team was 100% foreign born engineers from every continent.