The way I look at transformers is: they have been one of the most fertile inventions in recent history. Originally released in 2017, in the subsequent 8 years they completely transformed (heh) multiple fields, and at least partially led to one Nobel prize.
realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
I agree, gradient descent implicitly assumes things have a meaningful gradient, which they don’t always. And even if we say anything can be approximated by a continuous function, we’re learning we don’t like approximations in our AI. Some discrete alternative of SGD would be nice.
I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.
Which fields have they completely transformed? How was it before and how is it now? I won't pretend like it hasn't impacted my field, but I would say the impact is almost entirely negative.
Everyone who did NLP research or product discovery in the past 5 years had to pivot real hard to salvage their shit post-transformers. They're very disruptively good at most NLP task
edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.
in the super public consumer space, search engines / answer engines (like chatgpt) are the big ones.
on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.
in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!
In computer vision transformers have basically taken over most perception fields. If you look at paperswithcode benchmarks it’s common to find like 10/10 recent winners being transformer based against common CV problems. Note, I’m not talking about VLMs here, just small ViTs with a few million parameters. YOLOs and other CNNs are still hanging around for detection but it’s only a matter of time.
Spam detection and phishing detection are completely different than 5 years ago, as one cannot rely on typos and grammar mistakes to identify bad content.
> think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!
+1, I am also big user of PGMs, and also a big user of transformers, and I don't know what the parent comment talking about, beyond that for e.g. LLMs, sampling the next token can be thought of as sampling from a conditional distribution (of the next token, given previous tokens). However, this connection of using transformers to sample from conditional distributions is about autoregressive generation and training using next-token prediction loss, not about the transformer architecture itself, which mostly seems to be good because it is expressive and scalable (i.e. can be hardware-optimized).
Source: I am a PhD student, this is kinda my wheelhouse
Don't give up on older stuff just because deep learning went in a different direction. It's a perfect time to recombine the new with the old. I started DuckDuckGoing and found combinations of ("deep learning" or "neural networks") with ("gaussian," "clustering," "support vector machines," "markov," "probabilistic graphical models").
I haven't actually read these to see if they achieved anything. I'm just sharing the results from a quick search in your sub-field in case it helps you PGM folks.
I have my own probabilistic hyper-graph model which I have never written down in an article to share. You see people converging on this idea all over if you’re looking for it.
Yeah I think this is definitely the future. Recently, I too have spent considerable time on probabilistic hyper-graph models in certain domains of science. Maybe it _is_ the next big thing.
> I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area
I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.
In Explainable AI and hybrid studies, many people are combining multiple methods with one doing unsupervised learning or generation but training or analyzed by an explainable model. Try that.
In biology, PGMs were one of the first successful forms of "machine learning"- given a large set of examples, train a graphical model using probabilities using EM, and then pass many more examples through the model for classification. The HMM for proteins is pretty straightforward, basically just a probabilistic extension of using dynamic programming to do string alignment.
My perspective- which is a massive simplification- is that sequence models are a form of graphical model, although the graphs tend to be fairly "linear" and the predictions generate sequences (lists) rather than trees or graphs.
It's got nothing to do with PGM's. However, there is the flavor of describing graph structure by soft edge weights vs. hard/pruned edge connections. It's not that surprising that one does better than the other, and it's a very obvious and classical idea. For a time there were people working on NN structure learning and this is a natural step. I don't think there is any breakthrough here, other than that computation power caught up to make it feasible.
> Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."
So, this is really just a BS hype talk. This is just trying to get more funding and VCs.
Why wouldn't this both be an attempt to get funding and also him wanting to do something new? Certainly if he was wanting to do something new he'd want it funded, too?
Ideal architecture would be the one you can patent.
Imagine if transformer architecture was patented. Imagine how much innovation patent system would generate - because that’s why it exists in the first place, right?
It’s not patented and you see how much harm it creates? Nobody knows about it, AI winter is in full swing.
He sounds a lot like how some people behave when they reach a "top". Suddenly that thing seems unworthy all of a sudden. It's one of the reasons you'll see your favorite music artist totally go a different direction on their next album. It's an artistic process almost. There's a core arrogance involved, that you were responsible for the outcome and can easily create another great outcome.
Many researchers who invent something new and powerful pivot quickly to something new. that's because they're researchers, and incentive is to develop new things that subsume the old things. Other researchers will continue to work on improving existing things and finding new applications to existing problems, but they rarely get as much attention as the folks who "discover" something new.
Why "arrogance"? There are music artists that truly enjoy making music and don't just see their purpose in maximizing financial success and fan service?
There are other considerations that don't revolve around money, but I feel it's arrogant to assume success is the only motivation for musicians.
When you're overpressured to succeed, it makes a lot of sense to switch up your creative process in hopes of getting something new or better.
It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.
Its also plausible that the research field attracts people who want to explore the cutting edge and now that transformers are no longer "that"... he wants to find something novel.
Or a core fear, that you'll never do something as good in the same vein as the smash hit you already made, so you strike off in a completely different direction.
Haha, I like to joke that we were on track for the singularity in 2024, but it stalled because the research time gap between "profitable" and "recursive self-improvement" was just a bit too long that we're now stranded on the transformer model for the next two decades until every last cent has been extracted from it.
There's massive hardware and energy infra built out going on. None of that is specialized to run only transformers at this point, so wouldn't that create a huge incentive to find newer and better architectures to get the most out of all this hardware and energy infra?
What "AI" means for most people is the software product they see, but only a part of it is the underlying machine learning model. Each foundation model receives additional training from thousands of humans, often very lowly paid, and then many prompts are used to fine-tune it all. It's 90% product development, not ML research.
If you look at AI research papers, most of them are by people trying to earn a PhD so they can get a high-paying job. They demonstrate an ability to understand the current generation of AI and tweak it, they create content for their CVs.
There is actual research going on, but it's tiny share of everything, does not look impressive because it's not a product, or a demo, but an experiment.
It's difficult to do because of how well matched they are to the hardware we have. They were partially designed to solve the mismatch between RNNs and GPUs, and they are way too good at it. If you come up with something truly new, it's quite likely you have to influence hardware makers to help scale your idea. That makes any new idea fundamentally coupled to hardware, and that's the lesson we should be taking from this. Work on the idea as a simultaneous synthesis of hardware and software. But, it also means that fundamental change is measured in decade scales.
I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.
This is backwards. Algorithms that can be parallelized are inherently superior, independent of the hardware. GPUs were built to take advantage of the superiority and handle all kinds of parallel algorithms well - graphics, scientific simulation, signal processing, some financial calculations, and on and on.
There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.
RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.
There are large, large gaps of parallel stuff that GPUs can't do fast. Anything sparse (or even just shuffled) is one example. There are lots of architectures that are theoretically superior but aren't popular due to not being GPU friendly.
>The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."
Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.
Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].
[1] Stripe Press The Art of Doing Science and Engineering:
All transformative inventions and innovations seems to come from similar scenarios like "I was playing around with these things" or "I just met X at lunch and we discussed ...".
I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.
What you say is true, but let’s not forget that Ken Thompson did the first version of unix in 3 weeks while his wife had gone to California with their child to visit relatives, so deep focus is important too.
It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.
A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).
I'd go back to the office in a heartbeat provided it was an actual office. And not an "open-office" layout, that people are forced to try to concentrate with all the noise and people passing behind them constantly.
The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.
I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.
And it is always felt to me that has lineage from neural Turing machine line of work as prior. The transformative part was 1. find a good task (machine translation) and a reasonable way to stack (encoder-decoder architecture); 2. run the experiment; 3. ditch the external KV store idea and just use self-projected KV.
"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""
I have a feeling there is more research being done on non-transformer based architectures now, not less. The tsunami of money pouring in to make the next chatbot powered CRM doesn’t care about that though, so it might seem to be less.
I would also just fundamentally disagree with the assertion that a new architecture will be the solution. We need better methods to extract more value from the data that already exists. Ilya Sutskever talked about this recently. You shouldn’t need the whole internet to get to a decent baseline. And that new method may or may not use a transformer, I don’t think that is the problem.
I think you misunderstood the article a bit by saying that the assertion is "that a new architecture will be the solution". That's not the assertion. It's simply a statement about the lack of balance between exploration and exploitation. And the desire to rebalance it. What's wrong with that?
It looks like almost every AI researcher and lab who existed pre-2017 is now focused on transformers somehow. I agree the total number of researchers has increased, but I suspect the ratio has moved faster, so there are now fewer total non-transformer researchers.
Well, we also still use wheels despite them being invented thousands of years ago. We have added tons of improvements on top though, just as transformers have. The fact that wheels perform poorly in mud doesn’t mean you throw out the concept of wheels. You add treads to grip the ground better.
If you check the DeepSeek OCR paper it shows text based tokenization may be suboptimal. Also all of the MoE stuff, reasoning, and RLHF. The 2017 paper is pretty primitive compared to what we have now.
Something which I haven't been able to fully parse that perhaps someone has better insight into: aren't transformers inherently only capable of inductive reasoning? In order to actually progress to AGI, which is being promised at least as an eventuality, don't models have to be capable of deduction? Wouldn't that mean fundamentally changing the pipeline in some way? And no, tools are not deduction. They are useful patches for the lack of deduction.
Models need to move beyond the domain of parsing existing information into existing ideas.
That sounds like a category mistake to me. A proof assistant or logic-programming system performs deduction, and just strapping one of those to an LLM hasn't gotten us to "AGI".
I don't see any reason to think that transformers are not capable of deductive reasoning. Stochasticity doesn't rule out that ability. It just means the model might be wrong in its deduction, just like humans are sometimes wrong.
But it can't actually deduce, can it? If 136891438 * 1294538 isn't in the training data, it won't be able to give you a valid answer using the model itself. There's no process. It has to offload that task to a tool, which will then calculate and return.
Further, any offloading needs to be manually defined at some point. You could maybe give it a way to define its own tools, but even then they would still be defined by what has come before.
realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
[1] https://arxiv.org/abs/2309.17453
[2] https://arxiv.org/abs/2410.01104
[3] https://arxiv.org/abs/2505.17190
[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330
[5] https://arxiv.org/abs/2406.04267
[6] https://arxiv.org/abs/2410.23506
[6] https://arxiv.org/abs/2508.21038
edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.
on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.
in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!
Defenders are supposed to defend against attacks on AI, but here it misfired, so the conversation should be interesting.
That's because the defender is actually a skeptic of AI. But the first sentence sounded like a typical "nothing to see here" defense of AI.
Simultaneously discovering and leveraging the functional nature of language seems like kind of a big deal.
Takes like this are utterly insane to me
quite
the transformer innovation was to bring down the cost of producing incorrect, but plausible looking content (slop) in any modality to near zero
not a positive thing for anyone other than spammers
As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!
Source: I am a PhD student, this is kinda my wheelhouse
I haven't actually read these to see if they achieved anything. I'm just sharing the results from a quick search in your sub-field in case it helps you PGM folks.
https://arxiv.org/abs/2104.12053
https://pmc.ncbi.nlm.nih.gov/articles/PMC7831091/
And here's an intro for those wondering what PGM is:
https://arxiv.org/abs/2507.17116
Wish there were more hours in the day.
I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.
Having done my PhD in probabilistic programming... what?
In biology, PGMs were one of the first successful forms of "machine learning"- given a large set of examples, train a graphical model using probabilities using EM, and then pass many more examples through the model for classification. The HMM for proteins is pretty straightforward, basically just a probabilistic extension of using dynamic programming to do string alignment.
My perspective- which is a massive simplification- is that sequence models are a form of graphical model, although the graphs tend to be fairly "linear" and the predictions generate sequences (lists) rather than trees or graphs.
Dead Comment
So, this is really just a BS hype talk. This is just trying to get more funding and VCs.
Imagine if transformer architecture was patented. Imagine how much innovation patent system would generate - because that’s why it exists in the first place, right?
It’s not patented and you see how much harm it creates? Nobody knows about it, AI winter is in full swing.
We need more patents everywhere.
Edit: there is a cult around transformers.
/s
There are other considerations that don't revolve around money, but I feel it's arrogant to assume success is the only motivation for musicians.
Arrogance would be if explicitly chose to abandon it because he thought he was better
Deleted Comment
It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.
isn't this what [etched](https://www.etched.com/) is doing?
If you look at AI research papers, most of them are by people trying to earn a PhD so they can get a high-paying job. They demonstrate an ability to understand the current generation of AI and tweak it, they create content for their CVs.
There is actual research going on, but it's tiny share of everything, does not look impressive because it's not a product, or a demo, but an experiment.
I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.
There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.
RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.
Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.
Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].
[1] Stripe Press The Art of Doing Science and Engineering:
https://press.stripe.com/the-art-of-doing-science-and-engine...
I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.
It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.
A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).
The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.
I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.
Related thread:https://threadreaderapp.com/thread/1864023344435380613.html
According to various stories pieced together, the ideas of 4 of Pixar’s early hits were conceived on or around one lunch.
Bug’s Life, Wall-E, Monsters, Inc
"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""
I would also just fundamentally disagree with the assertion that a new architecture will be the solution. We need better methods to extract more value from the data that already exists. Ilya Sutskever talked about this recently. You shouldn’t need the whole internet to get to a decent baseline. And that new method may or may not use a transformer, I don’t think that is the problem.
If you check the DeepSeek OCR paper it shows text based tokenization may be suboptimal. Also all of the MoE stuff, reasoning, and RLHF. The 2017 paper is pretty primitive compared to what we have now.
Like humans think about things and learn which may require some differences from feed the internet in to pre-train your transformer.
Models need to move beyond the domain of parsing existing information into existing ideas.
Further, any offloading needs to be manually defined at some point. You could maybe give it a way to define its own tools, but even then they would still be defined by what has come before.