'Attention is all you need' coauthor says he's 'sick' of transformers

dekhn · 5 months ago

The way I look at transformers is: they have been one of the most fertile inventions in recent history. Originally released in 2017, in the subsequent 8 years they completely transformed (heh) multiple fields, and at least partially led to one Nobel prize.

realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.

samsartor · 5 months ago

I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.

kingstnap · 5 months ago

There are many many problems with attention.

The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].

I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].

[1] https://arxiv.org/abs/2309.17453

[2] https://arxiv.org/abs/2410.01104

[3] https://arxiv.org/abs/2505.17190

[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330

[5] https://arxiv.org/abs/2406.04267

[6] https://arxiv.org/abs/2410.23506

[6] https://arxiv.org/abs/2508.21038

mxkopy · 5 months ago

I agree, gradient descent implicitly assumes things have a meaningful gradient, which they don’t always. And even if we say anything can be approximated by a continuous function, we’re learning we don’t like approximations in our AI. Some discrete alternative of SGD would be nice.

eldenring · 5 months ago

I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.

krychu · 5 months ago

BDH

jimbo808 · 5 months ago

Which fields have they completely transformed? How was it before and how is it now? I won't pretend like it hasn't impacted my field, but I would say the impact is almost entirely negative.

isoprophlex · 5 months ago

Everyone who did NLP research or product discovery in the past 5 years had to pivot real hard to salvage their shit post-transformers. They're very disruptively good at most NLP task

edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.

jimmyl02 · 5 months ago

in the super public consumer space, search engines / answer engines (like chatgpt) are the big ones.

on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.

in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!

dekhn · 5 months ago

Genomics, protein structure prediction, various forms of small molecule and large molecule drug discovery.

CHY872 · 5 months ago

In computer vision transformers have basically taken over most perception fields. If you look at paperswithcode benchmarks it’s common to find like 10/10 recent winners being transformer based against common CV problems. Note, I’m not talking about VLMs here, just small ViTs with a few million parameters. YOLOs and other CNNs are still hanging around for detection but it’s only a matter of time.

Profan · 5 months ago

hah well, transformative doesn't necessarily mean positive!

EGreg · 5 months ago

AI fan (type 1 -- AI made a big breakthrough) meets AI defender (type 2 -- AI has not fundamentally changed anything that was already a problem).

Defenders are supposed to defend against attacks on AI, but here it misfired, so the conversation should be interesting.

That's because the defender is actually a skeptic of AI. But the first sentence sounded like a typical "nothing to see here" defense of AI.

warkdarrior · 5 months ago

Spam detection and phishing detection are completely different than 5 years ago, as one cannot rely on typos and grammar mistakes to identify bad content.

jonas21 · 5 months ago

Out of curiosity, what field are you in?

CamperBob2 · 5 months ago

Which fields have they completely transformed?

Simultaneously discovering and leveraging the functional nature of language seems like kind of a big deal.

mountainriver · 5 months ago

Software, and it’s wildly positive.

Takes like this are utterly insane to me

blibble · 5 months ago

> but I would say the impact is almost entirely negative.

quite

the transformer innovation was to bring down the cost of producing incorrect, but plausible looking content (slop) in any modality to near zero

not a positive thing for anyone other than spammers

epistasis · 5 months ago

> think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.

As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!

cauliflower2718 · 5 months ago

+1, I am also big user of PGMs, and also a big user of transformers, and I don't know what the parent comment talking about, beyond that for e.g. LLMs, sampling the next token can be thought of as sampling from a conditional distribution (of the next token, given previous tokens). However, this connection of using transformers to sample from conditional distributions is about autoregressive generation and training using next-token prediction loss, not about the transformer architecture itself, which mostly seems to be good because it is expressive and scalable (i.e. can be hardware-optimized).

Source: I am a PhD student, this is kinda my wheelhouse

nickpsecurity · 5 months ago

Don't give up on older stuff just because deep learning went in a different direction. It's a perfect time to recombine the new with the old. I started DuckDuckGoing and found combinations of ("deep learning" or "neural networks") with ("gaussian," "clustering," "support vector machines," "markov," "probabilistic graphical models").

I haven't actually read these to see if they achieved anything. I'm just sharing the results from a quick search in your sub-field in case it helps you PGM folks.

https://arxiv.org/abs/2104.12053

https://pmc.ncbi.nlm.nih.gov/articles/PMC7831091/

And here's an intro for those wondering what PGM is:

https://arxiv.org/abs/2507.17116

AaronAPU · 5 months ago

I have my own probabilistic hyper-graph model which I have never written down in an article to share. You see people converging on this idea all over if you’re looking for it.

Wish there were more hours in the day.

rbartelme · 5 months ago

Yeah I think this is definitely the future. Recently, I too have spent considerable time on probabilistic hyper-graph models in certain domains of science. Maybe it _is_ the next big thing.

hammock · 5 months ago

> I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area

I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.

nickpsecurity · 5 months ago

In Explainable AI and hybrid studies, many people are combining multiple methods with one doing unsupervised learning or generation but training or analyzed by an explainable model. Try that.

eli_gottlieb · 5 months ago

> probabilistic graphical models- of which transformers is an example

Having done my PhD in probabilistic programming... what?

dekhn · 5 months ago

I was talking about things inspired by (for example) hidden markov models. See https://en.wikipedia.org/wiki/Graphical_model

In biology, PGMs were one of the first successful forms of "machine learning"- given a large set of examples, train a graphical model using probabilities using EM, and then pass many more examples through the model for classification. The HMM for proteins is pretty straightforward, basically just a probabilistic extension of using dynamic programming to do string alignment.

My perspective- which is a massive simplification- is that sequence models are a form of graphical model, although the graphs tend to be fairly "linear" and the predictions generate sequences (lists) rather than trees or graphs.

pishpash · 5 months ago

It's got nothing to do with PGM's. However, there is the flavor of describing graph structure by soft edge weights vs. hard/pruned edge connections. It's not that surprising that one does better than the other, and it's a very obvious and classical idea. For a time there were people working on NN structure learning and this is a natural step. I don't think there is any breakthrough here, other than that computation power caught up to make it feasible.

cyanydeez · 5 months ago

Cancer is also fertile. Its more addiction than revolution, im afraid.

Dead Comment

pigeons · 5 months ago

Not doubting in any way, but what are some fields it transformed

bangaladore · 5 months ago

> Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."

So, this is really just a BS hype talk. This is just trying to get more funding and VCs.

brandall10 · 5 months ago

Attention is all he needs.

osener · 5 months ago

Reminds me of the headline I saw a long time ago: “50 years later, inventor of the pixel says he’s sorry that he made it square.”

LogicFailsMe · 5 months ago

Sadly, he probably needs a lot more or he's gonna go all Maslow...

elicash · 5 months ago

Why wouldn't this both be an attempt to get funding and also him wanting to do something new? Certainly if he was wanting to do something new he'd want it funded, too?

energy123 · 5 months ago

It's also how curious scientists operate, they're always itching for something creative and different.

IncreasePosts · 5 months ago

It would be hype talk if he said and my next big thing is X.

bangaladore · 5 months ago

Well, that's why he needs funding. Hasn't figured out what the next big thing is.

cheschire · 5 months ago

Well he got your attention didn't he?

htrp · 5 months ago

anyone know what they're trying to sell here?

gwbas1c · 5 months ago

The ability to do original, academic research without the pressure to build something marketable.

aydyn · 5 months ago

probably AI

mirekrusin · 5 months ago

Ideal architecture would be the one you can patent.

Imagine if transformer architecture was patented. Imagine how much innovation patent system would generate - because that’s why it exists in the first place, right?

It’s not patented and you see how much harm it creates? Nobody knows about it, AI winter is in full swing.

We need more patents everywhere.

password54321 · 5 months ago

If it was about money it would probably be easier to double down on something proven to make revenue rather than something that doesn't even exist.

Edit: there is a cult around transformers.

YC3498723984327 · 5 months ago

His AI company is called "Fish AI"?? Does it mean their AI will have the intelligence of a fish?

astrange · 5 months ago

It's about collective intelligence, as seen in swarms of ants or fish.

v3ss0n · 5 months ago

Or Fishy?

bangaladore · 5 months ago

Without transformers, maybe.

/s

ivape · 5 months ago

He sounds a lot like how some people behave when they reach a "top". Suddenly that thing seems unworthy all of a sudden. It's one of the reasons you'll see your favorite music artist totally go a different direction on their next album. It's an artistic process almost. There's a core arrogance involved, that you were responsible for the outcome and can easily create another great outcome.

dekhn · 5 months ago

Many researchers who invent something new and powerful pivot quickly to something new. that's because they're researchers, and incentive is to develop new things that subsume the old things. Other researchers will continue to work on improving existing things and finding new applications to existing problems, but they rarely get as much attention as the folks who "discover" something new.

moritzwarhier · 5 months ago

Why "arrogance"? There are music artists that truly enjoy making music and don't just see their purpose in maximizing financial success and fan service?

There are other considerations that don't revolve around money, but I feel it's arrogant to assume success is the only motivation for musicians.

dmix · 5 months ago

That’s just normal human behaviour to have evolving interests

Arrogance would be if explicitly chose to abandon it because he thought he was better

Deleted Comment

bigyabai · 5 months ago

When you're overpressured to succeed, it makes a lot of sense to switch up your creative process in hopes of getting something new or better.

It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.

toxic72 · 5 months ago

Its also plausible that the research field attracts people who want to explore the cutting edge and now that transformers are no longer "that"... he wants to find something novel.

ambicapter · 5 months ago

Or a core fear, that you'll never do something as good in the same vein as the smash hit you already made, so you strike off in a completely different direction.

Mistletoe · 5 months ago

Sometimes it just turns out like Michael Jordan playing baseball.

Xcelerate · 5 months ago

Haha, I like to joke that we were on track for the singularity in 2024, but it stalled because the research time gap between "profitable" and "recursive self-improvement" was just a bit too long that we're now stranded on the transformer model for the next two decades until every last cent has been extracted from it.

ai-christianson · 5 months ago

There's massive hardware and energy infra built out going on. None of that is specialized to run only transformers at this point, so wouldn't that create a huge incentive to find newer and better architectures to get the most out of all this hardware and energy infra?

Mehvix · 5 months ago

>None of that is specialized to run only transformers at this point

isn't this what [etched](https://www.etched.com/) is doing?

Davidzheng · 5 months ago

how do you know we're not at recursive self-improvement but the rate is just slower than human-mediated improvement?

nabla9 · 5 months ago

What "AI" means for most people is the software product they see, but only a part of it is the underlying machine learning model. Each foundation model receives additional training from thousands of humans, often very lowly paid, and then many prompts are used to fine-tune it all. It's 90% product development, not ML research.

If you look at AI research papers, most of them are by people trying to earn a PhD so they can get a high-paying job. They demonstrate an ability to understand the current generation of AI and tweak it, they create content for their CVs.

There is actual research going on, but it's tiny share of everything, does not look impressive because it's not a product, or a demo, but an experiment.

tippytippytango · 5 months ago

It's difficult to do because of how well matched they are to the hardware we have. They were partially designed to solve the mismatch between RNNs and GPUs, and they are way too good at it. If you come up with something truly new, it's quite likely you have to influence hardware makers to help scale your idea. That makes any new idea fundamentally coupled to hardware, and that's the lesson we should be taking from this. Work on the idea as a simultaneous synthesis of hardware and software. But, it also means that fundamental change is measured in decade scales.

I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.

danielmarkbruce · 5 months ago

This is backwards. Algorithms that can be parallelized are inherently superior, independent of the hardware. GPUs were built to take advantage of the superiority and handle all kinds of parallel algorithms well - graphics, scientific simulation, signal processing, some financial calculations, and on and on.

There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.

RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.

Scene_Cast2 · 5 months ago

There are large, large gaps of parallel stuff that GPUs can't do fast. Anything sparse (or even just shuffled) is one example. There are lots of architectures that are theoretically superior but aren't popular due to not being GPU friendly.

anticensor · 5 months ago

When you consider hardware-software co-design, the problem quits being an algorithms problem and becomes a computer engineering problem.

teleforce · 5 months ago

>The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."

Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.

Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].

[1] Stripe Press The Art of Doing Science and Engineering:

https://press.stripe.com/the-art-of-doing-science-and-engine...

CaptainOfCoit · 5 months ago

All transformative inventions and innovations seems to come from similar scenarios like "I was playing around with these things" or "I just met X at lunch and we discussed ...".

I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.

fipar · 5 months ago

What you say is true, but let’s not forget that Ken Thompson did the first version of unix in 3 weeks while his wife had gone to California with their child to visit relatives, so deep focus is important too.

It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.

A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).

DyslexicAtheist · 5 months ago

I'd go back to the office in a heartbeat provided it was an actual office. And not an "open-office" layout, that people are forced to try to concentrate with all the noise and people passing behind them constantly.

The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.

I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.

tagami · 5 months ago

Perhaps this is why we see AI devotees congregate in places like SF - increased probability

liuliu · 5 months ago

And it is always felt to me that has lineage from neural Turing machine line of work as prior. The transformative part was 1. find a good task (machine translation) and a reasonable way to stack (encoder-decoder architecture); 2. run the experiment; 3. ditch the external KV store idea and just use self-projected KV.

atonse · 5 months ago

True in creativity too.

According to various stories pieced together, the ideas of 4 of Pixar’s early hits were conceived on or around one lunch.

Bug’s Life, Wall-E, Monsters, Inc

emi2k01 · 5 months ago

The fourth one is Finding Nemo

bitwize · 5 months ago

One of the OG Unix guys (was it Kernighan?) literally specced out UTF-8 on a cocktail napkin.

dekhn · 5 months ago

Thompson and Pike: https://en.wikipedia.org/wiki/UTF-8

"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""

janalsncm · 5 months ago

I have a feeling there is more research being done on non-transformer based architectures now, not less. The tsunami of money pouring in to make the next chatbot powered CRM doesn’t care about that though, so it might seem to be less.

I would also just fundamentally disagree with the assertion that a new architecture will be the solution. We need better methods to extract more value from the data that already exists. Ilya Sutskever talked about this recently. You shouldn’t need the whole internet to get to a decent baseline. And that new method may or may not use a transformer, I don’t think that is the problem.

marcel-c13 · 5 months ago

I think you misunderstood the article a bit by saying that the assertion is "that a new architecture will be the solution". That's not the assertion. It's simply a statement about the lack of balance between exploration and exploitation. And the desire to rebalance it. What's wrong with that?

fritzo · 5 months ago

It looks like almost every AI researcher and lab who existed pre-2017 is now focused on transformers somehow. I agree the total number of researchers has increased, but I suspect the ratio has moved faster, so there are now fewer total non-transformer researchers.

janalsncm · 5 months ago

Well, we also still use wheels despite them being invented thousands of years ago. We have added tons of improvements on top though, just as transformers have. The fact that wheels perform poorly in mud doesn’t mean you throw out the concept of wheels. You add treads to grip the ground better.

If you check the DeepSeek OCR paper it shows text based tokenization may be suboptimal. Also all of the MoE stuff, reasoning, and RLHF. The 2017 paper is pretty primitive compared to what we have now.

tim333 · 5 months ago

The assertion, or maybe idea, that a new architecture may be the thing is kind of about building AGI rather than chatbots.

Like humans think about things and learn which may require some differences from feed the internet in to pre-train your transformer.

mcfry · 5 months ago

Something which I haven't been able to fully parse that perhaps someone has better insight into: aren't transformers inherently only capable of inductive reasoning? In order to actually progress to AGI, which is being promised at least as an eventuality, don't models have to be capable of deduction? Wouldn't that mean fundamentally changing the pipeline in some way? And no, tools are not deduction. They are useful patches for the lack of deduction.

Models need to move beyond the domain of parsing existing information into existing ideas.

eli_gottlieb · 5 months ago

That sounds like a category mistake to me. A proof assistant or logic-programming system performs deduction, and just strapping one of those to an LLM hasn't gotten us to "AGI".

mcfry · 5 months ago

A proof assistant is a verifier, and a tool so therefor a patch, so I really fail to see how that could be understood as the LLM having deduction.

energy123 · 5 months ago

I don't see any reason to think that transformers are not capable of deductive reasoning. Stochasticity doesn't rule out that ability. It just means the model might be wrong in its deduction, just like humans are sometimes wrong.

mcfry · 4 months ago

But it can't actually deduce, can it? If 136891438 * 1294538 isn't in the training data, it won't be able to give you a valid answer using the model itself. There's no process. It has to offload that task to a tool, which will then calculate and return.

Further, any offloading needs to be manually defined at some point. You could maybe give it a way to define its own tools, but even then they would still be defined by what has come before.

hammock · 5 months ago

They can induct just can’t generate new ideas. Its not going to discover a new quark without a human in the loop somewhere

nightshift1 · 5 months ago

maybe that's a good thing after all.