On the podcast interview now Groq CEO Jonathon Ross did[1] he talked about the creation of the original TPUs (which he built at Google). Apparently originally it was a FPGA he did in his 20% time because he sat near the team who was having inference speed issues.
They got it working, then Jeff Dean did the math and the decided to do an ASIC.
Now of course Google should spin off the TPU team as a separate company. It's the only credible competition NVidia has, and the software support is second only to NVidia.
The way I see, NVidia only has a few advantages ordered from most important to least:
1. Reserved fab space.
2. Highly integrated software.
3. Hardware architecture that exists today.
4. Customer relationships.
but all of these aspects are weak in one way or another:
For #1, fab space is tight, and NVidia can strangle its consumer GPU market if it means selling more AI chips at a higher price. This advantage is gone if a competitor makes big bets years in advance, or another company that has a lot of fab space (intel?) is willing to change priorities.
2. Life is good when your proprietary software is the industry standard. Whether this actually matters will depend on the use case heavily.
3. A benefit now, but not for long. It's my estimation that the hardware design for TPUs is fundamentally much simpler than for GPUs. No need for raytracing, texture samplers, or rasterization. Mostly just needs lots of matrix multiplication and memory. Others moving into the space will be able to catch up quickly.
4. Useful to stay in the conversation, but in a field hungry for any advantage, the hardware vendor with the highest FLOPS (or equivalent) per dollar is going to win enough customers to saturate their manufacturing ability.
So overall, I give them a few years, and then the competition is going to be real quite fast.
Seems you have not worked with ML workloads, but base your comment on "internet wisdom", or worse, business analysts (I am sorry if that's inaccurate).
On GPUs, ML "just works" (inference and training) and are always order of magnitude faster than whatever CPU you have.
TPUs work very well for some model architectures (old ones that they were optimized and designed for) and on some novel others can be actually slower than a CPU (because of gathers and similar) - this was my experience working on ML stuff as an ML Researcher at Google till 2022, maybe it got better but I doubt. Older TPUs were ok only for inference of those specific models and useless for training. And anything new I tried (fundamental part of research...) - the compiler would sonetimes just break with an internal error, most of the time just produce terrible and slow code, and bugs filed against it would stay open for years.
GPU is so much more than a matrix multiplier - it's a fully general, programmable processor. With excellent compilers, but most importantly - low level access that you don't need to rely on proprietary compiler engineers (like TPU ones) and anyone can develop something like Flash Attention. And as a side note: while a Transformer might be mostly matrix multiplication, many other models are not.
NVidia's biggest advantage is that AMD is unwilling to pay for top notch software engineers (and unwilling to pay the corresponding increase in hardware engineer salaries this would entail). If you check online you'll see NVidia pays both hardware and software engineers significantly more than AMD does. This is a cultural/management problem, which AMD's unlikely to overcome in the near-term future. Apple so far seems like the only other hardware company that doesn't underpay its engineers, but Apple's unlikely to release a discrete/stand-alone GPU any time soon.
Don’t underestimate CUDA as the moat. It’s been a decade of sheer dominance with multiple attempts to loosen its grip that haven’t been super fruitful.
I’ll also add that their second moat is Mellanox. They have state of the art interconnect and networking that puts them ahead of the competition that are currently focusing just on the single unit.
I’ve spent the last month deep in GPU driver/compiler world and -
AMD or Apple (Metal) or someone (I haven’t tried Intel’s stuff) just needs to have a single guide to installing a driver and compiler that doesn’t segfault if you look at it wrong, and they would sweep the R&D mindshare.
It is insane how bad CUDA is; it’s even more insane how bad their competitors are.
These have always been NVIDIA's "few" advantages and yet they've still dominated for years. It's their relentless pace of innovation that is their advantage. They resemble Intel of old, and despite Intel's same "few" advantages, Intel is still dominant in the PC space (even with recent missteps).
Nvidia's datacenter AI chips don't have raytracing or rasterization. Heck, for all we know the new blackwell chip is almost exclusively tensor cores. They gave no numbers for regular CUDA perf.
> Now of course Google should spin off the TPU team as a separate company.
Given the size of the market and its near-monopoly situation, I strongly think this has the potential to (almost immediately) surpass the Pixel hardware business. But the problem here is that TPU is a relatively scarce computing resource even inside Google and it's very likely that Google has a hard time to meet its internal demands...
> I strongly think this has the potential to (almost immediately) surpass the Pixel hardware business. But the problem here is that TPU is a relatively scarce computing resource even inside Google and it's very likely that Google has a hard time to meet its internal demands...
Yes.
But imagine how the company would do: they have a guaranteed market at Google say for 3 years, and while yes maybe Google takes 100% of the production in the first 12 months it's not a bad position to start from.
Plus there are other products which they could ship that might not always need to be built on the latest process. I imagine there would be demand for inference only earlier generation TPUs that can run LLMs fast if the power usage is low enough.
But they're far behind in adoption in the AI space, while TPUs have both adoption (inside Google and on top) and a very strong software offering (Jax and TF)
There's also Amazon's AWS "Trainium" chips, which is what Anthropic will be using going forward.
If you're talking about training LLMs, involving 10's of thousands of processors, then the specifics of one processor vs another isn't the most important thing - it's the overall architecture and infrastructure in place to manage it.
Speaking of which, mega props to Groq, they really are awesome, so many startups launch with bullshit and promises, but Groq came to the scene with something awesome already working, which is reason enough to love them. I really respect this company and I say that extremely never-often.
I wouldn't call it awesome. It's just a big chip with lots of cache. You need hundreds of them to sufficiently load any decent model. At which point the cost has skyrocketed.
How is it that Google invented the TPU and Google Research came up with the paper on LLM and NVDA and AI startup companies have captured ~100% of the value
There's an old joke explanation about Xerox and PARC, about the difficulty of "pitching a 'paperless office' to a photocopier company".
In Google's case, an example analogy would be pitching making something like ChatGPT widely available, when that would disrupt revenue from search engine paid placements, and from ads on sites that people wouldn't need to visit. (So maybe someone says, better to phase it in subtly, as needed for competitiveness, but in non-disruptive ways.)
I doubt it's as simple as that, but would be funny if that was it.
This (innovator's dilemma / too afraid of disrupting your own ads business model) is the most common explanation folks are giving for this, but seems to be some sort of post-rationalization of why such a large company full of competent researchers/engineers would drop the ball this hard.
My read (having seen some of this on the inside), is that it was a mix of being too worried about safety issues (OMG, the chatbot occasionally says something offensive!) and being too complacent (too comfortable with incremental changes in Search, no appetite for launching an entirely new type of product / doing something really out there). There are many ways to monetize a chatbot, OpenAI for example is raking billions in subscription fees.
The answer is far weirder - they had a chat bot, and no one even discussed it in the context of search replacements. They didn’t want to release it because they just didn’t think it should be a product. Only after OpenAI actually disrupted search did they start releasing Gemini/Bard which takes advantage of search.
My take as someone who worked in Cloud, closely with the AI product teams on GTM strategy, is that it was primarily the former: Google was always extremely risk averse when it came to AI, to the point that until Andrew Moore was pushed out, Google Cloud didn't refer to anything as AI. It was ML-only, hence the BigQuery ML, Video Intelligence ML, NLP API, and so many other "ML" product names. There was strong sentiment internally that the technology wasn't mature enough to legitimately call it "AI", and that any models adequately complex to be non-trivially explainable were a no-go. Part of this was just general conservatism around product launches within Google, but it was significantly driven by EU regulation, too. Having just come off massive GDPR projects and staring down the barrel of DMA, Google didn't want to do anything that expanded the risk surface, whether it was in Cloud, Ads, Mobile or anything else.
Their hand was forced with ChatGPT was launched ... and we're seeing how that's going.
They're like a hyperactive dog chasing its own tail. How many projects did they create only to shut them a bit later? All because there's always some nonsense to chase. Meanwhile the AI train has left the station without them and their search is now an ad infested hot piece of garbage. Don't even get me started on their customer/dev support or how aging things like Google Translate api got absolutely KILLED by GPT-4 like apis overnight.
Google has stage 4 leadership incompetency and can't be helped. The only humane option is euthanasia.
I think the TPU is simple. They do sell it (via cloud), but they focus on themselves first. When there was no shortage of compute, it was an also-ran in the ML hardware market. Now it’s trendy.
ChatGPT v Google is a far crazier history. Not only did Google invent Transformers, not only did Google open-source PaLM and Bert, but they even built chat tuned LLM chat bots and let employees talk with it. This isn’t a case where they were avoiding for disruption or protecting search - they genuinely didn’t see its potential. Worse, they got so much negative publicity over it that they considered it an AI safety issue to release. If that guy hadn’t gone to the press and claimed LaMDA was sentient than they may have entirely open sourced it like PaLM. This would likely mean that GPT-3 was open sourced and maybe never chat tuned either.
GPT-2 was freely available and OpenAI showed off GPT-3 freely as a parlor trick before ChatGPT came out. ChatGPT was originally the same - fun text generation as chat not a full product.
TLDR - Tensors probably didn’t have a lot of value until NVidia because scarce and they actively invented the original ChatGPT and “AI Safety” concerns caused them to lock it down.
Pretty sure it is because if ChatGPT likes would update as frequently as google website index it would render search engines like google obsolete and thus make their revenue nonexistent.
This article really connected a lot of abstract pieces together into how they flow through silicon. I really enjoyed seeing the simple CISC instructions and how they basically map on to LLM inference steps.
This is probably a dumb question, that just shows my ignorance
but I keep hearing on the consumer end of things that the M1-M4
chips are good at some AI.
The most important for me these days would be Photoshop, Resolve etc,
and I have seen those run a lot faster on Apple new proprietary chips
than on my older machines.
That may not translate well at all to what this chip can do or what
a H100s can do.
But does it translate at all?
Of course Apple is not selling their propritary chips either so for it
to be practical Apple would have to release some from of an external,
server something stuffed with their GPUs and AI chips
I’m also not quite an expert, but have benchmarked an M1 and various GPUs.
The M* chips have unified memory and (especially Pro/Max/Ultra) have very high memory bandwidth even compared eg to a 1080 (an M1 Ultra has memory bandwidth between 2080 and 3090).
At small batch sizes (including 1, like most local tasks), inference is bottlenecked by memory bandwidth, not compute ability. This is why people say the M* chips are good for ML.
However H100s are used primarily for training (at enormous batch sizes) and require lots of interconnect to train large models. At that scale, arithmetic intensity is very high, and the M* chips aren’t very competitive (even if they could be networked) - they pick a different part of the Pareto power/efficiency curve than H100s which guzzle up power.
What Google really needs to do is get into the 2nm EUV space and go sub 2nm. When they have the electro lithography (or whatever tech ASML has that prints on the chips) then you have something really dangerous. Probably a hardcore Google X moonshot type project. Or maybe they have 500mm sitting around to just buy one of the machines. If their tpu are really that good - maybe it is a good business - especially if they can integrate all the way to having their own fab with their own tech
This is frankly infeasible. Between the decades of trade secrets they would first need to discover, the tens- or maybe hundreds- of billions in capital needed to build their very first leading edge fab, the decade or two it would take for any such business to mature to the extent it would be functional, and the completely inconsequential volumes of devices they'd produce, they would probably be lighting half a trillion dollars on fire just to get a few years behind where the leading edge sits today, ten or more years from now. The only reason leading edge fabs are profitable today is because of decades of talent and engineering focused on producing general purpose computing devices for a wide variety of applications and customers, often with those very same customers driving innovation independently in critical focus areas (e.g. Micron with chip-on-chip HDI yield improvements, Xilinx with interdie communication fabric and multi chip substrate design). TPUs will never generate the required volumes, or attract the necessary customers, to achieve remotely profitable economies of scale, particularly when Google also has to set an attractive price against their competitors.
If Google has a compelling-enough business case, existing fabs will happily allocate space for their hardware. TPU is not remotely compelling enough.
I listened to a talk by Jim Keller from Tens torrent and their different approach to making AI cores - 5 Risc V cores one core for loading data, one for uploading data and the rest dedicated to performing matrix operations.
He did mention Google's TPU and the fact it was like programming a VLIW and they had about 500 people dedicated to their compiler.
Quote from the OP: "The TPU v1 uses a CISC (Complex Instruction Set Computer) design with around only about 20 instructions."
chuckle CISC/RISC has gone from astute observation, to research program, to revolutionary technology, to marketing buzzwords....and finally to being just completely meaningless sounds.
Idk maybe it's just me, but what I was taught in comp architecture was that cisc vs risc has more to do with the complexity of the instructions, not the raw count. So TPU having a smaller number of instructions can still be a cisc if the instructions are fairly complex.
Granted the last time I took any comp architecture was a grad course like 15 years ago, so my memory is pretty fuzzy (also we spent most of that semester dicking around with Itanium stuff that is beyond useless now)
Right. CISC vs RISC has always been about simplifying the underlying micro-instructions and register set usage. It's definitely CISC if you have a large complex operation on multiple memory direct locations (albeit the lines between RISC and CISC being blurred, as all such polar philosophies do, when real-life performance optimizations come into play)
Guys....what are the instructions? The on-chip memory they are talking about is essentially...a big register set. So we have load from main memory into registers, store from registers into main memory, multiply matrices--source and dest are stored in registers....
We have a 20 instruction, load-store cpu....how is this not RISC? At least RISC how we used the term in 1995?
They got it working, then Jeff Dean did the math and the decided to do an ASIC.
Now of course Google should spin off the TPU team as a separate company. It's the only credible competition NVidia has, and the software support is second only to NVidia.
[1] https://open.spotify.com/episode/0V9kRgNS7Ds6zh3GjdXUAQ?si=q...
1. Reserved fab space.
2. Highly integrated software.
3. Hardware architecture that exists today.
4. Customer relationships.
but all of these aspects are weak in one way or another:
For #1, fab space is tight, and NVidia can strangle its consumer GPU market if it means selling more AI chips at a higher price. This advantage is gone if a competitor makes big bets years in advance, or another company that has a lot of fab space (intel?) is willing to change priorities.
2. Life is good when your proprietary software is the industry standard. Whether this actually matters will depend on the use case heavily.
3. A benefit now, but not for long. It's my estimation that the hardware design for TPUs is fundamentally much simpler than for GPUs. No need for raytracing, texture samplers, or rasterization. Mostly just needs lots of matrix multiplication and memory. Others moving into the space will be able to catch up quickly.
4. Useful to stay in the conversation, but in a field hungry for any advantage, the hardware vendor with the highest FLOPS (or equivalent) per dollar is going to win enough customers to saturate their manufacturing ability.
So overall, I give them a few years, and then the competition is going to be real quite fast.
On GPUs, ML "just works" (inference and training) and are always order of magnitude faster than whatever CPU you have. TPUs work very well for some model architectures (old ones that they were optimized and designed for) and on some novel others can be actually slower than a CPU (because of gathers and similar) - this was my experience working on ML stuff as an ML Researcher at Google till 2022, maybe it got better but I doubt. Older TPUs were ok only for inference of those specific models and useless for training. And anything new I tried (fundamental part of research...) - the compiler would sonetimes just break with an internal error, most of the time just produce terrible and slow code, and bugs filed against it would stay open for years.
GPU is so much more than a matrix multiplier - it's a fully general, programmable processor. With excellent compilers, but most importantly - low level access that you don't need to rely on proprietary compiler engineers (like TPU ones) and anyone can develop something like Flash Attention. And as a side note: while a Transformer might be mostly matrix multiplication, many other models are not.
This is the thing that lets them outperform AMD chips even on inferior hardware. And the fact that anything new gets written for CUDA first.
There is OpenAI's Triton language for this too and people are beginning to use it (shout out to Unsloth here!).
> Reserved fab space.
While this is true, it's worth noting that the inference only Groq chip which gets 2x-5x better LLM inference performance is on a 12nm process.
NVidia's biggest advantage is that AMD is unwilling to pay for top notch software engineers (and unwilling to pay the corresponding increase in hardware engineer salaries this would entail). If you check online you'll see NVidia pays both hardware and software engineers significantly more than AMD does. This is a cultural/management problem, which AMD's unlikely to overcome in the near-term future. Apple so far seems like the only other hardware company that doesn't underpay its engineers, but Apple's unlikely to release a discrete/stand-alone GPU any time soon.
I’ll also add that their second moat is Mellanox. They have state of the art interconnect and networking that puts them ahead of the competition that are currently focusing just on the single unit.
AMD or Apple (Metal) or someone (I haven’t tried Intel’s stuff) just needs to have a single guide to installing a driver and compiler that doesn’t segfault if you look at it wrong, and they would sweep the R&D mindshare.
It is insane how bad CUDA is; it’s even more insane how bad their competitors are.
Alone how many internal ML things nvidia builds helps them tremendesly to understand the market (what does the market need).
And they use their inventions themselves.
'only has a few' = 'has a handful easy to list but with huge implications which are not easily matched by amd or intel right now'
NVidia's software is the only reason I'm not using GPU's for ML tasks and likely never will.
Given the size of the market and its near-monopoly situation, I strongly think this has the potential to (almost immediately) surpass the Pixel hardware business. But the problem here is that TPU is a relatively scarce computing resource even inside Google and it's very likely that Google has a hard time to meet its internal demands...
Yes.
But imagine how the company would do: they have a guaranteed market at Google say for 3 years, and while yes maybe Google takes 100% of the production in the first 12 months it's not a bad position to start from.
Plus there are other products which they could ship that might not always need to be built on the latest process. I imagine there would be demand for inference only earlier generation TPUs that can run LLMs fast if the power usage is low enough.
This is wrong, both AMD and Intel (through Habana) have GPUs comparable to H100s in performance.
If you're talking about training LLMs, involving 10's of thousands of processors, then the specifics of one processor vs another isn't the most important thing - it's the overall architecture and infrastructure in place to manage it.
In Google's case, an example analogy would be pitching making something like ChatGPT widely available, when that would disrupt revenue from search engine paid placements, and from ads on sites that people wouldn't need to visit. (So maybe someone says, better to phase it in subtly, as needed for competitiveness, but in non-disruptive ways.)
I doubt it's as simple as that, but would be funny if that was it.
My read (having seen some of this on the inside), is that it was a mix of being too worried about safety issues (OMG, the chatbot occasionally says something offensive!) and being too complacent (too comfortable with incremental changes in Search, no appetite for launching an entirely new type of product / doing something really out there). There are many ways to monetize a chatbot, OpenAI for example is raking billions in subscription fees.
Their hand was forced with ChatGPT was launched ... and we're seeing how that's going.
Google has stage 4 leadership incompetency and can't be helped. The only humane option is euthanasia.
https://www.linkedin.com/posts/eolver_googles-defense-agains...
ChatGPT v Google is a far crazier history. Not only did Google invent Transformers, not only did Google open-source PaLM and Bert, but they even built chat tuned LLM chat bots and let employees talk with it. This isn’t a case where they were avoiding for disruption or protecting search - they genuinely didn’t see its potential. Worse, they got so much negative publicity over it that they considered it an AI safety issue to release. If that guy hadn’t gone to the press and claimed LaMDA was sentient than they may have entirely open sourced it like PaLM. This would likely mean that GPT-3 was open sourced and maybe never chat tuned either.
GPT-2 was freely available and OpenAI showed off GPT-3 freely as a parlor trick before ChatGPT came out. ChatGPT was originally the same - fun text generation as chat not a full product.
TLDR - Tensors probably didn’t have a lot of value until NVidia because scarce and they actively invented the original ChatGPT and “AI Safety” concerns caused them to lock it down.
Deleted Comment
The most important for me these days would be Photoshop, Resolve etc, and I have seen those run a lot faster on Apple new proprietary chips than on my older machines.
That may not translate well at all to what this chip can do or what a H100s can do.
But does it translate at all?
Of course Apple is not selling their propritary chips either so for it to be practical Apple would have to release some from of an external, server something stuffed with their GPUs and AI chips
The M* chips have unified memory and (especially Pro/Max/Ultra) have very high memory bandwidth even compared eg to a 1080 (an M1 Ultra has memory bandwidth between 2080 and 3090).
At small batch sizes (including 1, like most local tasks), inference is bottlenecked by memory bandwidth, not compute ability. This is why people say the M* chips are good for ML.
However H100s are used primarily for training (at enormous batch sizes) and require lots of interconnect to train large models. At that scale, arithmetic intensity is very high, and the M* chips aren’t very competitive (even if they could be networked) - they pick a different part of the Pareto power/efficiency curve than H100s which guzzle up power.
If Google has a compelling-enough business case, existing fabs will happily allocate space for their hardware. TPU is not remotely compelling enough.
He did mention Google's TPU and the fact it was like programming a VLIW and they had about 500 people dedicated to their compiler.
chuckle CISC/RISC has gone from astute observation, to research program, to revolutionary technology, to marketing buzzwords....and finally to being just completely meaningless sounds.
I suppose it's the terminological circle of life.
Granted the last time I took any comp architecture was a grad course like 15 years ago, so my memory is pretty fuzzy (also we spent most of that semester dicking around with Itanium stuff that is beyond useless now)
We have a 20 instruction, load-store cpu....how is this not RISC? At least RISC how we used the term in 1995?