Also holy cow that was 10 years ago already? Dang.
Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.
So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.
(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)
The phrasing is very precise here, it’s the first TPU for _the age of inference_, which is a novel marketing term they have defined to refer to CoT and Deep Research.
> first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...
What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.
The first one was designed as a proof of concept that it would work at all, not really to be optimal for inference workloads. It just turns out that inference is easier.
Some honest competition in the chip space in the machine learning race! Genuinely interested to see how this ends up playing out. Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.
I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!
> Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.
Did it?
Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.
The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.
Not knowing much about special-purpose chips, I would like to understand whether chips like this would give Google a significant cost advantage over the likes of Anthropic or OpenAI when offering LLM services. Is similar technology available to Google's competitors?
GPUs, very good for pretraining. Inefficient for inference.
Why?
For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.
GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.
Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.
Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me.
I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.
Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern.
The Groq interview was good too. Seems that the thought process is that companies like Groq/Cerebras can run the inference, and companies like Nvidia can keep/focus on their highly lucrative pretraining business.
Anthropic is using Google TPUs. Also jointly working with Amazon on a data center using Amazon's custom AI chips. Also Google and Amazon are both investors in Anthropic.
I might be misremembering here, but Google's own AI models (Gemini) don't use NVIDIA hardware in any way, training or inference. Google bought a large number of NVIDIA hardware only for Google Cloud customers, not themselves.
There are other ai/llm ‘specific’ chips out there, yes. But the thing about asics is that you need one for each *specific* task. Eventually we’ll hit an equilibrium but for now, the stuff that Cerebras is best at is not what TPUs are best at is not what GPUs are best at…
It looks amazing but I wish we could stop playing silly games with benchmarks. Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware? Why leave out TPUv6 in the comparison?
Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?
[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)
It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?
It's even worse than I thought. El Capitan has 43,808 MI300A APUs. According to AMD each MI300A can do 3922TF of sparse FP8 for a total of 171EF sparse FP8 performance, or 85TF non-sparse.
In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.
Google shouldn't do that comparison. When I worked there I strongly emphasized to the TPU leadership to not compare their systems to supercomputers- not only were the comparisons misleading, Google absolutely does not want supercomputer users to switch to TPUs. SC users are demanding and require huge support.
Google needs to sell to Enterprise Customers. It's a Google Cloud Event.
Of course they have incentives to hype because once long-term contracts are signed you lose that customer forever. So, hype is a necessity
I went through the article and it seems you're right about the comparison with El Capitan. These performance figures are so bafflingly misleading.
And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.
Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.
Because it is a public company that aims to maximise shareholder value and thus the value of it's stock. Since value is largely evaluated by perception, if you can convince people your product is better than it is, your stock valuation, at least in the short term will be higher.
Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....
Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.
A big part of my issue here is that they've really messed up the misleading benchmarks.
They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.
They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.
Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.
Why not? If we line up to race. You can't say why compare v8 to v6 turbo or electric engine. It's a race, the drive train doesn't matter. Who gets to the finish line first?
No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...
I think it’s not misleading, but rather very clear that there are problems. v7 is compared to v5e. Also, notice that it’s not compared to competitors, and the price isn’t mentioned.
Finally, I think the much bigger issue with TPU is the software and developer experience. Without improvements there, there’s close to zero chance that anyone besides a few companies will use TPU. It’s barely viable if the trend continues.
> besides a few companies will use TPU. It’s barely viable if the trend continues
That doesn't matter much of those few companies are the biggest companies. Even with Nvidia majority of the revenue is being generated by a handful of hyperscalers.
One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.
Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.
How is API story for these devices? Are the drivers mainlined in Linux? Is there a specific API you use to code for them? How does the instance you rent on Google Cloud look and what does that have for software?
Also holy cow that was 10 years ago already? Dang.
Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.
So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.
(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)
What were the use cases like back then?
Certainly, RNNs are much older than TPUs?!
Can anyone suggest a better (i.e. more accurate and neutral) title, devoid of marketing tropes?
> first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...
What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.
Also funny enough it kinda looks like they've scrubbed all their references to v4i, where the i stands for inference. https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf
I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!
Did it?
Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.
The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.
I'm not sure - might not the equilibrium state be that we are constantly fine-tuning models with the latest data (e.g. social media firehose)?
This benefits everyone, even if you don't use Google Cloud, because of the competition it introduces.
Dead Comment
Dead Comment
Why?
For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.
GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.
Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.
[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...
I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.
https://www.youtube.com/watch?v=xBMRL_7msjY
https://www.datacenterknowledge.com/data-center-chips/ai-sta...
https://www.semafor.com/article/12/03/2024/amazon-announces-...
What even is an AI data center? are the GPU/TPU boxes in a different building than the others?
No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.
The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.
Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?
[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)
It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?
In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.
And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.
My impression from this is that they are too scared to say that their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of fp8 flops.
I can only assume that they need way more than 60 racks and they want to hide this fact.
Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.
Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....
Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.
A big part of my issue here is that they've really messed up the misleading benchmarks.
They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.
They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.
Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.
No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...
Because end users want to use fp8. Why should architectural differences matter when the speed is what matters at the end of the day?
Deleted Comment
That doesn't matter much of those few companies are the biggest companies. Even with Nvidia majority of the revenue is being generated by a handful of hyperscalers.
I wonder whether Google sees this as a problem. In a way it just means more AI compute capacity for Google.
It would be awesome for things like homelabs (to run Frigate NVR, Immich ML tasks or the Home Assistant LLM).
Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.
NB: Hobbyist, take all with a grain of salt
TensorFlow, PyTorch, and Jax all support XLA on the backend.
[1]: https://openxla.org/