Ironwood: The first Google TPU for the age of inference

Not knowing much about special-purpose chips, I would like to understand whether chips like this would give Google a significant cost advantage over the likes of Anthropic or OpenAI when offering LLM services. Is similar technology available to Google's competitors?

heymijo · 10 months ago

GPUs, very good for pretraining. Inefficient for inference.

Why?

For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.

GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.

Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.

[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...

latchkey · 10 months ago

Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me.

I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.

ein0p · 10 months ago

Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern.

hanska · 10 months ago

The Groq interview was good too. Seems that the thought process is that companies like Groq/Cerebras can run the inference, and companies like Nvidia can keep/focus on their highly lucrative pretraining business.

https://www.youtube.com/watch?v=xBMRL_7msjY

pkaye · 10 months ago

Anthropic is using Google TPUs. Also jointly working with Amazon on a data center using Amazon's custom AI chips. Also Google and Amazon are both investors in Anthropic.

https://www.datacenterknowledge.com/data-center-chips/ai-sta...

https://www.semafor.com/article/12/03/2024/amazon-announces-...

avrionov · 10 months ago

NVIDIA operates at 70% profit right now. Not paying that premium and having alternative to NVIDIA is beneficial. We just don't know how much.

kccqzy · 10 months ago

I might be misremembering here, but Google's own AI models (Gemini) don't use NVIDIA hardware in any way, training or inference. Google bought a large number of NVIDIA hardware only for Google Cloud customers, not themselves.

xnx · 10 months ago

Google has a significant advantage over other hyperscalers because Google's AI data centers are much more compute cost efficient (capex and opex).

claytonjy · 10 months ago

Because of the TPUs, or due to other factors?

What even is an AI data center? are the GPU/TPU boxes in a different building than the others?

cavisne · 10 months ago

Nvidia has ~60% margins in their datacenter chips. So TPU's have quite a bit of headroom to save google money without being as good as Nvidia GPU's.

No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.

buildbot · 10 months ago

Microsoft has the MAIA 100 as well. No comment on their scale/plans though.

baby_souffle · 10 months ago

There are other ai/llm ‘specific’ chips out there, yes. But the thing about asics is that you need one for each *specific* task. Eventually we’ll hit an equilibrium but for now, the stuff that Cerebras is best at is not what TPUs are best at is not what GPUs are best at…

monocasa · 10 months ago

I don't even know if eventually we'll hit an equilibrium.

The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.

It looks amazing but I wish we could stop playing silly games with benchmarks. Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware? Why leave out TPUv6 in the comparison?

Why compare fp64 flops in the El Capitan supercomputer to fp8 flops in the TPU pod when you know full well these are not comparable?

[Edit: it turns out that El Capitan is actually faster when compared like for like and the statement below underestimated how much slower fp64 is, my original comment in italics below is not accurate] (The TPU would still be faster even allowing for the fact fp64 is ~8x harder than fp8. Is it worthwhile to misleadingly claim it's 24x faster instead of honestly saying it's 3x faster? Really?)

It comes across as a bit cheap. Using misleading statements is a tactic for snake oil salesmen. This isn't snake oil so why lower yourself?

fancyfredbot · 10 months ago

It's even worse than I thought. El Capitan has 43,808 MI300A APUs. According to AMD each MI300A can do 3922TF of sparse FP8 for a total of 171EF sparse FP8 performance, or 85TF non-sparse.

In other words El Capitan is between 2 and 4 times as fast as one of these pods, yet they claim the pod is 24x faster than El Capitan.

dekhn · 10 months ago

Google shouldn't do that comparison. When I worked there I strongly emphasized to the TPU leadership to not compare their systems to supercomputers- not only were the comparisons misleading, Google absolutely does not want supercomputer users to switch to TPUs. SC users are demanding and require huge support.

meta_ai_x · 10 months ago

Google needs to sell to Enterprise Customers. It's a Google Cloud Event. Of course they have incentives to hype because once long-term contracts are signed you lose that customer forever. So, hype is a necessity

shihab · 10 months ago

I went through the article and it seems you're right about the comparison with El Capitan. These performance figures are so bafflingly misleading.

And so unnecessary too- nobody shopping for AI inference server cares at all about its relative performance vs a fp64 machine. This language seems designed solely to wow tech-illiterate C-Suites.

imtringued · 10 months ago

Also, there is no such thing as a "El Capitan pod". The quoted number is for the entire supercomputer.

My impression from this is that they are too scared to say that their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of fp8 flops.

I can only assume that they need way more than 60 racks and they want to hide this fact.

jeffbee · 10 months ago

A max-spec v5p deployment, at least the biggest one they'll let you rent, occupies 140 racks, for reference.

adrian_b · 10 months ago

FP64 is more like 64 times harder than FP8.

Actually the cost is even much higher, because the cost ratio is not much less than the square of the ratio between the sizes of the significands, which in this case is 52 bits / 4 bits = 13, and the square of 13 is 169.

christkv · 10 months ago

Memory size and bandwidth goes up a lot right?

zipy124 · 10 months ago

Because it is a public company that aims to maximise shareholder value and thus the value of it's stock. Since value is largely evaluated by perception, if you can convince people your product is better than it is, your stock valuation, at least in the short term will be higher.

Hence Tesla saying FSD and robo-taxis are 1 year away, the fusion companies saying fusion is closer than it is etc....

Nvidia, AMD, apple and intel have all been publishing misleading graphs for decades and even under constant criticism they continue to.

fancyfredbot · 10 months ago

I understand the value of perception.

A big part of my issue here is that they've really messed up the misleading benchmarks.

They've failed to compare to the most obvious alternative, which is Nvidia GPUs. They look like they've got something to hide, not like they're ahead.

They've needlessly made their own current products look bad in comparison to this one understating the long-standing advantage TPUs have given Google.

Then they've gone and produced a misleading comparison to the wrong product (who cares about El Capitan? I can't rent that!). This is a waste of credibility. If you are going to go with misleading benchmarks then at least compare to something people care about.

segmondy · 10 months ago

Why not? If we line up to race. You can't say why compare v8 to v6 turbo or electric engine. It's a race, the drive train doesn't matter. Who gets to the finish line first?

No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all about cost/performance factor. 8 bits is as good as 32bits, great performance is even been pulled out of 4 bits...

fancyfredbot · 10 months ago

This is like saying I'm faster because I ran (a mile) in 8 minutes whereas it took you 15 minutes (to run two miles).

charcircuit · 10 months ago

>Why compare fp8 performance in ironwood to architectures which don't support fp8 in hardware?

Because end users want to use fp8. Why should architectural differences matter when the speed is what matters at the end of the day?

bobim · 10 months ago

GP bikes are faster than dirt bikes, but not on dirt. The context has some influence here.

Deleted Comment

cheptsov · 10 months ago

I think it’s not misleading, but rather very clear that there are problems. v7 is compared to v5e. Also, notice that it’s not compared to competitors, and the price isn’t mentioned. Finally, I think the much bigger issue with TPU is the software and developer experience. Without improvements there, there’s close to zero chance that anyone besides a few companies will use TPU. It’s barely viable if the trend continues.

mupuff1234 · 10 months ago

> besides a few companies will use TPU. It’s barely viable if the trend continues

That doesn't matter much of those few companies are the biggest companies. Even with Nvidia majority of the revenue is being generated by a handful of hyperscalers.

sebzim4500 · 10 months ago

>Without improvements there, there’s close to zero chance that anyone besides a few companies will use TPU. It’s barely viable if the trend continues.

I wonder whether Google sees this as a problem. In a way it just means more AI compute capacity for Google.

latchkey · 10 months ago

The reference to El Capitan, is a competitor.

nharada · 10 months ago

The first specifically designed for inference? Wasn’t the original TPU inference only?

dgacmu · 10 months ago

Yup. (Source: was at brain at the time.)

Also holy cow that was 10 years ago already? Dang.

Amusing bit: The first TPU design was based on fully connected networks; the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.

So maybe it's reasonable to say that this is the first TPU designed for inference in the world where you have both a matrix multiply unit and an embedding processor.

(Also, the first gen was purely a co-processor, whereas the later generations included their own network fabric, a trait shared by this most recent one. So it's not totally crazy to think of the first one as a very different beast.)

miki123211 · 10 months ago

Wow, you guys needed a custom ASIC for inference before CNNs were even invented?

What were the use cases like back then?

kleiba · 10 months ago

> the advent of CNNs forced some design rethinking, and then the advent of RNNs (and then transformers) did it yet again.

Certainly, RNNs are much older than TPUs?!

theptip · 10 months ago

The phrasing is very precise here, it’s the first TPU for _the age of inference_, which is a novel marketing term they have defined to refer to CoT and Deep Research.

yencabulator · 10 months ago

As a previous boss liked to say, this car is the cheapest in its price range, the roomiest in it's size category, and the fastest in its speed group.

dang · 10 months ago

Ugh. We should have caught that.

Can anyone suggest a better (i.e. more accurate and neutral) title, devoid of marketing tropes?

shitpostbot · 10 months ago

They didn't though?

> first designed specifically for inference. For more than a decade, TPUs have powered Google’s most demanding AI training and serving workloads...

What do they think serving is? I think this marketing copy was written by someone with no idea what they are talking about, and not reviewed by anyone who did.

Also funny enough it kinda looks like they've scrubbed all their references to v4i, where the i stands for inference. https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

Yeah that made me chuckle, too. The original was indeed inference-only.

m00x · 10 months ago

The first one was designed as a proof of concept that it would work at all, not really to be optimal for inference workloads. It just turns out that inference is easier.

no_wizard · 10 months ago

Some honest competition in the chip space in the machine learning race! Genuinely interested to see how this ends up playing out. Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.

I know they aren't selling the TPU as boxed units, but still, even as hardware that backs GCP services and what not, its interesting to see how it'll shake out!

epolanski · 10 months ago

> Nvidia seemed 'untouchable' for so long in this space that its nice to see things get shaken up.

Did it?

Both Mistral's LeChat (running on Cerebras) and Google's Gemini (running on Tensors) have clearly showed ages ago Nvidia had no advantage at all in inference.

The hundreds of billions spent in hardware till now focused on training, but inference is in the long run gonna get the lion share of the work.

wyager · 10 months ago

> but inference is in the long run gonna get the lion share of the work.

I'm not sure - might not the equilibrium state be that we are constantly fine-tuning models with the latest data (e.g. social media firehose)?

throwaway48476 · 10 months ago

Its hard to be excited about hardware that will only exist in the cloud before shredding.

crazygringo · 10 months ago

You can't get excited about lower prices for your cloud GPU workloads thanks to the competition it brings to Nvidia?

This benefits everyone, even if you don't use Google Cloud, because of the competition it introduces.

01HNNWZ0MV43FF · 10 months ago

I like owning things

Dead Comment

foota · 10 months ago

Personally, I have a (non-functional) TPU sitting on my desk at home :-)

prometheon1 · 10 months ago

You don't find news about quantum computers exciting at all? I personally disagree

justanotheratom · 10 months ago

exactly. I wish Groq would start selling their cards that they use internally.

xadhominemx · 10 months ago

They would lose money on every sale

p_j_w · 10 months ago

I think this article is for Wall Street, not Silicon Valley.

mycall · 10 months ago

Bad timing as I think Wall Street is preoccupied at the moment.

killerstorm · 10 months ago

It might also be for people who consider working for Google...

noitpmeder · 10 months ago

What's their use case?

nehalem · 10 months ago

gigel82 · 10 months ago

I was hoping they're launching a Coral kind of device that can run locally and cheaply, with updated specs.

It would be awesome for things like homelabs (to run Frigate NVR, Immich ML tasks or the Home Assistant LLM).

_hark · 10 months ago

Can anyone comment on where efficiency gains come from these days at the arch level? I.e. not process-node improvements.

Are there a few big things, many small things...? I'm curious what fruit are left hanging for fast SIMD matrix multiplication.

vessenes · 10 months ago

One big area the last two years has been algorithmic improvements feeding hardware improvements. Supercomputer folks use f64 for everything, or did. Most training was done at f32 four years ago. As algo teams have shown fp8 can be used for training and inference, hardware has updated to accommodate, yielding big gains.

NB: Hobbyist, take all with a grain of salt

jmalicki · 10 months ago

Unlike a lot of supercomputer algorithms, where fp error accumulates as you go, gradient descent based algorithms don't need as much precision since any fp errors will still show up at the next loss function calculation to be corrected, which allows you to make do with much lower precision.

muxamilian · 10 months ago

In-memory computing (analog or digital). Still doing SIMD matrix multiplication but using more efficient hardware: https://arxiv.org/html/2401.14428v1 https://www.nature.com/articles/s41565-020-0655-z

gautamcgoel · 10 months ago

This is very interesting, but not what the Ironside TPU is doing. The blog post says that the TPU uses conventional HBM RAM.

yeahwhatever10 · 10 months ago

Specialization. Ie specialized for inference.

tuna74 · 10 months ago

How is API story for these devices? Are the drivers mainlined in Linux? Is there a specific API you use to code for them? How does the instance you rent on Google Cloud look and what does that have for software?

cbarrick · 10 months ago

XLA (Accelerated Linear Algebra) [1] is likely the library that you'll want to use to code for these machines.

TensorFlow, PyTorch, and Jax all support XLA on the backend.

[1]: https://openxla.org/

ndesaulniers · 10 months ago

They have out of tree drivers. If they don't ship the hardware to end users, it's not clear upstream (Linux kernel) would want them.