Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

> At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-...

You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

bubblethink · 3 months ago

>Each one costs ~$2 million, so that is $40 million.

Pricing for exotic hardware that is not manufactured at scale is quite meaningless. They are selling tokens over an API. The token pricing is competitive with other token APIs.

ryao · 3 months ago

Last year, I took the time to read through public documents and estimated that their annual production was limited to ~300 wafers per year from TSMC. That is not Nvidia level scale, but it is scale.

There are many companies that sell tokens from an API and many more that need hardware to compute tokens. Cerebras posted a comparison of hardware options for these companies, so evaluating it as such is meaningful. It is perhaps less meaningful to the average person who cannot afford the barrier to entry to afford this hardware, but there are plenty of people curious what the options are for the companies that sell tokens through APIs, as those impact available capacity.

jenny91 · 3 months ago

I agree on the first. On the second: I would bet a lot of money that they aren't actually breaking even on their API (or even close to). They don't have a "pay as you go" per-token tier, it's all geared up to demonstrate use of their API as a novelty. They're probably burning cash on every single token. But their valuation and hype has surely gone way up since they got onto LLMs.

littlestymaar · 3 months ago

> This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

Emphasis mine.

Behemoth may become the largest and most powerful llama model, but right now it's nothing but vaporware. Maverick is currently the largest and more powerful llama model today (and if I had to bet, my money would be on Meta discarding Llama4 Behemoth entirely it eventually without having released it, and moving on to the next version number).

attentive · 3 months ago

> Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things

Mistral says they run Le Chat on Cerebras

ryao · 3 months ago

How is that related to the claim that Cerebras themselves made about their hardware’s price performance ratio?

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

arisAlexis · 3 months ago

Also perplexity

addaon · 3 months ago

> SRAM scaling is dead

I'm /way/ outside my expertise here, so possibly-silly question. My understanding (any of which can be wrong, please correct me!) is that (a) the memory used for LLMs is dominantly parameters, which are read-only during inference; (b) SRAM scaling may be dead, but NVM scaling doesn't seem to be; (c) NVM read bandwidth scales well locally, within an order of magnitude or two of SRAM bandwidth, for wide reads; (d) although NVM isn't currently on leading-edge processes, market forces are generally pushing NVM to smaller and smaller processes for the usual cost/density/performance reasons.

Assuming that cluster of assumptions is true, does that suggest that there's a time down the road where something like a chip-scale-integrated inference chip using NVM for parameter storage solves?

ryao · 3 months ago

The processes used for logic chips, and the processes used for NVM are typically different. The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM, but the quantities of FeRAM are incredibly small there and the process technology is ancient. It seems unlikely to me that the rest of the industry will combine the processes such that you can have both on a single wafer, but you would have better luck asking a chip designer.

That said, NVM often has a wear-out problem. This is a major disincentive for using it in place of SRAM, which is frequently written. Different types of NVM have different endurance limits, but if they did build such a chip, it is only a matter of time before it stops working.

timschmidt · 3 months ago

> I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips

AMD and TSMC are stacking SRAM on the chip scale. I imagine they could accomplish it at the wafer scale. It'll be neat if we can get hundreds of layers in time, like flash.

Your analysis seems spot on to me.

latchkey · 3 months ago

More on the CPU side than the GPU side. GPU is still dominated by HBM.

nsteel · 3 months ago

Assume you meant Intel, rather than AMD?

Deleted Comment

skryl · 3 months ago

Performance per watt is better than h100 and b200, performance per watt per $ is worse than B200, and it does fp8 just fine

https://arxiv.org/pdf/2503.11698

skryl · 3 months ago

One caveat is that this paper only covers training, which can be done on a single CS-3 using external memory (swapping weights in and out of SRAM). There is no way that a single CS-3 will hit this record inference performance with external memory so this was likely done with 10-20 CS-3 chips and the full model in SRAM. Definitely can’t compare token/$ with that kind of setup vs a DGX.

ryao · 3 months ago

Thanks for the correction. They are currently using FP16 for inference according to OpenRouter. I had thought that implied that they could not use FP8 given the pressure that they have to use as little memory as possible from being solely reliant on SRAM. I wonder why they opted to use FP16 instead of FP8.

lern_too_spel · 3 months ago

Performance per watt per dollar is a useless metric as calculated. You can't spend more money on B200s to get more performance per watt.

x-complexity · 3 months ago

Pretty much no disagreements IMO.

By the time the CSE-5 is rolled out, it *needs* at least 500GB of SRAM to make it worthwhile. Multi-layer wafer stacking's the only path to advance this chip.

moralestapia · 3 months ago

>Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model.

Care to explain? I don't see it.

acchow · 3 months ago

CSE-3 chip has 44GB, which can hold 22B parameters in FP16.

400B parameters would need 18 chips. Then you need a bit more ram for other stuff

> The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency

Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.

baq · 3 months ago

It is if you want good results. I’ve been giving Gemini pro prompts for 200+ seconds multiple times per day this week and for such tasks I really like to make it double/triple check and sometimes give the results to Claude for review, too (and vice versa).

Ideally I can just run the prompt 100x and have it pick the best solution later. That’s prohibitively expensive and a waste of time today.

diggan · 3 months ago

> That’s prohibitively expensive

Assuming you experience is working within enterprise, you're then saying that cost is the biggest bottleneck currently?

Also surprising to me that enterprises would use out-of-the-box models like that, I was expecting at least fine-tuned models be used most of the time, for very specific tasks/contexts, but maybe that's way optimistic.

tiffanyh · 3 months ago

How do you create a prompt for Gemini to spend 200 seconds and review multiple times.

Is it as simple as stating in the prompt:

  Spend 200+ seconds and review multiple times <question/task>

threeseed · 3 months ago

Only an insignificant minority of companies are running their own AI LLM models.

Everyone else is perfectly fine using whatever Azure, GCP etc provide. Enterprise companies don't need to be the fastest or have the best user experience. They need to be secure, trusted and reliable. And you get that by using cloud offerings by default and only going third party when there is a serious need.

aktuel · 3 months ago

If you think that cloud offerings are secure and trustworthy by default you truly must be living under a rock.

qu0b · 3 months ago

True, the biggest bottleneck is formulating the right task list and ensuring the LLM is directed to find the relevant context it needs. I feel LLMs in their instruction following are often to eager to output rather than using tools (read files) in their reasoning step.