Readit News logoReadit News
maz1b · 3 months ago
Cerebras has been a true revelation when it comes to inference. I have a lot of respect for their founder, team, innovation, and technology. The colossal size of the WS3 chip, utilizing DRAM to mind-boggling scale, it's definitely ultra cool stuff.

I also wonder why they have not been acquired yet. Or is it intentional?

I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect.

aurareturn · 3 months ago

  I also wonder why they have not been acquired yet. Or is it intentional?
A few issues:

1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198

2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.

3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.

4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.

5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.

6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.

There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.

sailingparrot · 3 months ago
> I estimated that they needed over $100m of chips just to do Qwen 3 at max context size

I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.

https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...

addaon · 3 months ago
> SRAM has virtually stopped scaling in new nodes.

But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?

sinuhe69 · 3 months ago
No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.
arisAlexis · 3 months ago
But apparently they serve all the models super fast and in production so you must be wrong somewhere
oceanplexian · 3 months ago
I’ve been using them as a customer and have been fairly impressed. The thing is, a lot of inference providers might seem better on paper but it turns out they’re not.

Recently there was a fiasco I saw posted on r/localllama where many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs, but lying to customers about it. Unless you’re actually auditing the tokens you’re purchasing you may not be getting what you’re paying for even if the T/s and $/token seems better.

dlojudice · 3 months ago
OpenRouter should be responsible for this quality control, right? It seems to me to be the right player in the chain with the duties and scale to do so.
teruakohatu · 3 months ago
> many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs,

Do you have information on this? This seems like brand destroying for both OpenRouter and the model providers.

throw123890423 · 3 months ago
> I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

Yeah wait, why rent chips instead of sell them? Why wouldn't customers want to invest money in competition for cheaper inference hardware? It's not like Nvidia has a blacklist of companies that have bought chips from competitors, or anything. Now that would be crazy! That sure would make this market tough to compete in, wouldn't it. I'm so glad Nvidia is definitely not pressuring companies to not buy from competitors or anything.

aurareturn · 3 months ago
Their chips weren’t selling because:

1. They’re useless for training in 2025. They were designed for training prior to LLM explosion. They’re not practical for training anymore because they rely on SRAM which is not scalable.

2. No one is going to spend the resources to optimize models to run on their SDK and hardware. Open source inference engines don’t optimize for Cerebras hardware.

Given the above two reasons, it makes a lot of sense that no one is investing in their hardware and they have switched to a cloud model selling speed as the differentiator.

It’s not always “Nvidia bad”.

OkayPhysicist · 3 months ago
The UAE has sunk a lot of money into them, and I suspect it's not purely a financial move. If that's the case, an acquisition might be more complicated than it would seem at first glance.
nsteel · 3 months ago
> utilizing DRAM to mind-boggling scale

I thought it was the SRAM scaling that was impressive, no?

maz1b · 3 months ago
oops, typo! S and D are next to each other on the keyboard. thanks for pointing this out
liuliu · 3 months ago
They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.
Shakahs · 3 months ago
Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast. Even if it takes more LLM calls to complete a task, those calls are all happening in a fraction of the time.
nerpderp82 · 3 months ago
We must have very different workflows, I am curious about yours. What tools are you using and how are you guiding Qwen3-Coder? When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.
solarkraft · 3 months ago
You must write very elaborate prompts for 10 minutes to be worth the wait. What permissions are you giving it and how much do you care about the generated code? How much time did you spend on initial setup?

I‘ve found that the best way for myself to do LLM assisted coding at this point in time is in a somewhat tight feedback loop. I find myself wanting to refine the code and architectural approaches a fair amount as I see them coming in and latency matters a lot to me here.

CaptainOfCoit · 3 months ago
> When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.

Indirectly, it sounds like you're aware about the inference speed? Imagine if it took 2 minutes instead of 10 minutes, that's what the parent means.

ripped_britches · 3 months ago
Do you use cursor or what? Interested in how you set this up
Shakahs · 3 months ago
I use it via the Kilo Code extension for VSCode, which is invoking Qwen3-Coder via a Cerebras Code subscription.

https://github.com/Kilo-Org/kilocodehttps://www.cerebras.ai/blog/introducing-cerebras-code

sdesol · 3 months ago
> Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast.

Saying "technically" is really underselling the difference in intelligence in my opinion. Claude and Gemini are much, much smarter and I trust them to produce better code, but you honestly can't deny the excellent value that Qwen-3, the inference speed and $50/month for 25M tokens/per day brings to the table.

Since I paid for the Cerebras pro plan, I've decided to force myself to use it as much as possible for the duration of the month for developing my chat app (https://github.com/gitsense/chat) and here so some of my thoughts so far:

- Qwen3 Coder is a lot dumber when it comes to prompting as Gemini and Claude are much better at reading between the lines. However since the speed is so good, I often don't care as I can go back to the message and make some simple clarifications and try again.

- The max context window size of 128k for Qwen 3 Coder 480B on their platform can be a serious issue if you need a lot of documentation or code in context.

- I've never come close to the 25M tokens per day limit for their Pro Plan. The max I am using is 5M/day.

- The inference speed + a capable model like Qwen 3 will open up use cases most people might not have thought of before.

I will probably continue to pay for the $50 dollar plan for these use cases.

1. Applying LLM generated patches

Qwen 3 coder is very much capable of applying patches generated by Sonnet and Gemini. It is slower than what https://www.morphllm.com/ provides but it is definitely fast enough for most people to not care. The cost savings can be quite significant depending on the work.

2. Building context

Since it is so fast and because the 25M token limit per day is such a high limit for me, I am finding myself loading more files into context and just asking Qwen to identify files that I will need and/or summarize things so I can feed it into Sonnet or Gemini to save me significant money.

3. AI Assistant

Due to it's blazing speed, you can analyze a lot data fast for deterministic searches and because it can review results at such a great speed, you can do multiple search and review loops without feeling like you are waiting forever.

Given what I've experienced so far, I don't think Cerebras can be a serious platform for coding if Qwen 3 Coder is the only available model. Having said that, given the inference speed and Qwen being more than capable, I can see Cerebras becoming a massive cost savings option for many companies and developers, which is where I think they might win a lot of enterprise contracts.

fcpguru · 3 months ago
Their core product is the Wafer Scale Engine (WSE-3) — the largest single chip ever made for AI, designed to train and run models much faster and more efficiently than traditional GPUs.

Just tried https://cloud.cerebras.ai wow is it fast!

mythz · 3 months ago
Running Qwen3 coder at speed is great, but would also prefer to have access to other leading OSS models like GLM 4.6, Kimi K2 and DeepSeek v3.2 before considering switching subs.

Groq also runs OSS models at speed which is my preferred way to access Kimi K2 on their free quotas.

OGEnthusiast · 3 months ago
I'm surprised how under-the-radar Cerebras is. Being able to get near-instantaneous responses from Qwen3 and gpt-oss is pretty incredible.
data-ottawa · 3 months ago
I wish I could invest in them. Agree they're under the radar.
rbitar · 3 months ago
Congrats to the team, I'm surprised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100-150 tokens / second (see OpenRouter stats by provider). I do feel after around 300-400 tokens/second the gains in speed feel more incremental, so if there was a model at 300+ tokens/second, I would consider that a very competitive alternative.
rvz · 3 months ago
Sooner or later, lots of competitors including Cerebras are going to take apart Nvidia's data center market share and it will cause many AI model firms to question the unnecessary spend and hoarding of GPUs.

OpenAI is still developing their own chips with Broadcom, but they are not operational yet. So for now, they're buying GPUs from Nvidia to build up their own revenue income (to later spend it on their own chips)

By 2030, eventually many companies will be looking for alternatives to Nvidia like Cerebras or Lightmatter for both training and inference use-cases.

For example [0] Meta just acquired a chip startup for this exact reason - "An alternative to training AI systems" and "to cut infrastructure costs linked to its spending on advanced AI tools.".

[0] https://www.reuters.com/business/meta-buy-chip-startup-rivos...

onlyrealcuzzo · 3 months ago
There's so much optimization to be made when developing the model and the hardware it runs on, most of the big players are likely to run a non-trivial percentage of their workloads on proprietary chips eventually.

If that's 5 years into the future, that looks bad for Nvidia, if it's >10 years in the future, that doesn't affect Nvidia's current stock price very much.

JLO64 · 3 months ago
My experience with Cerebras is pretty mixed. On the one hand for simple and basic requests, it truly is mind blowing how fast it is. That said, I’ve had nothing but issues and empty responses whenever I try to use them for coding tasks (Opencode via Openrouter, GPT-OSS). It’s gotten to a point where I’ve disabled them as a provider on Openrouter.
divmain · 3 months ago
I experienced the same, but I think it is a limitation of OpenRouter. When I hit Cerebra’s OpenAI endpoint directly, it works flawlessly.