Groq runs Mixtral 8x7B-32k with 500 T/s

The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.

It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.

[0]: https://wow.groq.com/groqcard-accelerator/

[1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334

matanyall · 2 years ago

Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient?

You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

pbalcer · 2 years ago

Just curious, how does this work out in terms of TCO (even assuming the price of a Groq LPU is 0$)? What you say makes sense, but I'm wondering how you strike a balance between massive horizontal scaling vs vertical scaling. Sometimes (quite often in my experience) having a few beefy servers is much simpler/cheaper/faster than scaling horizontally across many small nodes.

Or I got this completely wrong, and your solution enables use-cases that are simply unattainable on mainstream (Nvidia/AMD) hardware, making TCO argument less relevant?

huac · 2 years ago

> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.

For these use cases, time to first byte is the most important metric, not total throughput.

treprinum · 2 years ago

> Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia

I built one, should be live soon ;-)

startupsfail · 2 years ago

I have one, with 13B, on a 5-year-old 48GB Q8000 GPU. It’s also can see, it’s LLaVA. And it is very important that it is local, as privacy is important and streaming images to the cloud is time consuming.

You only need a few tokens, not the full 500 tokens response to run TTS. And you can pre-generate responses online, as ASR is still in progress. With a bit of clever engineering the response starts with virtually no delay, the moment its natural to start the response.

jrflowers · 2 years ago

>Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Is your version of that on a different page from this chat bot?

mlazos · 2 years ago

You can’t scale horizontally forever because of communication. I think HBM would provide a lot more flexibility with the number of chips you need.

fennecbutt · 2 years ago

Are there voice responses in the demo? I couldn't find em?

chaunnyong · 2 years ago

Hi Matanyal, we worked with groq for a project last year. would you be open to connect on LinkedIn? :)

trsohmers · 2 years ago

Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...

tome · 2 years ago

Thomas, I think for full disclosure you should also state that you left Groq to start a competitor (a competitor which doesn't have the world's lowest latency LLM engine nor a guarantee to match the cheapest per token prices, like Groq does.).

Anyone with a serious interest in the total cost of ownership of Groq's system is welcome to email contact@groq.com.

Aeolun · 2 years ago

I think that just means it’s for people that really want it?

John doe and his friends will never have a need to have their fart jokes generated at this speed, and are more interested in low costs.

But we’d recently been doing call center operations and being able to quickly figure out what someone said was a major issue. You kind of don’t want your system to wait for a second before responding each time. I can imagine it making sense if it reduces the latency to 10ms there as well. Though you might still run up against the ‘good enough’ factor.

I guess few people want to spend millions to go from 1000ms to 10ms, but when they do they really want it.

nickpsecurity · 2 years ago

What happened to Rex? Did it hit production or get abandoned?

It was also on my list of things to consider modifying for an AI accelerator. :)

tome · 2 years ago

If you want low latency you have to be really careful with HBM, not only because of the delay involved, but also the non-determinacy. One of the huge benefits of our LPU architecture is that we can build systems of hundreds of chips with fast interconnect and we know the precise timing of the whole system to within a few parts per million. Once you start integrating non-deterministic components your latency guarantees disappear very quickly.

pclmulqdq · 2 years ago

I don't know about HBM specifically, but DDR and GDDR at a protocol level are both deterministic. It's the memory controller doing a bunch of reordering that makes them non-deterministic. Presumably, if that is the reason you don't like DRAM, you could build your compiler to be memory-layout aware and have the memory controller issue commands without reordering.

frognumber · 2 years ago

From a theoretical perspective, this is absolutely not true. Asynchronous logic can achieve much lower latency guarantees than synchronous logic.

Come to think of it, this is one of the few places where asynchronous logic might be more than academic... Async logic is hard with complex control flows, which deep learning inference does not have.

(From a practical perspective, I know you were comparing to independently-clocked logic, rather than async logic)

SilverBirch · 2 years ago

Surely once you're scaling over multiple chips/servers/racks you're dealing with retries and checksums and sequence numbers anyway? How do you get around the non-determinacy of networking beyond just hoping that you don't see any errors?

pclmulqdq · 2 years ago

Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.

foundval · 2 years ago

(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.

frozenport · 2 years ago

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1

I guess if you don't have any extra junk you can pack more processing into the chip?

nabakin · 2 years ago

I've been thinking the same but on the other hand, that would mean they are operating at a huge loss which doesn't scale

londons_explore · 2 years ago

> more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.

kcorbitt · 2 years ago

This actually already exists! We did a writeup of the relevant optimizations here: https://openpipe.ai/blog/s-lora

xzyaoi · 2 years ago

There's also papers for hosting full-parameter fine-tuned models: https://arxiv.org/abs/2312.05215

Disclaimer: I'm one of the authors.

azeirah · 2 years ago

I recall a recent discussion about a technique to load the diff in weights between a lora and base model, zip it and transfer it on a per-needs basis.

moralestapia · 2 years ago

>The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM [...]

IDGAF about any of that, lol. I just want an API endpoint.

480 tokens/sec at $0.27 per million tokens? Sign me in, I don't care about their hardware, at all.

treesciencebot · 2 years ago

there are providers out there offering for $0 per million tokens, that doesn't mean it is sustainable and won't disappear as soon as the VC well runs dry. Am not saying this is the case for Groq, but in general you probably should care if you want to build something serious on top of anything.

imtringued · 2 years ago

I honestly don't see the problem.

"just to serve a single model" could be easily fixed by adding a single LPDDR4 channel per LPU. Then you can reload the model sixty times per second and serve 60 different models per second.

treesciencebot · 2 years ago

per-chip compute is not the main thing this chip innovates for fast inference, it is the extremely fast memory bandwith. when you do that, you'll loose all of that and will be much worse off than any off the shelf accelerators.

Hi folks, I work for Groq. Feel free to ask me any questions.

(If you check my HN post history you'll see I post a lot about Haskell. That's right, part of Groq's compilation pipeline is written in Haskell!)

michaelbuckbee · 2 years ago

Friendly fyi - I think this might just be a web interface bug but but I submitted a prompt with the Mixtral model and got a response (great!) then switched the dropdown to Llama and submitted the same prompt and got the exact same response.

It may be caching or it didn't change the model being queried or something else.

tome · 2 years ago

Thanks, I think it's because the chat context is fed back to the model for the next generation even when you switch models. If you refresh the page that should erase the history and you should get results purely from the model you choose.

itishappy · 2 years ago

Alright, I'll bite. Haskell seems pretty unique in the ML space! Any unique benefits to this decision, and would you recommend it for others? What areas of your project do/don't use Haskell?

tome · 2 years ago

Haskell is a great language for writing compilers! The end of our compilation pipeline is written in Haskell. Other stages are written in C++ (MLIR) and Python. I'd recommend anyone to look at Haskell if they have a compiler-shaped problem, for sure.

We also use Haskell on our infra team. Most of our CI infra is written in Haskell and Nix. Some of the chip itself was designed in Haskell (or maybe Bluespec, a Haskell-like language for chip design, I'm not sure).

lemie · 2 years ago

Haskell is a great language. However, when you want to build production grade software it’s the wrong language. Specially when it comes to a complicated piece of software like the compiler for a novel chip. I can tell you for a fact it has always being the wrong choice (specially in the case of Groq).

jart · 2 years ago

If I understand correctly, you're using specialized hardware to improve token generation speed, which is very latency bound on the speed of computation. However generating tokens only requires multiplying 1-dimensional matrices usually. If I enter a prompt with ~100 tokens then your service goes much slower. Probably because you have to multiply 2-dimensional matrices. What are you doing to improve the computation speed of prompt processing?

tome · 2 years ago

I don't think it should be quadratic in input length. Why do you think it is?

mechagodzilla · 2 years ago

You all seem like one of the only companies targeting low-latency inference rather than focusing on throughput (and thus $/inference) - what do you see as your primary market?

tome · 2 years ago

Yes, because we're one of the only companies whose hardware can actually support low latency! Everyone else is stuck with traditional designs and they try to make up for their high latency by batching to get higher throughput. But not all applications work with high throughput/high latency ... Low latency unlocks feeding the result of one model into the input of another model. Check out this conversational AI demo on CNN. You can't do that kind of thing unless you have low latency.

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

ppsreejith · 2 years ago

Thank you for doing this AMA

1. How many GroqCards are you using to run the Demo?

2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)

3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!

*Assuming you're using this: https://www.bittware.com/products/groq/

tome · 2 years ago

1. I think our GroqChat demo is using 568 GroqChips. I'm not sure exactly, but it's about that number.

2. We're working on our second generation chip. I don't know how much SRAM it has exactly but we don't need to increase the SRAM to get efficient scaling. Our system is deterministic, which means no need for waiting or queuing anywhere, and we can have very low latency interconnect between cards.

3. Yeah absolutely, see this video of a live demo on CNN!

https://www.youtube.com/watch?t=235&v=pRUddK6sxDg

andy_xor_andrew · 2 years ago

are your accelerator chips designed in-house? or they're some specialized silicon or FPGPU or something that you wrote very optimized code for inference?

it's really amazing! the first time I tried the demo, I had to try a few prompts to believe it wasn't just an animation :)

tome · 2 years ago

Yup, custom ASIC, designed in-house, built into a system of several racks, hundreds of chips, with fast interconnect. Really glad you enjoyed it!

zawy · 2 years ago

When you start using Samsung 4 nm are you switching from SRAM to HDM? If yes, how's that going to affect all the metrics given that SRAM is so much faster? Someone said you'll eventually move to HDM because SRAM improvements is relatively stalled.

I read an article that indicated your Bill Of Materials compared to NVidia's is 10x to get 1/10 the latency, and 8x BOM for throughput if Nvidia optimizes for throughput? Does this seem accurate? That CAPEX is the primary drawback?

https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

UncleOxidant · 2 years ago

When will we be able to buy Groq accelerator cards that would be affordable for hobbyists?

tome · 2 years ago

We are prioritising building out whole systems at the moment I don't think we'll have a consumer level offering in the near future.

kkzz99 · 2 years ago

How does the Groq PCIE Card work exactly? Does it use system ram to stream the model data to the card? How many T/s could one expect with e.g. 36000Mhz DDR4 Ram?

tome · 2 years ago

We build out large systems where we stream in the model weights to the system once and then run multiple inferences on it. We don't really recommend streaming model weights repeatedly onto the chip because you'll lose the benefits of low latency.

Deleted Comment

tudorw · 2 years ago

As it works at inference do you think 'Representation Engineering ' could be applied to give a sort of fine-tuning ability? https://news.ycombinator.com/item?id=39414532

karthityrion · 2 years ago

Hi. Are these ASICs only for LLMs or could they accelerate other kinds of models(vision) as well?

tome · 2 years ago

It's a general purpose compute engine for numerical computing and linear algebra, so it can accelerate any ML workloads. Previously we've accelerated models for stabilising fusion reactions and for COVID drug discovery

* https://alcf.anl.gov/news/researchers-accelerate-fusion-rese...

* https://wow.groq.com/groq-accelerates-covid-drug-discovery-3...

amirhirsch · 2 years ago

It seems like you are making general purpose chips to run many models. Are we at a stage where we can consider taping out inference networks directly propagating the weights as constants in the RTL design?

Are chips and models obsoleted on roughly the same timelines?

tome · 2 years ago

I think the models change far too quickly for that to be viable. A chip has to last several years. Currently we're seeing groundbreaking models released every few months.

dkhudia · 2 years ago

@tome for the deterministic system, what if the timing for one chip/part is off due to manufacturing/environmental factors (e.g., temperature) ? How does the system handle this?

tome · 2 years ago

We know the maximum possible clock drift and so we know when we need to do a resynchronisation to keep all the chips in sync. You can read about it in section 3.3 of our recent whitepaper: https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

mechagodzilla · 2 years ago

Those sorts of issues are part of timing analysis for a chip, but once a chip's clock rate is set, they don't really factor in unless there is some kind of dynamic voltage/frequency scaling scheme going on. This chip probably does not do any of that and just uses a fixed frequency, so timing is perfectly predictable.

ianpurton · 2 years ago

Is it possible to buy Groq chips and how much do they cost?

ComputerGuru · 2 years ago

https://www.mouser.com/ProductDetail/BittWare/RS-GQ-GC1-0109...

pama · 2 years ago

FYI, I only see a repeating animation and nothing else in my iPhone on lockdown mode, with Safari or Firefox.

karthityrion · 2 years ago

What is the underlying architecture of the ASICs. Does it use systolic arrays?

tome · 2 years ago

Yes, our matrix engine is quite similar to a systolic array. You can find more details about our architecture in our paper:

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

louiscklaw · 2 years ago

price plan ?

liberix · 2 years ago

How do I sign up for API access? What payment methods do you support?

BryanLegend · 2 years ago

How well would your hardware work for image/video generation?

tome · 2 years ago

It should work great as far as I know. We've implemented some diffusion models for image generation but we don't offer them at the moment. I'm not aware of us having implemented any video models.

Oras · 2 years ago

Impressive speed. Are there any plans to run fine-tuned models?

tome · 2 years ago

Yes, we're working on a feature to give our partners the ability to deploy their own fine-tuned models.

phh · 2 years ago

You're running fp32 models, fp16 or quantized?

tome · 2 years ago

FP16 for calculating all activations. Some data is stored as FP8 at rest.