Readit News logoReadit News
eigenvalue · 2 years ago
I just want to say that this is one of the most impressive tech demos I’ve ever seen in my life, and I love that it’s truly an open demo that anyone can try without even signing up for an account or anything like that. It’s surreal to see the thing spitting out tokens at such a crazy rate when you’re used to watching them generate at one less than one fifth that speed. I’m surprised you guys haven’t been swallowed up by Microsoft, Apple, or Google already for a huge premium.
tome · 2 years ago
Really glad you like it! We've been working hard on it.
jonplackett · 2 years ago
Is this useful for training as well as running a model. Or is this approach specifically for running an already-trained model faster?
lokimedes · 2 years ago
The speed part or the being swallowed part?
RecycledEle · 2 years ago
If I understand correctly, each chip has 200 MB of RAM, so it takes racks to run a single LLM.

That does not sound like progress to me.

We need single PCIe boards with dozens or hundreds of GB of RAM and processors that handle it well.

elorant · 2 years ago
Perplexity Labs also has an open demo of Mixtral 8x7b although it's nowhere near as fast as this.

https://labs.perplexity.ai/

vitorgrs · 2 years ago
Poe has a bunch of them, including Groq as well!
RockyMcNuts · 2 years ago
ok... why tho? genuinely ignorant and extremely curious.

what's the TFLOPS/$ and TFLOPS/W and how does it compare with Nvidia, AMD, TPU?

from quick Googling I feel like Groq has been making these sorts of claims since 2020 and yet people pay a huge premium for Nvidia and Groq doesn't seem to be giving them much of a run for their money.

of course if you run a much smaller model than ChatGPT on similar or more powerful hardware it might run much faster but that doesn't mean it's a breakthrough on most models or use cases where latency isn't the critical metric?

larodi · 2 years ago
why sell? it would be much more delightful to beat them on their own game?
brcmthrowaway · 2 years ago
I have it on good authority Apple was very closing to acquiring Groq
baq · 2 years ago
If this is true, expect a call from the SEC...
timomaxgalvin · 2 years ago
Sure, but the responses are very poor compared to MS tools.
treesciencebot · 2 years ago
The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM (20x faster than HBM3, just to be clear). Which means you need ~256 LPUs (4 full server racks of compute, each unit on the rack contains 8x LPUs and there are 8x of those units on a single rack) just to serve a single model [1] where as you can get a single H200 (1/256 of the server rack density) and serve these models reasonably well.

It might work well if you have a single model with lots of customers, but as soon as you need more than a single model and a lot of finetunes/high rank LoRAs etc., these won't be usable. Or for any on-prem deployment since the main advantage is consolidating people to use the same model, together.

[0]: https://wow.groq.com/groqcard-accelerator/

[1]: https://twitter.com/tomjaguarpaw/status/1759615563586744334

matanyall · 2 years ago
Groq Engineer here, I'm not seeing why being able to scale compute outside of a single card/node is somehow a problem. My preferred analogy is to a car factory: Yes, you could build a car with say only one or two drills, but a modern automated factory has hundreds of drills! With a single drill, you could probably build all sorts of cars, but a factory assembly line is only able to make specific cars in that configuration. Does that mean that factories are inefficient?

You also say that H200's work reasonably well, and that's reasonable (but debatable) for synchronous, human interaction use cases. Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

pbalcer · 2 years ago
Just curious, how does this work out in terms of TCO (even assuming the price of a Groq LPU is 0$)? What you say makes sense, but I'm wondering how you strike a balance between massive horizontal scaling vs vertical scaling. Sometimes (quite often in my experience) having a few beefy servers is much simpler/cheaper/faster than scaling horizontally across many small nodes.

Or I got this completely wrong, and your solution enables use-cases that are simply unattainable on mainstream (Nvidia/AMD) hardware, making TCO argument less relevant?

huac · 2 years ago
> 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

I believe that this is doable - my pipeline is generally closer to 400ms without RAG and with Mixtral, with a lot of non-ML hacks to get there. It would also definitely be doable with a joint speech-language model that removes the transcription step.

For these use cases, time to first byte is the most important metric, not total throughput.

treprinum · 2 years ago
> Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia

I built one, should be live soon ;-)

startupsfail · 2 years ago
I have one, with 13B, on a 5-year-old 48GB Q8000 GPU. It’s also can see, it’s LLaVA. And it is very important that it is local, as privacy is important and streaming images to the cloud is time consuming.

You only need a few tokens, not the full 500 tokens response to run TTS. And you can pre-generate responses online, as ASR is still in progress. With a bit of clever engineering the response starts with virtually no delay, the moment its natural to start the response.

jrflowers · 2 years ago
>Show me a 30b+ parameter model doing RAG as part of a conversation with voice responses in less than a second, running on Nvidia.

Is your version of that on a different page from this chat bot?

mlazos · 2 years ago
You can’t scale horizontally forever because of communication. I think HBM would provide a lot more flexibility with the number of chips you need.
fennecbutt · 2 years ago
Are there voice responses in the demo? I couldn't find em?
chaunnyong · 2 years ago
Hi Matanyal, we worked with groq for a project last year. would you be open to connect on LinkedIn? :)
trsohmers · 2 years ago
Groq states in this article [0] that they used 576 chips to achieve these results, and continuing with your analysis, you also need to factor in that for each additional user you want to have requires a separate KV cache, which can add multiple more gigabytes per user.

My professional independent observer opinion (not based on my 2 years of working at Groq) would have me assume that their COGS to achieve these performance numbers would exceed several million dollars, so depreciating that over expected usage at the theoretical prices they have posted seems impractical, so from an actual performance per dollar standpoint they don’t seem viable, but do have a very cool demo of an insane level of performance if you throw cost concerns out the window.

[0]: https://www.nextplatform.com/2023/11/27/groq-says-it-can-dep...

tome · 2 years ago
Thomas, I think for full disclosure you should also state that you left Groq to start a competitor (a competitor which doesn't have the world's lowest latency LLM engine nor a guarantee to match the cheapest per token prices, like Groq does.).

Anyone with a serious interest in the total cost of ownership of Groq's system is welcome to email contact@groq.com.

Aeolun · 2 years ago
I think that just means it’s for people that really want it?

John doe and his friends will never have a need to have their fart jokes generated at this speed, and are more interested in low costs.

But we’d recently been doing call center operations and being able to quickly figure out what someone said was a major issue. You kind of don’t want your system to wait for a second before responding each time. I can imagine it making sense if it reduces the latency to 10ms there as well. Though you might still run up against the ‘good enough’ factor.

I guess few people want to spend millions to go from 1000ms to 10ms, but when they do they really want it.

nickpsecurity · 2 years ago
What happened to Rex? Did it hit production or get abandoned?

It was also on my list of things to consider modifying for an AI accelerator. :)

tome · 2 years ago
If you want low latency you have to be really careful with HBM, not only because of the delay involved, but also the non-determinacy. One of the huge benefits of our LPU architecture is that we can build systems of hundreds of chips with fast interconnect and we know the precise timing of the whole system to within a few parts per million. Once you start integrating non-deterministic components your latency guarantees disappear very quickly.
pclmulqdq · 2 years ago
I don't know about HBM specifically, but DDR and GDDR at a protocol level are both deterministic. It's the memory controller doing a bunch of reordering that makes them non-deterministic. Presumably, if that is the reason you don't like DRAM, you could build your compiler to be memory-layout aware and have the memory controller issue commands without reordering.
frognumber · 2 years ago
From a theoretical perspective, this is absolutely not true. Asynchronous logic can achieve much lower latency guarantees than synchronous logic.

Come to think of it, this is one of the few places where asynchronous logic might be more than academic... Async logic is hard with complex control flows, which deep learning inference does not have.

(From a practical perspective, I know you were comparing to independently-clocked logic, rather than async logic)

SilverBirch · 2 years ago
Surely once you're scaling over multiple chips/servers/racks you're dealing with retries and checksums and sequence numbers anyway? How do you get around the non-determinacy of networking beyond just hoping that you don't see any errors?
pclmulqdq · 2 years ago
Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.

foundval · 2 years ago
(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.

frozenport · 2 years ago

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1
I guess if you don't have any extra junk you can pack more processing into the chip?

nabakin · 2 years ago
I've been thinking the same but on the other hand, that would mean they are operating at a huge loss which doesn't scale
londons_explore · 2 years ago
> more than a single model and a lot of finetunes/high rank LoRAs

I can imagine a way might be found to host a base model and a bunch of LoRA's whilst using barely more ram than the base model alone.

The fine-tuning could perhaps be done in such a way that only perhaps 0.1% of the weights are changed, and for every computation the difference is computed not over the weights, but of the output layer activations.

kcorbitt · 2 years ago
This actually already exists! We did a writeup of the relevant optimizations here: https://openpipe.ai/blog/s-lora
xzyaoi · 2 years ago
There's also papers for hosting full-parameter fine-tuned models: https://arxiv.org/abs/2312.05215

Disclaimer: I'm one of the authors.

azeirah · 2 years ago
I recall a recent discussion about a technique to load the diff in weights between a lora and base model, zip it and transfer it on a per-needs basis.
moralestapia · 2 years ago
>The main problem with the Groq LPUs is, they don't have any HBM on them at all. Just a miniscule (230 MiB) [0] amount of ultra-fast SRAM [...]

IDGAF about any of that, lol. I just want an API endpoint.

480 tokens/sec at $0.27 per million tokens? Sign me in, I don't care about their hardware, at all.

treesciencebot · 2 years ago
there are providers out there offering for $0 per million tokens, that doesn't mean it is sustainable and won't disappear as soon as the VC well runs dry. Am not saying this is the case for Groq, but in general you probably should care if you want to build something serious on top of anything.
imtringued · 2 years ago
I honestly don't see the problem.

"just to serve a single model" could be easily fixed by adding a single LPDDR4 channel per LPU. Then you can reload the model sixty times per second and serve 60 different models per second.

treesciencebot · 2 years ago
per-chip compute is not the main thing this chip innovates for fast inference, it is the extremely fast memory bandwith. when you do that, you'll loose all of that and will be much worse off than any off the shelf accelerators.
karpathy · 2 years ago
Very impressive looking! Just wanted to caution it's worth being a bit skeptical without benchmarks as there are a number of ways to cut corners. One prominent example is heavy model quantization, which speeds up the model at a cost of model quality. Otherwise I'd love to see LLM tok/s progress exactly like CPU instructions/s did a few decades ago.
tome · 2 years ago
As a fellow scientist I concur with the approach of skepticism by default. Our chat app and API are available for everyone to experiment with and compare output quality with any other provider.

I hope you are enjoying your time of having an empty calendar :)

mr_luc · 2 years ago
Wait you have an API now??? Is it open, is there a waitlist? I’m on a plane but going to try to find that on the site. Absolutely loved your demo, been showing it around for a few months.
bsima · 2 years ago
As tome mentioned we don’t quantize, all activations are FP16

And here are some independent benchmarks https://artificialanalysis.ai/models/llama-2-chat-70b

xvector · 2 years ago
Jesus Christ, these speeds with FP16? That is simply insane.
sp332 · 2 years ago
At least for the earlier Llama 70B demo, they claimed to be running unquantized. https://twitter.com/lifebypixels/status/1757619926360096852

Update: This comment says "some data is stored as FP8 at rest" and I don't know what that means. https://news.ycombinator.com/item?id=39432025

tome · 2 years ago
The weights are quantized to FP8 when they're stored in memory, but all the activations are computed at full FP16 precision.
bearjaws · 2 years ago
Nothing really wrong with FP8 IMO, it performs pretty damn well usually within 98% while significantly reducing memory usage.
Gcam · 2 years ago
As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.

Link to benchmarking: https://artificialanalysis.ai/ (Note question was regarding API rather than their chat demo)

losvedir · 2 years ago
Maybe I'm stretching the analogy too far, but are we in the transistor regime of LLMs already? Sometimes I see these 70 billion parameter monstrosities and think we're still building ENIAC out of vacuum tubes.

In other words, are we ready to steadily march on, improving LLM tok/s year by year, or are we a major breakthrough or two away before that can even happen?

Deleted Comment

binary132 · 2 years ago
The thing is that tokens aren't an apples to apples metric.... Stupid tokens are a lot faster than clever tokens. I'd rather see token cleverness improving exponentially....
behnamoh · 2 years ago
tangent: Great to see you again on HN!
tome · 2 years ago
Hi folks, I work for Groq. Feel free to ask me any questions.

(If you check my HN post history you'll see I post a lot about Haskell. That's right, part of Groq's compilation pipeline is written in Haskell!)

michaelbuckbee · 2 years ago
Friendly fyi - I think this might just be a web interface bug but but I submitted a prompt with the Mixtral model and got a response (great!) then switched the dropdown to Llama and submitted the same prompt and got the exact same response.

It may be caching or it didn't change the model being queried or something else.

tome · 2 years ago
Thanks, I think it's because the chat context is fed back to the model for the next generation even when you switch models. If you refresh the page that should erase the history and you should get results purely from the model you choose.
itishappy · 2 years ago
Alright, I'll bite. Haskell seems pretty unique in the ML space! Any unique benefits to this decision, and would you recommend it for others? What areas of your project do/don't use Haskell?
tome · 2 years ago
Haskell is a great language for writing compilers! The end of our compilation pipeline is written in Haskell. Other stages are written in C++ (MLIR) and Python. I'd recommend anyone to look at Haskell if they have a compiler-shaped problem, for sure.

We also use Haskell on our infra team. Most of our CI infra is written in Haskell and Nix. Some of the chip itself was designed in Haskell (or maybe Bluespec, a Haskell-like language for chip design, I'm not sure).

lemie · 2 years ago
Haskell is a great language. However, when you want to build production grade software it’s the wrong language. Specially when it comes to a complicated piece of software like the compiler for a novel chip. I can tell you for a fact it has always being the wrong choice (specially in the case of Groq).
jart · 2 years ago
If I understand correctly, you're using specialized hardware to improve token generation speed, which is very latency bound on the speed of computation. However generating tokens only requires multiplying 1-dimensional matrices usually. If I enter a prompt with ~100 tokens then your service goes much slower. Probably because you have to multiply 2-dimensional matrices. What are you doing to improve the computation speed of prompt processing?
tome · 2 years ago
I don't think it should be quadratic in input length. Why do you think it is?
mechagodzilla · 2 years ago
You all seem like one of the only companies targeting low-latency inference rather than focusing on throughput (and thus $/inference) - what do you see as your primary market?
tome · 2 years ago
Yes, because we're one of the only companies whose hardware can actually support low latency! Everyone else is stuck with traditional designs and they try to make up for their high latency by batching to get higher throughput. But not all applications work with high throughput/high latency ... Low latency unlocks feeding the result of one model into the input of another model. Check out this conversational AI demo on CNN. You can't do that kind of thing unless you have low latency.

https://www.youtube.com/watch?v=pRUddK6sxDg&t=235s

ppsreejith · 2 years ago
Thank you for doing this AMA

1. How many GroqCards are you using to run the Demo?

2. Is there a newer version you're using which has more SRAM (since the one I see online only has 230MB)? Since this seems to be the number that will drive down your cost (to take advantage of batch processing, CMIIW!)

3. Can TTS pipelines be integrated with your stack? If so, we can truly have very low latency calls!

*Assuming you're using this: https://www.bittware.com/products/groq/

tome · 2 years ago
1. I think our GroqChat demo is using 568 GroqChips. I'm not sure exactly, but it's about that number.

2. We're working on our second generation chip. I don't know how much SRAM it has exactly but we don't need to increase the SRAM to get efficient scaling. Our system is deterministic, which means no need for waiting or queuing anywhere, and we can have very low latency interconnect between cards.

3. Yeah absolutely, see this video of a live demo on CNN!

https://www.youtube.com/watch?t=235&v=pRUddK6sxDg

andy_xor_andrew · 2 years ago
are your accelerator chips designed in-house? or they're some specialized silicon or FPGPU or something that you wrote very optimized code for inference?

it's really amazing! the first time I tried the demo, I had to try a few prompts to believe it wasn't just an animation :)

tome · 2 years ago
Yup, custom ASIC, designed in-house, built into a system of several racks, hundreds of chips, with fast interconnect. Really glad you enjoyed it!
zawy · 2 years ago
When you start using Samsung 4 nm are you switching from SRAM to HDM? If yes, how's that going to affect all the metrics given that SRAM is so much faster? Someone said you'll eventually move to HDM because SRAM improvements is relatively stalled.

I read an article that indicated your Bill Of Materials compared to NVidia's is 10x to get 1/10 the latency, and 8x BOM for throughput if Nvidia optimizes for throughput? Does this seem accurate? That CAPEX is the primary drawback?

https://www.semianalysis.com/p/groq-inference-tokenomics-spe...

UncleOxidant · 2 years ago
When will we be able to buy Groq accelerator cards that would be affordable for hobbyists?
tome · 2 years ago
We are prioritising building out whole systems at the moment I don't think we'll have a consumer level offering in the near future.
kkzz99 · 2 years ago
How does the Groq PCIE Card work exactly? Does it use system ram to stream the model data to the card? How many T/s could one expect with e.g. 36000Mhz DDR4 Ram?
tome · 2 years ago
We build out large systems where we stream in the model weights to the system once and then run multiple inferences on it. We don't really recommend streaming model weights repeatedly onto the chip because you'll lose the benefits of low latency.

Deleted Comment

tudorw · 2 years ago
As it works at inference do you think 'Representation Engineering ' could be applied to give a sort of fine-tuning ability? https://news.ycombinator.com/item?id=39414532
karthityrion · 2 years ago
Hi. Are these ASICs only for LLMs or could they accelerate other kinds of models(vision) as well?
tome · 2 years ago
It's a general purpose compute engine for numerical computing and linear algebra, so it can accelerate any ML workloads. Previously we've accelerated models for stabilising fusion reactions and for COVID drug discovery

* https://alcf.anl.gov/news/researchers-accelerate-fusion-rese...

* https://wow.groq.com/groq-accelerates-covid-drug-discovery-3...

amirhirsch · 2 years ago
It seems like you are making general purpose chips to run many models. Are we at a stage where we can consider taping out inference networks directly propagating the weights as constants in the RTL design?

Are chips and models obsoleted on roughly the same timelines?

tome · 2 years ago
I think the models change far too quickly for that to be viable. A chip has to last several years. Currently we're seeing groundbreaking models released every few months.
dkhudia · 2 years ago
@tome for the deterministic system, what if the timing for one chip/part is off due to manufacturing/environmental factors (e.g., temperature) ? How does the system handle this?
tome · 2 years ago
We know the maximum possible clock drift and so we know when we need to do a resynchronisation to keep all the chips in sync. You can read about it in section 3.3 of our recent whitepaper: https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...
mechagodzilla · 2 years ago
Those sorts of issues are part of timing analysis for a chip, but once a chip's clock rate is set, they don't really factor in unless there is some kind of dynamic voltage/frequency scaling scheme going on. This chip probably does not do any of that and just uses a fixed frequency, so timing is perfectly predictable.
ianpurton · 2 years ago
Is it possible to buy Groq chips and how much do they cost?
pama · 2 years ago
FYI, I only see a repeating animation and nothing else in my iPhone on lockdown mode, with Safari or Firefox.
karthityrion · 2 years ago
What is the underlying architecture of the ASICs. Does it use systolic arrays?
tome · 2 years ago
Yes, our matrix engine is quite similar to a systolic array. You can find more details about our architecture in our paper:

https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

louiscklaw · 2 years ago
price plan ?
liberix · 2 years ago
How do I sign up for API access? What payment methods do you support?
BryanLegend · 2 years ago
How well would your hardware work for image/video generation?
tome · 2 years ago
It should work great as far as I know. We've implemented some diffusion models for image generation but we don't offer them at the moment. I'm not aware of us having implemented any video models.
Oras · 2 years ago
Impressive speed. Are there any plans to run fine-tuned models?
tome · 2 years ago
Yes, we're working on a feature to give our partners the ability to deploy their own fine-tuned models.
phh · 2 years ago
You're running fp32 models, fp16 or quantized?
tome · 2 years ago
FP16 for calculating all activations. Some data is stored as FP8 at rest.
imiric · 2 years ago
Impressive demo!

However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

Also, while the CNN Vapi demo was impressive as well, a few weeks ago here[1] someone shared https://smarterchild.chat/. That also has _very_ low audio latency, making natural conversation possible. From that discussion it seems that https://www.sindarin.tech/ is behind it. Do we know if they use Groq LPUs or something else?

I think that once you reach ~50 t/s, real-time interaction is possible. Anything higher than that is useful for generating large volumes of data quickly, but there are diminishing returns as it's far beyond what humans can process. Maybe such speeds would be useful for AI-AI communication, transferring knowledge/context, etc.

So an LPU product that's only focused on AI-human interaction could have much lower capabilities, and thus much lower cost, no?

[1]: https://news.ycombinator.com/item?id=39180237

tome · 2 years ago
> However, the hardware requirements and cost make this inaccessible for anyone but large companies. When do you envision that the price could be affordable for hobbyists?

For API access to our tokens as a service we guarantee to beat any other provider on cost per token (see https://wow.groq.com). In terms of selling hardware, we're focused on selling whole systems, and they're only really suitable for corporations or research institutions.

pwillia7 · 2 years ago
Do you have any data on how many more tokens I would use with the increased speed?

In the demo alone I just used way more tokens than I normally would testing an LLM since it was so amazingly fast.

dsrtslnd23 · 2 years ago
How open is your early access? i.e. likelihood to get API access granted right now
stormfather · 2 years ago
>>50 t/s is absolutely necessary for real-time interaction with AI systems. Most of the LLM's output will be internal monologue and planning, performing RAG and summarization, etc, with only the final output being communicated to you. Imagine a blazingly fast GPT-5 that goes through multiple cycles of planning out how to answer you, searching the web, writing book reports, debating itself, distilling what it finds, critiquing and rewriting its answer, all while you blink a few times.
dmw_ng · 2 years ago
Given the size of the Sindarin team (3 AFAICT), that mostly looks like a clever combination of existing tech. There are some speech APIs that offer word-by-word realtime transcription (Google has one), assuming most of the special sauce is very well thought out pipelining between speech recognition->LLM->TTS

(not to denigrate their awesome achievement, I would not be interested if I were not curious about how to reproduce their result!)

charlie123hufft · 2 years ago
It's only faster sometimes, but when you ask it a complicated question or give it any type of pre-prompt to speak in a different way, then it still takes a while to load. Interesting but ultimately probably going to be a flop
neilv · 2 years ago
If the page can't access certain fonts, it will fail to work, while it keeps retrying requests:

    https://fonts.gstatic.com/s/notosansarabic/[...]
    https://fonts.gstatic.com/s/notosanshebrew/[...]
    https://fonts.gstatic.com/s/notosanssc/[...]
(I noticed this because my browser blocks these de facto trackers by default.)

rasz · 2 years ago
How to show Google how popular and interesting for acquisition you are without directly installing google trackers on your website.
sebastiennight · 2 years ago
Same problem when trying to use font replacements with a privacy plugin.

This is a very weird dependency to have :-)

tome · 2 years ago
Thanks, I've reported this internally.
SeanAnderson · 2 years ago
Sorry, I'm a bit naïve about all of this.

Why is this impressive? Can this result not be achieved by throwing more compute at the problem to speed up responses? Isn't the fact that there is a queue when under load just indicative that there's a trade-off between "# of request to process per unit of time" and "amount of compute to put into a response to respond quicker"?

https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/rel/do...

This chart from NVIDIA implies their H100 runs llama v2 70B at >500 tok/s.

MasterScrat · 2 years ago
Scaling up compute can improve throughput, but can't easily improve latency between tokens. Generation is usually bottlenecked by the time it takes to go through the network for each token. To speed that up, you need to perform these computations faster, which is a hard problem after you've exhausted all the obvious options (use the fastest accelerator you can find, cache what you can etc).
qeternity · 2 years ago
At batch size 1 LLMs are memory bandwidth bound, not compute bound…as in you spend most time waiting for model weights to load from vram. At higher batch sizes this flips.

But this is why Groq is built around large numbers of chips with small amount of very fast sram.

SeanAnderson · 2 years ago
Yeah. That makes sense, thank you for clarifying. I updated my original post with a chart from NVIDIA which highlights the H100's capabilities. It doesn't seem unreasonable to expect a 7B model to run at 500 tok/s on that hardware.
tome · 2 years ago
LLM inference is inherently a sequential problem. You can't speed it up by doing more in parallel. You can't generate the 101st token before you've generated the 100th.
NorwegianDude · 2 years ago
Technically, I guess you can use speculative execution to speed it up, and in that way take a guess at what the 100th token will be and start on the 101st token at the same time? Though it probably has it's own unforeseen challenges.

Everything is predictable with enough guesses.

Aeolun · 2 years ago
They’re using several hundred cards here. Clearly there is ‘something’ that can be done in parallel.
nabakin · 2 years ago
There's a difference between token throughput and latency. Token throughput is the token throughput of the whole GPU/system and latency is the token throughput for an individual user. Groq offers extremely low latency (aka extremely high token throughput per user) but we still don't have numbers on the token throughput of their entire system. Nvidia's metrics here on the other hand, show us the token throughput of the whole GPU/system. So, in reality, while you might be able to get 1.5k t/s on an H100, the latency (token throughput per user) will be something much lower like 20 t/s.

The really important metric to look for is cost per token because even though Groq is able to run at low latency, that doesn't mean it's able to do it cheaply. Determining the cost per token can be done many ways but a useful way for us is approximately the cost of the system divided by the total token throughput of the system per second. We don't have the total token throughput per second of Groq's system so we can't really say how efficient it is. It could very well be that Groq is subsidizing the cost of their system to lower prices and gain PR and will increase their prices later on.

frozenport · 2 years ago
https://wow.groq.com/artificialanalysis-ai-llm-benchmark-dou...

Seems to have it. Looks cost competitive but a lot faster.

SushiHippie · 2 years ago
I guess it depends on how much the infrastracture from TFA costs, as the H100 only costs ~$3300 to produce, but gets sold for ~$30k on average.

https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp...

moffkalast · 2 years ago
I think NVidia is listing max throughput in terms of batching, so e.g. 50 tok/s for 10 different prompts at the same time. Groq LPUs definitely outerform an H100 in raw speed.

But fundamentally it's a system that only has 10x the speed for 500x the price, made by a company that runs a blockchain and is trying to heavily market what were intended to be crypto mining chips for LLM inference. It's really quite a funny coincidence that when someone amazed posts this weekly link there's an army of Groq engineers at the ready in the comments ready to say everything and anything.

tome · 2 years ago
Groq does not run a blockchain and our chips were never intended for crypto mining.

Deleted Comment