Readit News logoReadit News
simonw · 2 years ago
Lots of comments talking about the model itself. This is Llama 2 70B, a model that has been around for a while now, so we're not seeing anything in terms of model quality (or model flaws) we haven't seen before.

What's interesting about this demo is the speed at which it is running, which demonstrates the "Groq LPU™ Inference Engine".

That's explained here: https://groq.com/lpu-inference-engine/

> This is the world’s first Language Processing Unit™ Inference Engine, purpose-built for inference performance and precision. How performant? Today, we are running Llama-2 70B at over 300 tokens per second per user.

I think the LPU is a custom hardware chip, though the page talking about it doesn't make that as clear as it could.

https://groq.com/products/ makes it a bit more clear - there's a custom chip, "GroqChip™ Processor".

jkachmar · 2 years ago
this is running on custom hardware, if you’re curious about the underlying architecture check the publication below.

https://groq.com/wp-content/uploads/2023/05/GroqISCAPaper202...

EDIT: i work at Groq, but i’m commenting in a personal capacity.

happy to answer clarifying questions or forward them along to folks who can :)

m3kw9 · 2 years ago
Is it fixed to a certain llm architecture like llama2? How does it deal with different architectures like MOE for example
cicce19 · 2 years ago
Will you be selling individual cards? Are you looking for use cases in the healthcare vertical (noticed its not on your current list)? Working in the medical imaging space and could use this tech as part of the offering. Reach out at 16bit.ai
m1sta_ · 2 years ago
How easy is it for companies to setup private local servers using Grow hardware (cost and complexity). I've got money. I want throughout.
mlazos · 2 years ago
How many chips are used for this demo? Do they have dram? I remember the earlier versions did not have dram.

Are they also used for training or just inference?

moneywoes · 2 years ago
what’s the cost?
laborcontract · 2 years ago
This is really impressive. For reference, inference for llama 70b on together’s api generates text at roughly 60 tokens/second.

I can’t find any information about an api, though I’m guessing that the costs are eye watering.

If they offered a Mixtral endpoint that did 300-400 tokens per second at a reasonable cost, I can’t imagine ever using another provider.

tome · 2 years ago
We don't have an API in public availability yet but that's coming soon in the new year. We will be price competitive with OpenAI but much faster. Deploying Mixtral is work in progress so keep your eyes open for that too!
GamerAlias · 2 years ago
In case, it's not blinding obvious to people. Groq are a hardware company that have built chips that are designed around the training and serving of machine models particularly targeted at LLMs. So the quality of the response isn't really what we're looking for here. We're looking for speed i.e. tokens per second.

I actually have a final round interview with a subsidiary of Groq coming up and I'm very undecided as to whether to pursue it so this felt extraordinarily serendipitous to me. Food for thought shown here

mlazos · 2 years ago
tbh anyone can build fast hw for a single model, I’d audit their plan for a SW stack before joining. That said their arch is pretty unique so if they’re able to get these speeds it is pretty compelling
tome · 2 years ago
Our hardware architecture was not designed with LLMs in mind, let alone a specific model. It's a general purpose numerical compute fabric. Our compiler allows us to quickly deploy new models of any architecture without the need that graphics processors have for handwritten kernels. We run language models, speech models, image generation models, scientific numerical programs including for drug discovery, ...
pclmulqdq · 2 years ago
They are putting the whole LLM into SRAM across multiple computing chips, IIRC. That is a very expensive way to go about serving a model, but should give pretty great speed at low batch size.

Deleted Comment

chihuahua · 2 years ago
> the quality of the response isn't really what we're looking for here. We're looking for speed i.e. tokens per second.

But if it was generating high-quality responses, would that not make it go slower?

nomel · 2 years ago
That would involve using a different model. This is not about the model, it’s about the relative speed improvement from the hardware, with this model as a demo.
coder543 · 2 years ago
Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.

[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

tome · 2 years ago
Yup, deploying Mixtral is a work in progress. Watch this space!
Mockapapella · 2 years ago
I can't wait until LLMs are fast enough that a single response can actually be a whole tree of thought/review process before giving you an answer, yet is still fast enough to not even notice
joshspankit · 2 years ago
I would bet a chunk of $$ that right before that point there will be a shift to bigger structures. Maybe MOE with individual tree of thought, or “town square consensus” or something.
bsima · 2 years ago
Why wait? This is pretty much what Groq has in hardware, just need the software layer to do the review process.
phildenhoff · 2 years ago
It’s very fast at telling me it can’t tell me things!

I asked about creating illicit substances — an obvious (and reasonable) target for censorship. And, admirably, it suggested getting help instead. That’s fine.

But I asked for a poem about pumping gas in the style of Charles Bukowski, and it moaned that I shouldn’t ask for such mean-spirited, rude things. It wouldn’t dare create such a travesty.

kromem · 2 years ago
It seems like it must be using Llama-2-chat, which has had 'safety' training.

To test which underlying model I asked it what a good sexy message for my girlfriend for Valentine's Day would be, and it lectured me about objectification.

It makes sense the chat interface is using the chat model, I just wish that people were more consistent about labeling the use of Llama-2-chat vs Llama-2 as the fine tuning really does lead to significant underlying differences.

matanyal · 2 years ago
It told me "Yeehaw Ridem Cowboy" was potentially problematic, which is news to me living out west.
microtherion · 2 years ago
It seems to reject all lyrics requests as well (In my experience, LLMs are good at the first one or two lines, and then just make it up as they go along, with sometimes hilarious results).
huevosabio · 2 years ago
I saw this in person back in September.

Really impressed by their hardware.

I'm still wondering why is the uptake so slow. My understanding from their presentations was that it was relatively simple to compile a model. Why isn't it more talked about? And why not demo Mixtral or show case multiple models?

tome · 2 years ago
We're building out racks as fast as we can to keep up with customer demand :) A public demo of Mixtral is in the works, so watch this space.
badFEengineer · 2 years ago
This was surprisingly fast, 276.27 T/s (although Llama 2 70B is noticeably worse than GPT-4 turbo). I'm actually curious if there's good benchmarks for inference tokens per second- I imagine it's a bit different for throughput vs. single inference optimization, but curious if there's an analysis somewhere on this

edit: I re-ran the same prompt on perplexity llama-2-70b and getting 59 tokens per sec there

andygeorge · 2 years ago
fast but wrong/gibberish
razorguymania · 2 years ago
Its using vanilla llama-2 from Meta with no fine tuning. The point here is the speed and responsiveness of the underlying HW and SW.
retro_bear · 2 years ago
The point isnt that they are running Llama2-70B. The point is that they are running Llama2-70B faster than anyone else so far.
andygeorge · 2 years ago
Out of sheer curiosity, why did you make an account for this thread?
vinniepukh · 2 years ago
at some point each one of us made an account because of a thread