If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....
(disclaimer, I am a Groq employee)
If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....
(disclaimer, I am a Groq employee)
It runs on Groq (the company I work for), so it's super snappy.
I think the problem is that for realistic TTS you need quite a few tokens because the prosody can be affected by tokens that come a fair bit further down the sentence, consider the difference in pitch between:
"The war will be long and bloody"
vs
"The war will be long and bloody?"
So to begin TTS you need quite a lot of tokens, which in turn means you have to digest the prompt and run a whole bunch of forward passes before you can start rendering. And of course you have to keep up with the speed of regular speech, which OpenAI sometimes struggles with.
That said, the gap isn't huge. Many apps won't need it. Some use cases where low latency might matter:
- Phone support.
- Trading. Think digesting a press release into an action a few seconds faster than your competitors.
- Agents that listen in to conversations and "butt in" when they have something useful to say.
- RPGs where you can talk to NPCs in realtime.
- Real-time analysis of whatever's on screen on your computing device.
- Auto-completion.
- Using AI as a general command prompt. Think AI bash.
Undoubtably there will be a lot more though. When you give people performance, they find ways to use it.
Groq happens to be excellent at doing huge linear algebra operations extremely fast. If they are latency sensitive, even better. If they are meant to run in a loop, best - that reduces the bandwidth cost of shipping data into and outside of the system. So think linear algebra driven search algorithms. ML Training isn't in this category because of the bandwidth requirements. But using ML inference to intelligently explore a search space? bingo.
If you dig around https://wow.groq.com/press, you'll find multiple such applications where we exceeded existing solutions by orders of magnitude.
To answer your questions:
- Spatial processors are an insanely good fit for async logic
- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.
In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.
Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).
TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.
So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.
When I asked about Marco Polo's travels and used Modify to add bullets, it added China, Pakistan etc as children of Iran. And the same for other paragraphs.
I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1
I guess if you don't have any extra junk you can pack more processing into the chip?We're just that much better at squeezing tokens out of transistors and optic cables than GPUs are - and you can imagine the implications on Watt/Token.
Anyways.. wait until you see our 4nm. :)
That being said, until there's another option at anywhere that speed.. That point is moot, isn't it :)
For now, Groq is the only option that can let you build an UX with near-instant response times. Or a live agents that help with a human-to-human interaction. I could go on and on about the product categories this opens.
I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.
I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.
This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!
As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.
I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.
This two-part AMA has a lot more detail if you're already familiar with what we do:
https://www.youtube.com/watch?v=UztfweS-7MU
https://www.youtube.com/watch?v=GOGuSJe2C6U