foundval (u/foundval)

foundval commented on Llama 3.1 llama.meta.com/... · Posted by u/luiscosio

quotemstr · 2 years ago

Groq's TSP architecture is one of the weirder and more wonderful ISAs I've seen lately. The choice of SRAM in fascinating. Are you guys planning on publishing anything about how you bridged the gap between your order-hundreds-megabytes SRAM TSP main memory and multi-TB model sizes?

foundval · 2 years ago

There is a lot out here.

I gave a seminar about the overall approach recently, abstract: https://shorturl.at/E7TcA, recording: https://shorturl.at/zBcoL.

This two-part AMA has a lot more detail if you're already familiar with what we do:

https://www.youtube.com/watch?v=UztfweS-7MU

https://www.youtube.com/watch?v=GOGuSJe2C6U

foundval commented on Llama 3.1 llama.meta.com/... · Posted by u/luiscosio

foundval · 2 years ago

You can chat with these new models at ultra-low latency at groq.com. 8B and 70B API access is available at console.groq.com. 405B API access for select customers only – GA and 3rd party speed benchmarks soon.

If you want to learn more, there is a writeup at https://wow.groq.com/now-available-on-groq-the-largest-and-m....

(disclaimer, I am a Groq employee)

foundval commented on Aider: AI pair programming in your terminal github.com/paul-gauthier/... · Posted by u/tosh

foundval · 2 years ago

If you're interested in this sort of stuff, you might like this diff-based CLI tool I wrote:

https://github.com/freuk/iter

It runs on Groq (the company I work for), so it's super snappy.

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

mike_hearn · 2 years ago

OpenAI have a voice powered chat mode in their app and there's a noticeable delay of a few seconds between finishing your sentence and the bot starting to speak.

I think the problem is that for realistic TTS you need quite a few tokens because the prosody can be affected by tokens that come a fair bit further down the sentence, consider the difference in pitch between:

"The war will be long and bloody"

vs

"The war will be long and bloody?"

So to begin TTS you need quite a lot of tokens, which in turn means you have to digest the prompt and run a whole bunch of forward passes before you can start rendering. And of course you have to keep up with the speed of regular speech, which OpenAI sometimes struggles with.

That said, the gap isn't huge. Many apps won't need it. Some use cases where low latency might matter:

- Phone support.

- Trading. Think digesting a press release into an action a few seconds faster than your competitors.

- Agents that listen in to conversations and "butt in" when they have something useful to say.

- RPGs where you can talk to NPCs in realtime.

- Real-time analysis of whatever's on screen on your computing device.

- Auto-completion.

- Using AI as a general command prompt. Think AI bash.

Undoubtably there will be a lot more though. When you give people performance, they find ways to use it.

foundval · 2 years ago

You've got good ideas. What I like to personally say is that Groq makes the "Copilot" metaphor real. A copilot is supposed to be fast enough to keep up with reality and react live :)

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

keeshond · 2 years ago

I see XTX is one of the investors - any potential use cases that require deterministic computation that you can talk about beyond just inference?

foundval · 2 years ago

(Groq Employee) As I'm sure you're aware, XTX takes its name from a particular linear algebra operation that happens to be used a lot in Finance.

Groq happens to be excellent at doing huge linear algebra operations extremely fast. If they are latency sensitive, even better. If they are meant to run in a loop, best - that reduces the bandwidth cost of shipping data into and outside of the system. So think linear algebra driven search algorithms. ML Training isn't in this category because of the bandwidth requirements. But using ML inference to intelligently explore a search space? bingo.

If you dig around https://wow.groq.com/press, you'll find multiple such applications where we exceeded existing solutions by orders of magnitude.

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

frognumber · 2 years ago

Curiously, almost all of this video is mostly covered by computer architectures lit in the late 90's early 00's. At the time, I recall Tom Knight had done most of the analysis in this video, but I don't know if he ever published it. It was extrapolating into the distant future.

To answer your questions:

- Spatial processors are an insanely good fit for async logic

- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.

In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.

Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).

TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.

foundval · 2 years ago

Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.

So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

kopirgan · 2 years ago

Just a minor gripe the bullet option doesn't seem to be logical..

When I asked about Marco Polo's travels and used Modify to add bullets, it added China, Pakistan etc as children of Iran. And the same for other paragraphs.

foundval · 2 years ago

(Groq Employee) Thanks for the feedback :) We're always improving that demo.

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

frozenport · 2 years ago

    I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1

I guess if you don't have any extra junk you can pack more processing into the chip?

foundval · 2 years ago

(Groq Employee) Yes! Determinism + Simplicity are superpowers for ALU and interconnect utilization rates. This system is powered by 14nm chips, and even the interconnects aren't best in class.

We're just that much better at squeezing tokens out of transistors and optic cables than GPUs are - and you can imagine the implications on Watt/Token.

Anyways.. wait until you see our 4nm. :)

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

treesciencebot · 2 years ago

there are providers out there offering for $0 per million tokens, that doesn't mean it is sustainable and won't disappear as soon as the VC well runs dry. Am not saying this is the case for Groq, but in general you probably should care if you want to build something serious on top of anything.

foundval · 2 years ago

(Groq Employee) Agreed, one should care, and especially since this particular service is very differentiated by its speed and has no competitors.

That being said, until there's another option at anywhere that speed.. That point is moot, isn't it :)

For now, Groq is the only option that can let you build an UX with near-instant response times. Or a live agents that help with a human-to-human interaction. I could go on and on about the product categories this opens.

foundval commented on Groq runs Mixtral 8x7B-32k with 500 T/s groq.com/... · Posted by u/tin7in

pclmulqdq · 2 years ago

Groq devices are really well set up for small-batch-size inference because of the use of SRAM.

I'm not so convinced they have a Tok/sec/$ advantage at all, though, and especially at medium to large batch sizes which would be the groups who can afford to buy so much silicon.

I assume given the architecture that Groq actually doesn't get any faster for batch sizes >1, and Nvidia cards do get meaningfully higher throughput as batch size gets into the 100's.

foundval · 2 years ago

(Groq Employee) It's hard to discuss Tok/sec/$ outside of the context of a hardware sales engagement.

This is because the relationship between Tok/s/u, Tok/s/system, Batching, and Pipelining is a complex one that involves compute utilization, network utilization, and (in particular) a host of compilation techniques that we wouldn't want to share publicly. Maybe we'll get to that level of transparency at some point, though!

As far as Batching goes, you should consider that with synchronous systems, if all the stars align, Batch=1 is all you need. Of course, the devil is in the details, and sometimes small batch numbers still give you benefits. But Batch 100's generally gives no advantages. In fact, the entire point of developing deterministic hardware and synchronous systems is to avoid batching in the first place.