Cerebras launches inference for Llama 3.1; benchmarked at 1846 tokens/s on 8B

russ · a year ago

Here’s an AI voice assistant we built this weekend that uses it:

https://x.com/dsa/status/1828481132108873979?s=46&t=uB6padbn...

ein0p · a year ago

8b models won’t even need a server a year from now. Basically the only reason to go to the server a year or two from now will be to do what edge devices can’t do: general purpose chat, long context (multimodal especially), data augmented generation that relies on pre-existing data sources in the cloud, etc. And on the server it’s very expensive to run batch size 1. You want to maximize the batch size while also keeping an eye on time to first token and time per token. Basically 20-25 tok/sec generation throughput is a good number for most non-demo workloads. TTFT for median prompt size should ideally be well under 1 sec.

But I’m happy they got this far. It’s an ambitious vision, and it’s extra competition in a field where it’s severely lacking.

freediver · a year ago

Yep it is fast. Now what exactly is Llama 8B useful is another matter - what are some good use cases?

One scenario I can think of is rolepaying - but I would assume that the slow streaming speed was kind of a feature there.

seldo · a year ago

For agentic use cases, where you might need several round-trips to the LLM to reflect on a query, improve a result, etc., getting fast inference means you can do more round-trips while still responding in reasonable time. So basically any LLM use-case is improved by having greater speed available IMO.

freediver · a year ago

The problem with this is tok/sec does not tell you what time to first token is. I've seen (with Groq) where this is large for large prompts, nullifying the advantage of faster tok/sec.

rgbrgb · a year ago

Speed is useful for batch tasks or doing a bunch of serial tasks quickly. E.g. "take these 1000 pitch decks and give me 5 bullets on each", "run this prompt 100 times and then pick the best response", "detect which of these 100k comments mention the SF Giants".

drdaeman · a year ago

8B is not exactly great for roleplaying, if we put the bar any high. It is just not sophisticated enough, as it has very limited "reasoning"-like capabilities and can normally make sensible conclusions only about very basic things (like if it's raining, maybe character will get wet). It can and will hallucinate about stuff like inventories or rules - and it's not a context length thing. If there are multiple NPCs, things get worse, as they're starting to all mix up.

70B does significantly better in this regard. Nowhere close to perfection, but the frequency of WTFs about LLM's output are [subjectively] drastically lower.

Speed can be useful in RP if we'd run multiple LLM-based agents (like "plot", "goal checker", "inventory", "validation", "narrator") that function call each other to achieve some goal.

wkat4242 · a year ago

These wafers only have 44GB of RAM though. Very curious why the quantity is so low considering the chips are absolutely massive. It's SRAM though so very fast, comparable to cache in a modern CPU. But I assume being fast and loading the whole model there is the point.

halJordan · a year ago

What kind of answer are you looking for? Just start asking it questions. The constant demand for a magic silver bullet use case applicable to every person in the country is wild. If you have to ask, you're not using it.

What exact use case did google.com enable you to do that made it worthwhile for everyone to immediately start using? It let you access nytimes.com? Access amazon.com? No, it let you ask off the wall, asinine, long tail questions no one else asked.

bottlepalm · a year ago

Surveillance states and intelligence agencies.

Or maybe a MMO with a town of NPCs.

benopal64 · a year ago

Why can't the MMO with a town of NPCs have an intelligence agency too?

phkahler · a year ago

The winner will be one of two approaches: 1) Getting great performance using regular DRAM - system memory. 2) Bringing the compute to the RAM chips - DRAM is accessed 64Kb per row (or more?) and at ~10ns per read you can use small/slow ALUs along the row to do MAC operations. Not sure how you program that though.

Current "at home" inference tends to be limited by how much RAM your graphics card has, but system RAM scales better.

eth0up · a year ago

I'll probably get stoned for asking here, but... since you seem knowledgeable on the subject:

I just got llama3.1-8b (standard and instruct). However, I cannot do anything with it on my current hardware. Can you recommend the best AI model that I: 1) can self host 2) run on 16GB ram with no dedicated graphics card and an old intel i5 3) use on Debian without installing a bunch of exo-repo mystery code?

Any recommendation, directly or semi related would be appreciated - I'm doing my 'research' but haven't made much progress nor had any questions answered.

smokel · a year ago

Running LLMs on that kind of hardware will be very slow (expect responses with only a few words per second, which is probably pretty annoying).

LM Studio [1] makes it very easy to run models locally and play with them. Llama 3.1 will only run in quantized form with 16GB RAM, and that cripples it quite badly, in my opinion.

You may try Phi-3 Mini, which has only 3.8B weights and can still do fun things.

[1] https://lmstudio.ai/

arcanemachiner · a year ago

Setting up Ollama via Docker was the easiest way for me to get up and running. Not 100% sure if it fits your constraints, but highly recommended.

ein0p · a year ago

+1. For inference especially compute is abundant and basically free in terms of energy. Almost all of the energy is spent on memory movement. The logical solution is to not move unaggregated data.

mikewarot · a year ago

Completely eliminating the separation between RAM and compute is how FPGAs are so fast, they do most of the computing as a series of Look Up Tables (LUTs), and optimize for latency and utilization with fancy switching fabrics.

The downside of the switching fabrics is that optimizing a design to fit an FPGA can sometimes take days.

rfoo · a year ago

The winner, unfortunately, will be on cloud inference.

ChrisArchitect · a year ago

[dupe]

More discussion on official post: https://news.ycombinator.com/item?id=41369705

wkat4242 · a year ago

Wow one chip taking up a whole wafer. I bet their yields are low, though I assume they're not using the bleeding edge process but a slightly older one that's totally worked out.

Still the price of one of these would be nuts if they'd sell them. Upwards of 1 million?

Havoc · a year ago

Guessing it’s set up in a way where they can just disable dead cores

twothreeone · a year ago

Process defects can be located and routed around statically on the chip, it's described e.g. here: https://youtu.be/8i1_Ru5siXc?t=810

bkitano19 · a year ago

Time to first token is as important to know for many use cases, rarely are people reporting it

Gcam · a year ago

See here for our TTFT metric benchmarks: https://artificialanalysis.ai/models/llama-3-1-instruct-70b/...

cheptsov · a year ago

Very interested in playing with their hardware and cloud. Also I wonder if it’s possible to try cloud without contacting their sales.