Readit News logoReadit News
Posted by u/superasn 4 months ago
Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?
Sam said yesterday that chatgpt handles ~700M weekly users. Meanwhile, I can't even run a single GPT-4-class model locally without insane VRAM or painfully slow speeds.

Sure, they have huge GPU clusters, but there must be more going on - model optimizations, sharding, custom hardware, clever load balancing, etc.

What engineering tricks make this possible at such massive scale while keeping latency low?

Curious to hear insights from people who've built large-scale ML systems.

canyon289 · 4 months ago
I work at Google on these systems everyday (caveat this is my own words not my employers)). So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

However I can share this written by my colleagues! You'll find great explanations about accelerator architectures and the considerations made to make things fast.

https://jax-ml.github.io/scaling-book/

In particular your questions are around inference which is the focus of this chapter https://jax-ml.github.io/scaling-book/inference/

Edit: Another great resource to look at is the unsloth guides. These folks are incredibly good at getting deep into various models and finding optimizations, and they're very good at writing it up. Here's the Gemma 3n guide, and you'll find others as well.

https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...

KaiserPro · 4 months ago
Same explanation but with less mysticism:

Inference is (mostly) stateless. So unlike training where you need to have memory coherence over something like 100k machines and somehow avoid the certainty of machine failure, you just need to route mostly small amounts of data to a bunch of big machines.

I don't know what the specs of their inference machines are, but where I worked the machines research used were all 8gpu monsters. so long as your model fitted in (combined) vram, you could job was a goodun.

To scale the secret ingredient was industrial amounts of cash. Sure we had DGXs (fun fact, nvidia sent literal gold plated DGX machines) but they wernt dense, and were very expensive.

Most large companies have robust RPC, and orchestration, which means the hard part isn't routing the message, its making the model fit in the boxes you have. (thats not my area of expertise though)

zozbot234 · 4 months ago
> Inference is (mostly) stateless. ... you just need to route mostly small amounts of data to a bunch of big machines.

I think this might just be the key insight. The key advantage of doing batched inference at a huge scale is that once you maximize parallelism and sharding, your model parameters and the memory bandwidth associated with them are essentially free (since at any given moment they're being shared among a huge amount of requests!), you "only" pay for the request-specific raw compute and the memory storage+bandwidth for the activations. And the proprietary models are now huge, highly-quantized extreme-MoE models where the former factor (model size) is huge and the latter (request-specific compute) has been correspondingly minimized - and where it hasn't, you're definitely paying "pro" pricing for it. I think this goes a long way towards explaining how inference at scale can work better than locally.

(There are "tricks" you could do locally to try and compete with this setup, such as storing model parameters on disk and accessing them via mmap, at least when doing token gen on CPU. But of course you're paying for that with increased latency, which you may or may not be okay with in that context.)

abdullin · 4 months ago
> Inference is (mostly) stateless

Quite the opposite. Context caching requires state (K/V cache) close to the VRAM. Streaming requires state. Constrained decoding (known as Structured Outputs) also requires state.

Dead Comment

blibble · 4 months ago
> So I simultaneously can tell you that its smart people really thinking about every facet of the problem, and I can't tell you much more than that.

"we do 1970s mainframe style timesharing"

there, that was easy

kstrauser · 4 months ago
For real. Say it takes 1 machine 5 seconds to reply, and that a machine can only possibly form 1 reply at a time (which I doubt, but for argument).

If the requests were regularly spaced, and they certainly won’t be, but for the sake of argument, then 1 machine could serve 17,000 requests per day, or 120,000 per week. At that rate, you’d need about 5,600 machines to serve 700M requests. That’s a lot to me, but not to someone who owns a data center.

Yes, those 700M users will issue more than 1 query per week and they won’t be evenly spaced. However, I’d bet most of those queries will take well under 1 second to answer, and I’d also bet each machine can handle more than one at a time.

It’s a large problem, to be sure, but that seems tractable.

brookst · 4 months ago
But that’s not accurate. There are all sorts of tricks around KV cache where different users will have the same first X bytes because they share system prompts, caching entire inputs / outputs when the context and user data is identical, and more.

Not sure if you were just joking or really believe that, but for other peoples’ sake, it’s wildly wrong.

claytongulick · 4 months ago
I'm pretty sure that's not right.

They're definitely running cluster knoppix.

:-)

rootsudo · 4 months ago
Makes perfect sense, completely understand now!
benreesman · 4 months ago
I don't think it's either useful or particularly accurate to characterize modern disagg racks of inference gear, well-understood RDMA and other low-overhead networking techniques, aggressive MLA and related cache optimizations that are in the literature, and all the other stuff that goes into a system like this as being some kind of mystical thing attended to by a priesthood of people from a different tier of hacker.

This stuff is well understood in public, and where a big name has something highly custom going on? Often as not it's a liability around attachment to some legacy thing. You run this stuff at scale by having the correct institutions and processes in place that it takes to run any big non-trivial system: that's everything from procurement and SRE training to the RTL on the new TPU, and all of the stuff is interesting, but if anyone was 10x out in front of everyone else? You'd be able to tell.

Signed, Someone Who Also Did Megascale Inference for a TOP-5 For a Decade.

tough · 4 months ago
Doesn't google have TPU's that makes inference of their own models much more profitable than say having to rent out NVDIA cards?

Doesn't OpenAI depend mostly on its relationship/partnership with Microsoft to get GPUs to inference on?

Thanks for the links, interesting book!

ActorNightly · 4 months ago
Yes. Google is probably gonna win the LLM game tbh. They had a massive head start with TPUs which are very energy efficient compared to Nvidia Cards.
canyon289 · 4 months ago
Im a research person building models so I can't answer your questions well (save for one part)

That is, as a research person using our GPUs and TPUs I see first hand how choices from the high level python level, through Jax, down to the TPU architecture all work together to make training and inference efficient. You can see a bit of that in the gif on the front page of the book. https://jax-ml.github.io/scaling-book/

I also see how sometimes bad choices by me can make things inefficient. Luckily for me if my code/models are running slow I can ping colleagues who are able to debug at both a depth and speed that is quite incredible.

And because were on HN I want to preemptively call out my positive bias for Google! It's a privilege to be able to see all this technology first hand, work with great people, and do my best to ship this at scale across the globe.

ignoramous · 4 months ago
> Another great resource to look at is the unsloth guides.

And folks at LMSys: https://lmsys.org/blog/

  Large Model Systems (LMSYS Corp.) is a 501(c)(3) non-profit focused on incubating open-source projects and research. Our mission is to make large AI models accessible to everyone by co-developing open models, datasets, systems, and evaluation tools. We conduct cutting-edge machine learning research, develop open-source software, train large language models for broad accessibility, and build distributed systems to optimize their training and inference.

hnpolicestate · 4 months ago
This caught my attention "But today even “small” models run so close to hardware limits".

Sounds analogous to the 60's and 70's i.e "even small programs run so close to hardware limits". If optimization and efficiency is dead in software engineering, it's certainly alive and well in LLM development.

jackhalford · 4 months ago
Why does the unsloth guide for gemma 3n say:

> llama.cpp an other inference engines auto add a <bos> - DO NOT add TWO <bos> tokens! You should ignore the <bos> when prompting the model!

That makes the want to try exactly that? Weird

nwhnwh · 4 months ago
Nothing smart about making something that is not useful for humans.
revskill · 4 months ago
No, you just over complicate things.
LAC-Tech · 4 months ago
If people at google are so smart why can't google.com get a 100% lighthouse score?
jeltz · 4 months ago
I have met a lot of people at Google, they have some really good engineers and mediocre ones. But mostl importantly they are just normal engineers dealing normal office politics.

I don't like how the grand parent mystifies this. This problem is just normal engineering. Any good engineer could learn how to do it.

usr1106 · 4 months ago
Because most smart people are not generalists. My first boss was really smart and managed to found a university institute in computer science. The 3 other professors he hired were, ahem, strange choices. We 28 year old assistents could only shake our heads. After fighting a couple of years with his own hires the founder left in frustration to found another institution.

One of my colleagues was only 25, really smart in his field and became a professor less than 10 years later. But he was incredibly naive in everyday chores. Buying groceries or filing taxes resulted in major screw-ups regularly

ranger_danger · 4 months ago
Pro-tip they're just not. A lot of tech nerds really like to think they're a genius with all the answers ("why don't they just do XX"), but some eventually learn that the world is not so black and white.

The Dunning-Kruger effect also applies to smart people. You don't stop when you are estimating your ability correctly. As you learn more, you gain more awareness of your ignorance and continue being conservative with your self estimates.

catigula · 4 months ago
A lot of really smart people working on problems that don't even really need to be solved is an interesting aspect of market allocation.
YossarianFrPrez · 4 months ago
Can you explain what you mean about 'not needing to be solved'? There are versions of that kind of critique that would seem, at least on the surface, to better apply to finance or flash trading.

I ask because scaling an system that a substantially chunk of the population finds incredibly useful, including for the more efficient production of public goods (scientific research, for example) does seem like a problem that a) needs to be solved from a business point of view, and b) should be solved from a civic-minded point of view.

virgil_disgr4ce · 4 months ago
> working on problems that don't even really need to be solved

Very, very few problems _need_ to be solved. Feeding yourself is a problem that needs to be solved in order for you to continue living. People solve problems for different reasons. If you don't think LLMs are valuable, you can just say that.

vermilingua · 4 months ago
Well, we all thought advertising was the worst thing to come out of the tech industry, someone had to prove us wrong!
airhangerf15 · 4 months ago
An H100 is a $20k USD card and has 80GB of vRAM. Imagine a 2U rack server with $100k of these cards in it. Now imagine an entire rack of these things, plus all the other components (CPUs, RAM, passive cooling or water cooling) and you're talking $1 million per rack, not including the costs to run them or the engineers needed to maintain them. Even the "cheaper"

I don't think people realize the size of these compute units.

When the AI bubble pops is when you're likely to be able to realistically run good local models. I imagine some of these $100k servers going for $3k on eBay in 10 years, and a lot of electricians being asked to install new 240v connectors in makeshift server rooms or garages.

semi-extrinsic · 4 months ago
What do you mean 10 years?

You can pick up a DGX-1 on Ebay right now for less than $10k. 256 GB vRAM (HBM2 nonetheless), NVLink capability, 512 GB RAM, 40 CPU cores, 8 TB SSD, 100 Gbit HBAs. Equivalent non-Nvidia branded machines are around $6k.

They are heavy, noisy like you would not believe, and a single one just about maxes out a 16A 240V circuit. Which also means it produces 13 000 BTU/hr of waste heat.

kj4ips · 4 months ago
Fair warning: the BMCs on those suck so bad, and the firmware bundles are painful, since you need a working nvidia-specific container runtime to apply them, which you might not be able to get up and running because of a firmware bug causing almost all the ram to be presented as nonvolatile.
ksherlock · 4 months ago
It's not waste heat if you only run it in the winter.
eulgro · 4 months ago
> 13 000 BTU/hr

In sane units: 3.8 kW

quickthrowman · 4 months ago
You’ll need (2) 240V 20A 2P breakers, one for the server and one for the 1-ton mini-split to remove the heat ;)
xtiansimon · 4 months ago
> “They are heavy, noisy like you would not believe, … produces … waste heat.”

Haha. I bought a 20 yro IBM server off eBay for a song. It was fun for a minute. Soon became a doorstop and I sold it as pickup-only on eBay for $20. Beast. Never again have one in my home.

CamperBob2 · 4 months ago
Are you talking about the guy in Temecula running two different auctions with some of the same photos (356878140643 and 357146508609, both showing a missing heat sink?) Interesting, but seems sketchy.

How useful is this Tesla-era hardware on current workloads? If you tried to run the full DeepSeek R1 model on it at (say) 4-bit quantization, any idea what kind of TTFT and TPS figures might be expected?

nulltype · 4 months ago
> What do you mean 10 years?

Didn’t the DGX-1 come out 9 years ago?

invaliduser · 4 months ago
Even is the AI bubble does not pops, your prediction about those servers being available on ebay in 10 years will likely be true, because some datacenters will simply upgrade their hardware and resell their old ones to third parties.
potatolicious · 4 months ago
Would anybody buy the hardware though?

Sure, datacenters will get rid of the hardware - but only because it's no longer commercially profitable run them, presumably because compute demands have eclipsed their abilities.

It's kind of like buying a used GeForce 980Ti in 2025. Would anyone buy them and run them besides out of nostalgia or curiosity? Just the power draw makes them uneconomical to run.

Much more likely every single H100 that exists today becomes e-waste in a few years. If you have need for H100-level compute you'd be able to buy it in the form of new hardware for way less money and consuming way less power.

For example if you actually wanted 980Ti-level compute in a desktop today you can just buy a RTX5050, which is ~50% faster, consumes half the power, and can be had for $250 brand new. Oh, and is well-supported by modern software stacks.

belter · 4 months ago
Except their insane electricity demands will still be the same, meaning nobody will buy them. You have plenty of SPARC servers on Ebay.
mattmanser · 4 months ago
Someone's take on AI was that we're collectively investing billions in data centers that will be utterly worthless in 10 years.

Unlike the investments in railways or telephone cables or roads or any other sort of architecture, this investment has a very short lifespan.

Their point was that whatever your take on AI, the present investment in data centres is a ridiculous waste and will always end up as a huge net loss compared to most other investments our societies could spend it on.

Maybe we'll invent AGI and he'll be proven wrong as they'll pay back themselves many times over, but I suspect they'll ultimately be proved right and it'll all end up as land fill.

DecentShoes · 4 months ago
This seems likely. Blizzard even sold off old World of Warcraft servers. You can still get them on ebay
torginus · 4 months ago
My personal sneaking suspicion is that publicly offered models are using way less compute than thought. In modern mixture of experts models, you can do top-k sampling, where only some experts are evaluated, meaning even SOTA models aren't using much more compute than a 70-80b non-MoE model.
ActorNightly · 4 months ago
To piggyback on this, at enterprise level in modern age, the question is really not about "how are we going to serve all these users", it comes down to the fact that investors believe that eventually they will see a return on investment, and then pay whatever is needed to get the infra.

Even if you didn't have optimizations involved in terms of job scheduling, they would just build as many warehouses as necessary filled with as many racks as necessary to serve the required user base.

brikym · 4 months ago
As a non-American the 240V thing made me laugh.

Dead Comment

RagnarD · 4 months ago
An RTX 6000 Pro (NVIDIA Blackwell GPU) has 96GB of VRAM and can be had for around $7700 currently (at least, the lowest price I've found.) It plugs into standard PC motherboard PCIe slots. The Max Q edition has slightly less performance but a max TDP of only 300W.
eitally · 4 months ago
What I wonder is what this means for Coreweave, Lambda and the rest, who are essentially just renting out fleets of racks like this. Does it ultimately result in acquisition by a larger player? Severe loss of demand? Can they even sell enough to cover the capex costs?
cootsnuck · 4 months ago
It means they're likely going to be left holding a very expensive bag.
adw · 4 months ago
These are also depreciating assets.

Deleted Comment

torginus · 4 months ago
I wonder if it's feasible to hook up NAND flash with a high bandwidth link necessary for inference.

Each of these NAND chips hundreds of dies of flash stacked inside, and they are hooked up to the same data line, so just 1 of them can talk at the same time, and they still achieve >1GB/s bandwidth. If you could hook them up in parallel, you could have 100s of GBs of bandwidth per chip.

potatolicious · 4 months ago
NAND is very, very slow relative to RAM, so you'd pay a huge performance penalty there. But maybe more importantly my impression is that memory contents mutate pretty heavily during inference (you're not just storing the fixed weights), so I'd be pretty concerned about NAND wear. Mutating a single bit on a NAND chip a million times over just results in a large pile of dead NAND chips.
dboreham · 4 months ago
They'll be in landfill in 10 years.
neko_ranger · 4 months ago
Four H100 in a 2U rack didn't sound impressive, but that is accurate:

>A typical 1U or 2U server can accommodate 2-4 H100 PCIe GPUs, depending on the chassis design.

>In a 42U rack with 20x 2U servers (allowing space for switches and PDU), you could fit approximately 40-80 H100 PCIe GPUs.

michaelt · 4 months ago
Why stop at 80 H100s for a mere 6.4 terabytes of GPU memory?

Supermicro will sell you a full rack loaded with servers [1] providing 13.4 TB of GPU memory.

And with 132kW of power output, you can heat an olympic-sized swimming pool by 1°C every day with that rack alone. That's almost as much power consumption as 10 mid-sized cars cruising at 50 mph.

[1] https://www.supermicro.com/en/products/system/gpu/48u/srs-gb...

jzymbaluk · 4 months ago
And the big hyperscaler cloud providers are building city-block sized data centers stuffed to the gills with these racks as far as the eye can see
tootie · 4 months ago
Yeah I think the crux of the issue is that chatgpt is serving a huge number of users including paid users and is still operating at a massive operating loss. They are spending truckloads of money on GPUs and selling access at a loss.
scarface_74 · 4 months ago
This isn’t like how Google was able to buy up dark fiber cheaply and use it.

From what I understand, this hardware has a high failure rate over the long term especially because of the heat they generate.

shusaku · 4 months ago
> When the AI bubble pops is when you're likely to be able to realistically run good local models.

After years of “AI is a bubble, and will pop when everyone realizes they’re useless plagiarism parrots” it’s nice to move to the “AI is a bubble, and will pop when it becomes completely open and democratized” phase

cootsnuck · 4 months ago
It's not even been 3 years. Give it time. The entire boom and bust of the dot come bubble took 7 years.

Dead Comment

piyh · 4 months ago
You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.

Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.

https://artificialanalysis.ai/?models=gpt-oss-20b%2Cgemma-3-...

cornholio · 4 months ago
You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.

Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.

0cf8612b2e1e · 4 months ago
During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?

I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.

fergal_reid · 4 months ago
I think the most direct answer is that at scale, inference can be batched, so that processing many queries together in a parallel batch is more efficient than interactively dedicating a single GPU per user (like your home setup).

If you want a survey of intermediate level engineering tricks, this post we wrote on the Fin AI blog might be interesting. (There's probably a level of proprietary techniques OpenAI etc have again beyond these): https://fin.ai/research/think-fast-reasoning-at-3ms-a-token/

nodja · 4 months ago
This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).

My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.

Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.

(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)

LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2

But you could instead batch the compute parts together.

LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2

Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)

Time spent using approach 1 (1 request at a time):

LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC

Time spend using approach 2 (batching):

LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC

The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.

TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.

superasn · 4 months ago
Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:

- Big models like GPT-4 are split across many GPUs (sharding).

- Each GPU holds some layers in VRAM.

- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.

- Loading into cache is slow, the ops are fast though.

- Without batching: load layer > compute user1 > load again > compute user2.

- With batching: load layer once > compute for all users > send to gpu 2 etc

- This makes cost per user drop massively if you have enough simultaneous users.

- But bigger batches need more GPU memory for activations, so there's a max size.

This does makes sense to me but does this sound accurate to you?

Would love to know if I'm still missing something important.

abathologist · 4 months ago
One clever ingredient in OpenAI's secret sauce is billions of dollars of losses. About $5 billion dollars lost in 2024. https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
throwmeaway222 · 4 months ago
That's all different now with agentic which was not really a big thing until the end of 2024. before they were doing 1 request, now they're doing hundreds for a given task. the reason oai/azure win over locally run models is the parallelization that you can do with a thinking agent. simultaneous processing of multiple steps.
nickpsecurity · 4 months ago
You hit the nail on the head. Just gotta add the up to $10 billion investment from Microsoft to cover pretraining, R&D, and inference. Then, they still lost billions.

One can serve a lot if models if allowed to burn through over a billion dollars with no profit requirement. Classic, VC-style, growth-focused capitalism with an unusual, business structure.

DoctorOetker · 4 months ago
Due to batching, inference is profitable, very profitable.

Yet undoubtedly they are making what is declared a loss.

But is it really a loss?

If you buy an asset, is that automatically a loss? or is it an investment?

By "running at a loss" one can build a huge dataset, to stay in the running.

dbbk · 4 months ago
How batched can it really be though if every request is personalised to the user with Memory?
gregoriol · 4 months ago
With infinite resources, you can serve infinite users. Until it's gone.
93po · 4 months ago
they would be break-even if all they did was serve existing models and got rid of everything related to R&D
mperham · 4 months ago
Have they considered replacing their engineers with AI?
Invictus0 · 4 months ago
An AI lab with no R&D. Truly a hacker news moment
TheAlchemist · 4 months ago
Would you have any numbers to back it up ?
knowitnone2 · 4 months ago
they are not the only player so getting rid of R&D would be suicide
jp57 · 4 months ago
700M weekly users doesn't say much about how much load they have.

I think the thing to remember is that the majority of chatGPT users, even those who use it every day, are idle 99.9% of the time. Even someone who has it actively processing for an hour a day, seven days a week, is idle 96% of the time. On top of that, many are using less-intensive models. The fact that they chose to mention weekly users implies that there is a significant tail of their user distribution who don't even use it once a day.

So your question factors into a few of easier-but-still-not-trivial problems:

- Making individual hosts that can fit their models in memory and run them at acceptable toks/sec.

- Making enough of them to handle the combined demand, as measured in peak aggregate toks/sec.

- Multiplexing all the requests onto the hosts efficiently.

Of course there are nuances, but honestly, from a high level last problem does not seem so different from running a search engine. All the state is in the chat transcript, so I don't think there any particular reason reason that successive interactions on the same chat need be handled by the same server. They could just be load-balanced to whatever server is free.

We don't know, for example, when the chat says "Thinking..." whether the model is running or if it's just queued waiting for a free server.

joshhart · 4 months ago
A single node with GPUs has a lot of FLOPs and very high memory bandwidth. When only processing a few requests at a time, the GPUs are mostly waiting on the model weights to stream from the GPU ram to the processing units. When batching requests together, they can stream a group of weights and score many requests in parallel with that group of weights. That allows them to have great efficiency.

Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.

Add all of these up and you get efficiency! Source - was director of the inference team at Databricks

bawana · 4 months ago
How is speculative decoding helpful if you still have to run the full model against which you check the results?
joshhart · 4 months ago
So the inference speed at low to medium usage is memory bandwidth bound, not compute bound. By “forecasting” into the future you do not increase the memory bandwidth pressure much but you use more compute. The compute is checking each potential token in parallel for several tokens forward. That compute is essentially free though because it’s not the limiting resource. Hope this makes sense, tried to keep it simple.
ritz_labringue · 4 months ago
The short answer is "batch size". These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

If you try to run GPT4 at home, you'll still need enough VRAM to load the entire model, which means you'll need several H100s (each one costs like $40k). But you will be under-utilizing those cards by a huge amount for personal use.

It's a bit like saying "How come Apple can make iphones for billions of people but I can't even build a single one in my garage"

jsnell · 4 months ago
> These days, LLMs are what we call "Mixture of Experts", meaning they only activate a small subset of their weights at a time. This makes them a lot more efficient to run at high batch size.

I don't really understand why you're trying to connect MoE and batching here. Your stated mechanism is not only incorrect but actually the wrong way around.

The efficiency of batching comes from optimally balancing the compute and memory bandwidth, by loading a tile of parameters from the VRAM to cache, applying those weights to all the batched requests, and only then loading in the next tile.

So batching only helps when multiple queries need to access the same weights for the same token. For dense models, that's just what always happens. But for MoE, it's not the case, exactly due to the reason that not all weights are always activated. And then suddenly your batching becomes a complex scheduling problem, since not all the experts at a given layer will have the same load. Surely a solvable problem, but MoE is not the enabler for batching but making it significantly harder.

ritz_labringue · 4 months ago
You’re right, I conflated two things. MoE improves compute efficiency per token (only a few experts run), but it doesn’t meaningfully reduce memory footprint.

For fast inference you typically keep all experts in memory (or shard them), so VRAM still scales with the total number of experts.

Practically, that’s why home setups are wasteful: you buy a GPU for its VRAM capacity, but MoE only activates a fraction of the compute each token, and some experts/devices sit idle (because you are the only one using the model).

MoE does not make batching more efficient, but it demands larger batches to maximize compute utilization and to amortize routing. Dense models batch trivially (same weights every token). MoE batches well once the batch is large enough so each expert has work. So the point isn’t that MoE makes batching better, but that MoE needs bigger batches to reach its best utilization.

radarsat1 · 4 months ago
I'm actually not sure I understand how MoE helps here. If you can route a single request to a specific subnetwork then yes, it saves compute for that request. But if you have a batch of 100 requests, unless they are all routed exactly the same, which feels unlikely, aren't you actually increasing the number of weights that need to be processed? (at least with respect to an individual request in the batch).
arjvik · 4 months ago
Essentially, inference is well-amortized across the many users.
robotnikman · 4 months ago
I wonder then if its possible to load the unused parts into main memory, while the more used parts into VRAM
cududa · 4 months ago
Great metaphor