superasn (u/superasn)

superasn commented on Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally? · Posted by u/superasn

nodja · 15 days ago

This is the real answer, I don't know what people above are even discussing when batching is the biggest reduction in costs. If it costs say $50k to serve one request, with batching is also costs $50k to serve 100 at the same time with minimal performance loss, I don't know what the real number of users is before you need to buy new hardware, but I know it's in the hundreds so going from $50000 to $500 in effective costs is a pretty big deal (assuming you have the users to saturate the hardware).

My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.

Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.

(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)

LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2

But you could instead batch the compute parts together.

LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2

Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)

Time spent using approach 1 (1 request at a time):

LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC

Time spend using approach 2 (batching):

LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC

The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.

TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.

superasn · 15 days ago

Thanks for the helpful reply! As I wasn't able to fully understand it still, I pasted your reply in chatgpt and asked it some follow up questions and here is what i understand from my interaction:

- Big models like GPT-4 are split across many GPUs (sharding).

- Each GPU holds some layers in VRAM.

- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.

- Loading into cache is slow, the ops are fast though.

- Without batching: load layer > compute user1 > load again > compute user2.

- With batching: load layer once > compute for all users > send to gpu 2 etc

- This makes cost per user drop massively if you have enough simultaneous users.

- But bigger batches need more GPU memory for activations, so there's a max size.

This does makes sense to me but does this sound accurate to you?

Would love to know if I'm still missing something important.

superasn commented on Cerebras Code cerebras.ai/blog/introduc... · Posted by u/d3vr

thanhhaimai · 23 days ago

> running at speeds of up to 2,000 tokens per second, with a 131k-token context window, no proprietary IDE lock-in, and no weekly limits!

I was excited, then I read this:

> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.

I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.

superasn · 23 days ago

Pretty sure this is there to prevent this[1] from happening to them

[1] https://www.viberank.app/

superasn commented on Show HN: An AI agent that learns your product and guides your users frigade.ai... · Posted by u/pancomplex

superasn · 24 days ago

I like the concept but the landing page is not good and too heavy.

My browser just froze after scrolling half-way. Not sure if this is something to do with the scroll fx but i really don't understand why this simple site is maxing out my CPUs.

superasn commented on Stop selling “unlimited”, when you mean “until we change our minds” blog.kilocode.ai/p/ai-pri... · Posted by u/heymax054

superasn · a month ago

I'm not a fan of usage caps either, but that Reddit post [1] (“You deserve harsh limits”) does highlight a perspective worth considering.

When some users burn massive amounts of compute just to climb leaderboards or farm karma, it’s not hard to imagine why providers might respond with tighter limits—not because it's ideal, but because that kind of behavior makes platforms harder to sustain and less accessible for everyone else. Because on the other hand a lot of genuine customers are canceling because they get API overload message after paying $200.

I still think caps are frustrating and often too blunt, but posts like that make it easier to see where the pressure might be coming from.

[1] https://www.reddit.com/r/ClaudeAI/comments/1lqrbnc/you_deser...

superasn commented on How to make websites that will require lots of your time and energy blog.jim-nielsen.com/2025... · Posted by u/OuterVale

superasn · a month ago

Always use ORMs and then spend the next year debugging N+1 queries, bloated joins, and mysterious performance issues that only show up in prod.

Migrations randomly fail, schema changes are a nightmare, and your team forgets how SQL works.

ORMs promise to abstract the database but end up being just another layer you have to fight when things go wrong.

superasn commented on A media company demanded a license fee for an Open Graph image I used alistairshepherd.uk/writi... · Posted by u/cheeaun

superasn · a month ago

This is the new source of income and a lot of media orgs are getting paid - take ANI in India.

Theyve been hitting YouTubers like Mohak Mangal, Nitish Rajput, Dhruv Rathee with copyright strikes for using just a few seconds of news clips which you would think is fair use.

Then they privately message creators demanding $60000 to remove the strikes or else the channel gets deleted after the third strike.

It s not about protecting content anymore it's copyright extortion. Fair use doesn't matter. System like Youtube makes it easy to abuse and nearly impossible to fight.

It s turning into a business model: pay otherwise your channels with millions of subs get deleted

[1] https://the420.in/dhruv-rathee-mohak-mangal-nitish-rajput-an...

superasn commented on Preliminary report into Air India crash released bbc.co.uk/news/live/cx20p... · Posted by u/cjr

dubcanada · a month ago

No neither black box stores video. One stores audio on flash memory and the other stores flight details, sensors etc.

I don’t think video is a bad idea. I assume there is a reason why it wasn’t done. Data wise black boxes actually store very little data (maybe a 100mbs), I don’t know if that is due to how old they are, or the requirements of withstanding extremes.

superasn · a month ago

Not sure why something so important isn't included.

Heck they can make a back up directly to the cloud in addition to black box considering I'm able to watch YouTube in some flights nowadays.