Deleted Comment
I was excited, then I read this:
> Send up to 1,000 messages per day—enough for 3–4 hours of uninterrupted vibe coding.
I don't mind paying for services I use. But it's hard to take this seriously when the first paragraph claim is contradicting the fine prints.
My browser just froze after scrolling half-way. Not sure if this is something to do with the scroll fx but i really don't understand why this simple site is maxing out my CPUs.
When some users burn massive amounts of compute just to climb leaderboards or farm karma, it’s not hard to imagine why providers might respond with tighter limits—not because it's ideal, but because that kind of behavior makes platforms harder to sustain and less accessible for everyone else. Because on the other hand a lot of genuine customers are canceling because they get API overload message after paying $200.
I still think caps are frustrating and often too blunt, but posts like that make it easier to see where the pressure might be coming from.
[1] https://www.reddit.com/r/ClaudeAI/comments/1lqrbnc/you_deser...
Migrations randomly fail, schema changes are a nightmare, and your team forgets how SQL works.
ORMs promise to abstract the database but end up being just another layer you have to fight when things go wrong.
Theyve been hitting YouTubers like Mohak Mangal, Nitish Rajput, Dhruv Rathee with copyright strikes for using just a few seconds of news clips which you would think is fair use.
Then they privately message creators demanding $60000 to remove the strikes or else the channel gets deleted after the third strike.
It s not about protecting content anymore it's copyright extortion. Fair use doesn't matter. System like Youtube makes it easy to abuse and nearly impossible to fight.
It s turning into a business model: pay otherwise your channels with millions of subs get deleted
[1] https://the420.in/dhruv-rathee-mohak-mangal-nitish-rajput-an...
I don’t think video is a bad idea. I assume there is a reason why it wasn’t done. Data wise black boxes actually store very little data (maybe a 100mbs), I don’t know if that is due to how old they are, or the requirements of withstanding extremes.
Heck they can make a back up directly to the cloud in addition to black box considering I'm able to watch YouTube in some flights nowadays.
My simple explanation of how batching works: Since the bottleneck of processing LLMs is in loading the weights of the model onto the GPU to do the computing, what you can do is instead of computing each request separately, you can compute multiple at the same time, ergo batching.
Let's make a visual example, let's say you have a model with 3 sets of weights that can fit inside the GPU's cache (A, B, C) and you need to serve 2 requests (1, 2). A naive approach would be to serve them one at a time.
(Legend: LA = Load weight set A, CA1 = Compute weight set A for request 1)
LA->CA1->LB->CB1->LC->CC1->LA->CA2->LB->CB2->LC->CC2
But you could instead batch the compute parts together.
LA->CA1->CA2->LB->CB1->CB2->LC->CC1->CC2
Now if you consider that the loading is hundreds if not thousands of times slower than computing the same data, then you'll see the big different, here's a "chart" visualizing the difference of the two approaches if it was just 10 times slower. (Consider 1 letter a unit of time.)
Time spent using approach 1 (1 request at a time):
LLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLCLLLLLLLLLLC
Time spend using approach 2 (batching):
LLLLLLLLLLCCLLLLLLLLLLCCLLLLLLLLLLCC
The difference is even more dramatic in the real world because as I said, loading is many times slower than computing, you'd have to serve many users before you see a serious difference in speeds. I believe in the real world the restrictions is actually that serving more users requires more memory to store the activation state of the weights, so you'll end up running out of memory and you'll have to balance out how many people per GPU cluster you want to serve at the same time.
TL;DR: It's pretty expensive to get enough hardware to serve an LLM, but once you do have you can serve hundreds of users at the same time with minimal performance loss.
- Big models like GPT-4 are split across many GPUs (sharding).
- Each GPU holds some layers in VRAM.
- To process a request, weights for a layer must be loaded from VRAM into the GPU's tiny on-chip cache before doing the math.
- Loading into cache is slow, the ops are fast though.
- Without batching: load layer > compute user1 > load again > compute user2.
- With batching: load layer once > compute for all users > send to gpu 2 etc
- This makes cost per user drop massively if you have enough simultaneous users.
- But bigger batches need more GPU memory for activations, so there's a max size.
This does makes sense to me but does this sound accurate to you?
Would love to know if I'm still missing something important.