Readit News logoReadit News
chillee commented on Are OpenAI and Anthropic losing money on inference?   martinalderson.com/posts/... · Posted by u/martinald
martinald · 2 hours ago
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).

Even rerunning the math on my use cases with way higher input token cost doesn't change much though.

chillee · an hour ago
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.

chillee commented on Are OpenAI and Anthropic losing money on inference?   martinalderson.com/posts/... · Posted by u/martinald
Den_VR · 2 hours ago
So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”
chillee · 2 hours ago
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
chillee commented on Are OpenAI and Anthropic losing money on inference?   martinalderson.com/posts/... · Posted by u/martinald
chillee · 2 hours ago
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

chillee commented on Tokasaurus: An LLM inference engine for high-throughput workloads   scalingintelligence.stanf... · Posted by u/rsehrlich
refibrillator · 3 months ago
The code has few comments but gotta love when you can tell someone was having fun!

https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...

I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

chillee · 3 months ago
I mean, vllm and sglang are both "pure python" essentially as well. But yeah, in ML you rarely require C++ to get good performance for most of the systems people are writing.
chillee commented on Blender-made movie Flow takes Oscar   reuters.com/lifestyle/flo... · Posted by u/boguscoder
tzs · 6 months ago
That's kind of surprising. Academy members are not required to watch all the nominees for Best Animated Feature before voting. In fact they are not require to watch any of them.

Several years ago I remember that after a year where the movie that won best animated was not the one that those in the animation industry overwhelming thought was sure to win some animation industry magazine survived Academy members asking which movie they voted for and why.

What they found was that a large number of the voters thought of animated movies as just for little kids and hadn't actually watched any of the nominees. They picked their vote by whatever they remembered children in their lives watching.

E.g., if they were parents of young children, they'd vote for whatever movie that their kids kept watching over and over. If they no longer had children at home they would ask grandkids or nieces or nephews "what cartoon did you like last year?" and vote for that.

Another factor was that a lot of these people would vote for the one they had heard the most about.

That gives Disney a big advantage. How the heck did Flow overcome that?

Inside Out 2 had a much wider theatrical release in the US, was widely advertised, made $650 million domestic, is the second highest grossing animated movie of all time so far worldwide, and streams on Disney+.

All that should contribute to making it likely that those large numbers of "vote even though they don't watch animated movies" Academy members would have heard of it.

Flow had a small US theatrical release at the end of the year. I didn't see any advertising for it. I'd expect a lot of Academy members hadn't heard of it.

As a guess, maybe Moana 2 is the movie that the kids are repeat streaming. That was not a nominee so maybe those "vote for what my kid watched" voters didn't vote this year and so we actually got a year where quality non-Disney movies had a chance?

chillee · 6 months ago
A couple things:

1. The academy has had a significant increase of young voters in the past 10 years or so. Generally speaking, young voters are more likely to take animation as a "serious" medium.

2. These interviews were always somewhat overstated. Of course some voters have stupid rationales, but I don't think this dominates the academy.

3. Disney's Inside Out 2 was nowhere close to winning the award this year - Flow's biggest competition was The Wild Robot, which did gross far more than Inside Out 2, but far below Inside Out 2.

If you look at the past couple years, The Boy and the Heron (Studio Ghibli) won over Across the Spider-Verse (with Pixar's movie Elemental nowhere close) in 2023, Guillermo del Toro's Pinocchio won over Across the Spider-Verse (with Pixar's movie Turning Red nowhere close) in 2022, etc.

I'm curious what year you're thinking about above. Perhaps Toy Story 4 over Klaus in 2019?

chillee commented on Amidst the noise and haste, Google has successfully pulled a SpaceX   markmaunder.com/2025/amid... · Posted by u/mmaunder
mmaunder · 8 months ago
Mind sharing your source on that? I’ve been trying to find one.

Edit: Specifically the nature and current status of the Broadcom/Google relationship as it relates to TPUs.

chillee · 8 months ago
https://www.theinformation.com/articles/to-reduce-ai-costs-g...

Which takes it from

> Broadcom generates a 70% profit margin from its work on TPUs, said a person with direct knowledge of the internal analysis. SemiAnalysis, a chip research firm, earlier reported that figure.

https://semianalysis.com/2023/08/30/broadcoms-google-tpu-rev...

chillee commented on Amidst the noise and haste, Google has successfully pulled a SpaceX   markmaunder.com/2025/amid... · Posted by u/mmaunder
chillee · 8 months ago
One of the big things this article misses is that Google pays Broadcom a significant amount for the actual chip design, also around a 70% margin.

Google certainly has infra/cost advantages, but it's nowhere near 10x.

chillee commented on Fast LLM Inference From Scratch (using CUDA)   andrewkchan.dev/posts/yal... · Posted by u/homarp
fancyfredbot · 8 months ago
I don't think this code can make use of the tensor cores, or the wgmma instructions that you typically need to get peak performance out of them.

Programming these is a nightmare as you need to have several in flight concurrently for peak performance.

Perhaps you don't need the extra flops as you end up bandwidth bound?

Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.

chillee · 8 months ago
For latency-bound inference (i.e. one request) you don't need tensor-cores since all your operations are just matrix vector multiplications.
chillee commented on The GPU is not always faster   cowfreedom.de/#dot_produc... · Posted by u/CowFreedom
jcranmer · 9 months ago
AIUI, Strassen gets used moderately commonly with non-floating-point datatypes, where numerical stability is less of a concern and multiplications are more useful to minimize than memory traffic. But from what I can tell, every floating-point BLAS library eschews Strassen, despite a steady trickle of papers saying "hey, there might be some small wins if we go to Strassen!"
chillee · 9 months ago
The big issue with Strassen isn't performance - it's numerical stability.
chillee commented on What Shapes Do Matrix Multiplications Like?   thonking.ai/p/what-shapes... · Posted by u/skidrow
amelius · 10 months ago
TL;DR: make sure your matrix dimensions are divisible by 2 often.
chillee · 10 months ago
Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)

u/chillee

KarmaCake day1205August 10, 2016
About
horace.io / twitter.com/cHHillee
View Original