If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.
There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.
https://github.com/ScalingIntelligence/tokasaurus/blob/65efb...
I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…
Several years ago I remember that after a year where the movie that won best animated was not the one that those in the animation industry overwhelming thought was sure to win some animation industry magazine survived Academy members asking which movie they voted for and why.
What they found was that a large number of the voters thought of animated movies as just for little kids and hadn't actually watched any of the nominees. They picked their vote by whatever they remembered children in their lives watching.
E.g., if they were parents of young children, they'd vote for whatever movie that their kids kept watching over and over. If they no longer had children at home they would ask grandkids or nieces or nephews "what cartoon did you like last year?" and vote for that.
Another factor was that a lot of these people would vote for the one they had heard the most about.
That gives Disney a big advantage. How the heck did Flow overcome that?
Inside Out 2 had a much wider theatrical release in the US, was widely advertised, made $650 million domestic, is the second highest grossing animated movie of all time so far worldwide, and streams on Disney+.
All that should contribute to making it likely that those large numbers of "vote even though they don't watch animated movies" Academy members would have heard of it.
Flow had a small US theatrical release at the end of the year. I didn't see any advertising for it. I'd expect a lot of Academy members hadn't heard of it.
As a guess, maybe Moana 2 is the movie that the kids are repeat streaming. That was not a nominee so maybe those "vote for what my kid watched" voters didn't vote this year and so we actually got a year where quality non-Disney movies had a chance?
1. The academy has had a significant increase of young voters in the past 10 years or so. Generally speaking, young voters are more likely to take animation as a "serious" medium.
2. These interviews were always somewhat overstated. Of course some voters have stupid rationales, but I don't think this dominates the academy.
3. Disney's Inside Out 2 was nowhere close to winning the award this year - Flow's biggest competition was The Wild Robot, which did gross far more than Inside Out 2, but far below Inside Out 2.
If you look at the past couple years, The Boy and the Heron (Studio Ghibli) won over Across the Spider-Verse (with Pixar's movie Elemental nowhere close) in 2023, Guillermo del Toro's Pinocchio won over Across the Spider-Verse (with Pixar's movie Turning Red nowhere close) in 2022, etc.
I'm curious what year you're thinking about above. Perhaps Toy Story 4 over Klaus in 2019?
Edit: Specifically the nature and current status of the Broadcom/Google relationship as it relates to TPUs.
Which takes it from
> Broadcom generates a 70% profit margin from its work on TPUs, said a person with direct knowledge of the internal analysis. SemiAnalysis, a chip research firm, earlier reported that figure.
https://semianalysis.com/2023/08/30/broadcoms-google-tpu-rev...
Google certainly has infra/cost advantages, but it's nowhere near 10x.
Programming these is a nightmare as you need to have several in flight concurrently for peak performance.
Perhaps you don't need the extra flops as you end up bandwidth bound?
Regardless the good thing about the code in the blog though is it'll probably work pretty well for other accelerators, if you port it to HIP or similar. If you use wgmma I'm not sure it'll even be portable across Nvidia generations.
Even rerunning the math on my use cases with way higher input token cost doesn't change much though.
The component about requiring long context lengths to be compute-bound for attention is also quite misleading.