I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.
I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.
Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.
Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.
I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.
Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.
But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.
So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?
I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.
That 80% drop in o3 was only a few weeks ago!
It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.
Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).
Another minor nit-pick:
> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.
That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com
One last one:
> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.
OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
We are building batch inference infrastructure and a great/user developer experience around it. We believe LLMs have not yet been meaningfully unlocked as data processing tools - we're changing that.
Our work involves interesting distributed systems and LLM research problems, newly-imagined user experiences, and a meaningful focus on mission and values.
Open Roles:
Infrastructure/LLM Engineer — https://jobs.skysight.inc/Member-of-Technical-Staff-Infrastr...
Research Engineer - https://jobs.skysight.inc/Member-of-Technical-Staff-Research...
If you're interested in applying, please send an email to jobs@sutro.sh with a resume/LinkedIn Profile. For extra priority, please include [HN] in the subject line.
We are building large-scale batch inference infrastructure and a great/user developer experience around it. We believe LLMs have not yet been meaningfully unlocked as data processing tools - we're changing that.
Our work involves interesting distributed systems and LLM research problems, newly-imagined user experiences, and a meaningful focus on mission and values.
Open Roles:
Infrastructure/LLM Engineer — https://jobs.skysight.inc/Member-of-Technical-Staff-Infrastr...
Research Engineer - https://jobs.skysight.inc/Member-of-Technical-Staff-Research...
If you're interested in applying, please send an email to jobs@skysight.inc with a resume/LinkedIn Profile. For extra priority, please include [HN] in the subject line.
Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?