Readit News logoReadit News
sethkim commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning   sutro.sh/blog/the-end-of-... · Posted by u/sethkim
ramesh31 · 2 months ago
>By embracing batch processing and leveraging the power of cost-effective open-source models, you can sidestep the price floor and continue to scale your AI initiatives in ways that are no longer feasible with traditional APIs.

Context size is the real killer when you look at running open source alternatives on your own hardware. Has anything even come close to the 100k+ range yet?

sethkim · 2 months ago
Yes! Both Llama 3 and Gemma 3 have 128k context windows.
sethkim commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning   sutro.sh/blog/the-end-of-... · Posted by u/sethkim
sharkjacobs · 2 months ago
> If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help.

I don't have any reason to doubt the reasoning this article is doing or the conclusions it reaches, but it's important to recognize that this article is part of a sales pitch.

sethkim · 2 months ago
Yes, we're a startup! And LLM inference is a major component of what we do - more importantly, we're working on making these models accessible as analytical processing tools, so we have a strong focus on making them cost-effective at scale.
sethkim commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning   sutro.sh/blog/the-end-of-... · Posted by u/sethkim
incomingpain · 2 months ago
I think the big thing that really surprised me.

Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is 400billion.

Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2 trillion.

I dont have the resources to be able to test these unfortunately; but they are claiming behemoth is superior to the best SAAS options via internal benchmarking.

Comparatively Deepseek r1 671B is 404gb in size; with pretty similar benchmarks.

But you compare deepseek r1 32b to any model from 2021 and it's going to be significantly superior.

So we have quality of models increasing, resources needed decreasing. In 5-10 years, do we have an LLM that loads up on a 16-32GB video card that is simply capable of doing it all?

sethkim · 2 months ago
My two cents here is the classic answer - it depends. If you need general "reasoning" capabilities, I see this being a strong possibility. If you need specific, factual information baked into the weights themselves, you'll need something large enough to store that data.

I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.

sethkim commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning   sutro.sh/blog/the-end-of-... · Posted by u/sethkim
simonw · 2 months ago
Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware.

That 80% drop in o3 was only a few weeks ago!

sethkim · 2 months ago
No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to "free" in the near future as many suspect.
sethkim commented on The End of Moore's Law for AI? Gemini Flash Offers a Warning   sutro.sh/blog/the-end-of-... · Posted by u/sethkim
simonw · 2 months ago
"In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model"

It's not quite that simple. Gemini 2.5 Flash previously had two prices, depending on if you enabled "thinking" mode or not. The new 2.5 Flash has just a single price, which is a lot more if you were using the non-thinking mode and may be slightly less for thinking mode.

Another way to think about this is that they retired their Gemini 2.5 Flash non-thinking model entirely, and changed the price of their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m output to $0.30/m input (more expensive) and $2.50/m output (less expensive).

Another minor nit-pick:

> For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.

That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly not Gemini 2.5 Flash) charges a higher rate for inputs over 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000 tokens. As a result I treat those as separate models on my pricing table on https://www.llm-prices.com

One last one:

> o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.

OpenAI are on the record that the o3 optimizations were not through model changes such as pruning or distillation. This is backed up by independent benchmarks that find the performance of the new o3 matches the previous one: https://twitter.com/arcprize/status/1932836756791177316

sethkim · 2 months ago
Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a "soft" floor for now on pricing. Do you not see it this way?
sethkim commented on Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite   blog.google/products/gemi... · Posted by u/meetpateltech
sethkim · 2 months ago
I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.

One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.

At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.

sethkim commented on Ask HN: Who is hiring? (June 2025)    · Posted by u/whoishiring
sethkim · 3 months ago
Sutro.sh (fka Skysight) | Infrastructure/LLMs & Research Engineering | SF Bay Area | Full-time

We are building batch inference infrastructure and a great/user developer experience around it. We believe LLMs have not yet been meaningfully unlocked as data processing tools - we're changing that.

Our work involves interesting distributed systems and LLM research problems, newly-imagined user experiences, and a meaningful focus on mission and values.

Open Roles:

Infrastructure/LLM Engineer — https://jobs.skysight.inc/Member-of-Technical-Staff-Infrastr...

Research Engineer - https://jobs.skysight.inc/Member-of-Technical-Staff-Research...

If you're interested in applying, please send an email to jobs@sutro.sh with a resume/LinkedIn Profile. For extra priority, please include [HN] in the subject line.

sethkim commented on Ask HN: Who is hiring? (May 2025)    · Posted by u/whoishiring
sethkim · 4 months ago
Skysight | Infrastructure/LLMs & Research Engineering | SF Bay Area | Full-time

We are building large-scale batch inference infrastructure and a great/user developer experience around it. We believe LLMs have not yet been meaningfully unlocked as data processing tools - we're changing that.

Our work involves interesting distributed systems and LLM research problems, newly-imagined user experiences, and a meaningful focus on mission and values.

Open Roles:

Infrastructure/LLM Engineer — https://jobs.skysight.inc/Member-of-Technical-Staff-Infrastr...

Research Engineer - https://jobs.skysight.inc/Member-of-Technical-Staff-Research...

If you're interested in applying, please send an email to jobs@skysight.inc with a resume/LinkedIn Profile. For extra priority, please include [HN] in the subject line.

sethkim commented on Gemini 2.5 Flash   developers.googleblog.com... · Posted by u/meetpateltech
statements · 4 months ago
Absolutely agree. Granted, it is task dependent. But when it comes to classification and attribute extraction, I've been using 2.0 Flash with huge access across massive datasets. It would not be even viable cost wise with other models.
sethkim · 4 months ago
How "huge" are these datasets? Did you build your own tooling to accomplish this?

u/sethkim

KarmaCake day398January 23, 2021
About
Founder of Sutro (https://sutro.sh/)
View Original