The real news is that non-thinking output is now 4x more expensive, which they of course carefully avoid mentioning in the blog, only comparing the thinking prices.
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
“While we strive to maintain consistent pricing between preview and stable releases to minimize disruption, this is a specific adjustment reflecting Flash’s exceptional value, still offering the best cost-per-intelligence available.”
Not too long ago Google was a bit of a joke in AI and their offerings were uncompetitive. For a while a lot of their preview/beta models had a price of 0.00. They were literally giving it away for free to try to get people to consider their offerings when building solutions.
As they've become legitimately competitive they have moved towards the pricing of their competitors.
I know they undercutting the price a lot, because at first launch gemini price is not make sense seeing it cheaper than competition (like a lot cheaper)
It might be a bit confusing, but there's no "only thinking flash" - it's a single model, and you can turn off thinking if you set thinking budget to 0 in the API request. Previously 2.5 Flash Preview was much cheaper with the thinking budget set to 0, now the price is the same. Of course, with thinking enabled the model will still use far more output tokens than the non-thinking mode.
Apparently, you can make a request to 2.5 flash to not use thinking, but it will still sometimes do it anyways, this has been an issue for months, and hasn't been fixed by model updates: https://github.com/google-gemini/cookbook/issues/722
At one point, when they made Gemini Pro free on AI Studio, Gemini was the model of choice for many people, I believe.
Somehow it's gotten worse since then, and I'm back to using Claude for serious work.
Gemini is like that guy who keeps talking but has no idea what he's actually talking about.
I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.
I use only the APIs directly with Aider (so no experience with AI Studio).
My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.
When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).
I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.
So far I had no reason to go back to Claude since o3-mini was released.
I just spent $35 for Opus to solve a problem with a hardware side-project (I'm turning an old rotary phone into a meeting handset so I can quit meetings by hanging up, if you must know). It didn't solve the problem, it churned and churned and spent a ton of money.
I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".
Using all of the popular coding models pretty extensively over the past year, I've been having great success with Gemini 2.5 Pro as far as getting working code the first time, instruction following around architectural decisions, and staying on-task. I use Aider and write mostly Python, JS, and shell scripts. I've spent hundreds of dollars on the Claude API over time but have switched almost entirely to Gemini. The API itself is also much more reliable.
My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).
You feelings of a little brain in there, and stable code are unfounded. All these models collapse pretty fast. If not due to context limit, then in their inability to interpret problems.
An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.
I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.
On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.
Yea, i had similar experiences. At first it felt like it solved complex problems really well, but then i realized i was having trouble steering it for simple things. It was also very verbose.
Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.
Same experience here. I even built a Gem with am elaborate prompt instructing it how to be concise, but it still gives annoying long-winded responses and frequently expands the scope of its answer far beyond the prompt.
I feel like this is part of the AI playbook now. Launch a really strong, capable model (expensive price inference) and once users think it’s SOTA, neuter it so the cost is cheaper and most users won’t notice.
The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.
I have no inside information but feels like they quantized it. I've seen patterns that I usually only see in quantized models like getting stuck repeating a single character indefinitely
They should just roll back to the preview versions. Those were so much more even keeled and actually did some useful pushback instead of this cheerleader-on-steroids version they GA'd.
Yes I was very surprised after the whole "scandal" around ChatGPT becoming too sycophantic that there was this massive change in tone from the last preview model (05-06) to the 06-05/GA model. The tone is really off-putting, I really liked how the preview versions felt like intelligent conversation partners and recognize what you're saying about useful pushback - it was my favorite set of models (the few preview iterations before this one) and I'm sad to see them disappearing.
Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.
I found Gemini now terrible for coding. I gave it my code blocks and told it what to change and it added tonnes and tonnes of needles extra code plus endless comments. It turned a tight code into a Papyrus.
ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.
Used to be able to use Gemini Pro free in cline. Now the API limits are so low that you immediately get messages about needing to top up your wallet and API queries just don't go through. Back to using DeepSeek R1 free in cline (though even that eventually stops after a few hours and you have to wait until the next day for it to work again). Starting to look like I need to setup a local LLM for coding - which means it's time to seriously upgrade my PC (well, it's been about 10 years so it was getting to be time anyway)
By the time you breakeven on whatever you spend on a decent LLM capable build, your hardware will be too far behind to run whatever is best locally then. It's something that feels cheaper but with the pace of things, unless you are churning an insane amount of tokens, probably doesn't make sense. Never mind that local models running on 24 or 48GB are maybe around flash-lite in ability while being slower than SOTA models.
Local models are mostly for hobby and privacy, not really efficiency.
Same for me. I've been using Gemini 2.5 Pro for the past week or so because people said Gemini is the best for coding! Not at all my experience with Gemini 2.5 Pro, on top of being slow and flaky, the responses are kind of bad. Claud Sonnet 4 is much better IMO.
They nerfed Pro 2.5 significantly in the last few months. Early this year, I had genuinely insightful conversations with Gemini 2.5 Pro. Now they are mostly frustrating.
I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.
There was a significant nerf of Gemini 3-25 a little while ago, so much so that I detected it without knowing there was even a new release.
Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.
I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.
One of the early updates improved agentic coding scores while lowering other general benchmark scores, which may have impacted those kind of conversations.
I am very impressed with Gemini and stopped using OpenAI. Sometimes, I ping all three major models on OpenRouter but 90% is on Gemini now. Compare that to 90% ChatGPT last year.
Also me. Still pay for OpenAI, I use gpt4 for excel work and is super fast and able to do more excel related work like combine files that come up often for projects I work on.
I don't like the thinking time, but for coding, journaling, and other stuff I've often been impressed with Gemini Pro 2.5 out of the box.
Possibly I could do much more prompt fine-tuning to nudge openai/anthropic in the direction I want, but with the same prompts Gemini often gives me answers/structure/tone I like much better.
Example: I had claude 3.7 generating embedding images and captions along with responses. Same prompt into Gemini it gave much more varied and flavorful pictures.
Love to see it, this takes Flash Lite from "don't bother" territory for writing code to potentially useful. (Besides being inexpensive, Flash Lite is fast -- almost always sub-second, to as low as 200ms. Median around 400ms IME.)
Brokk (https://brokk.ai/) currently uses Flash 2.0 (non-Lite) for Quick Edits, we'll evaluate 2.5 Lite now.
ETA: I don't have a use case for a thinking model that is dumber than Flash 2.5, since thinking negates the big speed advantage of small models. Curious what other people use that for.
Curious to hear what folks are doing with Gemini outside of the coding space and why you chose it. Are you building your app so you can swap the underlying GenAI easily? Do you "load balance" your usage across other providers for redundancy or cost savings? What would happen if there was ever some kind of spot market for LLMs?
In my experience, Gemini 2.5 Pro really shines in some non-coding use cases such as translation and summarization via Canvas. The gigantic context window and large usage limits help in this regard.
I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.
One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.
If you're a power user, you should probably be using Gemini through AI studio rather than the "basic user" version. That allows you to set system instructions, temperature, structured output, etc. There's also NotebookLM. Google seems to be trying to make a bunch of side projects based on Gemini and seeing what sticks, and the generic gemini app/webchat is just one of those.
I tried swapping for my project which involves having the LLM summarize and critique medical research and didn’t have great results. The prompt I found works best with the main LLM I use fucks up the intended format when fed to other LLMs. Thinking about refining prompts for each different llm but haven’t gotten there.
My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.
I can throw a pile of NDAs at it and it neatly pulls out relevant stuff from them within a few seconds. The huge context window and excellent needle in a haystack performance is great for this kind of task.
The NIAH performance is a misleading indicator for performance on the tasks people really want the long context for. It's great as a smoke/regression test. If you're bad on NIAH, you're not gonna do well on the more holistic evals.
But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.
I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.
EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.
Gemini Flash 2.0 is an absolute workhorse of a model at extremely low cost. It's obviously not going to measure up to frontier models in terms of intelligence but the combination of low cost, extreme speed, and highly reliable structured output generation make it really pleasant to develop with. I'll probably test against 2.5 Lite for an upgrade here.
I’ve found the 2.5 pro to be pretty insane at math. Having a lot of fun doing math that normally I wouldn’t be able to touch. I’ve always been good at math, but it’s one of those things where you have to do a LOT of learning to do anything. Being able to breeze through topics I don’t know with the help of AI and a good CAS + sympy and Mathematica verification lets me chew on problems I have no right to be even thinking about considering my mathematical background. (I did minor in math.. but the kinds of problems I’m chewing on are things people spend lifetimes working on. That I can even poke at the edges of them thanks to Gemini is really neat.)
I use it extensively for https://lexikon.ai - in particular one part of what Lexikon does involves processing large amounts of images, and the way Google charges for vision is vastly cheaper compared to the big alternatives (OpenAI, Anthropic)
I use Gemini 2.5 Flash (non thinking) as a thought partner. It helps me organize my thoughts or maybe even give some new input I didn't think of before.
I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.
It basically made a university physics exam for me. It almost one-shot it as well. Just uploaded some exams from previous years together with a latex template and told it to make me a similar one. Worked great. Also made it do the solutions.
It's very good at automatically segmenting and recognizing handwritten and badly scanned text. I use it to make spreadsheets out of handwritten petitions.
Web scraping - creating semi-structured data from a wide variety of horrific HTML soups.
Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.
I've yet to run out of free image gen credits with Gemini, so I use it for any low-effort image gen like when my kids want to play with it or for testing prompts before committing my o4 tokens for better quality results.
Yes, we implemented a separate service internally that interfaces with an LLM and so the callers can be agnostic as to what provider or model is being used. Haven't needed to load balance between models though.
I had a great-ish result from 2.5 Pro the other day. I asked it to date an old photograph, and it successfully read the partial headline on a newspaper in the background (which I had initially thought was too small/blurry to make out) and identified the 1980s event it was reporting. Impressive. But then it confidently hallucinated the date of the article (which I later verified by checking in an archive).
I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
For 2.5 Flash Preview https://web.archive.org/web/20250616024644/https://ai.google...
$0.15/million input text / image / video
$1.00/million audio
Output: $0.60/million non-thinking, $3.50/million thinking
The new prices for Gemini 2.5 Flash ditch the difference between thinking and non-thinking and are now: https://ai.google.dev/gemini-api/docs/pricing
$0.30/million input text / image / video (2x more)
$1.00/million audio (same)
$2.50/million output - significantly more than the old non-thinking price, less than the old thinking price.
https://developers.googleblog.com/en/gemini-2-5-thinking-mod...
How cute they are with their phrasing:
> $2.50 / 1M output tokens (*down from $3.50 output)
Which should be "up from $0.60 (non-thinking)/down from $3.50 (thinking)"
They are obviously excited about their price increase
As they've become legitimately competitive they have moved towards the pricing of their competitors.
finally we starting to see the real price
And Gemini 2.0 Flash was $0.10/$0.40.
Now 2.0 -> 2.5 is another hefty price increase.
But why is there only thinking flash now?
Somehow it's gotten worse since then, and I'm back to using Claude for serious work.
Gemini is like that guy who keeps talking but has no idea what he's actually talking about.
I still use Gemini for brainstorming, though I take its suggestions with several grains of salt. It's also useful for generating prompts that I can then refine and use with Claude.
I use only the APIs directly with Aider (so no experience with AI Studio).
My feeling with Claude is that they still perform good with weak prompts, the "taste" is maybe a little better when the direction is kinda unknown by the prompter.
When the direction is known I see Gemini 2.5 Pro (with thinking) on top of Claude with code which does not break. And with o4-mini and o3 I see more "smart" thinking (as if there is a little bit of brain inside these models) at the expense of producing unstable code (Gemini produces more stable code).
I see problems with Claude when complexity increases and I would put it behind Gemini and o3 in my personal ranking.
So far I had no reason to go back to Claude since o3-mini was released.
I was much more satisfied with o3 and Aider, I haven't tried them on this specific problem but I did quite a bit of work on the same project with them last night. I think I'm being a bit unfair, because what Claude got stuck on seems to be a hard problem, but I don't like how they'll happily consume all my money trying the same things over and over, and never say "yeah I give up".
My only complaint about 2.5 Pro is around the inane comments it leaves in the code (// Deleted varName here).
An LLM is just statistical regressions with a llztjora of engineering tricks, mostly NLP to produce an illusion.
I don't mean it's useless. I mean comparing these ever evolving models is like comparing escort staff in NYC vs those in L.A, hard to reach any conclusjon. We are getting fooled.
On the price increase, it seems Google was aggressively looking for adoption, Gemini was for a short range of time the best value for money of all the LLMs out there. Adoption likely surged, scaling needs be astronomical, costing Google billions to keep up. The price adjustment could've been expected before they announced it.
Overall though my primary concern is the UX, and Claude Code is the UX of choice for me currently.
The same happened with GPT-3.5. It was so good early on and got worse as OpenAI began to cut costs. I feel like when GPT-4.1 was cloaked as Optimus on Openrouter, it was really good, but once it launched, it also got worse.
Many people on the Google AI Developer forums have also noted either bugs or just performance regression in the final model.
ChatGPT is better but tends to be too agreeable, never trying to disagree with what you say even if it's stupid so you end up shooting yourself in the foot.
Claude seems like the best compromise.
Just my two kopecks.
Local models are mostly for hobby and privacy, not really efficiency.
Just claude 4 sonnet with thinking just has a bit think then does it
I also have a personal conspiracy theory, i.e., that once a user exceeds a certain use threshold of 2.5 Pro in the Google Gemini app, they start serving a quantized version. Of course, I have no proof, but it certainly feels that way.
Totally convinced they quantized the model quietly and improved on the coding benchmark to hide that fact.
I’m frankly quite tired of LLM providers changing the model I’m paying for access to behind the scenes, often without informing me, and in Gemini’s case on the API too—at least last time I checked they updated the 3-25 checkpoint to the May update.
All other ai’s seem to give me errors when working with large bodies of code.
Possibly I could do much more prompt fine-tuning to nudge openai/anthropic in the direction I want, but with the same prompts Gemini often gives me answers/structure/tone I like much better.
Example: I had claude 3.7 generating embedding images and captions along with responses. Same prompt into Gemini it gave much more varied and flavorful pictures.
Brokk (https://brokk.ai/) currently uses Flash 2.0 (non-Lite) for Quick Edits, we'll evaluate 2.5 Lite now.
ETA: I don't have a use case for a thinking model that is dumber than Flash 2.5, since thinking negates the big speed advantage of small models. Curious what other people use that for.
I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.
One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.
My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.
But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.
I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.
EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.
I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.
I find Flash and Flash Lite are more consistent than others as well as being really fast and cheap.
I could swap to other providers fairly easily, but don't intend to at this point. I don't operate at a large scale.
I give it the HTML, it finds the appropriate selector for the property item and then I use a HTML to RSS tool to publish the feed
Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.
In general, when I need "cheap and fast" I choose Gemini.
Gemini 2.5 Flash Lite (Audio Input) - $0.5/million tokens
Gemini 2.0 Flash Lite (Audio Input) - $0.075/million tokens
Wonder what led to such a high bump in Audio token processing
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.