Gemini-2.5-pro-preview-06-05

Impressive seeing Google notch up another ~25 ELO on lmarena, on top of the previous #1, which was also Gemini!

That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.

Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.

joshmlewis · 9 months ago

o3 is still my favorite over even Opus 4 in most cases. I've spent hundreds of dollars on AI code gen tools in the last month alone and my ranking is:

1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.

2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.

3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.

4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.

I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.

spaceman_2020 · 9 months ago

I use o3 a lot for basic research and analysis. I also find the deep research tool really useful for even basic shopping research

Like just today, it made a list of toys for my toddler that fit her developmental stage and play style. Would have taken me 1-2 hrs of browsing multiple websites otherwise

throwaway314155 · 9 months ago

It's interesting you say that because o3, while being a considerable improvement over OpenAI's other models, still doesn't match the performance of Opus 4 and Gemini 2.5 Pro by a long shot for me.

However, o3 resides in the ChatGPT app, which is still superior to the other chat apps in many ways, particularly the internet search implementation works very well.

vendiddy · 9 months ago

I find o3 to be the clearest thinker as well.

If I'm working on a complex problem and want to go back and forth on software architecture, I like having o3 research prior art and have a back and forth on trade-offs.

If o3 was faster and cheaper I'd use it a lot more.

I'm curious what your workflows are !

monkpit · 9 months ago

Have you used Cline with opus+sonnet? Do you have opinions about Claude code vs cline+api? Curious to hear your thoughts!

jonplackett · 9 months ago

How do you find o3 vs o4-mini?

pqdbr · 9 months ago

How do you choose which model to use with Claude Code?

VeejayRampay · 9 months ago

we need to stop it with the anecdotal evidence presented by one random dude

batrat · 9 months ago

What I like about Gemini is the search function that is very very good compared to others. I was blown away when I asked to compose me an email for a company that was sending spam to our domain. It literally searched and found not only the abuse email of the hosting company but all the info about the domain and the host(mx servers, ip owners, datacenters, etc.). Also if you want to convert a research paper into a podcast it did it instantly for me and it's fun to listen.

baq · 9 months ago

I’ve been giving the same tasks to claude 4 and gemini 2.5 this week and gemini provided correct solutions and claude didn’t. These weren’t hard tasks either, they were e.g. comparing sql queries before/after rewrite - Gemini found legitimate issues where claude said all is ok.

Dead Comment

Szpadel · 9 months ago

in my experience this highly depends case by case. For some cases Gemini crushed my problem, but in next one stuck and couldn't figure out simple bug.

the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)

I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution

varunneal · 9 months ago

Have you tried o3 on those problems? I've found o3 to be much more impressive than Opus 4 for all of my use cases.

johnfn · 9 months ago

To be honest, I haven't, because the "This model is extremely expensive" popup on Cursor makes me a bit anxious - but given the accolades here I'll have to give it a shot.

cwbriscoe · 9 months ago

I haven't tried all of the favorites, just what is available with Jetbrains AI, but I can say that Gemini 2.5 is very good with Go. I guess that makes sense in a way.

zamadatix · 9 months ago

I think the only way to be particularly impressed with new leading models lately is to hold the opinion all of the benchmarks are inaccurate and/or irrelevant and it's vibes/anecdotes where the model is really light years ahead. Otherwise you look at the numbers on e.g. lmarena and see it's claiming a ~16% preference win rate for gpt-3.5-turbo from November of 2023 over this new world-leading model from Google.

johnfn · 9 months ago

Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is 1206, which is an 86% win rate. https://chatgpt.com/share/6841f69d-b2ec-800c-9f8c-3e802ebbc0...

Workaccount2 · 9 months ago

People can ask whatever they want on LMarena, so a question like "List some good snacks to bring to work" might elicit a win for a old/tiny/deprecated model simply because it lists the snack the user liked more.

Deleted Comment

tempusalaria · 9 months ago

I agree I find claude easily the best model, at least for programming which is the only thing I use LLMs for

lispisok · 9 months ago

>That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability

Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.

Alifatisk · 9 months ago

> after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it

No way, is there any way to see the dialog or recreate this scenario!?

johnfn · 9 months ago

The chat was in Cursor, so I don't know a way to provide a public link, but here is the last paragraph that it output before I (and it) gave up. I honestly could have re-prompted it from scratch and maybe it would have gotten it, but at this point I was pretty sure that even if it did, it was going to make a total mess of things. Note that it was iterating on a test failure and had spun through multiple attempts at this point:

> Given the persistence of the error despite multiple attempts to refine the type definitions, I'm unable to fix this specific TypeScript error without a more profound change to the type structure or potentially a workaround that might compromise type safety or accuracy elsewhere. The current type definitions are already quite complex.

The two prior paragraphs, in case you're curious:

> I suspect the issue might be a fundamental limitation or bug in how TypeScript is resolving these highly recursive and conditional types when they are deeply nested. The type system might be "giving up" or defaulting to a less specific type ({ __raw: T }) prematurely.

> Since the runtime logic seems to be correctly hydrating the nested objects (as the builder.build method recursively calls hydrateHelper), the problem is confined to the type system's ability to represent this.

I found, as you can see in the first of the prior two paragraphs, that Gemini often wanted to claim that the issue was on TypeScript's side for some of these more complex issues. As proven by Opus, this simply wasn't the case.

AmazingTurtle · 9 months ago

for bulk data extraction on personal real life data I experienced that even gpt-4o-mini outperforms latest gemini models in both quality and cost. i would use reasoning models but their json schema response is different from the non-reasonig models, as in: they can not deal with union types for optional fields when using strict schemas... anyway.

idk whats the hype about gemini, it's really not that good imho

tymonPartyLate · 9 months ago

I just realized that Opus 4 is the first model that produced "beautiful" code for me. Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional. I had my first "wow" moment with it in a while. That being said it occasionally does something absolutely stupid. Like completely dumb. And when I ask it "why did you do this stupid thing", it replies "oh yeah, you're right, this is super wrong, here is an actual working, smart solution" (proceeds to create brilliant code)

I do not understand how those machines work.

diggan · 9 months ago

> Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional

I get that with most of the better models I've tried, although I'd probably personally favor OpenAI's models overall. I think a good system prompt is probably the best way there, rather than relying in some "innate" "clean code" behavior of specific models. This is a snippet of what I use today for coding guidelines: https://gist.github.com/victorb/1fe62fe7b80a64fc5b446f82d313...

> That being said it occasionally does something absolutely stupid. Like completely dumb

That's a bit tougher, but you have to carefully read through exactly what you said, and try to figure out what might have led it down the wrong path, or what you could have said in the first place for it avoid that. Try to work it into your system prompt, then slowly build up your system prompt so every one-shot gets closer and closer to being perfect on every first try.

Tostino · 9 months ago

My issue is that every time i've attempted to use Opus 4 to solve any problem, I would burn through my usage cap within a few min and not have solved the problem yet because it misunderstood things about the context and I didn't get the prompt quite right yet.

With Sonnet, at least I don't run out of usage before I actually get it to understand my problem scope.

simon1ltd · 9 months ago

I've also experienced the same, except it produced the same stupid code all over again. I usually use one model (doesn't matter which) until it starts chasing it's tail, then I feed it to a different model to have it fix the mistakes by the first model.

tomr75 · 9 months ago

how does it have access to DOM? are you using it with cursor/browser MCP?

I'd start to worry about OpenAI, from a valuation standpoint. The company has some serious competition now and is arguably no longer the leader.

its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.

If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.

When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.

What is new money coming into OpenAI getting now?

At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.

They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.

Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.

jstummbillig · 9 months ago

There is some serious confusion about the strength of OpenAIs position.

"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).

All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.

aeyes · 9 months ago

Google has a text input box on google.com, as soon as this gives similar responses there is no need for the average user to use ChatGPT anymore.

I already see lots of normal people share screenshots of the AI Overview responses.

candiddevmike · 9 months ago

ChatGPT is going to be Kleenex'd. They wasted their first mover advantage. Replace ChatGPT's interface with any other LLM and most users won't be able to tell the difference.

ComplexSystems · 9 months ago

"People have no idea what claude or gemini are"

One well-placed ad campaign could easily change all that. Doesn't hurt that Google can bundle Gemini into Android.

chollida1 · 9 months ago

Chatgpt has no moat of any kind though.

I can switch tomorrow to use gemini or grok or any other llm, and I have, with zero switching cost.

That means one stumble on the next foundational model and their market share drops in half in like 2 months.

Now the same is true for the other llms as well.

potatolicious · 9 months ago

I think this pretty substantially overstates ChatGPT's stickiness. Just because something is widely (if not universally) known doesn't mean it's universally used, or that such usage is sticky.

For example, I had occasion to chat with a relative who's still in high school recently, and was curious what the situation was in their classrooms re: AI.

tl;dr: LLM use is basically universal, but ChatGPT is not the favored tool. The favored tools are LLMs/apps specifically marketed as study/homework aids.

It seems like the market is fine with seeking specific LLMs for specific kinds of tasks, as opposed to some omni-LLM one-stop-shop that does everything. The market has already and rapidly moved beyond from ChatGPT.

Not to mention I am willing to bet that Gemini has radically more usage than OpenAI's models simply by virtue of being plugged into Google Search. There are distribution effects, I just don't think OpenAI has the strongest position!

I think OpenAI has some first-mover advantage, I just don't think it's anywhere near as durable (nor as large) as you're making it out to be.

lizardking · 9 months ago

Xerox was a verb too

PantaloonFlames · 9 months ago

> At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.

Probably your point still stands.

jadbox · 9 months ago

Currently I only find OpenAI to be clearly better for image generation: like illustrations, comics, or photo editing for home project ideation.

bufferoverflow · 9 months ago

And open-source Flux.1 Kontext is already better than it.

energy123 · 9 months ago

Even if they're winning the AI race, their search business is still going to be cannibalized, and it's unclear if they'll be able to extract any economic rents from AI thanks to market competition. Of course they have no choice but to compete, but they probably would have preferred the pre-AI status quo of unquestioned monopoly and eyeballs on ads.

xmprt · 9 months ago

Historically, every company has failed by not adapting to new technologies and trying to protect their core business (eg. Kodak, Blockbuster, Blackberry, Intel, etc). I applaud Google for going against their instincts and actively trying to disrupt their cash cow in order to gain an advantage in the AI race.

orionsbelt · 9 months ago

I think it’s too early to say they are not the leader given they have o3 pro and GPT 5 coming out within the next month or two. Only if those are not impressive would I start to consider that they have lost their edge.

Although it does feel likely that at minimum, they are neck and neck with Google and others.

ed_mercer · 9 months ago

Source for gpt 5 coming out soon?

sebzim4500 · 9 months ago

>At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.

What? Apple has a revenue of 400B and a market cap of 3T

Rudybega · 9 months ago

I think OpenAI has projected 12.7B in revenue this year and 29.4B in 2026.

Edit: I am dumb, ignore the second half of my post.

eamag · 9 months ago

isn't P/E about earnings, not revenue?

ketzo · 9 months ago

OpenAI has already forecast $12B in revenue by the end of this year.

I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway

Workaccount2 · 9 months ago

The hurdle for OpenAI is going to be on the profit side. Google has their own hardware acceleration and their own data centers. OpenAI has to pay a monopolist for hardware acceleration and beholden to another tech giant for data centers. Never mind that Google can customize it's hardware specifically for it's models.

The only way for OpenAI to really get ahead on solid ground is to discover some sort of absolute game changer (new architecture, new algorithm) and manage to keep it bottled away.

chollida1 · 9 months ago

Agreed, its the doubling of that each year for the next 4-5 years that I see as being difficult.

VeejayRampay · 9 months ago

the leeway comes from the grotesque fanboyism the company benefits from

they haven't been number one for quite some time and still people can't stop presenting them as the leaders

raincole · 9 months ago

Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.

qeternity · 9 months ago

Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.

Revenue is not the metric by which these companies are valued...

Yizahi · 9 months ago

The difference between Microsoft and OAI is that Microsoft can spend a lump sum of money on Excel and a fraction of that on its support and then sell it infinitely with almost no additional costs. MS can add a million of new Excel users tomorrow and that would be almost pure profit. (I'm very simplifying)

OAI on the other hand must spend a lot of additional money for every single new user, both free and paid. Adding million new OAI users tomorrow would mean gigantic negative red hole in the profits (adding to the existing negative). OAI has no or almost no benefits of scale, unlike other industries.

I have no knowledge about corporate valuations, but I strongly suspect that OAI valuation need to include this issue.

Oleksa_dr · 9 months ago

I was tempted by the ratings and immediately paid for a subscription to Gemini 2.5. Half an hour later, I canceled the subscription and got a refund. This is the laziest and stupidest LLM. What he had to do, he told me to do on my own. And also when analyzing simple short documents, he pulled up some completely strange documents from the Internet not related to the topic. Even local LLMs (3B) were not so stupid and lazy.

sigmoid10 · 9 months ago

Exactly my experience as well. I don't get why people here now seem to blindly take every new gamed benchmark as some harbinger of OpenAI's imminent downfall. Google is still way behind in day-to day personal and professional use for me.

#Set up the SFTTrainer print("Setting up SFTTrainer...") trainer = SFTTrainer( model=model, train_dataset=train_dataset, args=sft_config, processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME ) print("SFTTrainer ready.")