Recent AI model progress feels mostly like bullshit

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

AIPedant · 8 months ago

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

larodi · 8 months ago

This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.

Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.

apercu · 8 months ago

In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

JohnKemeny · 8 months ago

Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).

otabdeveloper4 · 8 months ago

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

billforsternz · 8 months ago

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

aezart · 8 months ago

> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.

aoeusnth1 · 8 months ago

2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.

Sunspark · 8 months ago

It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.

The AI will create something for you and tell you it was them.

CivBase · 8 months ago

I just asked my company-approved AI chatbot the same question.

It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.

It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.

When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.

Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).

I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.

Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.

senordevnyc · 8 months ago

Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...

greenmartian · 8 months ago

Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.

But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.

swader999 · 8 months ago

It'll get it right next time because they'll hoover up the parent post.

raxxorraxor · 8 months ago

This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.

No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?

throwawaymaths · 8 months ago

I've seen humans make exactly these sorts of mistakes?

tim333 · 8 months ago

A lot of humans are similarly good at some stuff and bad at other things.

Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):

>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.

Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.

Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.

simonw · 8 months ago

I had to look up these acronyms:

- USAMO - United States of America Mathematical Olympiad

- IMO - International Mathematical Olympiad

- ICPC - International Collegiate Programming Contest

Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.

sanxiyn · 8 months ago

Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.

InkCanon · 8 months ago

o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.

https://openai.com/index/learning-to-reason-with-llms/

The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.

bglazer · 8 months ago

Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.

So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.

I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.

**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com

I promise its a fun mathematical puzzle and the biology is pretty wild too

root_axis · 8 months ago

It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.

Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.

Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.

MoonGhost · 8 months ago

I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.

The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.

PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.

PPS: the fact that they can do known tasks with minor variations is already a huge time saver.

torginus · 8 months ago

When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.

Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.

I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.

Deleted Comment

airstrike · 8 months ago

I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model

melagonster · 8 months ago

I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.

kristianp · 8 months ago

Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.

Deleted Comment

sigmoid10 · 8 months ago

>I'm incredibly surprised no one mentions this

If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.

Workaccount2 · 8 months ago

On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.

usaar333 · 8 months ago

And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

selcuka · 8 months ago

> within a week

How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

MoonGhost · 8 months ago

They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.

KolibriFly · 8 months ago

Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.

yahoozoo · 8 months ago

LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.

utopcell · 8 months ago

This is simply using LLMs directly. Google has demonstrated that this is not the way to go when it comes to solving math problems. AlphaProof, which used AlphaZero code, got a silver medal in last year's IMO. It also didn't use any human proofs(!), only theorem statements in lean, without their corresponding proofs [1].

[1] https://www.youtube.com/watch?v=zzXyPGEtseI

geuis · 8 months ago

Query: Could you explain the terminology to people who don't follow this that closely?

BlanketLogic · 8 months ago

Not the OP but

USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1

IMO : International Math Olympiad

SOTA : State of the Art

OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).

cma · 8 months ago

OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.

AstroBen · 8 months ago

This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess

Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/

gilleain · 8 months ago

Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)

kylebyte · 8 months ago

Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.

famouswaffles · 8 months ago

LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?

https://github.com/adamkarvonen/chess_gpt_eval

raylad · 8 months ago

Eek! You mean eke.

SergeAx · 8 months ago

Because of the vast number of problems reused, removing those data from training sets will just make models worse. Why would anyone do it?

anonzzzies · 8 months ago

That type of news might make investors worry / scared.

hyperbovine · 8 months ago

Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.

aoeusnth1 · 8 months ago

2.5 pro scores 25%.

It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.

colonial · 8 months ago

Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!

TrackerFF · 8 months ago

What would the average human score be?

I.e. if you randomly sampled N humans to take those tests.

sanxiyn · 8 months ago

The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.

But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.

There was a little girl, Who had a little curl, Right in the middle of her forehead. When she was good, She was very good indeed, But when she was bad she was horrid.

This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

freehorse · 8 months ago

There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.

Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.

otterley · 8 months ago

Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.

FiniteIntegral · 8 months ago

It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.

A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.

code_biologist · 8 months ago

At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.

I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.

I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.

lherron · 8 months ago

Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.

I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.

I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.

InkCanon · 8 months ago

Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.

dsign · 8 months ago

You want to block subjectivity? Write some formulas.

There are three questions to consider:

a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.

b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.

c) And then the last question: will AI continue improving?

If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).

There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.

namaria · 8 months ago

If you're gonna formulate this conversation as a satisfiability problem you should be aware that this is an NP-complete problem (and actually working on that problem is the source of the insight that there is such as thing as NP-completeness).

lukev · 8 months ago

Invalid expression: value of type "probability distribution" cannot be cast to type "boolean".

pdimitar · 8 months ago

A lot of people judge by the lack of their desired outcome. Calling that fear and denial is disingenuous and unfair.

Deleted Comment

KolibriFly · 8 months ago

Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight

throwanem · 8 months ago

> I'm really curious what features signal an ability to make "better choices" w/r/t AI

So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.

nialv7 · 8 months ago

Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.

aunty_helen · 8 months ago

That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?

ramesh31 · 8 months ago

>"This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven."

People having vastly different opinions on AI simply comes down to token usage. If you are using millions of tokens on a regular basis, you completely understand the revolutionary point we are at. If you are just chatting back and forth a bit with something here and there, you'll never see it.

lukev · 8 months ago

So this is interesting because it's anecdotal (I presume you're a high-token user who believes it's revolutionary), but it's actually a measurable, falsifiable hypothesis in principle.

I'd love to see a survey from a major LLM API provider that correlated LLM spend (and/or tokens) with optimism for future transformativity. Correlation with a view of "current utility" would be a tautology, obviously.

I actually have the opposite intuition from you: I suspect the people using the most tokens are using it for very well-defined tasks that it's good at _now_ (entity extraction, classification, etc) and have an uncorrelated position on future potential. Full disclosure, I'm in that camp.

antonvs · 8 months ago

It's a tool and like all tools, it's sensitive to how you use it, and it's better for some purposes than others.

Someone who lacks experience, skill, training, or even the ability to evaluate results may try to use a tool and blame the tool when it doesn't give good results.

That said, the hype around LLMs certainly overstates their capabilities.