From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf]

aresant · 2 days ago

Feels like a mixed bag vs regression?

eg - GPT-5 beats GPT-4 on factual recall + reasoning (HeadQA, Medbullets, MedCalc).

But then slips on structured queries (EHRSQL), fairness (RaceBias), evidence QA (PubMedQA).

Hallucination resistance better but only modestly.

Latency seems uneven (maybe more testing?) faster on long tasks, slower on short ones.

GPT-5 feels like cost engineering. The model is incrementally better, but they are optimizing for least amount of compute. I am guessing investors love that.

narrator · 2 days ago

I agree. I have found GPT-5 significantly worse on medical queries. It feels like it skips important details and is much worse than o3, IMHO. I have heard good things about GPT-5 Pro, but that's not cheap.

I wonder if part of the degraded performance is where they think you're going into a dangerous area and they get more and more vague, for example like they demoed on launch day with the fireworks example. It gets very vague when talking about non-abusable prescription drugs for example. I wonder if that sort of nerfing gradient is affecting medical queries.

After seeing some painfully bad results, I'm currently using Grok4 for medical queries with a lot of success.

RestartKernel · 2 days ago

I wonder how that math works out. GPT-5 keeps triggering a thinking flow even for relatively simple queries, so each token must be a magnitude cheaper to make this worth the trade-off in performance.

JimDabell · 2 days ago

I’ve found that it’s super likely to get stuck repeating the exact same incorrect response over and over. It used to happen occasionally with older models, but it happens frequently now.

Things like:

Me: Is this thing you claim documented? Where in the documentation does it say this?

GPT: Here’s a long-winded assertion that what I said before was correct, plus a link to an unofficial source that doesn’t back me up.

Me: That’s not official documentation and it doesn’t say what you claim. Find me the official word on the matter.

GPT: Exact same response, word-for-word.

Me: You are repeating yourself. Do not repeat what you said before. Here’s the official documentation: [link]. Find me the part where it says this. Do not consider any other source.

GPT: Exact same response, word-for-word.

Me: Here are some random words to test if you are listening to me: foo, bar, baz.

GPT: Exact same response, word-for-word.

It’s so repetitive I wonder if it’s an engineering fault, because it’s weird that the model would be so consistent in its responses regardless of the input. Once it gets stuck, it doesn’t matter what I enter, it just keeps saying the same thing over and over.

yieldcrv · 2 days ago

Yeah look at their open source models and how you get such high parameters in such low vram

Its impressive but a regression for now, in direct comparison to just high parameter model

UltraSane · 2 days ago

Since the routing is opaque they can dynamically route queries to cheaper models when demand is high.

woeirua · 2 days ago

Definitely seems like GPT5 is a very incremental improvement. Not what you’d expect if AGI were imminent.

p1esk · 18 hours ago

What would you expect?

fertrevino · 2 days ago

Mixed results indeed. While it leads the benchmark in two question types, it falls short in others which results in the overall slight regression.

xnx · 2 days ago

Have you looked at comparing to Google's foundation models or specialty medical models like MedGemma (https://developers.google.com/health-ai-developer-foundation...)?

fertrevino · 2 days ago

That would be an interesting extension. MedGemma isn't part of the original benchmark either [1]. Since Gemini 2.0 Flash is on 6th place, expectations are for MedGemma to achieve higher than that :)

[1]https://crfm.stanford.edu/helm/medhelm/latest/#/leaderboard

hypoxia · 2 days ago

Did you try it with high reasoning effort?

ares623 · 2 days ago

Sorry, not directed at you specifically. But every time I see questions like this I can’t help but rephrase in my head:

“Did you try running it over and over until you got the results you wanted?”

dcre · 2 days ago

This is not a good analogy because reasoning models are not choosing the best from a set of attempts based on knowledge of the correct answer. It really is more like what it sounds like: “did you think about it longer until you ruled out various doubts and became more confident?” Of course nobody knows quite why directing more computation in this way makes them better, and nobody seems to take the reasoning trace too seriously as a record of what is happening. But it is clear that it works!

SequoiaHope · 2 days ago

What you describe is a person selecting the best results, but if you can get better results one shot with that option enabled, it’s worth testing and reporting results.

Deleted Comment

chairmansteve · 2 days ago

Or...

"Did you try a room full of chimpanzees with typewriters?"

0xDEAFBEAD · 2 days ago

So which of these benchmarks are most relevant for an ordinary user who wants to talk to AI about their health issues?

I'm guessing HeadQA, Medbullets, MedHallu, and perhaps PubMedQA? (Seems to me that "unsupported speculation" could be a good thing for a patient who has yet to receive a diagnosis...)

Maybe in practice it's better to look at RAG benchmarks, since a lot of AI tools will search online for information before giving you an answer anyways? (Memorization of info would matter less in that scenario)

ancorevard · 2 days ago

so since reasoning_effort is not discussed anywhere, I assume you used the default which is "medium"?

energy123 · 2 days ago

Also, were tool calls allowed? The point of reasoning models is to delete the facts so finite capacity goes towards the dense reasoning engine rather than recall, with the facts sitting elsewhere.

username135 · 2 days ago

I wonder what changed with the models that created regression?

teaearlgraycold · 2 days ago

Not sure but with each release it feels like they’re just wiping the dirt around and not actually cleaning.

degamad · 2 days ago

Obligxkcd: https://xkcd.com/1838/

oezi · 2 days ago

There is some speculation that GPT-5 uses a router to decide which expert model to deploy (e.g. to mini vs o/thinking models). So the router might decide that the query can be solved by a cheaper model and this model gives worse results.

Deleted Comment

causality0 · 2 days ago

I've definitely seen some unexpected behavior from gpt5. For example, it will tell me my query is banned and then give me a full answer anyway.

andai · a day ago

Did this use reasoning or not? GPT-5 with Minimal reasoning does roughly the same as 4o on benchmarks.