From Eric Zencey:
There is seduction in apocalyptic thinking. If one lives in the Last Days, one’s actions, one’s very life, take on historical meaning and no small measure of poignance.
https://en.wikipedia.org/wiki/List_of_dates_predicted_for_ap...
- It's not a fair match, these models have more compute and memory than humans
- Contestants weren't really elite, they're just college level programmers, not the world's best
- This doesn't matter for the real world, competitive programming is very different from regular software engineering
- It's marketing, they're just cranking up the compute to unrealistic levels to gain PR points
- It's brute force, not intelligence
I'm glad--I think LLMs are looking quite promising for medical use cases. I'm just genuinely surprised there's not been some big lawsuit yet over it providing some advice that leads to some negative outcome (whether due to hallucinations, the user leaving out key context, or something else).
"On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding."
The actual benchmark improvements are marginal at best - we're talking single-digit percentage gains over o3 on most metrics, which hardly justifies a major version bump. What we're seeing looks more like the plateau of an S-curve than a breakthrough. The pricing is competitive ($1.25/1M input tokens vs Claude's $15), but that's about optimization and economics, not the fundamental leap forward that "GPT-5" implies. Even their "unified system" turns out to be multiple models with a router, essentially admitting that the end-to-end training approach has hit diminishing returns.
The irony is that while OpenAI maintains their secretive culture (remember when they claimed o1 used tree search instead of RL?), their competitors are catching up or surpassing them. Claude has been consistently better for coding tasks, Gemini 2.5 Pro has more recent training data, and everyone seems to be converging on similar performance levels. This launch feels less like a victory lap and more like OpenAI trying to maintain relevance while the rest of the field has caught up. Looking forward to seeing what Gemini 3.0 brings to the table.
GPT-5 demonstrates exponential growth in task completion times:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
Why state this as absolute fact? Seems a bit lacking in epistemic humility.