To be fair, it's with the help of OpenAI.
They did it together, before the official release.
From experience, it's much more engineering work on the integrator's side than on OpenAI's. Basically they provide you their new model in advance, but they don't know the specifics of your system, so it's normal that you do most of the work.
Thus I'm particularly impressed by Cerebras: they only have a few models supported for their extreme perf inference, it must have been huge bespoke work to integrate.
Here are the metrics by which the author defines this plateau: "limited by their inability to maintain coherent context across sessions, their lack of persistent memory, and their stochastic nature that makes them unreliable for complex multi-step reasoning."
If you try to benchmark any proxy of the points above, for instance "can models solve problems that require multi steps in agentic mode" (PlanBench, BrowseComp, I've even built custom benchmarks), the progress between models is very clear, and shows no sign of slowing down.
And this does convert to real-world tasks : yesterday, I had GPT-5 build me complex react charts in one-shot, whereas previous models needed more constant supervision.
I think we're moving goalposts too fast for LLMs, that's what can lead us to believe they've plateaued : but just try using past models for your current tasks (you can use use open models to be sure they were not updated) and see them struggle.