there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
What they want is to get rid of apps like YouTube Vanced that are making them lose money (and other Play Store apps)