This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI
agreed.
This deserves much more attention than it has received right now! I would consider reposting this during the weekend if I were you.