Happy to answer any questions about how this works if folks are interested.
I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)
For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.
For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.
I like to jokingly call founder mode: "fine-grained multi-level oversight". Others might call it the derogatory "micromanagement".
That doesn't mean I control every decision, or that I don't give people space to be creative. What it means is: for whatever is most important for the business, I get involved with the details. The goal is that when I move out of that area, the team I worked with is able to operate closer to founder mode than when I started.
The issue is that vision fundamentally can't be communicated by telephone, or all at once. You're trying to get to a point on the map that most people can't see. The path to it is the integration of all of the tiny decisions everyone makes along the way.
If you only course correct from the highest level you'll never get there.
They tried to enforce the non-profit's charter, as is their duty. I would hardly frame that as overplaying their hand.
I've also used it several times on ~15-20min drives to memorize something I wanted to have available for immediate recall. I had it chunk & quiz me, and by the end of the drive I had it down pat. Fun use of drive time.