- Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench? - Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail? - What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?
Regarding more subtle degradation tracking, it is on the roadmap.
SWE-Bench (Pro / Verified)
Model | Pro (%) | Verified (%)
--------------------+---------+--------------
GPT-5.2-Codex | 56.4 | ~80
GPT-5.2 | 55.6 | ~80
Claude Opus 4.5 | n/a | ~80.9
Gemini 3 Pro | n/a | ~76.2
And for terminal workflows, where agentic steps matter: Terminal-Bench 2.0
Model | Score (%)
--------------------+-----------
Claude Opus 4.5 | ~60+
Gemini 3 Pro | ~54
GPT-5.2-Codex | ~47
So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:- Claude is still ahead on strict coding + terminal-style tasks
- Gemini is better for huge context + multimodal reasoning
- GPT-5.2-Codex is strong but not clearly the new state of the art across the board
It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.
And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%
See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/