Readit News logoReadit News
qwesr123 commented on Claude Code daily benchmarks for degradation tracking   marginlab.ai/trackers/cla... · Posted by u/qwesr123
qwesr123 · 11 days ago
FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month
qwesr123 commented on Claude Code Daily Degradation Tracker   marginlab.ai/trackers/cla... · Posted by u/qwesr123
7777777phil · a month ago
Cool project, I feel like I have been running my own mental, gut feeling degeneration tracker so far.

- Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench? - Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail? - What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?

qwesr123 · a month ago
Thanks! Daily confidence intervals are quite large and not super useful at the moment. Weekly aggregation is more sensitive. Hoping to increase sample sizes but it is quite expensive! Would be about $100-$150/day in API costs. We are using the Pro x20 subscription ($200/month).

Regarding more subtle degradation tracking, it is on the roadmap.

qwesr123 commented on GPT-5.2-Codex   openai.com/index/introduc... · Posted by u/meetpateltech
mohsen1 · 2 months ago
Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context:

    SWE-Bench (Pro / Verified)

    Model               | Pro (%) | Verified (%)
    --------------------+---------+--------------
    GPT-5.2-Codex       | 56.4    | ~80
    GPT-5.2             | 55.6    | ~80
    Claude Opus 4.5     | n/a     | ~80.9
    Gemini 3 Pro        | n/a     | ~76.2

And for terminal workflows, where agentic steps matter:

    Terminal-Bench 2.0

    Model               | Score (%)
    --------------------+-----------
    Claude Opus 4.5     | ~60+
    Gemini 3 Pro        | ~54
    GPT-5.2-Codex       | ~47

So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:

- Claude is still ahead on strict coding + terminal-style tasks

- Gemini is better for huge context + multimodal reasoning

- GPT-5.2-Codex is strong but not clearly the new state of the art across the board

It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.

qwesr123 · 2 months ago
Where are you getting SWE-Bench Verified scores for 5.2-Codex? AFAIK those have not been published.

And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%

See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/

qwesr123 commented on GPT-5.2-Codex   openai.com/index/introduc... · Posted by u/meetpateltech
lacoolj · 2 months ago
lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it
qwesr123 · 2 months ago
Not sure why they don't compare with others, but they are actually leading on the benchmarks they published. See here (bottom) for a chart comparing to other models: https://marginlab.ai/blog/swe-bench-deep-dive/

u/qwesr123

KarmaCake day323June 13, 2025
About
marginlab.ai
View Original