qwesr123 (u/qwesr123)

qwesr123 commented on Claude Code daily benchmarks for degradation tracking marginlab.ai/trackers/cla... · Posted by u/qwesr123

qwesr123 · 11 days ago

FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month

qwesr123 commented on Claude Code Daily Degradation Tracker marginlab.ai/trackers/cla... · Posted by u/qwesr123

7777777phil · a month ago

Cool project, I feel like I have been running my own mental, gut feeling degeneration tracker so far.

- Is a daily sample size of 50 really sufficient to distinguish actual model degradation from the inherent stochasticity of SWE-bench? - Since you're running directly in the CLI, do you also track 'silent' degradations like increased token usage or latency, or is it strictly pass/fail? - What are the§ API costs for daily Opus 4.5 runs on 50 SWE tasks?

qwesr123 · a month ago

Thanks! Daily confidence intervals are quite large and not super useful at the moment. Weekly aggregation is more sensitive. Hoping to increase sample sizes but it is quite expensive! Would be about $100-$150/day in API costs. We are using the Pro x20 subscription ($200/month).

Regarding more subtle degradation tracking, it is on the roadmap.

qwesr123 commented on GPT-5.2-Codex openai.com/index/introduc... · Posted by u/meetpateltech

mohsen1 · 2 months ago

Since they are not showing you how this model compares against the benchmarks they are showing, here is a quick view with the public numbers from Google and Anthropic. At least this gives some context:

    SWE-Bench (Pro / Verified)

    Model               | Pro (%) | Verified (%)
    --------------------+---------+--------------
    GPT-5.2-Codex       | 56.4    | ~80
    GPT-5.2             | 55.6    | ~80
    Claude Opus 4.5     | n/a     | ~80.9
    Gemini 3 Pro        | n/a     | ~76.2

And for terminal workflows, where agentic steps matter:

    Terminal-Bench 2.0

    Model               | Score (%)
    --------------------+-----------
    Claude Opus 4.5     | ~60+
    Gemini 3 Pro        | ~54
    GPT-5.2-Codex       | ~47

So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:

- Claude is still ahead on strict coding + terminal-style tasks

- Gemini is better for huge context + multimodal reasoning

- GPT-5.2-Codex is strong but not clearly the new state of the art across the board

It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.

qwesr123 · 2 months ago

Where are you getting SWE-Bench Verified scores for 5.2-Codex? AFAIK those have not been published.

And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%

See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/

qwesr123 commented on GPT-5.2-Codex openai.com/index/introduc... · Posted by u/meetpateltech

lacoolj · 2 months ago

lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it

qwesr123 · 2 months ago

Not sure why they don't compare with others, but they are actually leading on the benchmarks they published. See here (bottom) for a chart comparing to other models: https://marginlab.ai/blog/swe-bench-deep-dive/