marsh_mellow (u/marsh_mellow)

marsh_mellow commented on GPT-4.1 in the API openai.com/index/gpt-4-1/... · Posted by u/maheshrijal

jsnell · 8 months ago

That's not a lot of samples for such a small effect, I don't think it's statistically significant (p-value of around 10%).

marsh_mellow · 8 months ago

p-value of 7.9% — so very close to statistical significance.

the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.

Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer

marsh_mellow commented on GPT-4.1 in the API openai.com/index/gpt-4-1/... · Posted by u/maheshrijal

InkCanon · 8 months ago

>4.1 Was better in 55% of cases

Um, isn't that just a fancy way of saying it is slightly better

>Score of 6.81 against 6.66

So very slightly better

marsh_mellow · 8 months ago

I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol

55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge

marsh_mellow commented on GPT-4.1 in the API openai.com/index/gpt-4-1/... · Posted by u/maheshrijal

arvindh-manian · 8 months ago

Interesting link. Worth noting that the pull requests were judged by o3-mini. Further, I'm not sure that 55% vs 45% is a huge difference.

marsh_mellow · 8 months ago

Good point. They said they validated the results by testing with other models (including Claude), as well as with manual sanity checks.

55% to 45% definitely isn't a blowout but it is meaningful — in terms of ELO it equates to about a 36 point difference. So not in a different league but definitely a clear edge

marsh_mellow commented on GPT-4.1 in the API openai.com/index/gpt-4-1/... · Posted by u/maheshrijal

marsh_mellow · 8 months ago

From OpenAI's announcement:

> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).

https://www.qodo.ai/blog/benchmarked-gpt-4-1/

u/marsh_mellow

KarmaCake day186December 30, 2022View Original