Um, isn't that just a fancy way of saying it is slightly better
>Score of 6.81 against 6.66
So very slightly better
55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge
the p-value for GPT-4.1 having a win rate of at least 49% is 4.92%, so we can say conclusively that GPT-4.1 is at least (essentially) evenly matched with Claude Sonnet 3.7, if not better.
Given that Claude Sonnet 3.7 has been generally considered to be the best (non-reasoning) model for coding, and given that GPT-4.1 is substantially cheaper ($2/million input, $8/million output vs. $3/million input, $15/million output), I think it's safe to say that this is significant news, although not a game changer