At some point I just moved on, and solved the next word, Chess, in like 2 seconds. The results claimed I spent a whole pile of time on Chess, since it has no idea which result I'm thinking of solving (though all the wrong guesses for Catan should have been a clue that the time should have gone towards Catan.)
I'm not sure the per-word time values are useful if they can't be trusted to be accurate.
2. X86 micro-ops vs ARM decode are not equivalent. X86’s variable length instructions make the whole process far more complicated than it is on something like ARM. This is a penalty due to legacy design.
3. The OP was talking about M1. AFAIK, M4 is now 10-wide, and most x86 is 6-wide (Ryzen 5 does some weird stuff). X86 was 4-wide at the time of M1’s introduction.
4. M1 has over 600 reorder buffer registers… it’s significantly larger than competitors.
5. Close relative to x86 competitors.
And? Are you saying neither Intel nor AMD engineers were able to determine that this was a bottleneck worth chasing? The point was, anybody could add more cache, rename, reorder or whatever buffers they wanted to... it's not Apple secret-sauce.
If all the competition knew they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's on them. They got overtaken by a competitor with a better offering.
If all the competition didn't realize they were leaving all this performance/efficiency on the table despite there being a relatively simple fix, that's also on them. They got overtaken by a competitor with better offering AND more effective engineers.