The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.
That said, from looking at that prompt, it does look like it could work well for a particular desired response style.
You're absolutely right! This is the basis of this recent paper https://www.arxiv.org/abs/2506.06832
My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.
I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.