It isn't, as these are how stakeholders convey needs to those charged with satisfying same (a.k.a. "requirements"). Where expectations become unrealistic is believing language models can somehow "understand" those outlines as if a human expert were doing so in order to produce an equivalent work product.
Language models can produce nondeterministic results based on the statistical model derived from their training data set(s), with varying degrees of relevance as determined by persons interpreting the generated content.
They do not understand "what the system should do."
AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).
Throwing your hard problem at work to LLM is a better metric than benchmarks.