What about the so called NPUs which are present in some modern microcontroller chips?
If your headline metric is a score, and you constantly test on that score, it becomes very tempting to do anything that makes that score go up - i.e Train on the Test set.
I believe all the major ML labs are doing this now because:
- No one talks about their data set
- The scores are front and center of big releases, but there is very little discussion or nuance other than the metric.
- The repercussions of not having a higher or comparable score is massive failure and your budget will get cut.
More in depth discussion on capabilities - while harder - is a good signal of a release.
That is to say, focusing on scores is a good thing. If we want our models to improve further, we simply need better benchmarks.