Of course it is quite costly and also requires some "marketing" to actually get it established.
If it is to test generalisation capability, then what data the model being evaluated is trained on is crucial to making any conclusions.
Look at the construction of this synthetic dataset for example: https://arxiv.org/pdf/1711.00350
The Arc HRM blog post says:
> [...] we set out to verify HRM performance against the ARC-AGI-1 Semi-Private dataset - a hidden, hold-out set of ARC tasks used to verify that solutions are not overfit [...] 32% on ARC-AGI-1 is an impressive score with such a small model. A small drop from HRM's claimed Public Evaluation score (41%) to Semi-Private is expected. ARC-AGI-1's Public and Semi-Private sets have not been difficulty calibrated. The observed drop (-9pp) is on the high side of normal variation. If the model had been overfit to the Public set, Semi-Private performance could have collapsed (e.g., ~10% or less). This was not observed.
This would all go out the window if the model being evaluated can _see_ the type of distribution shift it would encounter during test time. And it's unclear whether the shift is the same in the hidden set.
There are questions about the evaluations that arise from the large model performance against the smaller models, especially given the ablation studies. Are the large models trained on the same data as these tiny models? Should they be? If they shouldn't, then why are we allowing these small models access to these in their training data?