Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?
They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.
If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?
I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.
There are two cases:
Your products are faulty and at least one has not made their intended 10 year lifespan. I'd change them all for better ones.
Or
They have reached their lifespan and you only noticed because the first one failed. I'd replace them all.