These companies will never admit it but AI is built on the back of piracy archives, easiest way and cheapest way to getting massive amounts of quality data.
Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.
The bigger/more thought-diverse the audience, the harder this is to do.
Too much surprise and the scientific audience will dismiss you out of hand. How could you be right while all the prior research is dead wrong?
Conversely, too little surprise and the reader / listener will yawn and say but of course we all know this. You are just repeating standard knowledge in the field.
Despite the impact on audience reception we tend to believe that most fields would benefit from robust replication studies and the researchers shouldn't be penalized for confirming the well known.
And, sometimes there really is paradigm breaking research and common knowledge is eventually demonstrated to be very wrong. But often the initial researchers face years or decades of rejection.