I work in machine learning for digital Pathology and I think the big problem here is the divergence between publishing papers and real life helpful models.
What we see is that in the literature you often see a model trained on data from a single lab and gets crazy good results.
However, apply it to a different lab (not in the paper of course) and it sucks.
So what we do in practice for our models is to train on many different labs at the same time and have a hold out test set with labs not covered during training.
That way you get a robust model which works well in practice (but doesn't have the 99.9% metrics reported in papers).
The second thing is looking at what task to let the machine do. We typically go for tasks which are boring / repetitive for the doctor, but visualize the result, so the doctor double checks it before making the diagnosis. Still saves a lot of time.
I also work in this field and what we typically do is a lot of nested cross-validations to get some bounds on a model building process and some idea of how it would perform on repeated unseen data. Data leakage is always on our mind and we do our best at all stages to avoid that. We also train on data from many sites. It can be done and it can be done properly. As you say, it is always best to collect some completely naive test set to back up the model-building process. If you design your pipeline properly, the test set should fall within the bounds you got during cross-validation. It all depends on how much data you have and I think as long as you design your pipelines with that in mind and acknowledge limitations with smaller datasets, then the research is valid and useful.