it will work better than regex. a lot of these companies rely on 'but we are clearly recognizable' via fornexample these user agents, as excuse to put burden on sysadmins to maintains blocklists instead of otherway round (keep list of scrapables..)
maybe someone mathy can unburden them ?
you could also look who ask for nonexisting resources, and block anyone who asks for more than X (large enough not to let config issue or so kill regular clients). block might be just a minute so u dont have too many risk when an FP occurs. it will be enough likely to make the scraper turn away.
there are many things to do depending on context, app complexity, load etc. , problem is there's no really easy way to do these things.
ML should be able to help a lot in such a space??
- shopping cart fraud
- geo-restricted content (think distributing laws)
- preventing abuse (think ticket scalpers)
- preventing cheating and multi-accounting (think gaming)
- preventing account takeovers (think 2FA trigger if fingerprint suddenly changed)
There is much more but yeah, this tech has its place. We cannot just assume everyone has a static website with a free for all content.