[0]: https://www.rferl.org/a/gutterman-week-in-russia-mariupol-uk...
they say that and yet much larger traffic of grain from Russia somehow doesn't hurt those farmers?! maybe it's because a lot of it is stolen Ukrainian grain so everyone should be hush shush about it
What faith can be placed in User-Agent strings. The contents of this header have been faked since the birth of the www in the early 90s.
How does anyone know what someone will do with the data they have crawled. There are no transparency requirements, there are no legally-enforceable agreements. There are only IP addresses and User-Agent headers.
No one can look at a User-Agent header or an IP address and conclude, "I know what someone will do with the data I let them access because this header or address confirms it". Those accessing the data could use it for any purpose or transfer it to someone else to use for any purpose. Unless the website operator has a binding agreement with the organisation doing the crawling, any assumptions about future behaviour made on basis of a header or IP address offer no control whatsoever.
Perhaps an IP address could be used to conclude the HTTP requests are being sent by Company X, but the address does not indicate what Company X will do with the data it collects. Company X can do whatever it wants. It does not need to tell anyone what it is doing; it could make up a story about what it is doing that conveniently conceals facts.
These so-called "tech" companies that are training "AI" are secretive and non-transparent. Further, they will lie when it suits their interests. They do not ask for permission, they only ask for forgiveness. They are strategic and unfortunately deceptive in what they tell the public. "Trust" at own risk.
Although it may be useless as a means of controlling how crawling data is used, it still makes sense to me to put something in robots.txt to indicate there is no consent given for crawling for the purpose of training "AI". Better would be to publish some explicit notice to the public that no consent is given to use data from the website to train "AI".
Put the restrictions in a license. Let the so-called "tech" companies assent to that license. Then, when the evidence of unauthorised data use becomes availalble, enforce the license.
If those in power decide to push forward by increasing the penalties, eventually running a Mastodon server will become as risky as getting caught for murder. At that point, it would be naive to run a Mastodon server locally, wouldn't it?
P.S. Mind you, the average consumer/user doesn't care as long as Facebook, Twitter, Instagram, and TikTok are up and running. Quotes like "they already know everything" and "do you have something to hide from the government?" will become ubiquitous in public discourse.