> The AFM pre-training dataset consists of a diverse dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.
The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.
That's why it's possible to have a default deny rule in robots.txt
User-agent: *
Disallow: /
And possibly allow-list the ones you accept. This probably won't change the fact that you may allow a vendor at one point in time, only to realise they changed their crawling use case and has been scraping data for AI training for the past 6 months (before they go public about it).
It can be argued that if you are a server operator, you always know which User-agents are making requests to your resources.
I thought robots.txt was provided an opt out for all web crawlers, not just vendor specific ones? What would be the use case for not using "*" if you didn't want something to be crawled?
Not exactly. It opt outs of respectful web crawlers by any name, or if you specify it, by some name in particular. Misbehaved ones requires a bit of more effort.
robots.txt has always had mechanisms for allowing or denying specific crawlers. Most of the AI labs that crawl the web support something like this now, here are a few relevant examples from the NY Times robots.txt file for example: https://www.nytimes.com/robots.txt
It’s very reasonable to want Google and Bing to index your page for search, but not have your data collected for training models, i think. I’m not familiar with robots.txt to know if it has a whitelisting mechanism
> The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.
I think this goes wider than Apple. Many site administrators thought robots.txt only was for (dis)allowing crawlers that created search indexes that could give them page hits in exchange, and saw nothing wrong with allowing that.
Now, many companies crawl with another goal in mind. They don’t do that in secret, but announce that in their headers.
Now the question is how much you can blame those new crawlers for doing that. Should they have made an effort to ask site administrators “hey, you allow us to crawl your site. Are you sure you mean that?” or should site administrators have noticed the new crawlers, and have started thinking whether there meant to allow them to get their content? I think it’s a bit of both, find it hard to judge who’s more to blame, but certainly think that newer crawlers are less to blame for this, as sites should by now know that this problem exists.
Wait, Apple releases papers? Do they have a name for their internal research division? When I think of companies doing AI research my mind generally jumps to Google DeepMind, I didn't know Apple releases any proper research papers at all.
Fun fact: Their first paper, Improving the Realism of Synthetic Images (2017; https://machinelearning.apple.com/research/gan), strongly hints at eye and hand tracking for the Apple Vision Pro released 5 years later.
I believe it, I'm just surprised that I never thought of Apple as a big player here, and I'm wondering if this is an image problem on their side (rare, I know) or something else.
> The AFM pre-training dataset consists of a diverse dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.
The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.
It can be argued that if you are a server operator, you always know which User-agents are making requests to your resources.
I think this goes wider than Apple. Many site administrators thought robots.txt only was for (dis)allowing crawlers that created search indexes that could give them page hits in exchange, and saw nothing wrong with allowing that.
Now, many companies crawl with another goal in mind. They don’t do that in secret, but announce that in their headers.
Now the question is how much you can blame those new crawlers for doing that. Should they have made an effort to ask site administrators “hey, you allow us to crawl your site. Are you sure you mean that?” or should site administrators have noticed the new crawlers, and have started thinking whether there meant to allow them to get their content? I think it’s a bit of both, find it hard to judge who’s more to blame, but certainly think that newer crawlers are less to blame for this, as sites should by now know that this problem exists.
Deleted Comment
Dead Comment
Deleted Comment