Apple Intelligence Foundation Language Models

simonw · a year ago

If you were wondering what they trained it on:

> The AFM pre-training dataset consists of a diverse dataset consists of a diverse and high quality data mixture. This includes data we have licensed from publishers, curated publicly-available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot. We respect the right of webpages to opt out of being crawled by Applebot, using standard robots.txt directives.

The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.

Jiocus · a year ago

That's why it's possible to have a default deny rule in robots.txt

    User-agent: *
    Disallow: /

And possibly allow-list the ones you accept. This probably won't change the fact that you may allow a vendor at one point in time, only to realise they changed their crawling use case and has been scraping data for AI training for the past 6 months (before they go public about it).

It can be argued that if you are a server operator, you always know which User-agents are making requests to your resources.

seanmcdirmid · a year ago

I thought robots.txt was provided an opt out for all web crawlers, not just vendor specific ones? What would be the use case for not using "*" if you didn't want something to be crawled?

gmuslera · a year ago

Not exactly. It opt outs of respectful web crawlers by any name, or if you specify it, by some name in particular. Misbehaved ones requires a bit of more effort.

simonw · a year ago

robots.txt has always had mechanisms for allowing or denying specific crawlers. Most of the AI labs that crawl the web support something like this now, here are a few relevant examples from the NY Times robots.txt file for example: https://www.nytimes.com/robots.txt

    User-agent: anthropic-ai
    Disallow: /

    User-agent: Applebot-Extended
    Disallow: /

    User-agent: FacebookBot
    Disallow: /

    User-agent: Google-Extended
    Disallow: /

    User-agent: GPTBot
    Disallow: /

atty · a year ago

It’s very reasonable to want Google and Bing to index your page for search, but not have your data collected for training models, i think. I’m not familiar with robots.txt to know if it has a whitelisting mechanism

Someone · a year ago

> The fact that you can opt-out in robots.txt only if you knew to list Applebot months (years?) ago when they started crawling is a little unimpressive.

I think this goes wider than Apple. Many site administrators thought robots.txt only was for (dis)allowing crawlers that created search indexes that could give them page hits in exchange, and saw nothing wrong with allowing that.

Now, many companies crawl with another goal in mind. They don’t do that in secret, but announce that in their headers.

Now the question is how much you can blame those new crawlers for doing that. Should they have made an effort to ask site administrators “hey, you allow us to crawl your site. Are you sure you mean that?” or should site administrators have noticed the new crawlers, and have started thinking whether there meant to allow them to get their content? I think it’s a bit of both, find it hard to judge who’s more to blame, but certainly think that newer crawlers are less to blame for this, as sites should by now know that this problem exists.

TradingPlaces · a year ago

Trained on TPU4, not GPUs. The only GPUs involved are on Apple Silicon.

TwentyPosts · a year ago

Wait, Apple releases papers? Do they have a name for their internal research division? When I think of companies doing AI research my mind generally jumps to Google DeepMind, I didn't know Apple releases any proper research papers at all.

rgovostes · a year ago

Fun fact: Their first paper, Improving the Realism of Synthetic Images (2017; https://machinelearning.apple.com/research/gan), strongly hints at eye and hand tracking for the Apple Vision Pro released 5 years later.

givinguflac · a year ago

For a while now! https://machinelearning.apple.com/research

TwentyPosts · a year ago

I believe it, I'm just surprised that I never thought of Apple as a big player here, and I'm wondering if this is an image problem on their side (rare, I know) or something else.

latexr · a year ago

https://machinelearning.apple.com/research

Deleted Comment

ein0p · a year ago

And a number of them are actually quite good. Sometimes they even release the source for replication