LAION, a high school teacher’s free image database, powers AI unicorns

> To build LAION, founders scraped visual data from companies such as Pinterest, Shopify and Amazon Web Services — which did not comment on whether LAION’s use of their content

Pinterest, "their content"... not sure I agree with that but if we're going to use the logic that things saved to Pinterest by it's users becomes Pinterest's content then isn't LAION doing the same thing and the content becomes LAION's content when it's saved to their database of images...

__alexs · 2 years ago

There is an T&C agreement between Pinterest and its users. There is no T&Cs agreement between between LAION and Pinterest's users.

debugnik · 2 years ago

There's also no agreement between Pinterest and the actual copyright owners of most of their content, so much reposted art without even credit or a link to the source.

Plus, LAION is just an index, whereas Pinterest actually hosts it.

Deleted Comment

mcv · 2 years ago

If LAION republishes people's copyrighted content, that sounds like a pretty blatant copyright violation (edit: it's not; see below). Sounds like all the artists unhappy that their art is being used to train these AI systems, should be talking to LAION to have their content removed from the dataset.

Edit: Apparently LAION doesn't republish the content, only the metadata, so it's not a copyright violation. Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.

tyingq · 2 years ago

>Apparently LAION doesn't republish the content, only the metadata

"LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images."

I'm surprised link rot doesn't make this a big problem.

Filligree · 2 years ago

> Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.

It obeys robots.txt, and the user-agent is documented.

archerx · 2 years ago

Why does Pinterest get away with republishing people’s content? Shouldn’t artist be suing Pinterest for its blatant copyright violations?

Silverback_VII · 2 years ago

You can read here(in german): https://www.alltageinesfotoproduzenten.de/2023/02/20/laion-v...

that the non-profit LAION is going as far as intimidating the creators of the images they used.

They don't publish the image, yes, and that is also their reasoning. However, in my opinion, the intention behind all of this seems obvious. They circumvent copyright claims by being a non-profit and not publishing, but with the clear intent that the image will be used to train some system further down the road. LAION appears to be a key player in how these text-to-image models dodge the copyright bullet.

ModernMech · 2 years ago

> Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.

If artists don't want their work to impact the world, they're free to keep it to themselves.

This whole discussion that we should allow individual artists to opt out of AI art through contracts or some other legal vehicle is a non starter, because it'll be impossible to administrate and enforce at scale, and there's too much incentive and ability for big tech to just ignore them and steamroll artists. They aren't a unified bloc, and even if they were how would they ever compete against big tech?

So what to do? Looking at productivity gains over the decades, it's not clear why we are still working as hard as we are. It's long overdue that productivity gains should come back to the people. Maybe "artist" shouldn't be a job title associated with profit/income seeking. If you want to be an artist, maybe society can support that.

Maybe instead of using all those productivity gains to do more more more, we can just work less for the same. Because it seems to me the more we work, the more they get richer. What if instead, they didn't get so rich, and we gave that money to artists in the form of grants, like we do for scientists. You do some art, apply for some grants, and you get some money to do more art. It'll all be public domain, anyone can use it, and big business gets to make a profit on it just like with scientific advancements (I have issues about that, but at least there's precedent).

ThrowawayTestr · 2 years ago

It's copyright infringement all the way down.

The group used raw HTML code collected by the California nonprofit Common Crawl to locate images around the web and associate them with descriptive text. It does not use any manual or human curation.

Does the Common Crawl data already take care of the copyright issue? Or else how does the LAION crawler deal with that problem?

I mean, it's not too hard to write an image crawler. Also, scaling it up it a bit of a challenge, but it's a technical one. But the real difficulty is how to deal with all the legal strings attached...

layer8 · 2 years ago

LAION doesn’t publish the images, only the metadata, which includes the original image URLs. Anyone who wants to make use of the image set has to re-download the images from the original sources [0], and is then liable for actually using the images. Arguably, the alt text contained in the metadata is still subject to copyright though.

slow_typist · 2 years ago

Arguably, a short alt text does not meet the threshold of originality under most jurisdictions. See https://en.m.wikipedia.org/wiki/Threshold_of_originality

XorNot · 2 years ago

This is the most commonly misunderstood characteristic of LAION, and I find that alarming because it feels like it just plain means a lot of people don't understand how the internet actually works.

Which is not surprising but guys, it's everywhere it's also not optional to understand this.

Borrible · 2 years ago

Well, at least one content producer is pissed about it.

Sorry, only in German:

https://www.alltageinesfotoproduzenten.de/2023/04/24/laion-e...

And translated via Google translate:

https://www-alltageinesfotoproduzenten-de.translate.goog/202...

Silverback_VII · 2 years ago

After reading it, it seems that LAION is merely doing the dirty work (going as far as intimidating the creators of the materials they use) so that other companies can utilize copyrighted material without taking too much risk. Not sure if this teacher is a hero...

slow_typist · 2 years ago

According to Christoph Schumann there is legislation in Europe that allows the usage of crawled data for public research institutions and nonprofit organisations. LAION is a registered german nonprofit (gemeinnütziger Verein).

Source: https://entwickler.de/machine-learning/laion-open-source-ai (found in another post).

gorbachev · 2 years ago

Commercial entities are using the data, however. I'm pretty sure the German legislation didn't mean that to be allowed.

nprateem · 2 years ago

It's not really a difficulty though is it, for VC funded unicorns who ask forgiveness and move fast. When the law catches up, if ever, they'll pay a slap on the wrist while sitting on their superyachts

angrais · 2 years ago

Less than 0.01% of the 5.8 billion images have an associated license. This license was taken from the webpage so cannot be trusted.

helsinkiandrew · 2 years ago

https://archive.ph/UIu69

whywhywhywhy · 2 years ago

netuffresche · 2 years ago

If someone is interested, I interviewed him recently. The interview is in german, though.

https://entwickler.de/machine-learning/laion-open-source-ai

w-m · 2 years ago

Thank you! The Bloomberg article piqued my interest, but it is actually very light on the details of the LAION project - it quickly moved into the same old license and bias discussions and left me wanting. Your interview is much more interesting, it's great to learn a more about the person behind it and his motivations.

Thanks! Really happy to hear that. He's a very fascinating guy and LAION is an awesome project, eventhough the discussion about copyright and data sets is important.

lisasays · 2 years ago

For his part, Schuhmann hasn’t profited from LAION and says he isn’t interested in doing so. “I’m still a high school teacher. I have rejected job offers from all different kinds of companies because I wanted this to stay independent,” he said.

Brave is this man for going against the current Zeitgeist (and the obvious financial temptations).

1letterunixname · 2 years ago

One could also write an image crawler to execute from AWS.

The key is finding quality source material.

It may become necessary to:

1. Fly drones around to gather data in the world at present time

2. Direct humans to take high-quality photos of particular subjects. Fiverr for photography

TACIXAT · 2 years ago

I've built a site to pay people for images and annotations. [1] I'm trying to onboard my first paid users right now. The plan is to build out a high quality 50k image license plate recognition dataset as a proof of concept.

Right now we own all of the datasets on the site and the idea is to license them out to companies while making them available to researchers under a non-commercial license. The market might take it a different direction to be more of a marketplace or Github style hosting. Email in bio if anyone wants to chat about this.

Also, if anyone wants to get paid 10 cents an image to take pictures of North American license plates, get in touch. Need about 1000 from each state. It's probably below most people's pay grade on here but there is a whole reverse bidding system, so you can always bid higher than 10 cents. Some user studies with a shared screen would be super helpful as well.

1. https://mekabytes.com

dahwolf · 2 years ago

"Also, if anyone wants to get paid 10 cents an image to take pictures of North American license plates, get in touch."

Your blatant disregard for privacy is shocking, but perhaps unsurprising in the field. I guess you also didn't through the enormous risks for the photographer.

kennyloginz · 2 years ago

Wow, interesting project. Not sure how many people you can entice into providing 10k photos of license plates, firearms, and children in pools. I kid, but are you building an alert system for a superhero?

virgildotcodes · 2 years ago

Hey there, just checked out the site and signed up. I’m not seeing anything about how to get paid for uploads. Can you provide some direction here? Thanks!

What is the cost for each dataset for commercial uses?

bradgessler · 2 years ago

This will be the new captcha challenge—“take a picture of 14 crosswalks”

It will send you a link to an app that only runs on an iPhone so it can verify you actually took the pics from the phone.

moffkalast · 2 years ago

takes pictures of a screen with the phone

quickthrower2 · 2 years ago

That would not be a11y friendly!

ramraj07 · 2 years ago

You mean like what google did with street view and convincing everyone to upload their photos on to maps?

lmpdev · 2 years ago

Like that but properly organised and with higher quality imagery

I've been wanting this for quite sometime

Tepix · 2 years ago

Or like Mapillary (bought by Facebook, but with a liberal license still)

epups · 2 years ago

I wonder how it came to be that such a critical component of a multi billion industry relies on something so amateurish as LAION. This is no offense to the author at all, who organised a gigantic effort which we now see is very valuable. But I would imagine a company like Google could do a much better job in no time, simply due to expertise and resources.

I guess the answer is legal liability.

By all indications, it's definitely a solid engineering project. On what basis should we deride it as "amateurish"?

Because it originates in the public sector? As opposed to the private sector (where, as we know, everything is done to the highest possible engineering standards)?

I know the word sounds demeaning, and I didn't mean to criticise in on technical grounds. I find it an extremely impressive project.

I meant that it does not have the refinement you would expect for such a critical tool. A substantial portion of LAION is composed of duplicates. If you have ever browsed it, you will find that many annotations are quite basic and in some cases incorrect. In ChatGPT's case we know there was a small army of people going through their dataset to filter and refine issues that are presumably similar to those.

"I guess the answer is legal liability."

Indeed. The answer is in the article. He gets offered jobs all the time, but nobody offers to buy the data itself. Clearly because nobody wants to own it, it's all plausible deniability.

bioemerl · 2 years ago

> Clearly because nobody wants to own it

Is it open source? If thats the case, you already own it.

Maken · 2 years ago

Forever relevant: https://xkcd.com/2347/

ftxbro · 2 years ago

Something about this reminds me a lot of Michael Hart and Project Gutenberg.

kleiba · 2 years ago