> To build LAION, founders scraped visual data from companies such as Pinterest, Shopify and Amazon Web Services — which did not comment on whether LAION’s use of their content
Pinterest, "their content"... not sure I agree with that but if we're going to use the logic that things saved to Pinterest by it's users becomes Pinterest's content then isn't LAION doing the same thing and the content becomes LAION's content when it's saved to their database of images...
There's also no agreement between Pinterest and the actual copyright owners of most of their content, so much reposted art without even credit or a link to the source.
Plus, LAION is just an index, whereas Pinterest actually hosts it.
If LAION republishes people's copyrighted content, that sounds like a pretty blatant copyright violation (edit: it's not; see below). Sounds like all the artists unhappy that their art is being used to train these AI systems, should be talking to LAION to have their content removed from the dataset.
Edit: Apparently LAION doesn't republish the content, only the metadata, so it's not a copyright violation. Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.
>Apparently LAION doesn't republish the content, only the metadata
"LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images."
I'm surprised link rot doesn't make this a big problem.
that the non-profit LAION is going as far as intimidating the creators of the images they used.
They don't publish the image, yes, and that is also their reasoning. However, in my opinion, the intention behind all of this seems obvious. They circumvent copyright claims by being a non-profit and not publishing, but with the clear intent that the image will be used to train some system further down the road. LAION appears to be a key player in how these text-to-image models dodge the copyright bullet.
> Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.
If artists don't want their work to impact the world, they're free to keep it to themselves.
This whole discussion that we should allow individual artists to opt out of AI art through contracts or some other legal vehicle is a non starter, because it'll be impossible to administrate and enforce at scale, and there's too much incentive and ability for big tech to just ignore them and steamroll artists. They aren't a unified bloc, and even if they were how would they ever compete against big tech?
So what to do? Looking at productivity gains over the decades, it's not clear why we are still working as hard as we are. It's long overdue that productivity gains should come back to the people. Maybe "artist" shouldn't be a job title associated with profit/income seeking. If you want to be an artist, maybe society can support that.
Maybe instead of using all those productivity gains to do more more more, we can just work less for the same. Because it seems to me the more we work, the more they get richer. What if instead, they didn't get so rich, and we gave that money to artists in the form of grants, like we do for scientists. You do some art, apply for some grants, and you get some money to do more art. It'll all be public domain, anyone can use it, and big business gets to make a profit on it just like with scientific advancements (I have issues about that, but at least there's precedent).
Thank you! The Bloomberg article piqued my interest, but it is actually very light on the details of the LAION project - it quickly moved into the same old license and bias discussions and left me wanting. Your interview is much more interesting, it's great to learn a more about the person behind it and his motivations.
Thanks! Really happy to hear that. He's a very fascinating guy and LAION is an awesome project, eventhough the discussion about copyright and data sets is important.
For his part, Schuhmann hasn’t profited from LAION and says he isn’t interested in doing so. “I’m still a high school teacher. I have rejected job offers from all different kinds of companies because I wanted this to stay independent,” he said.
Brave is this man for going against the current Zeitgeist (and the obvious financial temptations).
I've built a site to pay people for images and annotations. [1] I'm trying to onboard my first paid users right now. The plan is to build out a high quality 50k image license plate recognition dataset as a proof of concept.
Right now we own all of the datasets on the site and the idea is to license them out to companies while making them available to researchers under a non-commercial license. The market might take it a different direction to be more of a marketplace or Github style hosting. Email in bio if anyone wants to chat about this.
Also, if anyone wants to get paid 10 cents an image to take pictures of North American license plates, get in touch. Need about 1000 from each state. It's probably below most people's pay grade on here but there is a whole reverse bidding system, so you can always bid higher than 10 cents. Some user studies with a shared screen would be super helpful as well.
"Also, if anyone wants to get paid 10 cents an image to take pictures of North American license plates, get in touch."
Your blatant disregard for privacy is shocking, but perhaps unsurprising in the field. I guess you also didn't through the enormous risks for the photographer.
Wow, interesting project. Not sure how many people you can entice into providing 10k photos of license plates, firearms, and children in pools. I kid, but are you building an alert system for a superhero?
Hey there, just checked out the site and signed up. I’m not seeing anything about how to get paid for uploads. Can you provide some direction here? Thanks!
I wonder how it came to be that such a critical component of a multi billion industry relies on something so amateurish as LAION. This is no offense to the author at all, who organised a gigantic effort which we now see is very valuable. But I would imagine a company like Google could do a much better job in no time, simply due to expertise and resources.
By all indications, it's definitely a solid engineering project. On what basis should we deride it as "amateurish"?
Because it originates in the public sector? As opposed to the private sector (where, as we know, everything is done to the highest possible engineering standards)?
I know the word sounds demeaning, and I didn't mean to criticise in on technical grounds. I find it an extremely impressive project.
I meant that it does not have the refinement you would expect for such a critical tool. A substantial portion of LAION is composed of duplicates. If you have ever browsed it, you will find that many annotations are quite basic and in some cases incorrect. In ChatGPT's case we know there was a small army of people going through their dataset to filter and refine issues that are presumably similar to those.
Indeed. The answer is in the article. He gets offered jobs all the time, but nobody offers to buy the data itself. Clearly because nobody wants to own it, it's all plausible deniability.
The group used raw HTML code collected by the California nonprofit Common Crawl to locate images around the web and associate them with descriptive text. It does not use any manual or human curation.
Does the Common Crawl data already take care of the copyright issue? Or else how does the LAION crawler deal with that problem?
I mean, it's not too hard to write an image crawler. Also, scaling it up it a bit of a challenge, but it's a technical one. But the real difficulty is how to deal with all the legal strings attached...
LAION doesn’t publish the images, only the metadata, which includes the original image URLs. Anyone who wants to make use of the image set has to re-download the images from the original sources [0], and is then liable for actually using the images. Arguably, the alt text contained in the metadata is still subject to copyright though.
This is the most commonly misunderstood characteristic of LAION, and I find that alarming because it feels like it just plain means a lot of people don't understand how the internet actually works.
Which is not surprising but guys, it's everywhere it's also not optional to understand this.
After reading it, it seems that LAION is merely doing the dirty work (going as far as intimidating the creators of the materials they use) so that other companies can utilize copyrighted material without taking too much risk. Not sure if this teacher is a hero...
According to Christoph Schumann there is legislation in Europe that allows the usage of crawled data for public research institutions and nonprofit organisations. LAION is a registered german nonprofit (gemeinnütziger Verein).
It's not really a difficulty though is it, for VC funded unicorns who ask forgiveness and move fast. When the law catches up, if ever, they'll pay a slap on the wrist while sitting on their superyachts
Pinterest, "their content"... not sure I agree with that but if we're going to use the logic that things saved to Pinterest by it's users becomes Pinterest's content then isn't LAION doing the same thing and the content becomes LAION's content when it's saved to their database of images...
Plus, LAION is just an index, whereas Pinterest actually hosts it.
Deleted Comment
Edit: Apparently LAION doesn't republish the content, only the metadata, so it's not a copyright violation. Still, it would be nice if got permission or offered a way for artists to be excluded from the data set.
"LAION datasets are simply indexes to the internet, i.e. lists of URLs to the original images together with the ALT texts found linked to those images."
I'm surprised link rot doesn't make this a big problem.
It obeys robots.txt, and the user-agent is documented.
that the non-profit LAION is going as far as intimidating the creators of the images they used.
They don't publish the image, yes, and that is also their reasoning. However, in my opinion, the intention behind all of this seems obvious. They circumvent copyright claims by being a non-profit and not publishing, but with the clear intent that the image will be used to train some system further down the road. LAION appears to be a key player in how these text-to-image models dodge the copyright bullet.
If artists don't want their work to impact the world, they're free to keep it to themselves.
This whole discussion that we should allow individual artists to opt out of AI art through contracts or some other legal vehicle is a non starter, because it'll be impossible to administrate and enforce at scale, and there's too much incentive and ability for big tech to just ignore them and steamroll artists. They aren't a unified bloc, and even if they were how would they ever compete against big tech?
So what to do? Looking at productivity gains over the decades, it's not clear why we are still working as hard as we are. It's long overdue that productivity gains should come back to the people. Maybe "artist" shouldn't be a job title associated with profit/income seeking. If you want to be an artist, maybe society can support that.
Maybe instead of using all those productivity gains to do more more more, we can just work less for the same. Because it seems to me the more we work, the more they get richer. What if instead, they didn't get so rich, and we gave that money to artists in the form of grants, like we do for scientists. You do some art, apply for some grants, and you get some money to do more art. It'll all be public domain, anyone can use it, and big business gets to make a profit on it just like with scientific advancements (I have issues about that, but at least there's precedent).
https://entwickler.de/machine-learning/laion-open-source-ai
Brave is this man for going against the current Zeitgeist (and the obvious financial temptations).
The key is finding quality source material.
It may become necessary to:
1. Fly drones around to gather data in the world at present time
2. Direct humans to take high-quality photos of particular subjects. Fiverr for photography
Right now we own all of the datasets on the site and the idea is to license them out to companies while making them available to researchers under a non-commercial license. The market might take it a different direction to be more of a marketplace or Github style hosting. Email in bio if anyone wants to chat about this.
Also, if anyone wants to get paid 10 cents an image to take pictures of North American license plates, get in touch. Need about 1000 from each state. It's probably below most people's pay grade on here but there is a whole reverse bidding system, so you can always bid higher than 10 cents. Some user studies with a shared screen would be super helpful as well.
1. https://mekabytes.com
Your blatant disregard for privacy is shocking, but perhaps unsurprising in the field. I guess you also didn't through the enormous risks for the photographer.
It will send you a link to an app that only runs on an iPhone so it can verify you actually took the pics from the phone.
I've been wanting this for quite sometime
I guess the answer is legal liability.
Because it originates in the public sector? As opposed to the private sector (where, as we know, everything is done to the highest possible engineering standards)?
I meant that it does not have the refinement you would expect for such a critical tool. A substantial portion of LAION is composed of duplicates. If you have ever browsed it, you will find that many annotations are quite basic and in some cases incorrect. In ChatGPT's case we know there was a small army of people going through their dataset to filter and refine issues that are presumably similar to those.
Indeed. The answer is in the article. He gets offered jobs all the time, but nobody offers to buy the data itself. Clearly because nobody wants to own it, it's all plausible deniability.
Is it open source? If thats the case, you already own it.
Does the Common Crawl data already take care of the copyright issue? Or else how does the LAION crawler deal with that problem?
I mean, it's not too hard to write an image crawler. Also, scaling it up it a bit of a challenge, but it's a technical one. But the real difficulty is how to deal with all the legal strings attached...
[0] see also https://news.ycombinator.com/item?id=35681085
Which is not surprising but guys, it's everywhere it's also not optional to understand this.
Sorry, only in German:
https://www.alltageinesfotoproduzenten.de/2023/04/24/laion-e...
And translated via Google translate:
https://www-alltageinesfotoproduzenten-de.translate.goog/202...
Source: https://entwickler.de/machine-learning/laion-open-source-ai (found in another post).