The argument of the training inputs are "just like reading a book" seems like a fair statement IMO albeit antiquated these days. However, generating text, audio, or images in the specific style of an individual creator seems like a slippery slope that ultimately deserves some kind of renumeration.
I'm glad I'm not a lawyer or politician trying to sort this out. If AI gets commercially crippled, I really don't want to live in a world of black market training data.
“However, generating text, audio, or images in the specific style of an individual creator seems like a slippery slope that ultimately deserves some kind of renumeration.”
Human artists already do this, extensively. We handle it by making their output the part of the process which holds relevant copyright protections. I can sell Picasso inspired pieces all day long as long as I don’t sell them as “Picasso.”
If I faithfully reproduced “The Old Guitarist”[1] and attempted to sell it as the original, or even as a version to copy and sell prints, I’d be open to legal claims and action. Rightfully so.
I personally haven’t heard a convincing argument as to why ML training should be handled as if it’s the output of the process, rather than the input that it is. I’m open to be swayed and make adjustments to my worldview so I keep looking for counterpoints.
> However, generating text, audio, or images in the specific style of an individual creator seems like a slippery slope that ultimately deserves some kind of renumeration.
It's hard to find a foothold. Human output doesn't have this restriction. Further, it feels like regulating solar power so coal miners can keep their jobs.
Just banning it or regulating output may seem like a solution to some, but all that means is that we'll cripple ourselves so other, more technologically progressive economies can sprint past us in affected markets. Neither saving the jobs in the end, and ultimately hurting more people than the markets we tried to save.
But we do desperately need to sort out how this is going to devastate entire markets of labor before it risks major economic upheaval with no safety nets in place.
Making "style" into private property subject to infringement seems like the far more slippery slope to me. I think existing copyright laws are more than sufficient, for cases where there are actually generations with substantial similarity to some protected work.
> However, generating text, audio, or images in the specific style of an individual creator seems like a slippery slope that ultimately deserves some kind of renumeration.
It never has before. Why now?
Someone (more commonly, some group) invents Impressionism, or Art Deco, or heavy metal, or gangsta rap, or acid-washed jeans, or buzz cuts and pretty soon there are dozens or hundreds of other people creating works in that style, none of whom are paying the originators a cent.
Yeah I would love it if these companies simply took a principled stance against IP restrictions, but alas. Like they are literally running directly in to the main problem these restrictions create and then thinking “we need a way to say those don’t apply to us while making sure they still apply to everyone else”.
> Companies will always amorally argue for whatever makes them more money. I
There are no principles involved when companies advocate for or against things. Companies will always amorally argue for whatever makes them more money. They are entirely capable of arguing two opposing viewpoints if in one context viewpoint A makes them money and in another context opposite-viewpoint B makes them money. Being consistent, either logically, morally, ethically, or in principle, is not necessary.
"Copyright is good and necessary when it makes us money, and copyright is bad and wrong when it doesn't make us money" is a mundane and totally expected opinion coming from a corporation.
Here’s a principled stance: you don’t have the right to a sound, arrangement of pixels, or ideas. IP for nobody. If you disagree, then it should be “IP for everyone” and not “whether or not I support IP depends on who benefits from it”. An answer that differs from piracy to AI training is not a principled answer.
That's the way it's always worked. You read books to learn to code, then write code and you get paid for it, but that money doesn't trickle down to the writers of the books that taught you to code.
Except, the “books” in this case is the code/content that people might like to get paid for in the first place. Also, AI companies aren’t “buying the books”
Copyright holders make all kinds of arguments for why they should be get money for incidental exposure to their work. This is all about greed and jealousy. If someone uses AI to make infringing content, existing laws already cover that. The fact that an ML model could be used to generate infringing content, and has exposure to or "knowledge" of some copyrighted material is immaterial. People just see someone else making money and want to try and get a piece of it.
> in a way that is completely dependent upon their own prior work
Ultimately information has to come from somewhere. If something has no information about what a "car" is, it cannot paint a car more successfully than a random guess. When you draw a car or write an algorithm to do so, you'll be slightly affected by the existing car designs you've seen. It's not a limitation specific to AI - it's just more obscured for humans since there's no explicit searchable database of all the cars you've glanced at.
Whether it was affected by (and dependant on in aggregate) prior work is not the standard for copyright infringement, and I'd claim would implicate essentially all action as infringement. Instead, it should be judged by whether there's substantial similarity - and if there is substantial similarity, then by the factors of fair use.
No, AI art would exist without Disney or HBO just like human art would.
It literally does come back to the idea that either AI is doing more or less the same thing as an art student, and learns styles and structures and concepts, in which case training an art student is infringing because it’s completely dependent on the work of artists who came before.
And sure, if you ask a skilled 2d artist if they can draw something in the style of 80s anime, or specific artists, they can do it. There are some artists who specialize in this in fact! Can’t have retro anime porn commissions if it’s not riffing on retro anime images. Yes twitter, I see what you do with that account when you’re not complaining about AI.
The problem is that AI lowers the cost of doing this to zero, and thus lays bare the inherent contradictions of IP law and “intellectual ownership” in a society where everyone is diffusing and mashing up each others ideas and works on a continuous basis. It is one of those “everyone does it” crimes that mostly survives because it’s utterly unenforced at scale, apart from a few noxious litigants like disney.
It is the old Luddite problem - the common idea that luddites just hated technology is inaccurate. They were textile workers who were literally seeing their livelihoods displaced by automation mass-producing what they saw as inferior goods. https://en.wikipedia.org/wiki/Luddite
In general this is a problem that's set up by capitalism itself though. Ideas can’t and shouldn’t be owned, it is an absurd premise and you shouldn’t be surprised that you get absurd results. Making sure people can eat is not the job of capitalism, it’s the job of safety nets and governments. Ideas have no cost of replication and artificially creating one is distorting and destructive.
Would a neural net put a tax on neurons firing? No, that’s stupid and counterproductive.
Let people write their slash fiction in peace.
(HN probably has a good understanding of it, but in general people don't appreciate just how much it is not just aping images it's seen but learning the style and relationships of pixels and objects etc. To wit, the only thing NVIDIA saved from DLSS 1.0 was the model... and DLSS 2.0 has nothing to do with DLSS 1.0 in terms of technical approach. But the model encodes all the contextual understanding of how pixels are supposed to look in human images, even if it's not even doing the original transform anymore! And LLMs can indeed generalize reasonably accurately about things they haven't seen, as long as they know the precepts etc. Because they aren't "just guessing what word comes next", it's the word that comes next given a conceptual understanding of the underlying ideas. And that's a difficult thing to draw a line between a human and an AI large model, college students will "riff on the things they know" if you ask them to "generalize" about a topic they haven't studied too, etc.)
AI companies monetising people’s IP need to pay up. End of story. Make smarter ai next time that can “learn” with less content - or as it stands, so called ai is just a massive database that procedurally mixes content to generate what looks like “new” content.
All I see is AI companies poorly justifying their grift that they know they don't want to pay for the content that they are commercializing without permission and pull the fair use excuses.
It is no wonder why OpenAI had to pay Shutterstock for training on their data and Getty suing Stability AI for training on their watermarked images and using it commercially without permission and actors / actresses filing lawsuits against commercial voice cloners which costs them close to nothing, as those companies either take down the cloned voice offering or shutdown.
These weak arguments from these AI folks sound like excuses justifying a newly found grift.
When you're viewing everyone with a different opinion than you as a grifter, corporate rat, or some other malicious entity, you've disabled the ability or desire for people to try to engage with you. You won't be convinced, and you're already being uncivil and fallacious.
AI outputs should be regulated, of course. Obviously impersonation and copyright law already applies to AI systems. But a discussion on training inputs is entirely novel to man and our laws, and it's a very nuanced and important topic. And as AI advances, it becomes increasingly difficult because of the diminishing distinction between "organic" learning and "artificial" learning. As well as when stopping AI from — as an example — learning from research papers means we miss out on life-saving medication. Where do property rights conflict with human rights?
They're important conversations to have, but you've destroyed the opportunity to have them from the starting gun.
You likely benefit from machine learning applications constantly without realizing. The spam filters for your email, the scanning for defects of the products you use and the rails they were delivered on, when you enter a search query or translate a page into English, weather modelling to give you accurate predictions and early warnings, etc.
To avoid IP law causing more damage than it already has with evergreening of medical patents, I think it strictly has to be the generation of substantially similar media that counts as infringement, as the comment you're replying to suggests - not just "this tumor detector was pretrained on a large number of web images before task-specific fine-tuning, so it's illegal because they didn't pay Getty beforehand" if training were to be infringement.
Afaik Getty's case is strong against stability because they were dumb enough not to remove the water marks before training so now their model recreates their watermark, which is an infringement. Also let's not pretend openai licensed their dataset from all the stock image sites they scraped. The main reason they have a deal with Shutterstock is probably easy access and also Shutterstock partnered up with openai for ai tech to sell on their website.
Yes, most of this is whining from the "copyright forever" crowd. If you get out something vaguely similar to something old, they complain. What they're really worried about is becoming obsolete, not being copied.
The case against "tribute bands" is much stronger than the case against large language models built with some copyrighted content. Those are a blatant attempt to capitalize on the specific works of specific named people.
What people are worried about is the age old thieves stealing property and monetising it - which is what a lot of ai companies do. Pay up or create your own content and its fair game.
This is a blatant non-sequitor. There are many approaches to actually having a good faith discussion on the societal/economic/moral/humanitarian effects of large-scale AI taking over entire workforces. Being coy and asking loaded questions does nothing to convince anyone of them.
In wonder if copyrighted content will be needed at all in the future of AI training.
AlphaZero learned to play chess via self-play, not by reading books about chess.
Why couldn't the same happen for art for example?
For coding, won't a sufficiently advanced neural net be able to figure out how to use a programming language when given just the documentation?
And when most of our interactions are with AI, it will learn from our conversations with it. Asking some AI system why feature X was removed from programming language Y in version Z teaches it something. The next person who asks it which feature was removed from Y in version Z might be told "X". Without the AI system ever having to read about it. The interaction with AI could become a self-learning loop in and on itself.
Chess has a very straightforward definition of objectively correct and incorrect moves; there is no such thing for art. Though I have to wonder how many rounds of looking at random garbage and rating it you'd have to do for some kind of supervised learning to eventually yield coherent output...
I am getting tired of corporations taking public resources, keeping profits for themselves and pushing the costs onto the public. Therefore, I suggest the training sets (content, labeling) for AI systems should be considered world wide common property that everyone can use benefit from.
And your comment is just a remix of words from things you've read. The question is whether the result is a derivative work, which... I can't completely rule it out, but it's not obvious that that's all LLMs do. It pretty quickly gets to a question of exactly how the tech works and then gets philosophical (what is creativity?).
> If I take a copyrighted song and remix it do I owe the artist a royalty? Yes.
I don't think anyone's saying that the output of the model should be subject to more lenient copyright standards than human creations. If you're selling a remix with substantial similarity to an existing song and it fails the fair use factors, then it'd already be infringement regardless of whether you made it with AI or by hand.
The question is: what about the songs that you've listened to, and potentially influenced you, but don't have substantial similarity to the remix? Do you also owe royalties on all of those? For humans the answer is no, but the law doesn't necessarily have to treat AI the same way.
There are potential arguments to say one should be able to use copyrighted material to train a model, but it’s a very slippery slope that leads to logic that also says the resulting models shouldn’t be protectable intellectual property.
Thus these arguments could backfire on those making them real quick. At this point though they have no choice but to make them as they’ve all clearly used copyrighted works to build their products.
The other challenge here is there needs to be some protection and compensation for folks producing original works in the first place. If we just end up with ML models training on the generated output of other ML models this is all going to go downhill real quick.
And those arguments are now getting attention and legal consideration now that they’re being posited by the rich and the well-capitalized.
Some of these rhyme with the fair use and similar arguments put forth by the free software and anti-corporate-owned culture folks for the last couple decades. More honest (if cynical) is A16z’s take, of “the rich already put in a bunch of money, so now you can’t stop it.”
I'm glad I'm not a lawyer or politician trying to sort this out. If AI gets commercially crippled, I really don't want to live in a world of black market training data.
Human artists already do this, extensively. We handle it by making their output the part of the process which holds relevant copyright protections. I can sell Picasso inspired pieces all day long as long as I don’t sell them as “Picasso.”
If I faithfully reproduced “The Old Guitarist”[1] and attempted to sell it as the original, or even as a version to copy and sell prints, I’d be open to legal claims and action. Rightfully so.
I personally haven’t heard a convincing argument as to why ML training should be handled as if it’s the output of the process, rather than the input that it is. I’m open to be swayed and make adjustments to my worldview so I keep looking for counterpoints.
—— [1] https://en.m.wikipedia.org/wiki/The_Old_Guitarist
It's hard to find a foothold. Human output doesn't have this restriction. Further, it feels like regulating solar power so coal miners can keep their jobs.
Just banning it or regulating output may seem like a solution to some, but all that means is that we'll cripple ourselves so other, more technologically progressive economies can sprint past us in affected markets. Neither saving the jobs in the end, and ultimately hurting more people than the markets we tried to save.
But we do desperately need to sort out how this is going to devastate entire markets of labor before it risks major economic upheaval with no safety nets in place.
It never has before. Why now?
Someone (more commonly, some group) invents Impressionism, or Art Deco, or heavy metal, or gangsta rap, or acid-washed jeans, or buzz cuts and pretty soon there are dozens or hundreds of other people creating works in that style, none of whom are paying the originators a cent.
https://www.documentcloud.org/documents/24117932-apple
There are no principles involved when companies advocate for or against things. Companies will always amorally argue for whatever makes them more money. They are entirely capable of arguing two opposing viewpoints if in one context viewpoint A makes them money and in another context opposite-viewpoint B makes them money. Being consistent, either logically, morally, ethically, or in principle, is not necessary.
"Copyright is good and necessary when it makes us money, and copyright is bad and wrong when it doesn't make us money" is a mundane and totally expected opinion coming from a corporation.
Deleted Comment
Ultimately information has to come from somewhere. If something has no information about what a "car" is, it cannot paint a car more successfully than a random guess. When you draw a car or write an algorithm to do so, you'll be slightly affected by the existing car designs you've seen. It's not a limitation specific to AI - it's just more obscured for humans since there's no explicit searchable database of all the cars you've glanced at.
Whether it was affected by (and dependant on in aggregate) prior work is not the standard for copyright infringement, and I'd claim would implicate essentially all action as infringement. Instead, it should be judged by whether there's substantial similarity - and if there is substantial similarity, then by the factors of fair use.
No, AI art would exist without Disney or HBO just like human art would.
It literally does come back to the idea that either AI is doing more or less the same thing as an art student, and learns styles and structures and concepts, in which case training an art student is infringing because it’s completely dependent on the work of artists who came before.
And sure, if you ask a skilled 2d artist if they can draw something in the style of 80s anime, or specific artists, they can do it. There are some artists who specialize in this in fact! Can’t have retro anime porn commissions if it’s not riffing on retro anime images. Yes twitter, I see what you do with that account when you’re not complaining about AI.
The problem is that AI lowers the cost of doing this to zero, and thus lays bare the inherent contradictions of IP law and “intellectual ownership” in a society where everyone is diffusing and mashing up each others ideas and works on a continuous basis. It is one of those “everyone does it” crimes that mostly survives because it’s utterly unenforced at scale, apart from a few noxious litigants like disney.
It is the old Luddite problem - the common idea that luddites just hated technology is inaccurate. They were textile workers who were literally seeing their livelihoods displaced by automation mass-producing what they saw as inferior goods. https://en.wikipedia.org/wiki/Luddite
In general this is a problem that's set up by capitalism itself though. Ideas can’t and shouldn’t be owned, it is an absurd premise and you shouldn’t be surprised that you get absurd results. Making sure people can eat is not the job of capitalism, it’s the job of safety nets and governments. Ideas have no cost of replication and artificially creating one is distorting and destructive.
Would a neural net put a tax on neurons firing? No, that’s stupid and counterproductive.
Let people write their slash fiction in peace.
(HN probably has a good understanding of it, but in general people don't appreciate just how much it is not just aping images it's seen but learning the style and relationships of pixels and objects etc. To wit, the only thing NVIDIA saved from DLSS 1.0 was the model... and DLSS 2.0 has nothing to do with DLSS 1.0 in terms of technical approach. But the model encodes all the contextual understanding of how pixels are supposed to look in human images, even if it's not even doing the original transform anymore! And LLMs can indeed generalize reasonably accurately about things they haven't seen, as long as they know the precepts etc. Because they aren't "just guessing what word comes next", it's the word that comes next given a conceptual understanding of the underlying ideas. And that's a difficult thing to draw a line between a human and an AI large model, college students will "riff on the things they know" if you ask them to "generalize" about a topic they haven't studied too, etc.)
It is no wonder why OpenAI had to pay Shutterstock for training on their data and Getty suing Stability AI for training on their watermarked images and using it commercially without permission and actors / actresses filing lawsuits against commercial voice cloners which costs them close to nothing, as those companies either take down the cloned voice offering or shutdown.
These weak arguments from these AI folks sound like excuses justifying a newly found grift.
AI outputs should be regulated, of course. Obviously impersonation and copyright law already applies to AI systems. But a discussion on training inputs is entirely novel to man and our laws, and it's a very nuanced and important topic. And as AI advances, it becomes increasingly difficult because of the diminishing distinction between "organic" learning and "artificial" learning. As well as when stopping AI from — as an example — learning from research papers means we miss out on life-saving medication. Where do property rights conflict with human rights?
They're important conversations to have, but you've destroyed the opportunity to have them from the starting gun.
To avoid IP law causing more damage than it already has with evergreening of medical patents, I think it strictly has to be the generation of substantially similar media that counts as infringement, as the comment you're replying to suggests - not just "this tumor detector was pretrained on a large number of web images before task-specific fine-tuning, so it's illegal because they didn't pay Getty beforehand" if training were to be infringement.
The case against "tribute bands" is much stronger than the case against large language models built with some copyrighted content. Those are a blatant attempt to capitalize on the specific works of specific named people.
I'd like you do give away 100% of your salary, ok?
Are you greedy if you say no?
AlphaZero learned to play chess via self-play, not by reading books about chess.
Why couldn't the same happen for art for example?
For coding, won't a sufficiently advanced neural net be able to figure out how to use a programming language when given just the documentation?
And when most of our interactions are with AI, it will learn from our conversations with it. Asking some AI system why feature X was removed from programming language Y in version Z teaches it something. The next person who asks it which feature was removed from Y in version Z might be told "X". Without the AI system ever having to read about it. The interaction with AI could become a self-learning loop in and on itself.
For starters, because "art" does not have an objective scoring function.
> Why couldn't the same happen for art for example?
> For coding, won't a sufficiently advanced neural net be able to figure out how to use a programming language when given just the documentation?
some domains are too complex and large to be cracked that way.
Stop making excuses. AI training on copyrighted works is straight wrong no matter how much you don't want it to be.
All of my internet comments are copyrighted btw, but I do offer a license of $1b usd/year for using them in a model if you'd like.
authorship in many fields is well defined.. this comment slips into nonsense territory, whatever the view or jurisdiction regarding copyright laws
I don't think anyone's saying that the output of the model should be subject to more lenient copyright standards than human creations. If you're selling a remix with substantial similarity to an existing song and it fails the fair use factors, then it'd already be infringement regardless of whether you made it with AI or by hand.
The question is: what about the songs that you've listened to, and potentially influenced you, but don't have substantial similarity to the remix? Do you also owe royalties on all of those? For humans the answer is no, but the law doesn't necessarily have to treat AI the same way.
Thus these arguments could backfire on those making them real quick. At this point though they have no choice but to make them as they’ve all clearly used copyrighted works to build their products.
The other challenge here is there needs to be some protection and compensation for folks producing original works in the first place. If we just end up with ML models training on the generated output of other ML models this is all going to go downhill real quick.
Some of these rhyme with the fair use and similar arguments put forth by the free software and anti-corporate-owned culture folks for the last couple decades. More honest (if cynical) is A16z’s take, of “the rich already put in a bunch of money, so now you can’t stop it.”