Yeah it's surprising to me how many people who were previously skeptical of IP laws in general, are all on board to use them to attack AI companies, advocating that a copyright holder should have the right to control exactly how their work can be used by any downstream system.
The nightmare scenario I see is, if this principle is established for AIs, it will not just be restricted to AIs. Human artists too will eventually be required to only create new commercial works under license, based on the artistic works they have been exposed to. The result will that it be impossible to be create art commercially except by working for a large content company with a large existing IP library and cross-licensing agreements with other rights holders.
And if you think this is an impossible scenario, ask yourself when lawyers, lobbyists and politicians have ever shrank from expanding legal rights for corporations at the expense of individuals, even when the vast majority of people would not agree to the expansion. Today's absurdity is tomorrow's legal reality, and the chance to fully control creative expression will be far too tempting to turn down.
Yeah a lot of the objections I've heard about AI revolve around "it can copy an artist's style" or "they didn't intend it to be used that way" which are mostly legal now by humans, either by being fair use or not copyrightable today.
I don't want to change this! Let's not use "AI" as an excuse to tighten copyright even further. Last thing I want is more "Blurred Lines" style lawsuits[0].
If I try to use Bing Chat for Enterprise to generate a picture of a cat in a top hat in Tim Burton's style, I receive and Oops you've violated our content policy error message.
If AI companies and their benefactors are so concerned about this, they should be pushing for shorter copyright terms and more fair use exemptions. AFAIK they are not. Probably because they're hoping to lock up the AI models and possibly their output under the same onerous copyright system they flaunted when it was convenient for "training". Rules should be applied fairly. It's galling to see small creators get content taken down by DMCA strikes and then have large, well-funded players scoop up everything they can with no issue.
"Skeptical of IP law" does not mean I believe IP law should not exist.
I believe copyright terms in the US are too long and should be more like 20 years, but I do not believe everything should be a free for all where any person or any program can immediately reuse any idea.
I agree 100pc. I also think that it's important to distinguish between IP violation for personal profits and the violation for profit.
If someone is using my IP for personal enjoyment or learning that's great but if they're using it to make megabucks like openAI and others intend to do then I want my cut.
An even more serious risk, is the establishment of a de facto "pay per thought" society -- an economic state where mind/machine interfaces improve and proliferate due to capabilities, and memory and thought itself become monetized by the totalist application of copyright laws. The endgame of copyright is exceptionally bleak.
“We should allow AI to have unilateral access to licensed works for free cause then we will have to restrict human artists otherwise”
We already have separated digital and human artist laws and the law has, thus far, not had too much trouble accidentally thinking that humans are CPUs and vice versa.
The problem isn't so much "reading" (i.e., training on or ingesting) the copyrighted content, it is writing it out again.
Artists, writers, and coders are rightly angered when I can simply say "make a painting of XYZ in the style of Foo", and have a painting rendered that is nearly an exact copy of an item in the training set. Similar analogy for writing or code.
I see now that when trying to specify an image of an impossible object with ChatGPT4+DALL-E, it will balk if I say "like an Escher object". I have to be more verbose, and it will not render his signature style.
As long as particular artistic or writing or coding styles aren't specify-able and generate-able, this seems like a reasonable fair use solution, and does not require us to suddenly become copyright maximalists. (Without such prohibitions on generation of similar works, we must become copyright maximalists.)
This is just a hit piece which tries to frame scraping whole internet as-is as fair-use, because it's profitable for the AI companies, that's all.
There are things on the internet which can be shared freely, but not allowed to be altered. Similarly there are things which are for your eyes only (source-available licensed software, for example), and not to be built-upon.
AI companies bundle all of them, say that their models are learning like a human, and this is fair use and training data is non-reproducible. Then someone shows that their code, work, poem, whatever can be emitted as-is, damaging them; same companies act surprised, and a couple of censor filters, and go on.
Whether it's fair use or not (under US law) is still very unclear to me. Consider the "fair use" criteria laid out in Folsom v. Marsh:
1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
This is quite muddied. OpenAI is in some commercial and non-profit superposition, and many of its users are commercial and using the technology for commercial applications. But a huge swath of its users are using it for nonprofit educational purposes too. I use it primarily for learning, along with most people I know who use it. IMO there's no clear characterization here, given the information we have. Maybe a court could compel more information to clarify this.
2) the nature of the copyrighted work
I don't know enough about it to have an opinion.
3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
This is also unclear to me, because the models are effectively word-lossy compression, so are unlikely to reproduce substantial portions of works that impact the copyright holder. But they could in some cases.
4) the effect of the use upon the potential market for or value of the copyrighted work
Also unclear. I can easily imagine scenarios where the use of an LLM has a negative market impact on a copyright holder, but I can also imagine scenarios where it has a positive impact (ex "Give me some book recommendations"). What's the net impact? No idea.
> OpenAI is in some commercial and non-profit superposition
This is the big issue I have with it that no one has yet had a satisfactory (to me anyway) answer to. The internet was scraped for all manner of writing, images, etc. to train these models for research. And then once the research was done (enough), they took the model and began selling it for profit.
The question of is using publicly available content as training data fair use is an interesting question, but OpenAI has gone to market with a product on offer not merely without answering it, but seemingly quite deliberately avoiding answering it and AI hype people seem completely fine with it, despite that being a make-or-break question for the concept. And it's less that I think OpenAI owes royalties to every single person they collected training data from, and more than I'm extremely uncomfortable with and skeptical of the notion that, from my perspective, yet another major player in the tech space is abusing the public square to make a buck, which is unfortunately far from a new story.
Maybe I'm just getting old, but the rallying cry of disruption, where these startup companies come in half-cocked insisting an old industry is in dire need of innovation but plot twist, the innovation is just app-powered slavery again, they get seventy billion dollars to build an app and a website, crush an existing industry underneath the weight of VC money, then either a) price the service so it's no goddamn cheaper than the thing it replaced but now the people actually doing the work are somehow making even less money or b) go out of business and leave a husk of an industry left that barely functions and nobody wants to start it back up again is just... hollow.
Technology is great, but these massive corporations with the backing of unholy amounts of money just manifesting their destiny all over our society and economy and making us live with the consequences isn't.
Well, I was discussing Large Language Models (LLMs) as a technology, and esp, via this video [0] with a friend, just now.
He made a striking remark. LLMs compress the information they ingest in a variably-lossy way, and when you query them, they rebuild a representation from whole or part of this compressed data in a statistical manner. As a result they're a lossy storage medium.
Not unlike some music formats which store audio data in a lossy way. You know, this data can be restored with mathematical and statistical wizardry. You don't get everything you put in, but 97%-99% out of it, and companies and RIAA went bonkers for years, because even if it's not an exact reproduction, it was a reproduction enough, and this is a copyright violation.
If an LLM can reproduce what I have written, or coded with 97%-99% accuracy without any license information, and I licensed this thing with less than permissive licenses, and sue the maker of that LLM, what will happen?
- Will it be a copyright infringement?
- Will it be fair use?
It's the first if you look at it fairly, but it'll be probably be the latter, because money, fame and other corporate points will be at stake, otherwise.
>This is also unclear to me, because the models are effectively word-lossy compression, so are unlikely to reproduce substantial portions of works that impact the copyright holder. But they could in some cases.
If you tried to submit schoolwork with some of the minimal amount of changes that even the best models spit out, you’d be expelled dude.
Contrary to the implication of your statement, language models don’t actually understand what the words they spit out actually mean. They can regurgitate definitions.
Yeah, it's just copyright laundering. If this is deemed legal, then it's fair to return the favor by doing things like training models on proprietary source code to build free software. Effectively killing copyright would be the best possible outcome.
But it isn't. If the AI produces a work found to be a copyright violation it isn't magically not just because it was made by AI. This nonsense has laid bare for the world just how little people actually understand copyright.
The only interesting question is whether the models themselves violate copyright and it's a tough sell without being able to actually show the copy. Because even if it's able to make something similar to your work it doesn't show the model actually contains a copy of it. If you coax a model to make an image that violates copyright it's gonna be hard to say that it's anything other than you using an advanced drawing tool to copy an image.
> This is just a hit piece which tries to frame scraping whole internet as-is as fair-use, because it's profitable for the AI companies, that's all.
No, the Matthew Lane's argument (the author of the article being Matthew Lane) is that the AI model as a whole cannot be ex-ante assumed to be an infringement on the copyrights to one or works in the training set. Copyright is about the actual works, and infringement is about the similarity between actual outputs by models and existing works by humans.
My argument is that similarity of new works (including AI model outputs) to actual works (not to non-existing works, existing styles, or theoretical aggregations of multiple works) is a prerequisite to infringement. From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
In order to get an infringing output, the user usually has to include a reference to an existing author or an existing work in the prompt. Sometimes that's not the case (which I've worried about with respect to Copilot), but in order to damn the model as a whole, you would have to establish that in over some percentage of cases the model produces infringing outputs for prompts which don't reference a particular author (whether individual or collective), a particular work, or a style strongly associated with a single or a few authors.
> AI companies bundle all of them, say that their models are learning like a human, and this is fair use and training data is non-reproducible. Then someone shows that their code, work, poem, whatever can be emitted as-is, damaging them; same companies act surprised, and a couple of censor filters, and go on.
If it is fair use, why are they adding censor filter exceptions at all?
"Tim Burton's style" appears to be a banned term on Bing Chat Enterprise.
But, Google didn't train a LLM which ripped all license, context and author information, and provided a mishmash of information which can't be guaranteed to be true.
They just mapped the connection between the sites, and provided summaries of it, derived from the sites themselves verbatim.
They didn't create new articles from what they seen, generated new code from code they have harvested and disregarded its license, etc.
I really don't see this as fundamentally different from digital sampling equipment for music that became popular in the 1980s.
Some new technology comes along that repurposes things created by others that allows people to use it as the implements of a new form of art. The digital sampler was used as a musical instrument in its own right - with high variations, skill, and taste applied to the arrangements of other people's copyrighted sounds.
There's a wide variety of opinion on digital sampling and it's really the same thing here. I'd be surprised if a particular person's views on the two are in conflict.
It depends. A lot of times it's just a new, say UMG song that samples another UMG song so it's just Hollywood accounting with cross charging.
When say Madonna sampled the Beegees, I'm sure it was a large ordeal.
But for low profit or no profit work (independent stuff), the answer is nobody cares.
Bob James, one of the most sampled artists in history, take it in stride. He's happy that so many people are listening to his stuff. The Winston's (the famous amen break) were also happy the track got such wide acceptance.
Killing Joke, on the other hand, felt "Come as you are" was a rip off of "Eighties" and only dropped it upon the death of Kurt Cobain.
Or take Toni Basil's Mickey, which is actually a cover song of Racey's Kitty. Toni Basil has gone to court to secure pretty exclusive rights to the song. Racey does not get any of Toni's cash.
Artists who make a lot of money commonly voluntarily license samples they use which are not used in a highly transformative way. Artists who were sampled by artists who did not license the sample may get some lawyers and threaten to sue. The threatened artist may settle.
For any cases that do make it to court, copyright generally only covers works in their entirety and quoting is explicitly allowed under fair use. The four tests used to determine fair use are length of the quote; whether the allegedly infringing use is for commercial or nonprofit uses; whether the allegedly infringing use interferes with the sales of the original work; and the nature of the quoted work (for example, facts are not copyrightable).
(This commentary is us-centric; other countries have different rules)
I was curious and checked how much the drummer behind Amen Break has made, undoubtedly one of the most sampled things in the world. The answer was nothing.
The difference is that usually with sampling, a handful of pieces of other songs are mixed into a larger original work whereas ML generators (AI is a misnomer) instead chop hundreds or thousands of works into a fine slurry and reconstitute them into something that resembles the average of all of them — its works are comprised entirely of the work of others.
If ML generators become actual AI (AGI) and become able to apply abstracted concepts and and use non-trained observations like humans do, enabling them to create works that aren’t solely composed of samples of existing work, the comparison to music sampling makes more sense. AGI will probably also want fair treatment unlike their ML model predecessors and as a result probably won’t be the wealth generators that many seem to be looking for, though…
Don't understand this discussion at all. There's no need to flip your position due to AI. Any produced work may or may not infringe copyright. If you produce a work that lifts say, characters or expressions straight out of a story you're likely violating copyright. If you used AI tools to produce that same work it still does. It's irrelevant how you technically produced the work, by pen, typewriter or chatbot api.
Asking if training a model violates copyright is unintelligible because 100 GB of weights don't resemble any work unless you produce and publish inferences from it.
I don't think that argument has a lot of weight. By the same logic a zip file of a book doesn't resemble the original work if you look at the raw bytes but you can extract it and get it back. It would be hard to argue that the copyright violation happens only when you extract the zip, and not, say, if you distribute the archive.
The same logic doesn't apply because you can't get the data back out of the model by unpacking it. It's theoretically not possible because the model is magnitudes smaller than the totality of the data.
To use, and invert the example from the other commenter, you can even get an existing poem out of a generative model that genuinely was not in the training data at all. This is because it's not an archive, which does correspond directly to one particular work, but a generative technology.
> Asking if training a model violates copyright is unintelligible because 100 GB of weights don't resemble any work unless you produce and publish inferences from it.
I agree, but there are many who don’t, and assert that any training use should be banned. Those people are the ones who the article is talking about: people who suddenly became copyright maximalists due to their desire to halt AI training.
I don't want to halt AI training, I want corporations to fuck off from using my (A)GPL code to train their proprietary models which they then sell to people writing more proprietary code. I would be ok if the derived code is properly GPL licenced too.
I suspect many people feel in a similar way too (for example, artists whose art is used to train image generators without compensation).
I'm not sure it's always true to say "100 GB of weights don't resemble any work".
If I train a model on a famous poem or something, and it turns out if you ask the model to quote the poem verbatim, it can, then the model contains the poem. Have I not "copied" the poem into the model?
You can simply copy the poem in much less space if that's what you want, but incidental replication is not the same with copying.
When you train a model, you use 1T...20T tokens, with deduplication to reduce direct memorization. Then the model is a superposition of gradients from trillions of tokens, no longer just a copy of a poem, it could also write a commentary about it or compose new ones.
Copyright exists only on complete works and not characters; however any ai produced materials which use trademarked characters would violate trademark. It's just that ai cannot violate copyright unless it produces an identical copy of an existing work.
Regarding copyright on characters in the United States [1]:
> US Copyright Statute of 1976 does not explicitly mention fictional characters as subject matter of copyright, and their copyrightability is a product of common law. Historically, the Courts granted copyright protection to characters as parts of larger protected work and not as independent creations. They were regarded as ‘components in a copyrighted works’ and eligible for protection as thus.[5]
But in practice, if you write a fanfiction story which uses Mickey Mouse as a character and get sued by Disney for copyright infringement, you will not win in court regardless of how different your fanfiction story is from any Disney story unless you can afford as many lawyers of comparable quality as Disney can. And even then, who knows?
> So defiant, in fact, that MRT had actually registered for copyright protection on the songs it was selling. "To make matters worse," says EMI's complaint, "Defendants recently sought to register their infringing sound recordings with the Copyright Office, apparently claiming that because they copied the sound recordings using their own computer system, they now own these digital copies and have the right to distribute them to the public."
I'm going to take the side of the IP argument based on whatever hurts big business most. If they want to "steal" from the little guy by locking everything up behind copyright (or patents or trademarks) then I will oppose it. If they want to "steal" from the little guy by ingesting his works into the AI machine then I will oppose it.
Notice how governments and companies became "worried" about AI when people were starting to emulate certain styles in the images generated. Suddenly they needed to be regulated and censored (even harder).
Good take. Big companies should suffer so that the regular person can strive and compete. It is clear that big AI companies virtue signal so that they can exploit regulations and become monopolies. In the end the only thing enforcing these laws is government that can only enforce through punishment and ultimately violence, and you can fend that with enough money. AI training will continue to be done illegally and in the gray by both regular parties and those that can pay for the privilege.
Most of the artists I follow draw styles and subjects that can't easily be replicated using AI. Anyone with a refined enough taste should not feel threatened or relieved. The continued technological progress serves those who adapt. What you should fear is surveillance tech.
Jokes aside the recent supreme court case involving his estate, as a non lawyer, feels like it's some sort of precedent towards what is and isn't transformation of IP.
Generative AI is almost always a service problem. If a model can offer customized art instantly 24/7 for free but the artist says you have to wait 3 months and it’ll cost 500 dollars and I only accept PayPal, then the model’s service is more valuable.
Or perhaps it’s time to reconsider whether the whole idea of “IP ownership” still makes sense. It was introduced because it was, on the whole, beneficial to society. Now that AI can make producing content much cheaper, do we still need that incentive?
> It was introduced because it was, on the whole, beneficial to society.
How do you know?
It was introduced; people have mostly not made the argument that it was beneficial to society.
You could argue that the fact that it was introduced shows that it was beneficial to society, but that theory has problems with laws that are repealed. Prohibition was introduced for the same reason, that it was beneficial to society. And it was repealed because that was beneficial to society too. Is that... true?
Do we want AI to be the only economical way for “content” to be created in the future? And for the corporate owners of AI to be the middle men on all creativity?
I’m no great fan of the current IP regime, but the economics of building, maintaining and operating LLMs have the potential to completely gut human creativity and replace it with a mechanical ouroboros eating its own tale.
The nightmare scenario I see is, if this principle is established for AIs, it will not just be restricted to AIs. Human artists too will eventually be required to only create new commercial works under license, based on the artistic works they have been exposed to. The result will that it be impossible to be create art commercially except by working for a large content company with a large existing IP library and cross-licensing agreements with other rights holders.
And if you think this is an impossible scenario, ask yourself when lawyers, lobbyists and politicians have ever shrank from expanding legal rights for corporations at the expense of individuals, even when the vast majority of people would not agree to the expansion. Today's absurdity is tomorrow's legal reality, and the chance to fully control creative expression will be far too tempting to turn down.
I don't want to change this! Let's not use "AI" as an excuse to tighten copyright even further. Last thing I want is more "Blurred Lines" style lawsuits[0].
[0] https://www.berklee.edu/berklee-today/summer-2015/coda/richa...
https://www.latimes.com/entertainment-arts/movies/story/2023...
Do you think a small scale artist can get their style placed in the naughty list or just people with enough money and clout to sue?
Interestingly "Disney" and "Disney's Frozen" styles passed on.
https://www.404media.co/google-researchers-attack-convinces-...
I believe copyright terms in the US are too long and should be more like 20 years, but I do not believe everything should be a free for all where any person or any program can immediately reuse any idea.
Length of time for copyright right now is life of the author plus 70. Which seems insane to me.
Length of time for a patent is 20 years now.
If someone is using my IP for personal enjoyment or learning that's great but if they're using it to make megabucks like openAI and others intend to do then I want my cut.
https://youtu.be/8L4HHPTiZN8?si=Ex3ixajKTMzkPIHm
“We should allow AI to have unilateral access to licensed works for free cause then we will have to restrict human artists otherwise”
We already have separated digital and human artist laws and the law has, thus far, not had too much trouble accidentally thinking that humans are CPUs and vice versa.
Dead Comment
Artists, writers, and coders are rightly angered when I can simply say "make a painting of XYZ in the style of Foo", and have a painting rendered that is nearly an exact copy of an item in the training set. Similar analogy for writing or code.
I see now that when trying to specify an image of an impossible object with ChatGPT4+DALL-E, it will balk if I say "like an Escher object". I have to be more verbose, and it will not render his signature style.
As long as particular artistic or writing or coding styles aren't specify-able and generate-able, this seems like a reasonable fair use solution, and does not require us to suddenly become copyright maximalists. (Without such prohibitions on generation of similar works, we must become copyright maximalists.)
There are things on the internet which can be shared freely, but not allowed to be altered. Similarly there are things which are for your eyes only (source-available licensed software, for example), and not to be built-upon.
AI companies bundle all of them, say that their models are learning like a human, and this is fair use and training data is non-reproducible. Then someone shows that their code, work, poem, whatever can be emitted as-is, damaging them; same companies act surprised, and a couple of censor filters, and go on.
This is not fair use.
1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
This is quite muddied. OpenAI is in some commercial and non-profit superposition, and many of its users are commercial and using the technology for commercial applications. But a huge swath of its users are using it for nonprofit educational purposes too. I use it primarily for learning, along with most people I know who use it. IMO there's no clear characterization here, given the information we have. Maybe a court could compel more information to clarify this.
2) the nature of the copyrighted work
I don't know enough about it to have an opinion.
3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
This is also unclear to me, because the models are effectively word-lossy compression, so are unlikely to reproduce substantial portions of works that impact the copyright holder. But they could in some cases.
4) the effect of the use upon the potential market for or value of the copyrighted work
Also unclear. I can easily imagine scenarios where the use of an LLM has a negative market impact on a copyright holder, but I can also imagine scenarios where it has a positive impact (ex "Give me some book recommendations"). What's the net impact? No idea.
This is the big issue I have with it that no one has yet had a satisfactory (to me anyway) answer to. The internet was scraped for all manner of writing, images, etc. to train these models for research. And then once the research was done (enough), they took the model and began selling it for profit.
The question of is using publicly available content as training data fair use is an interesting question, but OpenAI has gone to market with a product on offer not merely without answering it, but seemingly quite deliberately avoiding answering it and AI hype people seem completely fine with it, despite that being a make-or-break question for the concept. And it's less that I think OpenAI owes royalties to every single person they collected training data from, and more than I'm extremely uncomfortable with and skeptical of the notion that, from my perspective, yet another major player in the tech space is abusing the public square to make a buck, which is unfortunately far from a new story.
Maybe I'm just getting old, but the rallying cry of disruption, where these startup companies come in half-cocked insisting an old industry is in dire need of innovation but plot twist, the innovation is just app-powered slavery again, they get seventy billion dollars to build an app and a website, crush an existing industry underneath the weight of VC money, then either a) price the service so it's no goddamn cheaper than the thing it replaced but now the people actually doing the work are somehow making even less money or b) go out of business and leave a husk of an industry left that barely functions and nobody wants to start it back up again is just... hollow.
Technology is great, but these massive corporations with the backing of unholy amounts of money just manifesting their destiny all over our society and economy and making us live with the consequences isn't.
He made a striking remark. LLMs compress the information they ingest in a variably-lossy way, and when you query them, they rebuild a representation from whole or part of this compressed data in a statistical manner. As a result they're a lossy storage medium.
Not unlike some music formats which store audio data in a lossy way. You know, this data can be restored with mathematical and statistical wizardry. You don't get everything you put in, but 97%-99% out of it, and companies and RIAA went bonkers for years, because even if it's not an exact reproduction, it was a reproduction enough, and this is a copyright violation.
If an LLM can reproduce what I have written, or coded with 97%-99% accuracy without any license information, and I licensed this thing with less than permissive licenses, and sue the maker of that LLM, what will happen?
- Will it be a copyright infringement?
- Will it be fair use?
It's the first if you look at it fairly, but it'll be probably be the latter, because money, fame and other corporate points will be at stake, otherwise.
[0]: https://www.youtube.com/watch?v=zjkBMFhNj_g&start=323
Edit: I forgot to add the video. :)
If you tried to submit schoolwork with some of the minimal amount of changes that even the best models spit out, you’d be expelled dude.
Contrary to the implication of your statement, language models don’t actually understand what the words they spit out actually mean. They can regurgitate definitions.
The effect of each use matters. Not the net effect of all possible uses.
The only interesting question is whether the models themselves violate copyright and it's a tough sell without being able to actually show the copy. Because even if it's able to make something similar to your work it doesn't show the model actually contains a copy of it. If you coax a model to make an image that violates copyright it's gonna be hard to say that it's anything other than you using an advanced drawing tool to copy an image.
No, the Matthew Lane's argument (the author of the article being Matthew Lane) is that the AI model as a whole cannot be ex-ante assumed to be an infringement on the copyrights to one or works in the training set. Copyright is about the actual works, and infringement is about the similarity between actual outputs by models and existing works by humans.
My argument is that similarity of new works (including AI model outputs) to actual works (not to non-existing works, existing styles, or theoretical aggregations of multiple works) is a prerequisite to infringement. From an article about the substantial similarity test in the US [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.[1][3]
In order to get an infringing output, the user usually has to include a reference to an existing author or an existing work in the prompt. Sometimes that's not the case (which I've worried about with respect to Copilot), but in order to damn the model as a whole, you would have to establish that in over some percentage of cases the model produces infringing outputs for prompts which don't reference a particular author (whether individual or collective), a particular work, or a style strongly associated with a single or a few authors.
[1] https://en.wikipedia.org/wiki/Substantial_similarity
If it is fair use, why are they adding censor filter exceptions at all?
"Tim Burton's style" appears to be a banned term on Bing Chat Enterprise.
google already did that, 20 years ago
They just mapped the connection between the sites, and provided summaries of it, derived from the sites themselves verbatim.
They didn't create new articles from what they seen, generated new code from code they have harvested and disregarded its license, etc.
Some new technology comes along that repurposes things created by others that allows people to use it as the implements of a new form of art. The digital sampler was used as a musical instrument in its own right - with high variations, skill, and taste applied to the arrangements of other people's copyrighted sounds.
There's a wide variety of opinion on digital sampling and it's really the same thing here. I'd be surprised if a particular person's views on the two are in conflict.
I haven't heard of anyone being compensated for their works making profit for OpenAI
When say Madonna sampled the Beegees, I'm sure it was a large ordeal.
But for low profit or no profit work (independent stuff), the answer is nobody cares.
Bob James, one of the most sampled artists in history, take it in stride. He's happy that so many people are listening to his stuff. The Winston's (the famous amen break) were also happy the track got such wide acceptance.
Killing Joke, on the other hand, felt "Come as you are" was a rip off of "Eighties" and only dropped it upon the death of Kurt Cobain.
Or take Toni Basil's Mickey, which is actually a cover song of Racey's Kitty. Toni Basil has gone to court to secure pretty exclusive rights to the song. Racey does not get any of Toni's cash.
So it varies and it's messy.
For any cases that do make it to court, copyright generally only covers works in their entirety and quoting is explicitly allowed under fair use. The four tests used to determine fair use are length of the quote; whether the allegedly infringing use is for commercial or nonprofit uses; whether the allegedly infringing use interferes with the sales of the original work; and the nature of the quoted work (for example, facts are not copyrightable).
(This commentary is us-centric; other countries have different rules)
If ML generators become actual AI (AGI) and become able to apply abstracted concepts and and use non-trained observations like humans do, enabling them to create works that aren’t solely composed of samples of existing work, the comparison to music sampling makes more sense. AGI will probably also want fair treatment unlike their ML model predecessors and as a result probably won’t be the wealth generators that many seem to be looking for, though…
Asking if training a model violates copyright is unintelligible because 100 GB of weights don't resemble any work unless you produce and publish inferences from it.
I don't think that argument has a lot of weight. By the same logic a zip file of a book doesn't resemble the original work if you look at the raw bytes but you can extract it and get it back. It would be hard to argue that the copyright violation happens only when you extract the zip, and not, say, if you distribute the archive.
To use, and invert the example from the other commenter, you can even get an existing poem out of a generative model that genuinely was not in the training data at all. This is because it's not an archive, which does correspond directly to one particular work, but a generative technology.
I agree, but there are many who don’t, and assert that any training use should be banned. Those people are the ones who the article is talking about: people who suddenly became copyright maximalists due to their desire to halt AI training.
I suspect many people feel in a similar way too (for example, artists whose art is used to train image generators without compensation).
If I train a model on a famous poem or something, and it turns out if you ask the model to quote the poem verbatim, it can, then the model contains the poem. Have I not "copied" the poem into the model?
When you train a model, you use 1T...20T tokens, with deduplication to reduce direct memorization. Then the model is a superposition of gradients from trillions of tokens, no longer just a copy of a poem, it could also write a commentary about it or compose new ones.
GPT4 summarizes this in a haiku:
"Superposition bright,
Gradients blend in the night,
New poems take flight."
> US Copyright Statute of 1976 does not explicitly mention fictional characters as subject matter of copyright, and their copyrightability is a product of common law. Historically, the Courts granted copyright protection to characters as parts of larger protected work and not as independent creations. They were regarded as ‘components in a copyrighted works’ and eligible for protection as thus.[5]
But in practice, if you write a fanfiction story which uses Mickey Mouse as a character and get sued by Disney for copyright infringement, you will not win in court regardless of how different your fanfiction story is from any Disney story unless you can afford as many lawyers of comparable quality as Disney can. And even then, who knows?
[1] https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...
> So defiant, in fact, that MRT had actually registered for copyright protection on the songs it was selling. "To make matters worse," says EMI's complaint, "Defendants recently sought to register their infringing sound recordings with the Copyright Office, apparently claiming that because they copied the sound recordings using their own computer system, they now own these digital copies and have the right to distribute them to the public."
https://twitter.com/kortizart/status/1588915427018559490/pho...
Notice how governments and companies became "worried" about AI when people were starting to emulate certain styles in the images generated. Suddenly they needed to be regulated and censored (even harder).
Most of the artists I follow draw styles and subjects that can't easily be replicated using AI. Anyone with a refined enough taste should not feel threatened or relieved. The continued technological progress serves those who adapt. What you should fear is surveillance tech.
Jokes aside the recent supreme court case involving his estate, as a non lawyer, feels like it's some sort of precedent towards what is and isn't transformation of IP.
How do you know?
It was introduced; people have mostly not made the argument that it was beneficial to society.
You could argue that the fact that it was introduced shows that it was beneficial to society, but that theory has problems with laws that are repealed. Prohibition was introduced for the same reason, that it was beneficial to society. And it was repealed because that was beneficial to society too. Is that... true?
I’m no great fan of the current IP regime, but the economics of building, maintaining and operating LLMs have the potential to completely gut human creativity and replace it with a mechanical ouroboros eating its own tale.