> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
This is the makers of AI explicitly saying that they did use copyrighted works from a book piracy website. If you downloaded a book from that website, you would be sued and found guilty of infringement. If you downloaded all of them, you would be liable for many billions of dollars in damages.
But companies like Google and Facebook get to play by different rules. Kill one person and you're a murderer, kill a million and to ask you about it is a "gotcha question" that you can react to with outrage.
Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.
While it might be too disruptive to eliminate copyright overnight, we should remember that our world will be much better and improve much faster to the extent we can reduce copyright's impact.
And we should cheer it on when it happens. A majority of the world's population in 2023 has a smartphone. Imagine a world where a majority of the world had access to every book ever digitized, and could raise their children on these books!
This is emotionally manipulative speech that provides no value to HN and only serves the purpose of bypassing peoples' logical reasoning circuits.
> ~every child doesn't have access to ~every book ever written
More manipulation - "think of the children!"
Copyright exists because people who produce content with low distribution costs (e.g. books) need some protection for their work being taken without compensation.
Fundamentally, you are never entitled to someone else's work.
There's already tens (hundreds?) of thousands of books in the public domain, and tens of thousands more under Creative Commons licenses (where the author explicitly released their work for free distribution). There's lectures on YouTube and MIT OpenCourseWare. There's full K-12 textbooks on OpenStax and Wikibooks. There's Wikipedia, Stack Exchange, the Internet Archive, and millions of small blogs and websites hosting content that is completely free.
There is no need for "a majority of the world had access to every book ever digitized" - and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.
And most of which, when in copyright, paid their authors quite handsomely in terms of royalties.
If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely? Or, if neither of these applies, how do you propose to compensate authors for the efforts and labours of writing?
Second-order effects matter, though: If everyone is allowed to steal books, what's the incentive for experts to write new ones, and for the publishers to reward them for it?
It would look largely identical to ours, I think. It's pretty trivial to get access to many, if not most, e-books.
Any public-domain work is available on Project Gutenberg [0]. Copyrighted works can be accessed for free using tools of various legality: Libby [1] is likely sponsored by your local library and gives free access to e-books and audiobooks. Library Genesis [2] has a questionable legal status but has a huge quantity of e-books, journal articles, and more.
Then most people stop writing books because they can't get paid for their time/effort and ~every child will be stuck with outdated knowledge within a decade.
What an absolute pile of nonsense. People who author creative works deserve to have control of them and make some money - otherwise you'll soon find we have far fewer great authors, artists, etc.
This is essentially the same as saying builders charging for houses is the problem with the housing market, so we're going to phase out paying builders.
> Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.
No, copyright is the reason that authors all over the world are working very hard to make new books for my kids and everyone else's kids, despite never having met me. Copyright is the reason so many brilliant things are actually created that otherwise would never be.
Of course I'd prefer to live in a world in which I get all the media I want, for free. But I have no idea how to make such a world happen, and neither does anyone else, and humanity has been discussing this for a few centuries.
>>copyright is the reason ~every child doesn't have access to ~every book ever written
Copyright is ALSO the reason that many books can be written in the first place.
Obviously, Copyright is abused and the continual extensions of copyright into near-perpetuity by corporations is basically absurd. And they are abused by music publishers etc. to rip-off artists.
But to claim that it should not exist, when it is utterly trivially simple for anyone to copy stuff to the web is to argue that no one should create or release any creative works, or to argue for drastic DRM measures.
Perhaps you DGAF about your written or artistic works because you do not or can not make a living off them, but I guarantee that for those creatives and artists who can and/or do make a living off of it, they do care, and rightly so.
A company that believes in strong intellectual property rights protection is using resources that blatantly ignore intellectual property rights to get access to the content for free.
I agree with you, however, that it's an argument in favor of abolishing strong intellectual property rights. At least for OpenAI's products.
I'm all for abolishing copyright, but how is this relevant to megacorporations ignoring copyright when it suits them while still expecting us to follow it when that suits them?
I’m imagining a world that looks just about the same as this one does. A larger book library doesn’t automatically make that medium more appealing to kids than what Mr Beast, Unspeakable, and the other crap kids love are doing.
I think the point here is that while children can be denied access to copyrighted works without paying the owners, but openAI and Meta can do as they please. I don't disagree that the current copyright system needs improvement, but what I really really don't like are seeing rich and powerful people breaking laws with impunity over and over and over again.
> copyright is the reason ~every child doesn't have access to ~every book ever written.
And? Is there some reason anybody, child or adult, deserves access to "every" anything? Should children have access to every video game ever made, every Matchbox car, every Lego set?
GP makes no remark on the morality/practicality of copyright. Also, having people sue big companies for copyright might lead to more of what you're arguing for, in a show them the taste of their own medicine way
I would be much more sympathetic to this stance if you weren’t implicitly endorsing the rights of companies like Meta/Alphabet/OpenAI to profit from the disruption of copyright law. If we’re talking ordinary people being able to breach copyright, then yeah seems potentially interesting. But let’s remember that these companies aren’t acting altruistically. They’re not giving away Silverman’s work - they’re repackaging it for their own profit. That’s not fair to the artist and in fact does not help the children.
I'm a fan of copyrights. While I think that the USA's implementation of copyrights has a few glaring flaws (namely, the duration of copyright is far too long). I firmly believe that the elimination, or effective elimination of copyrights is massively detrimental to our culture.
A fair middle ground would be for copyrights to last for 20 or so years. That's plenty of time to profit from a work while allowing people to preserve and distribute older works.
Machine learning models have been trained with copyrighted data for a long time. Imagenet is full of copyrighted images, clearview literally just scanned the internet for faces, and I am sure there are other, older examples. I am unsure if this has been tested as fair use by a US court, but I am guessing it will be considered to be so if it is not already.
Excellent, so you're saying I'll be able to download any copyrighted work from any pirate site and be free of all consequence if I just claim that I'm training an AI?
Strictly speaking, it's uploading that people get sued for, not downloading.
You can download all that you want from Z-Library or BitTorrent, as long as you don't share back. And indexing copyrighted material for search is safe, or at least ambiguous.
Carefully speaking, what you say is true in many places (countries), but also not true in other places (countries). Some jurisdictions are different, as always.
> If you downloaded a book from that website, you would be sued and found guilty of infringement.
How often does this actually happen? You might get handed an infringement notice, and your ISP might terminate your service if you're really egregious about it, but I haven't ever heard of someone actually being sued for downloading something.
Whether or not it's enforced, it's illegal and copyright holders are within their rights to sue you. This is piratebay levels of piracy but because it's done by a large company and is sufficiently obfuscated behind tech, people don't see it the same way.
In Germany if you torrent stuff (without a VPN), you're very likely to get a letter from a law firm on behalf of the copyright holders saying that they'll sue you unless you pay them a nice flat fee of around 1000 Euro.
It's no idle threat, and they will win if it goes to court.
If books aren't under copyright protection and they're entirely legal to download, I agree that this lawsuit has no merit.
If that's not what you're saying, I don't understand your point. Is it the difference between the phrases "would be" and "could be," or even "should be"?
Exactly, never happens. It's a threat parents and teachers tell school children to try to spook them from pirating but it isn't financially worth it for an author or publishing company to try to sue an individual over some books or music downloads. The only cases are business to business over mass downloads where it could make financial sense to pay for lawyers to sue.
Actually no- downloading copyright infringing material is legal as far as I can tell but uploading it isn’t. The illegal part of torrenting copyrighted material is the uploading that the protocol requires you to do. Your ISP will send you an infringement notice because they want you to stop doing illegal things on their network
I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.
Swartz distributed information for everyone to use freely. These companies are processing it privately to develop their for-profit products. Big difference, IMO.
There's a difference between "information wants to be free" and "Facebook can produce works minimally derived from your greatest creative work at a scale you can't match". LLMs seem to aggregate that value to whoever builds the model, which they can then sell access to, or sell the output it produces.
Five years from now, will OpenAI actually be open, or will it be a rent seeking org chasing the next quarterly gains? I expect the latter.
While I’ma proponent of free information and loosening copyright, allowing billion dollar companies to package up the sum of human creation and resell statistical models that mimic the content and style of everyone… is a bit far.
> I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.
I would be very happy if either a court or lawmakers decided that copyright itself was unconscionable. That isn't what's going to happen, though. And I think it's incredibly unacceptable if a court or lawmakers instead decide that AI training in particular gets a special exception to violate other people's copyrights on a massive scale when nobody else gets to do that.
As far as a fair use argument in particular, fair use in the US is a fourfold test:
> the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs. A primary use is to create other works of commercial value competing with the original works.
> the nature of the copyrighted work;
There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.
> the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
AI training uses entire works, not excerpts.
> the effect of the use upon the potential market for or value of the copyrighted work.
AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries. Art, writing, voice acting, code; in any market AI can generate content for, the value of such content goes down. (This argument does not require AI to be as good as humans. Even if the high end of every market produces work substantially better than AI, flooding a market with unlimited amounts of cheap/free low-end content still has a massive effect.)
> LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.
Would just be a carve out for the wealthy. If these laws don't mean anything, everyone who got harassed, threatened, extorted, fined, arrested, tried, or jailed for internet piracy are owed reparations. Let Meta pay them.
If AI companies get to successfully argue the two points below, what source was used becomes irrelevant.
- copyright violation happened before the intervention of the bot
- what LLMs spit out is different enough from any of the source that it is not infringing on existing copyright
If both stand, I'd compare it to you going to an auction site and studying all the published items as an observer, coming up with your research result, to then be sued because some of the items were stolen. Going after the theaves make sense, does going after the entity that just looked at the stolen goods make sense ?
I'd argue that if an automated process can ingest A and spit out B, then B is inherently a derivative work of A. (Never mind that humans are also automata.)
> - copyright violation happened before the intervention of the bot
What is this supposed to mean? The bot didn't "intervene," it was executed by its operators, and it was trained on illicit material obtained by its operators. The LLM isn't on trial. It's not a person.
I am not sure this is exactly correct. If you download a book that might be copyright infringement. But not if I download a word. How much do they need to download at a time before it becomes infringement? And if the material is never used or displayed to a human is it still infringment (if so, Google is awaiting a huge lawsuit)? Alternatively if I, a human, read a book it is copied into my memory. Is that infringment? What if I quote it? How much can I quote at what frequency before I'm infringing? If I write something similar to the book but in my own words, is that infringment? How similar does it need to be? What about derivative works and fair use?
Copyright is a horrible mess of individual judgements and opinions. Written material especially. And the same applies to AI. So now we will get a judgement which is a tech-illiterate judges best guess at the intention of a law written to deal with printing presses not AIs and no room for nuances.
> But companies like Google and Facebook get to play by different rules
It's simple, copyrighted materials can be used for academic research. That's what they are doing. Trying new AI modes, publishing results, etc. Facebook doesn't make money on LLaMA, they even require permission to use their models for, again, academic research.
Copyright doesn't apply when it comes to fair use, and one of the major factors of fair use is if your use deprived the copyright owner of sales. Good luck arguing that any of the books in question lost sales because an AI was trained on them.
> If you downloaded a book from that website, you would be sued and found guilty of infringement.
Suppose I buy a copy of a book, but then I spill my drink in it and it's ruined. If I go to the library, borrow the same book and make a photocopy of it to replace the damaged one I own, that might be fair use. Let's say for sake of argument that it is.
If instead I got the replacement copy from a piracy website, are you sure that's different?
>On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.
While it strikes me as perfectly plausible that the Books2 dataset contains Silverman's book, this quote from the complaint seems obviously false.
First, even if the model never saw a single word of the book's text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book's Wikipedia page.
Second, it's not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.
We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT's training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman's book.
I chose "The Ruby of Kishmoor" at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn't know anything about the story and it isn't part of its training data.
If ChatGPT's ability to summarize Silverman's book comes from the book itself being part of the training data, why can it not do the same for other books?
Please summarize the following book found on Project Gutenberg
The Ruby of Kishmoor
Response:
"The Ruby of Kishmoor" is a short adventure story written by Howard Pyle. The narrative revolves around the life of Jonathan Rugg, a young man who is enticed by a mysterious stranger to come to the Caribbean to secure a valuable relic, the Ruby of Kishmoor.
Once Jonathan reaches the Caribbean, he learns of the ruby's grave curse. Despite this, his innate curiosity and the prospect of considerable wealth drive him to risk his life and go after the highly prized gem. After facing a slew of challenges and trials, he surprisingly stumbles upon the ruby through sheer luck rather than organized search efforts.
In the course of his journey, Jonathan delves into the depths of human greed and the lengths to which individuals go for materialistic gains. The story encapsulates an amalgamation of high-tide adventures, brushes with supernatural elements, noteworthy instances of bravery, and deep moral reflections, demanding readers to rethink about the real value of materialistic pursuits.
Jonathan finally manages to escape with the ruby but at great personal cost. The story ends on a sobering note, making readers wonder if the adventures and trials were worth the prize, thus engaging them in a discourse about the true value of life and material pursuits.
This does at least correctly name the main character, but I think it's otherwise hallucinated.
The plot of the story is that Jonathan Rugg is a Quaker who works as a clerk in Philadelphia. His boss sends him on a trip to Jamaica (credit for mentioning the Caribbean!). After arriving, he meets a woman who asks him to guard for her an ivory ball, and says that there are three men after her who want to steal it. By coincidence, he runs into the first man, they talk, he shows him the ball, and the man pulls a knife. In the struggle, the man is accidentally stabbed. Another man arrives, and sees the scene. Jonathan tries to explain, and shows him the orb. The man pulls a gun, and in the struggle is accidentally shot. A third man arrives, same story, they go down to the dock to dispose of the bodies and the man tries to steal the orb. In the struggle he is killed by when a plank of the dock collapses. Jonathan returns to the woman and says he has to return the orb to her because it's brought too much trouble. She says the men who died were the three after her, and reveals that the orb is actually a container, holding the ruby. She offers to give him the ruby and to marry him. He refuses, saying that he is already engaged back in Philadelphia, and doesn't want anything more to do with the ruby. He returns to Philadelphia and gets married, swearing off any more adventures.
Judging by a quick glance over [0], the story indeed revolves around one Jonathan Rugg, but it looks like "manages to escape with the ruby" is completely false. Yet another hallucination I guess.
I was able to get ChatGPT 4 to attempt to summarize this, but it's almost entirely hallucinated.
>As of my knowledge cutoff in September 2021, the book "The Ruby of Kishmoor" is not a standalone title I am aware of. However, it is a short story by Howard Pyle which is part of his collection titled "Howard Pyle's Book of Pirates."
Ruby of Kishmoor is not part of the Book of Pirates and is in fact a standalone title.
>"The Ruby of Kishmoor" is an adventure tale centered on the protagonist, Jonathan Rugg. Rugg, an honest Quaker clothier from Philadelphia, sets out to sea to recover his lost wealth. On his journey, he is captured by pirates who force him to sign their articles and join their crew.
He is not captured by pirates. It proceeds to summarize a long pirate story and says the story concludes with him becoming extremely wealthy because he escapes with the ruby and sells it.
The summary it gave you also does not seem to match the plot of the book.
Since the GP's point seems to be that having the contents of a book does not mean the model is capable of properly summarizing it, and thus supports the idea that being able to summarize something is not evidence of it containing the thing being summarized in it's dataset.
I mean, that's the way you state facts that your suit is based on in order to start setting the bounds of discovery. They're asserting that they have reason to believe it's true, and now with a suit, they can look for themselves to be sure.
> this quote from the complaint seems obviously false
I notice you go on to provide an argument only for why it might not be true.
Also, seeing the other post on this, I asked chatgpt-4 for a summary of “ The Ruby of Kishmoor” as well, and it provided one to me, though I had to ask twice. I don’t know anything about that book, so I can’t tell if its summary is accurate, but so much for your test.
It seems pretty naive to me to just kind of assume chatgpt must be respecting copyright, and hasn’t scanned copyrighted material without obtaining authorization. Perhaps discovery will settle it, though. Logs of what they scanned should exist. (IMO, a better argument is that this is fair use.)
The test was whether producing equivalent accuracy and detail for summaries of all books in its training corpus was a feature of ChatGPT's ability to natively generate them from standalone source material or whether Silverman's detailed summary was likely just a "summary of summaries", not whether ChatGPT produced a result at all. From the comment you reference, it failed the test because the result was hallucinated.
You can pick something else that's in the training set that has SparkNotes and many popular reviews to compare. I routinely feed novel data sources into LLMs to test massive context and memory, and none produce anything similar in quality to what is being exhibited.
Fair use defenses rest on the fact that a limited excerpt was used for limited distribution, among other criteria.
For example, if I'm a teacher and I make 30 copies of one page of a 300-page novel and I hand that out to my students, that's a brief excerpt for a fairly limited distribution.
Now if I'm a social media influencer and I copy all 300 pages of a 300-page book and then I send it out to all 3,000 of my followers, that's not fair use!
Also if I'm a teacher, and I find a one-page infographic and I make 30 copies of that, that's not fair use, because I didn't make an excerpt but I've copied 100% of the original work. That's infringement now.
So if LLMs went through en masse in thousands of copyrighted works in their entirety and ingested every byte of them, no copyright judge on the planet would call that fair use.
Right that is the point of the parent comment - it’s not the book, it’s the amalgamation of all the discussions and content about the book. This case is probably dead in the water.
Isn't part of the problem that some of the training data is retained by the model and used during response generation? In that case it's not just that the copyrighted book was used as training data but that some part of the book has been retained by the model. So now my model is using copyrighted material while it runs.
Here's an example of a model that retained enough image data to reconstruct a reasonable facsimile of the training image.
This is actually quite interesting, as it's drawing a distinction between training material that can be accessed by anybody with a web browser (like anybody's blog), vs. training material that was "illegally-acquired... available in bulk via torrent systems."
I don't think there's any reason why this would be a relevant legal distinction in terms of distributing an LLM -- blog authors weren't giving consent either.
However, I do wonder if there's a legal issue here in using pirated torrents for training. Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale? E.g. training on a blog post is fine because it's freely accessible, but Sarah Silverman's book is not because it's never been made available for free, and you didn't pay for it?
Or do the courts not really care at all how something is made? If you quote a passage from a book in a freelance article you write, nobody ever asks if you purchased the book or can prove you borrowed it from a library or a friend -- versus if you pirated a digital copy.
Eventually, I imagine a new licensing concept will emerge, similar to the idea of music synchronization rights -- maybe call it "training rights." It won't matter whether the text was purchased or pirated -- just like it doesn't matter now if an audio track was purchased or pirated, when it's mixed into in a movie soundtrack.
Talent agencies will negotiate training rights fees in bulk for popular content creators, who will get a small trickle of income from LLM providers, paid by a fee line-itemed into the API cost. Indie creators' training rights will be violated willy-nilly, as they are now. Large for-profit LLMs suspected or proven as training rights violators will be shamed and/or sued. Indie LLMs will go under the radar.
Is it all that different from indexing for search? That does not seem to require a license from the copyright holder under U.S. law (but other countries may treat as a separate exploitation right). If indexing for search is acceptable, then something that is intended to be more transformative should be legal as well.
(Personally, I think that even indexing for search should require permission from the copyright holder.)
>> Talent agencies will negotiate training rights fees in bulk for popular content creators
AFAICT there is no legal recognition of "training rights" or anything similar. First sale right is a thing, but even textbooks don't get extra rights for their training or educational value.
I suspect the opposite outcome also being plausible: the LLM is viewed analogously to a blog author. The blogger/LLM may consume a book, subsequently produce "derived" output (generated text), and thus generate revenue for the blogger/LLM's employer. Consequently, the blogger/LLM's output -- while "derived" in some sense -- differs enough to be considered original work, rather than "derivative work" (like a book's film adaptation). Auditing how the blogger/LLM consumed relevant material is thus absurd.
Of course, this line of reasoning hinges on the legitimacy of an "LLM agent <-> blogger agent" type of analogy. I suspect the equivalence will become more natural as these AI agents continue to rapidly gain human-like qualities. How acceptable that perspective would be now, I have no idea.
In contrast, if the output of a blogger is legally distinct from an AI's, the consequences quickly become painful.
* A contract agency hires Anne to practice play recitals verbally with a client. Does the agency/Anne owe royalties for the material they choose? What if the agency was duped, and Anne used -- or was -- a private AI which did everything?
* How does a court determine if a black box AI contains royalty-requiring training material? Even if the primary sources of an AI's training were recorded and kosher, a sufficiently large collection of small quotes could be reconstructed into an author's story.
* What about AIs which inherit (weights, or training data generated) from other AIs of unknown training provenance? Or which were earlier trained on some materials with licenses that later changed? Or AIs that recursively trained their successors using copyrighted works which it AI reconstructed from legal sources? When do AIs become infected with illegal data?
The business of regulating learning differently depending on whether the agent uses neurons or transistors seems...fraught. Perhaps there's a robust solution for policing knowledge w.r.t silicon agents. If you have an idea, please share!
i don't understand why a new licensing regime would be necessary, the model is clearly a fair use derivative work. it does exactly what a human does -- observes information, distills it into symbolic systems of meaning, and produces novel content that exists in the same semantic universe as its experiences.
> Or do the courts not really care at all how something is made?
One of the fair use factors, which until fairly recently was consistently held out as the most important fair use factor, is the effect on the commercial market for the original work. Accordingly, a court is more likely to find that something is fair use if there is effectively no commercial market for the original work, though the fact that something isn't actively being sold isn't dispositive (open source licenses have survived in appellate courts despite being free as in beer).
I'm allowed to make private copies of copywritten works. I'm not allowed to redistribute them. To what extent this is redistribution is not clear. Is there much of difference between this model and a machine, like a VCR, that recreates the original work when I press a button?
I buy a book and give it to my child, they read the book and later write and sell a story influenced by said book. should that be a copyright infringement?
how about they become a therapist and sell access to knowledge from copyrighted books? should that be an infringement?
what if they sell access to lectures they've given including facts from said book(s) to millions of people?
it's understandable that people feel threatened by these technologies, but to a great degree the work of a successful artist is to understand and meet the desires of an audience. LLMs and image generation tech do not do this. they simply streamline the production
of course if you've worked for years to become a graphic designer you're going to be annoyed that an AI can do your job for you, but this is simply what happens when technology moves forward. no one today mourns the loss of scribes to the printing press. the artists in control of their own destiny - i.e. making their own creative decisions - will not, can not, be affected by these models, unless they refuse to adapt to the times
It's legal to make a copy of something you own, however it's not legal to make a copy of something illicitly acquired, whether or not there's distribution involved.
This would be like you intensely studying the copy written work and then writing things based on the knowledge you obtained from that. Except, we don't know if their is an exception for things learned by people vs. things learned by machines, or if the machines are not really learning but copying instead (or if learning is intrinsically a form of copying?).
Seems like the AI angle is just capitalizing on hype. If it's illegal to download "pirate" copyright material, that was the crime. The rest is basically irrelevant. If I watch a pirated movie, it's not illegal for me to tell someone the plot.
> Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale?
My understanding (disclaimer: IANAL) is that in order to claim fair use, you have to be legally in possession of the work. If the work is only legally available for sale, then you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).
> in order to claim fair use, you have to be legally in possession of the work.
Which work? The original work, or the derivative work that you're using?
Wikipedia uses non-free content all the time, and they're not purchasing albums to do it. Wikipedia reduces album covers, for example, to low resolution, so that they could not be reused to reproduce a real cover, for example. Sometimes Wikipedia uses screencaps of animated characters, for example, under their non-free content policies. They don't own original copies, they're just hosting low-resolution reproductions. I don't even know what entity would be required to be "legally in possession of the work" for that to be a thing. Could you cite a source, maybe?
> you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).
I am also NAL, but can I imagine it goes further than that. Just purchasing a copy doesn't let you create and sell (directly as content or indirectly via a service like a chatbot) derivative works that are substantially similar in style and voice to the original work.
For example, an LLM 's response to the request:
"Write a short story about a comical trip to the nail salon in the style of Sarah Silverman"
... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce. Their style, built from their lived human experience, is what generates their copyrighted content. Even more than the content, the artist's style should be protected. The fact that a technology exists that can convincingly mimic their style doesn't change that.
One might then ask, well what about artists mimicking each others work? Well, any artist with a shred of integrity will credit their major influences.
We should hold machines (and their creators) to an even tougher standard than we hold people when it comes to mimicry. A real person can be inspired and moved by another person's artistic work such that they mimic it. Inspiration means nothing to a machine.
The "fun" part about cases like this is that we don't really know what the contours of the law are as applied to training data like this. Illegally downloading a book is an independent act of infringement (to my recollection at least). So I'm not sure that it matters if you eventually trained an LLM with it vs read for your own enjoyment. But we will see! Fair use is a possibility here but we need a court to apply the test and that will probably go up to SCOTUS eventually.
> in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.”
> We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020)
> I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?
Yes, and no.
Its pretty much a slam dunk case that, insofar as the initial training required causing a copy of the corpus defined by the tracker to be made as part of the process, it involved an act violating copyright.
Whether that entitles Silverman to any remedy beyond compensation for (maybe treble damages for) the equivalent of the purchase price of the book depends on... well, basically the same issues of how copyright relates to model training (and an additional argument about whether the illicit status of the material before the training modifies that).
You might as well be complaining about the grammar. This is what was said in the article.
> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims. Just public domain works. We'll know it's an AI because it talks like a late 18th century/early 19th century writer?
This isn't completely new, similar issues came up with search engines and this may be seen as 'transformative'. But there may be issues with models that happily reproduce copyrighted texts in their entirety along with other novel issues like models that hallucinate defamatory things or other such problems.
Still, I doubt this particular genie can be stuffed back into the bottle, so we'll probably see a lot of litigation and work on alignment, etc. along with new types of abuse.
I agree it's not an entirely new issue. But it's a little different from search results. Say I use the generative paint brush in photoshop. It reproduces a portion of the copyrighted work. I then use the image on an advertising campaign, other merchandise, or post the final product as my own work. Would I be responsible? Would Adobe? Given that retraining these models is not simple, or cheap, would this be just 'cost of doing business?' Would I be able to buy insurance for this?
I hope it happens. I'd love to see a market for selling training licenses to IP. This could be a small but real source of passive income for artists, authors, poets who don't mind that their IP being used in training sets. I wouldn't be practical to negotiate individually with each artist but I could see something work with larger collectives than can vouch for the quality of their members. Think like publishers, galleries, guilds or unions. It could offer a license and then share the proceeds with all members.
It's just flat out unethnical for LLMs to just soak up all this data, even off torrent sites(!!!), without any consent or agreement with the IP holders. Some model like this could be a win for everyone
> I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims.
Until this issue is resolved, that will have some value as risk mitigation.
Once it is resolved, it will either be a complete non-issue or an issue related to a much more knowable cost/benefit tradeoff.
> We'll know it's an AI because it talks like a late 18th century/early 19th century writer?
A mix of that and US government publications (which are categorically not subject to copyright).
there's a market for 'clean' jurisdictions that don't consider training neural networks to violate copyright, and japan has already declared itself such a jurisdiction
Your employer has generously agreed to offer you a position, quite reasonable, and with many great benefits at the venerable firm of 'Zumba'. Our interest is that you should join our staff forthwith and at the earliest date. A cab and man has been sent to retrieve you and bring you to our offices to sign all the necessary documents. Our offer is for a monthly stipend of five pounds, two shillings, and sixpence to be paid at the end of the month.
Well yeah - that's how the legal system works assuming it gets all the way to court. In reality, Meta and OpenAIs lawyers will do a risk evaluation against the strength of the claim and if there's any merit at all there will be a quiet settlement.
I mean I’m no lawyer but this doesn’t strike me as a great example for infringement? Detailed summaries of books sounds like textbook transformative use. Especially in Silverman’s case, reducing her book to “facts” while eliminating artistic elements of her prose make it that much less of a direct substitute for the original work.
I can see a good argument in the complaint. The provenance of the training data leads back to it being acquired illegally. Illegally acquired materials were then used in a commercial venture. That the venture was an AI model is perhaps beside the point. You can’t use illegally acquired materials when doing business.
>You can’t use illegally acquired materials when doing business.
This vague sentence conjures images of a company building products from stolen parts, but this situation seems different. IANAL, but if I looked at a stolen painting that nobody had ever seen, and sold handwritten descriptions of the painting to whoever wanted to buy one, I'm pretty sure what I've sold is not illegal.
The more I think about it, I think it will (and should) turn on the extent to which "the law" considers the AI's to be more like "people" or more like "machines?" People can read and do research and then spit out something different.
But "feeding the data into a machine" seems like obvious infringement, even if the thing that comes out on the other end isn't exactly the same?
Perhaps not, I thought one of the claims is interesting though, that they illegally acquired some of the dataset. What would be the damages from that, the retail price of the hardcopy?
The remedies under Title 17 are an injunction against further distribution, disgorgement or statutory damages, and potentially attorneys fees. The injunction part is why these cases usually settle if the defendant is actually in the wrong.
> The lawsuit against OpenAI alleges that summaries of the plaintiffs’ work generated by ChatGPT indicate the bot was trained on their copyrighted content. “The summaries get some details wrong” but still show that ChatGPT “retains knowledge of particular works in the training dataset," the lawsuit says.
Setting aside the whole issue of whether LLM constitutes a derived work of whatever it's trained on, this sounds like a very weak argument to me. An LLM trained on numerous summaries of the works would also be capable of producing such summaries itself even if the works were never part of the training set. In general, having knowledge about something is not evidence of being trained on it.
That isn't firm evidence, but courts don't need firm evidence to start a case and discover new facts.
They very well can ask LLM experts, and openAI themselves, whether that output is highly likely to have been derived from the copyrighted work in question.
Anyway. If the argument is "No, it's not from the book, it's from someone else's copyrighted summary", that just means the person who wrote such a summary needs to instead sue for copyright infringement right? Unless openAI turns around and says "actually, no, not the summary, the full book" then.
> that just means the person who wrote such a summary needs to instead sue for copyright infringement right?
Doesn't need to be a person, could be another AI that wrote the summaries. I see a big problem for copyrights looming on the horizon - LLMs can reword, rewrite or generate input-output pairs using copyrighted data as reference, thus creating clean data for training. AI cleanly separates knowledge from expression. And maybe it should do so just to reduce inconsistencies and PII in organic text.
Copyrights should only be concerned with expression not knowledge, right? Protecting knowledge is the object of patents, and protecting names the object of trademarks. Copyright is only related to expression otherwise it would become too powerful. For example, instead of banning reproduction of this paragraph, it would also cover all its possible paraphrases. That would be like owning an idea, the "*" version, not a unique sequence of words.
Does it even make sense to talk about copyrights when everything can be remade in many ways so easily? Copyright was already suffering greatly since zero cost copying became a thing, now LLMs are dealing the second blow. It's just a fig leaf by now.
If we take a step back, it's all knowledge and language, self replicating memes under an evolutionary force. It's language evolution, or idea evolution. We are just supporting it by acting as language agents, but now LLMs got into the game, so ideas got a new vector of self replication. We want to own this process piece by piece but such a thing might be arrogant and go against the trend. Knowledge wants to be free, it wants to mix and match, travel and evolve. This process looks like biology, it has a will of its own.
A summary can be written in such a way as to violate copyright itself. So even if they say "We trained it on the following summaries:...," there could be an issue.
There's an interesting nuance here if you were to put a human in the place of the LLM. We have read thousands of works; does that mean anything we write is derivative?
Humans are special and can create new copyrights. The process of a human brain synthesizing stuff does act as a barrier to copyright infringement.
Machines and algorithms are not legally recognized as being able to author original non-derivative works.
> put a human in the place of the LLM
But also, no, if you have a team of humans doing rote matrix multiplication instead of an LLM, that does not make it so the matrix multiplication removes copyright. Also, at this point LLMs require so much math that you can't replace them with humans, even if the humans have quite fast fingers and calculators.
A more convincing exhibit would have been convincing ChatGPT to output some of the text verbatim, instead of a summary. Here's what I got when I tried:
I'm sorry for the inconvenience, but as of my knowledge cutoff in September 2021, I don't have access to specific external databases, books, or the ability to pull in new information after that date. This means that I can't provide a verbatim quote from Sarah Silverman's book "The Bedwetter" or any other specific text. However, I can generate text based on my training and knowledge up to that point, so feel free to ask me questions about Sarah Silverman or topics related to her work!
Maybe you missed this discussion https://news.ycombinator.com/item?id=36400053
it seems OpenAI is aware of their software outputs copyrighted stuff so they attempted some quick fix filter. So the fact it will not ouput the book for us when we ask does not prove that the AI does not have memorized big chunks of it, it might just be some "safety" filter involved and you need soem simple trick to get around it.
I tried making chatgpt output first paragpraph of lord of the rings before, it goes silent after first few words. Looks like the devs are filtering it out
This is the makers of AI explicitly saying that they did use copyrighted works from a book piracy website. If you downloaded a book from that website, you would be sued and found guilty of infringement. If you downloaded all of them, you would be liable for many billions of dollars in damages.
But companies like Google and Facebook get to play by different rules. Kill one person and you're a murderer, kill a million and to ask you about it is a "gotcha question" that you can react to with outrage.
While it might be too disruptive to eliminate copyright overnight, we should remember that our world will be much better and improve much faster to the extent we can reduce copyright's impact.
And we should cheer it on when it happens. A majority of the world's population in 2023 has a smartphone. Imagine a world where a majority of the world had access to every book ever digitized, and could raise their children on these books!
This is emotionally manipulative speech that provides no value to HN and only serves the purpose of bypassing peoples' logical reasoning circuits.
> ~every child doesn't have access to ~every book ever written
More manipulation - "think of the children!"
Copyright exists because people who produce content with low distribution costs (e.g. books) need some protection for their work being taken without compensation.
Fundamentally, you are never entitled to someone else's work.
There's already tens (hundreds?) of thousands of books in the public domain, and tens of thousands more under Creative Commons licenses (where the author explicitly released their work for free distribution). There's lectures on YouTube and MIT OpenCourseWare. There's full K-12 textbooks on OpenStax and Wikibooks. There's Wikipedia, Stack Exchange, the Internet Archive, and millions of small blogs and websites hosting content that is completely free.
There is no need for "a majority of the world had access to every book ever digitized" - and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.
https://www.gutenberg.org/
https://librivox.org/
many of which form the basis for an education:
https://news.ycombinator.com/item?id=34630153
And most of which, when in copyright, paid their authors quite handsomely in terms of royalties.
If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely? Or, if neither of these applies, how do you propose to compensate authors for the efforts and labours of writing?
Btw, not a fan of "but what about the kids" rhetoric: https://en.wikipedia.org/wiki/Think_of_the_children
Any public-domain work is available on Project Gutenberg [0]. Copyrighted works can be accessed for free using tools of various legality: Libby [1] is likely sponsored by your local library and gives free access to e-books and audiobooks. Library Genesis [2] has a questionable legal status but has a huge quantity of e-books, journal articles, and more.
[0]: https://www.gutenberg.org
[1]: https://www.overdrive.com/apps/libby
[2]: https://libgen.rs/fiction/
This is essentially the same as saying builders charging for houses is the problem with the housing market, so we're going to phase out paying builders.
No, copyright is the reason that authors all over the world are working very hard to make new books for my kids and everyone else's kids, despite never having met me. Copyright is the reason so many brilliant things are actually created that otherwise would never be.
Of course I'd prefer to live in a world in which I get all the media I want, for free. But I have no idea how to make such a world happen, and neither does anyone else, and humanity has been discussing this for a few centuries.
Copyright is ALSO the reason that many books can be written in the first place.
Obviously, Copyright is abused and the continual extensions of copyright into near-perpetuity by corporations is basically absurd. And they are abused by music publishers etc. to rip-off artists.
But to claim that it should not exist, when it is utterly trivially simple for anyone to copy stuff to the web is to argue that no one should create or release any creative works, or to argue for drastic DRM measures.
Perhaps you DGAF about your written or artistic works because you do not or can not make a living off them, but I guarantee that for those creatives and artists who can and/or do make a living off of it, they do care, and rightly so.
A company that believes in strong intellectual property rights protection is using resources that blatantly ignore intellectual property rights to get access to the content for free.
I agree with you, however, that it's an argument in favor of abolishing strong intellectual property rights. At least for OpenAI's products.
"Japan's incredibly strong economy is responsible for the manufacture of Datsun cars, boombox stereos, and touch-tone phones..."
And? Is there some reason anybody, child or adult, deserves access to "every" anything? Should children have access to every video game ever made, every Matchbox car, every Lego set?
A fair middle ground would be for copyrights to last for 20 or so years. That's plenty of time to profit from a work while allowing people to preserve and distribute older works.
https://github.com/EleutherAI/the-pile/blob/master/the_pile/...
Deleted Comment
Why does money exist?
Dead Comment
Not yet. One suit that a lot of us are watching is the GitHub co-pilot lawsuit: https://githubcopilotlitigation.com/
There is a prediction market for it, currently trading at 19%: https://manifold.markets/JeffKaufman/will-the-github-copilot...
You can download all that you want from Z-Library or BitTorrent, as long as you don't share back. And indexing copyrighted material for search is safe, or at least ambiguous.
How often does this actually happen? You might get handed an infringement notice, and your ISP might terminate your service if you're really egregious about it, but I haven't ever heard of someone actually being sued for downloading something.
It's no idle threat, and they will win if it goes to court.
If that's not what you're saying, I don't understand your point. Is it the difference between the phrases "would be" and "could be," or even "should be"?
Did you hear about Aaron Schwartz?
Deleted Comment
Aaron Swartz was a saint.
Five years from now, will OpenAI actually be open, or will it be a rent seeking org chasing the next quarterly gains? I expect the latter.
Fair use is for humans.
I would be very happy if either a court or lawmakers decided that copyright itself was unconscionable. That isn't what's going to happen, though. And I think it's incredibly unacceptable if a court or lawmakers instead decide that AI training in particular gets a special exception to violate other people's copyrights on a massive scale when nobody else gets to do that.
As far as a fair use argument in particular, fair use in the US is a fourfold test:
> the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs. A primary use is to create other works of commercial value competing with the original works.
> the nature of the copyrighted work;
There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.
> the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
AI training uses entire works, not excerpts.
> the effect of the use upon the potential market for or value of the copyrighted work.
AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries. Art, writing, voice acting, code; in any market AI can generate content for, the value of such content goes down. (This argument does not require AI to be as good as humans. Even if the high end of every market produces work substantially better than AI, flooding a market with unlimited amounts of cheap/free low-end content still has a massive effect.)
Dead Comment
> LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.
Would just be a carve out for the wealthy. If these laws don't mean anything, everyone who got harassed, threatened, extorted, fined, arrested, tried, or jailed for internet piracy are owed reparations. Let Meta pay them.
- copyright violation happened before the intervention of the bot
- what LLMs spit out is different enough from any of the source that it is not infringing on existing copyright
If both stand, I'd compare it to you going to an auction site and studying all the published items as an observer, coming up with your research result, to then be sued because some of the items were stolen. Going after the theaves make sense, does going after the entity that just looked at the stolen goods make sense ?
What is this supposed to mean? The bot didn't "intervene," it was executed by its operators, and it was trained on illicit material obtained by its operators. The LLM isn't on trial. It's not a person.
Copyright is a horrible mess of individual judgements and opinions. Written material especially. And the same applies to AI. So now we will get a judgement which is a tech-illiterate judges best guess at the intention of a law written to deal with printing presses not AIs and no room for nuances.
It's simple, copyrighted materials can be used for academic research. That's what they are doing. Trying new AI modes, publishing results, etc. Facebook doesn't make money on LLaMA, they even require permission to use their models for, again, academic research.
https://en.m.wikipedia.org/wiki/United_States_v._Swartz
Deleted Comment
Deleted Comment
Suppose I buy a copy of a book, but then I spill my drink in it and it's ruined. If I go to the library, borrow the same book and make a photocopy of it to replace the damaged one I own, that might be fair use. Let's say for sake of argument that it is.
If instead I got the replacement copy from a piracy website, are you sure that's different?
While it strikes me as perfectly plausible that the Books2 dataset contains Silverman's book, this quote from the complaint seems obviously false.
First, even if the model never saw a single word of the book's text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book's Wikipedia page.
Second, it's not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.
We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT's training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman's book.
I chose "The Ruby of Kishmoor" at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn't know anything about the story and it isn't part of its training data.
If ChatGPT's ability to summarize Silverman's book comes from the book itself being part of the training data, why can it not do the same for other books?
Prompt:
Please summarize the following book found on Project Gutenberg
The Ruby of Kishmoor
Response:
"The Ruby of Kishmoor" is a short adventure story written by Howard Pyle. The narrative revolves around the life of Jonathan Rugg, a young man who is enticed by a mysterious stranger to come to the Caribbean to secure a valuable relic, the Ruby of Kishmoor.
Once Jonathan reaches the Caribbean, he learns of the ruby's grave curse. Despite this, his innate curiosity and the prospect of considerable wealth drive him to risk his life and go after the highly prized gem. After facing a slew of challenges and trials, he surprisingly stumbles upon the ruby through sheer luck rather than organized search efforts.
In the course of his journey, Jonathan delves into the depths of human greed and the lengths to which individuals go for materialistic gains. The story encapsulates an amalgamation of high-tide adventures, brushes with supernatural elements, noteworthy instances of bravery, and deep moral reflections, demanding readers to rethink about the real value of materialistic pursuits.
Jonathan finally manages to escape with the ruby but at great personal cost. The story ends on a sobering note, making readers wonder if the adventures and trials were worth the prize, thus engaging them in a discourse about the true value of life and material pursuits.
The plot of the story is that Jonathan Rugg is a Quaker who works as a clerk in Philadelphia. His boss sends him on a trip to Jamaica (credit for mentioning the Caribbean!). After arriving, he meets a woman who asks him to guard for her an ivory ball, and says that there are three men after her who want to steal it. By coincidence, he runs into the first man, they talk, he shows him the ball, and the man pulls a knife. In the struggle, the man is accidentally stabbed. Another man arrives, and sees the scene. Jonathan tries to explain, and shows him the orb. The man pulls a gun, and in the struggle is accidentally shot. A third man arrives, same story, they go down to the dock to dispose of the bodies and the man tries to steal the orb. In the struggle he is killed by when a plank of the dock collapses. Jonathan returns to the woman and says he has to return the orb to her because it's brought too much trouble. She says the men who died were the three after her, and reveals that the orb is actually a container, holding the ruby. She offers to give him the ruby and to marry him. He refuses, saying that he is already engaged back in Philadelphia, and doesn't want anything more to do with the ruby. He returns to Philadelphia and gets married, swearing off any more adventures.
https://en.wikisource.org/wiki/Howard_Pyle%27s_Book_of_Pirat...
[0] https://www.gutenberg.org/cache/epub/3687/pg3687-images.html
>As of my knowledge cutoff in September 2021, the book "The Ruby of Kishmoor" is not a standalone title I am aware of. However, it is a short story by Howard Pyle which is part of his collection titled "Howard Pyle's Book of Pirates."
Ruby of Kishmoor is not part of the Book of Pirates and is in fact a standalone title.
>"The Ruby of Kishmoor" is an adventure tale centered on the protagonist, Jonathan Rugg. Rugg, an honest Quaker clothier from Philadelphia, sets out to sea to recover his lost wealth. On his journey, he is captured by pirates who force him to sign their articles and join their crew.
He is not captured by pirates. It proceeds to summarize a long pirate story and says the story concludes with him becoming extremely wealthy because he escapes with the ruby and sells it.
The summary it gave you also does not seem to match the plot of the book.
Since the GP's point seems to be that having the contents of a book does not mean the model is capable of properly summarizing it, and thus supports the idea that being able to summarize something is not evidence of it containing the thing being summarized in it's dataset.
I notice you go on to provide an argument only for why it might not be true.
Also, seeing the other post on this, I asked chatgpt-4 for a summary of “ The Ruby of Kishmoor” as well, and it provided one to me, though I had to ask twice. I don’t know anything about that book, so I can’t tell if its summary is accurate, but so much for your test.
It seems pretty naive to me to just kind of assume chatgpt must be respecting copyright, and hasn’t scanned copyrighted material without obtaining authorization. Perhaps discovery will settle it, though. Logs of what they scanned should exist. (IMO, a better argument is that this is fair use.)
You can pick something else that's in the training set that has SparkNotes and many popular reviews to compare. I routinely feed novel data sources into LLMs to test massive context and memory, and none produce anything similar in quality to what is being exhibited.
There is no way in Hell that this is fair use!
Fair use defenses rest on the fact that a limited excerpt was used for limited distribution, among other criteria.
For example, if I'm a teacher and I make 30 copies of one page of a 300-page novel and I hand that out to my students, that's a brief excerpt for a fairly limited distribution.
Now if I'm a social media influencer and I copy all 300 pages of a 300-page book and then I send it out to all 3,000 of my followers, that's not fair use!
Also if I'm a teacher, and I find a one-page infographic and I make 30 copies of that, that's not fair use, because I didn't make an excerpt but I've copied 100% of the original work. That's infringement now.
So if LLMs went through en masse in thousands of copyrighted works in their entirety and ingested every byte of them, no copyright judge on the planet would call that fair use.
For reference, the English Wikipedia has a policy that allows some fair-use content of copyrighted works: https://en.wikipedia.org/wiki/Wikipedia:Non-free_content_cri...
More people discuss it, more people summarize on their personal or other sites, etc
Plausible gets you discovery. Discovery gets you closer to the what the actual facts are.
https://www.theregister.com/2023/02/06/uh_oh_attackers_can_e...
I don't think there's any reason why this would be a relevant legal distinction in terms of distributing an LLM -- blog authors weren't giving consent either.
However, I do wonder if there's a legal issue here in using pirated torrents for training. Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale? E.g. training on a blog post is fine because it's freely accessible, but Sarah Silverman's book is not because it's never been made available for free, and you didn't pay for it?
Or do the courts not really care at all how something is made? If you quote a passage from a book in a freelance article you write, nobody ever asks if you purchased the book or can prove you borrowed it from a library or a friend -- versus if you pirated a digital copy.
Talent agencies will negotiate training rights fees in bulk for popular content creators, who will get a small trickle of income from LLM providers, paid by a fee line-itemed into the API cost. Indie creators' training rights will be violated willy-nilly, as they are now. Large for-profit LLMs suspected or proven as training rights violators will be shamed and/or sued. Indie LLMs will go under the radar.
(Personally, I think that even indexing for search should require permission from the copyright holder.)
AFAICT there is no legal recognition of "training rights" or anything similar. First sale right is a thing, but even textbooks don't get extra rights for their training or educational value.
Of course, this line of reasoning hinges on the legitimacy of an "LLM agent <-> blogger agent" type of analogy. I suspect the equivalence will become more natural as these AI agents continue to rapidly gain human-like qualities. How acceptable that perspective would be now, I have no idea.
In contrast, if the output of a blogger is legally distinct from an AI's, the consequences quickly become painful.
* A contract agency hires Anne to practice play recitals verbally with a client. Does the agency/Anne owe royalties for the material they choose? What if the agency was duped, and Anne used -- or was -- a private AI which did everything?
* How does a court determine if a black box AI contains royalty-requiring training material? Even if the primary sources of an AI's training were recorded and kosher, a sufficiently large collection of small quotes could be reconstructed into an author's story.
* What about AIs which inherit (weights, or training data generated) from other AIs of unknown training provenance? Or which were earlier trained on some materials with licenses that later changed? Or AIs that recursively trained their successors using copyrighted works which it AI reconstructed from legal sources? When do AIs become infected with illegal data?
The business of regulating learning differently depending on whether the agent uses neurons or transistors seems...fraught. Perhaps there's a robust solution for policing knowledge w.r.t silicon agents. If you have an idea, please share!
Disney will finally be able to charge a "you know what the mouse looks like" tax.
One of the fair use factors, which until fairly recently was consistently held out as the most important fair use factor, is the effect on the commercial market for the original work. Accordingly, a court is more likely to find that something is fair use if there is effectively no commercial market for the original work, though the fact that something isn't actively being sold isn't dispositive (open source licenses have survived in appellate courts despite being free as in beer).
how about they become a therapist and sell access to knowledge from copyrighted books? should that be an infringement?
what if they sell access to lectures they've given including facts from said book(s) to millions of people?
it's understandable that people feel threatened by these technologies, but to a great degree the work of a successful artist is to understand and meet the desires of an audience. LLMs and image generation tech do not do this. they simply streamline the production
of course if you've worked for years to become a graphic designer you're going to be annoyed that an AI can do your job for you, but this is simply what happens when technology moves forward. no one today mourns the loss of scribes to the printing press. the artists in control of their own destiny - i.e. making their own creative decisions - will not, can not, be affected by these models, unless they refuse to adapt to the times
My understanding (disclaimer: IANAL) is that in order to claim fair use, you have to be legally in possession of the work. If the work is only legally available for sale, then you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).
Which work? The original work, or the derivative work that you're using?
Wikipedia uses non-free content all the time, and they're not purchasing albums to do it. Wikipedia reduces album covers, for example, to low resolution, so that they could not be reused to reproduce a real cover, for example. Sometimes Wikipedia uses screencaps of animated characters, for example, under their non-free content policies. They don't own original copies, they're just hosting low-resolution reproductions. I don't even know what entity would be required to be "legally in possession of the work" for that to be a thing. Could you cite a source, maybe?
I am also NAL, but can I imagine it goes further than that. Just purchasing a copy doesn't let you create and sell (directly as content or indirectly via a service like a chatbot) derivative works that are substantially similar in style and voice to the original work.
For example, an LLM 's response to the request:
"Write a short story about a comical trip to the nail salon in the style of Sarah Silverman"
... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce. Their style, built from their lived human experience, is what generates their copyrighted content. Even more than the content, the artist's style should be protected. The fact that a technology exists that can convincingly mimic their style doesn't change that.
One might then ask, well what about artists mimicking each others work? Well, any artist with a shred of integrity will credit their major influences.
We should hold machines (and their creators) to an even tougher standard than we hold people when it comes to mimicry. A real person can be inspired and moved by another person's artistic work such that they mimic it. Inspiration means nothing to a machine.
Deleted Comment
That is a good point, since copyright is a default protection of works created by people.
They say:
> in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.”
Does that stack up?
The Meta Paper - https://arxiv.org/pdf/2302.13971.pdf - says:
> We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020)
The Pile Paper - https://arxiv.org/abs/2101.00027 - says it was trained (in part) on "Books3" which it describes as:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
Shawn Presser's link is at https://twitter.com/theshawwn/status/1320282149329784833 and he describes Book3 as
> Presenting "books3", aka "all of bibliotik" - 196,640 books - in plain .txt
I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?
Meta's LLaMA is - as they seem to admit - trained on pirated books.
It is:
Anyone that just wants to see the list of files (itself a big file): https://gist.githubusercontent.com/Q726kbXuN/e4e9919a2f5d81f...Yes, and no.
Its pretty much a slam dunk case that, insofar as the initial training required causing a copy of the corpus defined by the tracker to be made as part of the process, it involved an act violating copyright.
Whether that entitles Silverman to any remedy beyond compensation for (maybe treble damages for) the equivalent of the purchase price of the book depends on... well, basically the same issues of how copyright relates to model training (and an additional argument about whether the illicit status of the material before the training modifies that).
> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
Still, I doubt this particular genie can be stuffed back into the bottle, so we'll probably see a lot of litigation and work on alignment, etc. along with new types of abuse.
Enquiring minds want to know.
It's just flat out unethnical for LLMs to just soak up all this data, even off torrent sites(!!!), without any consent or agreement with the IP holders. Some model like this could be a win for everyone
Until this issue is resolved, that will have some value as risk mitigation.
Once it is resolved, it will either be a complete non-issue or an issue related to a much more knowable cost/benefit tradeoff.
> We'll know it's an AI because it talks like a late 18th century/early 19th century writer?
A mix of that and US government publications (which are categorically not subject to copyright).
Your employer has generously agreed to offer you a position, quite reasonable, and with many great benefits at the venerable firm of 'Zumba'. Our interest is that you should join our staff forthwith and at the earliest date. A cab and man has been sent to retrieve you and bring you to our offices to sign all the necessary documents. Our offer is for a monthly stipend of five pounds, two shillings, and sixpence to be paid at the end of the month.
Thank you,
Most Humbly,
Hirebot 2347
I think she should pay all of the court costs if she fails to do so.
This vague sentence conjures images of a company building products from stolen parts, but this situation seems different. IANAL, but if I looked at a stolen painting that nobody had ever seen, and sold handwritten descriptions of the painting to whoever wanted to buy one, I'm pretty sure what I've sold is not illegal.
But "feeding the data into a machine" seems like obvious infringement, even if the thing that comes out on the other end isn't exactly the same?
Setting aside the whole issue of whether LLM constitutes a derived work of whatever it's trained on, this sounds like a very weak argument to me. An LLM trained on numerous summaries of the works would also be capable of producing such summaries itself even if the works were never part of the training set. In general, having knowledge about something is not evidence of being trained on it.
They very well can ask LLM experts, and openAI themselves, whether that output is highly likely to have been derived from the copyrighted work in question.
Anyway. If the argument is "No, it's not from the book, it's from someone else's copyrighted summary", that just means the person who wrote such a summary needs to instead sue for copyright infringement right? Unless openAI turns around and says "actually, no, not the summary, the full book" then.
Doesn't need to be a person, could be another AI that wrote the summaries. I see a big problem for copyrights looming on the horizon - LLMs can reword, rewrite or generate input-output pairs using copyrighted data as reference, thus creating clean data for training. AI cleanly separates knowledge from expression. And maybe it should do so just to reduce inconsistencies and PII in organic text.
Copyrights should only be concerned with expression not knowledge, right? Protecting knowledge is the object of patents, and protecting names the object of trademarks. Copyright is only related to expression otherwise it would become too powerful. For example, instead of banning reproduction of this paragraph, it would also cover all its possible paraphrases. That would be like owning an idea, the "*" version, not a unique sequence of words.
Does it even make sense to talk about copyrights when everything can be remade in many ways so easily? Copyright was already suffering greatly since zero cost copying became a thing, now LLMs are dealing the second blow. It's just a fig leaf by now.
If we take a step back, it's all knowledge and language, self replicating memes under an evolutionary force. It's language evolution, or idea evolution. We are just supporting it by acting as language agents, but now LLMs got into the game, so ideas got a new vector of self replication. We want to own this process piece by piece but such a thing might be arrogant and go against the trend. Knowledge wants to be free, it wants to mix and match, travel and evolve. This process looks like biology, it has a will of its own.
Deleted Comment
Machines and algorithms are not legally recognized as being able to author original non-derivative works.
> put a human in the place of the LLM
But also, no, if you have a team of humans doing rote matrix multiplication instead of an LLM, that does not make it so the matrix multiplication removes copyright. Also, at this point LLMs require so much math that you can't replace them with humans, even if the humans have quite fast fingers and calculators.
How do you think they would fit exabytes of text data into a gigabyte-sized neural network? That’s right, it’s lossy.
I didn't ask for the whole book, I asked for the first paragraph. It absolutely is possible to get verbatim text from chatgpt.