Readit News logoReadit News
pessimizer · 2 years ago
> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

This is the makers of AI explicitly saying that they did use copyrighted works from a book piracy website. If you downloaded a book from that website, you would be sued and found guilty of infringement. If you downloaded all of them, you would be liable for many billions of dollars in damages.

But companies like Google and Facebook get to play by different rules. Kill one person and you're a murderer, kill a million and to ask you about it is a "gotcha question" that you can react to with outrage.

holmesworcester · 2 years ago
Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

While it might be too disruptive to eliminate copyright overnight, we should remember that our world will be much better and improve much faster to the extent we can reduce copyright's impact.

And we should cheer it on when it happens. A majority of the world's population in 2023 has a smartphone. Imagine a world where a majority of the world had access to every book ever digitized, and could raise their children on these books!

throw10920 · 2 years ago
> Let's take a second to remember

This is emotionally manipulative speech that provides no value to HN and only serves the purpose of bypassing peoples' logical reasoning circuits.

> ~every child doesn't have access to ~every book ever written

More manipulation - "think of the children!"

Copyright exists because people who produce content with low distribution costs (e.g. books) need some protection for their work being taken without compensation.

Fundamentally, you are never entitled to someone else's work.

There's already tens (hundreds?) of thousands of books in the public domain, and tens of thousands more under Creative Commons licenses (where the author explicitly released their work for free distribution). There's lectures on YouTube and MIT OpenCourseWare. There's full K-12 textbooks on OpenStax and Wikibooks. There's Wikipedia, Stack Exchange, the Internet Archive, and millions of small blogs and websites hosting content that is completely free.

There is no need for "a majority of the world had access to every book ever digitized" - and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.

WillAdams · 2 years ago
There are already many more public domain books than people are inclined to read:

https://www.gutenberg.org/

https://librivox.org/

many of which form the basis for an education:

https://news.ycombinator.com/item?id=34630153

And most of which, when in copyright, paid their authors quite handsomely in terms of royalties.

If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely? Or, if neither of these applies, how do you propose to compensate authors for the efforts and labours of writing?

ignoramous · 2 years ago
Second-order effects matter, though: If everyone is allowed to steal books, what's the incentive for experts to write new ones, and for the publishers to reward them for it?

Btw, not a fan of "but what about the kids" rhetoric: https://en.wikipedia.org/wiki/Think_of_the_children

dlivingston · 2 years ago
It would look largely identical to ours, I think. It's pretty trivial to get access to many, if not most, e-books.

Any public-domain work is available on Project Gutenberg [0]. Copyrighted works can be accessed for free using tools of various legality: Libby [1] is likely sponsored by your local library and gives free access to e-books and audiobooks. Library Genesis [2] has a questionable legal status but has a huge quantity of e-books, journal articles, and more.

[0]: https://www.gutenberg.org

[1]: https://www.overdrive.com/apps/libby

[2]: https://libgen.rs/fiction/

698969 · 2 years ago
Then most people stop writing books because they can't get paid for their time/effort and ~every child will be stuck with outdated knowledge within a decade.
JustLurking2022 · 2 years ago
What an absolute pile of nonsense. People who author creative works deserve to have control of them and make some money - otherwise you'll soon find we have far fewer great authors, artists, etc.

This is essentially the same as saying builders charging for houses is the problem with the housing market, so we're going to phase out paying builders.

edanm · 2 years ago
> Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

No, copyright is the reason that authors all over the world are working very hard to make new books for my kids and everyone else's kids, despite never having met me. Copyright is the reason so many brilliant things are actually created that otherwise would never be.

Of course I'd prefer to live in a world in which I get all the media I want, for free. But I have no idea how to make such a world happen, and neither does anyone else, and humanity has been discussing this for a few centuries.

toss1 · 2 years ago
>>copyright is the reason ~every child doesn't have access to ~every book ever written

Copyright is ALSO the reason that many books can be written in the first place.

Obviously, Copyright is abused and the continual extensions of copyright into near-perpetuity by corporations is basically absurd. And they are abused by music publishers etc. to rip-off artists.

But to claim that it should not exist, when it is utterly trivially simple for anyone to copy stuff to the web is to argue that no one should create or release any creative works, or to argue for drastic DRM measures.

Perhaps you DGAF about your written or artistic works because you do not or can not make a living off them, but I guarantee that for those creatives and artists who can and/or do make a living off of it, they do care, and rightly so.

gorbachev · 2 years ago
This situation is a bit different, however.

A company that believes in strong intellectual property rights protection is using resources that blatantly ignore intellectual property rights to get access to the content for free.

I agree with you, however, that it's an argument in favor of abolishing strong intellectual property rights. At least for OpenAI's products.

account42 · 2 years ago
I'm all for abolishing copyright, but how is this relevant to megacorporations ignoring copyright when it suits them while still expecting us to follow it when that suits them?
aaomidi · 2 years ago
Using charged language of bringing children into the equation is not a good way in having a discussion.
lotsoweiners · 2 years ago
I’m imagining a world that looks just about the same as this one does. A larger book library doesn’t automatically make that medium more appealing to kids than what Mr Beast, Unspeakable, and the other crap kids love are doing.
Eddy_Viscosity2 · 2 years ago
I think the point here is that while children can be denied access to copyrighted works without paying the owners, but openAI and Meta can do as they please. I don't disagree that the current copyright system needs improvement, but what I really really don't like are seeing rich and powerful people breaking laws with impunity over and over and over again.
glonq · 2 years ago
Copyright [and the textbook cartels] is the reason why my kids' highschool is using social studies textbooks that are 20-30 years old.

"Japan's incredibly strong economy is responsible for the manufacture of Datsun cars, boombox stereos, and touch-tone phones..."

2muchcoffeeman · 2 years ago
This assumes that the law is applied equally to both rich and poor.
listenallyall · 2 years ago
> copyright is the reason ~every child doesn't have access to ~every book ever written.

And? Is there some reason anybody, child or adult, deserves access to "every" anything? Should children have access to every video game ever made, every Matchbox car, every Lego set?

shreyshnaccount · 2 years ago
GP makes no remark on the morality/practicality of copyright. Also, having people sue big companies for copyright might lead to more of what you're arguing for, in a show them the taste of their own medicine way
erehweb · 2 years ago
Second-order effects are real - removing copyright would hurt authors. (See cstross comments in https://news.ycombinator.com/item?id=35761641 e.g.)
all_factz · 2 years ago
I would be much more sympathetic to this stance if you weren’t implicitly endorsing the rights of companies like Meta/Alphabet/OpenAI to profit from the disruption of copyright law. If we’re talking ordinary people being able to breach copyright, then yeah seems potentially interesting. But let’s remember that these companies aren’t acting altruistically. They’re not giving away Silverman’s work - they’re repackaging it for their own profit. That’s not fair to the artist and in fact does not help the children.
mywittyname · 2 years ago
I'm a fan of copyrights. While I think that the USA's implementation of copyrights has a few glaring flaws (namely, the duration of copyright is far too long). I firmly believe that the elimination, or effective elimination of copyrights is massively detrimental to our culture.

A fair middle ground would be for copyrights to last for 20 or so years. That's plenty of time to profit from a work while allowing people to preserve and distribute older works.

ssss11 · 2 years ago
So why do we cheer on megacorps and not mom and pop pirates?
charles_f · 2 years ago
That's not parent's point. Their point seems to be that large companies don't suffer the same consequences from a crime as any layman.
stcroixx · 2 years ago
In this world, the only authors would be people independently wealthy who don't need an income.
kristianp · 2 years ago
Anyone want to check if the book in question is in ThePile dataset?:

https://github.com/EleutherAI/the-pile/blob/master/the_pile/...

Deleted Comment

jklinger410 · 2 years ago
We can totally take a second to remember that without it being in the context of allowing large corporations to hoover up private works with impunity.
renewedrebecca · 2 years ago
But it's not children downloading the books is it? It's a company backed by billionaires, so why cheer?
ChatGTP · 2 years ago
Why don't children have access to all the latest LLMs, including ChatGPT-4, for free?

Why does money exist?

manzanarama · 2 years ago
How many books would not exist if copyright had the suggested reduced impact?
tr_user · 2 years ago
Try adding censorship to that list.

Dead Comment

buildbot · 2 years ago
Machine learning models have been trained with copyrighted data for a long time. Imagenet is full of copyrighted images, clearview literally just scanned the internet for faces, and I am sure there are other, older examples. I am unsure if this has been tested as fair use by a US court, but I am guessing it will be considered to be so if it is not already.
jefftk · 2 years ago
> I am unsure if this has been tested as fair use by a US court

Not yet. One suit that a lot of us are watching is the GitHub co-pilot lawsuit: https://githubcopilotlitigation.com/

There is a prediction market for it, currently trading at 19%: https://manifold.markets/JeffKaufman/will-the-github-copilot...

kibwen · 2 years ago
Excellent, so you're saying I'll be able to download any copyrighted work from any pirate site and be free of all consequence if I just claim that I'm training an AI?
guluarte · 2 years ago
and not only copyrighted material, also illegal and disturbing content
holmesworcester · 2 years ago
Strictly speaking, it's uploading that people get sued for, not downloading.

You can download all that you want from Z-Library or BitTorrent, as long as you don't share back. And indexing copyrighted material for search is safe, or at least ambiguous.

capableweb · 2 years ago
Carefully speaking, what you say is true in many places (countries), but also not true in other places (countries). Some jurisdictions are different, as always.
MisterBastahrd · 2 years ago
Downloading is illegal. That people do not normally get sued or prosecuted for downloading does not mean that they cannot get sued or prosecuted.
moralestapia · 2 years ago
Huh? No.
rococode · 2 years ago
> If you downloaded a book from that website, you would be sued and found guilty of infringement.

How often does this actually happen? You might get handed an infringement notice, and your ISP might terminate your service if you're really egregious about it, but I haven't ever heard of someone actually being sued for downloading something.

xmprt · 2 years ago
Whether or not it's enforced, it's illegal and copyright holders are within their rights to sue you. This is piratebay levels of piracy but because it's done by a large company and is sufficiently obfuscated behind tech, people don't see it the same way.
TillE · 2 years ago
In Germany if you torrent stuff (without a VPN), you're very likely to get a letter from a law firm on behalf of the copyright holders saying that they'll sue you unless you pay them a nice flat fee of around 1000 Euro.

It's no idle threat, and they will win if it goes to court.

pessimizer · 2 years ago
If books aren't under copyright protection and they're entirely legal to download, I agree that this lawsuit has no merit.

If that's not what you're saying, I don't understand your point. Is it the difference between the phrases "would be" and "could be," or even "should be"?

iepathos · 2 years ago
Exactly, never happens. It's a threat parents and teachers tell school children to try to spook them from pirating but it isn't financially worth it for an author or publishing company to try to sue an individual over some books or music downloads. The only cases are business to business over mass downloads where it could make financial sense to pay for lawyers to sue.
Shawnj2 · 2 years ago
Actually no- downloading copyright infringing material is legal as far as I can tell but uploading it isn’t. The illegal part of torrenting copyrighted material is the uploading that the protocol requires you to do. Your ISP will send you an infringement notice because they want you to stop doing illegal things on their network
harry8 · 2 years ago
>How often does this actually happen?

Did you hear about Aaron Schwartz?

Deleted Comment

Der_Einzige · 2 years ago
I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Aaron Swartz was a saint.

axblount · 2 years ago
Swartz distributed information for everyone to use freely. These companies are processing it privately to develop their for-profit products. Big difference, IMO.
8organicbits · 2 years ago
There's a difference between "information wants to be free" and "Facebook can produce works minimally derived from your greatest creative work at a scale you can't match". LLMs seem to aggregate that value to whoever builds the model, which they can then sell access to, or sell the output it produces.

Five years from now, will OpenAI actually be open, or will it be a rent seeking org chasing the next quarterly gains? I expect the latter.

colechristensen · 2 years ago
While I’ma proponent of free information and loosening copyright, allowing billion dollar companies to package up the sum of human creation and resell statistical models that mimic the content and style of everyone… is a bit far.

Fair use is for humans.

JoshTriplett · 2 years ago
> I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

I would be very happy if either a court or lawmakers decided that copyright itself was unconscionable. That isn't what's going to happen, though. And I think it's incredibly unacceptable if a court or lawmakers instead decide that AI training in particular gets a special exception to violate other people's copyrights on a massive scale when nobody else gets to do that.

As far as a fair use argument in particular, fair use in the US is a fourfold test:

> the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs. A primary use is to create other works of commercial value competing with the original works.

> the nature of the copyrighted work;

There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.

> the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

AI training uses entire works, not excerpts.

> the effect of the use upon the potential market for or value of the copyrighted work.

AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries. Art, writing, voice acting, code; in any market AI can generate content for, the value of such content goes down. (This argument does not require AI to be as good as humans. Even if the high end of every market produces work substantially better than AI, flooding a market with unlimited amounts of cheap/free low-end content still has a massive effect.)

slashdev · 2 years ago
Agreed, copyright has gone too far. I hope the advent of AI serves to weaken it.

Dead Comment

pessimizer · 2 years ago
Copyrights shouldn't exist, but

> LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Would just be a carve out for the wealthy. If these laws don't mean anything, everyone who got harassed, threatened, extorted, fined, arrested, tried, or jailed for internet piracy are owed reparations. Let Meta pay them.

makeitdouble · 2 years ago
If AI companies get to successfully argue the two points below, what source was used becomes irrelevant.

- copyright violation happened before the intervention of the bot

- what LLMs spit out is different enough from any of the source that it is not infringing on existing copyright

If both stand, I'd compare it to you going to an auction site and studying all the published items as an observer, coming up with your research result, to then be sued because some of the items were stolen. Going after the theaves make sense, does going after the entity that just looked at the stolen goods make sense ?

bobbylarrybobby · 2 years ago
I'd argue that if an automated process can ingest A and spit out B, then B is inherently a derivative work of A. (Never mind that humans are also automata.)
pessimizer · 2 years ago
> - copyright violation happened before the intervention of the bot

What is this supposed to mean? The bot didn't "intervene," it was executed by its operators, and it was trained on illicit material obtained by its operators. The LLM isn't on trial. It's not a person.

LatteLazy · 2 years ago
I am not sure this is exactly correct. If you download a book that might be copyright infringement. But not if I download a word. How much do they need to download at a time before it becomes infringement? And if the material is never used or displayed to a human is it still infringment (if so, Google is awaiting a huge lawsuit)? Alternatively if I, a human, read a book it is copied into my memory. Is that infringment? What if I quote it? How much can I quote at what frequency before I'm infringing? If I write something similar to the book but in my own words, is that infringment? How similar does it need to be? What about derivative works and fair use?

Copyright is a horrible mess of individual judgements and opinions. Written material especially. And the same applies to AI. So now we will get a judgement which is a tech-illiterate judges best guess at the intention of a law written to deal with printing presses not AIs and no room for nuances.

two_in_one · 2 years ago
> But companies like Google and Facebook get to play by different rules

It's simple, copyrighted materials can be used for academic research. That's what they are doing. Trying new AI modes, publishing results, etc. Facebook doesn't make money on LLaMA, they even require permission to use their models for, again, academic research.

troyvit · 2 years ago
But what if those copyrighted materials were illegally gained? That's what the suit alleges.
htss2013 · 2 years ago
Copyright doesn't apply when it comes to fair use, and one of the major factors of fair use is if your use deprived the copyright owner of sales. Good luck arguing that any of the books in question lost sales because an AI was trained on them.
freyes · 2 years ago
We need the dude who prosecuted Aaron Swartz here. This sucks.

https://en.m.wikipedia.org/wiki/United_States_v._Swartz

Deleted Comment

HDThoreaun · 2 years ago
I've pirated many books, never sued.
erezsh · 2 years ago
You mean, never caught.
oarabbus_ · 2 years ago
Why do you blame Facebook and not ElutherAI?

Deleted Comment

randomcarbloke · 2 years ago
a machine shouldn't have to pay to read a book.
amf12 · 2 years ago
The lawsuit doesn't even mention Google.
pessimizer · 2 years ago
No, I did. What's your point?
AnthonyMouse · 2 years ago
> If you downloaded a book from that website, you would be sued and found guilty of infringement.

Suppose I buy a copy of a book, but then I spill my drink in it and it's ruined. If I go to the library, borrow the same book and make a photocopy of it to replace the damaged one I own, that might be fair use. Let's say for sake of argument that it is.

If instead I got the replacement copy from a piracy website, are you sure that's different?

Imnimo · 2 years ago
>On information and belief, the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying OpenAI Language Model (either GPT-3.5 or GPT-4) as part of its training data.

While it strikes me as perfectly plausible that the Books2 dataset contains Silverman's book, this quote from the complaint seems obviously false.

First, even if the model never saw a single word of the book's text during training, it could still learn to summarize it from reading other summaries which are publicly available. Such as the book's Wikipedia page.

Second, it's not even clear to me that a model which only saw the text of a book, but not any descriptions or summaries of it, during training would even be particular good at producing a summary.

We can test this by asking for a summary of a book which is available through Project Gutenberg (which the complaint asserts is Books1 and therefore part of ChatGPT's training data) but for which there is little discussion online. If the source of the ability to summarize is having the book itself during training, the model should be equally able to summarize the rare book as it is Silverman's book.

I chose "The Ruby of Kishmoor" at random. It was added to PG in 2003. ChatGPT with GPT-3.5 hallucinates a summary that doesn't even identify the correct main characters. The GPT-4 model refuses to even try, saying it doesn't know anything about the story and it isn't part of its training data.

If ChatGPT's ability to summarize Silverman's book comes from the book itself being part of the training data, why can it not do the same for other books?

ec109685 · 2 years ago
The playground summarizes it as this via GPT-4:

Prompt:

Please summarize the following book found on Project Gutenberg

The Ruby of Kishmoor

Response:

"The Ruby of Kishmoor" is a short adventure story written by Howard Pyle. The narrative revolves around the life of Jonathan Rugg, a young man who is enticed by a mysterious stranger to come to the Caribbean to secure a valuable relic, the Ruby of Kishmoor.

Once Jonathan reaches the Caribbean, he learns of the ruby's grave curse. Despite this, his innate curiosity and the prospect of considerable wealth drive him to risk his life and go after the highly prized gem. After facing a slew of challenges and trials, he surprisingly stumbles upon the ruby through sheer luck rather than organized search efforts.

In the course of his journey, Jonathan delves into the depths of human greed and the lengths to which individuals go for materialistic gains. The story encapsulates an amalgamation of high-tide adventures, brushes with supernatural elements, noteworthy instances of bravery, and deep moral reflections, demanding readers to rethink about the real value of materialistic pursuits.

Jonathan finally manages to escape with the ruby but at great personal cost. The story ends on a sobering note, making readers wonder if the adventures and trials were worth the prize, thus engaging them in a discourse about the true value of life and material pursuits.

Imnimo · 2 years ago
This does at least correctly name the main character, but I think it's otherwise hallucinated.

The plot of the story is that Jonathan Rugg is a Quaker who works as a clerk in Philadelphia. His boss sends him on a trip to Jamaica (credit for mentioning the Caribbean!). After arriving, he meets a woman who asks him to guard for her an ivory ball, and says that there are three men after her who want to steal it. By coincidence, he runs into the first man, they talk, he shows him the ball, and the man pulls a knife. In the struggle, the man is accidentally stabbed. Another man arrives, and sees the scene. Jonathan tries to explain, and shows him the orb. The man pulls a gun, and in the struggle is accidentally shot. A third man arrives, same story, they go down to the dock to dispose of the bodies and the man tries to steal the orb. In the struggle he is killed by when a plank of the dock collapses. Jonathan returns to the woman and says he has to return the orb to her because it's brought too much trouble. She says the men who died were the three after her, and reveals that the orb is actually a container, holding the ruby. She offers to give him the ruby and to marry him. He refuses, saying that he is already engaged back in Philadelphia, and doesn't want anything more to do with the ruby. He returns to Philadelphia and gets married, swearing off any more adventures.

https://en.wikisource.org/wiki/Howard_Pyle%27s_Book_of_Pirat...

maratc · 2 years ago
Judging by a quick glance over [0], the story indeed revolves around one Jonathan Rugg, but it looks like "manages to escape with the ruby" is completely false. Yet another hallucination I guess.

[0] https://www.gutenberg.org/cache/epub/3687/pg3687-images.html

cthalupa · 2 years ago
I was able to get ChatGPT 4 to attempt to summarize this, but it's almost entirely hallucinated.

>As of my knowledge cutoff in September 2021, the book "The Ruby of Kishmoor" is not a standalone title I am aware of. However, it is a short story by Howard Pyle which is part of his collection titled "Howard Pyle's Book of Pirates."

Ruby of Kishmoor is not part of the Book of Pirates and is in fact a standalone title.

>"The Ruby of Kishmoor" is an adventure tale centered on the protagonist, Jonathan Rugg. Rugg, an honest Quaker clothier from Philadelphia, sets out to sea to recover his lost wealth. On his journey, he is captured by pirates who force him to sign their articles and join their crew.

He is not captured by pirates. It proceeds to summarize a long pirate story and says the story concludes with him becoming extremely wealthy because he escapes with the ruby and sells it.

The summary it gave you also does not seem to match the plot of the book.

Since the GP's point seems to be that having the contents of a book does not mean the model is capable of properly summarizing it, and thus supports the idea that being able to summarize something is not evidence of it containing the thing being summarized in it's dataset.

CydeWeys · 2 years ago
Right, but that's useless without knowing how much (if any!) of it is actually correct. Is this completely hallucinated garbage?
monocasa · 2 years ago
I mean, that's the way you state facts that your suit is based on in order to start setting the bounds of discovery. They're asserting that they have reason to believe it's true, and now with a suit, they can look for themselves to be sure.
jmull · 2 years ago
> this quote from the complaint seems obviously false

I notice you go on to provide an argument only for why it might not be true.

Also, seeing the other post on this, I asked chatgpt-4 for a summary of “ The Ruby of Kishmoor” as well, and it provided one to me, though I had to ask twice. I don’t know anything about that book, so I can’t tell if its summary is accurate, but so much for your test.

It seems pretty naive to me to just kind of assume chatgpt must be respecting copyright, and hasn’t scanned copyrighted material without obtaining authorization. Perhaps discovery will settle it, though. Logs of what they scanned should exist. (IMO, a better argument is that this is fair use.)

washadjeffmad · 2 years ago
The test was whether producing equivalent accuracy and detail for summaries of all books in its training corpus was a feature of ChatGPT's ability to natively generate them from standalone source material or whether Silverman's detailed summary was likely just a "summary of summaries", not whether ChatGPT produced a result at all. From the comment you reference, it failed the test because the result was hallucinated.

You can pick something else that's in the training set that has SparkNotes and many popular reviews to compare. I routinely feed novel data sources into LLMs to test massive context and memory, and none produce anything similar in quality to what is being exhibited.

NoZebra120vClip · 2 years ago
> IMO, a better argument is that this is fair use

There is no way in Hell that this is fair use!

Fair use defenses rest on the fact that a limited excerpt was used for limited distribution, among other criteria.

For example, if I'm a teacher and I make 30 copies of one page of a 300-page novel and I hand that out to my students, that's a brief excerpt for a fairly limited distribution.

Now if I'm a social media influencer and I copy all 300 pages of a 300-page book and then I send it out to all 3,000 of my followers, that's not fair use!

Also if I'm a teacher, and I find a one-page infographic and I make 30 copies of that, that's not fair use, because I didn't make an excerpt but I've copied 100% of the original work. That's infringement now.

So if LLMs went through en masse in thousands of copyrighted works in their entirety and ingested every byte of them, no copyright judge on the planet would call that fair use.

For reference, the English Wikipedia has a policy that allows some fair-use content of copyrighted works: https://en.wikipedia.org/wiki/Wikipedia:Non-free_content_cri...

seanthemon · 2 years ago
Accessibility? I've heard of Silverman but never Ruby of Kishmoor

More people discuss it, more people summarize on their personal or other sites, etc

buildbot · 2 years ago
Right that is the point of the parent comment - it’s not the book, it’s the amalgamation of all the discussions and content about the book. This case is probably dead in the water.
pkilgore · 2 years ago
Plausible is literally the standard to clear a motion to dismiss.

Plausible gets you discovery. Discovery gets you closer to the what the actual facts are.

vmh1928 · 2 years ago
Isn't part of the problem that some of the training data is retained by the model and used during response generation? In that case it's not just that the copyrighted book was used as training data but that some part of the book has been retained by the model. So now my model is using copyrighted material while it runs. Here's an example of a model that retained enough image data to reconstruct a reasonable facsimile of the training image.

https://www.theregister.com/2023/02/06/uh_oh_attackers_can_e...

crazygringo · 2 years ago
This is actually quite interesting, as it's drawing a distinction between training material that can be accessed by anybody with a web browser (like anybody's blog), vs. training material that was "illegally-acquired... available in bulk via torrent systems."

I don't think there's any reason why this would be a relevant legal distinction in terms of distributing an LLM -- blog authors weren't giving consent either.

However, I do wonder if there's a legal issue here in using pirated torrents for training. Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale? E.g. training on a blog post is fine because it's freely accessible, but Sarah Silverman's book is not because it's never been made available for free, and you didn't pay for it?

Or do the courts not really care at all how something is made? If you quote a passage from a book in a freelance article you write, nobody ever asks if you purchased the book or can prove you borrowed it from a library or a friend -- versus if you pirated a digital copy.

jmkb · 2 years ago
Eventually, I imagine a new licensing concept will emerge, similar to the idea of music synchronization rights -- maybe call it "training rights." It won't matter whether the text was purchased or pirated -- just like it doesn't matter now if an audio track was purchased or pirated, when it's mixed into in a movie soundtrack.

Talent agencies will negotiate training rights fees in bulk for popular content creators, who will get a small trickle of income from LLM providers, paid by a fee line-itemed into the API cost. Indie creators' training rights will be violated willy-nilly, as they are now. Large for-profit LLMs suspected or proven as training rights violators will be shamed and/or sued. Indie LLMs will go under the radar.

fweimer · 2 years ago
Is it all that different from indexing for search? That does not seem to require a license from the copyright holder under U.S. law (but other countries may treat as a separate exploitation right). If indexing for search is acceptable, then something that is intended to be more transformative should be legal as well.

(Personally, I think that even indexing for search should require permission from the copyright holder.)

phkahler · 2 years ago
>> Talent agencies will negotiate training rights fees in bulk for popular content creators

AFAICT there is no legal recognition of "training rights" or anything similar. First sale right is a thing, but even textbooks don't get extra rights for their training or educational value.

a_wild_dandan · 2 years ago
I suspect the opposite outcome also being plausible: the LLM is viewed analogously to a blog author. The blogger/LLM may consume a book, subsequently produce "derived" output (generated text), and thus generate revenue for the blogger/LLM's employer. Consequently, the blogger/LLM's output -- while "derived" in some sense -- differs enough to be considered original work, rather than "derivative work" (like a book's film adaptation). Auditing how the blogger/LLM consumed relevant material is thus absurd.

Of course, this line of reasoning hinges on the legitimacy of an "LLM agent <-> blogger agent" type of analogy. I suspect the equivalence will become more natural as these AI agents continue to rapidly gain human-like qualities. How acceptable that perspective would be now, I have no idea.

In contrast, if the output of a blogger is legally distinct from an AI's, the consequences quickly become painful.

* A contract agency hires Anne to practice play recitals verbally with a client. Does the agency/Anne owe royalties for the material they choose? What if the agency was duped, and Anne used -- or was -- a private AI which did everything?

* How does a court determine if a black box AI contains royalty-requiring training material? Even if the primary sources of an AI's training were recorded and kosher, a sufficiently large collection of small quotes could be reconstructed into an author's story.

* What about AIs which inherit (weights, or training data generated) from other AIs of unknown training provenance? Or which were earlier trained on some materials with licenses that later changed? Or AIs that recursively trained their successors using copyrighted works which it AI reconstructed from legal sources? When do AIs become infected with illegal data?

The business of regulating learning differently depending on whether the agent uses neurons or transistors seems...fraught. Perhaps there's a robust solution for policing knowledge w.r.t silicon agents. If you have an idea, please share!

the8472 · 2 years ago
Humans are also trained on copyrighted content they see. Should every artist have to pay that fee too on every work they create?

Disney will finally be able to charge a "you know what the mouse looks like" tax.

nyolfen · 2 years ago
i don't understand why a new licensing regime would be necessary, the model is clearly a fair use derivative work. it does exactly what a human does -- observes information, distills it into symbolic systems of meaning, and produces novel content that exists in the same semantic universe as its experiences.
banana_feather · 2 years ago
> Or do the courts not really care at all how something is made?

One of the fair use factors, which until fairly recently was consistently held out as the most important fair use factor, is the effect on the commercial market for the original work. Accordingly, a court is more likely to find that something is fair use if there is effectively no commercial market for the original work, though the fact that something isn't actively being sold isn't dispositive (open source licenses have survived in appellate courts despite being free as in beer).

itronitron · 2 years ago
Scarcity drives a lot of value for original work.
akira2501 · 2 years ago
I'm allowed to make private copies of copywritten works. I'm not allowed to redistribute them. To what extent this is redistribution is not clear. Is there much of difference between this model and a machine, like a VCR, that recreates the original work when I press a button?
permo-w · 2 years ago
I buy a book and give it to my child, they read the book and later write and sell a story influenced by said book. should that be a copyright infringement?

how about they become a therapist and sell access to knowledge from copyrighted books? should that be an infringement?

what if they sell access to lectures they've given including facts from said book(s) to millions of people?

it's understandable that people feel threatened by these technologies, but to a great degree the work of a successful artist is to understand and meet the desires of an audience. LLMs and image generation tech do not do this. they simply streamline the production

of course if you've worked for years to become a graphic designer you're going to be annoyed that an AI can do your job for you, but this is simply what happens when technology moves forward. no one today mourns the loss of scribes to the printing press. the artists in control of their own destiny - i.e. making their own creative decisions - will not, can not, be affected by these models, unless they refuse to adapt to the times

WheatMillington · 2 years ago
It's legal to make a copy of something you own, however it's not legal to make a copy of something illicitly acquired, whether or not there's distribution involved.
seanmcdirmid · 2 years ago
This would be like you intensely studying the copy written work and then writing things based on the knowledge you obtained from that. Except, we don't know if their is an exception for things learned by people vs. things learned by machines, or if the machines are not really learning but copying instead (or if learning is intrinsically a form of copying?).
JamesBarney · 2 years ago
This is not definitely not redistribution any more than writing a blog post of a book you read is.
postmodest · 2 years ago
Is Game of Thrones a redistribution of Lord of the Rings?
version_five · 2 years ago
Seems like the AI angle is just capitalizing on hype. If it's illegal to download "pirate" copyright material, that was the crime. The rest is basically irrelevant. If I watch a pirated movie, it's not illegal for me to tell someone the plot.
pdonis · 2 years ago
> Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale?

My understanding (disclaimer: IANAL) is that in order to claim fair use, you have to be legally in possession of the work. If the work is only legally available for sale, then you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).

NoZebra120vClip · 2 years ago
> in order to claim fair use, you have to be legally in possession of the work.

Which work? The original work, or the derivative work that you're using?

Wikipedia uses non-free content all the time, and they're not purchasing albums to do it. Wikipedia reduces album covers, for example, to low resolution, so that they could not be reused to reproduce a real cover, for example. Sometimes Wikipedia uses screencaps of animated characters, for example, under their non-free content policies. They don't own original copies, they're just hosting low-resolution reproductions. I don't even know what entity would be required to be "legally in possession of the work" for that to be a thing. Could you cite a source, maybe?

danans · 2 years ago
> you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).

I am also NAL, but can I imagine it goes further than that. Just purchasing a copy doesn't let you create and sell (directly as content or indirectly via a service like a chatbot) derivative works that are substantially similar in style and voice to the original work.

For example, an LLM 's response to the request:

"Write a short story about a comical trip to the nail salon in the style of Sarah Silverman"

... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce. Their style, built from their lived human experience, is what generates their copyrighted content. Even more than the content, the artist's style should be protected. The fact that a technology exists that can convincingly mimic their style doesn't change that.

One might then ask, well what about artists mimicking each others work? Well, any artist with a shred of integrity will credit their major influences.

We should hold machines (and their creators) to an even tougher standard than we hold people when it comes to mimicry. A real person can be inspired and moved by another person's artistic work such that they mimic it. Inspiration means nothing to a machine.

pkilgore · 2 years ago
The "fun" part about cases like this is that we don't really know what the contours of the law are as applied to training data like this. Illegally downloading a book is an independent act of infringement (to my recollection at least). So I'm not sure that it matters if you eventually trained an LLM with it vs read for your own enjoyment. But we will see! Fair use is a possibility here but we need a court to apply the test and that will probably go up to SCOTUS eventually.

Deleted Comment

itronitron · 2 years ago
>> blog authors weren't giving consent either.

That is a good point, since copyright is a default protection of works created by people.

edent · 2 years ago
Are we all reading the same complaint?

They say:

> in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.”

Does that stack up?

The Meta Paper - https://arxiv.org/pdf/2302.13971.pdf - says:

> We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020)

The Pile Paper - https://arxiv.org/abs/2101.00027 - says it was trained (in part) on "Books3" which it describes as:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

Shawn Presser's link is at https://twitter.com/theshawwn/status/1320282149329784833 and he describes Book3 as

> Presenting "books3", aka "all of bibliotik" - 196,640 books - in plain .txt

I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?

Meta's LLaMA is - as they seem to admit - trained on pirated books.

banana_giraffe · 2 years ago
> But if Silverman's book is in there

It is:

    $ grep -i "Sarah Silverman" books3.list.txt
         325196 books3/the-eye.eu/public/Books/Bibliotik/T/The Bedwetter - Sarah Silverman.epub.txt
Anyone that just wants to see the list of files (itself a big file): https://gist.githubusercontent.com/Q726kbXuN/e4e9919a2f5d81f...

dragonwriter · 2 years ago
> I don't have the time and space to download the 37GB file. But if Silverman's book is in there... isn't this a slam dunk case?

Yes, and no.

Its pretty much a slam dunk case that, insofar as the initial training required causing a copy of the corpus defined by the tracker to be made as part of the process, it involved an act violating copyright.

Whether that entitles Silverman to any remedy beyond compensation for (maybe treble damages for) the equivalent of the purchase price of the book depends on... well, basically the same issues of how copyright relates to model training (and an additional argument about whether the illicit status of the material before the training modifies that).

moffkalast · 2 years ago
We don't seem to be reading the same thing, you're pulling Google out of thin air somewhere.
edent · 2 years ago
I'm literally quoting The Verge article and following the links they present...
pessimizer · 2 years ago
You might as well be complaining about the grammar. This is what was said in the article.

> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

phoehne · 2 years ago
I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims. Just public domain works. We'll know it's an AI because it talks like a late 18th century/early 19th century writer?
Natsu · 2 years ago
This isn't completely new, similar issues came up with search engines and this may be seen as 'transformative'. But there may be issues with models that happily reproduce copyrighted texts in their entirety along with other novel issues like models that hallucinate defamatory things or other such problems.

Still, I doubt this particular genie can be stuffed back into the bottle, so we'll probably see a lot of litigation and work on alignment, etc. along with new types of abuse.

phoehne · 2 years ago
I agree it's not an entirely new issue. But it's a little different from search results. Say I use the generative paint brush in photoshop. It reproduces a portion of the copyrighted work. I then use the image on an advertising campaign, other merchandise, or post the final product as my own work. Would I be responsible? Would Adobe? Given that retraining these models is not simple, or cheap, would this be just 'cost of doing business?' Would I be able to buy insurance for this?

Enquiring minds want to know.

stusmall · 2 years ago
I hope it happens. I'd love to see a market for selling training licenses to IP. This could be a small but real source of passive income for artists, authors, poets who don't mind that their IP being used in training sets. I wouldn't be practical to negotiate individually with each artist but I could see something work with larger collectives than can vouch for the quality of their members. Think like publishers, galleries, guilds or unions. It could offer a license and then share the proceeds with all members.

It's just flat out unethnical for LLMs to just soak up all this data, even off torrent sites(!!!), without any consent or agreement with the IP holders. Some model like this could be a win for everyone

dragonwriter · 2 years ago
> I think this will be a bigger issue than some people think. Maybe there's a market for 'clean' training data that doesn't include potential copyright claims.

Until this issue is resolved, that will have some value as risk mitigation.

Once it is resolved, it will either be a complete non-issue or an issue related to a much more knowable cost/benefit tradeoff.

> We'll know it's an AI because it talks like a late 18th century/early 19th century writer?

A mix of that and US government publications (which are categorically not subject to copyright).

kragen · 2 years ago
there's a market for 'clean' jurisdictions that don't consider training neural networks to violate copyright, and japan has already declared itself such a jurisdiction
Tao3300 · 2 years ago
I think you mean 19th/20th, but that would be quite hilarious.
phoehne · 2 years ago
Dear Mr. Smith:

Your employer has generously agreed to offer you a position, quite reasonable, and with many great benefits at the venerable firm of 'Zumba'. Our interest is that you should join our staff forthwith and at the earliest date. A cab and man has been sent to retrieve you and bring you to our offices to sign all the necessary documents. Our offer is for a monthly stipend of five pounds, two shillings, and sixpence to be paid at the end of the month.

Thank you,

Most Humbly,

Hirebot 2347

echelon · 2 years ago
I think the onus should be on Sarah Silverman to prove which inputs were hers and which outputs leveraged those inputs.

I think she should pay all of the court costs if she fails to do so.

tjmc · 2 years ago
Well yeah - that's how the legal system works assuming it gets all the way to court. In reality, Meta and OpenAIs lawyers will do a risk evaluation against the strength of the claim and if there's any merit at all there will be a quiet settlement.
preseinger · 2 years ago
how could someone "prove" which inputs and outputs of a large ML model leveraged any specific data?
jeffgreco · 2 years ago
I mean I’m no lawyer but this doesn’t strike me as a great example for infringement? Detailed summaries of books sounds like textbook transformative use. Especially in Silverman’s case, reducing her book to “facts” while eliminating artistic elements of her prose make it that much less of a direct substitute for the original work.
binarymax · 2 years ago
I can see a good argument in the complaint. The provenance of the training data leads back to it being acquired illegally. Illegally acquired materials were then used in a commercial venture. That the venture was an AI model is perhaps beside the point. You can’t use illegally acquired materials when doing business.
getmeinrn · 2 years ago
>You can’t use illegally acquired materials when doing business.

This vague sentence conjures images of a company building products from stolen parts, but this situation seems different. IANAL, but if I looked at a stolen painting that nobody had ever seen, and sold handwritten descriptions of the painting to whoever wanted to buy one, I'm pretty sure what I've sold is not illegal.

refulgentis · 2 years ago
It seems like a weak argument, in that it is just as likely it saw any number of things about it, from book reviews to sales listings to interviews.
jrm4 · 2 years ago
The more I think about it, I think it will (and should) turn on the extent to which "the law" considers the AI's to be more like "people" or more like "machines?" People can read and do research and then spit out something different.

But "feeding the data into a machine" seems like obvious infringement, even if the thing that comes out on the other end isn't exactly the same?

georgeoliver · 2 years ago
Perhaps not, I thought one of the claims is interesting though, that they illegally acquired some of the dataset. What would be the damages from that, the retail price of the hardcopy?
getmeinrn · 2 years ago
Wouldn't they first need to prove that OpenAI didn't ingest summaries of the book, and not the book itself?
banana_feather · 2 years ago
The remedies under Title 17 are an injunction against further distribution, disgorgement or statutory damages, and potentially attorneys fees. The injunction part is why these cases usually settle if the defendant is actually in the wrong.
dclowd9901 · 2 years ago
Could a suitably prompted LLM repeat, verbatim, the book in its entirety?
jeffgreco · 2 years ago
Perhaps? But certainly not what’s shown here.
citizenkeen · 2 years ago
Haven't read the complaint, but there might be an argument that OpenAI used stolen works to train their data, and as such fair use doesn't apply.
int_19h · 2 years ago
> The lawsuit against OpenAI alleges that summaries of the plaintiffs’ work generated by ChatGPT indicate the bot was trained on their copyrighted content. “The summaries get some details wrong” but still show that ChatGPT “retains knowledge of particular works in the training dataset," the lawsuit says.

Setting aside the whole issue of whether LLM constitutes a derived work of whatever it's trained on, this sounds like a very weak argument to me. An LLM trained on numerous summaries of the works would also be capable of producing such summaries itself even if the works were never part of the training set. In general, having knowledge about something is not evidence of being trained on it.

TheDong · 2 years ago
That isn't firm evidence, but courts don't need firm evidence to start a case and discover new facts.

They very well can ask LLM experts, and openAI themselves, whether that output is highly likely to have been derived from the copyrighted work in question.

Anyway. If the argument is "No, it's not from the book, it's from someone else's copyrighted summary", that just means the person who wrote such a summary needs to instead sue for copyright infringement right? Unless openAI turns around and says "actually, no, not the summary, the full book" then.

visarga · 2 years ago
> that just means the person who wrote such a summary needs to instead sue for copyright infringement right?

Doesn't need to be a person, could be another AI that wrote the summaries. I see a big problem for copyrights looming on the horizon - LLMs can reword, rewrite or generate input-output pairs using copyrighted data as reference, thus creating clean data for training. AI cleanly separates knowledge from expression. And maybe it should do so just to reduce inconsistencies and PII in organic text.

Copyrights should only be concerned with expression not knowledge, right? Protecting knowledge is the object of patents, and protecting names the object of trademarks. Copyright is only related to expression otherwise it would become too powerful. For example, instead of banning reproduction of this paragraph, it would also cover all its possible paraphrases. That would be like owning an idea, the "*" version, not a unique sequence of words.

Does it even make sense to talk about copyrights when everything can be remade in many ways so easily? Copyright was already suffering greatly since zero cost copying became a thing, now LLMs are dealing the second blow. It's just a fig leaf by now.

If we take a step back, it's all knowledge and language, self replicating memes under an evolutionary force. It's language evolution, or idea evolution. We are just supporting it by acting as language agents, but now LLMs got into the game, so ideas got a new vector of self replication. We want to own this process piece by piece but such a thing might be arrogant and go against the trend. Knowledge wants to be free, it wants to mix and match, travel and evolve. This process looks like biology, it has a will of its own.

Deleted Comment

cgriswald · 2 years ago
A summary can be written in such a way as to violate copyright itself. So even if they say "We trained it on the following summaries:...," there could be an issue.
imglorp · 2 years ago
There's an interesting nuance here if you were to put a human in the place of the LLM. We have read thousands of works; does that mean anything we write is derivative?
TheDong · 2 years ago
Humans are special and can create new copyrights. The process of a human brain synthesizing stuff does act as a barrier to copyright infringement.

Machines and algorithms are not legally recognized as being able to author original non-derivative works.

> put a human in the place of the LLM

But also, no, if you have a team of humans doing rote matrix multiplication instead of an LLM, that does not make it so the matrix multiplication removes copyright. Also, at this point LLMs require so much math that you can't replace them with humans, even if the humans have quite fast fingers and calculators.

getmeinrn · 2 years ago
A more convincing exhibit would have been convincing ChatGPT to output some of the text verbatim, instead of a summary. Here's what I got when I tried:

    I'm sorry for the inconvenience, but as of my knowledge cutoff in September 2021, I don't have access to specific external databases, books, or the ability to pull in new information after that date. This means that I can't provide a verbatim quote from Sarah Silverman's book "The Bedwetter" or any other specific text. However, I can generate text based on my training and knowledge up to that point, so feel free to ask me questions about Sarah Silverman or topics related to her work!

simion314 · 2 years ago
Maybe you missed this discussion https://news.ycombinator.com/item?id=36400053 it seems OpenAI is aware of their software outputs copyrighted stuff so they attempted some quick fix filter. So the fact it will not ouput the book for us when we ask does not prove that the AI does not have memorized big chunks of it, it might just be some "safety" filter involved and you need soem simple trick to get around it.
shultays · 2 years ago
I tried making chatgpt output first paragpraph of lord of the rings before, it goes silent after first few words. Looks like the devs are filtering it out
Jolter · 2 years ago
GPT is a lossy jpeg of the whole Internet. It’s not possible to extract verbatim text from it, due to how neural networks work.

How do you think they would fit exabytes of text data into a gigabyte-sized neural network? That’s right, it’s lossy.

getmeinrn · 2 years ago
>It’s not possible to extract verbatim text from it

I didn't ask for the whole book, I asked for the first paragraph. It absolutely is possible to get verbatim text from chatgpt.