Sarah Silverman is suing OpenAI and Meta for copyright infringement

> The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”

This is the makers of AI explicitly saying that they did use copyrighted works from a book piracy website. If you downloaded a book from that website, you would be sued and found guilty of infringement. If you downloaded all of them, you would be liable for many billions of dollars in damages.

But companies like Google and Facebook get to play by different rules. Kill one person and you're a murderer, kill a million and to ask you about it is a "gotcha question" that you can react to with outrage.

holmesworcester · 2 years ago

Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

While it might be too disruptive to eliminate copyright overnight, we should remember that our world will be much better and improve much faster to the extent we can reduce copyright's impact.

And we should cheer it on when it happens. A majority of the world's population in 2023 has a smartphone. Imagine a world where a majority of the world had access to every book ever digitized, and could raise their children on these books!

throw10920 · 2 years ago

> Let's take a second to remember

This is emotionally manipulative speech that provides no value to HN and only serves the purpose of bypassing peoples' logical reasoning circuits.

> ~every child doesn't have access to ~every book ever written

More manipulation - "think of the children!"

Copyright exists because people who produce content with low distribution costs (e.g. books) need some protection for their work being taken without compensation.

Fundamentally, you are never entitled to someone else's work.

There's already tens (hundreds?) of thousands of books in the public domain, and tens of thousands more under Creative Commons licenses (where the author explicitly released their work for free distribution). There's lectures on YouTube and MIT OpenCourseWare. There's full K-12 textbooks on OpenStax and Wikibooks. There's Wikipedia, Stack Exchange, the Internet Archive, and millions of small blogs and websites hosting content that is completely free.

There is no need for "a majority of the world had access to every book ever digitized" - and it's deeply morally wrong (theft-adjacent) to take someone else's work without compensating them on their terms.

WillAdams · 2 years ago

There are already many more public domain books than people are inclined to read:

https://www.gutenberg.org/

https://librivox.org/

many of which form the basis for an education:

https://news.ycombinator.com/item?id=34630153

And most of which, when in copyright, paid their authors quite handsomely in terms of royalties.

If you believe that books should exist without copyright, then one has to ask --- how many books have you written which you have explicitly placed in the public domain? Or, how many authors have you patronized so as to fund their writing so that they can publish their works freely? Or, if neither of these applies, how do you propose to compensate authors for the efforts and labours of writing?

ignoramous · 2 years ago

Second-order effects matter, though: If everyone is allowed to steal books, what's the incentive for experts to write new ones, and for the publishers to reward them for it?

Btw, not a fan of "but what about the kids" rhetoric: https://en.wikipedia.org/wiki/Think_of_the_children

dlivingston · 2 years ago

It would look largely identical to ours, I think. It's pretty trivial to get access to many, if not most, e-books.

Any public-domain work is available on Project Gutenberg [0]. Copyrighted works can be accessed for free using tools of various legality: Libby [1] is likely sponsored by your local library and gives free access to e-books and audiobooks. Library Genesis [2] has a questionable legal status but has a huge quantity of e-books, journal articles, and more.

[0]: https://www.gutenberg.org

[1]: https://www.overdrive.com/apps/libby

[2]: https://libgen.rs/fiction/

698969 · 2 years ago

Then most people stop writing books because they can't get paid for their time/effort and ~every child will be stuck with outdated knowledge within a decade.

JustLurking2022 · 2 years ago

What an absolute pile of nonsense. People who author creative works deserve to have control of them and make some money - otherwise you'll soon find we have far fewer great authors, artists, etc.

This is essentially the same as saying builders charging for houses is the problem with the housing market, so we're going to phase out paying builders.

edanm · 2 years ago

> Let's take a second to remember that copyright is the reason ~every child doesn't have access to ~every book ever written.

No, copyright is the reason that authors all over the world are working very hard to make new books for my kids and everyone else's kids, despite never having met me. Copyright is the reason so many brilliant things are actually created that otherwise would never be.

Of course I'd prefer to live in a world in which I get all the media I want, for free. But I have no idea how to make such a world happen, and neither does anyone else, and humanity has been discussing this for a few centuries.

toss1 · 2 years ago

>>copyright is the reason ~every child doesn't have access to ~every book ever written

Obviously, Copyright is abused and the continual extensions of copyright into near-perpetuity by corporations is basically absurd. And they are abused by music publishers etc. to rip-off artists.

But to claim that it should not exist, when it is utterly trivially simple for anyone to copy stuff to the web is to argue that no one should create or release any creative works, or to argue for drastic DRM measures.

Perhaps you DGAF about your written or artistic works because you do not or can not make a living off them, but I guarantee that for those creatives and artists who can and/or do make a living off of it, they do care, and rightly so.

gorbachev · 2 years ago

This situation is a bit different, however.

A company that believes in strong intellectual property rights protection is using resources that blatantly ignore intellectual property rights to get access to the content for free.

I agree with you, however, that it's an argument in favor of abolishing strong intellectual property rights. At least for OpenAI's products.

account42 · 2 years ago

I'm all for abolishing copyright, but how is this relevant to megacorporations ignoring copyright when it suits them while still expecting us to follow it when that suits them?

aaomidi · 2 years ago

Using charged language of bringing children into the equation is not a good way in having a discussion.

lotsoweiners · 2 years ago

I’m imagining a world that looks just about the same as this one does. A larger book library doesn’t automatically make that medium more appealing to kids than what Mr Beast, Unspeakable, and the other crap kids love are doing.

Eddy_Viscosity2 · 2 years ago

I think the point here is that while children can be denied access to copyrighted works without paying the owners, but openAI and Meta can do as they please. I don't disagree that the current copyright system needs improvement, but what I really really don't like are seeing rich and powerful people breaking laws with impunity over and over and over again.

glonq · 2 years ago

Copyright [and the textbook cartels] is the reason why my kids' highschool is using social studies textbooks that are 20-30 years old.

"Japan's incredibly strong economy is responsible for the manufacture of Datsun cars, boombox stereos, and touch-tone phones..."

2muchcoffeeman · 2 years ago

This assumes that the law is applied equally to both rich and poor.

listenallyall · 2 years ago

> copyright is the reason ~every child doesn't have access to ~every book ever written.

And? Is there some reason anybody, child or adult, deserves access to "every" anything? Should children have access to every video game ever made, every Matchbox car, every Lego set?

shreyshnaccount · 2 years ago

GP makes no remark on the morality/practicality of copyright. Also, having people sue big companies for copyright might lead to more of what you're arguing for, in a show them the taste of their own medicine way

erehweb · 2 years ago

Second-order effects are real - removing copyright would hurt authors. (See cstross comments in https://news.ycombinator.com/item?id=35761641 e.g.)

all_factz · 2 years ago

I would be much more sympathetic to this stance if you weren’t implicitly endorsing the rights of companies like Meta/Alphabet/OpenAI to profit from the disruption of copyright law. If we’re talking ordinary people being able to breach copyright, then yeah seems potentially interesting. But let’s remember that these companies aren’t acting altruistically. They’re not giving away Silverman’s work - they’re repackaging it for their own profit. That’s not fair to the artist and in fact does not help the children.

mywittyname · 2 years ago

I'm a fan of copyrights. While I think that the USA's implementation of copyrights has a few glaring flaws (namely, the duration of copyright is far too long). I firmly believe that the elimination, or effective elimination of copyrights is massively detrimental to our culture.

A fair middle ground would be for copyrights to last for 20 or so years. That's plenty of time to profit from a work while allowing people to preserve and distribute older works.

ssss11 · 2 years ago

So why do we cheer on megacorps and not mom and pop pirates?

charles_f · 2 years ago

That's not parent's point. Their point seems to be that large companies don't suffer the same consequences from a crime as any layman.

stcroixx · 2 years ago

In this world, the only authors would be people independently wealthy who don't need an income.

kristianp · 2 years ago

Anyone want to check if the book in question is in ThePile dataset?:

https://github.com/EleutherAI/the-pile/blob/master/the_pile/...

Deleted Comment

jklinger410 · 2 years ago

We can totally take a second to remember that without it being in the context of allowing large corporations to hoover up private works with impunity.

renewedrebecca · 2 years ago

But it's not children downloading the books is it? It's a company backed by billionaires, so why cheer?

ChatGTP · 2 years ago

Why don't children have access to all the latest LLMs, including ChatGPT-4, for free?

Why does money exist?

manzanarama · 2 years ago

How many books would not exist if copyright had the suggested reduced impact?

tr_user · 2 years ago

Try adding censorship to that list.

Dead Comment

buildbot · 2 years ago

Machine learning models have been trained with copyrighted data for a long time. Imagenet is full of copyrighted images, clearview literally just scanned the internet for faces, and I am sure there are other, older examples. I am unsure if this has been tested as fair use by a US court, but I am guessing it will be considered to be so if it is not already.

jefftk · 2 years ago

> I am unsure if this has been tested as fair use by a US court

Not yet. One suit that a lot of us are watching is the GitHub co-pilot lawsuit: https://githubcopilotlitigation.com/

There is a prediction market for it, currently trading at 19%: https://manifold.markets/JeffKaufman/will-the-github-copilot...

kibwen · 2 years ago

Excellent, so you're saying I'll be able to download any copyrighted work from any pirate site and be free of all consequence if I just claim that I'm training an AI?

guluarte · 2 years ago

and not only copyrighted material, also illegal and disturbing content

holmesworcester · 2 years ago

Strictly speaking, it's uploading that people get sued for, not downloading.

You can download all that you want from Z-Library or BitTorrent, as long as you don't share back. And indexing copyrighted material for search is safe, or at least ambiguous.

capableweb · 2 years ago

Carefully speaking, what you say is true in many places (countries), but also not true in other places (countries). Some jurisdictions are different, as always.

MisterBastahrd · 2 years ago

Downloading is illegal. That people do not normally get sued or prosecuted for downloading does not mean that they cannot get sued or prosecuted.

moralestapia · 2 years ago

Huh? No.

rococode · 2 years ago

> If you downloaded a book from that website, you would be sued and found guilty of infringement.

How often does this actually happen? You might get handed an infringement notice, and your ISP might terminate your service if you're really egregious about it, but I haven't ever heard of someone actually being sued for downloading something.

xmprt · 2 years ago

Whether or not it's enforced, it's illegal and copyright holders are within their rights to sue you. This is piratebay levels of piracy but because it's done by a large company and is sufficiently obfuscated behind tech, people don't see it the same way.

TillE · 2 years ago

In Germany if you torrent stuff (without a VPN), you're very likely to get a letter from a law firm on behalf of the copyright holders saying that they'll sue you unless you pay them a nice flat fee of around 1000 Euro.

It's no idle threat, and they will win if it goes to court.

pessimizer · 2 years ago

If books aren't under copyright protection and they're entirely legal to download, I agree that this lawsuit has no merit.

If that's not what you're saying, I don't understand your point. Is it the difference between the phrases "would be" and "could be," or even "should be"?

iepathos · 2 years ago

Exactly, never happens. It's a threat parents and teachers tell school children to try to spook them from pirating but it isn't financially worth it for an author or publishing company to try to sue an individual over some books or music downloads. The only cases are business to business over mass downloads where it could make financial sense to pay for lawyers to sue.

Shawnj2 · 2 years ago

Actually no- downloading copyright infringing material is legal as far as I can tell but uploading it isn’t. The illegal part of torrenting copyrighted material is the uploading that the protocol requires you to do. Your ISP will send you an infringement notice because they want you to stop doing illegal things on their network

harry8 · 2 years ago

>How often does this actually happen?

Did you hear about Aaron Schwartz?

Deleted Comment

Der_Einzige · 2 years ago

I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Aaron Swartz was a saint.

axblount · 2 years ago

Swartz distributed information for everyone to use freely. These companies are processing it privately to develop their for-profit products. Big difference, IMO.

8organicbits · 2 years ago

There's a difference between "information wants to be free" and "Facebook can produce works minimally derived from your greatest creative work at a scale you can't match". LLMs seem to aggregate that value to whoever builds the model, which they can then sell access to, or sell the output it produces.

Five years from now, will OpenAI actually be open, or will it be a rent seeking org chasing the next quarterly gains? I expect the latter.

colechristensen · 2 years ago

While I’ma proponent of free information and loosening copyright, allowing billion dollar companies to package up the sum of human creation and resell statistical models that mimic the content and style of everyone… is a bit far.

Fair use is for humans.

JoshTriplett · 2 years ago

> I for one am quite happy that AI folks are basically treating copyright as not existing. I strongly hope that the courts find that LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

I would be very happy if either a court or lawmakers decided that copyright itself was unconscionable. That isn't what's going to happen, though. And I think it's incredibly unacceptable if a court or lawmakers instead decide that AI training in particular gets a special exception to violate other people's copyrights on a massive scale when nobody else gets to do that.

As far as a fair use argument in particular, fair use in the US is a fourfold test:

> the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

The purpose and character is absolutely heavily commercial and makes a great deal of money for the companies building the AIs. A primary use is to create other works of commercial value competing with the original works.

> the nature of the copyrighted work;

There's nothing about the works used for AI training that makes them any less entitled to copyright protections or more permissive of fair use than anything else. They're not unpublished, they're not merely collections of facts or ideas, etc.

> the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

AI training uses entire works, not excerpts.

> the effect of the use upon the potential market for or value of the copyrighted work.

AI models are having a massive effect on the market for and value of the works they train on, as is being widely discussed in multiple industries. Art, writing, voice acting, code; in any market AI can generate content for, the value of such content goes down. (This argument does not require AI to be as good as humans. Even if the high end of every market produces work substantially better than AI, flooding a market with unlimited amounts of cheap/free low-end content still has a massive effect.)

slashdev · 2 years ago

Agreed, copyright has gone too far. I hope the advent of AI serves to weaken it.

Dead Comment

pessimizer · 2 years ago

Copyrights shouldn't exist, but

> LLM weights, and the datasets are "fair-use" or whatever other silly legal justification.

Would just be a carve out for the wealthy. If these laws don't mean anything, everyone who got harassed, threatened, extorted, fined, arrested, tried, or jailed for internet piracy are owed reparations. Let Meta pay them.

makeitdouble · 2 years ago

If AI companies get to successfully argue the two points below, what source was used becomes irrelevant.

- copyright violation happened before the intervention of the bot

- what LLMs spit out is different enough from any of the source that it is not infringing on existing copyright

If both stand, I'd compare it to you going to an auction site and studying all the published items as an observer, coming up with your research result, to then be sued because some of the items were stolen. Going after the theaves make sense, does going after the entity that just looked at the stolen goods make sense ?

bobbylarrybobby · 2 years ago

I'd argue that if an automated process can ingest A and spit out B, then B is inherently a derivative work of A. (Never mind that humans are also automata.)

pessimizer · 2 years ago

> - copyright violation happened before the intervention of the bot

What is this supposed to mean? The bot didn't "intervene," it was executed by its operators, and it was trained on illicit material obtained by its operators. The LLM isn't on trial. It's not a person.

LatteLazy · 2 years ago

I am not sure this is exactly correct. If you download a book that might be copyright infringement. But not if I download a word. How much do they need to download at a time before it becomes infringement? And if the material is never used or displayed to a human is it still infringment (if so, Google is awaiting a huge lawsuit)? Alternatively if I, a human, read a book it is copied into my memory. Is that infringment? What if I quote it? How much can I quote at what frequency before I'm infringing? If I write something similar to the book but in my own words, is that infringment? How similar does it need to be? What about derivative works and fair use?

Copyright is a horrible mess of individual judgements and opinions. Written material especially. And the same applies to AI. So now we will get a judgement which is a tech-illiterate judges best guess at the intention of a law written to deal with printing presses not AIs and no room for nuances.

two_in_one · 2 years ago

> But companies like Google and Facebook get to play by different rules

It's simple, copyrighted materials can be used for academic research. That's what they are doing. Trying new AI modes, publishing results, etc. Facebook doesn't make money on LLaMA, they even require permission to use their models for, again, academic research.

troyvit · 2 years ago

But what if those copyrighted materials were illegally gained? That's what the suit alleges.

htss2013 · 2 years ago

Copyright doesn't apply when it comes to fair use, and one of the major factors of fair use is if your use deprived the copyright owner of sales. Good luck arguing that any of the books in question lost sales because an AI was trained on them.

freyes · 2 years ago

We need the dude who prosecuted Aaron Swartz here. This sucks.

https://en.m.wikipedia.org/wiki/United_States_v._Swartz

Deleted Comment

HDThoreaun · 2 years ago

I've pirated many books, never sued.

erezsh · 2 years ago

You mean, never caught.

oarabbus_ · 2 years ago

Why do you blame Facebook and not ElutherAI?

Deleted Comment

randomcarbloke · 2 years ago

a machine shouldn't have to pay to read a book.

amf12 · 2 years ago

The lawsuit doesn't even mention Google.

pessimizer · 2 years ago

No, I did. What's your point?

AnthonyMouse · 2 years ago

> If you downloaded a book from that website, you would be sued and found guilty of infringement.

Suppose I buy a copy of a book, but then I spill my drink in it and it's ruined. If I go to the library, borrow the same book and make a photocopy of it to replace the damaged one I own, that might be fair use. Let's say for sake of argument that it is.

If instead I got the replacement copy from a piracy website, are you sure that's different?

This is actually quite interesting, as it's drawing a distinction between training material that can be accessed by anybody with a web browser (like anybody's blog), vs. training material that was "illegally-acquired... available in bulk via torrent systems."

I don't think there's any reason why this would be a relevant legal distinction in terms of distributing an LLM -- blog authors weren't giving consent either.

However, I do wonder if there's a legal issue here in using pirated torrents for training. Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale? E.g. training on a blog post is fine because it's freely accessible, but Sarah Silverman's book is not because it's never been made available for free, and you didn't pay for it?

Or do the courts not really care at all how something is made? If you quote a passage from a book in a freelance article you write, nobody ever asks if you purchased the book or can prove you borrowed it from a library or a friend -- versus if you pirated a digital copy.

jmkb · 2 years ago

Eventually, I imagine a new licensing concept will emerge, similar to the idea of music synchronization rights -- maybe call it "training rights." It won't matter whether the text was purchased or pirated -- just like it doesn't matter now if an audio track was purchased or pirated, when it's mixed into in a movie soundtrack.

Talent agencies will negotiate training rights fees in bulk for popular content creators, who will get a small trickle of income from LLM providers, paid by a fee line-itemed into the API cost. Indie creators' training rights will be violated willy-nilly, as they are now. Large for-profit LLMs suspected or proven as training rights violators will be shamed and/or sued. Indie LLMs will go under the radar.

fweimer · 2 years ago

Is it all that different from indexing for search? That does not seem to require a license from the copyright holder under U.S. law (but other countries may treat as a separate exploitation right). If indexing for search is acceptable, then something that is intended to be more transformative should be legal as well.

(Personally, I think that even indexing for search should require permission from the copyright holder.)

phkahler · 2 years ago

>> Talent agencies will negotiate training rights fees in bulk for popular content creators

AFAICT there is no legal recognition of "training rights" or anything similar. First sale right is a thing, but even textbooks don't get extra rights for their training or educational value.

a_wild_dandan · 2 years ago

I suspect the opposite outcome also being plausible: the LLM is viewed analogously to a blog author. The blogger/LLM may consume a book, subsequently produce "derived" output (generated text), and thus generate revenue for the blogger/LLM's employer. Consequently, the blogger/LLM's output -- while "derived" in some sense -- differs enough to be considered original work, rather than "derivative work" (like a book's film adaptation). Auditing how the blogger/LLM consumed relevant material is thus absurd.

Of course, this line of reasoning hinges on the legitimacy of an "LLM agent <-> blogger agent" type of analogy. I suspect the equivalence will become more natural as these AI agents continue to rapidly gain human-like qualities. How acceptable that perspective would be now, I have no idea.

In contrast, if the output of a blogger is legally distinct from an AI's, the consequences quickly become painful.

* A contract agency hires Anne to practice play recitals verbally with a client. Does the agency/Anne owe royalties for the material they choose? What if the agency was duped, and Anne used -- or was -- a private AI which did everything?

* How does a court determine if a black box AI contains royalty-requiring training material? Even if the primary sources of an AI's training were recorded and kosher, a sufficiently large collection of small quotes could be reconstructed into an author's story.

* What about AIs which inherit (weights, or training data generated) from other AIs of unknown training provenance? Or which were earlier trained on some materials with licenses that later changed? Or AIs that recursively trained their successors using copyrighted works which it AI reconstructed from legal sources? When do AIs become infected with illegal data?

The business of regulating learning differently depending on whether the agent uses neurons or transistors seems...fraught. Perhaps there's a robust solution for policing knowledge w.r.t silicon agents. If you have an idea, please share!

the8472 · 2 years ago

Humans are also trained on copyrighted content they see. Should every artist have to pay that fee too on every work they create?

Disney will finally be able to charge a "you know what the mouse looks like" tax.

nyolfen · 2 years ago

i don't understand why a new licensing regime would be necessary, the model is clearly a fair use derivative work. it does exactly what a human does -- observes information, distills it into symbolic systems of meaning, and produces novel content that exists in the same semantic universe as its experiences.

banana_feather · 2 years ago

> Or do the courts not really care at all how something is made?

One of the fair use factors, which until fairly recently was consistently held out as the most important fair use factor, is the effect on the commercial market for the original work. Accordingly, a court is more likely to find that something is fair use if there is effectively no commercial market for the original work, though the fact that something isn't actively being sold isn't dispositive (open source licenses have survived in appellate courts despite being free as in beer).

itronitron · 2 years ago

Scarcity drives a lot of value for original work.

akira2501 · 2 years ago

I'm allowed to make private copies of copywritten works. I'm not allowed to redistribute them. To what extent this is redistribution is not clear. Is there much of difference between this model and a machine, like a VCR, that recreates the original work when I press a button?

permo-w · 2 years ago

I buy a book and give it to my child, they read the book and later write and sell a story influenced by said book. should that be a copyright infringement?

how about they become a therapist and sell access to knowledge from copyrighted books? should that be an infringement?

what if they sell access to lectures they've given including facts from said book(s) to millions of people?

it's understandable that people feel threatened by these technologies, but to a great degree the work of a successful artist is to understand and meet the desires of an audience. LLMs and image generation tech do not do this. they simply streamline the production

of course if you've worked for years to become a graphic designer you're going to be annoyed that an AI can do your job for you, but this is simply what happens when technology moves forward. no one today mourns the loss of scribes to the printing press. the artists in control of their own destiny - i.e. making their own creative decisions - will not, can not, be affected by these models, unless they refuse to adapt to the times

WheatMillington · 2 years ago

It's legal to make a copy of something you own, however it's not legal to make a copy of something illicitly acquired, whether or not there's distribution involved.

seanmcdirmid · 2 years ago

This would be like you intensely studying the copy written work and then writing things based on the knowledge you obtained from that. Except, we don't know if their is an exception for things learned by people vs. things learned by machines, or if the machines are not really learning but copying instead (or if learning is intrinsically a form of copying?).

JamesBarney · 2 years ago

This is not definitely not redistribution any more than writing a blog post of a book you read is.

postmodest · 2 years ago

Is Game of Thrones a redistribution of Lord of the Rings?

version_five · 2 years ago

Seems like the AI angle is just capitalizing on hype. If it's illegal to download "pirate" copyright material, that was the crime. The rest is basically irrelevant. If I watch a pirated movie, it's not illegal for me to tell someone the plot.

pdonis · 2 years ago

> Is there any legal basis for saying fair use permits distributing an LLM trained on copyrighted material, but you have to purchase all the content first to do so legally if it's only available for sale?

My understanding (disclaimer: IANAL) is that in order to claim fair use, you have to be legally in possession of the work. If the work is only legally available for sale, then you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).

NoZebra120vClip · 2 years ago

> in order to claim fair use, you have to be legally in possession of the work.

Which work? The original work, or the derivative work that you're using?

Wikipedia uses non-free content all the time, and they're not purchasing albums to do it. Wikipedia reduces album covers, for example, to low resolution, so that they could not be reused to reproduce a real cover, for example. Sometimes Wikipedia uses screencaps of animated characters, for example, under their non-free content policies. They don't own original copies, they're just hosting low-resolution reproductions. I don't even know what entity would be required to be "legally in possession of the work" for that to be a thing. Could you cite a source, maybe?

danans · 2 years ago

> you must have legally purchased a copy, or been given it by someone who did so (for example, if you received it as a gift).

I am also NAL, but can I imagine it goes further than that. Just purchasing a copy doesn't let you create and sell (directly as content or indirectly via a service like a chatbot) derivative works that are substantially similar in style and voice to the original work.

For example, an LLM 's response to the request:

"Write a short story about a comical trip to the nail salon in the style of Sarah Silverman"

... IMO doesn't constitute fair use, because the intellectual property of the artist is their style even more than the content they produce. Their style, built from their lived human experience, is what generates their copyrighted content. Even more than the content, the artist's style should be protected. The fact that a technology exists that can convincingly mimic their style doesn't change that.

One might then ask, well what about artists mimicking each others work? Well, any artist with a shred of integrity will credit their major influences.

We should hold machines (and their creators) to an even tougher standard than we hold people when it comes to mimicry. A real person can be inspired and moved by another person's artistic work such that they mimic it. Inspiration means nothing to a machine.

pkilgore · 2 years ago

The "fun" part about cases like this is that we don't really know what the contours of the law are as applied to training data like this. Illegally downloading a book is an independent act of infringement (to my recollection at least). So I'm not sure that it matters if you eventually trained an LLM with it vs read for your own enjoyment. But we will see! Fair use is a possibility here but we need a court to apply the test and that will probably go up to SCOTUS eventually.

Deleted Comment

itronitron · 2 years ago

>> blog authors weren't giving consent either.

That is a good point, since copyright is a default protection of works created by people.

I'm sorry for the inconvenience, but as of my knowledge cutoff in September 2021, I don't have access to specific external databases, books, or the ability to pull in new information after that date. This means that I can't provide a verbatim quote from Sarah Silverman's book "The Bedwetter" or any other specific text. However, I can generate text based on my training and knowledge up to that point, so feel free to ask me questions about Sarah Silverman or topics related to her work!