To be very clear on this point - this is not related to model training.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
> Buying used copies of books, scanning them, and training on it is fine.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
> It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation
And thank god they did. There was no perfectly legal channel to fix the taxi cartel. Now you don't even have to use Uber in many of these places because taxis had to compete - they otherwise never would have stopped pulling the "credit card reader is broken" scam, taking long routes on purpose, and started using tech that made them more accountable to these things as well as harder for them to racially profile passengers. (They would infamously pretend not to see you if they didn't want to give you service back when you had to hail them with an IRL gesture instead of an app..)
Anthropic literally did exactly this to train its models according to the lawsuit. The lawsuit found that Anthropic didn't even use the pirated books to train its model. So there is that
What you describe is in fact what Waymo has had, of chosen to, deal with. They didn't go for an end run around regulations related to vehicles on public roads. They committed to driverless vehicles and worked with local governments to roll it out as quickly as regulators were willing to allow.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
> But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest.
Anthropic did. That was the part of their operation that they didn't get in trouble for, but the news spun it as "Anthropic destroyed millions of books to make AI".
actually NL is training a GPT on only materials they bought fairly.
it wont be a chatgpt or coding model ofc, thats not what they go for, but it'll be interesting to see its quality as its all fairly and honestly done. transparently.
I think the jury is still out on how fair use applies to AI. Fair use was not designed for what we have now.
I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.
Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.
Fair use wasn't designed for AI, but AI doesn't change the motivations and goals behind copyright. We should be returning back to the roots - why do we have copyright in the first place, what were the goals and the intent behind it, and how does AI affect them?
The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.
We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?
> I think the jury is still out on how fair use applies to AI.
The judge presiding over this case has already issued a ruling to the effect that training an LLM like Anthropic's AI with legally acquired material is in fact fair use. So unless someone comes up with some novel claims that weren't already attempted, claims that a different form of AI is significantly different from a copyright perspective from an LLM or tries their hand in a different circuit to get a split decision, the "jury" is pretty much settled on how fair use applies to AI. Legally acquired material used to train LLMs is fair use. Illegally obtaining copies of material is not fair use, and the transformative nature of LLMs don't retroactively make it fair use.
One more fundamental difference. I can't read all of the books and then copy my brain.
Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.
But when you get to running multiple copies of model, we are clearly past that.
To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
I don't believe that's true. Most work I've read on fair use suggests it has to be a small amount, selectively used, substantially transformed, and not compete with content creators. These AI's training are the opposite of all that. I was surprised of a ruling like this but Alsup is a unique judge.
Additionally, sharing copyrighted works without permission... the data sets or data lakes... is its own tort. You're guilty just for sharing copies before even training. Some copyrighted works are also commercial, copyright with ban on others' commercial use, and patented. Some are NDA'd but 3rd party leaked them. Sources like Common Crawl probably have plenty of such content.
Additionally, there's often contractual terms of use on accessing the content. Even Singapore's and others laws allowing training on copyrighted content usually require that you lawfully accessed that content in the first place. The terms of use are the weakest link there.
I'd like to see these two issues turned by law into a copyright exception that no contract can override. It needs to specifically allow sharing scraped, publicly-visible content. Anything you can just view or download which the copyright owner put up. The law might impose or allow limits on daily scraping quantity, volume, etc to avoid damage scrapers are doing.
I have an author friend who felt like this was just adding insult to injury.
So not only had his work been consumed into this machine that is being used to threaten his day job as a court reporter, not only was that done without seeking his permission in any way, but they didn’t even pay for a single copy.
Really embodies raising your middle finger to the little guy while you steamroll him.
Exactly this. It's only us peons who will be prosecuted under the current copyright laws. The rich and well connected will base their entire business on blatant theft and will get away with it.
The Librareome project was about simply scanning books, not training AI with them. And it was a matter of trying to stop corporations from literally destroying the physical books in the process. I don't know that this is applicable.
This is excellent news because it means that folks who pay for printed books and scan them also can train with their content. It's been said already that we've already trained on "the entire (public) internet." Printed books still hold a wealth of knowledge that could be useful in training models. And cheap, otherwise unwanted copies make great fodder for "destructive" scanning where you cut the spine off and feed it to a page scanner. There are online services that offer just that.
> It’s important in the fair use assessment to understand that the training itself is fair use,
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.
Distributing and collecting payment for the usage of a trained model which may violate copyright, etc. that's still an open legal question and worth billions as well.
The RIAA should step in and get the money that publishers deserve. Talking millions per book and extra to make sure the pirates learned their lesson. And prison for the management.
I'm so over this shift in America's business model.
Original Silicon Valley model, and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.
New model: Still come up with a cool idea, still get it funded and sold, but the idea involves committing crime at a staggering scale (Uber, Google, AirBnB, all AI companies, long list here), and then paying your way out of the consequences later.
Look some of these laws may have sucked, but having billionaires organize a private entity that systematically breaks them and gets off with a slap on the wrist, is not the solution. For one thing, if innovation requires breaking the law, only the rich will be able to innovate because only they can pay their way out of the law. For another, obviously no one should be able to pay their way out of following the law! This is basic "foundations of society" stuff that the vast majority of humans agree on in terms of what feels fair and just, and what doesn't.
Go to a country which has really serious corruption problems, like is really high on the corruption index, and ask the people there what they think about it. I mean I live in one and have visited many others so I can tell you, they all hate it. It not only makes them unhappy, it fills them with hopelessness about their future. They don't believe that anything can ever get better, they don't believe they can succeed by being good, they believe their own life is doomed to an unappealing fate because of when and where they were born, and they have no agency to change it. 25 years ago they all wanted to move to America, because the absence of that crushing level of corruption was what "the land of opportunity" meant. Now not so much, because America is becoming more like their country.
This timeline ends poorly for all of us, even the corrupt rich who profit from it, because in the future America will be more like a Latin American banana republic where they won't be able to leave their compounds for fear of getting Luigi'ed. We normal people get poverty, they get fear and death, everyone loses. The social contract is collapsing in front of our eyes.
I agree with you, except that you’re too positive. The United States is already a banana republic.
The federal courts are a joke - the supreme court now has at least one justice whose craven corruption is notorious — openly accepting material value (ie bribes) from various parties. The district courts are being stuffed with Trump appointees with the obvious problems that go with that.
The congress is supine. Obviously they cannot act in any meaningful capacity.
We don’t have street level corruption today. But we’ve fired half the civil service, so I doubt that will continue.
> *and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.”
So exactly when was there “wealth equality” in the US? Are you glossing over that whole segregation, redlining, era of the US?
META did pirate basically all books in Anna’s archive but if I remember correctly they just whispered a a cried sorry and it ended up as that. Why are they also not asked to pay?
I keep thinking,if they bought ebooks,would that be fine or is this required to be paper books? If it doesn't work with ebooks, the world is going to be a nightmare
Yes, but the cat is out of the bag now. Welcome to the era of every piece of creative work coming with an EULA that you cannot train on it. It will be like clearing samples.
Many already did this years ago for game resources on iClone, Unity, and UE.
There are also a lot of usage rules that now make many games unfeasible.
We dug into the private markets seeking less Faustian terms, but found just as many legal submarines in wait... "AI" Plagiarism driven projects are just late to the party. =3
It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.
> Chopping the spine off books and putting the pages in an automated scanner is not scalable.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Not if 100 companies did it and they all got away.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.
Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.
I like the IA as much as anyone else, but surely there's a significant difference between distributing literal word for word exact copies of copyrighted material and distributing statistical indexes about copyrighted material right?
This is a good soundbite but doesn't make sense. The Internet Archive had to pay for redistributing copyrighted materials. Anthropic just paid too. (Note: redistributing != training)
I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.
In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.
> Buying used copies of books, scanning them, and training on it is fine.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research.
Training AI models for purposes other than purely academic fits into none of these.
Buying used copies of books, scanning them, training an employee with the scans: fair use.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
The purpose and character of AI models is transformative, and the effect of the model on the copyrighted works used in the model is largely negligible. That's what makes the use of copyrighted works in creating them fair use.
Are "fantasy name generators" of the sort you find all over the place online fair use if the weighting of their generators is based on statistical information about names in fantasy novels? I would think most people would agree they're fair use, or if not in so many words, I think those people would find it pretty unfair for WotC to go around suing sites for running D&D character name generators.
Or let's talk about another form of buying copyrighted / protected content and selling the results of transforming it: emulators. The Connectix Virtual Game Station was the impetus for one of the most important lawsuits about emulation, and the ruling held that even though writing an emulator inherently involves copying copyrighted code, the result is sufficiently transformative and falls under fair use.
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
> You only settle in this scenario/point if you think you'll lose.
Or because you already got the judgement you wanted. Remember Athropic's training of the AI was determined to be fair use for all the legally acquired items, which Anthropic claims is their current acquisition model anyway. If we assume that's true for the sake of argument, there's no point in fighting a battle on the remaining part unless they have something to gain by it. Since they're not doing that anymore, they don't gain, and run a very high risk of losing more. From a purely PR perspective, this is the right move.
There is already a mountain of legal precedent that you can't just download copyrighted work. That's what this lawsuit is about. Just because one of the parties is Anthropic doesn't mean this is some new AI thing.
It's a separate suit being wages against Meta and OpenAI etc.
There's piracy, then there's making available a model to the public which can regurgitate copyrighted works or emulate them. The latter is still unsettled
Few. This settlement potentially weakens all challenges to the use of copyrighted works in training LLM's. I'd be shocked if behind closed doors there wasn't some give and take on the matter between Executives/investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
It’s pretty incredible that the vast majority of authors will make more money for their books from this settlement that they ever have from selling their books.
But yes, one set of data is from 7-8 years ago, the other figure is from a few weeks ago. And median and average don't mean the same thing. And it's unclear how the population from the Author's Guild survey maps to the plaintiff class from the Anthropic settlement.
Thank you! I hadn't even thought that I could be affected, but I have written some programming books, and some of them show up on libgen. I've submitted my contact info, maybe something will come out of this...
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
The West can end the endless pain and legal hurdles to innovation by limiting the copyright. They can do it if there is will to open up the gates of information to everyone. The duration of 70 years after death of the author or 90 years for companies is excessively long. It should be ~25 years. For software it should be 10 years.
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
The vast majority of books don't generate any profits past the first few years, so I prefer Lawrence Lessig's proposal of copyright renewal at five-year intervals with a fee. Under this scheme, most books would enter the public domain after five years
Lessig: Not for this length of time, no. Copyright shouldn’t be anywhere close to what it is right now. In my book I proposed a system where you’d have to renew after every five years and you get a maximum term of 75 years. I thought that was pretty radical at the time. The Economist, after the Eldred decision, came out with a proposal—let’s go back to 14 years, renewable to 28 years. Nobody needs more than 14 years to earn the return back from whatever they produced.
I think western companies will be just fine -- Anthropic is settling because they illegally pirated books from LibGen back in 2021 and subsequently trained models on them. They realized this was an issue internally and pivoted to buying books en masse and scanning them into digital formats, destroying the original copies in the process (they actually hired a former lead in the Google Books project to help them in this endeavor!). And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.
So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
>And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.
But most marginal training of Anthropic, OpenAI and Google models is done on LLM paraphrased user data on those platforms. That user data is proprietary and obviously way more valuable than random books.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
Maybe small compared to the money raised, but it is in fact enormous compared to the money earned. Their revenue was under $1b last year and they projected themselves as likely to make $2b this year. This payout equals their average yearly revenue of the last two years.
Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
> Isn't that how the whole system operates? Everyone is a conduit to allow rich people to enrich themselves further. The amount and quality of opportunities any individual receives are proportional to how well it serves existing capital.
You're joking, but that's actually a good pitch. There was a significant legal issue hanging over their heads, with some risk of a potentially business-ending judgment down the line. This makes it go away, which makes the company a safer, more valuable investment. Both in absolute terms and compared to peers who didn't settle.
It just resolves their liability with regards to books they purported they did not even train the models on, which is all that was left in this case after summary judgment. Sure the potential liability was company ending, but it's all a stupid business decision when it is ultimately for books they did not even train on.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
Everything talks about settlement to the 'authors'; is that meant to be shorthand for copyright holders? Because there are a lot of academic works in that library where the publisher holds exclusive copyright and the author holds nothing.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
very unsurprisingly, new york times is going to frame this as a win for "the little guy" when in reality it's just multi-billion dollar publishers, with a long rich history of their own exploitive practices, hanging on for dear life against generative AI
Dunno if this matters but I thought the copyright always remains with the creator/author but they end up assigning the rights contractually. At least generally for books. Movies will be copyrighted by the studio.
Kinda how like patents will state the human “inventor” but Apple or whichever corp is assigned the rights.
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
No. It means model training is transformative enough to be fair use. They should just be asked to pay them back plus reimbursement/punishment, say pay 10x the price of the pirated books
I don't know if I agree with it, but you could argue that if a model was built for purely academic purposes, and then used for purely academic purposes, it could meet requirements for fair use.
Setting aside whether or not I think it should be fair use, you’re only going to be training a new foundation model these days if you have billions of dollars to spend on the endeavor anyway. Nobody is training Llama 5 in their garage.
This is a settlement. It does not set a precedent nor even admit to wrongdoing.
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.
That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.
And thank god they did. There was no perfectly legal channel to fix the taxi cartel. Now you don't even have to use Uber in many of these places because taxis had to compete - they otherwise never would have stopped pulling the "credit card reader is broken" scam, taking long routes on purpose, and started using tech that made them more accountable to these things as well as harder for them to racially profile passengers. (They would infamously pretend not to see you if they didn't want to give you service back when you had to hail them with an IRL gesture instead of an app..)
Didn't Google have a long standing project to do just that?
https://en.wikipedia.org/wiki/Google_Books
If this is a choice between risking to pay 1.5 billion or just paying 15 mil safely, they might.
Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.
Anthropic did. That was the part of their operation that they didn't get in trouble for, but the news spun it as "Anthropic destroyed millions of books to make AI".
it wont be a chatgpt or coding model ofc, thats not what they go for, but it'll be interesting to see its quality as its all fairly and honestly done. transparently.
Otherwise, of course they would tell them to just pound sand.
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
“Marooned in Real Time” remains my fav.
I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.
Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.
The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.
We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?
This has not been my experience. These days they are pretty good at googling though.
And even if one could, it would be illegal to do. Always found this argument for AI data laundering weird.
The judge presiding over this case has already issued a ruling to the effect that training an LLM like Anthropic's AI with legally acquired material is in fact fair use. So unless someone comes up with some novel claims that weren't already attempted, claims that a different form of AI is significantly different from a copyright perspective from an LLM or tries their hand in a different circuit to get a split decision, the "jury" is pretty much settled on how fair use applies to AI. Legally acquired material used to train LLMs is fair use. Illegally obtaining copies of material is not fair use, and the transformative nature of LLMs don't retroactively make it fair use.
Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.
But when you get to running multiple copies of model, we are clearly past that.
However, the judge already ruled on the only important piece of this legal proceeding:
> Alsup ruled in June that Anthropic made fair use of the authors' work to train Claude...
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
Every human has the right to read those books.
And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.
Also, at least so far, we don't call computers "someone".
Additionally, sharing copyrighted works without permission... the data sets or data lakes... is its own tort. You're guilty just for sharing copies before even training. Some copyrighted works are also commercial, copyright with ban on others' commercial use, and patented. Some are NDA'd but 3rd party leaked them. Sources like Common Crawl probably have plenty of such content.
Additionally, there's often contractual terms of use on accessing the content. Even Singapore's and others laws allowing training on copyrighted content usually require that you lawfully accessed that content in the first place. The terms of use are the weakest link there.
I'd like to see these two issues turned by law into a copyright exception that no contract can override. It needs to specifically allow sharing scraped, publicly-visible content. Anything you can just view or download which the copyright owner put up. The law might impose or allow limits on daily scraping quantity, volume, etc to avoid damage scrapers are doing.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
I have an author friend who felt like this was just adding insult to injury.
So not only had his work been consumed into this machine that is being used to threaten his day job as a court reporter, not only was that done without seeking his permission in any way, but they didn’t even pay for a single copy.
Really embodies raising your middle finger to the little guy while you steamroll him.
The settlement was a smart decision by anthropic to remove a huge uncertainty. 1.5 is not small, but it won't stop them or slow them significantly.
IIUC this is very far from settled, at least in US law.
Is this completely settled legally? It is not obvious to me it would be so
Awesome, so I just need enough perceptrons to overfit every possible copyrighted works then?
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
To rephrase the question:
Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?
Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.
So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?
This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.
Sure, training by itself isn't worth anything.
Distributing and collecting payment for the usage of a trained model which may violate copyright, etc. that's still an open legal question and worth billions as well.
I'm so over this shift in America's business model.
Original Silicon Valley model, and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.
New model: Still come up with a cool idea, still get it funded and sold, but the idea involves committing crime at a staggering scale (Uber, Google, AirBnB, all AI companies, long list here), and then paying your way out of the consequences later.
Look some of these laws may have sucked, but having billionaires organize a private entity that systematically breaks them and gets off with a slap on the wrist, is not the solution. For one thing, if innovation requires breaking the law, only the rich will be able to innovate because only they can pay their way out of the law. For another, obviously no one should be able to pay their way out of following the law! This is basic "foundations of society" stuff that the vast majority of humans agree on in terms of what feels fair and just, and what doesn't.
Go to a country which has really serious corruption problems, like is really high on the corruption index, and ask the people there what they think about it. I mean I live in one and have visited many others so I can tell you, they all hate it. It not only makes them unhappy, it fills them with hopelessness about their future. They don't believe that anything can ever get better, they don't believe they can succeed by being good, they believe their own life is doomed to an unappealing fate because of when and where they were born, and they have no agency to change it. 25 years ago they all wanted to move to America, because the absence of that crushing level of corruption was what "the land of opportunity" meant. Now not so much, because America is becoming more like their country.
This timeline ends poorly for all of us, even the corrupt rich who profit from it, because in the future America will be more like a Latin American banana republic where they won't be able to leave their compounds for fear of getting Luigi'ed. We normal people get poverty, they get fear and death, everyone loses. The social contract is collapsing in front of our eyes.
The federal courts are a joke - the supreme court now has at least one justice whose craven corruption is notorious — openly accepting material value (ie bribes) from various parties. The district courts are being stuffed with Trump appointees with the obvious problems that go with that.
The congress is supine. Obviously they cannot act in any meaningful capacity.
We don’t have street level corruption today. But we’ve fired half the civil service, so I doubt that will continue.
Not creative destruction. But pure corruption.
So exactly when was there “wealth equality” in the US? Are you glossing over that whole segregation, redlining, era of the US?
And America was built on slavery and genocide.
Deleted Comment
There are also a lot of usage rules that now make many games unfeasible.
We dug into the private markets seeking less Faustian terms, but found just as many legal submarines in wait... "AI" Plagiarism driven projects are just late to the party. =3
Or can they buy the book, and then use the pirated copy?
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
If anything it's too little based on precedent.
Deleted Comment
Deleted Comment
Dead Comment
Deleted Comment
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
Or let's talk about another form of buying copyrighted / protected content and selling the results of transforming it: emulators. The Connectix Virtual Game Station was the impetus for one of the most important lawsuits about emulation, and the ruling held that even though writing an emulator inherently involves copying copyrighted code, the result is sufficiently transformative and falls under fair use.
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
Or because you already got the judgement you wanted. Remember Athropic's training of the AI was determined to be fair use for all the legally acquired items, which Anthropic claims is their current acquisition model anyway. If we assume that's true for the sake of argument, there's no point in fighting a battle on the remaining part unless they have something to gain by it. Since they're not doing that anymore, they don't gain, and run a very high risk of losing more. From a purely PR perspective, this is the right move.
There's piracy, then there's making available a model to the public which can regurgitate copyrighted works or emulate them. The latter is still unsettled
And they actually went and did that afterwards. They just pirated them first.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
I was under the impression they had downloaded millions of books.
It looks like you'll be able to search this site if the settlement is approved:
> https://www.anthropiccopyrightsettlement.com/
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
You can search LibGen by author to see if your work is included. I believe this would make you a member of the class: https://www.theatlantic.com/technology/archive/2025/03/searc...
If you are a member of the class (or think you are) you can submit your contact information to the plaintiff's attorneys here: https://www.anthropiccopyrightsettlement.com/
https://authorsguild.org/news/six-takeaways-from-the-authors...
Median income was $3100 which is greater than the $3000 average award from the Anthropic settlement.
https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settl...
But yes, one set of data is from 7-8 years ago, the other figure is from a few weeks ago. And median and average don't mean the same thing. And it's unclear how the population from the Author's Guild survey maps to the plaintiff class from the Anthropic settlement.
But it seems we're in the same ballpark.
I suspected my work was in the dataset and it looks like it is! I reached out via the form.
Deleted Comment
And if AI companies want recent stuff, they need to pay the owners.
However, the West wants to infinitely enrich the lucky old people and companies who benefited from the lax regulations at the start of 20th century. Their people chose to not let the current generations to acquire equivalent wealth, at least not without the old hags get their cut too.
https://www.econlib.org/library/Columns/y2003/Lessigcopyrigh...
Lessig: Not for this length of time, no. Copyright shouldn’t be anywhere close to what it is right now. In my book I proposed a system where you’d have to renew after every five years and you get a maximum term of 75 years. I thought that was pretty radical at the time. The Economist, after the Eldred decision, came out with a proposal—let’s go back to 14 years, renewable to 28 years. Nobody needs more than 14 years to earn the return back from whatever they produced.
So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
You mean does constitute fair use?
Western companies will be fine but sharing data in ways that would be illegal in the US does help other companies outside the US.
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
So long as there is an excuse to justify money flows, that's fine, big capital doesn't really care about the excuse; so long as the excuse is just persuasive enough to satisfy the regulators and the judges.
Money flows happen independently, then later, people try to come up with good narratives. This is exactly what happened in this case. They paid the authors a lot of money as a settlement and agreed on a narrative which works for both sets of people; that training was fine, it's the pirating which was a problem...
It's likely why they settled; they preferred to pay a lot of money and agree on some false narrative which works for both groups rather than setting a precedent that AI training on copyrighted material is illegal; that would be the biggest loss for them.
Yes, and FWIW that's very succinctly stated.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.
Kinda how like patents will state the human “inventor” but Apple or whichever corp is assigned the rights.
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
Deleted Comment
> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so
Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.
Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.