I'm an author, and I've confirmed that 3 of my books are in the 500K dataset.
Thus, I stand to receive about $9,000 as a result of this settlement.
I think that's fair, considering that two of those books received advances under $20K and never earned out. Also, while I'm sure that Anthropic has benefited from training its models on this dataset, that doesn't necessarily mean that those models are a lasting asset.
You wont be put to jail if you breach copyright in almost any country, at least not just for downloading content from libgen or torrent. If you are talking about Swartz, he was going to jail for wire fraud and hacking not breaching copyright.
Just a FYI that it's closer to $6750 (Anthropic pays $9000, but 25% is likely to go to the attorneys - the exact number here is up to the court).
Can't help but feel the reporting about $3000/work is going to leave a lot of authors disappointed when they receive ~$2250 even if they'd have been perfectly happy if that was the number they initially saw.
Do they sell their books for more than $3000 per copy? In that case it isn't fair. Otherwise they are getting a windfall because of Anthropic's stupidity in not buying the books.
In my opinion, as a class member you should push for two things:
1. Getting the maximum statutory damages for copyright infringement, which would be something like &250,000 per instance of infringement; you can be generous and call their training and reproduction of your works as a single instance, though it’s probably many more than that.
2. An admission of wrongdoing plus withdrawal from the market and permanent deletion of all models trained on infringed works.
3. A perpetual agreement to only train new models on content licensed for such training going forward, with safeguards to prevent wholesale reproduction of works.
It’s no less than what they would do if they thought you were infringing their copyrights. It’s only fair that they be subject to the same kind of serious penalties, instead of something they can write off as a slap on the wrist.
>Also, while I'm sure that Anthropic has benefited from training its models on this dataset
I thought that they didn't use this data for training, the "crime" here was making the copies.
>I think that's fair, considering that two of those books received advances under $20K and never earned out.
i don't understand your logic here, if they never earned out that means you were already "overpaid" compared to what they were worth in the market. shouldn't fairness mean this extra bonus goes first to cover the unmet earnout?
As I understand, this case is not about training but about illegitimately sourcing the books, so unless you sell your books at $3k per copy, I don't see how it is fair.
What's more fair is for Anthropic to put 5% of their preferred shares at their most recent valuation into a pool that the authors of these books can make a claim against. For 18 months, any author in this cache of books can claim their ownership and rights to their proportional amount of the shares within all claimants.
Perhaps tokenize all of the books and assign proportionally for token count of each publication.
For you might be okay, but there are others who probably are losing way too much money because of what happened. Anthropic need to pay 5x to 10x more, it needs to set a deterrent.
> that doesn't necessarily mean that those models are a lasting asset.
It remains to be seen, but typically this forms a moat. Other companies can't bring together the investment resources to duplicate the effort and they die.
The only reasons why this wouldn't be a moat:
1. Too many investment dollars and companies chasing the same goal, and none of them consolidate. (Non-consolidation feels impractical.)
2. Open source / commoditize-my-complement offerings that devalue foundation models. We have a few of these, but the best still require H100s and they're not building product.
I think there's a moat. I think Anthropic is well positioned to capitalize from this.
It is not about 9k for your knowledge in that book. Is 9k for taking you out.
The faster they are able to grab data and process the less chance you have to make money from your work.
The money is irrelevant if we allow them to break the law. They even might pay you 9k for those books, but you might never get anything again because they would have made copyright useless
My understanding is this settlement is about the MANNER in which Anthropic acquired the text of the books. They downloaded illegal copies of the books.
There was no issues with the physical copies of books they purchased and scanned.
I believe the issue of USING these texts for AI training is a separate issue/case(s)
Penalties can be several times actual damages, and substantial similarity includes MP3 files and other lossy forms of compression which don’t directly look like the originals.
The entire point of deep learning is to copy aspects from training materials, which is why it’s unsurprising when you can reproduce substantial material from a copyrighted work given the right prompts. Proving damages for individual works in court is more expensive than the payout but that’s what class action lawsuits are for.
Given that books can be imitated by humans with no compensation, this isn't as strong as an argument as you think. Moreover AFAIK the training itself has been ruled legal, so Anthropic could have theoretically bought the book for $20 (or whatever) and be in the clear, which would obviously bring less revenue than the $9k settlement.
While I'm sure it feels good and validating to have this called copyright infringement, and be compensated, it's a mixed blessing at best. Remember, this also means that your works will owe compensation to anyone you "trained" off of. Once we accept that simply "learning from previous copyrighted works to make new ones" is "infringement", then the onus is on you to establish a clean creation chain, because you'll be vulnerable to the exact same argument, and you will owe compensation to anyone whose work you looked at in learning your craft.
It's a good thing that laws can be different for AI training and human consumption. And I think the blog post you linked makes that argument, too, so I'm not sure why you'd contort it into the idea that humans will be compelled to attribute/license information that has inspired them when creating art.
LLMs cannot create copyrightable works. Only humans can do that [0]. So LLMs are not making new copyrightable works.
[0] not because we're so amazingly more creative. But because copyright is a legal invention, not something derived from first principles, and has been defined to only apply to human creations. It could be changed to apply to LLM output in the future.
An infinitely scaling commercial for profit product designed to replace every creative by applying software processing to previous works is treated very differently than a sentient human being and their process of creativity.
The fact AI proponents can't see that is insane. Reminds me of the quote:
"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."
Name should sound familiar to those who follow tech law; he presided over Oracle v Google, along with Anthony Levandowski's criminal case for stealing Waymo tech for Uber.
He actually does understand most of what he is ruling on which is a welcome surprise. Not just legal jargon but also the technical spirit of what is at stake.
He's also the one who called bullshit when Oracle tried to claim that Java's function signatures were so novel they should be eligible for copyright. (Generally, arts are copyrightable and engineering is not - there's a creativity requirement.)
They tried to say `rangeCheck(length, start, end)` was novel. He spat back that he'd written equivalent utility functions as a hobbyist hundreds of time!
Art versus engineering is a very dangerous generalization of the law. There is a creativity requirement for copyrightability, but it's an explicitly low bar. Search query "minimal degree of creativity".
The Supreme Court decision in Oracle v Google skipped over copyrightability and addressed fair use. Fair use is a legal defense, applying only in response to finding infringement, which can only be found if material's copyrightable. So the way the Supreme Court made its decision was weird, but it wasn't about the creativity requirement.
Comments so far seem to be focusing on the rejection without considering the stated reasons for rejection. AFAICT Alsup is saying that the problems are procedural (how do payouts happen, does the agreement indemnify Anthropic from civil “double jeopardy”, etc), not that he’s rejecting the negotiated payout. Definitely not a lawyer but it seems to me like the negotiators could address the rejection without changing any dollar numbers.
Yes, exactly. The article is pretty clear that it’s rejected without prejudice and that a few points need to be ironed out before he gives a preliminary approval. I suspect a lot of folks didn’t read much/any of TFA.
I do wonder if all of the kinks will be smoothed out in time. Not a lawyer too, but the timeline to create the longer list is a bit tight, and generally feels like we could see an actual rejection or at least a stretched out process here that goes on for a few more months at least before approval.
Exactly. The judge is doing exactly what he's designed to do in a civil case -- help forge an agreement between the parties that doesn't come back to bite anyone in the future. The last thing a judge wants is a case getting reopened and relitigated a year from now because there was a "bug" in the settlement.
Good. Approving this would have set a concerning precedent.
Edit: My stance on information freedom and copyright hasn't changed since Aaron Swartz's death in 2013. Intellectual property laws, patents, copyright, and similar protections feel outdated and serve mainly to protect established interests. Despite widespread piracy making virtually all media available immediately upon release, content creators and media companies continue to grow and profit. Why should publishers rely on century-old laws to restrict access?
Because whenever anyone argues that all creative and knowledge works should be freely available, accessible without compensating the creators, they conveniently leave out software and the people who make it.
Moreover, IP law protects plenty of people who aren’t “established interests”. You just, perhaps, don’t know them.
Would it actually set any kind of legal precedent, or just establish a sort of cultural vibe baseline? I know Anthropic doesn't have to admit fault, and I don't know if that establishes anything in either direction. But I'm not from the US, so I wouldn't want to pretend to have intimate knowledge of its system.
The number of bizarre, contradictory inferences this settlement asks you to make - no matter your stance on the wider question - is wild.
The settlement doesn't set any kind of precedent at all.
The existing ruling in the case establish "persuasive" (i.e. future cases are entirely free to disagree and rule to the contrary) precedent - notably including the part about training on legally acquired copies of books (e.g. from a book store) being fair use.
Only appeals courts establish binding precedent in the US (and only for the courts under them). A result of this case settling is that it won't be appealed, and thus won't establish any binding precedent one way or another.
> The number of bizarre, contradictory inferences this settlement asks you to make - no matter your stance on the wider question - is wild.
In an economy where ideas have value, it seems logical we should have property protection, much like we do for physical goods. Its easy to argue "ideas should be freely shared", but if an idea takes 20 years and $100M dollars to develop, and there are no protections for ideas, then no one will take the time to develop them. Most modern technology we have is due to copyright/patents (drugs, electronics, entertainment, etc.), because without those protections, no one would have invested the time and energy to develop them in the first place.
I believe you are probably only looking at the current state of the world and seeing how it "stifles competition" or "hampers innovation". Those allegations are probably true to some extent, especially in specific cases, but its also missing the fact that without those protections, the tech likely wouldn't be created in the first place (and so you still wouldn't be able to freely use the idea, since the person who invented it wouldn't have).
this is a kinda strange example, since the discovery tends to be government funded research, and the safety shown by private money
the USSR went to space without those protections. its not like property protections are the only thing that has driven invention.
MIT licenses are also pretty popular as are creative commons licenses.
people also do things that don't make a lot of money, like teaching elementary school. it costs a ton of money to make and run all those schools, but without any intellectual property being created that can be sold or rented out.
i dont believe that nobody would want to build much of the things we have now, if there wasnt IP around them. Making and inventing things is fun
> but if an idea takes 20 years and $100M dollars to develop, and there are no protections for ideas, then no one will take the time to develop them
This sounds trivially true but I have some trouble reconciling it with reality. For example the Llama models probably cost more than this to develop but are made freely available on GitHub. So while it’s true that some things won’t be built, I think it’s also the case that many things would still be built.
I appreciate you giving the parent comment a fair chance.
As a society we’re having trouble defining abstract components of the self (consciousness, intelligence, identity) as is. What makes the legislative notion of an idea and its reification (what’s actually protected under copyright laws) secure from this same scrutiny? Then patent rights. And what do you think may happen if the viability of said economy comes into question afterwards?
It's just a fiction to allow something freely copiable - pure information - to be pretended to be a commodity. If the AI firms have only a single redeeming feature, then it is that in them the copyright mafia finally has to face someone their own size, rather than driving little people to suicide, as they did to Aaron Swartz.
Only people who don't create anything say that. Every musician and every author I know (including myself) thinks they should have some rights concerning the distribution and sale of the products of their work. Why should a successful book author be forced to live on charity?
The judge IIRC found that training models using copyrighted materials was fair use. I disagree. Furthermore this will be a problem for anyone who generates text for a living. Eventually LLMs will undercut web, news, and book publishing because LLMs capture the value and don't pay for it. The ecosystem will be harmed.
The only problem the judge found here was training on pirated texts.
How do any of these AI companies protect authors by users uploading full PDF or even plaintext of anything? Aren’t the same piracy concerns real even if they train on what users are providing ?
I bet you could get a court to say it was legally identical.
I think the Aereo case, and Scalia's dissent, are super relevant here. It's when the court decided to go with vibes, instead of facts. The inevitable result of that (which Scalia didn't predict) was selective enforcement.
edit: so what I really mean is that I bet you could get a court to say whatever you wanted about it if you were far wealthier and more influential than your opponents.
It's not similar at all because you can't get the book back out of the LLM like you can out of Dropbox. Copyright law is concerned with outputs, not with inputs. If you could make a machine that could create full exact copies of books without ever training on or copying those books, that would still be infringement.
Thus, I stand to receive about $9,000 as a result of this settlement.
I think that's fair, considering that two of those books received advances under $20K and never earned out. Also, while I'm sure that Anthropic has benefited from training its models on this dataset, that doesn't necessarily mean that those models are a lasting asset.
This settlement has nothing to do with any criminal liability Anrhropic might have, only tort liability (and it doesn’t involves damages, not fines.)
I don’t get fined 7000USD for illegally downloading 3 books for example, much less. Although if I’m a repeat offender it can go up to prison I think.
Dead Comment
Can't help but feel the reporting about $3000/work is going to leave a lot of authors disappointed when they receive ~$2250 even if they'd have been perfectly happy if that was the number they initially saw.
It may be fair to you but how about other authors? Maybe it's not fair at all to them.
1. Getting the maximum statutory damages for copyright infringement, which would be something like &250,000 per instance of infringement; you can be generous and call their training and reproduction of your works as a single instance, though it’s probably many more than that. 2. An admission of wrongdoing plus withdrawal from the market and permanent deletion of all models trained on infringed works. 3. A perpetual agreement to only train new models on content licensed for such training going forward, with safeguards to prevent wholesale reproduction of works.
It’s no less than what they would do if they thought you were infringing their copyrights. It’s only fair that they be subject to the same kind of serious penalties, instead of something they can write off as a slap on the wrist.
I thought that they didn't use this data for training, the "crime" here was making the copies.
>I think that's fair, considering that two of those books received advances under $20K and never earned out.
i don't understand your logic here, if they never earned out that means you were already "overpaid" compared to what they were worth in the market. shouldn't fairness mean this extra bonus goes first to cover the unmet earnout?
Deleted Comment
Perhaps tokenize all of the books and assign proportionally for token count of each publication.
Doesn't that mean the money should go to your publisher instead of you?
It remains to be seen, but typically this forms a moat. Other companies can't bring together the investment resources to duplicate the effort and they die.
The only reasons why this wouldn't be a moat:
1. Too many investment dollars and companies chasing the same goal, and none of them consolidate. (Non-consolidation feels impractical.)
2. Open source / commoditize-my-complement offerings that devalue foundation models. We have a few of these, but the best still require H100s and they're not building product.
I think there's a moat. I think Anthropic is well positioned to capitalize from this.
Deleted Comment
Where can I check if I'm eligible?
It is just another opinion.
It is not about 9k for your knowledge in that book. Is 9k for taking you out. The faster they are able to grab data and process the less chance you have to make money from your work.
The money is irrelevant if we allow them to break the law. They even might pay you 9k for those books, but you might never get anything again because they would have made copyright useless
Dead Comment
Infringement was supposed to imply substantial similarity. Now it is supposed to mean statistical similarity?
The suit isn't about Anthropic training its models using copyrighted materials. Courts have generally found that to be legal.
The suit is about Anthropic procuring those materials from a pirated dataset.
The infringement, in other words, happened at the time of procurement, not at the time of training.
If it had procured them from a legitimate source (e.g. licensed them from publishers) then the suit wouldn't be happening.
There was no issues with the physical copies of books they purchased and scanned.
I believe the issue of USING these texts for AI training is a separate issue/case(s)
The entire point of deep learning is to copy aspects from training materials, which is why it’s unsurprising when you can reproduce substantial material from a copyrighted work given the right prompts. Proving damages for individual works in court is more expensive than the payout but that’s what class action lawsuits are for.
Given that books can be imitated by humans with no compensation, this isn't as strong as an argument as you think. Moreover AFAIK the training itself has been ruled legal, so Anthropic could have theoretically bought the book for $20 (or whatever) and be in the clear, which would obviously bring less revenue than the $9k settlement.
While I'm sure it feels good and validating to have this called copyright infringement, and be compensated, it's a mixed blessing at best. Remember, this also means that your works will owe compensation to anyone you "trained" off of. Once we accept that simply "learning from previous copyrighted works to make new ones" is "infringement", then the onus is on you to establish a clean creation chain, because you'll be vulnerable to the exact same argument, and you will owe compensation to anyone whose work you looked at in learning your craft.
This point was made earlier in this blog post:
https://blog.giovanh.com/blog/2025/04/03/why-training-ai-can...
HN discussion of the post: https://news.ycombinator.com/item?id=43663941
[0] not because we're so amazingly more creative. But because copyright is a legal invention, not something derived from first principles, and has been defined to only apply to human creations. It could be changed to apply to LLM output in the future.
The fact AI proponents can't see that is insane. Reminds me of the quote:
"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."
Name should sound familiar to those who follow tech law; he presided over Oracle v Google, along with Anthony Levandowski's criminal case for stealing Waymo tech for Uber.
His orders and opinions are, imo, a success story of the US judicial system. I think this is true even if you disagree with them
They tried to say `rangeCheck(length, start, end)` was novel. He spat back that he'd written equivalent utility functions as a hobbyist hundreds of time!
The Supreme Court decision in Oracle v Google skipped over copyrightability and addressed fair use. Fair use is a legal defense, applying only in response to finding infringement, which can only be found if material's copyrightable. So the way the Supreme Court made its decision was weird, but it wasn't about the creativity requirement.
I do wonder if all of the kinks will be smoothed out in time. Not a lawyer too, but the timeline to create the longer list is a bit tight, and generally feels like we could see an actual rejection or at least a stretched out process here that goes on for a few more months at least before approval.
Edit: My stance on information freedom and copyright hasn't changed since Aaron Swartz's death in 2013. Intellectual property laws, patents, copyright, and similar protections feel outdated and serve mainly to protect established interests. Despite widespread piracy making virtually all media available immediately upon release, content creators and media companies continue to grow and profit. Why should publishers rely on century-old laws to restrict access?
Moreover, IP law protects plenty of people who aren’t “established interests”. You just, perhaps, don’t know them.
The number of bizarre, contradictory inferences this settlement asks you to make - no matter your stance on the wider question - is wild.
The existing ruling in the case establish "persuasive" (i.e. future cases are entirely free to disagree and rule to the contrary) precedent - notably including the part about training on legally acquired copies of books (e.g. from a book store) being fair use.
Only appeals courts establish binding precedent in the US (and only for the courts under them). A result of this case settling is that it won't be appealed, and thus won't establish any binding precedent one way or another.
> The number of bizarre, contradictory inferences this settlement asks you to make - no matter your stance on the wider question - is wild.
What contradictions do you see? I don't see any.
Sometimes these companies specifically seek out a settlement to avoid setting a legal precedent in case they feel like they will lose.
Deleted Comment
I believe you are probably only looking at the current state of the world and seeing how it "stifles competition" or "hampers innovation". Those allegations are probably true to some extent, especially in specific cases, but its also missing the fact that without those protections, the tech likely wouldn't be created in the first place (and so you still wouldn't be able to freely use the idea, since the person who invented it wouldn't have).
this is a kinda strange example, since the discovery tends to be government funded research, and the safety shown by private money
the USSR went to space without those protections. its not like property protections are the only thing that has driven invention.
MIT licenses are also pretty popular as are creative commons licenses.
people also do things that don't make a lot of money, like teaching elementary school. it costs a ton of money to make and run all those schools, but without any intellectual property being created that can be sold or rented out.
i dont believe that nobody would want to build much of the things we have now, if there wasnt IP around them. Making and inventing things is fun
This sounds trivially true but I have some trouble reconciling it with reality. For example the Llama models probably cost more than this to develop but are made freely available on GitHub. So while it’s true that some things won’t be built, I think it’s also the case that many things would still be built.
As a society we’re having trouble defining abstract components of the self (consciousness, intelligence, identity) as is. What makes the legislative notion of an idea and its reification (what’s actually protected under copyright laws) secure from this same scrutiny? Then patent rights. And what do you think may happen if the viability of said economy comes into question afterwards?
Deleted Comment
The only problem the judge found here was training on pirated texts.
Since the violation is detected via model output, it doesn't matter what the input method is.
I think the Aereo case, and Scalia's dissent, are super relevant here. It's when the court decided to go with vibes, instead of facts. The inevitable result of that (which Scalia didn't predict) was selective enforcement.
edit: so what I really mean is that I bet you could get a court to say whatever you wanted about it if you were far wealthier and more influential than your opponents.