Readit News logoReadit News
skilled · 20 days ago
> In response, NVIDIA defended its actions as fair use, noting that books are nothing more than statistical correlations to its AI models.

Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

ThrowawayR2 · 20 days ago
Yes, it's been discussed many times before. All the corporations training LLMs have to have done a legal analysis and concluded that it's defensible. Even one of the white papers commissioned by the FSF ( "Copyright Implications of the Use of Code Repositories to Train a Machine Learning Model" at https://www.fsf.org/licensing/copilot/copyright-implications... ), concluded that using copyrighted data to train AI was plausibly legally defensible and outlined the potential argument. You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
jkaplowitz · 20 days ago
> Even one of the white papers commissioned by the FSF

Quoting the text which the FSF put at the top of that page:

"This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."

So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.

> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.

It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.

Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.

grayhatter · 20 days ago
> Even one of the white papers commissioned by the FSF [...] concluded that using copyrighted data to train AI was plausibly legally defensible [...] notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.

I agree with jkaplowitz, but for a different reason I still believe that your description feels a bit misleading to me. The FSF commissioned paper makes the argument that Microsoft's use of code FROM GITHUB, FOR COPILOT is likely non-infringing, because of the additional github ToS. This feels like critical context to provide given in the very next statement, you widened it to LLMs generally, and the FSF which likely cares about code, not on github as well.

All of that said, I'm not sure it matters, because while I don't find the argument from the that whitepaper very compelling, because it's based critically on additional grants in the ToS. IIRC (going only from memory) the ToS requires that you grant github a license as it's needed to provide the service. Github can provide the services the user reasonably understood github to provide, without violating the additional clauses specified in the existing FOSS license covering the code. That being from a while ago, and I'd say it's very murky now, because everyone knows Microsoft provides copilot, so "obviously" they need it.

Unfortunately, and importantly, when dealing with copyrights, the paper also covers the transformative fair use arguments in depth. And I do find those following arguments very compelling. The paper, (and likely others) are making the argument that the code output from an LLM is likely transformative. And thus can't be infringing compelling, (or is unlikely to be). I think in many cases, the output is clearly transformative in nature.

I've also seen code generated by claude (likely others as well?) to copy large sections from existing works. Where it's clearly "copy/paste" which clearly can't be fair use, nor transformative. The output clearly copies the soul of the work. Thus given I have no idea what dataset they're copying this code from, it's scary enough to make me unwilling to take the chance on any of it.

reorder9695 · 20 days ago
So it's legal to train an "intelligence" on everything for free based on fair use, but it's not legal to train another intelligence (my brain) on it?
general1465 · 20 days ago
Did you pirated this movie? No I did not, it is fair use because this movie is nothing more than a statistical correlation to my dopamine production.
earthnail · 20 days ago
The movie played on my screen but I may or may not have seen the results of the pixels flashing. As such, we can only state with certainty that the movie triggered the TV's LEDs relative to its statistical light properties.
gruez · 20 days ago
>Did you pirated this movie? No I did not, [...]

You're probably being sarcastic but that's actually how the law works. You'll note that when people get sued for "pirating" movies, it's almost always because they were caught seeding a torrent, not for the act of watching an illegal copy. Movie studios don't go after visitors of illegal streaming sites, for instance.

thaumasiotes · 20 days ago
Note that what copyright law prohibits is the action of producing a copy for someone else, not the action of obtaining a copy for yourself.
ErroneousBosh · 20 days ago
Did you pirate this movie?

No, I acquired a block of high-entropy random numbers as a standard reference sample.

JKCalhoun · 20 days ago
I saw the movie, but I don't remember it now.
Ferret7446 · 20 days ago
Indeed, the "copy" of the movie in your brain is not illegal. It would be rather troublesome and dystopian if it were.
NitpickLawyer · 20 days ago
> Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?

It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.

I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.

Deleted Comment

musicale · 17 days ago
> There's also precedent, in google scanning massive amounts of books,

Except that Google acquired the books legally, and first sale doctrine applies to physical books.

> but not reproducing them

See also: "Extracting books from production language models"

https://news.ycombinator.com/item?id=46569799

olejorgenb · 20 days ago
> I don't see how they get around "procuring them" from 3rd party dubious sources

Yeah, isn't this what Anthropic was found guilty off?

bulbar · 20 days ago
Is they don't reproduce the data of any kind, how could the LLM be of any use?

The whole/main intention of an LLM is to reproduce knowledge.

masfuerte · 20 days ago
Scanning books is literally reproducing them. Copying books from Anna's Archive is also literally reproducing them. The idea that it is only copyright infringement if you engage in further reproduction is just wrong.

As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.

threethirtytwo · 20 days ago
It does make sense. It’s controversial. Your memory memorizes things in the same way. So what nvidia does here is no different, the AI doesn’t actually copy any of the books. To call training illegal is similar to calling reading a book and remembering it illegal.

Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.

I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.

What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.

ckastner · 20 days ago
> To call training illegal is similar to calling reading a book and remembering it illegal.

Perhaps, but reproducing the book from this memory could very well be illegal.

And these models are all about production.

lelanthran · 20 days ago
> To call training illegal is similar to calling reading a book and remembering it illegal.

A type of wishful thinking fallacy.

In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.

kalap_ur · 20 days ago
You can only read the book, if you purchased it. Even if you dont have the intent to reproduce it, you must purchase it. So, I guess NVDA should just purchase all those books, no?
_trampeltier · 20 days ago
But to train the models they have to download it first (make a copy)
Nursie · 20 days ago
But it’s not just about recall and reproduction. If they used Anna’s Archive the books were obtained and copied without a license, before they were fed in as training data.
godelski · 20 days ago
You need to pay for the books before you memorize them
Bombthecat · 20 days ago
Who cares? Only Disney had the money to fight them.

Everything else will be slurped up for and with AI and be reused.

nancyminusone · 20 days ago
When you're responsible for 4% of the global GDP, they let you do it.
qingcharles · 20 days ago
They let you just grab any book you want.
HillRat · 20 days ago
It's not settled law as it pertains to LLMs, but, yes, creating a "statistical summary" of a book (consider, e.g., a concordance of Joyce's "Ulysses") is generally protected as fair use. However, illegally accessing pirated books to create that concordance is still illegal.
HWR_14 · 20 days ago
Copyright laws are so undefined and NVIDIAs lawyers so plentiful that the statement works in their favor. You're allowed to copy part of a work in many cases, the easiest example is you can quote a line from a book in a review. The line is fuzzy.
tobwen · 20 days ago
Books are databases, chars their elements. We have copyright for databases in EU :)
bulbar · 20 days ago
Of course it does not make sense, it's just the framing of a multi billion dollar industry and people tend to buy those.
lencastre · 19 days ago
I would love to see these nvidia designs as mere statistical correlations of graphic card design.
RGamma · 20 days ago
The chicken is trying to become the egg.
postexitus · 20 days ago
A quite good explanation of what copyright laws cover and should (and should not) cover is here by Cory Doctorow: https://www.theguardian.com/us-news/ng-interactive/2026/jan/...
Elfener · 20 days ago
It seems so, stealing copyrighted content is only illegal if you do it to read it or allow others to read it. Stealing it to create slop is legal.

(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)

poulpy123 · 20 days ago
I'm not saying it will change anything but going after Anna's archive while most of the big AI players intensely used it is quite something
gizajob · 20 days ago
Library Genesis worked pretty great and unmolested until news came out about Meta using it, at which point a bunch of the main sites disappeared off the net. So not only do these companies take ALL the pirated material, their act of doing so even borks the pirates, ruining the fun of piracy for everyone else.
pjc50 · 20 days ago
NVIDIA are "legitimate", so anything they do is fine, while AA are "illegitimate", so it's not.
countWSS · 20 days ago
Short-term thinking, they don't care about where the data comes from but how easy is to get it. Its probably decided at project-manager level.
haritha-j · 20 days ago
Just to clarify, the most valuable company in the world refuses to pay for digital media?
rpdillon · 20 days ago
I see this sentiment posted quite a bit, but have the publishers made any products available that would allow AI training on their works for payment? A naive approach would be to go to an online bookstore and pay $15 for every book, but then you have copyrighted content that is encrypted, that it's a violation of the DMCA to decrypt.

I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

I think they're going to the pirate libraries because the product they want doesn't exist.

haritha-j · 20 days ago
Perhaps because authors don't want their content to be used for this purpose? Because Microsoft refuses to give me a copy of the source code to Windows to 'inspire' my vibe-coded OS, Windowpanes 12, of which I will not give microsoft a single cent of revenue, its acceptable for me to pirate it? Someone doesn't want to sell me their work, so I'm justified in stealing it?
zaptheimpaler · 20 days ago
> I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.

If this is the only legal way for them to train, then yes that is what they should do instead of breaking the law... just because its not easy doesn't mean piracy is fine.

g947o · 20 days ago
Hmm, didn't Anthropic buy a bunch of used books (like, physical ones), scanned them, and then destroyed them? If Anthropic can do that, surely can NVIDIA
dns_snek · 20 days ago
Do you believe in private property rights? If the product they want doesn't exist then they're shit out of luck and they must either make one or wait for one to get made. You're arguing that it's okay for them to break the law because doing business legally is really inconvenient.

That would be the end of discussion if we lived in a world governed by the rule of law but we're repeatedly reminded that we don't.

trueismywork · 20 days ago
The product i want doesnt exist too. But if I pirate, straight to Alcataraz I go.
kelnos · 20 days ago
That's not relevant went it comes to copyright law. The copyright holder has the sole legal right to decide how the work is distributed.

If it isn't distributed in a manner to your liking, the only legal thing you can do is not have a copy of it at all.

nexle · 20 days ago
they already paid 10x more to their lawyers to ensure that torrenting for LLM training is perfectly legal, why they want to pay more?
1over137 · 20 days ago
Not spending money (vs spending money) helps make one rich!
machomaster · 20 days ago
Not in the case of Nvidia. Famously, "the more you pay, the more you save".
GeorgeOldfield · 19 days ago
this is good. down with copyright.
NekkoDroid · 20 days ago
Well... you don't want the good guys (Nvidia) giving money to the bad guys (Anna's Archive) right??? /s
flipped · 20 days ago
Considering AA gave them ~500TB of books, which is astonishing (very expensive to even store for AA), I wonder how much nvidia paid them for it? It has to be atleast close to half a million?
qingcharles · 20 days ago
I have a very large collection of magazines. AI companies were offering straight cash and FTP logins for them about a year or so ago. Then when things all blew up they all went quiet.
uncivilized · 15 days ago
How did AI companies find your collection?
antonmks · 20 days ago
NVIDIA executives allegedly authorized the use of millions of pirated books from Anna's Archive to fuel its AI training. In an expanded class-action lawsuit that cites internal NVIDIA documents, several book authors claim that the trillion-dollar company directly reached out to Anna's Archive, seeking high-speed access to the shadow library data.
derelicta · 20 days ago
I feel like Nvidia's CEO would be the kind to snatch off sugary sachets from his local deli just to save up some more.
songodongo · 20 days ago
“Yes officer, it was the goober thinking he looked cool in the leather jacket.”
musicale · 17 days ago
The reason why it's legal for Nvidia to download Anna's Archive and illegal for you is that they have well-paid lawyers and deep pockets to kick the can down the road for years before negotiating a settlement, making billions in the meantime. The settlement itself becomes a moat.

They also pay millions of dollars to lobbyists to encourage favorable regulation and enforcement.

utopiah · 20 days ago
People HAVE to somehow notice how hungry for proper data AI companies are when one of the largest companies propping the fastest growing market STILL has to go to such length, getting actual approval for pirated content while they are hardware manufacturer.

I keep hearing how it's fine because synthetic data will solve it all, how new techniques, feedback etc. Then why do that?

The promises are not matching the resources available and this makes it blatantly clear.